# IEEE-CIS Fraud Detection

![fraud](https://abcountrywide.com.au/wp-content/uploads/2018/04/fraud.jpg)Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably aren’t thinking about the data science that determined your fate.

Embarrassed, and certain you have the funds to cover everything needed for an epic nacho party for 50 of your closest friends, you try your card again. Same result. As you step aside and allow the cashier to tend to the next customer, you receive a text message from your bank. “Press 1 if you really tried to spend $500 on cheddar cheese.”

While perhaps cumbersome (and often embarrassing) in the moment, this fraud prevention system is actually saving consumers millions of dollars per year. Researchers from the IEEE Computational Intelligence Society (IEEE-CIS) want to improve this figure, while also improving the customer experience. With higher accuracy fraud detection, you can get on with your chips without the hassle.

IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with the world’s leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge.

In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.

If successful, you’ll improve the efficacy of fraudulent transaction alerts for millions of people around the world, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue. And of course, you will save party people just like you the hassle of false positives.

Acknowledgements:



![](https://www.xenonstack.com/wp-content/uploads/xenonstack-credit-card-fraud-detection.png)

## Competition Objective is to detect fraud in transactions;

## Çalışmanın Hedefi, işlemlerde sahtekarlığı tespit etmektir;

## Kategorik Özellikler - İşlem

## Categorical Features - Transaction

* ProductCD
* emaildomain
* card1 - card6
* addr1, addr2
* P_emaildomain
* _emaildomain
* M1 - M9

## Categorical Features - Identity

* DeviceType
* DeviceInfo
* id_12 - id_38

TransactionDT özelliği, belirli bir referans tarih saatinden (gerçek bir zaman damgası değil) bir zaman çizelgesidir.

##  Questions

## Sorular

I will start exploring based on Categorical Features and Transaction Amounts. The aim is answer some questions like:

* What type of data we have on our data?
* How many cols, rows, missing values we have?
* Whats the target distribution?
* What's the Transactions values distribution of fraud and no fraud transactions?
* We have predominant fraudulent products?
* What features or target shows some interesting patterns?
* And a lot of more questions that will raise trought the exploration.

Kategorik Özelliklere ve İşlem Tutarlarına göre keşfetmeye başlayacağım. Amaç, aşağıdaki gibi bazı soruları cevaplamaktır:

* Verilerimizde ne tür veriler var?
* Kaç tane kömür, sıra, eksik değer var?
* Hedef dağılımı nedir?
* Dolandırıcılık işlemlerinin dolandırıcılık dağılımına değer verdiği ve dolandırıcılık işlemlerinin olmadığı nedir?
* Baskın hileli ürünlerimiz var mı?
* Hangi özellikler veya hedefler bazı ilginç desenler gösteriyor?
* Ve keşfi artıracak daha birçok soru.

# Importing necessary libraries

In [None]:
!pip install lightgbm

In [None]:
!pip install chart_studio
!sudo pip install chart_studio

In [None]:
!pip install xgboost

In [None]:
!pip install plotly==3.10.0

In [None]:
!pip install git+https://github.com/hyperopt/hyperopt.git

In [None]:
import numpy as np
import pandas as pd
import gc
import time
from contextlib import contextmanager
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold, StratifiedKFold
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 
warnings.filterwarnings("ignore", category=UserWarning) 

import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 200})
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
from numpy import array
from matplotlib import cm
from sklearn import preprocessing

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Standard plotly imports
#import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
from plotly.offline import iplot, init_notebook_mode
#import cufflinks
#import cufflinks as cf
import plotly.figure_factory as ff

# Using plotly + cufflinks in offline mode
init_notebook_mode(connected=True)
#cufflinks.go_offline(connected=True)

# Preprocessing, modelling and evaluating
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_score, KFold
from xgboost import XGBClassifier
import xgboost as xgb
import plotly.plotly as py
import plotly.graph_objs as go



import os
import gc
print(os.listdir("../input"))


In [None]:
warnings.simplefilter("ignore")
plt.style.use('ggplot')
color_pal = [x['color'] for x in plt.rcParams['axes.prop_cycle']]

In [None]:
df_sample_submission = pd.read_csv("../input/sample_submission.csv")
df_test_identity = pd.read_csv("../input/test_identity.csv")
df_test_transaction = pd.read_csv("../input/test_transaction.csv")
df_train_identity = pd.read_csv("../input/train_identity.csv")
df_train_transaction = pd.read_csv("../input/train_transaction.csv")

In [None]:
print('Size of df_sample_submission data', df_sample_submission.shape)
print('Size of df_test_identity data', df_test_identity.shape)
print('Size of df_test_transaction data', df_test_transaction.shape)
print('Size of df_train_identity data', df_train_identity.shape)
print('Size of df_train_transaction data', df_train_transaction.shape)

In [None]:
df_test_identity.head(5)

In [None]:
df_test_transaction.head()

In [None]:
df_train_identity.head()

In [None]:
df_train_transaction.head()

In [None]:
df_sample_submission.head(10)

In [None]:
x= df_train_transaction['isFraud'].value_counts().values
sns.barplot([0,1],x)
plt.title('Target variable count')

* There is clearly a class imbalace problem.
* We will look into methods of solving this issue later in this notebook.

* Açıkça bir sınıf dengesizliği sorunu var.
* Bu sorunu daha sonra bu not defterinde çözme yöntemlerine bakacağız.

## Train vs Test are Time Series Split

* The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp). One early discovery about the data is that the train and test appear to be split by time. There is a slight gap inbetween, but otherwise the training set is from an earlier period of time and test is from a later period of time. This will impact which cross validation techniques should be used.

* We will look into this more when reviewing differences in distribution of features between train and test.

* TransactionDT özelliği, belirli bir referans tarih saatinden (gerçek bir zaman damgası değil) bir zaman çizelgesidir. Verilerle ilgili erken bir keşif, tren ve testin zamana bölünmüş gibi görünmesidir. Arada hafif bir boşluk var, ancak aksi takdirde eğitim seti daha önceki bir zamandan ve test daha sonraki bir zaman diliminden. Bu, hangi çapraz validasyon tekniklerinin kullanılması gerektiğini etkileyecektir.

* Özelliklerin tren ve test arasındaki dağılımındaki farklılıkları incelerken bunu daha ayrıntılı inceleyeceğiz.

* **Of the 394 features/columns in the train_transaction data 15 columns begin in C . The officaila explanation of these columns is.**
* **C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked**
* **All C columns are of the numeric data type and summary is as below**


* **train_transaction verilerindeki 394 özellik / sütundan 15 sütun C ile başlar. Bu sütunların resmi açıklamasıdır.**
* **C1-C14: ödeme kartıyla ilişkilendirilen kaç adres gibi sayım, vb. Gerçek anlam maskelenir**
* **Tüm C sütunları sayısal veri türündedir ve özet aşağıdaki gibidir**

#### The graph below shows number of unique values in each of the C Columns as blue bars.Orange bar shows the number of unique values in 96.5% of the data in each of the columns. The difference between the two bars is a measure of how distributed the data is across the range of unique values in the column.Red line is the percentage of missing values in the columns.

* **Aşağıdaki grafik, C Sütunlarının her birindeki benzersiz değerlerin sayısını mavi çubuklar olarak gösterir. Turuncu çubuk, her bir sütundaki verilerin% 96.5'indeki benzersiz değerlerin sayısını gösterir. İki çubuk arasındaki fark, verilerin sütundaki benzersiz değerler aralığında ne kadar dağıldığının bir ölçüsüdür.Kırmızı çizgi, sütunlardaki eksik değerlerin yüzdesidir.**

In [None]:
Ccols= df_train_transaction.columns[df_train_transaction.columns.str.startswith('C')]
df_train_transaction[Ccols].describe()

In [None]:
missing_values_count = df_train_transaction.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(df_train_transaction.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)

In [None]:
missing_values_count = df_train_identity.isnull().sum()
print (missing_values_count[0:10])
total_cells = np.product(df_train_identity.shape)
total_missing = missing_values_count.sum()
print ("% of missing data = ",(total_missing/total_cells) * 100)

## Ploting Transaction Amount Values Distribution 
## İşlem Tutarı Değerleri Dağılımı

In [None]:
df_id = pd.read_csv("../input/train_identity.csv")
df_trans = pd.read_csv("../input/train_transaction.csv")

In [None]:
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
g = sns.distplot(df_trans[df_trans['TransactionAmt'] <= 1000]['TransactionAmt'])
g.set_title("Transaction Amount Distribuition <= 1000", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
g1 = sns.distplot(np.log(df_trans['TransactionAmt']))
g1.set_title("Transaction Amount (Log) Distribuition", fontsize=18)
g1.set_xlabel("")
g1.set_ylabel("Probability", fontsize=15)

plt.figure(figsize=(16,12))


plt.subplot(212)
g4 = plt.scatter(range(df_trans[df_trans['isFraud'] == 0].shape[0]),
                 np.sort(df_trans[df_trans['isFraud'] == 0]['TransactionAmt'].values), 
                 label='NoFraud', alpha=.2)
g4 = plt.scatter(range(df_trans[df_trans['isFraud'] == 1].shape[0]),
                 np.sort(df_trans[df_trans['isFraud'] == 1]['TransactionAmt'].values), 
                 label='Fraud', alpha=.2)
g4= plt.title("ECDF \nFRAUD and NO FRAUD Transaction Amount Distribution", fontsize=18)
g4 = plt.xlabel("Index")
g4 = plt.ylabel("Amount Distribution", fontsize=15)
g4 = plt.legend()

plt.figure(figsize=(16,12))

plt.subplot(321)
g = plt.scatter(range(df_trans[df_trans['isFraud'] == 1].shape[0]), 
                 np.sort(df_trans[df_trans['isFraud'] == 1]['TransactionAmt'].values), 
                label='isFraud', alpha=.4)
plt.title("FRAUD - Transaction Amount ECDF", fontsize=18)
plt.xlabel("Index")
plt.ylabel("Amount Distribution", fontsize=12)

plt.subplot(322)
g1 = plt.scatter(range(df_trans[df_trans['isFraud'] == 0].shape[0]),
                 np.sort(df_trans[df_trans['isFraud'] == 0]['TransactionAmt'].values), 
                 label='NoFraud', alpha=.2)
g1= plt.title("NO FRAUD - Transaction Amount ECDF", fontsize=18)
g1 = plt.xlabel("Index")
g1 = plt.ylabel("Amount Distribution", fontsize=15)

plt.suptitle('Individual ECDF Distribution', fontsize=22)

plt.show()

# Now, let's known the Product Feature Şimdi Ürün Özelliğini bilelim
- Distribution Products
- Distribution of Frauds by Product
- Has Difference between Transaction Amounts in Products? 


- Dağıtım Ürünleri
- Sahtekarlıkların Ürüne Göre Dağılımı
- Ürünlerde İşlem Tutarları Arasında Fark Var mı?

In [None]:
df_train_transaction['TransactionDT'].plot(kind='hist',
                                        figsize=(15, 5),
                                        label='train',
                                        bins=50,
                                        title='Train vs Test TransactionDT distribution')
df_test_transaction['TransactionDT'].plot(kind='hist',
                                       label='test',
                                       bins=50)
plt.legend()
plt.show()

In [None]:
ax = df_train_transaction.plot(x='TransactionDT',
                       y='TransactionAmt',
                       kind='scatter',
                       alpha=0.01,
                       label='TransactionAmt-train',
                       title='Train and test Transaction Ammounts by Time (TransactionDT)',
                       ylim=(0, 5000),
                       figsize=(15, 5))
df_test_transaction.plot(x='TransactionDT',
                      y='TransactionAmt',
                      kind='scatter',
                      label='TransactionAmt-test',
                      alpha=0.01,
                      color=color_pal[1],
                       ylim=(0, 5000),
                      ax=ax)
# Plot Fraud as Orange
df_train_transaction.loc[df_train_transaction['isFraud'] == 1] \
    .plot(x='TransactionDT',
         y='TransactionAmt',
         kind='scatter',
         alpha=0.01,
         label='TransactionAmt-train',
         title='Train and test Transaction Ammounts by Time (TransactionDT)',
         ylim=(0, 5000),
         color='orange',
         figsize=(15, 5),
         ax=ax)
plt.show()

# Distribution of Target in Training Set

In [None]:
print('  {:.4f}% of Transactions that are fraud in train '.format(df_train_transaction['isFraud'].mean() * 100))

## Dolandırıcılık işlemleri oran 3.4990% 

In [None]:
df_train_transaction.groupby('isFraud') \
    .count()['TransactionID'] \
    .plot(kind='barh',
          title='Distribution of Target in Train',
          figsize=(15, 3))
plt.show()

# TransactionAmt

#### The ammount of transaction. I've taken a log transform in some of these plots to better show the distribution- otherwise the few, very large transactions skew the distribution. Because of the log transfrom, any values between 0 and 1 will appear to be negative.

#### İşlem miktarı. Dağılımını daha iyi göstermek için bu parsellerden bazılarında günlük dönüşümü yaptım, aksi halde birkaç büyük işlem dağıtımı çarpıtır. Günlük aktarımı nedeniyle, 0 ile 1 arasındaki tüm değerler negatif görünecektir.

In [None]:
df_train_transaction['TransactionAmt'] \
    .apply(np.log) \
    .plot(kind='hist',
          bins=100,
          figsize=(15, 5),
          title='Distribution of Log Transaction Amt')
plt.show()

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 6))
df_train_transaction.loc[df_train_transaction['isFraud'] == 1] \
    ['TransactionAmt'].apply(np.log) \
    .plot(kind='hist',
          bins=100,
          title='Log Transaction Amt.Gunlk islm kay - Fraud.Dolandiricilik',
          color=color_pal[1],
          xlim=(-3, 10),
         ax= ax1)
df_train_transaction.loc[df_train_transaction['isFraud'] == 0] \
    ['TransactionAmt'].apply(np.log) \
    .plot(kind='hist',
          bins=100,
          title='Log Transaction Amt - Not Fraud',
          color=color_pal[2],
          xlim=(-3, 10),
         ax=ax2)
df_train_transaction.loc[df_train_transaction['isFraud'] == 1] \
    ['TransactionAmt'] \
    .plot(kind='hist',
          bins=100,
          title='Transaction Amt - Fraud',
          color=color_pal[1],
         ax= ax3)
df_train_transaction.loc[df_train_transaction['isFraud'] == 0] \
    ['TransactionAmt'] \
    .plot(kind='hist',
          bins=100,
          title='Transaction Amt.İşlem Tutarı - Not Fraud.Sahtekarlık Değil',
          color=color_pal[2],
         ax=ax4)
plt.show()

* Fraudulent charges appear to have a higher average transaction ammount
* Hileli ücretlerin ortalama işlem tutarı daha yüksek gibi görünüyor

In [None]:
print('Mean transaction amt for fraud is {:.4f}'.format(df_train_transaction.loc[df_train_transaction['isFraud'] == 1]['TransactionAmt'].mean()))
print('Mean transaction amt for non-fraud is {:.4f}'.format(df_train_transaction.loc[df_train_transaction['isFraud'] == 0]['TransactionAmt'].mean()))

### ProductCD
- For now we don't know exactly what these values represent.
- `W` has the most number of observations, `C` the least.
- ProductCD `C` has the most fraud with >11%
- ProductCD `W` has the least with ~2%

### ProductCD 
* Şimdilik bu değerlerin tam olarak neyi temsil ettiğini bilmiyoruz. 
* ProductCD W en fazla gözlem, C en az gözlem içerir. 
* ProductCD  C>% 11 ile en fazla dolandırıcılığa sahiptir 
* ProductCDCD W ~% 2 ile en düşük değere sahiptir

In [None]:
df_train_transaction.groupby('ProductCD') \
    ['TransactionID'].count() \
    .sort_index() \
    .plot(kind='barh',
          figsize=(15, 3),
         title='Count of Observations by ProductCD. Urun gozlem sayıları')
plt.show()

In [None]:
df_train_transaction.groupby('ProductCD')['isFraud'] \
    .mean() \
    .sort_index() \
    .plot(kind='barh',
          figsize=(15, 3),
         title='Percentage of Fraud by ProductCD.Sahtekarlik %')
plt.show()

## Categorical Features - Transaction Kategorik Özellikler - İşlem


We are told in the data description that the following transaction columns are categorical:
Veri açıklamasında aşağıdaki işlem sütunlarının kategoriktir.    

* ProductCD
* emaildomain
* card1 - card6
* addr1, addr2
* P_emaildomain
* R_emaildomain
* M1 - M9

### card1 - card6

* We are told these are all categorical, even though some appear numeric.
* Bazılarının sayısal görünse de bunların hepsinin kategorik olduğu söyleniyor.

In [None]:
card_cols = [c for c in df_train_transaction.columns if 'card' in c]
df_train_transaction[card_cols].head()

In [None]:
color_idx = 0
for c in card_cols:
    if df_train_transaction[c].dtype in ['float64','int64']:
        df_train_transaction[c].plot(kind='hist',
                                      title=c,
                                      bins=50,
                                      figsize=(15, 2),
                                      color=color_pal[color_idx])
    color_idx += 1
    plt.show()

In [None]:
df_train_transaction_fr = df_train_transaction.loc[df_train_transaction['isFraud'] == 1]
df_train_transaction_nofr = df_train_transaction.loc[df_train_transaction['isFraud'] == 0]
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
df_train_transaction_fr.groupby('card4')['card4'].count().plot(kind='barh', ax=ax1, title='Count of card4 fraud')
df_train_transaction_nofr.groupby('card4')['card4'].count().plot(kind='barh', ax=ax2, title='Count of card4 non-fraud')
df_train_transaction_fr.groupby('card6')['card6'].count().plot(kind='barh', ax=ax3, title='Count of card6 fraud')
df_train_transaction_nofr.groupby('card6')['card6'].count().plot(kind='barh', ax=ax4, title='Count of card6 non-fraud')
plt.show()

## Fraud by card types . Kart cinslerine göre sahtecilik

## addr1 & addr2

The data description states that these are categorical even though they look numeric. Could they be the address value?
 
* **Veri açıklaması, sayısal görünseler bile bunların kategorik olduğunu belirtir. Adres değeri olabilir mi?**

In [None]:
print(' addr1 - has {} NA values'.format(df_train_transaction['addr1'].isna().sum()))
print(' addr2 - has {} NA values'.format(df_train_transaction['addr2'].isna().sum()))

In [None]:
df_train_transaction['addr1'].plot(kind='hist', bins=500, figsize=(15, 2), title='addr1 distribution dagilimi')
plt.show()
df_train_transaction['addr2'].plot(kind='hist', bins=500, figsize=(15, 2), title='addr2 distribution dagilimi')
plt.show()

## dist1 & dist2

# dist1 & dist2
Plotting with logx to better show the distribution. Possibly this could be the distance of the transaction vs. the card owner's home/work address. This is just a guess.
# dist1 & dist2
Dağıtımı daha iyi göstermek için logx ile çizim. Muhtemelen bu işlemin kart sahibinin ev / iş adresine olan uzaklığı olabilir. Bu sadece bir tahmin.

In [None]:
df_train_transaction['dist1'].plot(kind='hist',
                                bins=5000,
                                figsize=(15, 2),
                                title='dist1 distribution',
                                color=color_pal[1],
                                logx=True)
plt.show()
df_train_transaction['dist2'].plot(kind='hist',
                                bins=5000,
                                figsize=(15, 2),
                                title='dist2 distribution',
                                color=color_pal[1],
                                logx=True)
plt.show()

In [None]:
df_id = pd.read_csv("../input/train_identity.csv")
df_trans = pd.read_csv("../input/train_transaction.csv")

In [None]:
def resumetable(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values

    for name in summary['Name'].value_counts().index:
        summary.loc[summary['Name'] == name, 'Entropy'] = round(stats.entropy(df[name].value_counts(normalize=True), base=2),2) 

    return summary

## Function to reduce the DF size
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

def CalcOutliers(df_num): 

    # calculating mean and std of the array
    data_mean, data_std = np.mean(df_num), np.std(df_num)

    # seting the cut line to both higher and lower values
    # You can change this value
    cut = data_std * 3

    #Calculating the higher and lower cut values
    lower, upper = data_mean - cut, data_mean + cut

    # creating an array of lower, higher and total outlier values 
    outliers_lower = [x for x in df_num if x < lower]
    outliers_higher = [x for x in df_num if x > upper]
    outliers_total = [x for x in df_num if x < lower or x > upper]

    # array without outlier values
    outliers_removed = [x for x in df_num if x > lower and x < upper]
    
    print('Identified lowest outliers: %d' % len(outliers_lower)) # printing total number of values in lower cut of outliers
    print('Identified upper outliers: %d' % len(outliers_higher)) # printing total number of values in higher cut of outliers
    print('Total outlier observations: %d' % len(outliers_total)) # printing total number of values outliers of both sides
    print('Non-outlier observations: %d' % len(outliers_removed)) # printing total number of non outlier values
    print("Total percentual of Outliers: ", round((len(outliers_total) / len(outliers_removed) )*100, 4)) # Percentual of outliers in points
    
    return

## Target Distribution

In [None]:
df_trans['TransactionAmt'] = df_trans['TransactionAmt'].astype(float)
total = len(df_trans)
total_amt = df_trans.groupby(['isFraud'])['TransactionAmt'].sum().sum()
plt.figure(figsize=(16,6))

plt.subplot(121)
g = sns.countplot(x='isFraud', data=df_trans, )
g.set_title("Fraud Transactions Distribution \n# 0: No Fraud | 1: Fraud #", fontsize=22)
g.set_xlabel("Is fraud?", fontsize=18)
g.set_ylabel('Count', fontsize=18)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 

perc_amt = (df_trans.groupby(['isFraud'])['TransactionAmt'].sum())
perc_amt = perc_amt.reset_index()
plt.subplot(122)
g1 = sns.barplot(x='isFraud', y='TransactionAmt',  dodge=True, data=perc_amt)
g1.set_title("% Total Amount in Transaction Amt \n# 0: No Fraud | 1: Fraud #", fontsize=22)
g1.set_xlabel("Is fraud?", fontsize=18)
g1.set_ylabel('Total Transaction Amount Scalar', fontsize=18)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total_amt * 100),
            ha="center", fontsize=15) 
    
plt.show()

- **We have 3.5% of Fraud transactions in our dataset.I think that it would be interesting to see if the amount percentual is higher or lower than 3.5% of total. I will see it later.We have the same % when considering the Total Transactions Amount by Fraudand No Fraud.Let's explore the Transaction amount further below.**

* **Veri setimizde Sahtekarlık işlemlerinin% 3,5'ine sahibiz. Yüzdelik miktarın toplamın% 3,5'inden yüksek veya düşük olup olmadığını görmek ilginç olacağını düşünüyorum. Daha sonra göreceğim. Sahtekarlık ve Sahtekarlık Olmadan Toplam İşlem Tutarı göz önüne alındığında aynı% oranına sahibiz. İşlem tutarını aşağıda daha ayrıntılı inceleyelim.**

In [None]:
df_trans['TransactionAmt'] = df_trans['TransactionAmt'].astype(float)
print("Transaction Amounts Quantiles:")
print(df_trans['TransactionAmt'].quantile([.01, .025, .1, .25, .5, .75, .9, .975, .99]))

### Ploting Transaction Amount Values Distribution- İşlem Tutarı Değerleri Dağılımı

### Seeing the Quantiles of Fraud and No Fraud Transactions-Dolandırıcılık Miktarlarını Görme ve Dolandırıcılık İşlemleri Yok

In [None]:
print(pd.concat([df_trans[df_trans['isFraud'] == 1]['TransactionAmt']\
                 .quantile([.01, .1, .25, .5, .75, .9, .99])\
                 .reset_index(), 
                 df_trans[df_trans['isFraud'] == 0]['TransactionAmt']\
                 .quantile([.01, .1, .25, .5, .75, .9, .99])\
                 .reset_index()],
                axis=1, keys=['Fraud', "No Fraud"]))

### Transaction Amount Outliers - İşlem Tutarı Aykırı Değerleri

- It's considering outlier values that are highest than 3 times the std from the mean

- Ortalamadan 3 kat daha yüksek aykırı değerleri dikkate alıyor

In [None]:
CalcOutliers(df_trans['TransactionAmt'])

* Tanımlanmış en düşük aykırı değerler: 0
* Tanımlanan üst aykırı değerler: 10093
* Toplam aykırı gözlemler: 10093
* Aykırı olmayan gözlemler: 580447
* Aykırı değerlerin toplam yüzdesi: 1.7388

* Yalnızca = 0 ila 800 arasındaki değerleri dikkate alırsak, aykırı değerlerden kaçınırız ve dağıtımımıza daha fazla güveniriz.
* Toplam satırların% 1.74'ünü temsil eden aykırı değerlere sahip 10 bin satırımız var.

#### Now, let's known the Product Feature
* Distribution Products
* Distribution of Frauds by Product
* Has Difference between Transaction Amounts in Products?

 
* Dağıtım Ürünleri
* Sahtekarlıkların Ürüne Göre Dağılımı
* Ürünlerde İşlem Tutarı Arasındaki Fark Nedir?

In [None]:
tmp = pd.crosstab(df_trans['ProductCD'], df_trans['isFraud'], normalize='index') * 100
tmp = tmp.reset_index()
tmp.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)

plt.figure(figsize=(14,10))
plt.suptitle('ProductCD Distributions', fontsize=22)

plt.subplot(221)
g = sns.countplot(x='ProductCD', data=df_trans)
# plt.legend(title='Fraud', loc='upper center', labels=['No', 'Yes'])

g.set_title("ProductCD Distribution", fontsize=19)
g.set_xlabel("ProductCD Name", fontsize=17)
g.set_ylabel("Count", fontsize=17)
g.set_ylim(0,500000)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=14) 

    plt.subplot(222)
g1 = sns.countplot(x='ProductCD', hue='isFraud', data=df_trans)
plt.legend(title='Fraud', loc='best', labels=['No', 'Yes'])
gt = g1.twinx()
gt = sns.pointplot(x='ProductCD', y='Fraud', data=tmp, color='black', order=['W', 'H',"C", "S", "R"], legend=False)
gt.set_ylabel("% of Fraud Transactions", fontsize=16)

g1.set_title("Product CD by Target(isFraud)", fontsize=19)
g1.set_xlabel("ProductCD Name", fontsize=17)
g1.set_ylabel("Count", fontsize=17)

plt.subplot(212)
g3 = sns.boxenplot(x='ProductCD', y='TransactionAmt', hue='isFraud', 
              data=df_trans[df_trans['TransactionAmt'] <= 2000] )
g3.set_title("Transaction Amount Distribuition by ProductCD and Target", fontsize=20)
g3.set_xlabel("ProductCD Name", fontsize=17)
g3.set_ylabel("Transaction Values", fontsize=17)

plt.subplots_adjust(hspace = 0.6, top = 0.85)

plt.show()

#### Exploring M1-M9 Features- M1-M9 Özelliklerini Keşfetme

In [None]:
for col in ['M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9']:
    df_trans[col] = df_trans[col].fillna("Miss")
    
def ploting_dist_ratio(df, col, lim=2000):
    tmp = pd.crosstab(df[col], df['isFraud'], normalize='index') * 100
    tmp = tmp.reset_index()
    tmp.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)

    plt.figure(figsize=(20,5))
    plt.suptitle(f'{col} Distributions ', fontsize=22)

    plt.subplot(121)
    g = sns.countplot(x=col, data=df, order=list(tmp[col].values))
    # plt.legend(title='Fraud', loc='upper center', labels=['No', 'Yes'])
    g.set_title(f"{col} Distribution\nCound and %Fraud by each category", fontsize=18)
    g.set_ylim(0,400000)
    gt = g.twinx()
    gt = sns.pointplot(x=col, y='Fraud', data=tmp, order=list(tmp[col].values),
                       color='black', legend=False, )
    gt.set_ylim(0,20)
    gt.set_ylabel("% of Fraud Transactions", fontsize=16)
    g.set_xlabel(f"{col} Category Names", fontsize=16)
    g.set_ylabel("Count", fontsize=17)
    for p in gt.patches:
        height = p.get_height()
        gt.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/total*100),
                ha="center",fontsize=14) 
        
    perc_amt = (df_trans.groupby(['isFraud',col])['TransactionAmt'].sum() / total_amt * 100).unstack('isFraud')
    perc_amt = perc_amt.reset_index()
    perc_amt.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)

    plt.subplot(122)
    g1 = sns.boxplot(x=col, y='TransactionAmt', hue='isFraud', 
                     data=df[df['TransactionAmt'] <= lim], order=list(tmp[col].values))
    g1t = g1.twinx()
    g1t = sns.pointplot(x=col, y='Fraud', data=perc_amt, order=list(tmp[col].values),
                       color='black', legend=False, )
    g1t.set_ylim(0,5)
    g1t.set_ylabel("%Fraud Total Amount", fontsize=16)
    g1.set_title(f"{col} by Transactions dist", fontsize=18)
    g1.set_xlabel(f"{col} Category Names", fontsize=16)
    g1.set_ylabel("Transaction Amount(U$)", fontsize=16)
        
    plt.subplots_adjust(hspace=.4, wspace = 0.35, top = 0.80)
    
    plt.show()


### M distributions:  Count, %Fraud and Transaction Amount distribution

In [None]:
for col in ['M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9']:
    ploting_dist_ratio(df_trans, col, lim=2500)

* **This graphs give us many interesting intuition about the M features.**
* **Only in M4 the Missing values haven't the highest % of Fraud.**
** **Bu grafikler bize M özellikleri hakkında birçok ilginç sezgi veriyor.**
** **Sadece M4'te Eksik değerler Sahtekarlık'ın en yüksek yüzdesine sahip değildir.**

#### Addr1 and Addr2

In [None]:
print("Card Features Quantiles: ")
print(df_trans[['addr1', 'addr2']].quantile([0.01, .025, .1, .25, .5, .75, .90,.975, .99]))

I will set all values in Addr1 that has less than 5000 entries to "Others"
In Addr2 I will set as "Others" all values with less than 50 entries

Addr1'de 5000'den az girişi olan tüm değerleri 'Diğerleri' olarak ayarlayacağım
Addr2'de 50'den az girişi olan tüm değerleri 'Diğerleri' olarak ayarlayacağım

In [None]:
df_trans.loc[df_trans.addr1.isin(df_trans.addr1.value_counts()[df_trans.addr1.value_counts() <= 5000 ].index), 'addr1'] = "Others"
df_trans.loc[df_trans.addr2.isin(df_trans.addr2.value_counts()[df_trans.addr2.value_counts() <= 50 ].index), 'addr2'] = "Others"

### Addr1 Distributions

In [None]:
 def ploting_cnt_amt(df, col, lim=2000):
    tmp = pd.crosstab(df[col], df['isFraud'], normalize='index') * 100
    tmp = tmp.reset_index()
    tmp.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)
    
    plt.figure(figsize=(16,14))    
    plt.suptitle(f'{col} Distributions ', fontsize=24)
    
    plt.subplot(211)
    g = sns.countplot( x=col,  data=df, order=list(tmp[col].values))
    gt = g.twinx()
    gt = sns.pointplot(x=col, y='Fraud', data=tmp, order=list(tmp[col].values),
                       color='black', legend=False, )
    gt.set_ylim(0,tmp['Fraud'].max()*1.1)
    gt.set_ylabel("%Fraud Transactions", fontsize=16)
    g.set_title(f"Most Frequent {col} values and % Fraud Transactions", fontsize=20)
    g.set_xlabel(f"{col} Category Names", fontsize=16)
    g.set_ylabel("Count", fontsize=17)
    g.set_xticklabels(g.get_xticklabels(),rotation=45)
    sizes = []
    for p in g.patches:
        height = p.get_height()
        sizes.append(height)
        g.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/total*100),
                ha="center",fontsize=12) 
        
    g.set_ylim(0,max(sizes)*1.15)
    
    #########################################################################
    perc_amt = (df.groupby(['isFraud',col])['TransactionAmt'].sum() \
                / df.groupby([col])['TransactionAmt'].sum() * 100).unstack('isFraud')
    perc_amt = perc_amt.reset_index()
    perc_amt.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)
    amt = df.groupby([col])['TransactionAmt'].sum().reset_index()
    perc_amt = perc_amt.fillna(0)
    plt.subplot(212)
    g1 = sns.barplot(x=col, y='TransactionAmt', 
                       data=amt, 
                       order=list(tmp[col].values))
    g1t = g1.twinx()
    g1t = sns.pointplot(x=col, y='Fraud', data=perc_amt, 
                        order=list(tmp[col].values),
                       color='black', legend=False, )
    g1t.set_ylim(0,perc_amt['Fraud'].max()*1.1)
    g1t.set_ylabel("%Fraud Total Amount", fontsize=16)
    g.set_xticklabels(g.get_xticklabels(),rotation=45)
    g1.set_title(f"{col} by Transactions Total + %of total and %Fraud Transactions", fontsize=20)
    g1.set_xlabel(f"{col} Category Names", fontsize=16)
    g1.set_ylabel("Transaction Total Amount(U$)", fontsize=16)
    g1.set_xticklabels(g.get_xticklabels(),rotation=45)    
    
    for p in g1.patches:
        height = p.get_height()
        g1.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/total_amt*100),
                ha="center",fontsize=12) 
        
    plt.subplots_adjust(hspace=.4, top = 0.9)
    plt.show()
    
ploting_cnt_amt(df_trans, 'addr1')

* We can note interesting patterns on Addr1.
* Addr1'de ilginç desenler not edebiliriz.

### Addr2 Distributions

In [None]:
ploting_cnt_amt(df_trans, 'addr2')

* Almost all entries in Addr2 are in the same value.
* Interestingly in the value 65 , the percent of frauds are almost 60%
* Altought the value 87 has 88% of total entries, it has 96% of Total Transaction Amounts


* Addr2'deki hemen hemen tüm girişler aynı değerdedir. 
* İlginç bir şekilde, 65 değerinde, dolandırıcılıkların yüzdesi neredeyse% 60'tır, ancak 87 değerinin toplam girdilerin% 88'ine sahip olmasına rağmen, 
* Toplam İşlem Tutarlarının% 96'sına sahiptir.

### emaildomain Distributions

- I will group all e-mail domains by the respective enterprises.
- Also, I will set as "Others" all values with less than 500 entries.

- Tüm e-posta alan adlarını ilgili kuruluşlara göre gruplandıracağım.
- Ayrıca, 500'den az girişi olan tüm değerleri 'Diğerleri' olarak ayarlayacağım.

In [None]:
df_trans.loc[df_trans['P_emaildomain'].isin(['gmail.com', 'gmail']),'P_emaildomain'] = 'Google'

df_trans.loc[df_trans['P_emaildomain'].isin(['yahoo.com', 'yahoo.com.mx',  'yahoo.co.uk',
                                         'yahoo.co.jp', 'yahoo.de', 'yahoo.fr',
                                         'yahoo.es']), 'P_emaildomain'] = 'Yahoo Mail'
df_trans.loc[df_trans['P_emaildomain'].isin(['hotmail.com','outlook.com','msn.com', 'live.com.mx', 
                                         'hotmail.es','hotmail.co.uk', 'hotmail.de',
                                         'outlook.es', 'live.com', 'live.fr',
                                         'hotmail.fr']), 'P_emaildomain'] = 'Microsoft'
df_trans.loc[df_trans.P_emaildomain.isin(df_trans.P_emaildomain\
                                         .value_counts()[df_trans.P_emaildomain.value_counts() <= 500 ]\
                                         .index), 'P_emaildomain'] = "Others"
df_trans.P_emaildomain.fillna("NoInf", inplace=True)

### Ploting P-Email Domain

In [None]:
ploting_cnt_amt(df_trans, 'P_emaildomain')

## R-Email Domain plot distribution
- I will group all e-mail domains by the respective enterprises.
- I will set as "Others" all values with less than 300 entries.

- Tüm e-posta alan adlarını ilgili kuruluşlara göre gruplandıracağım.
- 300'den az girişi olan tüm değerleri 'Diğerleri' olarak ayarlayacağım.

In [None]:
df_trans.loc[df_trans['R_emaildomain'].isin(['gmail.com', 'gmail']),'R_emaildomain'] = 'Google'

df_trans.loc[df_trans['R_emaildomain'].isin(['yahoo.com', 'yahoo.com.mx',  'yahoo.co.uk',
                                             'yahoo.co.jp', 'yahoo.de', 'yahoo.fr',
                                             'yahoo.es']), 'R_emaildomain'] = 'Yahoo Mail'
df_trans.loc[df_trans['R_emaildomain'].isin(['hotmail.com','outlook.com','msn.com', 'live.com.mx', 
                                             'hotmail.es','hotmail.co.uk', 'hotmail.de',
                                             'outlook.es', 'live.com', 'live.fr',
                                             'hotmail.fr']), 'R_emaildomain'] = 'Microsoft'
df_trans.loc[df_trans.R_emaildomain.isin(df_trans.R_emaildomain\
                                         .value_counts()[df_trans.R_emaildomain.value_counts() <= 300 ]\
                                         .index), 'R_emaildomain'] = "Others"
df_trans.R_emaildomain.fillna("NoInf", inplace=True)

In [None]:
ploting_cnt_amt(df_trans, 'R_emaildomain')

- We can see a very similar distribution in both email domain features.
- It's interesting that we have high values in google and icloud frauds

- Her iki e-posta alan özelliğinde de benzer bir dağılım görebiliriz.
- Google ve icloud sahtekarlıklarında yüksek değerlere sahip olmamız ilginç

### C1-C14 features

- Let's understand what this features are.
- What's the distributions? 

 Bu özelliklerin ne olduğunu anlayalım.
- Dağılımlar nedir?

In [None]:
## REducing memory
df_trans = reduce_mem_usage(df_trans)
df_id = reduce_mem_usage(df_id)

In [None]:
df_trans[['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8',
                      'C9', 'C10', 'C11', 'C12', 'C13', 'C14']].describe()

In [None]:
df_trans.loc[df_trans.C1.isin(df_trans.C1\
                              .value_counts()[df_trans.C1.value_counts() <= 400 ]\
                              .index), 'C1'] = "Others"

#### C1 Distribution Plot

In [None]:
ploting_cnt_amt(df_trans, 'C1')

In [None]:
df_trans.loc[df_trans.C2.isin(df_trans.C2\
                              .value_counts()[df_trans.C2.value_counts() <= 350 ]\
                              .index), 'C2'] = "Others"

In [None]:
ploting_cnt_amt(df_trans, 'C2')

#### Top 3 values are 1, 2 and 3 and is the same on Total Amounts. We see the same pattern on fraud ratios
* **Top 3 values are 1, 2 and 3 and is the same on Total Amounts. We see the same pattern on fraud ratios**

### TimeDelta Feature

* **Let's see if the frauds have some specific hour that has highest % of frauds**

### Converting to Total Days, Weekdays and Hours

* **In discussions tab I read an excellent solution to Timedelta column, I will set the link below;
We will use the first date as 2017-12-01 and use the delta time to compute datetime features**

* **Tartışmalar sekmesinde Timedelta sütununa mükemmel bir çözüm okudum, aşağıdaki bağlantıyı ayarlayacağım;İlk tarihi 2017-12-01 olarak kullanacağız ve datetime özelliklerini hesaplamak için delta zamanını kullanacağız**

In [None]:
# https://www.kaggle.com/c/ieee-fraud-detection/discussion/100400#latest-579480
import datetime

START_DATE = '2017-12-01'
startdate = datetime.datetime.strptime(START_DATE, "%Y-%m-%d")
df_trans["Date"] = df_trans['TransactionDT'].apply(lambda x: (startdate + datetime.timedelta(seconds=x)))

df_trans['_Weekdays'] = df_trans['Date'].dt.dayofweek
df_trans['_Hours'] = df_trans['Date'].dt.hour
df_trans['_Days'] = df_trans['Date'].dt.day

### Top Days with highest Total Transaction Amount - Toplam İşlem Tutarı en yüksek olan İlk Günler

In [None]:
from functools import partial

import os
import gc
print(os.listdir("../input"))

In [None]:
ploting_cnt_amt(df_trans, '_Days')

#### Ploting WeekDays Distributions - Haftaiçi Dağılımlar

In [None]:
ploting_cnt_amt(df_trans, '_Weekdays')

#### Ploting Hours Distributions - Çizim Saatleri Dağılımları

In [None]:
ploting_cnt_amt(df_trans, '_Hours')

In [None]:
df_id[['id_12', 'id_13', 'id_14', 'id_15', 'id_16', 'id_17', 'id_18',
       'id_19', 'id_20', 'id_21', 'id_22', 'id_23', 'id_24', 'id_25',
       'id_26', 'id_27', 'id_28', 'id_29', 'id_30', 'id_31', 'id_32',
       'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38']].describe(include='all')

In [None]:
df_train = df_trans.merge(df_id, how='left', left_index=True, right_index=True)

In [None]:
def cat_feat_ploting(df, col):
    tmp = pd.crosstab(df[col], df['isFraud'], normalize='index') * 100
    tmp = tmp.reset_index()
    tmp.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)

    plt.figure(figsize=(14,10))
    plt.suptitle(f'{col} Distributions', fontsize=22)

    plt.subplot(221)
    g = sns.countplot(x=col, data=df, order=tmp[col].values)
    # plt.legend(title='Fraud', loc='upper center', labels=['No', 'Yes'])

    g.set_title(f"{col} Distribution", fontsize=19)
    g.set_xlabel(f"{col} Name", fontsize=17)
    g.set_ylabel("Count", fontsize=17)
    # g.set_ylim(0,500000)
    for p in g.patches:
        height = p.get_height()
        g.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/total*100),
                ha="center", fontsize=14) 

    plt.subplot(222)
    g1 = sns.countplot(x=col, hue='isFraud', data=df, order=tmp[col].values)
    plt.legend(title='Fraud', loc='best', labels=['No', 'Yes'])
    gt = g1.twinx()
    gt = sns.pointplot(x=col, y='Fraud', data=tmp, color='black', order=tmp[col].values, legend=False)
    gt.set_ylabel("% of Fraud Transactions", fontsize=16)

    g1.set_title(f"{col} by Target(isFraud)", fontsize=19)
    g1.set_xlabel(f"{col} Name", fontsize=17)
    g1.set_ylabel("Count", fontsize=17)

    plt.subplot(212)
    g3 = sns.boxenplot(x=col, y='TransactionAmt', hue='isFraud', 
                       data=df[df['TransactionAmt'] <= 2000], order=tmp[col].values )
    g3.set_title("Transaction Amount Distribuition by ProductCD and Target-", fontsize=20)
    g3.set_xlabel("ProductCD Name", fontsize=17)
    g3.set_ylabel("Transaction Values", fontsize=17)

    plt.subplots_adjust(hspace = 0.4, top = 0.85)

    plt.show()

- **Transaction Amount Distribuition by ProductCD and Target**
- **Ürün CD'si ve Hedefe Göre İşlem Tutarı Dağılımı**

#### Ploting columns with few unique values - Birkaç benzersiz değere sahip sütunları çizme

In [None]:
for col in ['id_12', 'id_15', 'id_16', 'id_23', 'id_27', 'id_28', 'id_29']:
    df_train[col] = df_train[col].fillna('NaN')
    cat_feat_ploting(df_train, col)

### Modelling

In [None]:
df_trans = pd.read_csv('../input/train_transaction.csv')
df_test_trans = pd.read_csv('../input/test_transaction.csv')

df_id = pd.read_csv('../input/train_identity.csv')
df_test_id = pd.read_csv('../input/test_identity.csv')

sample_submission = pd.read_csv('../input/sample_submission.csv', index_col='TransactionID')

df_train = df_trans.merge(df_id, how='left', left_index=True, right_index=True, on='TransactionID')
df_test = df_test_trans.merge(df_test_id, how='left', left_index=True, right_index=True, on='TransactionID')

print(df_train.shape)
print(df_test.shape)

# y_train = df_train['isFraud'].copy()
del df_trans, df_id, df_test_trans, df_test_id


### Mapping emails

In [None]:
emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 
          'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft',
          'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo',
          'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 
          'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink',
          'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other',
          'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 
          'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 
          'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo',
          'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other',
          'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft',
          'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 
          'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 
          'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 
          'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 
          'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 
          'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 
          'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other',
          'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}

us_emails = ['gmail', 'net', 'edu']

# https://www.kaggle.com/c/ieee-fraud-detection/discussion/100499#latest-579654
for c in ['P_emaildomain', 'R_emaildomain']:
    df_train[c + '_bin'] = df_train[c].map(emails)
    df_test[c + '_bin'] = df_test[c].map(emails)
    
    df_train[c + '_suffix'] = df_train[c].map(lambda x: str(x).split('.')[-1])
    df_test[c + '_suffix'] = df_test[c].map(lambda x: str(x).split('.')[-1])
    
    df_train[c + '_suffix'] = df_train[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
    df_test[c + '_suffix'] = df_test[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')

### Encoding categorical features

## Libraries

In [None]:
!pip install catboost

In [None]:
# data analysis libraries:
import numpy as np
import pandas as pd

# data visualization libraries:
import matplotlib.pyplot as plt
import seaborn as sns

# to ignore warnings:
import warnings
warnings.filterwarnings('ignore')

# to display all columns:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from sklearn.preprocessing import scale 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import roc_auc_score,roc_curve
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import numpy as np, pandas as pd, os, gc
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn as sns
import lightgbm as lgb
import gc
from time import time
import datetime
from tqdm import tqdm_notebook
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, TimeSeriesSplit
from sklearn.metrics import roc_auc_score
warnings.simplefilter('ignore')
sns.set()
%matplotlib inline
import gc
import time
from contextlib import contextmanager

# Helper Functions

In [None]:
@contextmanager
def timer(title):
    t0 = time.time()
    yield
    print("{} - done in {:.0f}s".format(title, time.time() - t0))

In [None]:
# FREQUENCY ENCODE TOGETHER
def encode_FE(df1, df2, cols):
    for col in cols:
        df = pd.concat([df1[col],df2[col]])
        vc = df.value_counts(dropna=True, normalize=True).to_dict()
        vc[-1] = -1
        nm = col+'_FE'
        df1[nm] = df1[col].map(vc)
        df1[nm] = df1[nm].astype('float32')
        df2[nm] = df2[col].map(vc)
        df2[nm] = df2[nm].astype('float32')
        print(nm,', ',end='')
        
# LABEL ENCODE
def encode_LE(col,train,test,verbose=False):
    df_comb = pd.concat([train[col],test[col]],axis=0)
    df_comb,_ = df_comb.factorize(sort=True)
    nm = col
    if df_comb.max()>32000: 
        train[nm] = df_comb[:len(train)].astype('int32')
        test[nm] = df_comb[len(train):].astype('int32')
    else:
        train[nm] = df_comb[:len(train)].astype('int16')
        test[nm] = df_comb[len(train):].astype('int16')
    del df_comb; x=gc.collect()
    if verbose: print(nm,', ',end='')
        
# GROUP AGGREGATION MEAN AND STD
# https://www.kaggle.com/kyakovlev/ieee-fe-with-some-eda
def encode_AG(main_columns, uids, aggregations, train_df, test_df, 
              fillna=True, usena=False):
    # AGGREGATION OF MAIN WITH UID FOR GIVEN STATISTICS
    for main_column in main_columns:  
        for col in uids:
            for agg_type in aggregations:
                new_col_name = main_column+'_'+col+'_'+agg_type
                temp_df = pd.concat([train_df[[col, main_column]], test_df[[col,main_column]]])
                if usena: temp_df.loc[temp_df[main_column]==-1,main_column] = np.nan
                temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(
                                                        columns={agg_type: new_col_name})

                temp_df.index = list(temp_df[col])
                temp_df = temp_df[new_col_name].to_dict()   

                train_df[new_col_name] = train_df[col].map(temp_df).astype('float32')
                test_df[new_col_name]  = test_df[col].map(temp_df).astype('float32')
                
                if fillna:
                    train_df[new_col_name].fillna(-1,inplace=True)
                    test_df[new_col_name].fillna(-1,inplace=True)
                
                print("'"+new_col_name+"'",', ',end='')
                
# COMBINE FEATURES
def encode_CB(col1,col2,train,test):
    nm = col1+'_'+col2
    train[nm] = train[col1].astype(str)+'_'+train[col2].astype(str)
    test[nm] = test[col1].astype(str)+'_'+test[col2].astype(str) 
    encode_LE(nm,train,test)
# GROUP AGGREGATION NUNIQUE
def encode_AG2(main_columns, uids, train_df, test_df):
    for main_column in main_columns:  
        for col in uids:
            comb = pd.concat([train_df[[col]+[main_column]],test_df[[col]+[main_column]]],axis=0)
            mp = comb.groupby(col)[main_column].agg(['nunique'])['nunique'].to_dict()
            train_df[col+'_'+main_column+'_ct'] = train_df[col].map(mp).astype('float32')
            test_df[col+'_'+main_column+'_ct'] = test_df[col].map(mp).astype('float32')
            print(col+'_'+main_column+'_ct, ',end='')

In [None]:
%%time
# From kernel https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
# WARNING! THIS CAN DAMAGE THE DATA 
def reduce_mem_usage2(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
def comb_mails(emails,us_emails,train,test):
    for c in ['P_emaildomain', 'R_emaildomain']:
                train[c + '_bin'] = train[c].map(emails)
                test[c + '_bin'] = test[c].map(emails)

                train[c + '_suffix'] = train[c].map(lambda x: str(x).split('.')[-1])
                test[c + '_suffix'] = test[c].map(lambda x: str(x).split('.')[-1])

                train[c + '_suffix'] = train[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
                test[c + '_suffix'] = test[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')

In [None]:
!pip install future

In [None]:
def love(data):
    print('\n'.join([''.join([(' I_Love_Data_Science_'[(x-y) % len('I_Love_Data_Science_')] if ((x*0.05)**2+(y*0.1)**2-1)**3-(x*0.05)**2*(y*0.1)**3 <= 0 else ' ') for x in range(-30, 30)]) for y in range(15, -15, -1)]))

# Combine

In [None]:
        train_id= pd.read_csv("../input/train_identity.csv")
        test_id = pd.read_csv("../input/test_identity.csv")
        train_tr = pd.read_csv("../input/train_transaction.csv")
        test_tr = pd.read_csv("../input/test_transaction.csv")

In [None]:
def combine():
     with timer('Combining :'):   
        print('Combining Start...')
        # Read train and test data with pd.read_csv():
        train_id= pd.read_csv("../input/train_identity.csv")
        test_id = pd.read_csv("../input/test_identity.csv")
        train_tr = pd.read_csv("../input/train_transaction.csv")
        test_tr = pd.read_csv("../input/test_transaction.csv")
        train=pd.merge(train_tr, train_id, on = "TransactionID",how='left',left_index=True, right_index=True)
        test=pd.merge(test_tr, test_id, on = "TransactionID",how="left",left_index=True, right_index=True)
        del train_id, train_tr, test_id, test_tr
        test.columns=train.columns.drop("isFraud")
    
        
        

        return train,test

In [None]:
test_tr.head()

> # Preprocessing and Feature Engineering

In [None]:
def pre_processing_and_feature_engineering():
    train,test=combine()   
    
    with timer('Preprocessing and Feature Engineering'):
        print('-' * 30)
        print('Preprocessing and Feature Engineering start...')
        print('-' * 10)
        
        emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 
              'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft',
              'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo',     
              'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 
              'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink',
              'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other',
              'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 
              'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 
              'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo',
              'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other',
              'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft',
              'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 
              'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 
              'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 
              'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 
              'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 
              'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 
              'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other',
              'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}

        us_emails = ['gmail', 'net', 'edu']
        comb_mails(emails,us_emails,train,test)
        print("Creating New Features...")
        numeric=['TransactionAmt','D4','D9','D10','D15', 'C1' , 'C2' ,
               'C3' , 'C4' , 'C5' , 'C6' , 'C7' , 'C8' , 'C9' , 'C10' , 
                'C11' , 'C12' , 'C13' , 'C14','V95' ,'V279' , 'V281',
                'V284' ,'V285' , 'V286' , 'V288' , 'V290', 'V296',
                 'V297',"C11","D14", "id_05","id_06","C2", "id_13","D3","D11",
                "C14","D5", "D9", "C1", "dist2", "D1", "D2", "C13", "id_20", "id_19", "D4",
                "D8", "dist1", "id_02", "D15", "addr1", "card2", "day", "id_19", "id_02"]
        encode_FE(train,test,['addr1','card1','card2','card3','P_emaildomain'])
        encode_CB('card1','addr1',train,test)
        train['day'] = train.TransactionDT / (24*60*60)
        train['uid'] = train.card1_addr1.astype(str)+'_'+np.floor(train.day-train.D1).astype(str)

        test['day'] = test.TransactionDT / (24*60*60)
        test['uid'] = test.card1_addr1.astype(str)+'_'+np.floor(test.day-test.D1).astype(str)
        encode_FE(train,test,['uid'])
        encode_AG(numeric, ['uid'],['mean',"std"], train, test)
        encode_AG2(['P_emaildomain','dist1','id_02',"M4","P_emaildomain", "M5", "id_31"], ['uid'], train, test)
        # FREQUENCY ENCODE: ADDR1, CARD1, CARD2, CARD3, P_EMAILDOMAIN
        encode_FE(train,test,['addr1','card1','card2','card3','P_emaildomain'])
        # COMBINE COLUMNS CARD1+ADDR1+P_EMAILDOMAIN
        encode_CB('card1_addr1','P_emaildomain',train, test)
        # FREQUENCY ENOCDE
        encode_FE(train,test,['card1_addr1','card1_addr1_P_emaildomain'])
        # GROUP AGGREGATE
        encode_AG(['TransactionAmt','D9','D11'],['card1','card1_addr1','card1_addr1_P_emaildomain'],['mean','std'],train,test,usena=True)
        del train['uid'], test['uid']
        print("Creating New Features Finished")
        print('-' * 10)
        print('Label Coding and One Hot Encoding Start...')
        train_cat = train.select_dtypes(include=['object'])
        train_cat_columns=train_cat.columns
        train_columns=train.columns
        test_cat = test.select_dtypes(include=['object'])
        test_cat_columns=test_cat.columns
        del test_cat, train_cat
        from sklearn import preprocessing
        for i in train_cat_columns: 
            lbe=preprocessing.LabelEncoder()
            train[i]=lbe.fit_transform(train[i].astype(str))
        for i in test_cat_columns:    
            test[i]=lbe.fit_transform(test[i].astype(str))
        for i in train_cat_columns:
            if (test[i].max()== train[i].max())&(train[i].max()<6):
                    test = pd.get_dummies(test, columns = [i])
                    train=pd.get_dummies(train, columns = [i])
          
        
        import datetime
        START_DATE = datetime.datetime.strptime('2017-11-30', '%Y-%m-%d')
        train['DT_M'] = train['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))
        train['DT_M'] = (train['DT_M'].dt.year-2017)*12 + train['DT_M'].dt.month 

        test['DT_M'] = test['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))
        test['DT_M'] = (test['DT_M'].dt.year-2017)*12 + test['DT_M'].dt.month 
         
        print('-' * 10)
        return train,test

In [None]:
def pre_processing_and_feature_engineering():
    train,test=combine()   
    
    with timer('Preprocessing and Feature Engineering'):
        print('-' * 30)
        print('Preprocessing and Feature Engineering start...')
        print('-' * 10)
        
        emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 
              'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft',
              'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo',     
              'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 
              'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink',
              'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other',
              'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 
              'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 
              'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo',
              'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other',
              'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft',
              'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 
              'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 
              'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 
              'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 
              'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 
              'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 
              'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other',
              'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}

        us_emails = ['gmail', 'net', 'edu']
        comb_mails(emails,us_emails,train,test)
        print("Creating New Features...")
        numeric=['TransactionAmt','D4','D9','D10','D15', 'C1' , 'C2' ,
               'C3' , 'C4' , 'C5' , 'C6' , 'C7' , 'C8' , 'C9' , 'C10' , 
                'C11' , 'C12' , 'C13' , 'C14','V95' ,'V279' , 'V281',
                'V284' ,'V285' , 'V286' , 'V288' , 'V290', 'V296',
                 'V297',"C11","D14", "id_05","id_06","C2", "id_13","D3","D11",
                "C14","D5", "D9", "C1", "dist2", "D1", "D2", "C13", "id_20", "id_19", "D4",
                "D8", "dist1", "id_02", "D15", "addr1", "card2", "day", "id_19", "id_02"]
        encode_FE(train,test,['addr1','card1','card2','card3','P_emaildomain'])
        encode_CB('card1','addr1',train,test)
        train['day'] = train.TransactionDT / (24*60*60)
        train['uid'] = train.card1_addr1.astype(str)+'_'+np.floor(train.day-train.D1).astype(str)

        test['day'] = test.TransactionDT / (24*60*60)
        test['uid'] = test.card1_addr1.astype(str)+'_'+np.floor(test.day-test.D1).astype(str)
        encode_FE(train,test,['uid'])
        encode_AG(numeric, ['uid'],['mean',"std"], train, test)
        encode_AG2(['P_emaildomain','dist1','id_02',"M4","P_emaildomain", "M5", "id_31"], ['uid'], train, test)
        # FREQUENCY ENCODE: ADDR1, CARD1, CARD2, CARD3, P_EMAILDOMAIN
        encode_FE(train,test,['addr1','card1','card2','card3','P_emaildomain'])
        # COMBINE COLUMNS CARD1+ADDR1+P_EMAILDOMAIN
        encode_CB('card1_addr1','P_emaildomain',train, test)
        # FREQUENCY ENOCDE
        encode_FE(train,test,['card1_addr1','card1_addr1_P_emaildomain'])
        # GROUP AGGREGATE
        encode_AG(['TransactionAmt','D9','D11'],['card1','card1_addr1','card1_addr1_P_emaildomain'],['mean','std'],train,test,usena=True)
        del train['uid'], test['uid']
        print("Creating New Features Finished")
        print('-' * 10)
        print('Label Coding and One Hot Encoding Start...')
        train_cat = train.select_dtypes(include=['object'])
        train_cat_columns=train_cat.columns
        train_columns=train.columns
        test_cat = test.select_dtypes(include=['object'])
        test_cat_columns=test_cat.columns
        del test_cat, train_cat
        from sklearn import preprocessing
        for i in train_cat_columns: 
            lbe=preprocessing.LabelEncoder()
            train[i]=lbe.fit_transform(train[i].astype(str))
        for i in test_cat_columns:    
            test[i]=lbe.fit_transform(test[i].astype(str))
        for i in train_cat_columns:
            if (test[i].max()== train[i].max())&(train[i].max()<6):
                    test = pd.get_dummies(test, columns = [i])
                    train=pd.get_dummies(train, columns = [i])
          
        
        import datetime
        START_DATE = datetime.datetime.strptime('2017-11-30', '%Y-%m-%d')
        train['DT_M'] = train['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))
        train['DT_M'] = (train['DT_M'].dt.year-2017)*12 + train['DT_M'].dt.month 

        test['DT_M'] = test['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))
        test['DT_M'] = (test['DT_M'].dt.year-2017)*12 + test['DT_M'].dt.month 
         
        print('-' * 10)
        return train,test

> # main

In [None]:
def main():
    with timer('Full Model Run '):
        print("Full Model Run Start...")
        print('-' * 50)
        data=feature_importance()
        love(data)   
        print('-' * 50)

In [None]:
if __name__ == "__main__":
     main()