# Hello and welcome to my IEEE fraud detection tutorial!


**In this tutorial I will do some basic data preparation, because the data of this competition is not as nice and cleaned up as in other competitions.**

**Afterwards i will handle the missing values, encode categorical variables and finally train and optimize the model and its parameters.**


**private score = 0.911683**

**private rank: 3085/6381**

**top 48 %**

# Overview
 
## '1'. Data preparation

## '2'. Analyze features

## '3'. Summary of feature analysis

## '4'. Dropping features with many missing values

## '5'. Impute features with low missing values

## '6'. Impute features with medium missing values

## '7'. Final check if there are still missing values in num. columns

## '8'. Reduce memory usage

## '9'. Dealing with missing values in the cat. features

## '10'. Final check if there are still missing values in cat. columns

## '11'. Concat encoded dataframes

## '12'. Train the model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import time

start_time = time.time()

print("loading data takes about 1 minute....")

train_transaction = pd.read_csv('/kaggle/input/ieee-fraud-detection/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('/kaggle/input/ieee-fraud-detection/test_transaction.csv', index_col='TransactionID')

train_identity = pd.read_csv('/kaggle/input/ieee-fraud-detection/train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('/kaggle/input/ieee-fraud-detection/test_identity.csv', index_col='TransactionID')

#sample_submission = pd.read_csv('/kaggle/input/ieee-fraud-detection/sample_submission.csv', index_col='TransactionID')

print("loading successful!")

# 1. Data Preparation

**Let's have a look at our data. We will simply do all the basic stuff before we actually modify and prepare the data.**

In [None]:
print(train_transaction.info(), "\n")
#print(train_transaction.describe(), "\n")
#print(train_transaction.head(), "\n")
print(train_transaction.shape, "\n")
print(train_transaction.columns, "\n")

print(test_transaction.shape, "\n")
print(test_transaction.columns, "\n")

print(train_transaction.isFraud, "\n")

print(train_transaction.isFraud.isnull().sum(), "\n")  # 0 missing values in target

print("percent of fraudulent train-transactions: ", len(train_transaction.loc[train_transaction.isFraud == 1])*100/len(train_transaction))

**As we can see,  3.5% in train are fraudulent transactions, so this is a rather imbalanced dataset. We can presume that the percentage of fraudulent transactions in the test data is about the same magnitude.**

**Now we will drop the target from our train_transaction dataframe.**

In [None]:
y_train = train_transaction["isFraud"]

# drop target column from train dataframe
train_transaction = train_transaction.drop(columns = ['isFraud'])

print(y_train.shape, "\n")
print(train_transaction.shape, "\n")

In [None]:
print(train_identity.shape, "\n")
print(train_identity.columns, "\n")
print(train_identity.head(), "\n")

print("\n\n")

print(test_identity.shape, "\n")
print(test_identity.columns, "\n")
print(test_identity.head(), "\n")

In [None]:
print("train_transaction.index: \n", train_transaction.index, "\n")
print("train_identity.index: \n", train_identity.index, "\n")

print(train_identity.id_01.value_counts(), "\n")    #  the id features seem to have many different values
print(train_identity.id_07.value_counts(), "\n")    #  the id features seem to have many different values
print(train_identity.DeviceType.value_counts(), "\n")  
print(train_identity.DeviceInfo.value_counts(), "\n")  

**The 38 different id features seem to have many different values, but the first 10 unique values will cover about 90% of all data (quickly estimated).**

**DeviceType only has 2 different values:  'desktop' and 'mobile'.**

**DeviceInfo has about 5 to 7 different values, that cover about 90% of all data (quickly estimated).**

**Later on we will calculate how many unique values it takes to cover up 90% of data per feature.**



**Now we will concat our train and test dataframes such that we have one big transaction dataframe and one big identity dataframe.**


**This will make the preprocessing of our data much easier compared to doing everything twice on both train and test dataframes.**

**And then before the training process we will simply split the dataframe up into train and test.** 

In [None]:
# sadly the id columns of test_identity are called id-01 instead of id_01, which is their name in the train dataframe.
# hence we must first rename all the 38 id columns in test_identity, before we can concat the dataframes.
# I simply used print(train_identity.columns) to get the list of correct column names, and now
# we will just assign them as the column names to the test_identity dataframe.

test_identity.columns = ['id_01', 'id_02', 'id_03', 'id_04', 'id_05', 'id_06', 'id_07', 'id_08',
       'id_09', 'id_10', 'id_11', 'id_12', 'id_13', 'id_14', 'id_15', 'id_16',
       'id_17', 'id_18', 'id_19', 'id_20', 'id_21', 'id_22', 'id_23', 'id_24',
       'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_30', 'id_31', 'id_32',
       'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType',
       'DeviceInfo']


print("before concatting: \n\n")
print("train_transaction.shape: ", train_transaction.shape, "\n")
print("test_transaction.shape: ", test_transaction.shape, "\n")
print("train_transaction.index: ", train_transaction.index, "\n")
print("test_transaction.index: ", test_transaction.index, "\n")

print("train_identity.shape: ", train_identity.shape, "\n")
print("test_identity.shape: ", test_identity.shape, "\n")
print("train_identity.index: ", train_identity.index, "\n")
print("test_identity.index: ", test_identity.index, "\n")



transaction_data = pd.concat([train_transaction, test_transaction])
identity_data = pd.concat([train_identity, test_identity])


print("after concatting: \n\n")
print("transaction_data.shape: ", transaction_data.shape, "\n")
print("transaction_data.index: ", transaction_data.index, "\n")
print("identity_data.shape: ", identity_data.shape, "\n")
print("identity_data.index: ", identity_data.index, "\n")

In [None]:
# lets generate some useful lists of columns
# we want a list of numerical features
# and a list of categorical features

c = (identity_data.dtypes == 'object')
n = (identity_data.dtypes != 'object')
cat_id_cols = list(c[c].index)
num_id_cols = list(n[n].index) 

print(cat_id_cols, "\n")
print("number categorical identity features: ", len(cat_id_cols), "\n\n")
print(num_id_cols, "\n")
print("number numerical identity features: ", len(num_id_cols))

In [None]:
# lets generate some useful lists of columns
# we want a list of numerical features
# and a list of categorical features

c = (transaction_data.dtypes == 'object')
n = (transaction_data.dtypes != 'object')
cat_trans_cols = list(c[c].index)
num_trans_cols = list(n[n].index) 

print(cat_trans_cols, "\n")
print("number categorical transaction features: ", len(cat_trans_cols), "\n\n")
print(num_trans_cols, "\n")
print("number numerical transaction features: ", len(num_trans_cols))

**Now we can delete train_transaction, train_identity, test_transaction, test_identity, because we no longer need them.**

In [None]:
# we save the shapes in these variables before deleting the dataframes
shape_of_train_trans = train_transaction.shape
shape_of_train_id    = train_identity.shape

shape_of_test_trans  = test_transaction.shape
shape_of_test_id     = test_identity.shape

del train_transaction
del train_identity
del test_transaction
del test_identity

print("deletion successful!")

# 2. Analyze features

## 2.1 Analyze identity features

In [None]:
print(identity_data.id_12, "\n")
print(identity_data.id_15, "\n")
print(identity_data.id_16, "\n")

**As we ca see some id features are categorical features containing strings. The different id features contain different sorts of strings though, let's find out more about these features.**

In [None]:
for i in cat_id_cols:
    print(identity_data[i].value_counts())
    print(i, "missing values: ", identity_data[i].isnull().sum())
    print(identity_data[i].isnull().sum()*100/ len(identity_data[i]), "\n")

In [None]:
#  categorical identity features:

#  id_12:   2 values       0  missing values
#  id_15:   3 values    8178  missing values
#  id_16:   2 values   31053  missing values
#  id_23:   3 values  275909  missing values 96%
#  id_27:   2 values  275909  missing values 96%
#  id_28:   2 value     8384  missing values 
#  id_29:   2 values    8384  missing values  
#  id_30:  75 values  137916  missing values 48%
#  id_31: 130 values:   9233  missing values  
#  id_33: 260 values: 142180  missing values 50%
#  id_34:   4 values: 136160  missing values 47%
#  id_35:   2 values:   8178  missing values
#  id_36:   2 values:   8178  missing values
#  id_37:   2 values:   8178  missing values
#  id_38:   2 values:   8178  missing values
#  DeviceType: 2 values 8399  missing values
#  DeviceInfo: 2799 values,  52417  missing values

In [None]:
low_missing_cat_id_cols = []      # lower than 15% missing values
medium_missing_cat_id_cols = []   # between 15% and 60% missing
many_missing_cat_id_cols = []     # more than 60% missing

for i in cat_id_cols:
    percentage = identity_data[i].isnull().sum() * 100 / len(identity_data[i])
    if percentage < 15:
        low_missing_cat_id_cols.append(i)
    elif percentage >= 15 and percentage < 60:
        medium_missing_cat_id_cols.append(i)
    else:
        many_missing_cat_id_cols.append(i)
        
print("cat_id_cols: \n\n")      
print("number low missing: ", len(low_missing_cat_id_cols), "\n")
print("number medium missing: ", len(medium_missing_cat_id_cols), "\n")
print("number many missing: ", len(many_missing_cat_id_cols), "\n")

In [None]:
for i in num_id_cols:
    print(identity_data[i].value_counts())
    print(i, "missing values: ", identity_data[i].isnull().sum()) 
    print(identity_data[i].isnull().sum()*100/len(identity_data[i]), "\n") # missing percent

In [None]:
#  numerical identity  features:

#  id_01:       77 values,       0  missing values  
#  id_02:   115655 values,    8292  missing values 
#  id_03:       24 values,  153335  missing values 54%
#  id_04:       15 values,  153335  missing values 54%
#  id_05:       93 values,   14525  missing values 
#  id_06:      101 values,   14525  missing values
#  id_07:       84 values,  275926  missing values 96%
#  id_08:       94 values,  275926  missing values 96%
#  id_09:       46 values,  136876  missing values 48%
#  id_10:       62 values,  136876  missing values 48%
#  id_11:      365 values,    8384  missing values
#  id_13:       54 values,   28534  missing values
#  id_14:       25 values,  134739  missing values 47%
#  id_17:      104 values,   10805  missing values
#  id_18:       18 values,  190152  missing values 66%
#  id_19:      522 values,   10916  missing values
#  id_20:      394 values,   11246  missing values
#  id_21:      490 values,  275922  missing values 96%
#  id_22:       25 values,  275909  missing values 96%
#  id_24:       12 values,  276653  missing values 97%
#  id_25:      341 values,  275969  missing values 96%
#  id_26:       95 values,  275930  missing values 96%
#  id_32:        4 values,  137883  missing values 48%

**So far we have a pretty good overview of our identity features,  but the 378 numerical transaction features are simply too many to evaluate by hand, like we just did with the identity features.**

**Let's have a look at our categorical transaction features.**

In [None]:
low_missing_num_id_cols = []      # lower than 15% missing values
medium_missing_num_id_cols = []   # between 15% and 60% missing
many_missing_num_id_cols = []     # more than 60% missing

for i in num_id_cols:
    percentage = identity_data[i].isnull().sum() * 100 / len(identity_data[i])
    if percentage < 15:
        low_missing_num_id_cols.append(i)
    elif percentage >= 15 and percentage < 60:
        medium_missing_num_id_cols.append(i)
    else:
        many_missing_num_id_cols.append(i)
        
print("num_id_cols: \n\n")        
print("number low missing: ", len(low_missing_num_id_cols), "\n")
print("number medium missing: ", len(medium_missing_num_id_cols), "\n")
print("number many missing: ", len(many_missing_num_id_cols), "\n")

## 2.2 Analyze transaction features

In [None]:
for i in cat_trans_cols:
    print(transaction_data[i].value_counts())
    print(i, transaction_data[i].isnull().sum(), "missing values")
    print(i, transaction_data[i].isnull().sum()*100/len(transaction_data[i]), "\n")  # missing percent

In [None]:
#  categorical transaction features:

#  ProductCD:      5  values,      0 missing values
#  card4:          4  values,   4663 missing values
#  card6:          4  values,   4578 missing values 
#  P_emaildomain  59  values, 163648 missing values,15%  
#  R_emaildomain  60  values, 824070 missing values 75%  
#  M1:             2  values, 447739 missing values 41%
#  M2:             2  values, 447739 missing values 41% 
#  M3:             2  values, 447739 missing values 41%
#  M4:             3  values, 519189 missing values 47%  
#  M5:             2  values, 660114 missing values 60% 
#  M6:             2  values, 328299 missing values 30% 
#  M7:             2  values, 581283 missing values 53% 
#  M8:             2  values, 581256 missing values 53% 
#  M9:             2  values, 581256 missing values 53% 

**For the 378 numerical cols we have to think of something, because we can not evaluate them by hand.** 

In [None]:
low_missing_num_trans_cols = []      # lower than 15% missing values
medium_missing_num_trans_cols = []   # between 15% and 60% missing
many_missing_num_trans_cols = []     # more than 60% missing

for i in num_trans_cols:
    percentage = transaction_data[i].isnull().sum() * 100 / len(transaction_data[i])
    if percentage < 15:
        low_missing_num_trans_cols.append(i)
    elif percentage >= 15 and percentage < 60:
        medium_missing_num_trans_cols.append(i)
    else:
        many_missing_num_trans_cols.append(i)
        
print("num_trans_cols: \n\n")        
print("number low missing: ", len(low_missing_num_trans_cols), "\n")
print("number medium missing: ", len(medium_missing_num_trans_cols), "\n")
print("number many missing: ", len(many_missing_num_trans_cols), "\n")

**Ok, as we can see there are 155 columns with <15% missing values, we will impute these missing values somehow.**

**For the 56 columns with 15%-60% missing values we will think of something later.**

**For the 167 columns with more than 60% missing values, we will simply drop these columns.**

In [None]:
low_missing_cat_trans_cols = []      # lower than 15% missing values
medium_missing_cat_trans_cols = []   # between 15% and 60% missing
many_missing_cat_trans_cols = []     # more than 60% missing

for i in cat_trans_cols:
    percentage = transaction_data[i].isnull().sum() * 100 / len(transaction_data[i])
    if percentage < 15:
        low_missing_cat_trans_cols.append(i)
    elif percentage >= 15 and percentage < 60:
        medium_missing_cat_trans_cols.append(i)
    else:
        many_missing_cat_trans_cols.append(i)
        
print("cat_trans_cols: \n\n")    
print("number low missing: ", len(low_missing_cat_trans_cols), "\n")
print("number medium missing: ", len(medium_missing_cat_trans_cols), "\n")
print("number many missing: ", len(many_missing_cat_trans_cols), "\n")

**Let's summarize what we found out about the features of our 2 dataframes so far:**

# 3. Summary of feature analysis

In [None]:
# Summary so far:

# we have 2 dataframes:   transaction_data and identity_data

####################################################################
# features:

# transaction_data:     14 categorical and 378 numerical features
# identity_data:        17 categorical and  23 numerical features
####################################################################
# missing values:

# cat_trans_cols:      4 low,    8 medium,    2 many 
# num_trans_cols:    176 low,   35 medium,  167 many

# cat_id_cols:        11 low,    4 medium,    2 many 
# num_id_cols:         9 low,    6 medium,    8 many
####################################################################

**We will drop all numerical features with many missing values.**

**We will impute with 'median'  all numerical features with medium missing values.**

**We will impute with 'mean' all numerical features with low missing values.**

**But first let's drop all numerical features with many missing values.**

# 4. Dropping features with many missing values

In [None]:
print("shape before dropping num_trans_cols: ", transaction_data.shape, "\n")        
transaction_data = transaction_data.drop(columns = many_missing_num_trans_cols)
print("shape after dropping num_trans_cols: ", transaction_data.shape, "\n\n")    


print("shape before dropping num_id_cols: ", identity_data.shape, "\n")        
identity_data = identity_data.drop(columns = many_missing_num_id_cols)
print("shape after dropping num_id_cols: ", identity_data.shape, "\n")


# because we dropped some numerical columns from the dataframe,
# we must create the list 'num_trans_cols' and
# 'num_id_cols' again such that the dropped cols are no longer in them
n = (transaction_data.dtypes != 'object')
num_trans_cols = list(n[n].index) 

n = (identity_data.dtypes != 'object')
num_id_cols = list(n[n].index) 

# 5. Impute features with low missing values

## 5.1 Impute the transaction features

In [None]:
from sklearn.impute import SimpleImputer

print("index before imputation: ", transaction_data.index, "\n")
print("columns before imputation: ", transaction_data.columns, "\n")

print("starting imputation..... \n\n")
my_imputer = SimpleImputer(strategy = 'mean') 
my_imputer.fit(transaction_data[low_missing_num_trans_cols])

#print("values before imputing: ", train_transaction[low_missing_num_trans_cols], "\n")

transaction_data[low_missing_num_trans_cols] = my_imputer.transform(transaction_data[low_missing_num_trans_cols])

print("index after imputation: ", transaction_data.index, "\n")
print("columns after imputation: ", transaction_data.columns, "\n")

In [None]:
print("values after imputing: ", transaction_data[low_missing_num_trans_cols], "\n")

print("As we can see the imputation was successful! \n")

## 5.2 Impute the identity features

**Now we do the exact same imputation procedure for the train_identity dataframe:**

In [None]:
print("index before imputation: ", identity_data.index, "\n")
print("columns before imputation: ", identity_data.columns, "\n")


my_imputer = SimpleImputer(strategy = 'mean') 
my_imputer.fit(identity_data[low_missing_num_id_cols])

print("starting imputation....\n")
identity_data[low_missing_num_id_cols] = my_imputer.transform(identity_data[low_missing_num_id_cols])

print("index after imputation: ", identity_data.index, "\n")
print("columns after imputation: ", identity_data.columns, "\n")

# 6. Impute features with medium missing values

## 6.1 Impute the transaction features

In [None]:
print("index before imputation: ", transaction_data.index, "\n")
print("columns before imputation: ", transaction_data.columns, "\n")

print("values before imputing: ", transaction_data[medium_missing_num_trans_cols], "\n")

print("starting imputation.....\n\n")
my_imputer = SimpleImputer(strategy = 'median') 
my_imputer.fit(transaction_data[medium_missing_num_trans_cols])

transaction_data[medium_missing_num_trans_cols] = my_imputer.transform(transaction_data[medium_missing_num_trans_cols])

print("index after imputation: ", transaction_data.index, "\n")
print("columns after imputation: ", transaction_data.columns, "\n")

In [None]:
print("values after imputing: ", transaction_data[medium_missing_num_trans_cols], "\n")

## 6.2 Impute the identity features

In [None]:
print("index before imputation: ", identity_data.index, "\n")
print("columns before imputation: ", identity_data.columns, "\n")


my_imputer = SimpleImputer(strategy = 'median') 
my_imputer.fit(identity_data[medium_missing_num_id_cols])

print("values before imputing: ", identity_data[medium_missing_num_id_cols], "\n")

identity_data[medium_missing_num_id_cols] = my_imputer.transform(identity_data[medium_missing_num_id_cols])

print("index after imputation: ", identity_data.index, "\n")
print("columns after imputation: ", identity_data.columns, "\n")

# 7. Final check if there are still missing values in num. columns

In [None]:
print(transaction_data[num_trans_cols].isnull().sum().sum())

In [None]:
print(identity_data[num_id_cols].isnull().sum().sum())

**Hooray, no more missing values in our numerical features!**

**Next we will reduce the memory usage of our dataframes.**

# 8. Reduce memory usage

**We will maybe run into problems when we dont reduce the memory usage of our dataframes.**

**Sadly one of our dataframe has more than one million rows, hence it really makes sense to try a few methods that are known to reduce the memory usage.**

**I explain these methods in this short tutorial**: https://www.kaggle.com/jonas0/reduce-memory-usage-tutorial



**We will print the memory usage of transaction_data and identity_data and compare with the memory_usage after we have converted the numerical datatypes.**

In [None]:
print("transaction_data.memory_usage(): ", transaction_data.info(), "\n")  # 1.8 GB

print("identity_data.memory_usage(): ", identity_data.info(), "\n")        #  72 MB

**Let's get an overview of our features and which datatype they have.**

In [None]:
object_counter = 0
int_counter = 0
float_counter = 0

not_detected = []

for i in transaction_data.columns:
        if transaction_data[i].dtype == 'object':
            object_counter += 1
        elif transaction_data[i].dtype == 'int':
            int_counter += 1
        elif transaction_data[i].dtype in ['float', 'float16', 'float32', 'float64']:
            float_counter += 1
        else:
            not_detected.append(i)
            
print("transaction_data has ", "\n")
print(object_counter, "object columns, \n")
print(int_counter, "int columns, \n")
print(float_counter, "float columns \n")

total = object_counter + int_counter  + float_counter

if total != len(transaction_data.columns):
    
    print("D DOUBLE DANGER: some columns have not been detected!!")
    print("these columns have not been detected: ", not_detected)
    for i in not_detected:
        print(identity_data[i].dtype, "\n")

In [None]:
object_counter = 0
int_counter = 0
float_counter = 0

not_detected = []

for i in identity_data.columns:
        if identity_data[i].dtype == 'object':
            object_counter += 1
        elif identity_data[i].dtype == 'int':
            int_counter += 1
        elif identity_data[i].dtype in ['float', 'float16', 'float32', 'float64']:
            float_counter += 1
        else:
            not_detected.append(i)
            
            
print("identity_data has ", "\n")
print(object_counter, "object columns, \n")
print(int_counter, "int columns, \n")
print(float_counter, "float columns \n")

total = object_counter + int_counter  + float_counter

if total != len(identity_data.columns):
    
    print("D DOUBLE DANGER: some columns have not been detected!!")
    print("these columns have not been detected: ", not_detected)    
    for i in not_detected:
        print(identity_data[i].dtype, "\n")

In [None]:
# the integer datatypes have the following ranges:

#   int8:  -128 to 127, range = 255  

#  int16:  -32,768 to 32,767, range = 65,535

#  int32:  -2,147,483,648 to 2,147,483,647, range = 4,294,967,295

#  int64:  -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807,
#           range = 18,446,744,073,709,551,615


#  By default all numerical columns in pandas are in int64 or float64.
#  This means that when we find a numerical integer column whose 
#  values do not exceed one of the ranges shown above, we can then
#  convert this datatype down to a smaller one. 

**Now we will define a function which takes a list of numerical columns and a dataframe.**

**The function returns a list of lists indicating which columns need to be converted.**

In [None]:
#  this function detects all the numerical columns,
#  that can be converted to a smaller datatype.

def detect_num_cols_to_shrink(list_of_num_cols, dataframe):
 
    convert_to_int8 = []
    convert_to_int16 = []
    convert_to_int32 = []
    
    #  sadly the datatype float8 does not exist
    convert_to_float16 = []
    convert_to_float32 = []
    
    for col in list_of_num_cols:
        
        if dataframe[col].dtype in ['int', 'int8', 'int32', 'int64']:
            describe_object = dataframe[col].describe()
            minimum = describe_object[3]
            maximum = describe_object[7]
            diff = abs(maximum - minimum)

            if diff < 255:
                convert_to_int8.append(col)
            elif diff < 65535:
                convert_to_int16.append(col)
            elif diff < 4294967295:
                convert_to_int32.append(col)   
                
        elif dataframe[col].dtype in ['float', 'float16', 'float32', 'float64']:
            describe_object = dataframe[col].describe()
            minimum = describe_object[3]
            maximum = describe_object[7]
            diff = abs(maximum - minimum)

            if diff < 65535:
                convert_to_float16.append(col)
            elif diff < 4294967295:
                convert_to_float32.append(col) 
        
    list_of_lists = []
    list_of_lists.append(convert_to_int8)
    list_of_lists.append(convert_to_int16)
    list_of_lists.append(convert_to_int32)
    list_of_lists.append(convert_to_float16)
    list_of_lists.append(convert_to_float32)
    
    return list_of_lists

In [None]:
num_cols_to_shrink_trans = detect_num_cols_to_shrink(num_trans_cols, transaction_data)

convert_to_int8 = num_cols_to_shrink_trans[0]
convert_to_int16 = num_cols_to_shrink_trans[1]
convert_to_int32 = num_cols_to_shrink_trans[2]

convert_to_float16 = num_cols_to_shrink_trans[3]
convert_to_float32 = num_cols_to_shrink_trans[4]

print("convert_to_int8 :", convert_to_int8, "\n")
print("convert_to_int16 :", convert_to_int16, "\n")
print("convert_to_int32 :", convert_to_int32, "\n")

print("convert_to_float16 :", convert_to_float16, "\n")
print("convert_to_float32 :", convert_to_float32, "\n")

In [None]:
print("starting with converting process....")

for col in convert_to_int16:
    transaction_data[col] = transaction_data[col].astype('int16') 
    
for col in convert_to_int32:
    transaction_data[col] = transaction_data[col].astype('int32') 

for col in convert_to_float16:
    transaction_data[col] = transaction_data[col].astype('float16')
    
for col in convert_to_float32:
    transaction_data[col] = transaction_data[col].astype('float32')
    
print("successfully converted!")

**And now we do the same with the identity_data.**

In [None]:
num_cols_to_shrink_id = detect_num_cols_to_shrink(num_id_cols, identity_data)

convert_to_int8 = num_cols_to_shrink_id[0]
convert_to_int16 = num_cols_to_shrink_id[1]
convert_to_int32 = num_cols_to_shrink_id[2]

convert_to_float16 = num_cols_to_shrink_id[3]
convert_to_float32 = num_cols_to_shrink_id[4]

print("convert_to_int8 :", convert_to_int8, "\n")
print("convert_to_int16 :", convert_to_int16, "\n")
print("convert_to_int32 :", convert_to_int32, "\n")

print("convert_to_float16 :", convert_to_float16, "\n")
print("convert_to_float32 :", convert_to_float32, "\n")

In [None]:
for col in convert_to_float16:
    identity_data[col] = identity_data[col].astype('float16')
    
for col in convert_to_float32:
    identity_data[col] = identity_data[col].astype('float32')
    
    
print("successfully converted!")

In [None]:
print("transaction_data.memory_usage(): ", transaction_data.info(), "\n")   # now uses 615 MB

print("identity_data.memory_usage(): ", identity_data.info(), "\n")         # now uses 48 MB

**Wow, this really was helpful, since transaction_data went down from 1.8 GB to 615 MB.**

**For identity_data we went down from 72 MB to 48 MB.**

# 9. Dealing with missing values in the cat. features

**I will try a simple one-hot encoding approach, but it's important that we preprocess the train- and the test-data in the exact same way.**

**Otherwise we cant train our model with the train-data, and then feed the test-data into the model, for this process the train- and test-data
must look identical in terms of shape and columns and column-names etc.**

**So far we have only preprocessed our train_transaction and train_identity dataframes, now we will quickly do all the procedures with the test-data:**

**-Drop features with many missing values**

**-Label-encode features with high cardinality**

**-Onehot-Encode features with low cardinality**

## 9.1 Drop cat. features with many missing values

In [None]:
print("shape before dropping many_missing_cat_trans_cols: ", transaction_data.shape, "\n")        
transaction_data = transaction_data.drop(columns = many_missing_cat_trans_cols)
print("shape after dropping many_missing_cat_trans_cols: ", transaction_data.shape, "\n\n")    

print("shape before dropping many_missing_cat_id_cols: ", identity_data.shape, "\n")        
identity_data = identity_data.drop(columns = many_missing_cat_id_cols)
print("shape after dropping many_missing_cat_id_cols: ", identity_data.shape, "\n")


# because we dropped some categorical columns from the dataframe,
# we must create the list 'cat_trans_cols' and
# 'cat_id_cols' again such that the dropped cols are no longer in them
c = (transaction_data.dtypes == 'object')
cat_trans_cols = list(c[c].index) 

c = (identity_data.dtypes == 'object')
cat_id_cols = list(c[c].index) 

## 9.2 Label-Encode features with high cardinality

**First we have to calculate the cardinality of our categorical features of transaction_data.**

In [None]:
for col in cat_trans_cols:
    print(col, transaction_data[col].nunique(), "\n")

**As we can see there two groups of cardinality:  low cardinality with (2,3,4,5) unique values,  and comparibly high cardinality with 60 unique values.**


In [None]:
low_card_trans_cols = ["ProductCD", "card4", "card6", "M1", "M2", "M3", "M4", "M6", "M7", "M8", "M9"]
high_card_trans_cols = ["P_emaildomain"]

print("lists successfully created!")

**First we will label-encode our high cardinality features.**

**Before we can actually label-encode, we must replace all NaN's with the most frequent value per columns.**

In [None]:
for i in cat_trans_cols:
    most_frequent_value = transaction_data[i].mode()[0]
    print("For column: ", i, "the most frequent value is: ", most_frequent_value, "\n")
    transaction_data[i].fillna(most_frequent_value, inplace = True)

In [None]:
from sklearn.preprocessing import LabelEncoder
    
label_encoder = LabelEncoder()
print("transaction_data.shape before label-encoding: ", transaction_data.shape, "\n")

transaction_data[high_card_trans_cols] = label_encoder.fit_transform(transaction_data[high_card_trans_cols])

print("transaction_data.shape after label-encoding: ", transaction_data.shape, "\n")
print("transaction_data[high_card_trans_cols] after label_encoding: ",transaction_data[high_card_trans_cols], "\n")

**Now we check if the label-encoding was successful:**

In [None]:
for col in cat_id_cols:
    print(col, identity_data[col].nunique(), "\n")

**Here we can also find 2 groups with low and high cardinality.**

In [None]:
low_card_id_cols =  ["id_12", "id_15", "id_16", "id_28", "id_29", "id_34", "id_35", "id_36", "id_37", "id_38", "DeviceType"]
high_card_id_cols = ["id_30", "id_31", "id_33", "DeviceInfo"]
    
print("lists successfully created!")

**Before we can label-encode, we must remove all NaN's from the dataset.**

In [None]:
for i in cat_id_cols:
    most_frequent_value = identity_data[i].mode()[0]
    print("For column: ", i, "the most frequent value is: ", most_frequent_value, "\n")
    identity_data[i].fillna(most_frequent_value, inplace = True)

In [None]:
label_encoder = LabelEncoder()

print("identity_data.shape before label-encoding: ", identity_data.shape, "\n")

for col in high_card_id_cols:
    identity_data[col] = label_encoder.fit_transform(identity_data[col])

print("identity_data.shape after label-encoding: ", identity_data.shape, "\n")
print("identity_data[high_card_id_cols] after label_encoding: ",identity_data[high_card_id_cols], "\n")

In [None]:
print(transaction_data.info())

In [None]:
print(identity_data.info())

## 9.3 Onehot-Encode features with low cardinality

**For the onehot-encoding-process we sadly have to think a little more than for the imputation-process, because the imputation process did not change the number of columns.**

**When we are going to onehot-encode our low cardinality features, the onehot-encoder will generate many more columns, the column-names, column-values and column-positions will all be generated automatically.**

**Hence we have to create a separate dataframe for each onehot-encoding-process, and then put these dataframes back together afterwards.**

**Due to the generation of extra columns, we can not simply do something like**

dataframe[list_of_columns] = encoder.fit_transform(dataframe[list_of_columns])

**like we did it for the imputation.**  

In [None]:
print("shape before encoding: ", transaction_data.shape, "\n")
print("columns to encode: ", low_card_trans_cols, "\n")
print("transaction_data.columns.to_list() before encoding: ", transaction_data.columns.to_list(), "\n")


# this line does the onehot encoding
low_card_trans_encoded = pd.get_dummies(transaction_data[low_card_trans_cols], dummy_na = False)
transaction_data.drop(columns = low_card_trans_cols, inplace = True)

print("shape after encoding: ", transaction_data.shape, "\n\n")
print("shape of new dataframe: ", low_card_trans_encoded.shape, "\n\n")
print("newly generated columns: ", low_card_trans_encoded.columns, "\n")
print("low_card_trans_encoded.info(): ", low_card_trans_encoded.info(),"\n")
print("transaction_data.columns.to_list() after encoding: ", transaction_data.columns.to_list(), "\n")

**The cool thing is that the label-encoding and the onehot-encoding process uses uint8/int8 for the integers, hence we do not have to convert everything down to a smaller datatype.**

**Now let's do the same thing for the cat. features of the identity dataframe:**

In [None]:
print("shape before encoding: ", identity_data.shape, "\n")
print("columns to encode: ", low_card_id_cols, "\n")

# this line does the onehot encoding
low_card_id_encoded = pd.get_dummies(identity_data[low_card_id_cols], dummy_na = False)
identity_data.drop(columns = low_card_id_cols, inplace = True)


print("shape after encoding: ", identity_data.shape, "\n\n")
print("shape of new dataframe: ", low_card_id_encoded.shape, "\n\n")
print("newly generated columns: ", low_card_id_encoded.columns, "\n")
print("low_card_id_encoded.info(): ", low_card_id_encoded.info())

# 10. Final check if there are still missing values in cat. columns

In [None]:
print(transaction_data.isnull().sum().sum(), "\n")
print(low_card_trans_encoded.isnull().sum().sum())

In [None]:
print(identity_data.isnull().sum().sum(), "\n")
print(low_card_id_encoded.isnull().sum().sum())

In [None]:
print(transaction_data.info(), "\n")
print(low_card_trans_encoded.info())

In [None]:
print(identity_data.info(), "\n")
print(low_card_id_encoded.info())

# 11. Concat encoded dataframes

**Right now we have our 2 original dataframes transaction_data and identity_data containing the label-encoded categorical features, and our 2 onehot-encoded dataframes:**

1. low_card_trans_encoded
1. low_card_id_encoded


**Now we have to concat these dataframes to one big dataframe, and then we have to split up transaction_data into train_transaction and test_transaction,  and then do the same with identity_data.**

In [None]:
print("transaction_data.shape before concatting: ", transaction_data.shape, "\n")
print("low_card_trans_encoded.shape before concatting: ", low_card_trans_encoded.shape, "\n")

transaction_concatted = pd.concat([transaction_data, low_card_trans_encoded], axis = 1)

print("transaction_concatted.shape after concatting: ", transaction_concatted.shape, "\n")
print("transaction_concatted.columns after concatting: ", transaction_concatted.columns, "\n")

#del low_card_trans_encoded
#del transaction_data

print(transaction_concatted.info())

**Now we do the same with the identity_data.**

In [None]:
print("identity_data.shape before concatting: ", identity_data.shape, "\n")
print("low_card_id_encoded.shape before concatting: ", low_card_id_encoded.shape, "\n")

identity_concatted = pd.concat([identity_data, low_card_id_encoded], axis = 1)

print("identity_concatted.shape after concatting: ", identity_concatted.shape, "\n")
print("identity_concatted.columns after concatting: ", identity_concatted.columns, "\n")

#del low_card_id_encoded
#del identity_data

print(identity_concatted.info())

**Now we have our 2 finished dataframes  transaction_concatted and identity_concatted, that contains all the features we want to use, all categorical features are encoded and all missing values have been removed.**

**Before we start the training process, we have to split up our transaction_concatted into train and test, and we also have to split up our identity_concatted.**

**The reason for this is shown in the following image:**

![](https://i.imgur.com/9spsvK8.png)


**In the left part of the image we can see why we cannot concat both dataframes first and then split up.**

**For the splitting process we have to split up our dataframe into two pieces via a horizontal cut.**

**But due to the different shapes of transaction_concatted (1097231, 253)  and identity_concatted  (286140, 55), we cannot concat first and then split up.**

**This would not result in the correct 4 dataframes  train_transaction, test_transaction, train_identity, test_identity, with which we started.**

**In order to get the correct train and test dataframes, we have to split up first,  and then concat.**

**Due to the different shapes of both dataframes there will also be a huge number of generated NaN's. This is caused by the fact that identity_concatted dataframe does not have the same columns as transaction_concatted.  Hence all the empty values generated by the concatting process will be filled with NaN's.  We will not impute these missing values, we will simply tell our model to ignore these values. I will explain this later in more detail.**

In [None]:
print("transaction_concatted.shape before splitting up: ", transaction_concatted.shape, "\n")

# shape of train_transaction was (590540, 393), 
# shape of test_transaction  was (506691, 392)
train_transaction = transaction_concatted.iloc[0:590540]
test_transaction = transaction_concatted.iloc[590540:]

print("train_transaction.shape after splitting up: ", train_transaction.shape, "\n")
print("test_transaction.shape after splitting up: ", test_transaction.shape, "\n")

In [None]:
print("identity_concatted.shape before splitting up: ", identity_concatted.shape, "\n")

# shape of train_identity was  (144233, 40)
# shape of test_identity  was  (141907, 40)
train_identity = identity_concatted.iloc[0:144233]
test_identity = identity_concatted.iloc[144233:]

print("train_identity.shape after splitting up: ", train_identity.shape, "\n")
print("test_identity.shape after splitting up: ", test_identity.shape, "\n")

In [None]:
print("train_transaction.shape before concatting: ", train_transaction.shape, "\n")
print("train_identity.shape before concatting: ", train_identity.shape, "\n")

train_data  = pd.concat([train_transaction, train_identity], axis = 1)

print("train_data.shape: ", train_data.shape)

In [None]:
counter = 0

for i in train_data.columns:
    
    summ = train_data[i].isnull().sum()
    print(i, summ)
    if summ > 0:
        counter += 1
        
print("\n number of columns with missing values: ", counter)

**As we can see exactly 590540 - 144233 = 446307 values are missing in exactly 44 columns, hence the concatting process went exactly as we expected it to happen.**

In [None]:
print("test_transaction.shape before concatting: ", test_transaction.shape, "\n")
print("test_identity.shape before concatting: ", test_identity.shape, "\n")

test_data  = pd.concat([test_transaction, test_identity], axis = 1)

print("test_data.shape: ", test_data.shape)

In [None]:
counter = 0

for i in test_data.columns:
    
    summ = test_data[i].isnull().sum()
    print(i, summ)
    if summ > 0:
        counter += 1
        
print("\n number of columns with missing values: ", counter)

**Let's have a look at any random column containing these many NaN's:**

In [None]:
print(test_data["id_35_F"])

**The good thing is that we dont have to remove these NaN's, we can simply let our xgb model handle them.**

# 12. Train the model

**We choose a XGBClassifier as our model and start the training process.**

**I commented the code to save time such that I can save this notebook faster.**

**The given parameters result roughly in a private rank of 3085/6381.**


In [None]:
#from xgboost import XGBClassifier


#clf = XGBClassifier(objective = 'binary:logistic',
#                    gamma = 0.05,
#                    colsample_bytree = 0.5, 
#                    eval_metric = 'auc',
#                    n_estimators = 1350,         
#                    max_depth = 8,
#                    min_child_weight = 2, 
#                    learning_rate = 0.02,
#                    subsample = 0.8,
#                    n_jobs = -1,
#                    silent = False,
#                    verbosity = 0)        
                

#print("starting training process..... \n") 
#clf.fit(train_data, y_train)

**Now we save the predictions in a .csv file.**

In [None]:
#sample_submission = pd.read_csv('/kaggle/input/ieee-fraud-detection/sample_submission.csv', index_col='TransactionID')

#sample_submission['isFraud'] = clf.predict_proba(test_data)[:,1]
#sample_submission.to_csv('simple_xgboost.csv')

#print("saving was successful!")

# Thank you for reading my IEEE Fraud detection tutorial!

# Feel free to comment or ask questions :)