In this project, the target variable is "isFraud" which is categorical variable. The rest of the the columns in the dataset are used as independent variables in this case.
As the given datasets are too big, there are many steps proceeded to clean and construct the data to be more readble and handled enough to train the planned models in this processes of analysis.

We are going to explore XGB and LGB machine learning models to train as the size of this project's data is enormous and it would be difficult for traditional data algorithms to give faster results. We will be doing XGB which is also an implementation of gradient boosting machines to see what model would be the best fit for this maching learning observation.


In [None]:
#calling the libraries which would need for this analysis
import pandas as pd
from pandas import Series, DataFrame 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
import time
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split

In [None]:
#uploading the files
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

There are four files we need to load for checking through the features of the datasets

In [None]:
#pulling out the necessary files
train_ID=pd.read_csv('/kaggle/input/ieee-fraud-detection/test_identity.csv') 
train_trans=pd.read_csv('/kaggle/input/ieee-fraud-detection/train_transaction.csv')
test_ID=pd.read_csv('/kaggle/input/ieee-fraud-detection/test_identity.csv') 
test_trans=pd.read_csv('/kaggle/input/ieee-fraud-detection/test_transaction.csv')

As we can see the files sizes are too big as follow. And we are going to merge train_ID file and train_trans file because they have different variables and we need all those features in one dataframe to run the models by targeting "isFraud" as dependent varaible.

In [None]:
#check their shapes(numbers of rows and columns)
print(train_ID.shape)
print(train_trans.shape)
print(test_ID.shape)
print(test_trans.shape)

Firstly, lets create train and test set by merging the respective data files accordingly. 

In [None]:
train_set=pd.merge(train_ID, train_trans, on='TransactionID', how='outer')
test_set=pd.merge(test_ID, test_trans, on='TransactionID', how='outer')

Following categorical features in Transaction_ID dataset and 
ProductCD
- card1 - card6
- addr1, addr2
- P_emaildomain
- R_emaildomain
- M1 - M9

Following categorical Features in Identity dataset are now in one combained datasets; train_set and test_set.
- DeviceType
- DeviceInfo
- id_12 - id_38

- The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp).

After observing the files, we can say that columns of both datasets have same fields.  Following process are to deal with the missing values in the merged files.

In [None]:
#delete the original uploaded files to save the memory 
del train_ID, train_trans, test_ID, test_trans

In [None]:
#drop the rows with na values by setting the threshold of 250 as out of total total vlues in both datasets, train and test
#train_set=train_set.dropna(thresh=250)
#test_set=test_set.dropna(thresh=250)

In [None]:
#then check the size of datasets again to see how many we have left 
print(train_set.shape)
print(test_set.shape)

In [None]:
#As the datasets are still large to manage, here dropping the columns that would have nulls which is 80% of the whole data vlues.
train_set = train_set.loc[:, train_set.isnull().sum() < 0.8*train_set.shape[0]]


In [None]:
#As the datasets are still large to manage, here dropping the columns that would have nulls which is 80% of the whole data vlues.
test_set = test_set.loc[:, test_set.isnull().sum() < 0.8*test_set.shape[0]]

In [None]:
#Then those train_set and test_set are combined 
both_data = pd.concat([train_set, test_set], axis=0, sort=False)

In [None]:
#created the bar plot to see how is the frequency of targerted variable "isFraud"
both_data['isFraud'].value_counts().plot(kind = 'bar')

In [None]:
#then delete the old versions of files as we dont need to use them for training models later.
del train_set, test_set

When we called the combined dataset of those four files, the following fields have many null values which is almost 90% of the value of the whole dataset.

In [None]:
#To avoid dropping the main variable "isFraud" column, we just dropped the columns by calling individualy in the function
both_data=both_data.drop(['id-01', 'id-02', 'id-05', 'id-06', 'id-11', 'id-12', 'id-13', 'id-15', 'id-16', 'id-17', 'id-19', 'id-20',
'id-28', 'id-29', 'id-31', 'id-35', 'id-36', 'id-37', 'id-38'], axis=1)


In [None]:
#Since we cant drop any na in dataset as that could lead to zero row left, we again set the conditon to drop certain columns in datset again
both_data = both_data.loc[:, both_data.isnull().sum() < 0.7*both_data.shape[0]]

In [None]:
#count the values of targeted variable in the mergered dataset
both_data.isFraud.value_counts() 

In [None]:
#dropping the random rows again to make the dataset more managable when running the models
np.random.seed(10)
remove_n = 100000
drop_indices = np.random.choice(both_data.index, remove_n, replace=False)
both_data = both_data.drop(drop_indices)
both_data.shape

In [None]:
#As it is mentioned above, there are some categorical fields in dataset and we have to encode all those categorical variables.pd get dummies function would save your times a lot rather than using column transform function here.

both_data = pd.get_dummies(both_data)
print(both_data.shape)
both_data.head(2)

In [None]:
#clean the newly generated columns of the dataset as there are some weird words in the column headings
import re
both_data = both_data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

In [None]:
both_data.head(5)

In [None]:
#removing dulpicate columns because that could help you run the chosen models smoothly
_, i = np.unique(both_data.columns, return_index=True) 
both_data=both_data.iloc[:, i] 

In [None]:
#count the values of targeted variable in the mergered dataset
both_data.isFraud.value_counts() 

Hoorayyy!!! Finally,we can now split that combined dataset into featres and target datasets accordingly

In [None]:
#separating our data into features dataset X and our target dataset y 
X=both_data.drop('isFraud',axis=1) 
y=both_data.isFraud 

In [None]:
y.fillna(y.mode()[0], inplace=True)

In [None]:
X = X.fillna(X.mean())


In [None]:
#delete the old version of dataset after splited it into two dataset as X and y accordingly
del both_data

In [None]:
#Now splitting our datasets into test and train to apply into the models
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.5)

In [None]:
X_test.shape

In [None]:
#importing linear model for the purpose of creating the correlatlion plot, we dont really need to do fit this model here .
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)

In [None]:
#computing the coefficient ..
reg.coef_

In [None]:
#then create the correlation plot
%matplotlib inline
feat_importances = pd.Series(reg.coef_, index=X.columns)
feat_importances.nlargest(30).plot(kind='barh')

We generated the largest 30 features that have high correlation with the targeted variable. It can be seen that R_emaildomain has most influence factor in this case. Probably, if the ones who are using the similar email domain names would be the same group of hackers to do certain frauds. Device Info is more correlated than Device type and the variable types of M4, V1 , ... ,etc are also invloved in the influenct factor for the targeted variable.

In [None]:
#delete the old versions of datasets as we wouldnot need them for furtur purposes 
del X,y

In [None]:
#XGB model was applied
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)

In [None]:
##wvaluate model on test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [None]:
#getting the accuracy for training the model
from sklearn.metrics import accuracy_score
accur = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accur * 100.0))

Awesome!! The accuracy rate is 98.12% and the model can be seen as a real good one for this analysis. But as we want to explore more and compare the results, we are going to train light gradient boosting.

In [None]:
# building the lightgbm model
import lightgbm as lgb
model2 = lgb.LGBMClassifier()
model2.fit(X_train, y_train)


In [None]:
# predict the results
y_pred_lgb=model2.predict(X_test)

In [None]:
#getting the accuracy for training the model
accur2=accuracy_score(y_pred_lgb,y_test)
print("Accuracy: %.2f%%" % (accur2 * 100.0))

Tremendous again!! the accuracy of training the LGB model is 98%. It cant be said much different which one is better over one another as the different is only small points between XGB and LGB. We can make a little more complicated application by having parameters in the model functions if you want to explore more.

In [None]:
#Creating a dataframe ‘comparison_df’ for comparing the performance of Lightgbm and xgb.
comparison = {'accuracy score':(accur2,accur)}
comparison_df = DataFrame(comparison) 
comparison_df.index= ['LightGBM','xgboost'] 
comparison_df

In [None]:
#Prob pred with XGB model
test_pred_Prob = model.predict_proba(X_test)
print(test_pred_Prob[:20])


In [None]:
submission = pd.DataFrame({'TransactionID' : X_test.TransactionID,    'isFraud' : test_pred_Prob[:,1]})
submission.head(20)

In [None]:
submission.shape

In [None]:
submission.head()

In [None]:


ss=pd.read_csv('/kaggle/input/ieee-fraud-detection/sample_submission.csv')
ss.head(2)
#ss.loc[:, 'isFraud'] = submission['isFraud']
ss=ss.drop(['isFraud'], axis=1)
ss.head(2).head(2)
my_submission =pd.concat([ss,submission], axis=0)
final=my_submission.dropna()
final.shape

final.to_csv('submission_pmm.csv',index=False)



Ref:

https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc

https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d