# Ecommerce Fraud Detection EDA + Feature Engg. + Modelling

### Fraud Detection on Customer's Ecommerce Transactions Data.

**Problem Statement:**

The profiles contain information about the customer, their orders, their transactions, what payment methods they used and whether the customer is fraudulent or not.

1) Tasks

 * Provide exploratory analysis of the dataset.
 * Summarise and explain the key trends in the data, providing visualisations and tabular representations as necessary.
 * Explain what factors you think are significant and insignificant in contributing to fraud
 * Construct a model to predict if a customer is fraudulent based on their profile.
 * Report on the models success and show what features are most important


### I am a Beginner in the field of machine learning and data analysis. This notebooks marks my entry in kaggle kernels.

### If you like the notebook, do UPVOTE. Suggestions, improvements and corrections are always welcome. 

**Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
warnings.filterwarnings('ignore')

**Importing Datasets**

In [None]:
d1 = pd.read_csv("../input/ecommerce-fraud-data/Customer_DF (1).csv")

In [None]:
d1.columns

In [None]:
d1.head()

In [None]:
d1.info()

In [None]:
d1.describe()

In [None]:
d2 = pd.read_csv('../input/ecommerce-fraud-data/cust_transaction_details (1).csv')

In [None]:
d2.columns

In [None]:
d2.head()

In [None]:
d2.info()

In [None]:
d2.describe()

### Exploratory Data Analysis and Data Visualisations

In [None]:
d1['customerEmail'].nunique()

In [None]:
d2['customerEmail'].nunique()

**As there are total 168 rows in the d1 dataset and we have 161 unique email addresses so some of them must be repeated.**

**Finding out the emails that are repeated in the dataset.**

In [None]:
mail_list=[]
repeat =0
result={}
for i in range(0,168):
    repeat = 0
    for j in range(0,168):
        if d1['customerEmail'][i] == d1['customerEmail'][j]:
            repeat+=1
    result.update({d1['customerEmail'][i]:repeat})
result    

**From this we can see that 1 email address i.e 'johnlowery@gmail.com' has been repeated 8 times in the dataset.**

In [None]:
d1[d1['customerEmail']=='johnlowery@gmail.com']

**On furthur checking on this email address we can see that all transactions from this email address are fraudulent.**

Now, doing some analysis in d2, we can see that d2 has **4 unique payment methods** and **most of the payments are made from card**.

In [None]:
d2['paymentMethodType'].unique()

In [None]:
sns.countplot(d2['paymentMethodType'],data = d2)

In [None]:
sns.countplot(d2['orderState'])

**We can see that most of the order's are fulfilled and on doing furthur analysis on this we find out:**
1. Orders Fulfilled = 516
2. Order Failed = 63
3. Order Pending = 44

In [None]:
plt.figure(figsize=(16,5))
sns.countplot(d2['paymentMethodProvider'])
plt.tight_layout()

**More Payments were mane by 'JCB 16 digit' and 'VISA 16 digit' as compared to other payment method providers.**

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(d2['paymentMethodProvider'],hue = d2['paymentMethodRegistrationFailure'])
plt.tight_layout()

**This shows that maximum payments were made using 'JCB 16 digit'and 'VISA 13 digit' had most registration failures.**

In [None]:
sns.countplot(d1['No_Payments'],hue = d1['Fraud'])

**From this we can see that if the no. of payments is greaten than 5 then the customer is definitely a fraud.**

**Now, finding out the emails that are not common in the 2 datasets.**

In [None]:
l = []
for i in range(0,168):
    uncommon=0
    for j in range(0,623):
        if d1['customerEmail'][i]==d2['customerEmail'][j]:
            uncommon+=1
    if uncommon==0:
        l.append(d1['customerEmail'][i])
print(len(l))
l

**These are the 25 emails that are not common in both the datasets so we do not have the transaction details of these customers.**

**To use the elements of both the dataset's we have to find the common emails in both the datasets.**

In [None]:
common =0
for i in d1['customerEmail']:
    for email in d2['customerEmail']:
        if i==email:
            common+=1
            break
common

In [None]:
final = d1[d1['customerEmail'].isin(d2['customerEmail'])== True]
final.shape

**This dataset has common emails from both the datasets.**

In [None]:
final.drop('Unnamed: 0',axis = 1, inplace = True)

In [None]:
final.reset_index(inplace = True)

**Now adding Total Transaction amount column.**

In [None]:
Total_transaction_amt = []
for i in range(0,143):
    s=0
    for j in range(0,623):
        if(final['customerEmail'][i]==d2['customerEmail'][j]):
            s += d2['transactionAmount'][j]
    Total_transaction_amt.append(s)        


In [None]:
final['Total_transaction_amt'] = Total_transaction_amt

**Now adding No. of Transactions Failed Columns.**

In [None]:
No_transactionsFail = []
for i in range(0,143):
    s=0
    for j in range(0,623):
        if(final['customerEmail'][i]==d2['customerEmail'][j]):
            s += d2['transactionFailed'][j]
    No_transactionsFail.append(s)        

In [None]:
final['No_transactionsFail'] = No_transactionsFail

**Now adding Payment Method Registration Failures column.**

In [None]:
PaymentRegFail = []
for i in range(0,143):
    s=0
    for j in range(0,623):
        if(final['customerEmail'][i]==d2['customerEmail'][j]):
            s += d2['paymentMethodRegistrationFailure'][j]
    PaymentRegFail.append(s)  

In [None]:
final['PaymentRegFail'] = PaymentRegFail

**Now adding No. of payments from Paypal, Apple Pay, Card, Bitcoin columns.**

In [None]:
def col_make(column_name,category):
    array = []
    for i in range(0,143):
        s=0
        for j in range(0,623):
            if(final['customerEmail'][i]==d2['customerEmail'][j]):
                if d2[column_name][j]==category:
                    s+=1
        array.append(s)
    return array 

In [None]:
PaypalPayments = col_make('paymentMethodType','paypal')
ApplePayments = col_make('paymentMethodType','apple pay')
BitcoinPayments = col_make('paymentMethodType','bitcoin')
CardPayments = col_make('paymentMethodType','card')

In [None]:
final['PaypalPayments']= PaypalPayments
final['ApplePayments']= ApplePayments
final['CardPayments']= CardPayments
final['BitcoinPayments']= BitcoinPayments

**Now adding Order Fullfilled, Pending, Failed columns.**

In [None]:
OrdersFulfilled = col_make('orderState','fulfilled')
OrdersFailed =  col_make('orderState','failed')
OrdersPending = col_make('orderState','pending')

In [None]:
final['OrdersFulfilled'] = OrdersFulfilled
final['OrdersPending'] = OrdersPending
final['OrdersFailed'] = OrdersFailed

In [None]:
JCB_16 = col_make('paymentMethodProvider','JCB 16 digit')
AmericanExp = col_make('paymentMethodProvider','American Express')
VISA_16 =  col_make('paymentMethodProvider','VISA 16 digit')
Discover =  col_make('paymentMethodProvider','Discover')
Voyager = col_make('paymentMethodProvider','Voyager')
VISA_13 = col_make('paymentMethodProvider','VISA 13 digit')
Maestro = col_make('paymentMethodProvider','Maestro')
Mastercard = col_make('paymentMethodProvider','Mastercard')
DC_CB =col_make('paymentMethodProvider','Diners Club / Carte Blanche')
JCB_15= col_make('paymentMethodProvider','JCB 15 digit')

In [None]:
final['JCB_16'] = JCB_16
final['AmericanExp'] = AmericanExp 
final['VISA_16'] = VISA_16 
final['Discover'] = Discover
final['Voyager'] = Voyager 
final['VISA_13'] = VISA_13
final['Maestro'] = Maestro 
final['Mastercard'] = Mastercard
final['DC_CB'] = DC_CB 
final['JCB_15'] = JCB_15

In [None]:
final.shape

In [None]:
Trns_fail_order_fulfilled = []
for i in range(0,143):
    s=0
    for j in range(0,623):
        if(final['customerEmail'][i]==d2['customerEmail'][j]):
            if (d2['orderState'][j]=='fulfilled') & (d2['transactionFailed'][j]==1):
                s+=1
    Trns_fail_order_fulfilled.append(s)

In [None]:
final['Trns_fail_order_fulfilled'] = Trns_fail_order_fulfilled

In [None]:
Duplicate_IP = []
for i in range(0,143):
    s=0
    for j in range(0,143):
        if(final['customerIPAddress'][i]==final['customerIPAddress'][j]):
            s+=1
    s-=1        
    Duplicate_IP.append(s)

In [None]:
final['Duplicate_IP'] = Duplicate_IP

In [None]:
Fraud_Decoded = []
for i in range(0,143):
    s=0
    if(final['Fraud'][i]==True):
        s+=1        
    Fraud_Decoded.append(s)

In [None]:
final['Fraud_Decoded'] = Fraud_Decoded

In [None]:
Duplicate_Address = []
for i in range(0,143):
    s=0
    for j in range(0,143):
        if(final['customerBillingAddress'][i]==final['customerBillingAddress'][j]):
            s+=1
    s-=1        
    Duplicate_Address.append(s)

In [None]:
final['Duplicate_Address']=Duplicate_Address

In [None]:
final[final['Fraud']==True].count()

**Out of 143 data points in the final dataset, 56 data points are Truely fraudulent in the data and rest 87 data points are not fraud.**

In [None]:
final.head()

In [None]:
sns.barplot(x = final['No_Transactions'],y = final['No_transactionsFail'],hue = final['Fraud'])

**Note: When number of transactions  = 0, we have number of Transactions failed =6. And hence the customer is definitely a fraud.**

**When No. of Transaction is 10,11,13 then the transaction is definitely a fraud irrespective of the number of transactions failed.**

In [None]:
final[(final['No_transactionsFail'] == 6) & (final['No_Transactions']==0)==True]

**It is the same Email Address that has been repeated 8 times in the data.**

In [None]:
print(final['customerPhone'].nunique())
print(final['customerDevice'].nunique())
print(final['customerIPAddress'].nunique())
print(final['customerBillingAddress'].nunique())

**As all the Phone Numbers and Devices in the data are unique so they are not of much use in the analysis.**

**We can see that some of the IP Addresses and Billing Addresses are repeated in the data.**

In [None]:
final[final['Duplicate_IP']>0]

**From this we can see that these 4 customers have the same IP address and as it is not possible for different devices to have the same IP therefore these are definitely fraud.**

In [None]:
final[final['Duplicate_Address']>0]

**These 3 customers have the same Billing Address and hence these 3 are Fraud.**

In [None]:
sns.countplot(x = final['OrdersFulfilled'], hue = final['Fraud'])

**From this we can see that if Number of Orders Fulfilled is greater than 8 then the transaction is definitely fraudulent.**

In [None]:
final.columns

**Preparing data to feed into model.**

In [None]:
X = final[['No_Transactions',
       'No_Orders', 'No_Payments', 'Total_transaction_amt',
       'No_transactionsFail', 'PaymentRegFail', 'PaypalPayments',
       'ApplePayments', 'CardPayments', 'BitcoinPayments', 'OrdersFulfilled',
       'OrdersPending', 'OrdersFailed','Trns_fail_order_fulfilled','Duplicate_IP','Duplicate_Address','JCB_16', 'AmericanExp', 'VISA_16',
       'Discover', 'Voyager', 'VISA_13', 'Maestro', 'Mastercard', 'DC_CB',
       'JCB_15']]
y = final['Fraud_Decoded']

**Splitting the data into training and testing set.**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Model Training

**The model's used are Random Forests, Logistic Regression and Support Vector Machines as all of them are good for binary classification.**

**Training the Random Forest Classifier.**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [None]:
rfc = RandomForestClassifier(n_estimators=150)
rfc.fit(X_train,y_train)
pred = rfc.predict(X_test)
print(accuracy_score(y_test,pred))
sns.heatmap(data = confusion_matrix(y_test,pred),annot = True)
print(classification_report(y_test,pred))

**Training the Logistic Regression Model.**

In [None]:
from sklearn.linear_model import LogisticRegression
logr = LogisticRegression()
logr.fit(X_train,y_train)
log_pred =logr.predict(X_test)
print(accuracy_score(y_test,log_pred))
sns.heatmap(data=confusion_matrix(y_test,log_pred),annot = True)
print(classification_report(y_test,log_pred))

**Trainig the Support Vector Machines Ckassifier.**

In [None]:
from sklearn.svm import SVC
svc = SVC(gamma = 'auto')
svc.fit(X_train,y_train)
svc_pred=svc.predict(X_test)
print(accuracy_score(y_test,pred))
sns.heatmap(data = confusion_matrix(y_test,pred),annot = True)
print(classification_report(y_test,pred))

**We can see that Logistic Regression Model is giving the best outcome as compared to other models.**

### Hyperparameter Tuning

**Using Grid Search Cross Validation to fine tune the models to improve accuracy.**

**Applying Grid Search CV on Support Vector Classifier.**

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
svc_param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 

In [None]:
gridsvc = GridSearchCV(SVC(),svc_param_grid,refit=True,verbose=3)

In [None]:
gridsvc.fit(X_train,y_train)

In [None]:
gridsvc.best_params_

In [None]:
gridsvc.best_estimator_

In [None]:
grid_svc_predictions = gridsvc.predict(X_test)

In [None]:
print(accuracy_score(y_test,grid_svc_predictions))
sns.heatmap(data = confusion_matrix(y_test,grid_svc_predictions),annot= True)
print(classification_report(y_test,grid_svc_predictions))

**Applying Grid Search CV on Logistic Regression.**

In [None]:
logr_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

In [None]:
grid_logr = GridSearchCV(LogisticRegression(),logr_param_grid,refit=True,verbose=3)

In [None]:
grid_logr.fit(X_train,y_train)

In [None]:
grid_logr.best_params_

In [None]:
grid_logr.best_estimator_

In [None]:
grid_logr_predictions = grid_logr.predict(X_test)

In [None]:
print(accuracy_score(y_test,grid_logr_predictions))
sns.heatmap(data = confusion_matrix(y_test,grid_logr_predictions),annot = True)
print(classification_report(y_test,grid_logr_predictions))

**Applying Grid Search CV on Random Forests Classifier.**

In [None]:
rfc_param_grid = { 
    'n_estimators': [100,150,200,350,500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

In [None]:
grid_rfc = GridSearchCV(RandomForestClassifier(),rfc_param_grid,refit=True,verbose=3)

In [None]:
grid_rfc.fit(X_train,y_train)

In [None]:
grid_rfc.best_params_

In [None]:
grid_rfc.best_estimator_

In [None]:
grid_rfc_predictions = grid_rfc.predict(X_test)

In [None]:
print(accuracy_score(y_test,grid_rfc_predictions))
sns.heatmap(data = confusion_matrix(y_test,grid_rfc_predictions),annot = True)
print(classification_report(y_test,grid_rfc_predictions))

In [None]:
rfc.feature_importances_

**The top 3 features used in the prediction are:**
1. Total Transaction Amount
2. Number Of Payments
3. Orders Fulfilled 

**Applying K Fold Cross Validation on Grid Search Model to check the robustness and quality of the model.**

**Applying Kfold CV on RFC.**

In [None]:
from sklearn.model_selection import cross_val_score
cv_scores_rfc = cross_val_score(grid_rfc.best_estimator_, X, y, cv=5)
print(cv_scores_rfc)
print("Mean 5-Fold R Squared: {}".format(np.mean(cv_scores_rfc)))

**Applying Kfold CV on Logistic Regression.**

In [None]:
cv_scores_logr = cross_val_score(grid_logr.best_estimator_, X, y, cv=5)
print(cv_scores_logr)
print("Mean 5-Fold R Squared: {}".format(np.mean(cv_scores_logr)))

**Applying Kfold CV on SVC.**

In [None]:
cv_scores_svc = cross_val_score(gridsvc.best_estimator_, X, y, cv=5)
print(cv_scores_svc)
print("Mean 5-Fold R Squared: {}".format(np.mean(cv_scores_svc)))