<a href="https://colab.research.google.com/github/yuan-code/Financial_Fraud_Detection_and_Risk_Analysis/blob/main/eCommerce_Fraud_Detection_Risk_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# eCommerce Fraud Detection

eCommerce websites often transact huge amounts of money. And whenever a huge amount of money is moved, there is a high risk of users performing fraudulent activities, e.g. using stolen credit cards, doing money laundry, etc.Machine Learning really excels at identifying fraudulent activities. Any website where you put your credit card information has a risk team in charge of avoiding frauds via machine learning.

Company X is an e-commerce site that sells electronic products.

This project builds a machine learning model that predicts the probability that the first transaction of a new user is fraudulent or not. The goal of this project contains:

- Make classifications over a highly imbalanced (roughly 100:1) dataset
- Perform correct feature engineering to extract more information from the data
- Applied resampling techniques to handel class imbalanced problems.
- Predict whether a transaction from an eCommerce site was fraudulent. Used logistic regression, random forest, and gradient boosting classifier models to improve evaluation metrics.
- Make actionable recommendations for fraud detection system design


Data info:
1. "Fraud_Data"-information about each user first transaction
2. "IpAddress_to_Country" - mapping each numeric ip address to its country. For each country, it gives a range. If the numeric ip address falls within the range, then the ip address belongs to the corresponding country.

**Unsupervised Learning**

In a fraud detection scenario, we often have very few labeled examples, and the process of labeling fraud cases can be time-consuming. Therefore, we seek to leverage the available unlabeled data to gain valuable insights. Anomaly detection is a form of unsupervised learning where we try to identify unusual instances based solely on their feature characteristics.

**Supervised Learning**

Once we have collected a sufficient amount of labeled training data, we can use a supervised learning algorithm that identify relationships between the features and the corresponding target class.


# Contents
* Part 1: Import Data
* Part 2: Data exploration
* Part 3: Feature Engineering
  * Part 3.1: Identify country info based on ip_address
  * Part 3.2: Time related features
  * Part 3.3: Data Split
  * Part 3.4: Convert categorical features
  * Part 3.5: Scaling
* Part 4: Model Training
  * Part 4.1: Baseline model
  * Part 4.2: Resampling
* Part 5: Parameter tuning by Grid Search
* Part 6: Model Selection
* Part 7: Fraud Characteristics
* Part 8: How to use the prediction for fraud detection system

# Part 1: Import Data

In [None]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score, roc_auc_score, roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

!pip install ydata_profiling

In [None]:
!git clone https://github.com/xx.git

Cloning into 'fraudDetection'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 11 (delta 2), reused 11 (delta 2), pack-reused 0[K
Receiving objects: 100% (11/11), 6.74 MiB | 9.68 MiB/s, done.
Resolving deltas: 100% (2/2), done.


In [None]:
!cd fraudDetection/
!ls fraudDetection/

cv_data.csv   imbalancedFraudDF.csv	test_data.csv	tr_server_data.csv
cv_label.csv  IpAddress_to_Country.csv	test_label.csv


In [None]:
ipToCountry = pd.read_csv('fraudDetection/IpAddress_to_Country.csv')
fraud_data = pd.read_csv('fraudDetection/imbalancedFraudDF.csv')

In [None]:
fraud_data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
3,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0
4,159135,2015-05-21 06:03:03,2015-07-09 08:05:14,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0


# Part 2: Data exploration

In [None]:
# Distribution of the label column
print(fraud_data['class'].value_counts())
print(fraud_data['class'].value_counts()/fraud_data['class'].count()) # fraud_data['class'].value_counts(normalize = True)

0    136961
1      1415
Name: class, dtype: int64
0    0.989774
1    0.010226
Name: class, dtype: float64


This dataset has 1415 frauds out of 138,376 transactions. The dataset is highly unbalanced. Most of the transactions were Non-Fraud (98.98%) of the time, while Fraud transactions occurs (1.02%) of the time in the dataframe.

In [None]:
fraud_data.shape

(138376, 11)

In [None]:
fraud_data.dtypes

user_id             int64
signup_time        object
purchase_time      object
purchase_value      int64
device_id          object
source             object
browser            object
sex                object
age                 int64
ip_address        float64
class               int64
dtype: object

In [None]:
# check missing values
fraud_data.isna().sum()
# fraud_data.isnull().sum(axis = 0)

user_id           0
signup_time       0
purchase_time     0
purchase_value    0
device_id         0
source            0
browser           0
sex               0
age               0
ip_address        0
class             0
dtype: int64

In [None]:
# Check if column user_id is unique(no dup) for time related aggregates
# This dataset only contains the first transaction of the customer
print (fraud_data["user_id"].nunique())
print (len(fraud_data.user_id))

138376
138376


In [None]:
fraud_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,138376.0,200149.0,115226.8,2.0,100894.8,200000.5,299745.2,400000.0
purchase_value,138376.0,36.93899,18.32109,9.0,22.0,35.0,49.0,154.0
age,138376.0,33.12587,8.623645,18.0,27.0,33.0,39.0,76.0
ip_address,138376.0,2154381000.0,1250563000.0,52093.496895,1085079000.0,2156471000.0,3249150000.0,4294850000.0
class,138376.0,0.01022576,0.1006045,0.0,0.0,0.0,0.0,1.0


In [None]:
import ydata_profiling

ydata_profiling.ProfileReport(fraud_data)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



# Part 3: Feature Engineering




## Part 3.1: Identify country info based on ip_address


In [None]:
ipToCountry.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


We can identify geographical location via ip_address by collecting data on the mapping relation between ip_address and country. Then add a new feature named country onto 'fraud_data'.

In [None]:
ip_address = fraud_data.loc[8, 'ip_address']
tmp = ipToCountry[(ipToCountry['lower_bound_ip_address'] <= ip_address) &
                    (ipToCountry['upper_bound_ip_address'] >= ip_address)]
print(tmp)
print('-------------------')

ip_address = fraud_data.loc[5, 'ip_address']
tmp = ipToCountry[(ipToCountry['lower_bound_ip_address'] <= ip_address) &
                    (ipToCountry['upper_bound_ip_address'] >= ip_address)]
print(tmp)

      lower_bound_ip_address  upper_bound_ip_address        country
1017             335544320.0               352321535  United States
-------------------
Empty DataFrame
Columns: [lower_bound_ip_address, upper_bound_ip_address, country]
Index: []


In [None]:
# start = time.time()

countries = []
for i in range(len(fraud_data)):
    ip_address = fraud_data.loc[i, 'ip_address'] # get each row's ip_address

    # check which interval does ip_address falls into
    tmp = ipToCountry[(ipToCountry['lower_bound_ip_address'] <= ip_address) &
                    (ipToCountry['upper_bound_ip_address'] >= ip_address)]

    if len(tmp) == 1: # found match
        countries.append(tmp['country'].values[0])
    else: # no match
        countries.append('NA')

# add a new column
fraud_data['country'] = countries
# runtime = time.time() - start
# print("Lookup took", runtime, "seconds.")

# This process might be time-consuming. could use binary search to optimize it later.

In [None]:
fraud_data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,Japan
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,United States
2,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,
3,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,United States
4,159135,2015-05-21 06:03:03,2015-07-09 08:05:14,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0,Canada


In [None]:
import collections

collections.Counter(fraud_data.country.value_counts(dropna = False) < 400)


Counter({False: 33, True: 148})

## Part 3.2: Time related features

In [None]:
# time related features
fraud_data['interval_after_signup'] = (pd.to_datetime(fraud_data['purchase_time']) - pd.to_datetime(
        fraud_data['signup_time'])).dt.total_seconds()/3600

fraud_data['interval_after_signup_days_of_year'] = pd.DatetimeIndex(fraud_data['purchase_time']).dayofyear - pd.DatetimeIndex(fraud_data['signup_time']).dayofyear
fraud_data = fraud_data.drop(['user_id','signup_time','purchase_time'], axis=1)

In [None]:
fraud_data.head()

Unnamed: 0,purchase_value,device_id,source,browser,sex,age,ip_address,class,country,interval_after_signup,interval_after_signup_days_of_year
0,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,Japan,1251.856111,53
1,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,United States,4.984444,1
2,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,,136.690278,6
3,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,United States,1211.516944,50
4,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0,Canada,1178.036389,49


In [None]:
fraud_data.source.value_counts()

SEO       55766
Ads       54913
Direct    27697
Name: source, dtype: int64

## Part 3.3: Data Split

We need to split the data into test and train before applying any techniques to handle imbalanced data.This ensures that we do not leak information from the test set into the train set.

In [None]:
y = fraud_data['class']
X = fraud_data.drop(['class'], axis=1)

# split into train/test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)

X_train.shape: (110700, 10)
y_train.shape: (110700,)
X_test.shape: (27676, 10)
y_test.shape: (27676,)


In [None]:
X_train.head()

Unnamed: 0,purchase_value,device_id,source,browser,sex,age,ip_address,country,interval_after_signup,interval_after_signup_days_of_year
29343,12,OULPAZAFRFPXP,Ads,Chrome,M,42,3690922000.0,Korea Republic of,972.128889,41
12190,10,AIIWMFEYQQIEB,Ads,Opera,M,29,1686759000.0,United States,1879.455278,79
19388,34,VUVETBUPCIWJE,Direct,Chrome,M,53,4138429000.0,,1630.698611,68
89104,48,QCFULAJOYKFUU,Ads,Chrome,M,29,96173370.0,France,596.005,25
82082,44,IHRWLMIJMEEEU,Ads,FireFox,M,24,1936025000.0,China,1966.405278,82


## Part 3.4: Convert categorical features

### Part 3.4.1: Convert categorical features with less than 4 classes using one-hot encoding


In [None]:
# converting needs to be done after split
X_train = pd.get_dummies(X_train, columns=['source', 'browser'])# ['source', 'browser'] auto dropped by get_dummies
X_train['sex'] = (X_train.sex == 'M').astype(int)

X_test = pd.get_dummies(X_test, columns=['source', 'browser'])
X_test['sex'] = (X_test.sex == 'M').astype(int)


### Part 3.4.2: Convert categorical features with high cadinality to numericals

In [None]:
X_train['country'].value_counts(ascending=True)
# drawback: collision in the same bucket(no differentiation for these countries)

Benin                 1
Yemen                 1
Fiji                  1
Monaco                1
Madagascar            1
                  ...  
United Kingdom     3253
Japan              5251
China              8876
NA                16275
United States     42348
Name: country, Length: 177, dtype: int64

In [None]:
# Training set
# the more a device is shared, the more suspicious
# number of times device_id occurred in train data
X_train['n_dev_shared'] = X_train.device_id.map(X_train.device_id.value_counts(dropna=False))

# the more a IP is shared, the more suspicious
X_train['n_ip_shared'] = X_train.ip_address.map(X_train.ip_address.value_counts(dropna=False))

# the less visit from a country, the more suspicious
# include counts of NaN
X_train['n_country_shared'] = X_train.country.map( X_train.country.value_counts(dropna=False))

X_train = X_train.drop(['device_id','ip_address','country'], axis=1)


X_test['n_dev_shared'] = X_test.device_id.map(X_test.device_id.value_counts(dropna=False))
X_test['n_ip_shared'] = X_test.ip_address.map(X_test.ip_address.value_counts(dropna=False))
X_test['n_country_shared'] = X_test.country.map(X_test.country.value_counts(dropna=False))

X_test = X_test.drop(['device_id','ip_address','country'], axis=1)


In [None]:
X_train.head()

Unnamed: 0,purchase_value,sex,age,interval_after_signup,interval_after_signup_days_of_year,source_Ads,source_Direct,source_SEO,browser_Chrome,browser_FireFox,browser_IE,browser_Opera,browser_Safari,n_dev_shared,n_ip_shared,n_country_shared
29343,12,1,42,972.128889,41,1,0,0,1,0,0,0,0,1,1,3075
12190,10,1,29,1879.455278,79,1,0,0,0,0,0,1,0,1,1,42348
19388,34,1,53,1630.698611,68,0,1,0,1,0,0,0,0,1,1,16275
89104,48,1,29,596.005,25,1,0,0,1,0,0,0,0,1,1,2322
82082,44,1,24,1966.405278,82,1,0,0,0,1,0,0,0,1,1,8876


## Part 3.5: Scaling

In [None]:
# normalize (min-max) to [0,1], standardize(StandardScaler) to normal, mu=0,var = 1 can < 0

# Compute the train minimum and maximum to be used for later scaling:
scaler = preprocessing.MinMaxScaler().fit(X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])

# transform the training data and use them for the model training
X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']] = scaler.transform(X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])

# before the prediction of the test data, apply the same scaler obtained from above, on X_test, not fitting a brandnew scaler on test
X_test[['n_dev_shared', 'n_ip_shared', 'n_country_shared']] = scaler.transform(X_test[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])


In [None]:
X_train.head()

Unnamed: 0,purchase_value,sex,age,interval_after_signup,interval_after_signup_days_of_year,source_Ads,source_Direct,source_SEO,browser_Chrome,browser_FireFox,browser_IE,browser_Opera,browser_Safari,n_dev_shared,n_ip_shared,n_country_shared
29343,12,1,42,972.128889,41,1,0,0,1,0,0,0,0,0.0,0.0,0.072591
12190,10,1,29,1879.455278,79,1,0,0,0,0,0,1,0,0.0,0.0,1.0
19388,34,1,53,1630.698611,68,0,1,0,1,0,0,0,0,0.0,0.0,0.384301
89104,48,1,29,596.005,25,1,0,0,1,0,0,0,0,0.0,0.0,0.054809
82082,44,1,24,1966.405278,82,1,0,0,0,1,0,0,0,0.0,0.0,0.209578


In [None]:
X_train.n_dev_shared.value_counts(dropna=False)

0.0    105427
0.2      4774
0.4       324
0.6       124
0.8        45
1.0         6
Name: n_dev_shared, dtype: int64

In [None]:
X_test.n_dev_shared.value_counts(dropna=False)

0.0    27330
0.2      334
0.4       12
Name: n_dev_shared, dtype: int64

In [None]:
X_train.dtypes

purchase_value                          int64
sex                                     int64
age                                     int64
interval_after_signup                 float64
interval_after_signup_days_of_year      int64
source_Ads                              uint8
source_Direct                           uint8
source_SEO                              uint8
browser_Chrome                          uint8
browser_FireFox                         uint8
browser_IE                              uint8
browser_Opera                           uint8
browser_Safari                          uint8
n_dev_shared                          float64
n_ip_shared                           float64
n_country_shared                      float64
dtype: object

# Part 4: Model Training


## Part 4.1: Baseline model


### Simple LogisticRegression model

In [None]:
def evaluate_metrics(y_test, y_predict, probs):
  cm = metrics.confusion_matrix(y_test, y_predict)
  cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
  print ("confusion_matrix is: ")
  print(cmDF)
  print("roc_auc_score is: ", roc_auc_score(y_test, probs[:, 1]))
  print ('recall =',float(cm[1,1])/(cm[1,0]+cm[1,1]))
  print ('precision =', float(cm[1,1])/(cm[1,1] + cm[0,1]))
  print("%s: %r" % ("f1_score is: ", f1_score(y_test, y_predict )))

In [None]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,y_train)

# predict on test
y_pred=logreg.predict(X_test)

# generate class probabilities
probs = logreg.predict_proba(X_test)

In [None]:
# evaluation metrics
print ("Baseline - Logistic Regression: ")
evaluate_metrics(y_test, y_pred, probs)

Baseline - Logistic Regression: 
confusion_matrix is: 
        pred_0  pred_1
true_0   27388       1
true_1     280       7
roc_auc_score is:  0.7063308943047024
recall = 0.024390243902439025
precision = 0.875
f1_score is: : 0.04745762711864407


### Simple Random Forest model

In [None]:
classifier_RF = RandomForestClassifier(random_state=0)

classifier_RF.fit(X_train, y_train)

# predict class labels 0/1 for the test set
predicted = classifier_RF.predict(X_test)

# generate class raw probabilities
probs = classifier_RF.predict_proba(X_test)

In [None]:
# generate evaluation metrics
print ("Baseline - Random Forest: ")
evaluate_metrics(y_test, predicted, probs)

Baseline - Random Forest: 
confusion_matrix is: 
        pred_0  pred_1
true_0   27389       0
true_1     142     145
roc_auc_score is:  0.7659237291402243
recall = 0.5052264808362369
precision = 1.0
f1_score is: : 0.6712962962962962


### Simple XGBoost model

In [None]:
classifier_xgb = XGBClassifier(random_state=0)
classifier_xgb.fit(X_train,y_train)
y_preds = classifier_xgb.predict(X_test)
# predicted = np.where(y_preds > 0.5, 1, 0)
probs = classifier_xgb.predict_proba(X_test)

In [None]:
# generate evaluation metrics
print ("Baseline - XGBoost: ")
evaluate_metrics(y_test, y_preds, probs)

Baseline - XGBoost: 
confusion_matrix is: 
        pred_0  pred_1
true_0   27389       0
true_1     142     145
roc_auc_score is:  0.7626967666639993
recall = 0.5052264808362369
precision = 1.0
f1_score is: : 0.6712962962962962


We trained logistic regression, random forest, and xgboost models with default hyperparameters. We found that random forest model and xgboost are better with high recall, precision, and f1 scores, where precision of 1 means no false alerts.

## Part 4.2: Resampling

We can try to see if resampling techniques for imbalanced problems can improve the performance of the model.

### Oversampling - SMOTE

In [None]:
# oversampling on only the training data
smote = SMOTE(random_state = 0)
x_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

collections.Counter(y_train_sm)

Counter({0: 109572, 1: 109572})

In [None]:
# RF on smoted training data
classifier_RF_sm = RandomForestClassifier(random_state=0)
classifier_RF_sm.fit(x_train_sm, y_train_sm)
predicted_RF_sm = classifier_RF_sm.predict(X_test)
probs_RF_sm = classifier_RF_sm.predict_proba(X_test)

# generate evaluation metrics
print ("Random Forest with SMOTE: ")
evaluate_metrics(y_test, predicted_RF_sm, probs_RF_sm)

Random Forest with SMOTE: 
confusion_matrix is: 
        pred_0  pred_1
true_0   27374      15
true_1     142     145
roc_auc_score is:  0.7608108522419859
recall = 0.5052264808362369
precision = 0.90625
f1_score is: : 0.6487695749440715


In [None]:
classifier_xgb_sm = XGBClassifier(random_state=0)
classifier_xgb_sm.fit(x_train_sm,y_train_sm)
predicted_XGB_sm = classifier_xgb_sm.predict(X_test)
probs_XGB_sm = classifier_xgb_sm.predict_proba(X_test)

# generate evaluation metrics
print ("XGBoost with SMOTE: ")
evaluate_metrics(y_test, predicted_XGB_sm, probs_XGB_sm)

XGBoost with SMOTE: 
confusion_matrix is: 
        pred_0  pred_1
true_0   15351   12038
true_1      81     206
roc_auc_score is:  0.7482831239123822
recall = 0.7177700348432056
precision = 0.016824567134923227
f1_score is: : 0.03287846141568909


From the above results, we can see an increase in non-fraud cases being classified as fraud (false positives) on the random forest model. The reason this happens is that SMOTE has oversampled the fraud class so much that it has increased its overlap in the feature space with the non-fraud cases.

### Undersampling

In [None]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=0)
# rus.fit(X_train, y_train)
x_train_us, y_train_us = rus.fit_resample(X_train, y_train)
collections.Counter(y_train_us)

Counter({0: 1128, 1: 1128})

In [None]:
classifier_RF_us = RandomForestClassifier(random_state=0)
classifier_RF_us.fit(x_train_us, y_train_us)
predicted_RF_us = classifier_RF_us.predict(X_test)
probs_RF_us = classifier_RF_us.predict_proba(X_test)

# generate evaluation metrics
print ("Random Forest with undersampling: ")
evaluate_metrics(y_test, predicted_RF_us, probs_RF_us)

Random Forest with undersampling: 
confusion_matrix is: 
        pred_0  pred_1
true_0   26015    1374
true_1     129     158
roc_auc_score is:  0.7651391241149103
recall = 0.5505226480836237
precision = 0.1031331592689295
f1_score is: : 0.1737218251786696


In [None]:
classifier_xgb_us = XGBClassifier(random_state=0)
classifier_xgb_us.fit(x_train_us,y_train_us)
predicted_XGB_us = classifier_xgb_us.predict(X_test)
probs_XGB_us = classifier_xgb_us.predict_proba(X_test)

# generate evaluation metrics
print ("XGBoost with undersampling: ")
evaluate_metrics(y_test, predicted_XGB_us, probs_XGB_us)

XGBoost with undersampling: 
confusion_matrix is: 
        pred_0  pred_1
true_0   23267    4122
true_1     117     170
roc_auc_score is:  0.7613274130373304
recall = 0.5923344947735192
precision = 0.03960857409133271
f1_score is: : 0.07425202009172309


Observations:
- Although the resampling techniques did improve the recall a lot, it also increased the False Positive rate dramatically.
- SMOTE looks better than undersampling in terms of false positive rates.


# Part 5: Parameter tuning by Grid Search

In [None]:
def grid_search_wrapper(models, model_names, parameters, refit_score='roc_auc'):
    """
    fits a GridSearchCV classifier using refit_score for optimization(refit on the best model according to refit_score)
    prints classifier performance metrics
    """
    count = 0
    best_model = {}
    for model in models:
      grid_search = GridSearchCV(model, parameters[model_names[count]], scoring=scorers, refit=refit_score, cv=5, return_train_score=True)
      grid_search.fit(X_train, y_train)

      # make the predictions
      y_pred = grid_search.predict(X_test)
      y_prob = grid_search.predict_proba(X_test)[:, 1]

      # print('Best params for {}: '.format(refit_score))
      # print(grid_search.best_params_)
      best_model[model_names[count]] = grid_search

      # confusion matrix on the test data.
      print('Confusion matrix of {} optimized on the test data:'.format(model_names[count]))
      cm = confusion_matrix(y_test, y_pred)
      cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
      print(cmDF)

      f1 = round(f1_score(y_test, y_pred),4)
      prec = round(float(cm[1,1]) / (cm[1, 1] + cm[0,1]),4)
      rec = round(float(cm[1,1]) / (cm[1,0] + cm[1,1]), 4)
      auc = round(roc_auc_score(y_test, y_prob), 4)
      results = pd.DataFrame([[model_names[count], f1, prec, rec, auc]],
                          columns = ["Model", "f1", "precision", "recall", "roc_auc"])
      if count > 0:
        model_results = pd.concat([model_results, results], ignore_index=True)
      else:
        model_results = results
      count += 1
    return best_model, model_results



In [None]:
scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'f1_score': make_scorer(f1_score, pos_label=1),
    'roc_auc': make_scorer(roc_auc_score, needs_threshold=True),
     }

### Optimizing based on roc_auc score on LR/RF/XGBoost

In [None]:
parameters = {'Logistic Regression (roc_auc)': {'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}, # C: inverse of regularization strength, smaller values specify stronger regularization, l1 lasso l2 ridge
             'Random Forest (roc_auc)': {'max_depth': [None, 5, 15], 'n_estimators' :  [10,150], 'class_weight' : [{0: 1, 1: w} for w in [0.2, 1, 100]]},
             'XGBoost (roc_auc)': {'gamma': [0.5, 0.8, 1], 'max_depth': [1, 2, 3, 4],'n_estimators': [30, 40, 50, 60]
        }}

logRegModel = LogisticRegression(solver='liblinear', random_state=0)
ranForModel = RandomForestClassifier(random_state=0)
xgbModel = XGBClassifier(random_state=0)

models = [logRegModel, ranForModel, xgbModel]
model_names = ['Logistic Regression (roc_auc)','Random Forest (roc_auc)','XGBoost (roc_auc)']

best_model, model_results = grid_search_wrapper(models, model_names, parameters, refit_score='roc_auc')

Confusion matrix of Logistic Regression (roc_auc) optimized on the test data:
        pred_0  pred_1
true_0   27386       3
true_1     278       9
Confusion matrix of Random Forest (roc_auc) optimized on the test data:
        pred_0  pred_1
true_0   27389       0
true_1     142     145
Confusion matrix of XGBoost (roc_auc) optimized on the test data:
        pred_0  pred_1
true_0   27389       0
true_1     142     145


In [None]:
print(model_results)

                           Model      f1  precision  recall  roc_auc
0  Logistic Regression (roc_auc)  0.0602       0.75  0.0314   0.7404
1        Random Forest (roc_auc)  0.6713       1.00  0.5052   0.7716
2              XGBoost (roc_auc)  0.6713       1.00  0.5052   0.7664


In [None]:
model_performance = pd.DataFrame({'Model':["Baseline LR","Baseline RF","Baseline XGBoost","RF (SMOTE)","XGBoost (SMOTE)"],
                                  'f1':[0.0475,0.6713,0.6713,0.6488,0.0329],
                                  'precision':[0.8750,1.00,1.00,0.9063,0.0168],
                                  'recall':[0.0244,0.5052,0.5052,0.5052,0.7178],
                                  'roc_auc':[0.7063,0.7659,0.7659,0.7608,0.7482]})
pd.concat([model_performance, model_results], ignore_index=True)

Unnamed: 0,Model,f1,precision,recall,roc_auc
0,Baseline LR,0.0475,0.875,0.0244,0.7063
1,Baseline RF,0.6713,1.0,0.5052,0.7659
2,Baseline XGBoost,0.6713,1.0,0.5052,0.7659
3,RF (SMOTE),0.6488,0.9063,0.5052,0.7608
4,XGBoost (SMOTE),0.0329,0.0168,0.7178,0.7482
5,Logistic Regression (roc_auc),0.0602,0.75,0.0314,0.7404
6,Random Forest (roc_auc),0.6713,1.0,0.5052,0.7716
7,XGBoost (roc_auc),0.6713,1.0,0.5052,0.7664


In [None]:
best_LR_model_roc = best_model['Logistic Regression (roc_auc)'].best_estimator_
best_LR_model_roc

In [None]:
best_RF_model_roc = best_model['Random Forest (roc_auc)'].best_estimator_
best_RF_model_roc

In [None]:
best_XGB_model_roc = best_model['XGBoost (roc_auc)'].best_estimator_
best_XGB_model_roc

In [None]:
# feature importances of random forest
pd.DataFrame(best_RF_model_roc.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False).head(5)


Unnamed: 0,importance
interval_after_signup,0.616815
n_ip_shared,0.168079
interval_after_signup_days_of_year,0.140667
n_dev_shared,0.070134
n_country_shared,0.001965


In [None]:
# feature importances of XGBoost
pd.DataFrame(best_XGB_model_roc.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False).head(5)


Unnamed: 0,importance
interval_after_signup,0.454033
n_dev_shared,0.215644
source_Ads,0.066167
source_SEO,0.03391
browser_Chrome,0.032339


### Optimizing based on f1 score on LR/RF/XGBoost

In [None]:
parameters = {'Logistic Regression (f1)': {'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}, # C: inverse of regularization strength, smaller values specify stronger regularization, l1 lasso l2 ridge
             'Random Forest (f1)': {'max_depth': [None, 5, 15], 'n_estimators' :  [10,150], 'class_weight' : [{0: 1, 1: w} for w in [0.2, 1, 100]]},
             'XGBoost (f1)': {'gamma': [0.5, 0.8, 1], 'max_depth': [1, 2, 3, 4],'n_estimators': [30, 40, 50, 60]
        }}

logRegModel = LogisticRegression(solver='liblinear', random_state=0)
ranForModel = RandomForestClassifier(random_state=0)
xgbModel = XGBClassifier(random_state=0)

models = [logRegModel, ranForModel, xgbModel]
model_names = ['Logistic Regression (f1)','Random Forest (f1)','XGBoost (f1)']
best_model_f1, model_results_f1 = grid_search_wrapper(models, model_names, parameters, refit_score='f1_score')

Confusion matrix of Logistic Regression (f1) optimized on the test data:
        pred_0  pred_1
true_0   27386       3
true_1     278       9
Confusion matrix of Random Forest (f1) optimized on the test data:
        pred_0  pred_1
true_0   27389       0
true_1     142     145
Confusion matrix of XGBoost (f1) optimized on the test data:
        pred_0  pred_1
true_0   27389       0
true_1     142     145


In [None]:
print(model_results_f1)

                      Model      f1  precision  recall  roc_auc
0  Logistic Regression (f1)  0.0602       0.75  0.0314   0.7374
1        Random Forest (f1)  0.6713       1.00  0.5052   0.7777
2              XGBoost (f1)  0.6713       1.00  0.5052   0.7862


In [None]:
# feature importances of random forest
pd.DataFrame(best_model_f1['Random Forest (f1)'].best_estimator_.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False).head(5)


Unnamed: 0,importance
interval_after_signup,0.599194
interval_after_signup_days_of_year,0.175865
n_ip_shared,0.158314
n_dev_shared,0.057569
age,0.002717


In [None]:
# feature importances of XGBoost
pd.DataFrame(best_model_f1['XGBoost (f1)'].best_estimator_.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False).head(5)


Unnamed: 0,importance
interval_after_signup,0.827834
n_dev_shared,0.161467
purchase_value,0.0107
sex,0.0
age,0.0


Optimizing based on f1 score may get same f1, recall, and precision scores as the previous results. But roc_auc score is a bit higher.

Next, we want to look at the performance of fine-tuned models with SMOTE technique to handle class imbalanced problems.

### Optimizing based on f1 score on RF/XGBoost with SMOTE technique

In [None]:
# RF on smoted training data
best_model_f1_sm = best_model_f1.copy()
best_RF_f1_sm = best_model_f1_sm['Random Forest (f1)']
best_RF_f1_sm.fit(x_train_sm, y_train_sm)
predicted_best_RF_f1_sm = best_RF_f1_sm.predict(X_test)
probs_best_RF_f1_sm = best_RF_f1_sm.predict_proba(X_test)

# generate evaluation metrics
print ("Fine-tuned Random Forest with SMOTE: ")
evaluate_metrics(y_test, predicted_best_RF_f1_sm, probs_best_RF_f1_sm)

# XGBoost on smoted training data
best_xgb_f1_sm = best_model_f1_sm['XGBoost (f1)']
best_xgb_f1_sm.fit(x_train_sm,y_train_sm)
predicted_best_XGB_f1_sm = best_xgb_f1_sm.predict(X_test)
probs_best_XGB_f1_sm = best_xgb_f1_sm.predict_proba(X_test)

# generate evaluation metrics
print ("Fine-tuned XGBoost with SMOTE: ")
evaluate_metrics(y_test, predicted_best_XGB_f1_sm, probs_best_XGB_f1_sm)

Fine-tuned Random Forest with SMOTE: 
confusion_matrix is: 
        pred_0  pred_1
true_0   27371      18
true_1     142     145
roc_auc_score is:  0.7553431697635931
recall = 0.5052264808362369
precision = 0.8895705521472392
f1_score is: : 0.6444444444444445
Fine-tuned XGBoost with SMOTE: 
confusion_matrix is: 
        pred_0  pred_1
true_0   21830    5559
true_1     117     170
roc_auc_score is:  0.752260851943028
recall = 0.5923344947735192
precision = 0.02967359050445104
f1_score is: : 0.05651595744680851


 Implementing resampling techniques on our imbalanced dataset helped us with the imbalance of our labels. In the case of resampled data, our models may struggle to correctly identify non-fraud transactions in a significant number of cases, leading to misclassifying them as fraud cases and potentially canceling those transactions. This could result in lower customer satisfaction and more customer complaints in practice. Therefore, I decided not to use resampling for future modeling.

# Part 6: Model Selection

### Optimal Threshold Tuning

We can define a set of thresholds and then evaluate predicted probabilities under each in order to find and select the optimal threshold.

In [None]:
def thres_results(model, thresholds):
  # j = 1
  count = 0
  for thres in thresholds:
      y_thres_preds = model.predict(X_test)
      y_thres_probs = model.predict_proba(X_test)[:, 1]
      y_thres_probs_1 = np.where(model.predict_proba(X_test) > thres, 1, 0)[:, 1] #model.predict_proba(X_test)[:, 1] > thres
      # print("Threshold: {:.1f}".format(thres))
      # confusion matrix on the test data.

      cm = confusion_matrix(y_test, y_thres_probs_1)
      cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
      # print(cmDF)

      f1 = round(f1_score(y_test, y_thres_probs_1),4)
      prec = round(float(cm[1,1]) / (cm[1, 1] + cm[0,1]),4)
      rec = round(float(cm[1,1]) / (cm[1,0] + cm[1,1]), 4)
      auc = round(roc_auc_score(y_test, y_thres_probs), 4)
      results = pd.DataFrame([[thres, f1, prec, rec, auc]],
                          columns = ["threshold", "f1", "precision", "recall", "roc_auc"])

      if count > 0:
        model_results = pd.concat([model_results, results], ignore_index=True)
      else:
        model_results = results
      count += 1

  return model_results


In [None]:
thres_results(best_model_f1['Logistic Regression (f1)'], np.linspace(0.1, 0.9, num=9))

Unnamed: 0,threshold,f1,precision,recall,roc_auc
0,0.1,0.2391,0.7321,0.1429,0.7374
1,0.2,0.241,0.8889,0.1394,0.7374
2,0.3,0.0602,0.75,0.0314,0.7374
3,0.4,0.0602,0.75,0.0314,0.7374
4,0.5,0.0602,0.75,0.0314,0.7374
5,0.6,0.0602,0.75,0.0314,0.7374
6,0.7,0.0602,0.75,0.0314,0.7374
7,0.8,0.0606,0.9,0.0314,0.7374
8,0.9,0.0,,0.0,0.7374


In [None]:
thres_results(best_model_f1['Random Forest (f1)'], np.linspace(0.1, 0.9, num=9))

Unnamed: 0,threshold,f1,precision,recall,roc_auc
0,0.1,0.6713,1.0,0.5052,0.7777
1,0.2,0.6713,1.0,0.5052,0.7777
2,0.3,0.6713,1.0,0.5052,0.7777
3,0.4,0.6713,1.0,0.5052,0.7777
4,0.5,0.6713,1.0,0.5052,0.7777
5,0.6,0.6713,1.0,0.5052,0.7777
6,0.7,0.6713,1.0,0.5052,0.7777
7,0.8,0.3018,1.0,0.1777,0.7777
8,0.9,0.25,1.0,0.1429,0.7777


In [None]:
thres_results(best_model_f1['XGBoost (f1)'], np.linspace(0.1, 0.9, num=9))

Unnamed: 0,threshold,f1,precision,recall,roc_auc
0,0.1,0.6713,1.0,0.5052,0.7862
1,0.2,0.6713,1.0,0.5052,0.7862
2,0.3,0.6713,1.0,0.5052,0.7862
3,0.4,0.6713,1.0,0.5052,0.7862
4,0.5,0.6713,1.0,0.5052,0.7862
5,0.6,0.6713,1.0,0.5052,0.7862
6,0.7,0.6713,1.0,0.5052,0.7862
7,0.8,0.6713,1.0,0.5052,0.7862
8,0.9,0.6713,1.0,0.5052,0.7862


Observations:

- We can see that changing the threshold value does not change AUC because AUC is an aggregate measure of ROC and it is not dependent on classification threshold value.

- For logistic regression model, with the threshold increasing, it shows a decrease in recall, but at the same time, an increase in precision (the rate of misclassifing non-fraud as fraud).

- For random forest model, recall decreases with a high threshold, while precision does not change.

- XGBoost is considered stable and robust. As the threshold changes, all evaluation scores keep same.

In [None]:
pd.concat([model_performance, model_results, model_results_f1], ignore_index=True)

Unnamed: 0,Model,f1,precision,recall,roc_auc
0,Baseline LR,0.0475,0.875,0.0244,0.7063
1,Baseline RF,0.6713,1.0,0.5052,0.7659
2,Baseline XGBoost,0.6713,1.0,0.5052,0.7659
3,RF (SMOTE),0.6488,0.9063,0.5052,0.7608
4,XGBoost (SMOTE),0.0329,0.0168,0.7178,0.7482
5,Logistic Regression (roc_auc),0.0602,0.75,0.0314,0.7404
6,Random Forest (roc_auc),0.6713,1.0,0.5052,0.7716
7,XGBoost (roc_auc),0.6713,1.0,0.5052,0.7664
8,Logistic Regression (f1),0.0602,0.75,0.0314,0.7374
9,Random Forest (f1),0.6713,1.0,0.5052,0.7777


- The model performance results indicate that there is minimal difference between the Random Forest and XGBoost models.

- XGBoost is often faster and more computationally efficient. It can handle imbalanced data well without the need for additional resampling techniques. Considering a real-time or near-real-time fraud detection is a priority, I decided to use XGBoost as the optimal model in this project.

# Part 7: Fraud Characteristics

Based on the feature importances above, we can observe:

- interval_after_signup feature and number of device shared feature dominates the predictive performance in those tuned models.
- number of ip shared, the browser used by user, and source (user marketing channel) also siginificantly influence the classification.

We will take a deeper dive into those features.



In [None]:
# action velocity (consecutive operations/actions of user) by hours
fraud_data.groupby("class")[['interval_after_signup']].mean()

Unnamed: 0_level_0,interval_after_signup
class,Unnamed: 1_level_1
0,1441.994052
1,713.951802


In [None]:
# by seconds
fraud_data.groupby("class")[['interval_after_signup']].median()*3600

Unnamed: 0_level_0,interval_after_signup
class,Unnamed: 1_level_1
0,5194911.0
1,1.0


Fraud transactions have considerably shorter intervals after signup compared to legitimate cases. More than half of fraud happened immediately (about 1s) after signed up. This looks abnormal. (Bot operations)

In [None]:
trainDF = pd.concat([X_train, y_train], axis=1)
pd.crosstab(trainDF["n_dev_shared"],trainDF["class"], normalize='index')

class,0,1
n_dev_shared,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,0.995627,0.004373
0.2,0.922287,0.077713
0.4,0.469136,0.530864
0.6,0.298387,0.701613
0.8,0.288889,0.711111
1.0,0.166667,0.833333


In [None]:
pd.crosstab(trainDF["n_ip_shared"],trainDF["class"], normalize='index')

class,0,1
n_ip_shared,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,0.994503,0.005497
0.2,0.426887,0.573113
0.4,0.313492,0.686508
0.6,0.240385,0.759615
0.8,0.2,0.8
1.0,0.166667,0.833333


It can be seen that the larger number of shared devices and (shared/public) ip addresses, the higher rate of fraud.

In [None]:
pd.crosstab([trainDF["source_SEO"], trainDF["source_Ads"], trainDF["source_Direct"]], trainDF["class"], normalize='columns')

Unnamed: 0_level_0,Unnamed: 1_level_0,class,0,1
source_SEO,source_Ads,source_Direct,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,1,0.200316,0.215426
0,1,0,0.396716,0.399823
1,0,0,0.402968,0.384752


In [None]:
pd.crosstab([trainDF["browser_Chrome"], trainDF["browser_FireFox"], trainDF["browser_Opera"], trainDF["browser_IE"], trainDF["browser_Safari"]], trainDF["class"], normalize='columns')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,class,0,1
browser_Chrome,browser_FireFox,browser_Opera,browser_IE,browser_Safari,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,0,0,1,0.164513,0.160461
0,0,0,1,0,0.24467,0.213652
0,0,1,0,0,0.024441,0.018617
0,1,0,0,0,0.163427,0.161348
1,0,0,0,0,0.40295,0.445922


It can be seen that users who came to the site via advertisements and search engine optimization and used Google Chrome are likely to make a fraud transaction.

In [None]:
fraud_data[fraud_data['class'] == 1].head(20)

Unnamed: 0,purchase_value,device_id,source,browser,sex,age,ip_address,class,country,interval_after_signup,interval_after_signup_days_of_year
136961,24,VLHGCDPFCICDA,SEO,Chrome,F,33,3432126000.0,1,United States,924.431111,39
136962,14,YLUQSRNYYIPXU,Ads,Chrome,M,40,3905319000.0,1,,0.000278,0
136963,63,ABUBCQDATQMQH,Ads,FireFox,F,46,550567000.0,1,United States,2122.241667,88
136964,34,QHEODGCAVJKIQ,SEO,Chrome,M,37,940809600.0,1,United States,0.000278,0
136965,76,DAKVYHKIEYRBH,SEO,Chrome,F,48,636104100.0,1,Hungary,0.000278,0
136966,32,ESANFBTIVMNHX,Ads,IE,M,30,3875475000.0,1,,1589.236667,66
136967,95,HIAMXITLJWYCT,SEO,FireFox,M,42,3786924000.0,1,,2822.252222,117
136968,13,BQTPLJBGYXQYX,Ads,IE,M,32,2463262000.0,1,Austria,0.000278,0
136969,15,BWSMVSLCJXMCM,Direct,IE,F,39,2937899000.0,1,Japan,0.000278,0
136970,26,HPPSDIRGUSSTB,Direct,Opera,M,31,647126100.0,1,United States,0.000278,0


**Fraudulent patterns** include customers with the following characteristics, which may be flagged as "at risk" for potential fraud:

- Make a purchase  immediately after signing up or show an unusually rapid pace of transactions

- Use a shared IP or public network for transactions.

- Make more than 2 transactions using the same device with multiple accounts.

- Reached the site via advertisements and search engine optimization (the process of improving the website to increase its visibility in the serach engine) and used Google Chrome are more likely to engage in fraudulent transactions.


# Part 8: How to use the prediction for fraud detection system

In [None]:
# compare the performance of RF and XGBoost
probs_RF_f1 = best_model_f1['Random Forest (f1)'].predict_proba(X_test)[:, 1]
print('Random Forest')
print(np.array(sorted(collections.Counter((10 * probs_RF_f1).astype(int)).items())))

probs_XGB_f1 = best_model_f1['XGBoost (f1)'].predict_proba(X_test)[:, 1]
print('XGBoost')
print(np.array(sorted(collections.Counter((10 * probs_XGB_f1).astype(int)).items())))

Random Forest
[[    0 27531]
 [    7    94]
 [    8    10]
 [    9    41]]
XGBoost
[[    0 27531]
 [    9   145]]


The random forest and xgboost models' probability estimates can be transformed into a score to facilitate automated fraud detection for customer orders. Here's how the scoring system could work:

- For a customer with a score ranging from 0 to 7, the transaction can pass through to the shipping process without any manual intervention. These customers are considered low-risk for fraud.

- For a customer with a score of 8, the transaction should undergo manual investigation before proceeding to the shipping process. This score suggests a moderate level of risk.

- For a customer with a score from 9 to 10, the transaction should be declined outright. These scores indicate a high likelihood of fraudulent behavior, and it is best to cancel such orders to prevent potential losses.