# Identifying Fraudulent Activities

Company XYZ is an e-commerce site that sells hand-made clothes.
You have to build a model that predicts whether a user has a high probability of using the site to
perform some illegal activity or not. This is a super common task for data scientists.
You only have information about the user first transaction on the site and based on that you
have to make your classification ("fraud/no fraud").
These are the tasks you are asked to do:

- For each user, __determine her country__ based on the numeric IP address.

- Build a model to predict whether an activity is fraudulent or not. Explain how different
assumptions about the cost of false positives vs false negatives would impact the model.

- Your boss is a bit worried about using a model she doesn't understand for something as
important as fraud detection. How would you explain her how the model is making the
predictions? Not from a mathematical perspective (she couldn't care less about that), but
from a user perspective. What kinds of users are more likely to be classified as at risk?
What are their characteristics?

- Let's say you now have this model which can be used live to predict in real time if an
activity is fraudulent or not. From a product perspective, how would you use it? That is,
what kind of different user experiences would you build based on the model output?

## Import Libraries and Data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import auc, roc_curve, classification_report

In [22]:
from sklearn.tree import DecisionTreeClassifier,export_graphviz
import xgboost as xgb
plt.style.use('ggplot')

In [25]:
from sklearn.model_selection import train_test_split 
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

In [47]:
data = pd.read_csv('Fraud_Data.csv')

In [48]:
data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0


In [49]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         151112 non-null  int64  
 1   signup_time     151112 non-null  object 
 2   purchase_time   151112 non-null  object 
 3   purchase_value  151112 non-null  int64  
 4   device_id       151112 non-null  object 
 5   source          151112 non-null  object 
 6   browser         151112 non-null  object 
 7   sex             151112 non-null  object 
 8   age             151112 non-null  int64  
 9   ip_address      151112 non-null  float64
 10  class           151112 non-null  int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 12.7+ MB


In [50]:
data.describe()

Unnamed: 0,user_id,purchase_value,age,ip_address,class
count,151112.0,151112.0,151112.0,151112.0,151112.0
mean,200171.04097,36.935372,33.140704,2152145000.0,0.093646
std,115369.285024,18.322762,8.617733,1248497000.0,0.291336
min,2.0,9.0,18.0,52093.5,0.0
25%,100642.5,22.0,27.0,1085934000.0,0.0
50%,199958.0,35.0,33.0,2154770000.0,0.0
75%,300054.0,49.0,39.0,3243258000.0,0.0
max,400000.0,154.0,76.0,4294850000.0,1.0


In [51]:
country = pd.read_csv('IpAddress_to_Country.csv')
country.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


In [52]:
country.describe()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address
count,138846.0,138846.0
mean,2724532000.0,2724557000.0
std,897521500.0,897497900.0
min,16777220.0,16777470.0
25%,1919930000.0,1920008000.0
50%,3230887000.0,3230888000.0
75%,3350465000.0,3350466000.0
max,3758096000.0,3758096000.0


In [53]:
country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138846 entries, 0 to 138845
Data columns (total 3 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   lower_bound_ip_address  138846 non-null  float64
 1   upper_bound_ip_address  138846 non-null  int64  
 2   country                 138846 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 3.2+ MB


## Question 1:

In [54]:
country_columns = []

for i in range(len(data)):
    ip_address = data.loc[i, 'ip_address']
    tmp = country[(country['lower_bound_ip_address'] <= ip_address) & (country['upper_bound_ip_address'] >= ip_address)]
    
    if len(tmp) == 1:
        country_columns.append(tmp['country'].values[0])
    else:
        country_columns.append('NA')

data['country'] = country_columns

data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,Japan
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,United States
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1,United States
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,United States


## Question 2:

- Time difference between sign-up time and purchase time
- If the device id is unique or certain users are sharing the same device (many different user ids using
the same device could be an indicator of fake accounts)
- Same for the ip address. Many different users having the same ip address could be an indicator of
fake accounts
- Usual week of the year and day of the week from time variables

## Feature Engineering

In [55]:
data['signup_time'] = pd.to_datetime(data.signup_time)
data['purchase_time'] = pd.to_datetime(data.purchase_time)

# it is very suspicious for a user signup and then immediately purchase
data['interval_after_signup'] = (data.purchase_time - data.signup_time).dt.total_seconds()

data.drop(["signup_time", "purchase_time"], axis=1, inplace=True)

In [56]:
n_dev_shared = data.device_id.value_counts()

# because we are studying user's first transaction
# the more a device is shared, the more suspicious
data['n_dev_shared'] = data.device_id.map(n_dev_shared)
del data['device_id']

In [42]:
data.head()

Unnamed: 0_level_0,purchase_value,source,browser,sex,age,ip_address,class,interval_after_signup,n_dev_shared
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
22058,34,SEO,Chrome,M,39,732758400.0,0,4506682.0,1
333320,16,Ads,Chrome,F,53,350311400.0,0,17944.0,1
1359,15,SEO,Opera,M,53,2621474000.0,1,1.0,12
150084,44,SEO,Safari,M,41,3840542000.0,0,492085.0,1
221365,39,Ads,Safari,M,45,415583100.0,0,4361461.0,1


In [57]:
# how many times a ip address is shared
n_ip_shared = data.ip_address.value_counts()

# because we are studying user's first transaction
# the more a ip is shared, the more suspicous
data['n_ip_shared'] = data.ip_address.map(n_ip_shared)
del data['ip_address']

In [58]:
# how many users are from the same country
n_country_shared = data.country.value_counts()

# the less visit from a country, the more suspicious
data['n_country_shared'] = data.country.map(n_country_shared)
del data['country']

In [59]:
data.head()

Unnamed: 0,user_id,purchase_value,source,browser,sex,age,class,interval_after_signup,n_dev_shared,n_ip_shared,n_country_shared
0,22058,34,SEO,Chrome,M,39,0,4506682.0,1,1,7306
1,333320,16,Ads,Chrome,F,53,0,17944.0,1,1,58049
2,1359,15,SEO,Opera,M,53,1,1.0,12,12,58049
3,150084,44,SEO,Safari,M,41,0,492085.0,1,1,21966
4,221365,39,Ads,Safari,M,45,0,4361461.0,1,1,58049


In [67]:
data['Male_or_not'] = (data.sex == 'M').astype(int) #Make it 1/0

In [74]:
del data['sex']

In [79]:
data.head()

Unnamed: 0,user_id,purchase_value,age,class,interval_after_signup,n_dev_shared,n_ip_shared,n_country_shared,Male_or_not,source_Ads,source_SEO,browser_Chrome,browser_FireFox,browser_IE,browser_Safari
0,22058,34,39,0,4506682.0,1,1,7306,1,0,1,1,0,0,0
1,333320,16,53,0,17944.0,1,1,58049,0,1,0,1,0,0,0
2,1359,15,53,1,1.0,12,12,58049,1,0,1,0,0,0,0
3,150084,44,41,0,492085.0,1,1,21966,1,0,1,0,0,0,1
4,221365,39,45,0,4361461.0,1,1,58049,1,1,0,0,0,0,1


In [80]:
data.source.value_counts()

AttributeError: 'DataFrame' object has no attribute 'source'

In [71]:
data.browser.value_counts()

Chrome     61432
IE         36727
Safari     24667
FireFox    24610
Opera       3676
Name: browser, dtype: int64

In [77]:
data = pd.get_dummies(data,columns=['source','browser'])

del data['source_Direct']
del data['browser_Opera']
data.head()

KeyError: "None of [Index(['source', 'browser'], dtype='object')] are in the [columns]"

In [None]:
data.rename(columns={'class':'is_fraud'},inplace=True)# 'class' is a reserved keyword

In [None]:
datas.to_csv("fraud_cleaned.csv",index_column="user_id")

## Train the model

In [None]:
seed = 999
X = datas.loc[:,datas.columns != 'is_fraud']
y = datas.is_fraud

# split into training dataset and test dataset
Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=0.3,random_state=seed)
train_matrix = xgb.DMatrix(Xtrain,ytrain)
test_matrix = xgb.DMatrix(Xtest)

use cross-validation to find best number of trees

In [None]:
params = {}
params['silent'] = 1
params['objective'] = 'binary:logistic'  # output probabilities
params['eval_metric'] = 'auc'
params["num_rounds"] = 300
params["early_stopping_rounds"] = 30
# params['min_child_weight'] = 2
params['max_depth'] = 6
params['eta'] = 0.1
params["subsample"] = 0.8
params["colsample_bytree"] = 0.8

cv_results = xgb.cv(params,train_matrix,
                    num_boost_round = params["num_rounds"],
                    nfold = params.get('nfold',5),
                    metrics = params['eval_metric'],
                    early_stopping_rounds = params["early_stopping_rounds"],
                    verbose_eval = True,
                    seed = seed)

In [None]:
cv_results

In [None]:
n_best_trees = cv_results.shape[0]
n_best_trees

In [None]:
# retrain on the whole training dataset
watchlist = [(train_matrix, 'train')]
gbt = xgb.train(params, train_matrix, n_best_trees,watchlist)

## Plot ROC and choose threshold

In [None]:
def plot_validation_roc():
    """
    we cannot plot ROC on either training set or test set, since both are biased
    so I split the training dataset again into training set and validation set
    retrain on training set and plot ROC on validation set and choose a proper cutoff value
    
    define a class to limit the naming group, avoid polluting the global naming space
    """
    Xtrain_only,Xvalid,ytrain_only,yvalid = train_test_split(Xtrain,ytrain,test_size=0.3,random_state=seed)
    onlytrain_matrix = xgb.DMatrix(Xtrain_only,ytrain_only)
    valid_matrix = xgb.DMatrix(Xvalid,yvalid)

    temp_gbt = xgb.train(params, onlytrain_matrix, n_best_trees,[(onlytrain_matrix,'train_only'),(valid_matrix,'validate')])
    yvalid_proba_pred = temp_gbt.predict(valid_matrix,ntree_limit=n_best_trees)

    fpr,tpr,thresholds = roc_curve(yvalid,yvalid_proba_pred)
    return pd.DataFrame({'FPR':fpr,'TPR':tpr,'Threshold':thresholds})

In [None]:
roc = plot_validation_roc()

In [None]:
plt.figure(figsize=(10,5))
plt.plot(roc.FPR,roc.TPR,marker='h')
plt.xlabel("FPR")
plt.ylabel("TPR")

## Impact of FP vs. FN

- if __false positive__ cost much higher, we should increase the probability threshold, but pay the price TPR is also decreased.
- if __false negative__ cost much higher, we should decrease the probability threshold, but pay the price FPR is also increased.

In [None]:
roc.loc[ (roc.TPR >= 0.78) & (roc.TPR <=0.83),:]

In this case, because normally this "Fraud Detection Model" is often used in a pre-screening step, whose result will be further investigated by expert, so

- if 'Not Fraud' is classified as 'Fraud', human expert can still have method to fix the problem
- but if 'Fraud' is classified as 'Not Fraud', the company will lose money directly.

so in this case, "false negative" cost much higher, so we should choose a relatively smaller threshold.

## Question 3

Your boss is a bit worried about using a model she doesn't understand for something as important as fraud detection. How would you explain her how the model is making the predictions? Not from a mathematical perspective (she couldn't care less about that), but from a user perspective. What kinds of users are more likely to be classiﬁed as at risk? What are their characteristics

In [None]:
# first we plot the feature importance from GBM
xgb.plot_importance(gbt)

from above model, we can see that, 'interval_after_signup' is the most important factor helping us to decide a transaction is fraud or not.

To better understand, we fit a shallow, simple Decision Tree and plot it.

In [None]:
dt = DecisionTreeClassifier(max_depth=3,min_samples_leaf=20,min_samples_split=20)
dt.fit(X,y)
export_graphviz(dt,feature_names=X.columns,class_names=['NotFraud','Fraud'],
                proportion=True,leaves_parallel=True,filled=True)

from above plot, we focus on two leaf-nodes

the blue leaf node indicates, if 'interval_after_signup' is <=69 seconds, which means the customer purchases immediatelly after signup, then there is very high probability that this transaction is fraud.
the leaf-node with positive ratio=23% (the second node from right), if the purchase is from a device shared by 2~4 users, then there is above-normal probability that the transaction is fraud.

## Question 4

Let's say you now have this model which can be used live to predict in real time if an activity is fraudulent or not. From a product perspective, how would you use it? That is, what kind of diﬀerent user experiences would you build based on the model output?

since my model can predict the probability a purchase is fraud, so I need to set two probability cutoffs as 'alert value', alert1 and alert2, and alert1 < alert2 .

for a incoming purchase, my model will return the probability 'p' that the purchase is fraud,

- if p < alert1, then I assume the purchase is normal, proceed without any problem
- if alert1 <= p < alert2, then I assume the purchase is suspicious, I will ask the customer for additional authroization. for example, send email or SMS to the customer, let him/her authorize the purchase.
- if p>= alert2, then the purchase is highly suspicious, I not only ask the customer for additional authorization via email or SMS, but also put the purchase on hold and send the purchase information to some human expert for further investigation.