# Sheet 2

# Part 0: Summary of Fraud Detection Code Lab

- How to handle extremely imbalanced data
- Time related variables will play a significant role in predicting fraud detection
- Finding feasible forecasting solutions for enterprises

# Part 1: Import Data

In [1]:
!git clone https://github.com/loganlaioffer/fraudDetection.git

fatal: destination path 'fraudDetection' already exists and is not an empty directory.


In [2]:
!cd fraudDetection/
!ls fraudDetection/
!pip install -U imbalanced-learn
# !pip install pandas-profiling
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

cv_data.csv   imbalancedFraudDF.csv	test_data.csv	tr_server_data.csv
cv_label.csv  IpAddress_to_Country.csv	test_label.csv
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.14.0-py3-none-any.whl.metadata (8.8 kB)
Downloading imbalanced_learn-0.14.0-py3-none-any.whl (239 kB)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.14.0
Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Downloading https://github.com/pandas-profiling/pandas-profiling/archive/master.zip (17.9 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/17.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/17.9 MB[0m [31m44.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m17.8/17.9 MB[0m [31m51.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score, roc_auc_score, roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

In [4]:
ipToCountry = pd.read_csv('IpAddress_to_Country.csv')
fraud_data = pd.read_csv('imbalancedFraudDF.csv')

fraud_data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
3,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0
4,159135,2015-05-21 06:03:03,2015-07-09 08:05:14,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0


# Part 2: Data exploration

In [5]:
fraud_data['class'].value_counts()

In [6]:
import pandas_profiling

#Inline summary report without saving report as object
pandas_profiling.ProfileReport(fraud_data)

  import pandas_profiling

  0%|          | 0/11 [00:00<?, ?it/s][A
  9%|▉         | 1/11 [00:00<00:08,  1.24it/s][A
 18%|█▊        | 2/11 [00:04<00:20,  2.25s/it][A
 27%|██▋       | 3/11 [00:04<00:11,  1.45s/it][A
 36%|███▋      | 4/11 [00:04<00:06,  1.08it/s][A
 45%|████▌     | 5/11 [00:07<00:10,  1.76s/it][A
 73%|███████▎  | 8/11 [00:08<00:02,  1.32it/s][A
 91%|█████████ | 10/11 [00:08<00:00,  1.91it/s][A100%|██████████| 11/11 [00:08<00:00,  1.30it/s]




**There are some noteworthy aspects in this report:**
- Firstly, let's take a look at the alerts section, where the username and signup time are used as the user name and registration time, and it's normal for there to be no duplicates. However, the data in the class is extremely imbalanced, and in actual prediction, imbalanced data is likely to cause problems.
- The distribution of registration time is basically average, but the purchase time is relatively concentrated in the middle period. The reason may be that some users in earlier time periods did not make a purchase when they first registered, and only made a purchase in the middle time period, while other users who registered in the middle time period immediately purchased the item, resulting in a significant increase in the number of purchases during the middle time period; However, since only the registration and purchase times within a certain period of time were captured, the delayed purchase time data of users in the later period was not recorded in the data, resulting in fewer purchase behaviors in the later period compared to the middle period.
- The number of users from SEO and Ads is relatively high compared to the number of users who directly purchase, but the overall data is balanced.
- In terms of user age, there are some prominent age groups with relatively large numbers, which are quite suspicious.
- The classification in the data has almost zero correlation between pairs, except for a correlation of 1 with itself.
- From the interaction, it can be seen that the main consumer group is concentrated in the 20-40 age group, and the value is mainly concentrated in the 20-60 age group. Therefore, the shopping behavior of some strange age groups and the large-scale purchase of high or low priced items may be more suspicious.
- There is no missing value in the data.

# Task 1: Identify country info based on ip_address

In [7]:
ipToCountry.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


In [8]:
country1 = []

def BS(country1, target, ipToCountry):
    left = 0
    right = len(ipToCountry) - 1
    while left <= right:
        mid = (left + right) // 2
        if target < ipToCountry.loc[mid, 'lower_bound_ip_address']:
            right = mid - 1
        elif target > ipToCountry.loc[mid, 'upper_bound_ip_address']:
            left = mid + 1
        else:
            country1.append(ipToCountry.loc[mid, 'country'])
            return
    country1.append('NA')
    return

for i in range(len(fraud_data)):
    target = fraud_data.loc[i, 'ip_address']
    BS(country1, target, ipToCountry)

fraud_data['country']=country1

Due to the long running time of the method that directly matches ipaddress and iptoCountry, it was replaced with a binary check method, and finally the country list was returned to fraud_data

In [9]:
fraud_data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,Japan
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,United States
2,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,
3,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,United States
4,159135,2015-05-21 06:03:03,2015-07-09 08:05:14,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0,Canada


# Part 3a: Feature Engineering

In [10]:
fraud_data['interval_after_signup'] = (pd.to_datetime(fraud_data['purchase_time']) - pd.to_datetime(
        fraud_data['signup_time'])).dt.total_seconds()

fraud_data['signup_days_of_year'] = pd.DatetimeIndex(fraud_data['signup_time']).dayofyear

fraud_data['signup_seconds_of_day'] = pd.DatetimeIndex(fraud_data['signup_time']).second + 60 * pd.DatetimeIndex(
    fraud_data['signup_time']).minute + 3600 * pd.DatetimeIndex(fraud_data['signup_time']).hour

fraud_data['purchase_days_of_year'] = pd.DatetimeIndex(fraud_data['purchase_time']).dayofyear
fraud_data['purchase_seconds_of_day'] = pd.DatetimeIndex(fraud_data['purchase_time']).second + 60 * pd.DatetimeIndex(
    fraud_data['purchase_time']).minute + 3600 * pd.DatetimeIndex(fraud_data['purchase_time']).hour

fraud_data = fraud_data.drop(['user_id','signup_time','purchase_time'], axis=1)

In [11]:

print(fraud_data.purchase_days_of_year.value_counts())

purchase_days_of_year
171    648
201    647
118    646
157    645
172    642
      ... 
346     15
347     13
348      7
349      3
350      1
Name: count, Length: 350, dtype: int64


Since it does not involve data exchange between rows, time-dependent variables can be processed first:

- Find the time interval between the purchase time and the registration time;
- Convert time-dependent variables to int type

In [12]:
fraud_data.head()

Unnamed: 0,purchase_value,device_id,source,browser,sex,age,ip_address,class,country,interval_after_signup,signup_days_of_year,signup_seconds_of_day,purchase_days_of_year,purchase_seconds_of_day
0,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0,Japan,4506682.0,55,82549,108,10031
1,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0,United States,17944.0,158,74390,159,5934
2,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0,,492085.0,118,76405,124,50090
3,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0,United States,4361461.0,202,25792,252,67253
4,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315000.0,0,Canada,4240931.0,141,21783,190,29114


**Due to the involvement of inter row operations,we chose to perform a split operation first to prevent data leakage**

# Part 4: Data Split

In [13]:
y = fraud_data['class']
X = fraud_data.drop(['class'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)

X_train.shape: (110700, 13)
y_train.shape: (110700,)


In [14]:
X_train.head()

Unnamed: 0,purchase_value,device_id,source,browser,sex,age,ip_address,country,interval_after_signup,signup_days_of_year,signup_seconds_of_day,purchase_days_of_year,purchase_seconds_of_day
29343,12,OULPAZAFRFPXP,Ads,Chrome,M,42,3690922000.0,Korea Republic of,3499664.0,183,67384,224,24648
12190,10,AIIWMFEYQQIEB,Ads,Opera,M,29,1686759000.0,United States,6766039.0,5,78146,84,18585
19388,34,VUVETBUPCIWJE,Direct,Chrome,M,53,4138429000.0,,5870515.0,197,81354,265,76669
89104,48,QCFULAJOYKFUU,Ads,Chrome,M,29,96173370.0,France,2145618.0,160,30920,185,16538
82082,44,IHRWLMIJMEEEU,Ads,FireFox,M,24,1936025000.0,China,7079059.0,111,71897,193,66156


# Part 3b: Feature Engineering

In [15]:
X_train = pd.get_dummies(X_train, columns=['source', 'browser'])
X_train['sex'] = (X_train.sex == 'M').astype(int)

X_train_device_id_mapping = X_train.device_id.value_counts(dropna=False)
X_train['n_dev_shared'] = X_train.device_id.map(X_train_device_id_mapping)

X_train_ip_address_mapping = X_train.ip_address.value_counts(dropna=False)
X_train['n_ip_shared'] = X_train.ip_address.map(X_train_ip_address_mapping)

X_train_country_mapping = X_train.country.value_counts(dropna=False)
X_train['n_country_shared'] = X_train.country.map(X_train_country_mapping)


X_train = X_train.drop(['device_id','ip_address','country'], axis=1)

- Due to the low dimensionality of the source and browser, get_ dumies is still used
- Due to the explosion of dimensionality caused by single hot encoding of Device_id, Ip_dedress, and Country, frequency encoding is used to replace the data itself with the number of occurrences
- Finally, drop the useless data that has already been processed

In [16]:
X_test = pd.get_dummies(X_test, columns=['source', 'browser'])
X_test['sex'] = (X_test.sex == 'M').astype(int)

# the more a device is shared, the more suspicious
X_test['n_dev_shared'] = X_test.device_id.map(X_test.device_id.value_counts(dropna=False))

# the more a ip is shared, the more suspicious
X_test['n_ip_shared'] = X_test.ip_address.map(X_test.ip_address.value_counts(dropna=False))

# the less visit from a country, the more suspicious
X_test['n_country_shared'] = X_test.country.map(X_test.country.value_counts(dropna=False))

X_test = X_test.drop(['device_id','ip_address','country'], axis=1)

In [17]:
X_train.head()

Unnamed: 0,purchase_value,sex,age,interval_after_signup,signup_days_of_year,signup_seconds_of_day,purchase_days_of_year,purchase_seconds_of_day,source_Ads,source_Direct,source_SEO,browser_Chrome,browser_FireFox,browser_IE,browser_Opera,browser_Safari,n_dev_shared,n_ip_shared,n_country_shared
29343,12,1,42,3499664.0,183,67384,224,24648,True,False,False,True,False,False,False,False,1,1,3075
12190,10,1,29,6766039.0,5,78146,84,18585,True,False,False,False,False,False,True,False,1,1,42348
19388,34,1,53,5870515.0,197,81354,265,76669,False,True,False,True,False,False,False,False,1,1,16275
89104,48,1,29,2145618.0,160,30920,185,16538,True,False,False,True,False,False,False,False,1,1,2322
82082,44,1,24,7079059.0,111,71897,193,66156,True,False,False,False,True,False,False,False,1,1,8876


In [18]:
train_scaler = preprocessing.MinMaxScaler().fit(X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])

X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']] = train_scaler.transform(X_train[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])

X_test[['n_dev_shared', 'n_ip_shared', 'n_country_shared']] = train_scaler.transform(X_test[['n_dev_shared', 'n_ip_shared', 'n_country_shared']])

In [19]:
X_train.n_dev_shared.value_counts(dropna=False)

In [20]:
X_test.n_dev_shared.value_counts(dropna=False)

# Part 5: Model Training

*Simple LogisticRegression model*

In [21]:
logreg = LogisticRegression()

logreg.fit(X_train,y_train)

y_pred=logreg.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [22]:
cm = metrics.confusion_matrix(y_test, y_pred)
cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)

        pred_0  pred_1
true_0   27389       0
true_1     287       0


**Obviously, the performance of Logistics Regression here is not good. This cmDF means that the model will unconditionally judge the data as normal, which is clearly unacceptable.**

*Simple RF model*

In [23]:
classifier_RF = RandomForestClassifier(random_state=0)

classifier_RF.fit(X_train, y_train)

probs = classifier_RF.predict_proba(X_test)

predicted = classifier_RF.predict(X_test)

print("%s: %r" % ("accuracy_score is: ", accuracy_score(y_test, predicted)))
print("%s: %r" % ("roc_auc_score is: ", roc_auc_score(y_test, probs[:, 1])))
print("%s: %r" % ("f1_score is: ", f1_score(y_test, predicted )))#string to int

print ("confusion_matrix is: ")
cm = confusion_matrix(y_test, predicted)
cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)
print('recall =',float(cm[1,1])/(cm[1,0]+cm[1,1]))
print('precision =', float(cm[1,1])/(cm[1,1] + cm[0,1]))

accuracy_score is: : 0.9948692007515537
roc_auc_score is: : 0.7801672204169557
f1_score is: : 0.6712962962962963
confusion_matrix is: 
        pred_0  pred_1
true_0   27389       0
true_1     142     145
recall = 0.5052264808362369
precision = 1.0


*SMOTE sampling*

In [24]:
smote = SMOTE(random_state=12)
x_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

unique, counts = np.unique(y_train_sm, return_counts=True)

print(np.asarray((unique, counts)).T)

[[     0 109572]
 [     1 109572]]


In [25]:
#RF on smoted training data
classifier_RF_sm = RandomForestClassifier(random_state=0)

classifier_RF_sm.fit(x_train_sm, y_train_sm)

predicted_sm = classifier_RF_sm.predict(X_test)

probs_sm = classifier_RF_sm.predict_proba(X_test)

print("%s: %r" % ("accuracy_score_sm is: ", accuracy_score(y_test, predicted_sm)))
print("%s: %r" % ("roc_auc_score_sm is: ", roc_auc_score(y_test, probs_sm[:, 1])))
print("%s: %r" % ("f1_score_sm is: ", f1_score(y_test, predicted_sm )))#string to int

print ("confusion_matrix_sm is: ")
cm_sm = confusion_matrix(y_test, predicted_sm)
cmDF = pd.DataFrame(cm_sm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
print(cmDF)
print('recall or sens_sm =',float(cm_sm[1,1])/(cm_sm[1,0]+cm_sm[1,1]))
print('precision_sm =', float(cm_sm[1,1])/(cm_sm[1,1] + cm_sm[0,1]))

accuracy_score_sm is: : 0.9948330683624801
roc_auc_score_sm is: : 0.7666438992331798
f1_score_sm is: : 0.6697459584295612
confusion_matrix_sm is: 
        pred_0  pred_1
true_0   27388       1
true_1     142     145
recall or sens_sm = 0.5052264808362369
precision_sm = 0.9931506849315068


Due to the overall data being imbalanced, accuracy has little significance in this project. In the case of consistent recall, RF has a higher precision, so RF performs slightly better than SMOTE in this project.

# Part 6: Parameter tuning by GridSearchCV

**以下对各个模型进行参数调整以找出模型的最优表现：**

In [26]:
scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'f1_score': make_scorer(f1_score, pos_label=1)

}

In [27]:
def grid_search_wrapper(model, parameters, refit_score='f1_score'):
    """
    fits a GridSearchCV classifier using refit_score for optimization(refit on the best model according to refit_score)
    for each combination of parameters, calculate all score in scorers, save them
    prints classifier performance metrics
    """

    grid_search = GridSearchCV(model, parameters, scoring=scorers, refit=refit_score,
                           cv=3, return_train_score=True)
    grid_search.fit(X_train, y_train)

    y_pred = grid_search.predict(X_test)
    y_prob = grid_search.predict_proba(X_test)[:, 1]

    print('Best params for {}'.format(refit_score))
    print(grid_search.best_params_)

    print('\nConfusion matrix of Random Forest optimized for {} on the test data:'.format(refit_score))
    cm = confusion_matrix(y_test, y_pred)
    cmDF = pd.DataFrame(cm, columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
    print(cmDF)

    print("\t%s: %r" % ("roc_auc_score is: ", roc_auc_score(y_test, y_prob)))
    print("\t%s: %r" % ("f1_score is: ", f1_score(y_test, y_pred)))#string to int

    print('recall = ', float(cm[1,1]) / (cm[1,0] + cm[1,1]))
    print('precision = ', float(cm[1,1]) / (cm[1, 1] + cm[0,1]))

    return grid_search

In [28]:
# C: inverse of regularization strength, smaller values specify stronger regularization
LRGrid = {"C" : np.logspace(-2,2,5), "penalty":["l1","l2"]}# l1 lasso l2 ridge

logRegModel = LogisticRegression(random_state=0)

grid_search_LR_f1 = grid_search_wrapper(logRegModel, LRGrid, refit_score='f1_score')

Best params for f1_score
{'C': 0.01, 'penalty': 'l2'}

Confusion matrix of Random Forest optimized for f1_score on the test data:
        pred_0  pred_1
true_0   27389       0
true_1     287       0
	roc_auc_score is: : 0.7495650165005586
	f1_score is: : 0.0
recall =  0.0
precision =  nan


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, 

In [29]:
parameters = {
'max_depth': [None, 5, 15],
'n_estimators' :  [10,150],
'class_weight' : [{0: 1, 1: w} for w in [0.2, 1, 100]]
}

clf = RandomForestClassifier(random_state=0)

In [30]:
grid_search_rf_f1 = grid_search_wrapper(clf, parameters, refit_score='f1_score')

Best params for f1_score
{'class_weight': {0: 1, 1: 0.2}, 'max_depth': None, 'n_estimators': 150}

Confusion matrix of Random Forest optimized for f1_score on the test data:
        pred_0  pred_1
true_0   27389       0
true_1     142     145
	roc_auc_score is: : 0.7781993788548851
	f1_score is: : 0.6712962962962963
recall =  0.5052264808362369
precision =  1.0


In [31]:
best_rf_model_f1 = grid_search_rf_f1.best_estimator_
best_rf_model_f1

In [32]:
results_f1 = pd.DataFrame(grid_search_rf_f1.cv_results_)
results_sortf1 = results_f1.sort_values(by='mean_test_f1_score', ascending=False)
results_sortf1[['mean_test_precision_score', 'mean_test_recall_score', 'mean_test_f1_score', 'mean_train_precision_score', 'mean_train_recall_score', 'mean_train_f1_score','param_max_depth', 'param_class_weight', 'param_n_estimators']].round(3).head()

Unnamed: 0,mean_test_precision_score,mean_test_recall_score,mean_test_f1_score,mean_train_precision_score,mean_train_recall_score,mean_train_f1_score,param_max_depth,param_class_weight,param_n_estimators
1,1.0,0.527,0.69,1.0,1.0,1.0,,"{0: 1, 1: 0.2}",150
3,1.0,0.527,0.69,1.0,0.527,0.69,5.0,"{0: 1, 1: 0.2}",150
13,1.0,0.527,0.69,1.0,1.0,1.0,,"{0: 1, 1: 100}",150
5,1.0,0.527,0.69,1.0,0.56,0.718,15.0,"{0: 1, 1: 0.2}",150
9,1.0,0.527,0.69,1.0,0.527,0.69,5.0,"{0: 1, 1: 1}",150


In [33]:
grid_search_rf_recall = grid_search_wrapper(clf, parameters, refit_score='recall_score')

Best params for recall_score
{'class_weight': {0: 1, 1: 100}, 'max_depth': 5, 'n_estimators': 150}

Confusion matrix of Random Forest optimized for recall_score on the test data:
        pred_0  pred_1
true_0   27146     243
true_1     132     155
	roc_auc_score is: : 0.7904661234456265
	f1_score is: : 0.45255474452554745
recall =  0.5400696864111498
precision =  0.38944723618090454


In [34]:
best_RF_model_recall = grid_search_rf_recall.best_estimator_
best_RF_model_recall

In [35]:
predictedBest_recall = best_RF_model_recall.predict(X_test)

probsBest_recall = best_RF_model_recall.predict_proba(X_test)

results_recall = pd.DataFrame(grid_search_rf_recall.cv_results_)# recall score is different from above, as above is metric on test data, this is performance on cv data
results_sortrecall = results_recall.sort_values(by='mean_test_recall_score', ascending=False)
results_sortrecall[['mean_test_precision_score', 'mean_test_recall_score', 'mean_test_f1_score', 'mean_train_precision_score', 'mean_train_recall_score', 'mean_train_f1_score','param_max_depth', 'param_class_weight', 'param_n_estimators']].round(3).head()

Unnamed: 0,mean_test_precision_score,mean_test_recall_score,mean_test_f1_score,mean_train_precision_score,mean_train_recall_score,mean_train_f1_score,param_max_depth,param_class_weight,param_n_estimators
15,0.159,0.636,0.254,0.164,0.656,0.262,5.0,"{0: 1, 1: 100}",150
14,0.16,0.633,0.255,0.162,0.652,0.26,5.0,"{0: 1, 1: 100}",10
16,0.675,0.533,0.593,0.759,0.813,0.782,15.0,"{0: 1, 1: 100}",10
0,0.995,0.527,0.689,1.0,0.856,0.923,,"{0: 1, 1: 0.2}",10
1,1.0,0.527,0.69,1.0,1.0,1.0,,"{0: 1, 1: 0.2}",150


**Based on the actual situation, it is believed that the RF model performs the best as the loss from undetected anomalies is much greater than judging normal shopping behavior as abnormal. **

In [36]:
pd.DataFrame(best_rf_model_f1.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False)

Unnamed: 0,importance
interval_after_signup,0.408875
purchase_days_of_year,0.132442
purchase_seconds_of_day,0.079075
signup_seconds_of_day,0.077661
signup_days_of_year,0.057319
n_ip_shared,0.052617
purchase_value,0.044106
age,0.038233
n_dev_shared,0.035686
n_country_shared,0.027432


**It is not difficult to find through the importance analysis of each classification in the RF model that the time interval between registration and shopping is the first indicator to determine whether shopping behavior is suspicious, followed by shopping time and registration time. In addition, the repetition of device IDs also has an impact on the model's predictions to some extent.**

# Tasks 3: Fraud Characteristics

**From the feature importance, it can be seen that time is the biggest factor affecting the model's judgment of whether it is abnormal behavior. Therefore, we first choose the time interval between registration and shopping behavior as an example for analysis:**

In [37]:
fraud_data.groupby("class")[['interval_after_signup']].mean()

Unnamed: 0_level_0,interval_after_signup
class,Unnamed: 1_level_1
0,5191179.0
1,2570226.0


In [38]:
fraud_data.groupby("class")[['interval_after_signup']].median()

Unnamed: 0_level_0,interval_after_signup
class,Unnamed: 1_level_1
0,5194911.0
1,1.0


By taking the average of normal and abnormal values, it can be found that generally speaking, the shorter the time interval between registration and shopping, the more suspicious this behavior is. After calculating the median, this becomes even more apparent, indicating that a large proportion of abnormal behaviors are completed within one second after registration, and these abnormal accounts are likely to have scripts set up.

In [39]:
trainDF = pd.concat([X_train, y_train], axis=1)
pd.crosstab(trainDF["n_dev_shared"],trainDF["class"])

class,0,1
n_dev_shared,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,104966,461
0.2,4403,371
0.4,152,172
0.6,37,87
0.8,13,32
1.0,1,5


From the distribution of device IDs, in normal behavior, the higher the repetition of device IDs, the less likely it is, and the degree of rarity becomes more and more obvious; The similarity of IDs in abnormal accounts is significantly higher. Using multiple accounts on the same device is likely to result in abnormal behavior.

In [40]:
fraud_data[fraud_data['class'] == 1].head(10)

Unnamed: 0,purchase_value,device_id,source,browser,sex,age,ip_address,class,country,interval_after_signup,signup_days_of_year,signup_seconds_of_day,purchase_days_of_year,purchase_seconds_of_day
136961,24,VLHGCDPFCICDA,SEO,Chrome,F,33,3432126000.0,1,United States,3327952.0,218,80113,257,38465
136962,14,YLUQSRNYYIPXU,Ads,Chrome,M,40,3905319000.0,1,,1.0,12,4207,12,4208
136963,63,ABUBCQDATQMQH,Ads,FireFox,F,46,550567000.0,1,United States,7640070.0,49,40723,137,77593
136964,34,QHEODGCAVJKIQ,SEO,Chrome,M,37,940809600.0,1,United States,1.0,12,77710,12,77711
136965,76,DAKVYHKIEYRBH,SEO,Chrome,F,48,636104100.0,1,Hungary,1.0,10,48421,10,48422
136966,32,ESANFBTIVMNHX,Ads,IE,M,30,3875475000.0,1,,5721252.0,176,53824,242,72676
136967,95,HIAMXITLJWYCT,SEO,FireFox,M,42,3786924000.0,1,,10160108.0,9,33511,126,84819
136968,13,BQTPLJBGYXQYX,Ads,IE,M,32,2463262000.0,1,Austria,1.0,12,29576,12,29577
136969,15,BWSMVSLCJXMCM,Direct,IE,F,39,2937899000.0,1,Japan,1.0,7,61065,7,61066
136970,26,HPPSDIRGUSSTB,Direct,Opera,M,31,647126100.0,1,United States,1.0,1,80617,1,80618


# Task 4: How to use the prediction

In [41]:
t = (10*probsBest_recall[:, 1]).astype(int)
unique, counts = np.unique(t, return_counts=True)

print(np.asarray((unique, counts)).T)

[[    1     1]
 [    2 24555]
 [    3  2623]
 [    4    99]
 [    5   177]
 [    6    76]
 [    7     1]
 [    8    20]
 [    9   124]]


**In the practical application of the model, we can directly classify behaviors with a probability of less than or equal to 30% in the predicted results as normal behaviors, and classify behaviors with a probability of greater than or equal to 70% as abnormal behaviors and handle them accordingly. For behaviors with a probability between 30% and 70%, the platform or company can conduct further contact or testing to determine whether these behaviors are abnormal.**