## XGBoost Email Fraud Classifier

In [32]:
#imports
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix


from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
fraud = pd.read_csv('fraud test.csv')
fraud.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,21/06/2020 12:14,2291160000000000.0,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,19/03/1968,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,21/06/2020 12:14,3573030000000000.0,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",17/01/1990,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,21/06/2020 12:14,3598220000000000.0,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",21/10/1970,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,3,21/06/2020 12:15,3591920000000000.0,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767,Set designer,25/07/1987,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,4,21/06/2020 12:15,3526830000000000.0,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.017,1126,Furniture designer,06/07/1955,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


Columns: Datetime, Customer ID number, Merchant, Cardholder variables, etc. etc. 
Goal: Use XGBoost to create a strong classifier for is_fraud based on all the available data. 

Create XGBoost object. 

XGBoost Mechanics: Classifies the data with successive weak decision trees. Starts broad and narrows, getting better and better insights by using successive trees to correct points of error and continually lower risk function. Contain feature importance scores based on how integral certain variables are during the tree classification process. Risk of overfitting --can set parameters for # of trees, learning rate, etc. Also less interpretable than normal decision trees since it doesn't explain its predictive methods. 

Uses gradient boosting -- stochastic gradient descent with learning rate hypothesis changes. 

Needed for fast and accurate predictions. 

### Preprocessing for different Datatypes

In [3]:
fraud.columns

Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')

In [4]:
fraud = fraud.drop(columns=['trans_date_trans_time', 'Unnamed: 0', 'zip', 'lat', 'long', 'street','dob','trans_num', 'cc_num', 'merch_lat', 'merch_long'])

In [5]:
#Check missing values
fraud.isna().sum(axis='rows')

merchant     0
category     0
amt          0
first        0
last         0
gender       0
city         0
state        0
city_pop     0
job          0
unix_time    0
is_fraud     0
dtype: int64

In [6]:
print(len(fraud['merchant'].unique()))
print(len(fraud['first'].unique()))
print(len(fraud['category'].unique()))

print(len(fraud))
fraud.columns

693
341
14
555719


Index(['merchant', 'category', 'amt', 'first', 'last', 'gender', 'city',
       'state', 'city_pop', 'job', 'unix_time', 'is_fraud'],
      dtype='object')

Problem: I have numerous categorical variable with high cardinality (# of unique values). I have to encode them numerically for XGBoost to use them for classification. 

Solution: For merchants, job, state, and city, I'll use frequency encoding. I'll use the value_counts function to change out each word with its relative frequency (like what TF-IDF does). I'll do this in exchange for a loss of interpretability; I won't be able to understand how each of these necessarily impacts fraud classification. 

For first and last name, for the sake of not having 5000 dummy variables for all unique names in the dataset, I'll simplify them to string length. 

For category, since there are only 14, I'll do one-hot encoding. 


In [7]:
# Frequency encoding
encode = ['merchant', 'city', 'job', 'state']

# Iterate over each column and compute the frequency of each category
for col in encode:
    freq_encoding = fraud[col].value_counts(normalize=True)
    fraud[col + '_freq'] = fraud[col].map(freq_encoding)
    
fraud.drop(columns=encode, inplace=True)

In [8]:
fraud.head()

Unnamed: 0,category,amt,first,last,gender,city_pop,unix_time,is_fraud,merchant_freq,city_freq,job_freq,state_freq
0,personal_care,2.86,Jeff,Elliott,M,333497,1371816865,0,0.001324,0.001152,0.004373,0.022567
1,personal_care,29.84,Joanne,Williams,F,302,1371816873,0,0.001413,0.001506,0.004562,0.008382
2,health_fitness,41.28,Ashley,Lopez,F,34496,1371816893,0,0.001359,0.001931,0.004655,0.064633
3,misc_pos,60.05,Brian,Williams,M,54767,1371816915,0,0.001279,0.001193,0.001193,0.032578
4,travel,3.19,Nathan,Massey,M,1126,1371816917,0,0.000666,0.001603,0.001603,0.035397


In [9]:
# Feature Transformation
features = ['first', 'last']
# def convert_length(string):
#     return len(string)

for feature in features:
    fraud[feature] = fraud[feature].apply(len)

In [10]:
# Convert gender to boolean
def isMale(string):
    if string == 'M':
        return True
    return False

fraud['gender'] = fraud['gender'].apply(isMale)
fraud['genderM'] = fraud['gender']
fraud = fraud.drop(columns=['gender'])

In [11]:
fraud.head()

Unnamed: 0,category,amt,first,last,city_pop,unix_time,is_fraud,merchant_freq,city_freq,job_freq,state_freq,genderM
0,personal_care,2.86,4,7,333497,1371816865,0,0.001324,0.001152,0.004373,0.022567,True
1,personal_care,29.84,6,8,302,1371816873,0,0.001413,0.001506,0.004562,0.008382,False
2,health_fitness,41.28,6,5,34496,1371816893,0,0.001359,0.001931,0.004655,0.064633,False
3,misc_pos,60.05,5,8,54767,1371816915,0,0.001279,0.001193,0.001193,0.032578,True
4,travel,3.19,6,6,1126,1371816917,0,0.000666,0.001603,0.001603,0.035397,True


In [12]:
#one-hot encoding: category
one_hot_encoded = pd.get_dummies(fraud['category'])

# Concatenate the one-hot encoded columns with the original DataFrame
fraud_encoded = pd.concat([fraud, one_hot_encoded], axis=1)
fraud = fraud_encoded.drop(columns='category')

In [13]:
fraud.head()

Unnamed: 0,amt,first,last,city_pop,unix_time,is_fraud,merchant_freq,city_freq,job_freq,state_freq,...,grocery_pos,health_fitness,home,kids_pets,misc_net,misc_pos,personal_care,shopping_net,shopping_pos,travel
0,2.86,4,7,333497,1371816865,0,0.001324,0.001152,0.004373,0.022567,...,0,0,0,0,0,0,1,0,0,0
1,29.84,6,8,302,1371816873,0,0.001413,0.001506,0.004562,0.008382,...,0,0,0,0,0,0,1,0,0,0
2,41.28,6,5,34496,1371816893,0,0.001359,0.001931,0.004655,0.064633,...,0,1,0,0,0,0,0,0,0,0
3,60.05,5,8,54767,1371816915,0,0.001279,0.001193,0.001193,0.032578,...,0,0,0,0,0,1,0,0,0,0
4,3.19,6,6,1126,1371816917,0,0.000666,0.001603,0.001603,0.035397,...,0,0,0,0,0,0,0,0,0,1


### Analysis

In [14]:
# Test Train Split
X = fraud.drop(columns=['is_fraud'])  # Features
y = fraud['is_fraud']  # Target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
type(X_test)

pandas.core.frame.DataFrame

In [16]:
#fitting the model to the training data
model = XGBClassifier()
model.fit(X_train, y_train)

In [20]:
# Predict based on training data
predict_train = model.predict(X_train)
print('\nTarget on train data',predict_train) 

accuracy_train = accuracy_score(y_train,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(X_test)
print('\nTarget on test data',predict_test) 

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)


Target on train data [0 0 0 ... 0 0 0]

accuracy_score on train dataset :  0.99969049087188

Target on test data [0 0 0 ... 0 0 0]

accuracy_score on test dataset :  0.9991218599294609


Highly successful test. Classifies nearly perfectly. The concerns that persist for me are that I cannot interpret how the XGBoost is classifying fraud and that it could be overfitting. Let's look at some evaluation metrics. 

In [26]:
# precision score: proportion of true positives among all positive predictions
precision = precision_score(y_test, predict_test)
print(precision)

0.9320594479830149


In [28]:
# recall score: sensitivity, measures proportion of true positive predictions among all positives 
recall = recall_score(y_test, predict_test)
print(recall)

0.8298676748582231


In [30]:
# f1 score: harmonic mean of the two above
harmonic = f1_score(y_test, predict_test)
print(harmonic)

0.878


In [34]:
# confusion matrix
confus = confusion_matrix(y_test, predict_test)
print(confus)

[[138369     32]
 [    90    439]]
