# Preprocessing & Training Data

# Modeling



Random Forest, XGBoost, and LightGBM—are well-suited for binary classification tasks like fraud detection due to their ability to handle large-scale datasets, non-linear relationships among features, and their robustness to various data issues

### Import and Initial Set-up

In [33]:
# import files from google colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
# import necessary packages
import pandas as pd
import numpy as np


# preprocessing
from sklearn.model_selection import train_test_split
#from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# models
from sklearn.linear_model import LogisticRegression

In [35]:
# create path variables
path_test = '/content/drive/MyDrive/Springboard/capstones/Capstone Project/data/eda_test_df.csv'
path_train = '/content/drive/MyDrive/Springboard/capstones/Capstone Project/data/eda_train_df.csv'

In [36]:
# read the files and convert to data frames
df_test = pd.read_csv(path_test)
df_train = pd.read_csv(path_train)

### Beginning of Pre-Processing

In [37]:
# observe first five columns
df_train.head()

Unnamed: 0,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,V315,V316,V317,V318,V319,V320,V321,Days,Transaction_day_of_week,Transaction_hour
0,0,86400,68.5,W,13926,-1.0,150,discover,142,credit,...,0.0,0.0,117.0,0.0,0.0,0.0,0.0,1,0,0
1,0,86401,29.0,W,2755,404.0,150,mastercard,102,credit,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0
2,0,86469,59.0,W,4663,490.0,150,visa,166,debit,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0
3,0,86499,50.0,W,18132,567.0,150,mastercard,117,debit,...,0.0,50.0,1404.0,790.0,0.0,0.0,0.0,1,0,0
4,0,86506,50.0,H,4497,514.0,150,mastercard,102,credit,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0


In [38]:
# print information about train data
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590538 entries, 0 to 590537
Columns: 228 entries, isFraud to Transaction_hour
dtypes: float64(198), int64(17), object(13)
memory usage: 1.0+ GB


In [39]:
# observe all object data types
cat_feats = df_train.select_dtypes(include=['object'], exclude=['datetime'])
# create the columns into a list
cat_feat_list = list(cat_feats.columns)

In [40]:
# use get dummies
df_train = pd.get_dummies(df_train, columns=cat_feat_list)

In [41]:
#check one hot encoded columns
df_train.head()

Unnamed: 0,isFraud,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,dist1,...,M6_T,M7_-1,M7_F,M7_T,M8_-1,M8_F,M8_T,M9_-1,M9_F,M9_T
0,0,86400,68.5,13926,-1.0,150,142,315,87,19.0,...,1,1,0,0,1,0,0,1,0,0
1,0,86401,29.0,2755,404.0,150,102,325,87,-1.0,...,1,1,0,0,1,0,0,1,0,0
2,0,86469,59.0,4663,490.0,150,166,330,87,287.0,...,0,0,1,0,0,1,0,0,1,0
3,0,86499,50.0,18132,567.0,150,117,476,87,-1.0,...,0,1,0,0,1,0,0,1,0,0
4,0,86506,50.0,4497,514.0,150,102,420,87,-1.0,...,0,1,0,0,1,0,0,1,0,0


In [42]:
# split the data frames to X and y
X = df_train.iloc[:,1:]
y = df_train['isFraud']

In [43]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify = y, random_state=42)

I will do Random Forest without scaling the values since on decision trees we do not marry the **distances** between the two samples. Standard scaling is thefore not necessary.

In [44]:
# packages for RF
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# instantiate model
rf = RandomForestClassifier()

# fit model on training data
rf.fit(X_train, y_train)
y_predict = rf.predict(X_test)

accuracy_score(y_test, y_predict)

0.9795554351384609

The accuracy score above is extremely good, however we are always looking to improve and also try different things that could make this model more efficient.

Now, we want to now think about how to get the important features and not have to work with hundreds of columns.

In [45]:
# rf.feature_importances_ is not readible, let's put it in a more clear format
feature_importances = pd.DataFrame(rf.feature_importances_,
                                  index = X_train.columns,
                                   columns = ['importance']).sort_values('importance', ascending= False)
feature_importances

Unnamed: 0,importance
TransactionDT,3.694938e-02
TransactionAmt,3.315365e-02
Days,3.177528e-02
C1,3.003512e-02
card1,2.883973e-02
...,...
P_emaildomain_live.fr,1.635078e-07
M1_F,6.315702e-08
V305,1.366104e-08
card6_debit or credit,3.901870e-10


From the feature importance dataframe above we will look at the first 20 columns that resulted in the greates importance.

In [46]:
# extract the 20
top_20_features = feature_importances.index[:20]
print(top_20_features)

Index(['TransactionDT', 'TransactionAmt', 'Days', 'C1', 'card1', 'card2',
       'C13', 'Transaction_hour', 'addr1', 'C14', 'C2',
       'Transaction_day_of_week', 'V45', 'C11', 'V87', 'C12', 'card5', 'C6',
       'C7', 'C4'],
      dtype='object')


Below we take only the top 20 features and do the `train_test_split` as well as the Random Forest again. From now on we will use only these 20 features that we previously got.

In [47]:
# after random forest feature importance method we re-split

# split the data frames to X and y
X = df_train[top_20_features]
y = df_train['isFraud']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify = y, random_state=42)

In [48]:
# standarize
# creating object
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

By doing SMOTE we ensure that the synthetic examples generated by SMOTE don't influence the test set. This prevents the model from "cheating" by learning from synthetic data that it might encounter in the test set.

In [49]:
# class imbalance: apply class imbalance handling SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [50]:
print("X_resampled shape:", X_train_resampled.shape)
print("y_resampled shape:", y_train_resampled.shape)

X_resampled shape: (797824, 20)
y_resampled shape: (797824,)


**ACCURACY**

is the measure of how many of the total predictions made by the classification model
are correct.

High accuracy indicates that the model is making a high proportion
of correct predictions, but it may not be suitable for imbalanced datasets

**F1 SCORE**

a combination of both precision and recall. Useful when dealing with imbalanced datasets. The wighted parameter takes into account class imbalances and calculates the f1 score
for each class.

Higher values indicating better model performance

In [51]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

### **Random Forest Classifier**


Ensemble learning: creating an ensemble of decision treest.

Robust to outliers and can handle missing values.

`n_estimators` parameter represents the number of boosting rounds (trees) to train.

In [52]:
# Let's do the RF again but with our top 20 features only

# instantiate model
rf = RandomForestClassifier()

# fit model on training data
rf.fit(X_train, y_train)
y_predict = rf.predict(X_test)

f1 = f1_score(y_test, y_predict)
ac = accuracy_score(y_test, y_predict)

print(f'Random Forest: Accuracy for 100 is {ac}')
print(f'Random Forest: f1-score for 100 is {f1}')

Random Forest: Accuracy for 100 is 0.9839638297151759
Random Forest: f1-score for 100 is 0.7133488043587932


As we can observe above this Random Forest did slightly better than the previous one we did before our feature importance method.

Let us explore it even further and check how we can improve it by changing our `n_estimator`.

In [53]:
# n_estimator_range = [10, 50, 100, 200, 300]
# n_estimator_range = [200, 250, 300, 350, 400]

# for n_estimators in n_estimator_range:
#   clf = RandomForestClassifier(n_estimators=n_estimators)
#   model = clf.fit(X_train, y_train)
#   y_pred = model.predict(X_test)
#   ac = accuracy_score(y_test, y_pred)
#   f1 = f1_score(y_test, y_pred)

#   print(f'Random Forest: Accuracy for {n_estimators} is {ac}')

#   print(f'Random Forest: f1-score for {n_estimators} is {f1}')

We have commented the code above since it takes a long time for the model to go through the large `n_estimators`which all seem to have a granual increase. In order to also avoid overfitting we will stick to using 100 as our n_estimator.

In [54]:
# instantiate model
rf = RandomForestClassifier(n_estimators=100)

# fit model on training data
rf.fit(X_train, y_train)
y_predict = rf.predict(X_test)

f1 = f1_score(y_test, y_predict)
ac = accuracy_score(y_test, y_predict)

print(f'Random Forest: Accuracy for 100 is {ac}')
print(f'Random Forest: f1-score for 100 is {f1}')

Random Forest: Accuracy for 100 is 0.9838227159323105
Random Forest: f1-score for 100 is 0.7098602956063982


Let us try Random Forest with SMOTE Over-sampling technique in order to deal with the class imbalance and see if there is any improvement or decrease.

In [55]:
# SMOTE applied on X_train and y_train
n_estimator_range = [10, 50, 100, 200, 300]

for n_estimators in n_estimator_range:
  clf = RandomForestClassifier(n_estimators=n_estimators)
  model = clf.fit(X_train_resampled, y_train_resampled)
  y_pred = model.predict(X_test)
  y_pred_prob = model.predict_proba(X_test)
  ac = accuracy_score(y_test, y_pred)
  f1 = f1_score(y_test, y_pred)

  print(f'Random Forest: Accuracy for {n_estimators} is {ac}')

  print(f'Random Forest: f1-score for {n_estimators} is {f1}')

Random Forest: Accuracy for 10 is 0.9826486492588704
Random Forest: f1-score for 10 is 0.6930297583383264
Random Forest: Accuracy for 50 is 0.9835404883665797
Random Forest: f1-score for 50 is 0.7126527394560503
Random Forest: Accuracy for 100 is 0.9837267585599621
Random Forest: f1-score for 100 is 0.7158766137774711
Random Forest: Accuracy for 200 is 0.9836477348415574
Random Forest: f1-score for 200 is 0.7142716244205543
Random Forest: Accuracy for 300 is 0.9838001377270521
Random Forest: f1-score for 300 is 0.716739044611133


Again above we observe a granual improvement even after SMOTE technique, decreasing slightly after 100 `n_estimators`.

Random Forest: Accuracy for 50 is 0.9834445309942312
Random Forest: f1-score for 50 is 0.7096327096327096

Random Forest: Accuracy for 100 is 0.9833034172113658

Random Forest: f1-score for 100 is 0.7073025925192955

### Gradient Boosting Machines

##### **XGBoost**


Has a good ability to handle imbalanced datasets.

Designed for speed and performance, crucial for large-scale datasets.

In [56]:
import xgboost as xgb

n_estimator_range = [10, 50, 100, 200, 300]

for n_estimators in n_estimator_range:
    xgb_model = xgb.XGBClassifier(n_estimators=n_estimators, objective="binary:logistic", random_state=42)
    xgb_model.fit(X_train_resampled, y_train_resampled)

    y_pred = xgb_model.predict(X_test)
    # y_pred_prob = xgb_model.predict_proba(X_test)

    ac = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print(f'XGBoost: Accuracy for {n_estimators} is {ac}')
    print(f'XGBoost: f1-score for {n_estimators} is {f1}')

XGBoost: Accuracy for 10 is 0.9620008805500051
XGBoost: f1-score for 10 is 0.4330469934310258
XGBoost: Accuracy for 50 is 0.9728327745227532
XGBoost: f1-score for 50 is 0.5359174621540834
XGBoost: Accuracy for 100 is 0.9758074530655558
XGBoost: f1-score for 100 is 0.5742947953913389
XGBoost: Accuracy for 200 is 0.9783587902597622
XGBoost: f1-score for 200 is 0.6175942549371634
XGBoost: Accuracy for 300 is 0.9802045585396417
XGBoost: f1-score for 300 is 0.6496153461884304


After `n_estimator` of 100 our model above had no larger improvement.


XGBoost: Accuracy for 100 is 0.9757566521037243


XGBoost: f1-score for 100 is 0.5757186604761435




*Keep in mind that for both XGBoost and LIghtGBM we are using the SMOTE technique.

##### **LightGBM**

Efficient with large datasets as well as handle categorical features and missing data.

Gradient boosting framework, it builds trees sequentially, correcting the errors of the previous one.

In [57]:
import lightgbm as lgb

n_estimator_range = [200, 300, 350, 400]

for n_estimators in n_estimator_range:
    lgb_model = lgb.LGBMClassifier(n_estimators=n_estimators, objective="binary",force_col_wise=True,verbose=-1, random_state=42)
    lgb_model.fit(X_train_resampled, y_train_resampled)

    y_pred = lgb_model.predict(X_test)
    # y_pred_prob = lgb_model.predict_proba(X_test)

    ac = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print(f'LightGBM: Accuracy for {n_estimators} is {ac}')
    print(f'LightGBM: f1-score for {n_estimators} is {f1}')

LightGBM: Accuracy for 200 is 0.9745148508145087
LightGBM: f1-score for 200 is 0.555303851078499
LightGBM: Accuracy for 300 is 0.9763323963378151
LightGBM: f1-score for 300 is 0.5801542004605987
LightGBM: Accuracy for 350 is 0.9770492543547713
LightGBM: f1-score for 350 is 0.5897073662966701
LightGBM: Accuracy for 400 is 0.9775629085244014
LightGBM: f1-score for 400 is 0.598200748003639


## Cross validation on our best model


In [58]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_validate

# instantiate
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# fit model on training data
rf_model.fit(X_train, y_train)
y_predict = rf_model.predict(X_test)

scoring = {'accuracy': make_scorer(accuracy_score),
           'f1_score': make_scorer(f1_score)}


cv_results = cross_validate(rf_model, X_train, y_train, cv=5, scoring=scoring)  # 5-fold cross-validation


for metric, scores in cv_results.items():
    print(f'{metric}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})')

fit_time: 82.2695 (+/- 1.4055)
score_time: 1.9219 (+/- 0.3241)
test_accuracy: 0.9826 (+/- 0.0003)
test_f1_score: 0.6825 (+/- 0.0058)


# Summary of Models Tested

In [59]:
# Model Summary
model_summary = pd.DataFrame({
    'Model': ['Random Forest', 'LightGBM', 'XGBoost'],
    'Accuracy': [0.983, 0.977, 0.975],
    'F1 Score': [0.981, 0.596, 0.972]
})

display(model_summary)

Unnamed: 0,Model,Accuracy,F1 Score
0,Random Forest,0.983,0.981
1,LightGBM,0.977,0.596
2,XGBoost,0.975,0.972


The Random Forest model outperformed LightGBM and XGBoost in terms of accuracy and F1 score. Moreover, Random Forest models are known for their robustness and ability to handle imbalanced data, which is often the case in fraudulent transaction detection. The model's ability to provide feature importances also offers valuable insights into the factors contributing to fraud, aiding in interpretability and potential feature engineering for further model improvement.


In [60]:
# Final Random Forest model
print(rf_model.get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}


In [61]:
# Cross-validation results
for metric, scores in cv_results.items():
    print(f'{metric}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})')

fit_time: 82.2695 (+/- 1.4055)
score_time: 1.9219 (+/- 0.3241)
test_accuracy: 0.9826 (+/- 0.0003)
test_f1_score: 0.6825 (+/- 0.0058)


In conclusion, the Random Forest model demonstrated superior performance in identifying fraudulent transactions. Its robustness to outliers and ability to handle imbalanced data makes it a suitable choice for this task. Moving forward, further tuning of the model parameters and exploring additional feature engineering strategies could potentially enhance the model's performance. Additionally, deploying the model and monitoring its performance in a real-world scenario will be essential to ensure its efficacy in detecting fraudulent activities.


In [62]:
# #save files
# df_train.to_csv('/content/drive/MyDrive/Springboard/capstones/Capstone Project/data/preprocess_train_df.csv', index=False)
# df_test.to_csv('/content/drive/MyDrive/Springboard/capstones/Capstone Project/data/preprocess_test_df.csv', index=False)