# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

ANS: Classification.

Are you predicting for multiple classes or binary classes?  

ANS: Binary classes.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

ANS: K-Nearest Neighbors, Support Vector Machines, and (maybe) Random Forests.

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [26]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

from imblearn.over_sampling import BorderlineSMOTE


In [27]:
transactions = pd.read_csv("../data/transactions.csv")
transactions.head()

Unnamed: 0,amount,oldbalanceOrig,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_OUT,type_TRANSFER,isDuplicateDestLF,isFraud,isFlaggedFraud
0,1.1e-05,0.000928,0.000918,0.0,0.0,0.0,0.0,0,0,0
1,0.000597,0.002511,0.001135,0.0,0.0,0.0,0.0,0,0,0
2,0.00239,0.196364,0.205295,0.002599,0.001975,0.0,0.0,0,0,0
3,0.0255,0.0,0.0,0.01182,0.018426,0.0,1.0,1,0,0
4,0.000735,0.0,0.0,0.001759,0.001947,1.0,0.0,1,0,0


In [28]:
# Sample a smaller subset of the transactions for quicker model training
sample_transactions = transactions.sample(frac=0.2, random_state=37)
sample_transactions.shape

(200000, 10)

In [29]:
sample_transactions.value_counts("isFraud")

isFraud
0    199757
1       243
Name: count, dtype: int64

## SAMPLE SIZE OF 200,000

In [30]:
# Separate features and target variable
feats_to_drop = ["isFraud", "isFlaggedFraud"]

#X = sample_transactions.drop(columns=feats_to_drop)
#y = sample_transactions["isFraud"]

X = transactions.drop(columns=feats_to_drop)
y = transactions["isFraud"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=37)

y_train.value_counts()

isFraud
0    699091
1       909
Name: count, dtype: int64

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [33]:
# This time using random search and the million rows of data...

knn = KNeighborsClassifier()

param_dist = {
    'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

rand_search = RandomizedSearchCV(estimator=knn,
    random_state=37,
    param_distributions=param_dist,
    cv=5,                 # 5-fold cross-validation
    scoring='precision',   # Optimize for precision 
    n_jobs=-1)            # Use all available cores

print("Starting RandomizedSearchCV...")
rand_search.fit(X_train, y_train)
print("RandomizedSearchCV complete.\n")

print(f"Best parameters found: {rand_search.best_params_}")
print(f"Best cross-validation precision: {rand_search.best_score_:.4f}")

best_knn_model = rand_search.best_estimator_
test_precision= best_knn_model.score(X_test, y_test)
print(f"Test set precision with best parameters: {test_precision:.4f}")

Starting RandomizedSearchCV...
RandomizedSearchCV complete.

Best parameters found: {'weights': 'uniform', 'n_neighbors': 15, 'metric': 'manhattan'}
Best cross-validation precision: 0.9384
Test set precision with best parameters: 0.9995


## Observations: (using 200,000 rows, GridSearch)
- Best parameters found: {'metric': 'manhattan', 'n_neighbors': 15, 'weights': 'uniform'}
- Best cross-validation precision: 0.9454
- Test set precision with best parameters: 0.9991

## Next Step:
- Performing a RandomSearch using the full dataset

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [34]:
best_knn_model.fit(X_train, y_train)

yhat = best_knn_model.predict(X_test)

In [35]:
confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[299591     21]
 [   130    258]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299612
           1       0.92      0.66      0.77       388

    accuracy                           1.00    300000
   macro avg       0.96      0.83      0.89    300000
weighted avg       1.00      1.00      1.00    300000



In [36]:
# Applying SMOTE to handle class imbalance
'''
sampling_strategies = [0.002, 0.003, 0.004, 0.005]

best_strategy = None
best_precision = 0

for strategy in sampling_strategies:
    print(f"Testing sampling_strategy={strategy}...")
    
    # Apply SMOTE with the current sampling_strategy
    smote = BorderlineSMOTE(random_state=37, k_neighbors=3, sampling_strategy=strategy)
    X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
    
    # Train a model (e.g., Random Forest) on the resampled data
    model = best_knn_model
    model.fit(X_train_smote, y_train_smote)
    
    # Evaluate the model on the validation set
    y_pred = model.predict(X_test)
    precision = precision_score(y_test, y_pred, zero_division=1)
    
    print(f"Precision for sampling_strategy={strategy}: {precision:.4f}")
    
    # Update the best strategy if the current one is better
    if precision > best_precision:
        best_precision = precision
        best_strategy = strategy

# Output the best sampling_strategy and its precision
print(f"\nBest sampling_strategy: {best_strategy}")
print(f"Best precision: {best_precision:.4f}")

###############################################################################

print("Class distribution after SMOTE:")
print(y_train_smote.value_counts())
'''

# Using previously determined best sampling strategy with full dataset

best_strategy = 0.002


## Observations: (using 200,000 rows)
- Best sampling_strategy: 0.002
- Best precision: 0.7500
- Class distribution after SMOTE:
    - isFraud
    - 0:    139843
    - 1:       699


In [37]:
best_smote = BorderlineSMOTE(random_state=37, k_neighbors=3, sampling_strategy=best_strategy)
X_train_smote, y_train_smote = best_smote.fit_resample(X_train, y_train)

In [38]:
best_knn_model.fit(X_train_smote, y_train_smote)

yhat = best_knn_model.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[299527     85]
 [   101    287]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299612
           1       0.77      0.74      0.76       388

    accuracy                           1.00    300000
   macro avg       0.89      0.87      0.88    300000
weighted avg       1.00      1.00      1.00    300000



## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [39]:
rf = RandomForestClassifier(random_state=37)
rf.fit(X_train_smote, y_train_smote)

yhat = rf.predict(X_test) 

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[299593     19]
 [    95    293]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299612
           1       0.94      0.76      0.84       388

    accuracy                           1.00    300000
   macro avg       0.97      0.88      0.92    300000
weighted avg       1.00      1.00      1.00    300000



In [41]:
param_dist = {
    "criterion": ["gini", "entropy"],
    "max_depth": [5, 10, 20, 30, 40, 50],
    "min_samples_split": [2, 5, 10, 15],
    "max_features": ["sqrt", "log2"]
}

# TODO: set up RandomizedearchCV with 5-fold cross-validation
rf_grid = RandomizedSearchCV(random_state=37,
        estimator=rf, 
        param_distributions=param_dist,
        cv=5, 
        n_jobs=-1)

# TODO: fit this model on your training data
rf_grid.fit(X_train_smote, y_train_smote)

## Observations: (using 200,000 rows)
- RandomForestClassifier(max_depth=20, max_features='log2', random_state=37)

In [42]:
# TODO: extract the best rf estimator
rf_best = rf_grid.best_estimator_

# TODO: use this estimator to generate "yhat" on your X_test dataset
yhat = rf_best.predict(X_test)

# TODO: generate a confusion matrix and a classification report
confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[299590     22]
 [    91    297]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299612
           1       0.93      0.77      0.84       388

    accuracy                           1.00    300000
   macro avg       0.97      0.88      0.92    300000
weighted avg       1.00      1.00      1.00    300000



### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [43]:
# Training, fitting, and testing a basic Logistic Regression model, untuned hyperparameters
log_reg = LogisticRegression()
log_reg.fit(X_train_smote, y_train_smote)
yhat = log_reg.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[299612      0]
 [   388      0]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299612
           1       0.00      0.00      0.00       388

    accuracy                           1.00    300000
   macro avg       0.50      0.50      0.50    300000
weighted avg       1.00      1.00      1.00    300000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [44]:
# Hyperparameter tuning of Logistic regression model

# Randomly search for the best hyperparameters on a logistic regression model
param_dist = {
    'penalty': ['l1', 'l2', 'elasticnet'],  # Using only 'l1' and 'l2'
    'C': np.linspace(0.01, 1, 100),
    'solver': ['saga'], 
    'max_iter': [10000],
}

random_search = RandomizedSearchCV(LogisticRegression(), param_distributions=param_dist, cv=5, scoring='accuracy', random_state=37)
random_search.fit(X_train_smote, y_train_smote)

# Best model from random search
best_params_random = random_search.best_params_
best_score_random = random_search.best_score_

print(f"RandomizedSearchCV - Best Params: {best_params_random}")
print(f"RandomizedSearchCV - Cross-Val Accuracy: {best_score_random:.2f}")

20 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/miniconda3/envs/ds/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/miniconda3/envs/ds/lib/python3.12/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/ds/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1203, in fit
    raise ValueError("l1_ratio must be specified when penalty is elasticnet.")
ValueError:

RandomizedSearchCV - Best Params: {'solver': 'saga', 'penalty': 'l1', 'max_iter': 10000, 'C': np.float64(0.78)}
RandomizedSearchCV - Cross-Val Accuracy: 1.00


In [45]:
confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[299612      0]
 [   388      0]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299612
           1       0.00      0.00      0.00       388

    accuracy                           1.00    300000
   macro avg       0.50      0.50      0.50    300000
weighted avg       1.00      1.00      1.00    300000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [46]:
# Implementing Support Vector Classifier model

lin_svc = LinearSVC(C=1.0, max_iter=10000, random_state=37)

lin_svc.fit(X_train, y_train)

yhat = lin_svc.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[299611      1]
 [   336     52]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299612
           1       0.98      0.13      0.24       388

    accuracy                           1.00    300000
   macro avg       0.99      0.57      0.62    300000
weighted avg       1.00      1.00      1.00    300000



In [47]:
# Hyperparameter tuning for SVC

param_grid = {
    "penalty": ["l1", "l2"],
    "C": [0.01, 0.1, 1.0]
}

svc = LinearSVC(max_iter=10000)

# TODO: set up RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(
    svc, 
    param_distributions=param_grid, 
    n_iter=100, 
    cv=5, 
    scoring='accuracy', 
    random_state=37
)

# TODO: fit this model on your training data
random_search.fit(X_train_smote, y_train_smote)

best_svc = random_search.best_estimator_

yhat = best_svc.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)



Confusion Matrix 
 [[299607      5]
 [   271    117]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299612
           1       0.96      0.30      0.46       388

    accuracy                           1.00    300000
   macro avg       0.98      0.65      0.73    300000
weighted avg       1.00      1.00      1.00    300000



In [None]:
# Implementing random search on the LinearSVC model to find best hyperparams
param_grid = {
    'C': np.linspace(0.1, 10, 100),
    'kernel': ['rbf', 'poly', 'sigmoid'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3, 4, 5]  
}

svc = SVC(max_iter=10000, random_state=37)

# TODO: set up RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(
    svc, 
    param_distributions=param_grid, 
    n_iter=100, 
    cv=5, 
    scoring='accuracy', 
    random_state=37
)

# TODO: fit this model on your training data
random_search.fit(X_train_smote, y_train_smote)

best_svc = random_search.best_estimator_

yhat = best_svc.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

In [None]:
# Lastly, using the Naive Bayes model

gnb = GaussianNB()
gnb.fit(X_train_smote, y_train_smote)

yhat = gnb.predict(X_test)

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[59440   474]
 [   74    12]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      0.99      1.00     59914
           1       0.02      0.14      0.04        86

    accuracy                           0.99     60000
   macro avg       0.51      0.57      0.52     60000
weighted avg       1.00      0.99      0.99     60000



## Observations:

- The Random Forest Classifier did a better job at not missing too many fraudulent transactions, with an f1-score of 0.78 (0.84 when using full dataset)
- K-Nearest Neighbors, when tuned, yielded an f1-score of 0.66 (0.76 when using full dataset)
- Logistic Regression was tried out for comparison; it resulted in an f1-score of 0, even with hyperparameter tuning
- The Support Vector Classifier (which I expected to perform the best with this dataset) was actually second to last in performance with an f1-score of 0.31 (0.24 with full dataset) without using the kernel trick and 0.24 (0.46) after trying out several kernels.

## Questions:
- How can the Random Forest and Support Vector Classifiers be improved? What attributes and tuning could be tested further...?