# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Classification

Are you predicting for multiple classes or binary classes?  

Binary

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

Random Forest Classifier
Logistic Regression
XGBoost (Extreme Gradient Boosting)


## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import MinMaxScaler
import pandas as pd


In [10]:
transactions = pd.read_csv(r"C:\Users\tosth\OneDrive\Documents\detect-fraud\detect-fraud\bank_transactions.csv")

In [11]:
import numpy as np

# log_amount: log-transformed version of 'amount' (with small offset to avoid log(0))
transactions['log_amount'] = np.log1p(transactions['amount'])

# AmountScaled and BalanceScaled are example scaled versions — you can use MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

transactions['AmountScaled'] = scaler.fit_transform(transactions[['amount']])
transactions['BalanceScaled'] = scaler.fit_transform(
    transactions[['oldbalanceOrg']].fillna(0)  # Assuming you're scaling this as a balance
)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Select only numeric or encoded features

# Drop non-numeric or unencoded columns like 'nameOrig', 'nameDest'
numeric_features = ['amount', 'oldbalanceOrg', 'newbalanceOrig',
                    'oldbalanceDest', 'newbalanceDest',
                    'isFlaggedFraud', 'log_amount', 'AmountScaled', 'BalanceScaled']

In [13]:
print(transactions.columns.tolist())

['type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud', 'log_amount', 'AmountScaled', 'BalanceScaled']


In [14]:
# Encode 'type' column (if not already encoded)
transactions = pd.get_dummies(transactions, columns=['type'], drop_first=True)

# Add the one-hot encoded 'type' columns
numeric_features += [col for col in transactions.columns if col.startswith("type_")]

# Define X and y
X = transactions[numeric_features]
y = transactions['isFraud']

In [15]:
print(transactions.columns.tolist())

['amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud', 'log_amount', 'AmountScaled', 'BalanceScaled', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'type_TRANSFER']


In [None]:
#Scale features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, stratify=y, random_state=42
)


In [None]:

#  Train Random Forest
rf = RandomForestClassifier(class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)


In [None]:

# 5. Evaluate
y_pred = rf.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

Confusion Matrix:
 [[299604      7]
 [   106    283]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299611
           1       0.98      0.73      0.83       389

    accuracy                           1.00    300000
   macro avg       0.99      0.86      0.92    300000
weighted avg       1.00      1.00      1.00    300000

Accuracy Score: 0.9996233333333333


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Initialize model
rf = RandomForestClassifier(class_weight='balanced', random_state=42)

# Define a small grid of key parameters
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10],
}

# Setup GridSearchCV
grid = GridSearchCV(rf, param_grid, cv=3, scoring='f1', n_jobs=-1)

# Fit on training data
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)

Best params: {'max_depth': None, 'n_estimators': 100}


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

- 299,604  Legitimate transactions correctly identified.
- 7  Legitimate transactions wrongly flagged as fraud.
- 106  Fraudulent transactions missed by the model.
- 283  Fraudulent transactions correctly flagged.
- Accuracy 0.9996	Very high overall correctness, but not enough on its own for imbalanced data.
- Recall	is 0.7275. Model detected ~72.75% of actual frauds. Decent, but this could be improved.
- F1-Score is 0.83 makinh it a Balanced metric combining precision and recall. Solid performance.
- The model is good for avoiding false positives (low cost to users).
- It still misses about 27% of actual fraud cases 106 false negatives were given. we may want to prioritize higher recall, even if precision drops slightly.


In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score

# 1. Use the best model from hyperparameter tuning
best_rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    class_weight='balanced',
    random_state=42
)
best_rf.fit(X_train, y_train)

# 2. Predict on test set
y_pred = best_rf.predict(X_test)

# 3. Evaluate model
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)  # Also called sensitivity

# 4. Print results
print("Confusion Matrix:\n", conf_matrix)
print("\nAccuracy:", round(accuracy, 4))
print("Precision:", round(precision, 4))
print("Recall (Sensitivity):", round(recall, 4))

# Optional: Full classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Confusion Matrix:
 [[299604      7]
 [   106    283]]

Accuracy: 0.9996
Precision: 0.9759
Recall (Sensitivity): 0.7275

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    299611
           1       0.98      0.73      0.83       389

    accuracy                           1.00    300000
   macro avg       0.99      0.86      0.92    300000
weighted avg       1.00      1.00      1.00    300000



## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

## LOGISTIC REGRESSION FOR SECOND CHOICE

In [22]:
from sklearn.linear_model import LogisticRegression

# Create logistic regression model with class_weight to handle imbalance
lr_model = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

In [23]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score

# Predict on test set
y_pred_lr = lr_model.predict(X_test)

# Metrics
print("Confusion Matrix (Logistic Regression):")
print(confusion_matrix(y_test, y_pred_lr))

print("\nAccuracy:", accuracy_score(y_test, y_pred_lr))
print("Precision:", precision_score(y_test, y_pred_lr))
print("Recall (Sensitivity):", recall_score(y_test, y_pred_lr))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))


Confusion Matrix (Logistic Regression):
[[268155  31456]
 [    14    375]]

Accuracy: 0.8951
Precision: 0.01178096823850963
Recall (Sensitivity): 0.9640102827763496

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.90      0.94    299611
           1       0.01      0.96      0.02       389

    accuracy                           0.90    300000
   macro avg       0.51      0.93      0.48    300000
weighted avg       1.00      0.90      0.94    300000



Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [38]:
# Sample 10% of the original data
sampled_data = transactions.sample(frac=0.1, random_state=42)

# Split features and target
X_sample = sampled_data[numeric_features]
y_sample = sampled_data['isFraud']


In [39]:
from sklearn.model_selection import train_test_split

X_tune_train, X_tune_test, y_tune_train, y_tune_test = train_test_split(
    X_sample, y_sample, test_size=0.2, stratify=y_sample, random_state=42
)


In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
import numpy as np

# Define parameter space
param_dist = {
    'penalty': ['l1', 'l2'],               # Regularization type
    'C': loguniform(0.001, 100),           # Regularization strength
    'solver': ['liblinear', 'saga'],       # Solvers that support l1/l2
    'max_iter': [100, 200, 300]            # Allow more iterations for convergence
}

# Set up model
lr = LogisticRegression(class_weight='balanced', random_state=42)

# Random search
random_search = RandomizedSearchCV(
    estimator=lr,
    param_distributions=param_dist,
    n_iter=20,
    scoring='f1',
    cv=3,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

# Fit on tuning data
random_search.fit(X_tune_train, y_tune_train)


Fitting 3 folds for each of 20 candidates, totalling 60 fits


In [41]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

best_lr = random_search.best_estimator_
y_pred_tune = best_lr.predict(X_tune_test)

print("Best Parameters:", random_search.best_params_)
print("\nConfusion Matrix:")
print(confusion_matrix(y_tune_test, y_pred_tune))

print("\nClassification Report:")
print(classification_report(y_tune_test, y_pred_tune))

print("Accuracy:", accuracy_score(y_tune_test, y_pred_tune))


Best Parameters: {'C': np.float64(11.015056790269638), 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear'}

Confusion Matrix:
[[18971  1002]
 [    1    26]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.95      0.97     19973
           1       0.03      0.96      0.05        27

    accuracy                           0.95     20000
   macro avg       0.51      0.96      0.51     20000
weighted avg       1.00      0.95      0.97     20000

Accuracy: 0.94985


3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity. 

This model is very effective in catching almost all fraud cases (recall = 0.96). However, precision is very low for fraud and will provide many false fraud cases. In the grand scheme of fraud detection (no pun intended) the high recall is still valuable because it is better to raise the alarm of a false fraud alarm as opposed to missing a real fraud.


In [42]:
final_lr = LogisticRegression(
    **random_search.best_params_,
    class_weight='balanced',
    random_state=42
)

final_lr.fit(X_train, y_train)


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.