# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Answer: It is a classification task.

Are you predicting for multiple classes or binary classes?  

Answer: Binary classes, since each transaction is either fraudulent (1) or not fraudulent (0).

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

Answer: Random Forest Classifier, Logistic Regression

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [31]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE

df = pd.read_csv("../data/filtered_transactions.csv")
df_encoded = pd.get_dummies(df, columns=["type"], drop_first=True)

X = df_encoded.drop(columns=["isFraud"])
y = df_encoded["isFraud"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

rus = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

print("training set org sample total:", len(X_train))
print("under sampling total:", len(X_train_resampled))
print("type:\n", y_train_resampled.value_counts())

training set org sample total: 800000
under sampling total: 1597924
type:
 isFraud
0    798962
1    798962
Name: count, dtype: int64


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [32]:
rf = RandomForestClassifier(random_state=42, n_jobs=1)
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions={
        "n_estimators": [100, 200],
        "max_depth": [10, 20, None],
        "min_samples_split": [2, 5],
        "min_samples_leaf": [1, 2],
        "bootstrap": [True, False]
    },
    n_iter=3,
    scoring="f1",
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train_resampled, y_train_resampled)

Fitting 3 folds for each of 3 candidates, totalling 9 fits


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [33]:
best_rf = random_search.best_estimator_

y_pred = best_rf.predict(X_test)

print("Best Hyperparameters:", random_search.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, zero_division=0))
print("Recall:", recall_score(y_test, y_pred, zero_division=0))
print("F1 Score:", f1_score(y_test, y_pred, zero_division=0))

Best Hyperparameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None, 'bootstrap': False}
Accuracy: 0.998965
Precision: 0.5625
Recall: 0.9034749034749034
F1 Score: 0.6933333333333334


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression(max_iter=1000, random_state=42)

grid_search = GridSearchCV(
    estimator=lr,
    param_grid={
        "C": [0.01, 0.1, 1, 10],
        "penalty": ["l2"],
        "solver": ["lbfgs"]
    },
    scoring="f1",
    cv=3,
    verbose=1
)

grid_search.fit(X_train_resampled, y_train_resampled)

Fitting 3 folds for each of 4 candidates, totalling 12 fits


In [35]:
best_lr = grid_search.best_estimator_

y_pred = best_lr.predict(X_test)

print("Best Hyperparameters:", grid_search.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, zero_division=0))
print("Recall:", recall_score(y_test, y_pred, zero_division=0))
print("F1 Score:", f1_score(y_test, y_pred, zero_division=0))

Best Hyperparameters: {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
Accuracy: 0.956805
Precision: 0.028789923526765633
Recall: 0.9884169884169884
F1 Score: 0.055950169380395584


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.