# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Answer here
This is a classification task because you are predicting whether a transaction is fraud (isFraud=1) or not (isFraud=0). There are no continuous target values here—just classes.

Are you predicting for multiple classes or binary classes?  

Answer here
binary classification. The model must decide between two classes: fraud (1) or not fraud (0).

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

Random Forest Classifier – good at handling many features and imbalanced data.

Logistic Regression – simple, interpretable, and effective for binary targets.

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load your cleaned and transformed data
df = pd.read_csv('/Users/talgat/Downloads/detect-fraud/data/transformed_dataset.csv')

X = df.drop('isFraud', axis=1)
y = df['isFraud']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20]
}

rf = RandomForestClassifier(class_weight='balanced', random_state=42)
grid_rf = GridSearchCV(rf, param_grid=params, scoring='recall', cv=3, n_jobs=-1)
grid_rf.fit(X_train, y_train)

print("Best parameters:", grid_rf.best_params_)


Best parameters: {'max_depth': None, 'n_estimators': 100}


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [5]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

best_rf = grid_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf))
print("Recall:", recall_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf, digits=4))


Accuracy: 0.9988
Precision: 1.0
Recall: 0.25
              precision    recall  f1-score   support

           0     0.9988    1.0000    0.9994      2496
           1     1.0000    0.2500    0.4000         4

    accuracy                         0.9988      2500
   macro avg     0.9994    0.6250    0.6997      2500
weighted avg     0.9988    0.9988    0.9984      2500



## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [6]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(class_weight='balanced', max_iter=1000, solver='liblinear')
params_lr = {
    'C': [0.01, 0.1, 1.0, 10],
    'penalty': ['l1', 'l2']
}

grid_lr = GridSearchCV(lr, param_grid=params_lr, scoring='recall', cv=3, n_jobs=-1)
grid_lr.fit(X_train, y_train)

print("Best parameters for Logistic Regression:", grid_lr.best_params_)

best_lr = grid_lr.best_estimator_
y_pred_lr = best_lr.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Precision:", precision_score(y_test, y_pred_lr))
print("Recall:", recall_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr, digits=4))




Best parameters for Logistic Regression: {'C': 0.01, 'penalty': 'l1'}
Accuracy: 0.9872
Precision: 0.1111111111111111
Recall: 1.0
              precision    recall  f1-score   support

           0     1.0000    0.9872    0.9935      2496
           1     0.1111    1.0000    0.2000         4

    accuracy                         0.9872      2500
   macro avg     0.5556    0.9936    0.5968      2500
weighted avg     0.9986    0.9872    0.9923      2500





### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.