# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

- Classification

Are you predicting for multiple classes or binary classes?  

- Binary

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

- Logistic Regression, SVM, KNN

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [9]:
# imports
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
import pandas as pd 
import numpy as np
import time

# using the subset of the entire dataset
transactions = pd.read_csv("../data/bank_transactions_transformed.csv")
transactions.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,isFraud
0,4991.92,0.0,0.0,0.0,0.0,False,False,False,True,False,0
1,190509.48,2077257.8,2267767.28,1156520.81,966011.33,True,False,False,False,False,0
2,63289.56,0.0,0.0,7564302.19,7627591.75,False,True,False,False,False,0
3,69590.65,0.0,0.0,357084.05,426674.7,False,True,False,False,False,0
4,154130.51,5133566.98,5287697.49,1273335.54,1119205.04,True,False,False,False,False,0


In [10]:
# load dataset
X = transactions.drop(['isFraud'], axis=1)
y = transactions['isFraud']
# view first 5 rows of predictors
X.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,4991.92,0.0,0.0,0.0,0.0,False,False,False,True,False
1,190509.48,2077257.8,2267767.28,1156520.81,966011.33,True,False,False,False,False
2,63289.56,0.0,0.0,7564302.19,7627591.75,False,True,False,False,False
3,69590.65,0.0,0.0,357084.05,426674.7,False,True,False,False,False
4,154130.51,5133566.98,5287697.49,1273335.54,1119205.04,True,False,False,False,False


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
11633,1006357.0,1006357.0,0.0,139905.1,1146262.0,False,True,False,False,False
221,13097.52,231110.1,218012.63,0.0,0.0,False,False,False,True,False
558,148666.6,0.0,0.0,512664.5,661331.1,False,True,False,False,False
6435,184415.3,0.0,0.0,1658737.0,1843152.0,False,True,False,False,False
1327,162490.2,35693.0,0.0,0.0,162490.2,False,True,False,False,False


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [12]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0)

lasso.fit(X_train, y_train)
print("Learned coefficients", lasso.coef_, '\n')

Learned coefficients [ 5.63683670e-09  4.52380019e-07 -4.76661953e-07  4.98983239e-08
 -8.49855750e-08 -0.00000000e+00  0.00000000e+00 -0.00000000e+00
 -0.00000000e+00  0.00000000e+00] 



In [13]:
y_train_pred = lasso.predict(X_train)
mse_test = mean_squared_error(y_train, y_train_pred)

print(f"Train MSE (Basic LASSO, alpha=1.0): {mse_test:.2f}")

Train MSE (Basic LASSO, alpha=1.0): 0.15


In [14]:
# RandomSearchCV
alpha_grid = {'alpha': np.linspace(0.01, 10, 100)}

lasso_model = Lasso()

# start the timer
start_time = time.time()

grid_search = RandomizedSearchCV(estimator=lasso_model, param_distributions=alpha_grid, cv=5)
grid_search.fit(X_train, y_train)

# end the timer after we're done fitting
end_time = time.time()
# calculate elapsed time
elapsed_time_grid = end_time - start_time

# Extract the best model from random search
best_alpha_grid = grid_search.best_params_['alpha']
y_test_pred_grid = grid_search.best_estimator_.predict(X_test)
mse_test_grid = mean_squared_error(y_test, y_test_pred_grid)

print(f"RandomizedSearchCV - Best alpha: {best_alpha_grid}")
print(f"RandomizedSearchCV - Test MSE: {mse_test_grid:.2f}")
print(f"RandomizedSearchCV - Time elapsed: {elapsed_time_grid:.2f} seconds")


RandomizedSearchCV - Best alpha: 9.798181818181817
RandomizedSearchCV - Test MSE: 0.17
RandomizedSearchCV - Time elapsed: 27.95 seconds


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [15]:
# Randomly search for the best hyperparameters on a logistic regression model
param_dist = {
    'penalty': ['l1', 'l2'],
    'C': np.linspace(0.01, 1, 100),
    'solver': ['saga'], 
    'max_iter': [10000]
}

model = RandomizedSearchCV(LogisticRegression(), param_distributions=param_dist, cv=5, scoring='accuracy', random_state=42)
model.fit(X_train, y_train)

# Best model from random search
best_params_random = model.best_params_
best_score_random = model.best_score_

print(f"RandomizedSearchCV - Best Params: {best_params_random}")
print(f"RandomizedSearchCV - Cross-Val Accuracy: {best_score_random:.2f}")

RandomizedSearchCV - Best Params: {'solver': 'saga', 'penalty': 'l2', 'max_iter': 10000, 'C': np.float64(0.48000000000000004)}
RandomizedSearchCV - Cross-Val Accuracy: 0.90


In [16]:
# Use the best model found from RandomizedSearchCV to predict on unseen test data

# extract the best estimator
best_log = model.best_estimator_

# predict on testing data
log_predictions = best_log.predict(X_test)

# evaluate its accuracy
test_score = accuracy_score(log_predictions, y_test)

print(f"RandomizedSearchCV - Coefficients: {best_log.coef_}")
print(f"RandomizedSearchCV - Test Accuracy: {test_score:.2f}")

RandomizedSearchCV - Coefficients: [[-5.88886710e-06  1.93285136e-05 -3.49054039e-05  6.25244605e-06
  -7.26058342e-06 -4.17485393e-11  6.40006883e-10 -1.03933209e-11
  -1.05339038e-09  2.24334517e-10]]
RandomizedSearchCV - Test Accuracy: 0.90


In [17]:
# Import for metrics
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

# generate a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)
# calculate all measures of accuracy
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# calculate specificity by hand
tn, fp, fn, tp = cm.ravel()
specificity = tn / (tn + fp)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall (Sensitivity): {recall:.2f}")
print(f"Specificity: {specificity:.2f}")
print(f"F1 Score: {f1:.2f}")

Confusion Matrix:
 [[1491   88]
 [ 219 1398]]
Accuracy: 0.90
Precision: 0.94
Recall (Sensitivity): 0.86
Specificity: 0.94
F1 Score: 0.90


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.