#### **LAB for CLASSIFICATION MACHINE LEARNING**


<font color='Blue'>*Objective*:</font> Predict the probability of a customer defaulting payment for the credit card the subsequent month, based on past information. The past information is provided in the dataset. This probability will help the collections team to prioritise follow up with customers who have a high propensity of defaulting. The goal of the research is to explore and exhibit the capabilities of various model agnostic interpretability techniques to generate explainable insights into prediction model outcomes

<font color='Blue'>*Input Data*:</font> Cleaned and pre-processed files from Feature engineering session are used as input.
  - x_train.csv, x_test.csv, y_train.csv, y_test.csv <br>
  - Location --> content/04_data_preprocessing_&_feature_engineering/Solution_Classification_preprocessing.ipynb
             

<font color='Blue'>*Outcome Expected*:</font> Build Machine learning Models, Evaluation their performace and compare ML algorithms across various performance metrics.

###                                                  Day 1 - Part_2 LAB

####  Basic Model Implementation

In [None]:
#importing required libraries
import pandas as pd

import time
import warnings
warnings.filterwarnings('ignore')

#Import Data balancing libraries
import imblearn
from imblearn.over_sampling import SMOTE

# Import models from sklearn
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier

# Import evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve, auc

# Import tuning library from sklearn
from sklearn.model_selection import GridSearchCV

In [None]:
# Read the training & test datasets from Part1- Feature Engineering solution 

x_train=pd.read_csv('../datasets/classification/processed/X_train.csv', index_col=0)
x_test=pd.read_csv('../datasets/classification/processed/X_test.csv', index_col=0)

y_train=pd.read_csv('../datasets/classification/processed/y_train.csv', index_col=0)
y_test=pd.read_csv('../datasets/classification/processed/y_test.csv', index_col=0)

print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

#### **Data Balancing**

In [None]:
# Check if the dataset requires Resampling
print("Original unbalanced dataset distribution: ", y_train.value_counts())

In [None]:
#importing SMOTETomek to handle class imbalance
from imblearn.combine import SMOTETomek

balanced_data = SMOTETomek()

# fit predictor and target variable
x_smote, y_smote = balanced_data.fit_resample(x_train, y_train)

In [None]:
print("Resampled balanced dataset distribution: ", y_smote.value_counts())

#### **Logistic Regression Model**

In [None]:
# Fit the Logistic Regression Model
start = time.time()

logmodel = LogisticRegression()
logmodel.fit(x_smote,y_smote)

stop = time.time()

# predicting the y test observations
y_pred = logmodel.predict(x_test)
y_train_pred = logmodel.predict(x_smote)

In [None]:
#Deriving all scores for Logistic Regression
log_acctr = round(accuracy_score(y_train_pred,y_smote), 3)
log_acc = round(accuracy_score(y_pred,y_test), 3)

log_prec = round(precision_score(y_pred,y_test), 3)
log_rec = round(recall_score(y_pred,y_test), 3)

log_f1 = round(f1_score(y_pred,y_test), 3)
log_roc = round(roc_auc_score(y_pred,y_test), 3)
log_time=stop-start

results = pd.DataFrame([['Logistic Regression', log_acctr, log_acc, log_prec, log_rec, log_f1, log_roc, log_time]],
        columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC/AUC', 'Training Time(s)'])
results

### Task 1 - **Build a different model, Evaluate the predictions & Compare the results**

In [None]:
# Import the Model from sklearn




In [None]:
# Fit the Model & Predict the outcome using x_smote, y_smote, x_test, y_test





In [None]:
# Evaluate the model predictions using various performance metrics





### Task 2 - **Re-train the model using data from different Balancing technique & Compare the results**

In [None]:
# Import & fit a different balancing technique 




In [None]:
# Use the new x_train and y_train to re-train the model (fit and predict) 






In [None]:
# Evaluate the new model predictions using performance metrics and compare the results






------------------------------------ **END of DAY 1 LAB** ------------------------------------

###                                                  Day 2 - Part_1

####  Ensemble Model Implementation

*NOTE:* Be sure to re-run the above code from the Day 1 lab ( the run should take a few seconds) in order to reuse the same input files and have the results available for comparison

#### **Random Forest Classifier**

In [None]:
#fitting data into Random Forest Classifier
start = time.time()

rfc = RandomForestClassifier()
rfc.fit(x_smote,y_smote)

stop = time.time()

# predicting the test observations
y_pred = rfc.predict(x_test)
y_train_pred = rfc.predict(x_smote)

In [None]:
#getting all scores for Random Forest Classifier
rfc_acctr = round(accuracy_score(y_train_pred,y_smote), 3)
rfc_acc = round(accuracy_score(y_pred,y_test), 3)
rfc_prec = round(precision_score(y_pred,y_test), 3)
rfc_rec = round(recall_score(y_pred,y_test), 3)
rfc_f1 = round(f1_score(y_pred,y_test), 3)
rfc_roc = round(roc_auc_score(y_pred,y_test), 3)
rfc_time=stop-start

results = pd.DataFrame([['Random Forest classifier', rfc_acctr, rfc_acc, rfc_prec, rfc_rec, rfc_f1, rfc_roc, rfc_time]],
            columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)'])
results

### Task 3 -  **Build a different Ensemble model, Evaluate the predictions & Compare the results**

In [None]:
# Import a different Model from sklearn




In [None]:
# Fit the Model & Predict the outcome - x_smote, y_smote, x_test, y_test





In [None]:
# Evaluation of model predictions using various performance metrics





###                                                  Day 2 - Part_2 LAB

###  Hyperparameter tuning

#### Random Forest Classifer with tuning
Possible parameters to tune:

1. n_estimators in [10, 100, 1000]
2. max_features in [‘sqrt’, ‘log2’]   or max_features [1 to 20]
3. min_samples_split
4. min_samples_leaf
5. max_depth

In [None]:
# Hyperparameter Grid
param_dict = {'n_estimators' : [10, 50, 100],
               'max_depth' : [2, 3, 5, 10]}

# Create an instance of the RandomForestClassifier
start = time.time()

rfch = RandomForestClassifier()

# Grid search
rfch_grid = GridSearchCV(estimator=rfch,
                       param_grid = param_dict,
                       cv = 3, verbose=2, scoring='roc_auc')
rfch_grid.fit(x_smote, y_smote)

stop = time.time()

In [None]:
# Displaying the best parameters
print(rfch_grid.best_estimator_)
print(rfch_grid.best_params_)

rfch_optimal_model = rfch_grid.best_estimator_

#class prediction of y on train and test sets
y_pred_rfch_grid = rfch_optimal_model.predict(x_test)
y_train_pred_rfch_grid = rfch_optimal_model.predict(x_smote)

In [None]:
#getting all scores for Random Forest Classifier
rfch_acctr = round(accuracy_score(y_train_pred_rfch_grid,y_smote), 3)
rfch_acc = round(accuracy_score(y_pred_rfch_grid,y_test), 3)
rfch_prec = round(precision_score(y_pred_rfch_grid,y_test), 3)
rfch_rec = round(recall_score(y_pred_rfch_grid,y_test), 3)
rfch_f1 = round(f1_score(y_pred_rfch_grid,y_test), 3)
rfch_roc = round(roc_auc_score(y_pred_rfch_grid,y_test), 3)
rfch_time=stop-start

results = pd.DataFrame([['Random Forest tuned', rfch_acctr, rfch_acc, rfch_prec, rfch_rec, rfch_f1, rfch_roc, rfch_time]],
            columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)'])
results

### Task 4:  **Tune the model from Task3, Evaluate the predictions and Compare the results**

In [None]:
# Define the hyperparameters and apply Gridsearch to find best parameters




In [None]:
# Fit & Predict the tuned Model





In [None]:
# Evaluation of tuned model





#### Compare the results of all 4 Tasks to see how the performance changes / improves based the following parameters:
    1. Choice/Usage of Class Balancing techniques
    2. Model selection
    3. Hyperparameter tuning