## Table of contents

* [1. Background](#Background)
* [2. Imports](#import)
* [3. Load the data](#Load_Data)
* [4. Is the data imbalanced?](#imblnc)
* [5. Data preparation prior to model training](#pre)
* [6. Choose the right metrics for model evaluation](#metrics)
* [7. Train models](#model)
    * [7.1. Logistic regression](#lr)
        * [7.1.1.  Make a pipeline for logistic regression model training](#pllr)
        * [7.1.2. Hyperparameter search using GridSearchCV for logistic regression](#gdlr)
        * [7.1.3. Best Logistic Regression model](#bestlr)
    * [7.2. Decision tree classifier](#dt)
        * [7.2.1. Make a pipeline for decision tree](#pldt)
        * [7.2.2. Hyperparameter search for decision tree classifier](#gddt)
        * [7.2.3. Decision tree classifier with best parameters](#bestdt)
        * [7.2.4. Feature importance assessment in the decision tree classifier](#fidt)
    * [7.3. Random forest classifier](#rf)
        * [7.3.1. Make a pipeline for random forest classifier](#plrf)
        * [7.3.2. Hyperparameter search for random forest classifier](#gdrf)
        * [7.3.3. Rondom forest classifier feature importance assessment](#firf)
    * [7.4. Gradient boosting classifier](#GB)
        * [7.4.1. Make a pipeline for gradient boosting classifier](#plgb)
        * [7.4.2. Hyperparameter tuning for gradient boosting classifier](#gdgb)
        * [7.4.3. Gradient boosting feature importance assessment](#figb)
* [8. Final model selection](#select)
    * [8.1. Logistic regressio](#lr_test)
    * [8.2. Decision tree](#dt_test)
    * [8.3 Random forest](#rf_test)
    * [8.4. XGBoost](#XG_test)
    * [8.5. Discussion](#disc)
* [9. Save the best model](#save_model)
* [10. Summary](#discussion)    

# 1. Background  <a class='anchor' id='Background'></a>

In this notebook we will apply several calssification models to our cleaned data frame of accepted loans from Lending Club. Briefly, Lending Club used to be the biggest peer to peer lending platform. To decide about a loan application, Lending Club relies on applicants' information provided during application. Such information includes income, length employment and credit history. In previous notebooks, we addressed missing data and explore the data to get a better understanding of the data 

In this notebook we will apply several classification models to predict if a loan will default. The models are:

1. Logistic regression

2. Decision tree

3. Random Forest

4. Gradient boosting



# 2. Imports <a class='anchor' id='import'></a>

We start by importing required packages.

In [28]:
import pandas as pd
import numpy as np

from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn import tree, metrics

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, roc_auc_score, f1_score, precision_recall_curve, accuracy_score, confusion_matrix, balanced_accuracy_score
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.metrics import fbeta_score, make_scorer
from sklearn.inspection import permutation_importance

import itertools

import warnings
warnings.filterwarnings('ignore')

from imblearn.under_sampling import RandomUnderSampler

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.compose import ColumnTransformer

import pickle


# 3. Load the data<a class='anchor' id='Load_Data'></a>

Using the pd.read_csv, we load the data

In [29]:
df = pd.read_csv("C:\\Users\\somfl\\Documents\\Data Science Career Track\\LendingClub\\LendingClubClean.csv")
df = df.drop(columns=['Unnamed: 0'])

# 4. Is the data imbalanced? <a class='anchor' id='imblnc'><a/>

A data set is called [imbalanced](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data), if the minority class makes a small percentage of the data set. 

If the minority class (in our case the defaulted loans) makes 20 to 40% of a data set, then it is mildly imbalance.
if the minority class is 1 to 20%, then the dataset is moderately imbalanced and if minority class is < 1% of data set, the data is extremely imbalanced.

To find out if our data set is imbalanced or not, we will look at waht percentages of loan applications in the data frame is defaulted

In [32]:
df.loan_status.value_counts()/len(df)

Fully Paid    0.783049
Default       0.216951
Name: loan_status, dtype: float64

Default loans count for 22% of our data set. Therefore, our data is mildly imbalance, which may or may not be a problem. It is suggested to model with the true distribution and if it was not fine, apply techncis such as undersampling to deal with the imbalance.

# 5. Data preparation prior to model training <a class='anchor' id='pre'><a/>

Due to complexity of our data frame, we definde two functions to handle all the prerocessing steps we need to do prior to training. These steps are:

1. Defining X and y
2. Applying One Hot Encoder to change categorical data
3. Under sampling the data by using RandomUnderSampler
4. spliting the data into training and testing sets.


In [33]:
df.isna().sum()

revol_util                      880
dti                               0
chargeoff_within_12_mths          0
collections_12_mths_ex_med        0
inq_last_6mths                   30
open_acc                         29
mort_acc                      50030
annual_inc                        4
sub_grade                         0
loan_status                       0
installment                       0
int_rate                          0
term                              0
revol_bal                         0
emp_length                    77070
home_ownership                    0
num_rev_accts                 70277
pub_rec_bankruptcies              0
tax_liens                         0
loan_amnt                         0
Credit Length (year)             29
fico_score                        0
dtype: int64

## Model for deployment

In [35]:
categorical_columns = ['term']

numerical_columns = ['int_rate', 'loan_amnt', 'fico_score']
yy = df['loan_status']
XX = df[numerical_columns + categorical_columns]

X_cols = XX.columns
y_col = ['status']

Xn = XX.to_numpy()
yn = yy.to_numpy()

RU = RandomUnderSampler(random_state=42)
X_res, y_res = RU.fit_resample(Xn,yn)

X = pd.DataFrame(X_res, columns=X_cols)
y = pd.DataFrame(y_res, columns=y_col)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=42)


categorical_pipe = Pipeline(steps=[('cat_impute', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(drop='if_binary'))])
numerical_pipe = Pipeline(steps=[('num_impute', SimpleImputer(strategy='median')), ('scaler', MinMaxScaler())])

preprocessing = ColumnTransformer(
    [
        ("cat", categorical_pipe, categorical_columns),
        ("num", numerical_pipe, numerical_columns),
    ]
)
lr= Pipeline(
    [
        ("preprocess", preprocessing),
        ("classifier", LogisticRegression(random_state=42))
    ])

grid_params = {'classifier__penalty': ['l1','l2'],
               'classifier__C': [0.1,100]
              }
  

grid_cv = GridSearchCV(lr, param_grid=grid_params, cv=5, scoring = 'balanced_accuracy', n_jobs=-1)
grid_cv.fit(X_train, y_train)


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(transformers=[('cat',
                                                                         Pipeline(steps=[('cat_impute',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('encoder',
                                                                                          OneHotEncoder(drop='if_binary'))]),
                                                                         ['term']),
                                                                        ('num',
                                                                         Pipeline(steps=[('num_impute',
                                                                                          SimpleImputer(strategy='me

In [36]:
y_pred = grid_cv.best_estimator_.predict(X_test)
y_prob = grid_cv.best_estimator_.predict_proba(X_test)


In [37]:
import pickle
model =grid_cv.best_estimator_
pickle.dump(model, open('model.pkl', 'wb'))

In [44]:
import os
with open(os.path.join('C:\\Users\\somfl\\Documents\\GitHub\\LoanDefaultDeploy','Procfile'), "w") as file1:
    toFile = 'web: sh setup.sh && streamlit run test.py'
    'web: sh setup.sh && streamlit run test.py'
'
    
file1.write('web: sh setup.sh && streamlit run <app name>.py')

SyntaxError: EOL while scanning string literal (1312998602.py, line 4)