## Hi Guys! Today, I have chosen an important problem statement to work upon. Let me break it down into simple words, and you will soon understand its importance.  

### -  Financial institutions invest a ton of money for constructing credit risk analysis models to determine the probability of default of a potential borrower. The models provide information on the level of a borrower's credit risk at any particular time. 

### -  Some of you might be wondering what "credit" is. Well here's come the definition:

<div style="width:100%;text-align: center;"> <img align=middle src="https://www.greenbiz.com/sites/default/files/2020-09/definition_conceptart.jpg" alt="Heat beating" style="height:300px;margin-top:3rem;"> </div>

### <span style="color:red;"> - "Credit is the ability to borrow money or access goods or services with the understanding that you'll pay later." </span>

### <span style="color:blue;"> - "Creditworthiness is how a lender determines that you will default on your debt obligations, or how worthy you are to receive new credit. Your creditworthiness is what creditors look at before they approve any new credit to you." </span>

### Credit risks are a commonly observed phenomenon in areas of finance that relate to mortgages, credit cards, and other kinds of loans. There is always a probability that the borrower may not get back with the amount.

So it is important that when a borrower applies for a loan, the lender or the issuer must establish and examine the borrower’s ability to repay the loan. So in this notebook, I will be doing the following stuff:

1. Exploring the Dataset(**EDA**)
2. Applying **Oversampling** Techniques
3. Test all machine learning models with **Cross Validation**
4. **Hyperparameter tune** the best model
5. Analyse the best model with the help of relevant **Metrics** 
6. **Pickling** the best model 
7. Creating an **UI**(User Interface) with the help of **Streamlit** 
8. **Deployment** on **Heroku** platform
9. A sample prediction will be made to test the application

The 7th and 8th steps will be shown and explained with the help of screenshots.

# 1. Exploring the Dataset(EDA)

## About the Dataset:

### The dataset consists of the following features:

- Name:	Description

- person_age:	Age of the person 

- person_income:	Annual Income

- personhomeownership:	Home ownership

- personemplength:	Employment length (in years)

- loan_intent:	Loan intent

- loan_grade:	Loan grade

- loan_amnt:	Loan amount

- loanintrate:	Interest rate

- loan_status:	Loan status (0 is non-default/1 is default)

- loanpercentincome:	Percent income

- cbpersondefaultonfile:	Historical default

- cbpresoncredhistlength:	Credit history length

## Part 1:

Some data wrangling

Some outlier removal based on domain knowledge

Use Column Transformer and Pipeline to streamline process

Use Randomized Search to find optimal set of parameters

Automate the procedure for multiple classifiers

Plot Precision-Recall Curve

Plot Learning Curve (for bias-variance tradeoff / check for overfitting-underfitting)

## Part 2:

Rectify existing model based on inferences from the learning curve and make a better one

# Part 1:

In [None]:
## Basic Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline 

## For Preprocessing: 

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, learning_curve, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import plot_precision_recall_curve
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, classification_report
from lightgbm import LGBMClassifier

In [None]:
df = pd.read_csv("../input/credit-risk-dataset/credit_risk_dataset.csv")
df.head()

In [None]:
dups = df.duplicated()
dups.value_counts() #There are 165 Duplicated rows

In [None]:
df[dups]

In [None]:
df.query("person_age==23 & person_income==42000 &\
person_home_ownership=='RENT' & loan_int_rate==9.99")

In [None]:
df.shape

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

In [None]:
# X and y will be thought of as the entire training data
# X_test and y_test will be thought of as the out of sample data for model evaluation
# df["loan_status"] is the target variable.

X, X_test, y, y_test = train_test_split(df.drop('loan_status', axis=1), df['loan_status'],
                                        random_state=0,  test_size=0.2, stratify=df['loan_status'],
                                        shuffle=True)

In [None]:
df['loan_status'].value_counts(normalize=True)

In [None]:
y.value_counts(normalize=True)   #Note that the proportion remains the same because of stratify.

In [None]:
y_test.value_counts(normalize=True)

In [None]:
np.round(X.isna().sum()* 100 / X.shape[0], 3) #Looks like there are very very few missing values which we can remove safely

In [None]:
X.shape

In [None]:
X.dropna().shape

In [None]:
(25932-22763)/25932  #To calculate percent of missing values in the training set

In [None]:
X[['person_income', 'loan_amnt', 'loan_percent_income']].head()

In [None]:
X.drop('loan_percent_income', axis=1, inplace=True)
X_test.drop('loan_percent_income', axis=1, inplace=True)

In [None]:
#To print the number of unique values:
for col in X:
    print(col, '--->', X[col].nunique())
    if X[col].nunique()<20:
        print(X[col].value_counts(normalize=True)*100)
    print()

In [None]:
X.describe()

In [None]:
num_cols = X.select_dtypes(exclude=["object"]).columns
num_cols

In [None]:
for col in num_cols:
    sns.histplot(X[col])
    plt.show()

In [None]:
X.loc[X['person_age']>=80, :]  #Using common sense, we can exclude rows whose age is >80

In [None]:
X = X.loc[X['person_age']<80, :]

In [None]:
X.shape

In [None]:
X.loc[X['person_emp_length']>=66, :]

In [None]:
df.query("person_age<=person_emp_length+14")

In [None]:
X = X.loc[(X['person_emp_length']<66) | (X['person_emp_length'].isna()), :]

In [None]:
# since we've removed some data from X, we need to pass on these updations to y as well,
# as y doesn't know some of its corresponding X's have been deleted.
y = y[X.index]

In [None]:
cat_cols = X.select_dtypes(include=["object"]).columns
cat_cols

### Creating Pipelines:

In [None]:
num_pipe = Pipeline([
    ('impute', IterativeImputer()),     #MICE (Multivariate Imputation by Chained Equations)
    ('scale', StandardScaler()),
])

In [None]:
ct = ColumnTransformer([
    ('num_pipe', num_pipe, num_cols),
    ('cat_cols', OneHotEncoder(sparse=False, handle_unknown='ignore'), cat_cols)
], remainder='passthrough')

In [None]:
grid = {
    RandomForestClassifier(random_state=0, n_jobs=-1, class_weight='balanced'):
    {'model__n_estimators':[300,400,500],
     'coltf__num_pipe__impute__estimator': [LinearRegression(), RandomForestRegressor(random_state=0), #coltf is the name of the final 
                                        KNeighborsRegressor()]},                                       #pipeline. The base estimator of 
                                                                                                       #iterative imputer are also considered
    LGBMClassifier(class_weight='balanced', random_state=0, n_jobs=-1):                                #parameters
    {'model__n_estimators':[300,400,500],
     'model__learning_rate':[0.001,0.01,0.1,1,10],
     'model__boosting_type': ['gbdt', 'goss', 'dart'],
     'coltf__num_pipe__impute__estimator':[LinearRegression(), RandomForestRegressor(random_state=0),
                                        KNeighborsRegressor()]},
}

In [None]:
for clf, param in grid.items():
    print(clf)
    print('-'*50)
    print(param)
    print('\n')

In [None]:
full_df = pd.DataFrame()
best_algos = {}

for clf, param in grid.items():
    pipe = Pipeline([
    ('coltf', ct),       #ct for the column transformer for preprocessing
    ('model', clf)
])

    gs = RandomizedSearchCV(estimator=pipe, param_distributions=param, scoring='accuracy',
                            n_jobs=-1, verbose=3, n_iter=4, random_state=0)
    
    gs.fit(X, y)
    
    all_res = pd.DataFrame(gs.cv_results_)

    temp = all_res.loc[:, ['params', 'mean_test_score']]
    algo_name = str(clf).split('(')[0]
    temp['algo'] = algo_name
    
    full_df = pd.concat([full_df, temp], ignore_index=True)
    best_algos[algo_name] = gs.best_estimator_

In [None]:
full_df.sort_values('mean_test_score', ascending=False)

In [None]:
full_df.sort_values('mean_test_score', ascending=False).iloc[0, 0]

In [None]:
be = best_algos['RandomForestClassifier']
be

In [None]:
be.fit(X, y)

In [None]:
preds = be.predict(X_test)

In [None]:
confusion_matrix(y_test, preds)

In [None]:
plot_confusion_matrix(be, X_test, y_test)

In [None]:
print(classification_report(y_test, preds))

In [None]:
be.score(X_test, y_test)

### Final Accuracy on test set is 92.27%

### Plotting the Precision-Recall Curve:

In [None]:
plot_precision_recall_curve(estimator=be, X=X_test, y=y_test, name='model AUC')
baseline = y_test.sum() / len(y_test)
plt.axhline(baseline, ls='--', color='r', label=f'Baseline model ({round(baseline,2)})')
plt.legend(loc='best')

### Learning Curve:

In [None]:
a, b, c = learning_curve(be, X, y, n_jobs=-1, scoring='accuracy')

In [None]:
plt.plot(a, b.mean(axis=1), label='training accuracy')
plt.plot(a, c.mean(axis=1),  label='validation accuracy')
plt.xlabel('training sample sizes')
plt.ylabel('accuracy')
plt.legend()

### From the above Diagram, the following results can be concluded:
We can observe **Overfitting** because:

    1. High training accuracy (meaning low bias)
    
    2. Low testing/ validation accuracy (which shows high variance)
   
    3. Big gap between training and validation curves (consequence of high variance)
    
Overfitting makes a model very complex because it tries to learn even the "noise" in the data, which is not desired.

### In order to address the problem of overfitting, the following remedial measures can be performed:

1. Add more training samples, if possible, to allow the model to learn better(Which is not possible here)

Working with data at hand:

1. Make a simpler model / reduce complexity of model:

2. Try reducing number of features

3. Try increasing regularization (lambda)

4. Try pruning the decision trees

In [None]:
grid = {
    
    RandomForestClassifier(random_state=0, n_jobs=-1, class_weight='balanced'):
    {'model__n_estimators':[100,200,300],
     'model__max_depth':[5, 9, 13],
     'model__min_samples_split':[4,6,8],
     'coltf__num_pipe__impute__estimator': [LinearRegression(), RandomForestRegressor(random_state=0),
                                        KNeighborsRegressor()]},
    
#     LGBMClassifier(class_weight='balanced', random_state=0, n_jobs=-1):
#     {'model__n_estimators':[100,200,300],
#      'model__max_depth':[5, 9, 13],
#      'model__num_leaves': [7,15,31],
#      'model__learning_rate':[0.0001,0.001,0.01,0.1,],
#      'model__boosting_type': ['gbdt', 'goss', 'dart'],
#      'coltf__num_pipe__impute__estimator':[LinearRegression(), RandomForestRegressor(random_state=0),
#                                         KNeighborsRegressor()]} 
}

In [None]:
for clf, param in grid.items():
    print(clf)
    print('-'*50)
    print(param)
    print('\n')

In [None]:
full_df = pd.DataFrame()
best_algos = {}

for clf, param in grid.items():
    pipe = Pipeline([
    ('coltf', ct),
    ('model', clf)
])

    gs = RandomizedSearchCV(estimator=pipe, param_distributions=param, scoring='accuracy',
                            n_jobs=-1, verbose=3, n_iter=4)
    
    gs.fit(X, y)
    
    all_res = pd.DataFrame(gs.cv_results_)

    temp = all_res.loc[:, ['params', 'mean_test_score']]
    algo_name = str(clf).split('(')[0]
    temp['algo'] = algo_name
    
    full_df = pd.concat([full_df, temp])
    best_algos[algo_name] = gs.best_estimator_

In [None]:
full_df.sort_values('mean_test_score', ascending=False)

In [None]:
be = best_algos['RandomForestClassifier']
be

In [None]:
be.fit(X, y)

In [None]:
preds = be.predict(X_test)

In [None]:
confusion_matrix(y_test, preds)

In [None]:
plot_confusion_matrix(be, X_test, y_test)

In [None]:
print(classification_report(y_test, preds))

In [None]:
plot_precision_recall_curve(be, X_test, y_test)
baseline = y_test.sum() / len(y_test)
plt.axhline(baseline, ls='--', color='r', label=f'Baseline model ({round(baseline,2)})')
plt.legend(loc='best')

In [None]:
a, b, c = learning_curve(be, X, y, n_jobs=-1, cv=5)

In [None]:
plt.plot(a, b.mean(axis=1), label='training accuracy')
plt.plot(a, c.mean(axis=1),  label='validation accuracy')
plt.xlabel('training sample sizes')
plt.ylabel('accuracy')
plt.legend()

The Model is Clearly **OVERFITTING** on the training data.

# Part 2:

Remedial measures:

Add more training samples, if possible, to allow the model to learn better.

Working with data at hand:

Make a simpler model / reduce complexity of model:

try reducing number of features

try increasing regularization (lambda)

try pruning the decision trees

In [None]:
grid = {
    
    RandomForestClassifier(random_state=0, n_jobs=-1, class_weight='balanced'):
    {'model__n_estimators':[100,200,300],
     'model__max_depth':[5, 9, 13],
     'model__min_samples_split':[4,6,8],
     'coltf__num_pipe__impute__estimator': [LinearRegression(), RandomForestRegressor(random_state=0),
                                        KNeighborsRegressor()]},
    
#     LGBMClassifier(class_weight='balanced', random_state=0, n_jobs=-1):
#     {'model__n_estimators':[100,200,300],
#      'model__max_depth':[5, 9, 13],
#      'model__num_leaves': [7,15,31],
#      'model__learning_rate':[0.0001,0.001,0.01,0.1,],
#      'model__boosting_type': ['gbdt', 'goss', 'dart'],
#      'coltf__num_pipe__impute__estimator':[LinearRegression(), RandomForestRegressor(random_state=0),
#                                         KNeighborsRegressor()]} 
}

In [None]:
for clf, param in grid.items():
    print(clf)
    print('-'*50)
    print(param)
    print('\n')

In [None]:
full_df = pd.DataFrame()
best_algos = {}

for clf, param in grid.items():
    pipe = Pipeline([
    ('coltf', ct),
    ('model', clf)
])

    gs = RandomizedSearchCV(estimator=pipe, param_distributions=param, scoring='accuracy',
                            n_jobs=-1, verbose=3, n_iter=4)
    
    gs.fit(X, y)
    
    all_res = pd.DataFrame(gs.cv_results_)

    temp = all_res.loc[:, ['params', 'mean_test_score']]
    algo_name = str(clf).split('(')[0]
    temp['algo'] = algo_name
    
    full_df = pd.concat([full_df, temp])
    best_algos[algo_name] = gs.best_estimator_

In [None]:
full_df.sort_values('mean_test_score', ascending=False)

In [None]:
be = best_algos['RandomForestClassifier']
be

In [None]:
be.fit(X, y)

In [None]:
preds = be.predict(X_test)

In [None]:
confusion_matrix(y_test, preds)

In [None]:
plot_confusion_matrix(be, X_test, y_test)

In [None]:
print(classification_report(y_test, preds))

In [None]:
be.score(X_test, y_test)

In [None]:
plot_precision_recall_curve(be, X_test, y_test)
baseline = y_test.sum() / len(y_test)
plt.axhline(baseline, ls='--', color='r', label=f'Baseline model ({round(baseline,2)})')
plt.legend(loc='best')

In [None]:
a, b, c = learning_curve(be, X, y, n_jobs=-1, cv=5)

In [None]:
a

In [None]:
b

In [None]:
c

In [None]:
plt.plot(a, b.mean(axis=1), label='training accuracy')
plt.plot(a, c.mean(axis=1),  label='validation accuracy')
plt.xlabel('training sample sizes')
plt.ylabel('accuracy')
plt.legend()