# Credit Risk Modelling

**Description:**
In this competition, you must explore and cleanse a dataset consisting of over 111,000 loan records to determine the best way to predict whether a loan applicant will fully repay or default on a loan. You must then build a machine learning model that returns the unique loan ID and a loan status label that indicates whether the loan will be fully paid or charged off.

### Getting all the Dependencies

# Installing dependencies
!pip install --upgrade pip
!pip install numpy --upgrade --user
!pip install pandas --upgrade --user
!pip install scikit-learn --upgrade --user

In [1]:
# Avoiding Warnings
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# Importing Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

ModuleNotFoundError: No module named 'sklearn.experimental'

### Loading the Data

In [None]:
# Importing Dataset
train = pd.read_csv("https://dim-mlpython.s3.amazonaws.com/CreditRiskModeling/train.csv", low_memory=False)
test = pd.read_csv("https://dim-mlpython.s3.amazonaws.com/CreditRiskModeling/test.csv", low_memory=False)

### Describing the data

In [None]:
# For each column heading we replace " " and convert the heading in lowercase 
cleancolumn = []
for i in range(len(train.columns)):
    cleancolumn.append(train.columns[i].replace(' ', '_').lower())
train.columns = cleancolumn

In [None]:
train.head()

In [None]:
print(train.shape)
train.describe()

In [None]:
train.info()

### Removing the duplicates

Loan ID is unique for every loan process so we will use Loan ID to remove duplicates

In [None]:
# Check the unique values, to see if there is any duplicates
unique_loanid=train['loan_id'].unique().tolist()
print("Total samples in data:", str(train.shape[0]))
print("Total unique samples in data:", str(len(unique_loanid)))
print("Duplicate samples in data:", str(train.shape[0] - len(unique_loanid)))

In [None]:
# Drop the duplicates
train = train.drop_duplicates()
print("Total samples in data:", str(train.shape[0]))
print("Total unique samples in data:", str(len(unique_loanid)))
print("Duplicate samples in data:", str(train.shape[0] - len(unique_loanid)))

In [None]:
#Get the duplicates
dup_loanid=train[train.duplicated(['loan_id'],keep=False)]
print(dup_loanid.shape)
dup_loanid.describe()

In [None]:
#Sort the duplicate dataframe in ascending order with NA's in last
sorted_df=dup_loanid.sort_values(['current_loan_amount', 'credit_score'], ascending=True, na_position='last')
sorted_df.head()

In [None]:
#Considering samples which are genuine
correct_df = sorted_df.drop_duplicates(['loan_id'], keep='first')
print(correct_df.shape)
correct_df.head()

In [None]:
#Check if there is any such placeholder in duplicates
correct_df[correct_df['current_loan_amount']==99999999]

In [None]:
#check if there is still Na's in duplicates
correct_df[correct_df['credit_score'].isnull()]

In [None]:
# Droping the duplicate loan ID's 
train.drop_duplicates(['loan_id'], keep=False, inplace=True)

In [None]:
train.shape

In [None]:
# Getting the final train data which is all genuine
train = train.append(correct_df, ignore_index=True)
print(train.shape)
train.describe()

### Preprocessing / Cleaning the data

#### Feature: Years in Current Job

Remove the special charachters and other words and make it numeric.

In [None]:
train['years_in_current_job'].unique()

In [None]:
train['years_in_current_job'] = [0 if str(x)=='< 1 year' else x if str(x)=='nan' else int(re.findall(r'\d+', str(x))[0]) for x in train['years_in_current_job']]
train['years_in_current_job'].unique()

#### Feature: Credit Score

Credit Score range is from 0 to 800 but there are some values greater than this basically this is some data error.

In [None]:
train['credit_score'].head(10)

In [None]:
# Function to bring credit score in range
def credit_range(x):
    if x > 800:
        return int(x/10)
    elif str(x) == 'nan' : 
        return x
    else:
        return int(x)

In [None]:
train['credit_score'] = train['credit_score'].map(credit_range)
train['credit_score'].head(10)

#### Feature: Maximum Open Credit

There is one data error which needs to be handled.

In [None]:
print(train.shape)
train[train['maximum_open_credit']=='#VALUE!']

In [None]:
train = train[train['maximum_open_credit'] != '#VALUE!']
train['maximum_open_credit']= pd.to_numeric(train['maximum_open_credit'])
train.shape

#### Feature: Monthly Debt

Monthly debt has Currency symbol due to which its datatype is string, remove it and convert it to numeric.

In [None]:
train['monthly_debt']=train['monthly_debt'].str.strip('$')
train['monthly_debt']=pd.to_numeric(train['monthly_debt'])
train['monthly_debt'].describe()

### Handling Missing Values and Outliers

Describe the data and check which feature has Missing Values and if there is any Outlier. 

In [None]:
train.describe()

#### Outlier treatment: Current Loan Amount

In [None]:
ax = sns.boxplot(data=train['current_loan_amount'], orient="h", palette="Set2")

In [None]:
# check the description their is a placeholder in max value
train[train['current_loan_amount']==99999999.000]

In [None]:
#There are such 5861 samples, which is not low so need to replace it by NA's
train['current_loan_amount'] = [np.nan if int(x)==99999999 else x for x in train['current_loan_amount']]

In [None]:
ax = sns.boxplot(data=train['current_loan_amount'], orient="h", palette="Set2")

In [None]:
train.describe()

#### Outlier treatment: Annual Income

In [None]:
ax = sns.boxplot(data=train['annual_income'], orient="h", palette="Set2")

In [None]:
train[train['annual_income']==8713547.000]

In [None]:
train = train[train['annual_income']!=8713547.000]
train.shape

In [None]:
ax = sns.boxplot(data=train['annual_income'], orient="h", palette="Set2")

In [None]:
train[train['annual_income']>1200000]

In [None]:
train = train.drop([3686, 11660, 46615])

In [None]:
ax = sns.distplot(train['annual_income'].dropna(), hist=True, kde=True, 
             color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Annual Income') 
plt.title('Annual Income frequency chart'); 
plt.show()

In [None]:
train.describe()

#### Outlier Treatment: Years of Credit History

In [None]:
ax = sns.boxplot(data=train['years_of_credit_history'], orient="h", palette="Set2")

In [None]:
train[train['years_of_credit_history']>58]

In [None]:
train = train.drop([1908, 32096, 45779, 49017, 61832])

In [None]:
ax = sns.boxplot(data=train['years_of_credit_history'], orient="h", palette="Set2")

In [None]:
train.describe()

#### Outlier treatment: Number of Open Accounts

In [None]:
ax = sns.boxplot(data=train['number_of_open_accounts'], orient="h", palette="Set2")

In [None]:
train[train['number_of_open_accounts']>50]

In [None]:
train = train.drop([26502, 26637, 27936, 39254])

In [None]:
ax = sns.boxplot(data=train['number_of_open_accounts'], orient="h", palette="Set2")

In [None]:
#Capping the outliers
IQR = train['number_of_open_accounts'].quantile(0.75) - train['number_of_open_accounts'].quantile(0.25)
upper_limit = train['number_of_open_accounts'].quantile(0.75) + (IQR * 1.5)
print("Upper Limit:", upper_limit)

In [None]:
train['number_of_open_accounts'] = [23.0 if ( x>23.0 and x!=np.nan) else x for x in train['number_of_open_accounts']]
ax = sns.boxplot(data=train['number_of_open_accounts'], orient="h", palette="Set2")

In [None]:
print(train.shape)
train.describe()

Now, all outliers has been handled so let's check the missing values.

In [None]:
train.isnull().sum()

#### Missing Value Treatment: Bankruptcies & Tax Liens

Reference Link for Iterative Imputing:
https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py

In [None]:
# Since percentage value of Missing values in Bankruptcies and Tax Liens are very low we can delete the rows
train = train.dropna(subset=['bankruptcies', 'tax_liens'])
print(train.shape)
train.isnull().sum()

#### Missing Value Treatment: Months Since Last Delinquent

This feature describes the number of months since the credit is not paid, so NA's actually are the genuine customers so replacing it with 0.

In [None]:
train["months_since_last_delinquent"].fillna(0, inplace = True)
print(train.shape)
train.isnull().sum()

#### Missing Value Treatment: Current Loan Amount, Credit Score, Years In Current Job, Annual Income

Applying **'Iterative Imputer'** using default estimator **'Bayesian Ridge'** which is **Regularized Linear Regression**.

In [None]:
train.reset_index(drop=True, inplace=True)
my_imputer = IterativeImputer()
#For this we need only numerical variables so filtering this
train_numerical = train._get_numeric_data()
train_numerical_columns = train_numerical.columns 
print(train_numerical.shape)
train_numerical.isnull().sum()

In [None]:
train_imputed = my_imputer.fit_transform(train_numerical)
#Imputer will give the array as an object so need to convert it to Dataframe with columns
train_imputed = pd.DataFrame(train_imputed, columns=train_numerical_columns)
train_imputed.isnull().sum()

In [None]:
train_imputed.describe()

Plotting histogram to see the difference between without na values and after replacing na values with Iterative Imputer. <br>
train_numerical is our Dataframe with NA's, while, train_imputed is our Dataframe without NA's.

#### Histogram: Current Loan Amount

In [None]:
ax = sns.distplot(train_numerical['current_loan_amount'].dropna(), hist=True, kde=True, 
             bins=int(42740/1000), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Current Loan Amount') 
plt.title('Current Loan Amount before Imputation'); 
plt.show()

In [None]:
ax = sns.distplot(train_imputed['current_loan_amount'], hist=True, kde=True, 
             bins=int(42740/1000), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Current Loan Amount') 
plt.title('Current Loan Amount after Imputation'); 
plt.show()

In [None]:
temp = train_imputed[train_imputed['current_loan_amount']<40000]

In [None]:
ax = sns.distplot(temp['current_loan_amount'], hist=True, kde=True, 
             bins=int(42740/1000), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Current Loan Amount') 
plt.title('Current Loan Amount after Imputation'); 
plt.show()

#### Histogram: Credit Score

In [None]:
ax = sns.distplot(train_numerical['credit_score'].dropna(), hist=True, kde=True, 
             bins=int(800/20), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Credit Score') 
plt.title('Credit Score before Imputation'); 
plt.show()

In [None]:
ax = sns.distplot(train_imputed['credit_score'], hist=True, kde=True, 
             bins=int(800/20), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Credit Score') 
plt.title('Credit Score after Imputation'); 
plt.show() 

#### Histogram: Annual Income

In [None]:
ax = sns.distplot(train_numerical['annual_income'].dropna(), hist=True, kde=True, 
             bins=int(215580/10000), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Annual Income') 
plt.title('Annual Income before Imputation'); 
plt.show()

In [None]:
ax = sns.distplot(train_imputed['annual_income'], hist=True, kde=True, 
             bins=int(215580/10000), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Annual Income') 
plt.title('Annual Income after Imputation'); 
plt.show()

#### Histogram: Years In Current Job

In [None]:
ax = sns.distplot(train_numerical['years_in_current_job'].dropna(), hist=True, kde=True, 
             bins=int(215580/10000), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Years In Current Job') 
plt.title('Years In Current Job before Imputation'); 
plt.show()

In [None]:
ax = sns.distplot(train_imputed['years_in_current_job'], hist=True, kde=True, 
             bins=int(215580/10000), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Years In Current Job') 
plt.title('Years In Current Job after Imputation'); 
plt.show()

Replacing the feature in Train dataframe by above Imputed dataframe.

In [None]:
train_imputed.shape

In [None]:
train.shape

In [None]:
train['years_in_current_job'] = train_imputed['years_in_current_job']
train['current_loan_amount'] = train_imputed['current_loan_amount']
train['credit_score'] = train_imputed['credit_score']
train['annual_income'] = train_imputed['annual_income']

In [None]:
print(train.shape)
train.isnull().sum()

#### Converting Months since last delinquent into categories

In [None]:
train['months_since_last_delinquent'] = ['extreme_risk' if x>51 
        else 'high_risk' if x>32 
        else 'moderate_risk' if x>16 
        else 'low_risk' if x>0 else 'no_risk' for x in train['months_since_last_delinquent']]

In [None]:
train['months_since_last_delinquent'].unique()

In [None]:
# For each column heading we replace " " and convert the heading in lowercase 
cleancolumn = []
for i in range(len(test.columns)):
    cleancolumn.append(test.columns[i].replace(' ', '_').lower())
test.columns = cleancolumn

In [None]:
test.head()

In [None]:
print(test.shape)
test.describe()

In [None]:
test.info()

In [None]:
# Check the unique values, to see if there is any duplicates
unique_loanid_test=test['loan_id'].unique().tolist()
print("Total samples in data:", str(test.shape[0]))
print("Total unique samples in data:", str(len(unique_loanid_test)))
print("Duplicate samples in data:", str(test.shape[0] - len(unique_loanid_test)))

In [None]:
# Drop the duplicates
test = test.drop_duplicates()
print("Total samples in data:", str(test.shape[0]))
print("Total unique samples in data:", str(len(unique_loanid_test)))
print("Duplicate samples in data:", str(test.shape[0] - len(unique_loanid_test)))

In [None]:
test['years_in_current_job'].unique()

In [None]:
test['years_in_current_job'] = [1 if str(x)=='1 year' else x if str(x)=='nan' else int(re.findall(r'\d+', str(x))[0]) for x in test['years_in_current_job']]
test['years_in_current_job'].unique()

In [None]:
test['credit_score'].head(10)

In [None]:
test['credit_score'] = test['credit_score'].map(credit_range)
test['credit_score'].head(10)

In [None]:
print(test.shape)
test[test['maximum_open_credit']=='#VALUE!']

In [None]:
test = test[test['maximum_open_credit'] != '#VALUE!']
test['maximum_open_credit']= pd.to_numeric(test['maximum_open_credit'])
test.shape

In [None]:
test['monthly_debt']=test['monthly_debt'].str.strip('$')
test['monthly_debt']=pd.to_numeric(test['monthly_debt'])
test['monthly_debt'].describe()

In [None]:
ax = sns.boxplot(data=test['current_loan_amount'], orient="h", palette="Set2")

In [None]:
test[test['current_loan_amount']==99999999.000]


In [None]:
test['current_loan_amount'] = [np.nan if int(x)==99999999 else x for x in test['current_loan_amount']]

In [None]:
ax = sns.boxplot(data=test['current_loan_amount'], orient="h", palette="Set2")

In [None]:
ax = sns.boxplot(data=test['annual_income'], orient="h", palette="Set2")

In [None]:
test.describe()

In [None]:
test[test['annual_income']==8713547.000]

In [None]:
test = test[test['annual_income']!=8713547.000]
train.shape

In [None]:
ax = sns.boxplot(data=test['annual_income'], orient="h", palette="Set2")


In [None]:
test[test['annual_income']>1200000]


In [None]:
test = test.drop([9470,21777])

In [None]:
ax = sns.boxplot(data=test['annual_income'], orient="h", palette="Set2")

In [None]:
ax = sns.distplot(test['annual_income'].dropna(), hist=True, kde=True, 
             color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})
ax.set(xlabel='Annual Income') 
plt.title('Annual Income frequency chart'); 
plt.show()

In [None]:
ax = sns.boxplot(data=test['years_of_credit_history'], orient="h", palette="Set2")

In [None]:
test[test['years_of_credit_history']>58]

In [None]:
test = test.drop([4829,5060, 24164])

In [None]:
ax = sns.boxplot(data=test['years_of_credit_history'], orient="h", palette="Set2")

In [None]:
ax = sns.boxplot(data=test['number_of_open_accounts'], orient="h", palette="Set2")

In [None]:
test[test['number_of_open_accounts']>50]

In [None]:
test = test.drop([4935,18759])

In [None]:
ax = sns.boxplot(data=test['number_of_open_accounts'], orient="h", palette="Set2")

In [None]:
#Capping the outliers
IQR = test['number_of_open_accounts'].quantile(0.75) - test['number_of_open_accounts'].quantile(0.25)
upper_limit = test['number_of_open_accounts'].quantile(0.75) + (IQR * 1.5)
print("Upper Limit:", upper_limit)

In [None]:
test['number_of_open_accounts'] = [23.0 if ( x>23.0 and x!=np.nan) else x for x in test['number_of_open_accounts']]
ax = sns.boxplot(data=train['number_of_open_accounts'], orient="h", palette="Set2")

In [None]:
test.isnull().sum()

In [None]:
test = test.dropna(subset=['bankruptcies', 'tax_liens'])
print(test.shape)
test.isnull().sum()

In [None]:
test["months_since_last_delinquent"].fillna(0, inplace = True)
print(test.shape)
test.isnull().sum()

In [None]:
test.reset_index(drop=True, inplace=True)
my_imputer = IterativeImputer()
#For this we need only numerical variables so filtering this
test_numerical = test._get_numeric_data()
test_numerical_columns = test_numerical.columns 
print(test_numerical.shape)
test_numerical.isnull().sum()


In [None]:
test_imputed = my_imputer.fit_transform(test_numerical)
#Imputer will give the array as an object so need to convert it to Dataframe with columns
test_imputed = pd.DataFrame(test_imputed, columns=test_numerical_columns)
test_imputed.isnull().sum()

In [None]:
test['years_in_current_job'] = test_imputed['years_in_current_job']
test['current_loan_amount'] = test_imputed['current_loan_amount']
test['credit_score'] = test_imputed['credit_score']
test['annual_income'] = test_imputed['annual_income']

In [None]:
test['months_since_last_delinquent'].unique()

In [None]:
test['months_since_last_delinquent'] = ['extreme_risk' if x>51 
        else 'high_risk' if x>32 
        else 'moderate_risk' if x>16 
        else 'low_risk' if x>0 else 'no_risk' for x in test['months_since_last_delinquent']]

In [None]:
test.isnull().sum()