<h1 style="text-align:center;font-size:30px;" > Home Credit Default Risk </h1>

[1.Business Problem](#business) <br>
- [1.1 Description](#description) <br>
- [1.2 Sources/Useful Links](#links) <br>
- [1.3 Real world/Business Objectives and Constraints](#real) <br>
- [1.4 Data](#data) <br>

[2. Exploratory Data Analysis](#analysis)<br>
- [2.1 Reading data and basic stats](#stats)<br>
- [2.2 Basic Analysis](#basic)<br>
  - [2.2.1 Checking for Missing values](#missing) <br>
  - [2.2.2 Checking for Duplicates](#dup) <br>
  - [2.2.3 Distribution of data points among output classes](#distribution) <br>
- [2.3 Data Analysis](#3.3)<br> 
  - [2.3.1 Types of loan](#3.3.1)<br>
  - [2.3.2 Distribution of AMT_INCOME_TOTAL](#3.3.2)<br>
  - [2.3.3 Distribution of AMT_CREDIT](#3.3.3)<br>
  - [2.3.4 Distribution of Name of type of the Suite in terms of loan is repayed or not](#3.3.4)<br>
  - [2.3.5 Distribution of Income sources of Applicants in terms of loan is repayed or not](#3.3.5)<br>
  - [2.3.6 Distribution of Education of Applicants in terms of loan is repayed or not](#3.3.6)<br>
  - [2.3.7 Distribution of Family status of Applicants in terms of loan is repayed or not](#3.3.7)<br>
  - [2.3.8 Distribution of Housing type of Applicants in terms of loan is repayed or not](#3.3.8)<br>
  - [2.3.9 Distribution of Clients Age](#3.3.9)<br>
  - [2.3.10 Distribution of years before the application the person started current employment](#3.3.10)<br>
  - [2.3.11 Occupation of Applicants in terms of loan is repayed or not](#3.3.11)<br>
- [2.4 Preparation of data](#3.4)<br>
  - [2.4.1 Feature Engineering of Application data](#3.4.1)<br>
  - [2.4.2 Using Bureau Data](#3.4.2)<br>
  - [2.4.3 Feature Engineering of Bureau Data](#3.4.3)<br>
  - [2.4.4 Using Previous Application Data](#3.4.4)<br>
  - [2.4.5 Using POS_CASH_balance data](#3.4.5)<br>
  - [2.4.6 Using installments_payments data](#3.4.6)<br>
  - [2.4.7 Using Credit card balance data](#3.4.7)<br>
- [2.5 Dividing data into train, valid and test](#3.5)<br>
- [2.6 Featurizing the data](#3.6)<br>
- [2.7 Selection of features](#3.7)<br>

[3. Machine Learning Models](#4)<br>
- [3.1 Logistic regression with selected features](#4.1)<br>



## <a id='business'></a>
<h1> 1. Business Problem </h1>

<a id='description'></a>
<h2> 1.1 Description </h2>

<p>
Home Credit offers easy, simple and fast loans for a range of Home Appliances, Mobile Phones, Laptops, Two Wheelers , and varied personal needs. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
<br />
<br />
This project focuses on the problem of predicting the capability of each applicant of repaying a loan, given the applicant data, all credits data from Credit Bureau, previous applications data from Home Credit and some more data. <br />
</p>

<p style='font-size:18px'><b> Problem Statemtent </b></p>
To predict how capable each applicant is of repaying a loan, so that sanctioning loan only for the applicants who are likely to repay the loan.

<p style='font-size:18px'> <b> Source:  </b> <br /> https://www.kaggle.com/c/home-credit-default-risk </p>

<a id='links'></a>
<h2> 1.2 Sources/Useful Links</h2>

Data Source : https://www.kaggle.com/c/home-credit-default-risk/data <br>

<a id='real'></a>
<h2> 1.3 Real World / Business Objectives and Constraints </h2>

1. No strict latency constraints.
2. Predict the probability of capability of each applicant of repaying a loan, so that you can choose any threshold of choice. 
3. The cost of a mis-classification can be very high(Loss for the organization).
4. Interpretability is partially important.

<a id='data'></a><h2> 1.4 Data </h2>

The data is provided by Home Credit, a service dedicated to provided lines of credit (loans) to the unbanked population. <br />
<br/>
There are 7 different sources of data:

- <b>application_train/application_test</b>: The main training data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. Here we will use only the Training data.
- <b>bureau</b>: In this dataset it consists of data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
- <b>bureau_balance</b>: It consists of monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
- <b>previous_application</b>:The data of previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
- <b>POS_CASH_BALANCE</b>: It consists of monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
- <b>credit_card_balance</b>:The monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
- <b>installments_payment</b>:The data of payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.
<br/>
<br/>
The below diagram shows how the data is related:
<img src='images/home_credit.png'/>

<a id='analysis'></a>
<h1>2. Exploratory Data Analysis </h1>

In [None]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import os
import warnings
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier

import plotly.offline as py
#import chart_studio.plotly as py 
import plotly.graph_objs as go
#import plotly.graph_objects as go
#import chart_studio.plotly.graph_objs as go

from plotly.offline import init_notebook_mode, iplot
#from chart_studio.plotly import plot, iplot
from sklearn.model_selection import train_test_split
init_notebook_mode(connected=True)


import cufflinks as cf
cf.go_offline()

import pickle
import gc
import lightgbm as lgb

warnings.filterwarnings('ignore')
%matplotlib inline

<a id='stats'></a>
<h2> 2.1 Reading data and basic stats </h2>

- In this case study, we have multiple datasets from different data sources to deal with. First, we will start with the application dataset(Main table) and proceed further with the other datasets.

In [None]:
print('Reading the data....', end='')
application = pd.read_csv('application_train.csv')
print('Finished reading the data.')
print('The shape of data:',application.shape)
print('First 5 rows of data:')
application.head()

In [None]:
application.info()

We are using 'application_train.csv' file :
- This dataset consists of 307511 rows and 122 columns.
- Each row has unique id 'SK_ID_CURR' and the output label is in the 'TARGET' column.
- TARGET indicating 0: the loan was repaid or 1: the loan was not repaid.
- The description of each column can be found in the file 'HomeCredit_columns_description.csv'

<a id='basic'></a>
<h2> 2.2 Basic Analysis</h2>

<a id='missing'></a>
<h3> 2.2.1 Checking for Missing values</h3>

In [None]:
count = application.isnull().sum().sort_values(ascending=False)
percentage = ((application.isnull().sum()/len(application)*100)).sort_values(ascending=False)

missing_application = pd.concat([count, percentage], axis=1, keys=['Count','Percentage'])
print('Count and % of missing values for top 20 columns:')
missing_application.head(20)

#### Observations:
- There are lot of missing values in each column.

<a id='dup'></a>
<h3>2.2.2 Checking for Duplicates </h3>


In [None]:
columns_without_id = [col for col in application.columns if col!='SK_ID_CURR']

#Checking for duplicates in the data.
application[application.duplicated(subset = columns_without_id, keep=False)]

print('The no of duplicates in the data:',application[application.duplicated(subset = columns_without_id, keep=False)]
      .shape[0])

<a id='distribution'></a>
<h3> 2.2.3 Distribution of data points among output classes</h3>

Most of the analysis are plotted using **Plotly** , you can hover over the plot to see the overview of data.

In [None]:
cf.set_config_file(theme='polar')
target_val = application['TARGET'].value_counts()
target_df = pd.DataFrame({'labels': ['Loan Repayed (0)','Loan not Repayed(1)'],
                   'values': target_val.values
                  })
target_df.iplot(kind='pie',labels='labels',values='values', title='Loan Repayed or not', hole = 0.6)



#### Observations:
- The data is imbalanced(91.9%(Loan repayed-0) and 8.07%(Loan not repayed-1)) and we need to handle this problem.

<a id='3.3'></a>
<h2> 2.3 Data Analysis </h2>

<a id='3.3.1'></a>
<h3> 2.3.1 Types of loan </h3>

In [None]:
cf.set_config_file(theme='polar')
contract_val = application['NAME_CONTRACT_TYPE'].value_counts()
contract_df = pd.DataFrame({'labels': contract_val.index,
                   'values': contract_val.values
                  })
contract_df.iplot(kind='pie',labels='labels',values='values', title='Types of Loan', hole = 0.6)

#### Observations:
- Many people are willing to take cash loan than revolving loan (https://www.investopedia.com/terms/r/revolving-loan-facility.asp).

<a id='3.3.2'></a>
<h3> 2.3.2 Distribution of AMT_INCOME_TOTAL </h3>

In [None]:
cf.set_config_file(theme='pearl')
application['AMT_INCOME_TOTAL'].iplot(kind='histogram', bins=100,
                                      xTitle = 'Total Income',yTitle ='Count of applicants',
                                     title='Distribution of AMT_INCOME_TOTAL')

In [None]:
application[application['AMT_INCOME_TOTAL'] < 2000000]['AMT_INCOME_TOTAL'].iplot(kind='histogram', bins=100,
                                                        xTitle = 'Total Income',yTitle ='Count of applicants',
                                     title='Distribution of AMT_INCOME_TOTAL')

In [None]:
(application[application['AMT_INCOME_TOTAL'] > 1000000]['TARGET'].value_counts())/len(application[application['AMT_INCOME_TOTAL'] > 1000000])*100

#### Observations:
- The distribution is right skewed and there are extreme values, we can apply log distribution.
- People with high income are likely to repay the loan.

<a id='3.3.3'></a>
<h3> 2.3.3 Distribution of AMT_CREDIT</h3>

In [None]:
application['AMT_CREDIT'].iplot(kind='histogram', bins=100,
                               xTitle = 'Credit Amount',yTitle ='Count of applicants',
                               title='Distribution of AMT_CREDIT')

In [None]:
(application[application['AMT_CREDIT'] > 2000000]['TARGET'].value_counts())/len(application[application['AMT_CREDIT'] > 2000000])*100

In [None]:
np.log(application['AMT_CREDIT']).iplot(kind='histogram', bins=100,
                               xTitle = 'log(Credit Amount)',yTitle ='Count of applicants',
                               title='Distribution of log(AMT_CREDIT)')

#### Observations:
- People who are taking credit for large amount are very likely to repay the loan.
- Originally the distribution is right skewed, we used log transformation to make it normal distributed.

<a id='3.3.4'></a>
<h3>2.3.4 Distribution of Name of type of the Suite in terms of loan is repayed or not</h3>

In [None]:
cf.set_config_file(theme='polar')
suite_val = (application['NAME_TYPE_SUITE'].value_counts()/len(application))*100
suite_val.iplot(kind='bar', xTitle = 'Name of type of the Suite',
             yTitle='Count of applicants in %',
             title='Who accompanied client when applying for the  application in % ')

In [None]:
suite_val = application['NAME_TYPE_SUITE'].value_counts()

suite_val_y0 = []
suite_val_y1 = []
for val in suite_val.index:
    suite_val_y1.append(np.sum(application['TARGET'][application['NAME_TYPE_SUITE']==val] == 1))
    suite_val_y0.append(np.sum(application['TARGET'][application['NAME_TYPE_SUITE']==val] == 0))

data = [go.Bar(x = suite_val.index, y = ((suite_val_y1 / suite_val.sum()) * 100), name='Yes' ),
        go.Bar(x = suite_val.index, y = ((suite_val_y0 / suite_val.sum()) * 100), name='No' )]

layout = go.Layout(
    title = "Who accompanied client when applying for the  application in terms of loan is repayed or not in %",
    xaxis=dict(
        title='Name of type of the Suite',
       ),
    yaxis=dict(
        title='Count of applicants in %',
        )
)

fig = go.Figure(data = data, layout=layout) 
#fig.layout.template = 'plotly_dark'


py.iplot(fig)


<a id='3.3.5'></a>
<h3>2.3.5 Distribution of Income sources of Applicants in terms of loan is repayed or not </h3>

In [None]:
income_val = application['NAME_INCOME_TYPE'].value_counts()

income_val_y0 = []
income_val_y1 = []
for val in income_val.index:
    income_val_y1.append(np.sum(application['TARGET'][application['NAME_INCOME_TYPE']==val] == 1))
    income_val_y0.append(np.sum(application['TARGET'][application['NAME_INCOME_TYPE']==val] == 0))

data = [go.Bar(x = income_val.index, y = ((income_val_y1 / income_val.sum()) * 100), name='Yes' ),
        go.Bar(x = income_val.index, y = ((income_val_y0 / income_val.sum()) * 100), name='No' )]

layout = go.Layout(
    title = "Income sources of Applicants in terms of loan is repayed or not  in %",
    xaxis=dict(
        title='Income source',
       ),
    yaxis=dict(
        title='Count of applicants in %',
        )
)

fig = go.Figure(data = data, layout=layout) 
#fig.layout.template = 'plotly_dark'


py.iplot(fig)


#### Observations:
- All the Students and Businessman are repaying loan.(Hover over the plot to observe)

<a id='3.3.6'></a>
<h3> 2.3.6 Distribution of Education of Applicants in terms of loan is repayed or not </h3>

In [None]:
education_val = application['NAME_EDUCATION_TYPE'].value_counts()

education_val_y0 = []
education_val_y1 = []
for val in education_val.index:
    education_val_y1.append(np.sum(application['TARGET'][application['NAME_EDUCATION_TYPE']==val] == 1))
    education_val_y0.append(np.sum(application['TARGET'][application['NAME_EDUCATION_TYPE']==val] == 0))

data = [go.Bar(x = education_val.index, y = ((education_val_y1 / education_val.sum()) * 100), name='Yes' ),
        go.Bar(x = education_val.index, y = ((education_val_y0 / education_val.sum()) * 100), name='No' )]

layout = go.Layout(
    title = "Education sources of Applicants in terms of loan is repayed or not  in %",
    xaxis=dict(
        title='Education of Applicants',
       ),
    yaxis=dict(
        title='Count of applicants in %',
        )
)

fig = go.Figure(data = data, layout=layout) 
#fig.layout.template = 'plotly_dark'


py.iplot(fig)


#### Observations:
- People with Academic Degree are more likely to repay the loan(Out of 164, only 3 applicants are not able to repay)

<a id='3.3.7'></a>
<h3> 2.3.7 Distribution of Family status of Applicants in terms of loan is repayed or not </h3>

In [None]:
family_val = application["NAME_FAMILY_STATUS"].value_counts()

family_val_y0 = []
family_val_y1 = []
for val in family_val.index:
    family_val_y1.append(np.sum(application["TARGET"][application["NAME_FAMILY_STATUS"]==val] == 1))
    family_val_y0.append(np.sum(application["TARGET"][application["NAME_FAMILY_STATUS"]==val] == 0))

data = [go.Bar(x = family_val.index, y = ((family_val_y1 / family_val.sum()) * 100), name='Yes' ),
        go.Bar(x = family_val.index, y = ((family_val_y0 / family_val.sum()) * 100), name='No' )]

layout = go.Layout(
    title = "Family Status of Applicants in terms of loan is repayed or not in %",
    xaxis=dict(
        title='Family Status',
       ),
    yaxis=dict(
        title='Count of applicants in %',
        )
)

fig = go.Figure(data = data, layout=layout) 
#fig.layout.template = 'plotly_dark'

py.iplot(fig)


#### Observations:
- Widows are more likely to repay the loan when compared to appliants with the other family statuses.   

<a id='3.3.8'></a>
<h3> 2.3.8 Distribution of Housing type of Applicants in terms of loan is repayed or not </h3>

In [None]:
housing_val = application['NAME_HOUSING_TYPE'].value_counts()

housing_val_y0 = []
housing_val_y1 = []
for val in housing_val.index:
    housing_val_y1.append(np.sum(application['TARGET'][application['NAME_HOUSING_TYPE']==val] == 1))
    housing_val_y0.append(np.sum(application['TARGET'][application['NAME_HOUSING_TYPE']==val] == 0))

data = [go.Bar(x = housing_val.index, y = ((housing_val_y1 / housing_val.sum()) * 100), name='Yes' ),
        go.Bar(x = housing_val.index, y = ((housing_val_y0 / housing_val.sum()) * 100), name='No' )]

layout = go.Layout(
    title = "For which types of house higher applicants applied for loan in terms of loan is repayed or not in %",
    xaxis=dict(
        title='Type of house',
       ),
    yaxis=dict(
        title='Count of applicants in %',
        )
)

fig = go.Figure(data = data, layout=layout) 
#fig.layout.template = 'plotly_dark'


py.iplot(fig)


<a id='3.3.9'></a>
<h3> 2.3.9 Distribution of Clients Age </h3>

In [None]:
cf.set_config_file(theme='pearl')
(application['DAYS_BIRTH']/(-365)).iplot(kind='histogram', xTitle = 'Age',bins=50,
             yTitle='Count of type of applicants in %',
             title='Distribution of Clients Age')

<a id='3.3.10'></a>
<h3> 2.3.10 Distribution of years before the application the person started current employment </h3>

In [None]:
cf.set_config_file(theme='pearl')
(application['DAYS_EMPLOYED']).iplot(kind='histogram', xTitle = 'Days',bins=50,
             yTitle='Count of applicants in %',
             title='Days before the application the person started current employment')

- The data looks strange(we have -1000.66 years(-365243 days) of employment which is impossible) looks like there is data entry error.

In [None]:
application['DAYS_EMPLOYED'].describe()

In [None]:
error = application[application['DAYS_EMPLOYED'] == 365243]
print('The no of errors are :', len(error))
(error['TARGET'].value_counts()/len(error))*100

- The error are default to 5.4%, so we need to handle this error

In [None]:
# Create an error flag column
application['DAYS_EMPLOYED_ERROR'] = application["DAYS_EMPLOYED"] == 365243

# Replace the error values with nan
application['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

In [None]:
cf.set_config_file(theme='pearl')
(application['DAYS_EMPLOYED']/(-365)).iplot(kind='histogram', xTitle = 'Years of Employment',bins=50,
             yTitle='Count of applicants in %',
             title='Years before the application the person started current employment')

In [None]:
application[application['DAYS_EMPLOYED']>(-365*2)]['TARGET'].value_counts()/sum(application['DAYS_EMPLOYED']>(-365*2))

#### Observations:
- The applicants with less than 2 years of employment are less likely to repay the loan.

<a id='3.3.11'></a>
<h3> 2.3.11 Occupation of Applicants in terms of loan is repayed or not</h3>

In [None]:
occupation_val = application['OCCUPATION_TYPE'].value_counts()

occupation_val_y0 = []
occupation_val_y1 = []
for val in occupation_val.index:
    occupation_val_y1.append(np.sum(application['TARGET'][application['OCCUPATION_TYPE']==val] == 1))
    occupation_val_y0.append(np.sum(application['TARGET'][application['OCCUPATION_TYPE']==val] == 0))

data = [go.Bar(x = occupation_val.index, y = ((occupation_val_y1 / occupation_val.sum()) * 100), name='Yes' ),
        go.Bar(x = occupation_val.index, y = ((occupation_val_y0 / occupation_val.sum()) * 100), name='No' )]

layout = go.Layout(
    title = "Occupation of Applicants in terms of loan is repayed or not in %",
    xaxis=dict(
        title='Occupation ',
       ),
    yaxis=dict(
        title='Count in %',
        )
)

fig = go.Figure(data = data, layout=layout)
#fig.layout.template = 'plotly_dark'


py.iplot(fig)


#### Observations:
- Core staff ,Managers, High skill tech staff, Accountants are more likely to repay when compared to Laborers, Sales staff, Drivers, Low-skill Laborers(very less likely to repay).

In [None]:
application.shape

<a id='3.4'></a>
<h2> 2.4 Preparation of data </h2>

<a id='3.4.1'></a>
<h3> 2.4.1 Feature Engineering of Application data </h3>

In [None]:
# Flag to represent when Total income is greater than Credit
application['INCOME_GT_CREDIT_FLAG'] = application['AMT_INCOME_TOTAL'] > application['AMT_CREDIT']
# Column to represent Credit Income Percent
application['CREDIT_INCOME_PERCENT'] = application['AMT_CREDIT'] / application['AMT_INCOME_TOTAL']
# Column to represent Annuity Income percent
application['ANNUITY_INCOME_PERCENT'] = application['AMT_ANNUITY'] / application['AMT_INCOME_TOTAL']
# Column to represent Credit Term
application['CREDIT_TERM'] = application['AMT_CREDIT'] / application['AMT_ANNUITY'] 
# Column to represent Days Employed percent in his life
application['DAYS_EMPLOYED_PERCENT'] = application['DAYS_EMPLOYED'] / application['DAYS_BIRTH']

In [None]:
application.shape

<a id='3.4.2'></a>
<h3> 2.4.2 Using Bureau Data </h3>

In [None]:
print('Reading the data....', end='')
bureau = pd.read_csv('bureau.csv')
print('done!!!')
print('The shape of data:',bureau.shape)
print('First 5 rows of data:')
bureau.head()

In [None]:
# Combining numerical features
grp = bureau.drop(['SK_ID_BUREAU'], axis = 1).groupby(by=['SK_ID_CURR']).mean().reset_index()
grp.columns = ['BUREAU_'+column if column !='SK_ID_CURR' else column for column in grp.columns]
application_bureau = application.merge(grp, on='SK_ID_CURR', how='left')
application_bureau.update(application_bureau[grp.columns].fillna(0))

In [None]:
# Combining categorical features
bureau_categorical = pd.get_dummies(bureau.select_dtypes('object'))
bureau_categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']

grp = bureau_categorical.groupby(by = ['SK_ID_CURR']).mean().reset_index()
grp.columns = ['BUREAU_'+column if column !='SK_ID_CURR' else column for column in grp.columns]
application_bureau = application_bureau.merge(grp, on='SK_ID_CURR', how='left')
application_bureau.update(application_bureau[grp.columns].fillna(0))

In [None]:
application_bureau.shape

<a id='3.4.3'></a>
<h3> 2.4.3 Feature Engineering of Bureau Data </h3>

In [None]:
# Number of past loans per customer
grp = bureau.groupby(by = ['SK_ID_CURR'])['SK_ID_BUREAU'].count().reset_index().rename(columns = {'SK_ID_BUREAU': 'BUREAU_LOAN_COUNT'})

application_bureau = application_bureau.merge(grp, on='SK_ID_CURR', how='left')
application_bureau['BUREAU_LOAN_COUNT'] = application_bureau['BUREAU_LOAN_COUNT'].fillna(0)

In [None]:
# Number of types of past loans per customer 
grp = bureau[['SK_ID_CURR', 'CREDIT_TYPE']].groupby(by = ['SK_ID_CURR'])['CREDIT_TYPE'].nunique().reset_index().rename(columns={'CREDIT_TYPE': 'BUREAU_LOAN_TYPES'})

application_bureau = application_bureau.merge(grp, on='SK_ID_CURR', how='left')
application_bureau['BUREAU_LOAN_TYPES'] = application_bureau['BUREAU_LOAN_TYPES'].fillna(0)

In [None]:
# Debt over credit ratio 
bureau['AMT_CREDIT_SUM'] = bureau['AMT_CREDIT_SUM'].fillna(0)
bureau['AMT_CREDIT_SUM_DEBT'] = bureau['AMT_CREDIT_SUM_DEBT'].fillna(0)

grp1 = bureau[['SK_ID_CURR','AMT_CREDIT_SUM']].groupby(by=['SK_ID_CURR'])['AMT_CREDIT_SUM'].sum().reset_index().rename(columns={'AMT_CREDIT_SUM': 'TOTAL_CREDIT_SUM'})

grp2 = bureau[['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(by=['SK_ID_CURR'])['AMT_CREDIT_SUM_DEBT'].sum().reset_index().rename(columns={'AMT_CREDIT_SUM_DEBT':'TOTAL_CREDIT_SUM_DEBT'})

grp1['DEBT_CREDIT_RATIO'] = grp2['TOTAL_CREDIT_SUM_DEBT']/grp1['TOTAL_CREDIT_SUM']

del grp1['TOTAL_CREDIT_SUM']

application_bureau = application_bureau.merge(grp1, on='SK_ID_CURR', how='left')
application_bureau['DEBT_CREDIT_RATIO'] = application_bureau['DEBT_CREDIT_RATIO'].fillna(0)
application_bureau['DEBT_CREDIT_RATIO'] = application_bureau.replace([np.inf, -np.inf], 0)
application_bureau['DEBT_CREDIT_RATIO'] = pd.to_numeric(application_bureau['DEBT_CREDIT_RATIO'], downcast='float')

In [None]:
(application_bureau[application_bureau['DEBT_CREDIT_RATIO'] > 0.5]['TARGET'].value_counts()/len(application_bureau[application_bureau['DEBT_CREDIT_RATIO'] > 0.5]))*100

In [None]:
# Overdue over debt ratio
bureau['AMT_CREDIT_SUM_OVERDUE'] = bureau['AMT_CREDIT_SUM_OVERDUE'].fillna(0)
bureau['AMT_CREDIT_SUM_DEBT'] = bureau['AMT_CREDIT_SUM_DEBT'].fillna(0)

grp1 = bureau[['SK_ID_CURR','AMT_CREDIT_SUM_OVERDUE']].groupby(by=['SK_ID_CURR'])['AMT_CREDIT_SUM_OVERDUE'].sum().reset_index().rename(columns={'AMT_CREDIT_SUM_OVERDUE': 'TOTAL_CUSTOMER_OVERDUE'})

grp2 = bureau[['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(by=['SK_ID_CURR'])['AMT_CREDIT_SUM_DEBT'].sum().reset_index().rename(columns={'AMT_CREDIT_SUM_DEBT':'TOTAL_CUSTOMER_DEBT'})

grp1['OVERDUE_DEBT_RATIO'] = grp1['TOTAL_CUSTOMER_OVERDUE']/grp2['TOTAL_CUSTOMER_DEBT']

del grp1['TOTAL_CUSTOMER_OVERDUE']

application_bureau = application_bureau.merge(grp1, on='SK_ID_CURR', how='left')
application_bureau['OVERDUE_DEBT_RATIO'] = application_bureau['OVERDUE_DEBT_RATIO'].fillna(0)
application_bureau['OVERDUE_DEBT_RATIO'] = application_bureau.replace([np.inf, -np.inf], 0)

application_bureau['OVERDUE_DEBT_RATIO'] = pd.to_numeric(application_bureau['OVERDUE_DEBT_RATIO'], downcast='float')

In [None]:
application_bureau.shape

In [None]:
gc.collect()

<a id='3.4.4'></a>
<h3> 2.4.4 Using Previous Application Data </h3>

In [None]:
print('Reading the data....', end='')
previous_applicaton = pd.read_csv('previous_application.csv')
print('done!!!')
print('The shape of data:',previous_applicaton.shape)
print('First 5 rows of data:')
previous_applicaton.head()

In [None]:
# Number of previous applications per customer
grp = previous_applicaton[['SK_ID_CURR','SK_ID_PREV']].groupby(by=['SK_ID_CURR'])['SK_ID_PREV'].count().reset_index().rename(columns={'SK_ID_PREV':'PREV_APP_COUNT'})
application_bureau_prev = application_bureau.merge(grp, on =['SK_ID_CURR'], how = 'left')
application_bureau_prev['PREV_APP_COUNT'] = application_bureau_prev['PREV_APP_COUNT'].fillna(0)

In [None]:
# Combining numerical features
grp = previous_applicaton.drop('SK_ID_PREV', axis =1).groupby(by=['SK_ID_CURR']).mean().reset_index()
prev_columns = ['PREV_'+column if column != 'SK_ID_CURR' else column for column in grp.columns ]
grp.columns = prev_columns
application_bureau_prev = application_bureau_prev.merge(grp, on =['SK_ID_CURR'], how = 'left')
application_bureau_prev.update(application_bureau_prev[grp.columns].fillna(0))

In [None]:
# Combining categorical features
prev_categorical = pd.get_dummies(previous_applicaton.select_dtypes('object'))
prev_categorical['SK_ID_CURR'] = previous_applicaton['SK_ID_CURR']
prev_categorical.head()

grp = prev_categorical.groupby('SK_ID_CURR').mean().reset_index()
grp.columns = ['PREV_'+column if column != 'SK_ID_CURR' else column for column in grp.columns]

application_bureau_prev = application_bureau_prev.merge(grp, on=['SK_ID_CURR'], how='left')
application_bureau_prev.update(application_bureau_prev[grp.columns].fillna(0))

<a id='3.4.5'></a>
<h3> 2.4.5 Using POS_CASH_balance data </h3>

In [None]:
print('Reading the data....', end='')
pos_cash = pd.read_csv('POS_CASH_balance.csv')
print('done!!!')
print('The shape of data:',pos_cash.shape)
print('First 5 rows of data:')
pos_cash.head()

In [None]:
# Combining numerical features
grp = pos_cash.drop('SK_ID_PREV', axis =1).groupby(by=['SK_ID_CURR']).mean().reset_index()
prev_columns = ['POS_'+column if column != 'SK_ID_CURR' else column for column in grp.columns ]
grp.columns = prev_columns
application_bureau_prev = application_bureau_prev.merge(grp, on =['SK_ID_CURR'], how = 'left')
application_bureau_prev.update(application_bureau_prev[grp.columns].fillna(0))

In [None]:
# Combining categorical features
pos_cash_categorical = pd.get_dummies(pos_cash.select_dtypes('object'))
pos_cash_categorical['SK_ID_CURR'] = pos_cash['SK_ID_CURR']

grp = pos_cash_categorical.groupby('SK_ID_CURR').mean().reset_index()
grp.columns = ['POS_'+column if column != 'SK_ID_CURR' else column for column in grp.columns]

application_bureau_prev = application_bureau_prev.merge(grp, on=['SK_ID_CURR'], how='left')
application_bureau_prev.update(application_bureau_prev[grp.columns].fillna(0))

<a id='3.4.6'></a>
<h3> 2.4.6 Using installments_payments data</h3>

In [None]:
print('Reading the data....', end='')
insta_payments = pd.read_csv('installments_payments.csv')
print('done!!!')
print('The shape of data:',insta_payments.shape)
print('First 5 rows of data:')
insta_payments.head()

In [None]:
# Combining numerical features and there are no categorical features in this dataset
grp = insta_payments.drop('SK_ID_PREV', axis =1).groupby(by=['SK_ID_CURR']).mean().reset_index()
prev_columns = ['INSTA_'+column if column != 'SK_ID_CURR' else column for column in grp.columns ]
grp.columns = prev_columns
application_bureau_prev = application_bureau_prev.merge(grp, on =['SK_ID_CURR'], how = 'left')
application_bureau_prev.update(application_bureau_prev[grp.columns].fillna(0))

<a id='3.4.7'></a>
<h3> 2.4.7 Using Credit card balance data </h3>

In [None]:
print('Reading the data....', end='')
credit_card = pd.read_csv('credit_card_balance.csv')
print('done!!!')
print('The shape of data:',credit_card.shape)
print('First 5 rows of data:')
credit_card.head()

In [None]:
# Combining numerical features
grp = credit_card.drop('SK_ID_PREV', axis =1).groupby(by=['SK_ID_CURR']).mean().reset_index()
prev_columns = ['CREDIT_'+column if column != 'SK_ID_CURR' else column for column in grp.columns ]
grp.columns = prev_columns
application_bureau_prev = application_bureau_prev.merge(grp, on =['SK_ID_CURR'], how = 'left')
application_bureau_prev.update(application_bureau_prev[grp.columns].fillna(0))

In [None]:
# Combining categorical features
credit_categorical = pd.get_dummies(credit_card.select_dtypes('object'))
credit_categorical['SK_ID_CURR'] = credit_card['SK_ID_CURR']

grp = credit_categorical.groupby('SK_ID_CURR').mean().reset_index()
grp.columns = ['CREDIT_'+column if column != 'SK_ID_CURR' else column for column in grp.columns]

application_bureau_prev = application_bureau_prev.merge(grp, on=['SK_ID_CURR'], how='left')
application_bureau_prev.update(application_bureau_prev[grp.columns].fillna(0))

In [None]:
application_bureau_prev.shape

<h3> 2.5 Dividing data into train, valid and test</h3>

In [None]:
y = application_bureau_prev.pop('TARGET').values

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(application_bureau_prev.drop(['SK_ID_CURR'],axis=1), y, stratify = y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, stratify = y_temp, test_size=0.5, random_state=42)

In [None]:
print('Shape of X_train:',X_train.shape)
print('Shape of X_val:',X_val.shape)
print('Shape of X_test:',X_test.shape)

<h3> 2.6 Featurizing the data</h3>

In [None]:
# Seperation of columns into numeric and categorical columns
types = np.array([dt for dt in X_train.dtypes])

all_columns = X_train.columns.values
is_num = types != 'object'

num_cols = all_columns[is_num]
cat_cols = all_columns[~is_num]


In [None]:
# Featurization of numeric data

imputer_num = SimpleImputer(strategy='median')
X_train_num = imputer_num.fit_transform(X_train[num_cols])
X_val_num = imputer_num.transform(X_val[num_cols])
X_test_num = imputer_num.transform(X_test[num_cols])

scaler_num = StandardScaler()
X_train_num1 = scaler_num.fit_transform(X_train_num)
X_val_num1 = scaler_num.transform(X_val_num)
X_test_num1 = scaler_num.transform(X_test_num)

X_train_num_final = pd.DataFrame(X_train_num1, columns=num_cols)
X_val_num_final = pd.DataFrame(X_val_num1, columns=num_cols)
X_test_num_final = pd.DataFrame(X_test_num1, columns=num_cols)


In [None]:
# Featurization of categorical data

imputer_cat = SimpleImputer(strategy='constant', fill_value='MISSING')
X_train_cat = imputer_cat.fit_transform(X_train[cat_cols])
X_val_cat = imputer_cat.transform(X_val[cat_cols])
X_test_cat = imputer_cat.transform(X_test[cat_cols])

X_train_cat1= pd.DataFrame(X_train_cat, columns=cat_cols)
X_val_cat1= pd.DataFrame(X_val_cat, columns=cat_cols)
X_test_cat1= pd.DataFrame(X_test_cat, columns=cat_cols)

ohe = OneHotEncoder(sparse=False,handle_unknown='ignore')
X_train_cat2 = ohe.fit_transform(X_train_cat1)
X_val_cat2 = ohe.transform(X_val_cat1)
X_test_cat2 = ohe.transform(X_test_cat1)

cat_cols_ohe = list(ohe.get_feature_names(input_features=cat_cols))

X_train_cat_final = pd.DataFrame(X_train_cat2, columns = cat_cols_ohe)
X_val_cat_final = pd.DataFrame(X_val_cat2, columns = cat_cols_ohe)
X_test_cat_final = pd.DataFrame(X_test_cat2, columns = cat_cols_ohe)

In [None]:
# To free up the unused memory
gc.collect()

In [None]:
# Final complete data
X_train_final = pd.concat([X_train_num_final,X_train_cat_final], axis = 1)
X_val_final = pd.concat([X_val_num_final,X_val_cat_final], axis = 1)
X_test_final = pd.concat([X_test_num_final,X_test_cat_final], axis = 1)
print(X_train_final.shape)
print(X_val_final.shape)
print(X_test_final.shape)

In [None]:
# Saving the Dataframes into CSV files for future use
X_train_final.to_csv('X_train_final.csv')
X_val_final.to_csv('X_val_final.csv')
X_test_final.to_csv('X_test_final.csv')

In [None]:
# Saving the numpy arrays into text files for future use
np.savetxt('y.txt', y)
np.savetxt('y_train.txt', y_train)
np.savetxt('y_val.txt', y_val)
np.savetxt('y_test.txt', y_test)

<h3> 2.7 Selection of features</h3>

In [None]:
model_sk = lgb.LGBMClassifier(boosting_type='gbdt', max_depth=7, learning_rate=0.01, n_estimators= 2000, 
                 class_weight='balanced', subsample=0.9, colsample_bytree= 0.8, n_jobs=-1)

train_features, valid_features, train_y, valid_y = train_test_split(X_train_final, y_train, test_size = 0.15, random_state = 42)

model_sk.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], eval_metric = 'auc', verbose = 200)


In [None]:
feature_imp = pd.DataFrame(sorted(zip(model_sk.feature_importances_, X_train_final.columns)), columns=['Value','Feature'])
features_df = feature_imp.sort_values(by="Value", ascending=False)
selected_features = list(features_df[features_df['Value']>=50]['Feature'])
print('The no. of features selected:',len(selected_features))

In [None]:
# Saving the selected features into pickle file
with open('select_features.txt','wb') as fp:
    pickle.dump(selected_features, fp)

In [None]:
# Feature importance Plot
data1 = features_df.head(20)
data = [go.Bar(x =data1.sort_values(by='Value')['Value'] , y = data1.sort_values(by='Value')['Feature'], orientation = 'h',
              marker = dict(
        color = 'rgba(43, 13, 150, 0.6)',
        line = dict(
            color = 'rgba(43, 13, 150, 1.0)',
            width = 1.5)
    )) ]

layout = go.Layout(
    autosize=False,
    width=1300,
    height=700,
    title = "Top 20 important features",
    xaxis=dict(
        title='Importance value'
        ),
    yaxis=dict(
        automargin=True
        ),
    bargap=0.4
    )

fig = go.Figure(data = data, layout=layout)
#fig.layout.template = 'seaborn'


py.iplot(fig)

<h1> 3. Machine Learning Models </h1>

In [None]:
X_train_final = pd.read_csv('X_train_final.csv')
X_val_final = pd.read_csv('X_val_final.csv')
X_test_final = pd.read_csv('X_test_final.csv')

print('X_train_final',X_train_final.shape)
print('X_val_final',X_val_final.shape)
print('X_test_final',X_test_final.shape)

In [None]:
with open('select_features.txt', 'rb') as fp:
    selected_features = pickle.load(fp)

In [None]:
y_train = np.loadtxt('y_train.txt')
y_val = np.loadtxt('y_val.txt')
y_test = np.loadtxt('y_test.txt')

In [None]:
def plot_confusion_matrix(test_y, predicted_y):
    # Confusion matrix
    C = confusion_matrix(test_y, predicted_y)
    
    # Recall matrix
    A = (((C.T)/(C.sum(axis=1))).T)
    
    # Precision matrix
    B = (C/C.sum(axis=0))
    
    plt.figure(figsize=(20,4))
    
    labels = ['Re-paid(0)','Not Re-paid(1)']
    cmap=sns.light_palette("purple")
    plt.subplot(1,3,1)
    sns.heatmap(C, annot=True, cmap=cmap,fmt="d", xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Confusion matrix')
    
    plt.subplot(1,3,2)
    sns.heatmap(A, annot=True, cmap=cmap, xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Recall matrix')
    
    plt.subplot(1,3,3)
    sns.heatmap(B, annot=True, cmap=cmap, xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Precision matrix')
    
    plt.show()

In [None]:
def cv_plot(alpha, cv_auc):
    
    fig, ax = plt.subplots()
    ax.plot(np.log10(alpha), cv_auc,c='g')
    for i, txt in enumerate(np.round(cv_auc,3)):
        ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]),cv_auc[i]))
    plt.grid()
    plt.xticks(np.log10(alpha))
    plt.title("Cross Validation Error for each alpha")
    plt.xlabel("Alpha i's")
    plt.ylabel("Error measure")
    plt.show()

<h2> 3.1 Logistic regression with selected features </h2>

In [None]:
alpha = np.logspace(-4,4,9)
cv_auc_score = []

for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l1',class_weight = 'balanced', loss='log', random_state=28)
    clf.fit(X_train_final[selected_features], y_train)
    sig_clf = CalibratedClassifierCV(clf, method='sigmoid')
    sig_clf.fit(X_train_final[selected_features], y_train)
    y_pred_prob = sig_clf.predict_proba(X_val_final[selected_features])[:,1]
    cv_auc_score.append(roc_auc_score(y_val,y_pred_prob))
    print('For alpha {0}, cross validation AUC score {1}'.format(i,roc_auc_score(y_val,y_pred_prob)))

cv_plot(alpha, cv_auc_score)    

print('The Optimal C value is:', alpha[np.argmax(cv_auc_score)])    

In [None]:
best_alpha = alpha[np.argmax(cv_auc_score)]
logreg = SGDClassifier(alpha = best_alpha, class_weight = 'balanced', penalty = 'l1', loss='log', random_state = 28)
logreg.fit(X_train_final[selected_features], y_train)
logreg_sig_clf = CalibratedClassifierCV(logreg, method='sigmoid')
logreg_sig_clf.fit(X_train_final[selected_features], y_train)
y_pred_prob = logreg_sig_clf.predict_proba(X_train_final[selected_features])[:,1]
print('For best alpha {0}, The Train AUC score is {1}'.format(best_alpha, roc_auc_score(y_train,y_pred_prob) ))    
y_pred_prob = logreg_sig_clf.predict_proba(X_val_final[selected_features])[:,1]
print('For best alpha {0}, The Cross validated AUC score is {1}'.format(best_alpha, roc_auc_score(y_val,y_pred_prob) ))  
y_pred_prob = logreg_sig_clf.predict_proba(X_test_final[selected_features])[:,1]
print('For best alpha {0}, The Test AUC score is {1}'.format(best_alpha, roc_auc_score(y_test,y_pred_prob) ))  
y_pred = logreg.predict(X_test_final[selected_features])
print('The test AUC score is :', roc_auc_score(y_test,y_pred_prob))
print('The percentage of misclassified points {:05.2f}% :'.format((1-accuracy_score(y_test, y_pred))*100))
plot_confusion_matrix(y_test, y_pred)

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test,y_pred_prob)

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, marker='.')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.title('ROC curve', fontsize = 20)
plt.xlabel('FPR', fontsize=15)
plt.ylabel('TPR', fontsize=15)
plt.grid()
plt.legend(["AUC=%.3f"%auc])
plt.show()