#    AV Janata Hack - Credit Card Payment Default Prediction Hackathon              

# Executive Summary


As a participant in the AnalyticsVidhya Janata Hack - Credit Card Payment Default Prediction Hackathon, I analyzed the data and information on the credit card payment defaults by the customers of a Taiwan based company. Based on extensive data analysis, I wish to present the following summary of my findings, and proposed changes going forward:

1. As per the available data, 22% of the customers shall be defaulting on their payments next month. This is an extremely high number of defaulting customers. The global benchmark for payment default is less than 2%.

2. Even though the problem statement mentions that the payment defaults (ie. customers not paying their bills) was the primary concern that the company was trying to address, the analysis of the data shows that actually the problem is the customers paying less than the "Minimum Billing Amount" resulting in large "Pending Amounts" against such "Delinquent" customers month-after-month.

3. The company gave a credit of 3.5 billion NT dollars to its customers. 

4. The total payment pending from delinquent customers is 5 billion NT dollars. 

5. The total payment pending from defulting customers is 1.2 billion dollars.

6. The pending amount keeps increasing by an order of magnitude as the customers keep getting delinquent month after month. This is also because there is a regular increase in the number of customers who are getting delinquent multiple times.

7. At the same time, the pending amount keeps decreasing by an order of magnitude as the customers keep defaulting month after month. This is also because there is a regular decrease in the number of customers who are defaulting multiple times.

**We can clearly establish that certain basic "Credit Controls" have not been observed by the company. Obviously, they have allowed customers to continue to use cards beyond the authorized credit limits.**

**The reason is very simple - the entire focus has been on customers who shall not pay next month, rather than on customers who are paying less than the "Minimum Amounts" and falling into "Delinquent" status month-after-month.**

**My recommendation is that the focus of the company should shift to predicting customers who shall pay less than the Minimum Amount to keep their cards "Active", rather than on customers who make "No Payments."**

**This change in strategy and focus will deliver far greater financial and bottom line benefits to the company.**



# Problem Statement



Predicting accurately which customers are most probable to default represents a significant business opportunity for all banks. Bank cards are the most common credit card type in Taiwan, which emphasizes the impact of risk prediction on both the consumers and banks. 

This would inform the bank’s decisions on criteria to approve a credit card application and also decide upon what credit limit to provide.

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. 

Using the information given, predict the probability of a customer defaulting in the next month.



## Data Dictionary

Below is a thorough description of the 25 features/variables:

1. **ID :** Customer ID

2. **LIMIT_BAL:** Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.

3. **SEX :** Gender (1 = male; 2 = female).

4. **EDUCATION:** Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

5. **MARRIAGE :** Marital status (1 = married; 2 = single; 3 = others).

6. **AGE :** Age (year).

7. **PAY_0 - PAY_6:** History of past payments. The past monthly payment records (from April to September, 2005) are as follows: 

* PAY_0 = the repayment status in September, 2005 
* PAY_2 = the repayment status in August, 2005; . . .;
* PAY_6 = the repayment status in April, 2005. 

The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.


8. **BILL_AMT1 - BILL_AMT6:** Amount of bill statement (NT dollar) as follows : 

* BILL_AMT1 = amount of bill statement in September, 2005 
* BILL_AMT2 = amount of bill statement in August, 2005; . . .; 
* BILL_AMT6 = amount of bill statement in April, 2005.

9. **PAY_AMT1 - PAY_AMT6:** Amount of previous payment (NT dollar) as follows : 

* PAY_AMT1 = amount paid in September, 2005; 
* PAY_AMT2 = amount paid in August, 2005; . . .;
* PAY_AMT6 = amount paid in April, 2005.

10. **default_payment_next_month:** Target variable - Default Payment (Yes = 1, No = 0)



#                               Hypothesis Generation


Based on the information provided about the data, and general understanding of the credit card business, let us try to list out all the possible factors that can affect the outcome (identifying who will default on payment) :

1. It should not matter whether customer is a male or a female when it comes to defaulting on payments. We assume equal probability of defaulters being males or females.

2. Married customers are less likely to default than single customers.

3. Younger customers (less than 25 years of age) are more likely to default, than senior customers.

4. Customers who are senior in age, education and job profile should have higher billing, than the younger customers who are still studying or are unemployed. 

5. Lower the billing, lower should be the probability of a payment default. Lower the billing, higher should be the probability of making full payment on time, and hence lower the probability of payment default.

6. Higher the billing, higher should be the probability of payment default irrespective of credit limit.  Higher the billing, lower should be the probability of making full payment on time, and hence higher the probability of payment default.

7. Higher the billing, higher the credit limit, and hence higher the probability that customer shall not make regular payments. Lower the billing, lower the credit limit, hence lower the probability that customer shall make regular payments.

8. We should expect about 2% delinquency, which is a normal and acceptable delinquency rate in the cards industry.

9. Not paying even the Minimum Amount required to be paid to keep the card from getting blocked should point to increasing probability of payment default in next cycle.

10. Paying only the Minimum Amount month after month should result in higher probability of eventual payment default.

As Data Scientists, we should discuss these hypotheses with the business and have full concurrance with them. We should be ready to make any changes to these so that there are no surprises when we actually present the analysis facts to the business.

Also, it is a well known fact that in any data science / machine learning project, more than 70% of the time gets spent on data gathering, data exploration, data cleaning, data transformation, data enhancement and detailed data analysis.

Only when the data is understood clearly do we proceed to apply ML modeling algorithms.

So.....let us spend the necessary 70% of the time in understanding our data.



## Loading Packages and data.


For this Janata Hack problem, we have been given three CSV files: train, test and sample submission.

Train file will be used for training the model, i.e. our model will learn from this file. It contains all the independent variables and the target variable.

Test file contains all the independent variables, but not the target variable. We will apply the model to predict the target variable for the test data.

Sample submission file contains the format in which we have to submit our predictions.



In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline 

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Reading data

train_df=pd.read_csv("../input/av-janata-hack-payment-default-prediction/train_20D8GL3.csv") 
test_df=pd.read_csv("../input/av-janata-hack-payment-default-prediction/test_O6kKpvt.csv")


Let’s make a copy of train and test data so that even if we have to make any changes in these datasets we would not lose the original datasets.



In [None]:
original_train_df=train_df.copy() 
original_test_df=test_df.copy()

#                        Understanding the Data


We will look at the structure of the train and test datasets. 

Firstly, we will check the features present in our data and then we will look at their data types.



In [None]:
train_df.columns

In [None]:
test_df.columns

We have 24 independent variables and 1 target variable, i.e. 'default_payment_next_month' in the train dataset. 

We will predict the Payment Default next month using the model built using the train data.



In [None]:
train_df.dtypes

In [None]:
test_df.dtypes

We see that all our data is already in numeric (integer) format.


In [None]:
train_df.shape, test_df.shape

We have 21000 rows and 25 columns in the train dataset and 9000 rows and 24 columns in test dataset.

In [None]:
train_df.head()

In [None]:
test_df.head()

##       Proposed Enhancements to the data provided

From the meta-data provided, and looking at the first few rows of the training dataset (and hence the test dataset), we find the following serious anamolies in the dataset :

1. The PAY_0 column-name should actually be PAY_1. We shall rename it.

2. The columns PAY_1 to PAY_6 are said to be capturing the "Payment delay in months". For example, PAY_6 represents "the repayment status in April, 2005", and value of 7 in that column would mean "payment delay for seven months", whereas a value of -1 in that column would mean "paid duly". We find that many of the values in PAY_1 to PAY_6 are zeros and -2. and the meta data has not specified any meanings to 0 and -2 in these columns.

Now when it comes to credit card payments, the following types of payments are generally captured by such organizations :

1.  **"No Payment"**  indicating that the customer refused to make any payment against the bill.

2.  **"Minimum Payment"**  representing about 5-10% against the total billed amount which the customer has to pay within the credit period applicable to the customer to keep the card current and active.

3. **"Full Payment"**  representing full payment against the total billed amount which the customer has to pay within the credit period applicable.

4. **"Part Payment"**  representing any fraction of the payment against the total billed amount which the customer may decide to pay.

5. Such **"Part Payment"**  can be less than the **"Minimum Payment"**, equal to the minimum payment or more than the minimum payment but less than the total amount.

6. All such **"Pending Payments"** are then carried forward to the next bill, with additional applicable finance and interest charges on such pending amount.

7. Customer can keep paying **"Part Payments"** over many months and keep accumulating his **"Total Pending Amount"** till his **"Credit Balance Limit"** is reached, in the process incurring additional monthly finance and interest charges.

Based on the common understanding of the credit card billing cycle, we shall utilize the data available in the BILL_AMT1 to BILL_AMT6 columns, PAY_AMT1 to PAY_AMT6 columns, and LIMIT_BAL column to create the following additional variables to enrich our predictive model  :

Even though it is not explicitly mentioned anywhere, we are assuming that all the bills have a 1-month credit period in which customers need to make payments.

1.  **MIN_AMT_1 to MIN_AMT_6 :**

These values shall represent the "Minimum Billing" amount that needs to be paid within the next 1-month credit period for the card to be kept active. We are taking 10% as the Minimum amount that needs to be paid to keep the card live. This means that MIN_AMT_1 shall represent the minimum amount (10%) of the total pending amount till August-2005 to be paid in September-2005.

2.  **PENDING_AMT_1 to PENDING_AMT_6 :**

These integer variables shall represent the "Total Pending Amount" for the previous 6 bills. This is because we have the data only for the 6 months. Hence PENDING_AMT_1 shall represent total pending amount in September-2005 - which is the accumulated unpaid amounts for the last 6 months (ie. since April-2005), but PENDING_AMT_2 shall represent total pending amount in August-2005 since April-2005 and so on.

3.  **DELINQ_1 to DELINQ_6 :**

These indicators shall represent how many times the customer has been classified into "Delinquent" state. As per the credit card business, customers move to the delinquency state when they fail to pay the Minimum Billing Amount necessary to keep the card alive. Assuming 1-month credit period for making the payment, this would mean that if the customer does not pay more than the MIN_AMT bill of August-2005 by September-2005, the customer would be classified as "Delinquent". 

Please note that any amount paid which is less than the required MIN_AMT is considered as no payment, and customer is moved to "Delinquent" state.

Hence, DELINQ_1 would represent how many times customer has been in delinquent state since April-2005.

4. **NO_PMNT_1 to NO_PMNT_6 :**

These indicators shall capture how many times the customer made "No Payments". This would mean that if the customer does not pay any amount (0 value) against the bill of August-2005 by September-2005, the customer payment status would be moved to "No Payment" state.

Hence, NO_PMNT1 would represent how many times customer made "No Payments" since April-2005.

5. **AVG_6MTH_BAL :**

This value shall represent mean (average) value of PENDING_AMT_1 over a 6 month period.

6. **CREDIT_UTILIZATION_RATIO :**

Average 6 month balance (AVG_6MTH_BAL) divided by the individual’s credit limit (LIMIT_BAL). As per the credit card industry, anything <= .3 is considered good, whereas anything closer to 1 is considered very risky.

From the data given, we have been asked to predict customers who are likely to default in their payments next month. 

Considering that the last bill data in the dataset is for September-2005, we have been asked to predict customers who shall default in Oct-2005.


## Create additional variables for the enhancement of the data


First of all, we shall proceed to create the additional variables from the existing data as described in detail above.



In [None]:
train_df.rename(columns = {'PAY_0':'PAY_1'}, inplace = True)
test_df.rename(columns = {'PAY_0':'PAY_1'}, inplace = True)

We now proceed to create the additional variables that were explained earlier.

Let us start with creating MIN_AMT6 variable which shall represent the minimum amount of the April-2005 bill to be paid by May-2005.



In [None]:
train_df['MIN_AMT6']=train_df['BILL_AMT6']*0.1
test_df['MIN_AMT6']=test_df['BILL_AMT6']*0.1


Let us create PENDING_AMT6 variable which shall represent (April-2005 Bill Amount - April-2005 Payment Amount). 



In [None]:
train_df['PENDING_AMT6']=train_df['BILL_AMT6'] - train_df['PAY_AMT6']
test_df['PENDING_AMT6']=test_df['BILL_AMT6'] - test_df['PAY_AMT6']

Let us create MIN_AMT5 variable which shall represent the minimum of the pending amount till May-2005 to be paid by June-2005.



In [None]:
train_df['MIN_AMT5']=train_df['PENDING_AMT6']*0.1
test_df['MIN_AMT5']=test_df['PENDING_AMT6']*0.1

Let us create DELINQ_5 variable which shall represent whether the customer has become 'Delinquent' because customer has failed to pay more than the Minimum amount.



In [None]:
train_df['DELINQ_5'] = np.where((train_df['PAY_AMT5']>0) & (train_df['PAY_AMT5']<train_df['MIN_AMT6']),1,0)
test_df['DELINQ_5'] = np.where((test_df['PAY_AMT5']>0) & (test_df['PAY_AMT5']<test_df['MIN_AMT6']),1,0)

Let us create NO_PMNT5 variable which shall represent whether the customer did not make any payment in May-2005.



In [None]:
train_df['NO_PMNT5']=np.where(train_df['PAY_AMT5'] == 0,1,0)
test_df['NO_PMNT5']=np.where(test_df['PAY_AMT5'] == 0,1,0)

Let us create PENDING_AMT5 variable. 



In [None]:
train_df['PENDING_AMT5'] = (train_df['BILL_AMT5']+train_df['BILL_AMT6']) - (train_df['PAY_AMT5']+train_df['PAY_AMT6'])
test_df['PENDING_AMT5'] = (test_df['BILL_AMT5']+test_df['BILL_AMT6']) - (test_df['PAY_AMT5']+test_df['PAY_AMT6'])

Let us create all other derived variables in a similar manner.


In [None]:
train_df['MIN_AMT4']=train_df['PENDING_AMT5']*0.1
test_df['MIN_AMT4']=test_df['PENDING_AMT5']*0.1

In [None]:
train_df['DELINQ_4'] = np.where((train_df['PAY_AMT4']>0) & (train_df['PAY_AMT4']<train_df['MIN_AMT5']),1,0) + train_df['DELINQ_5']
test_df['DELINQ_4'] = np.where((test_df['PAY_AMT4']>0) & (test_df['PAY_AMT4']<test_df['MIN_AMT5']),1,0) + test_df['DELINQ_5']

In [None]:
train_df['NO_PMNT4']=np.where(train_df['PAY_AMT4'] == 0,1,0) + train_df['NO_PMNT5']
test_df['NO_PMNT4']=np.where(test_df['PAY_AMT4'] == 0,1,0) + test_df['NO_PMNT5']

In [None]:
train_df['PENDING_AMT4'] = (train_df['BILL_AMT4']+train_df['BILL_AMT5']+train_df['BILL_AMT6']) - (train_df['PAY_AMT4']+train_df['PAY_AMT5']+train_df['PAY_AMT6'])
test_df['PENDING_AMT4'] = (test_df['BILL_AMT4']+test_df['BILL_AMT5']+test_df['BILL_AMT6']) - (test_df['PAY_AMT4']+test_df['PAY_AMT5']+test_df['PAY_AMT6'])

In [None]:
train_df['MIN_AMT3']=train_df['PENDING_AMT4']*0.1
test_df['MIN_AMT3']=test_df['PENDING_AMT4']*0.1

In [None]:
train_df['DELINQ_3'] = np.where((train_df['PAY_AMT3']>0) & (train_df['PAY_AMT3']<train_df['MIN_AMT4']),1,0) + train_df['DELINQ_4']
test_df['DELINQ_3'] = np.where((test_df['PAY_AMT3']>0) & (test_df['PAY_AMT3']<test_df['MIN_AMT4']),1,0) + test_df['DELINQ_4']

In [None]:
train_df['NO_PMNT3']=np.where(train_df['PAY_AMT3'] == 0,1,0) + train_df['NO_PMNT4']
test_df['NO_PMNT3']=np.where(test_df['PAY_AMT3'] == 0,1,0) + test_df['NO_PMNT4']

In [None]:
train_df['PENDING_AMT3'] = (train_df['BILL_AMT3']+train_df['BILL_AMT4']+train_df['BILL_AMT5']+train_df['BILL_AMT6']) - (train_df['PAY_AMT3']+train_df['PAY_AMT4']+train_df['PAY_AMT5']+train_df['PAY_AMT6'])
test_df['PENDING_AMT3'] = (test_df['BILL_AMT3']+test_df['BILL_AMT4']+test_df['BILL_AMT5']+test_df['BILL_AMT6']) - (test_df['PAY_AMT3']+test_df['PAY_AMT4']+test_df['PAY_AMT5']+test_df['PAY_AMT6'])

In [None]:
train_df['MIN_AMT2']=train_df['PENDING_AMT3']*0.1
test_df['MIN_AMT2']=test_df['PENDING_AMT3']*0.1

In [None]:
train_df['DELINQ_2'] = np.where((train_df['PAY_AMT2']>0) & (train_df['PAY_AMT2']<train_df['MIN_AMT3']),1,0) + train_df['DELINQ_3']
test_df['DELINQ_2'] = np.where((test_df['PAY_AMT2']>0) & (test_df['PAY_AMT2']<test_df['MIN_AMT3']),1,0) + test_df['DELINQ_3']

In [None]:
train_df['NO_PMNT2']=np.where(train_df['PAY_AMT2'] == 0,1,0) + train_df['NO_PMNT3']
test_df['NO_PMNT2']=np.where(test_df['PAY_AMT2'] == 0,1,0) + test_df['NO_PMNT3']

In [None]:
train_df['PENDING_AMT2'] = (train_df['BILL_AMT2']+train_df['BILL_AMT3']+train_df['BILL_AMT4']+train_df['BILL_AMT5']+train_df['BILL_AMT6']) - (train_df['PAY_AMT2']+train_df['PAY_AMT3']+train_df['PAY_AMT4']+train_df['PAY_AMT5']+train_df['PAY_AMT6'])
test_df['PENDING_AMT2'] = (test_df['BILL_AMT2']+test_df['BILL_AMT3']+test_df['BILL_AMT4']+test_df['BILL_AMT5']+test_df['BILL_AMT6']) - (test_df['PAY_AMT2']+test_df['PAY_AMT3']+test_df['PAY_AMT4']+test_df['PAY_AMT5']+test_df['PAY_AMT6'])

In [None]:
train_df['MIN_AMT1']=train_df['PENDING_AMT2']*0.1
test_df['MIN_AMT1']=test_df['PENDING_AMT2']*0.1

In [None]:
train_df['DELINQ_1'] = np.where((train_df['PAY_AMT1']>0) & (train_df['PAY_AMT1']<train_df['MIN_AMT2']),1,0) + train_df['DELINQ_2']
test_df['DELINQ_1'] = np.where((test_df['PAY_AMT1']>0) & (test_df['PAY_AMT1']<test_df['MIN_AMT2']),1,0) + test_df['DELINQ_2']

In [None]:
train_df['NO_PMNT1']=np.where(train_df['PAY_AMT1'] == 0,1,0) + train_df['NO_PMNT2']
test_df['NO_PMNT1']=np.where(test_df['PAY_AMT1'] == 0,1,0) + test_df['NO_PMNT2']

In [None]:
train_df['PENDING_AMT1'] = (train_df['BILL_AMT1']+train_df['BILL_AMT2']+train_df['BILL_AMT3']+train_df['BILL_AMT4']+train_df['BILL_AMT5']+train_df['BILL_AMT6']) - (train_df['PAY_AMT1']+train_df['PAY_AMT2']+train_df['PAY_AMT3']+train_df['PAY_AMT4']+train_df['PAY_AMT5']+train_df['PAY_AMT6'])
test_df['PENDING_AMT1'] = (test_df['BILL_AMT1']+test_df['BILL_AMT2']+test_df['BILL_AMT3']+test_df['BILL_AMT4']+test_df['BILL_AMT5']+test_df['BILL_AMT6']) - (test_df['PAY_AMT1']+test_df['PAY_AMT2']+test_df['PAY_AMT3']+test_df['PAY_AMT4']+test_df['PAY_AMT5']+test_df['PAY_AMT6'])

Let us create AVG_6MTH_BAL variable which shall represent mean / average value of Amount owed (PENDING_AMT1) by the customer over a 6 month period.



In [None]:
train_df['AVG_6MTH_BAL'] = train_df['PENDING_AMT1']/6
test_df['AVG_6MTH_BAL'] = test_df['PENDING_AMT1']/6

Let us create CREDIT_UTIL_RATIO variable which shall represent (average 6 month balance divided by the individual’s credit limit)

Please note that a Credit utilization ratio <= .3 is considered good, whereas anything close to 1 or more is considered very risky.



In [None]:
train_df['CREDIT_UTIL_RATIO'] = train_df['AVG_6MTH_BAL']/train_df['LIMIT_BAL']
test_df['CREDIT_UTIL_RATIO'] = test_df['AVG_6MTH_BAL']/test_df['LIMIT_BAL']

Looking the Age distribution of the customers, we have decided to divide the customers according to their Age Bins as follows.



In [None]:
bins=[0,20,30,40,50,60,70,80] 
group=['VERY_YOUNG','YOUNG','MIDDLE','SENIOR','VERY_SENIOR','RETIRED','ELDERLY'] 

train_df['AGE_BIN']=pd.cut(train_df['AGE'],bins,labels=group)
test_df['AGE_BIN']=pd.cut(test_df['AGE'],bins,labels=group)

In [None]:
original_columns = train_df.columns

In [None]:
# delete un-necessary columns

columns_to_delete = ['ID','AGE']

In [None]:
final_columns = list(set(original_columns)-set(columns_to_delete))

In [None]:
final_columns

Let us create final version of the training data set.



In [None]:
final_train_df = train_df[final_columns]

The test_df does not contain the target variable, so we will make appropriate change for the test_df.



In [None]:
original_columns = test_df.columns

In [None]:
final_columns = list(set(original_columns)-set(columns_to_delete))

In [None]:
final_columns

In [None]:
final_test_df = test_df[final_columns]

With the above, we have created all the required variables and now we can proceed to Data Exploration & Analysis stage, and gather some Insights about this business.



#        Data Exploration & Data Analysis


Now, we will check the final features present in our data, look at their range of values, distributions, missing values, any outliers etc.



In [None]:
final_train_df.shape, final_test_df.shape

In [None]:
# Let us look at the datatypes of the training data columns.

final_train_df.dtypes

In [None]:
# Let us look at the datatypes of the test data columns.

final_test_df.dtypes

We see that all the columns are of numeric type except column AGE_BIN. 



##   Check missing data


Let's check if there is any missing data.



In [None]:
total = final_train_df.isnull().sum().sort_values(ascending = False)
percent = (final_train_df.isnull().sum()/final_train_df.isnull().count()*100).sort_values(ascending = False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent']).transpose()

There is no missing data in the entire dataset, so we have nothing to worry about substituting for any missing values.



In [None]:
final_train_df.describe()

There are 21,000 distinct credit card clients.

The average value for the amount of credit card limit is 167,214. The standard deviation is unusually large (at 128,965), and the max value is 800,000.

We observe -ve values in BILL_AMT* and PAY_AMT*

Let us analyze all of this in detail.

#                     Univariate Analysis



## Target Variable


We will first look at the target variable, i.e., default_payment_next_month. As it is a categorical variable, let us look at its frequency table, percentage distribution and bar plot.



In [None]:
final_train_df['default_payment_next_month'].value_counts()

In [None]:
sns.countplot(final_train_df['default_payment_next_month'])
plt.show()

Let us print proportions instead of number.



In [None]:
final_train_df['default_payment_next_month'].value_counts(normalize=True)

In [None]:
final_train_df['default_payment_next_month'].value_counts(normalize=True).plot.bar()

We see that 22% of the customers shall be defaulting on their payments next month.

The global benchmark for payment default is less than 2%.

This is an extremely high number of defaulting customers.

Also, while this can be called an "Imbalanced dataset", this should have been far more imbalanced if we consider that only 2-3% customers should have been classified as defaulting.


Now lets visualize each variable separately. 

Let’s visualize the categorical and ordinal features first.


##              Independent Variable (Categorical)



In [None]:
final_train_df['SEX'].value_counts(normalize=True).plot.bar(figsize=(10,5), title= 'SEX') 
plt.show()


It can be inferred from the above bar plot that about 60% of the customers are females.

Please note that our initial hypothesis was that our male - female ratio is same. Clearly the given data violates our initial hypothesis.



Now let’s visualize the ordinal variables.


##                        Independent Variables (Ordinal)



In [None]:
plt.figure(1) 
plt.subplot(221) 
final_train_df['MARRIAGE'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'MARRIAGE') 
plt.subplot(222) 
final_train_df['EDUCATION'].value_counts(normalize=True).plot.bar(title= 'EDUCATION') 
plt.show()

In [None]:
plt.figure(1)
plt.subplot(221) 
final_train_df['AGE_BIN'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'AGE') 
plt.show()

Following inferences can be made from the above bar plots:

1. About 55% of the customers are 'Single' status

2. About 44% of the customers are 'Married' status

3. About 0.5% of the customers have 'Other' 

4. About 0.5% of the customers have 'Invalid' status (with a Value = 0)

5. About 50% of the customers have 'University' level education

6. About 35% of the customers have 'Graduate School' level education

7. About 14% of the customers have 'High School' level education

8. About 0.4% of the customers have 'Other' level education

9. About 0.6% of the customers have 'Invalid' level education (with Values = 0, 5, 6)

10. Majority of the customers are in 20 - 50 years Age bracket

11. We also see lot customers in 51-75 years Age bracket


##                   Independent Variable (Numerical)


Now lets visualize the numerical variables. Lets look at the distribution of 'LIMIT_BAL' first.



In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['LIMIT_BAL']); 
plt.subplot(122) 
final_train_df['LIMIT_BAL'].plot.box(figsize=(16,5)) 
plt.show()

Let us try to place the LIMIT_BAL in 20 bins and see the distribution.



In [None]:
bins = 20
plt.hist(final_train_df.LIMIT_BAL, bins = bins, label = 'Total', alpha=0.5)

# plt.hist(data.LIMIT_BAL[data['default.payment.next.month'] == 1], bins = bins, color='b',label = 'Default')

plt.xlabel('Credit Limit (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Fig.1 : Credit Limit ',fontweight="bold", size=12)
plt.legend();
plt.show()

We can clearly see that very few customers have credit limit exceeding 6,00,000.

Let us count how many customers belong to this categoty.



In [None]:
len(final_train_df[final_train_df['LIMIT_BAL']>= 600000])

We see that we have 57 customers out of 21000 which appear as outliers, which is about 0.03%

In [None]:
len(final_train_df[(final_train_df['LIMIT_BAL']>= 600000) & (final_train_df['default_payment_next_month']== 1)])

We also see that out of this 57 outlier customers, only 7 are candidates for 'default_payment_next_month'. 

So....we certainly reserve the option to remove these outliers if we find that this can improve our prediction ability.



It can be inferred that most of the data in the distribution of 'LIMIT_BAL' is towards left (right-skewed), which means it is not normally distributed. 

The boxplot confirms the presence of a lot of outliers/extreme values. 

Please note that from the Box Plot, the inference about the outliers is drawn based on the mathematical hypothesis that anything outside of "Inter-Quartile-Range (IQR)" is considered Outlier. 

From the Business perspective this may not always be true. There could be genuin transactions with large values in the dataset. Removing them may not be the correct way in such cases.

Always the Business needs to be consulted about the correctness of these values.

In case they seem to be data entry errors, they can safely be either removed (OR) substituted with mean / mode / median values.

**(OR)**

If we find that there are relatively very few (say less that 3-5%) significant outliers whose presence is likely to bias the model, we can take a decision to remove these outliers.

We will study these outliers carefully.



Similar to LIMIT_BAL, let us see if we also have outliers for BILL_AMT* and PAY_AMT* variables.



In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['BILL_AMT1']); 
plt.subplot(122) 
final_train_df['BILL_AMT1'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.BILL_AMT1, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Bill Amount Sep-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Bill Amount Sep-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['BILL_AMT2']); 
plt.subplot(122) 
final_train_df['BILL_AMT2'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.BILL_AMT2, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Bill Amount Aug-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Bill Amount Aug-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['BILL_AMT3']); 
plt.subplot(122) 
final_train_df['BILL_AMT3'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.BILL_AMT3, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Bill Amount July-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Bill Amount July-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['BILL_AMT4']); 
plt.subplot(122) 
final_train_df['BILL_AMT4'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.BILL_AMT4, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Bill Amount June-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Bill Amount June-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['BILL_AMT5']); 
plt.subplot(122) 
final_train_df['BILL_AMT5'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.BILL_AMT5, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Bill Amount May-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Bill Amount May-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['BILL_AMT6']); 
plt.subplot(122) 
final_train_df['BILL_AMT6'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.BILL_AMT6, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Bill Amount April-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Bill Amount April-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

The plots for all the BILL_AMT* show very skewed distributions.

We should closely analyze the customers with BILL_AMT* >= 200000, as well as with BILL_AMT* < 0.

Let us loot at the -ve values first.



In [None]:
final_train_df[final_train_df['BILL_AMT1'] <0][['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']]

In [None]:
final_train_df[final_train_df['BILL_AMT2'] <0][['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']]

We clearly see -ve bill values indicating perhaps some amount of excess payment in previous months.


In [None]:
for billamt in ['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']:
    print("\nNo. of Customers with ", billamt, " >= 200000 : ", len(final_train_df[final_train_df[billamt] >=200000]))

In [None]:
for billamt in ['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']:
    print("\nNo. of Customers with ", billamt, " >= 400000 : ", len(final_train_df[final_train_df[billamt] >=400000]))

In [None]:
for billamt in ['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']:
    print("\nNo. of Customers with ", billamt, " >= 600000 : ", len(final_train_df[final_train_df[billamt] >=600000]))

We observe from Sept-2005 billing that 6 customers have billing in excess of 600,000, 119 customers have billing in excess of 400,000 and 1070 customers have billing in excess of 200,000. 

With good understanding of the BILL_AMT* variables, we will now look at the PAY_AMT* variables.



In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PAY_AMT1']); 
plt.subplot(122) 
final_train_df['PAY_AMT1'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PAY_AMT1, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Payment Amount Sept-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Payment Amount Sept-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PAY_AMT2']); 
plt.subplot(122) 
final_train_df['PAY_AMT2'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PAY_AMT2, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Payment Amount Aug-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Payment Amount Aug-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PAY_AMT3']); 
plt.subplot(122) 
final_train_df['PAY_AMT3'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PAY_AMT3, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Payment Amount July-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Payment Amount July-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PAY_AMT4']); 
plt.subplot(122) 
final_train_df['PAY_AMT4'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PAY_AMT4, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Payment Amount June-2005 ');plt.ylabel('Number of Accounts')
plt.title('Payment Amount June-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PAY_AMT5']); 
plt.subplot(122) 
final_train_df['PAY_AMT5'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PAY_AMT5, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Payment Amount May-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Payment Amount May-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PAY_AMT6']); 
plt.subplot(122) 
final_train_df['PAY_AMT6'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PAY_AMT6, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Payment Amount April-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Payment Amount April-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

From the PAY_AMT* plots, we can easily derive following insights about the customer's payment behavior :


1. Most of the payments seem to be for amounts < 100000.


2. Very few payments are seen for amounts >= 100000.


To get a clearer picture, we may need to divide this data in more bins.

But before we do that, let us compute what % of customers make paymenmts >= 100000.



In [None]:
for pmnt in ['PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6',]:
    print("\nNo. of Customers with ", pmnt, " >= 100000 : ", len(final_train_df[final_train_df[pmnt]>=100000]))

What we see is that out of 21000 customers, about 150+ customers seem to be paying more than 100000 per month for the last 6 months.

Earlier, we saw that the customers with the BILL_AMT* >= 200000 were resulting is the skew, and now we see that the customers with PAY_AMT* >= 100000 are also resulting in the skewed distribution.

What this means is that the resultant model may get biased with these very few customers, but these are not outliers. These are valid records, and must be retained for the analysis and modeling.

Apart from the original variables in the dataset, we had created few additional variables. 

Let us analyze this data based on these derived variables as follows:

Let us look at the distribution of 'DELINQ_1', 'NO_PMNT1', 'CREDIT_UTIL_RATIO', and 'AVG_6MTH_BAL'.

In [None]:
# Let's start by visualizing the distribution of 'DELINQ_1' in the dataset.  

fig, ax = plt.subplots()

x = final_train_df.DELINQ_1.unique()

# Counting total delinquencies in the dataset

y = final_train_df.DELINQ_1.value_counts()

# Plotting the bar graph

ax.bar(x, y)
ax.set_xlabel('Total Number of Delinquencies in last 6 months.')
ax.set_ylabel('No. of Customers')
plt.show()


In [None]:
# Let's start by visualizing the distribution of 'NO_PMNT1' in the dataset.  

fig, ax = plt.subplots()

x = final_train_df.NO_PMNT1.unique()

# Counting total delinquencies in the dataset

y = final_train_df.NO_PMNT1.value_counts()

# Plotting the bar graph

ax.bar(x, y)
ax.set_xlabel('Total Number of No Payments in last 6 months.')
ax.set_ylabel('No. of Customers')
plt.show()


As the objective of this case is to find who shall default next month, the above bar-graphs need to be analyzed in detail.


In [None]:
plt.figure(1) 
plt.subplot(221) 
final_train_df['DELINQ_1'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'Total Number of Delinquencies in last 6 months.') 
plt.subplot(222) 
final_test_df['NO_PMNT1'].value_counts(normalize=True).plot.bar(title= 'Total Number of No Payments in last 6 months.') 
plt.show()

The "Number of Delinquencies" bar plot indicates that most of the months, most of the customers are paying less than the "Minimum Payment" required to keep the card active.

So....even if the customers pay, most of the customers are paying less than the Minimum payment required.


**So...ideally the focus of this company should have been to predict customers who shall pay less then the Minimum Payment, rather than customers who make No Payment.**


**That change in strategy and focus would deliver far greater financial and bottom line benefiits to the company in question, rather than focusing on Non Payment.**

To understand this in money terms, let us do few additional computations.

We have created derived variables 'PENDING_AMT*' - which represent amounts that are pending from the customer's side (ie. Bills - Payments) for the last 1 - 6 months.

Let us study the distributions of these derived variables :


In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PENDING_AMT5']); 
plt.subplot(122) 
final_train_df['PENDING_AMT5'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PENDING_AMT5, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Pending Amount till May-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Pending Amount till May-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PENDING_AMT4']); 
plt.subplot(122) 
final_train_df['PENDING_AMT4'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PENDING_AMT4, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Pending Amount till June-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Pending Amount till June-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PENDING_AMT3']); 
plt.subplot(122) 
final_train_df['PENDING_AMT3'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PENDING_AMT3, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Pending Amount till July-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Pending Amount till July-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PENDING_AMT2']); 
plt.subplot(122) 
final_train_df['PENDING_AMT2'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PENDING_AMT2, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Pending Amount till August-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Pending Amount till August-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['PENDING_AMT1']); 
plt.subplot(122) 
final_train_df['PENDING_AMT1'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.PENDING_AMT1, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Pending Amount till Sept-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Pending Amount till Sept-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(final_train_df['AVG_6MTH_BAL']); 
plt.subplot(122) 
final_train_df['AVG_6MTH_BAL'].plot.box(figsize=(16,5)) 
plt.show()

In [None]:
bins = 20
plt.hist(final_train_df.AVG_6MTH_BAL, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Avg. 6 Month Balance in Sept-2005 (NT dollar)');plt.ylabel('Number of Accounts')
plt.title('Avg. 6 Month Balance in Sept-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

Now we shall focus on analysing the impact of no payments and delayed payments on the financial health of this company.

Let us start with understanding the Total Credit given to the customers vs. Total Pending Amount from the customers as on Sept-2005.


In [None]:
print("\nTotal Credit given to all the customers (as on Sept-2005): ", final_train_df['LIMIT_BAL'].sum())

In [None]:
print("\nTotal Credit given to all Defaulting customers (as on Sept-2005): ", final_train_df.groupby('default_payment_next_month')['LIMIT_BAL'].sum()[1])

In [None]:
print("\nTotal Pending Amount from all the customers (as on Sept-2005): ", final_train_df['PENDING_AMT1'].sum())

In [None]:
print("\nTotal Pending Amount from all Defaulting customers (as on Sept-2005): ", final_train_df.groupby('default_payment_next_month')['PENDING_AMT1'].sum()[1])

As we can see, the company gave a credit of 3.5 billion NT dollars, whereas the total amount pending is 5 billion NT dollars.

Out of this, the company gave a credit of 600 million NT dollars to defaulting customers, whereas total amount pending from defaulting customers is 1.1 billion NT dollars. 


**We can clearly establish that certain basic "Credit Controls" have not been observed by the company. They have obviously allowed customers to continue to use cards beyond the authorized credit.**


**The reason is very simple - the entire focus has been on customers who shall not pay next month, rather than on customers who are paying less than the "Minimum Amounts" and falling into "Delinquent" status.**


Let us analyze pending amounts according to the "DELINQ_1" counts.

Let us understand the distribution of Total Credit and Total Pending Amount based on No of delinquencies over 6-month period.



In [None]:
delinquencies = list(final_train_df['DELINQ_1'].unique())

In [None]:
for num_delinq in delinquencies:
    
    print("\nTotal Credit Amount for : ", num_delinq, " delinquencies is : ", sum(final_train_df[final_train_df['DELINQ_1']==num_delinq]['LIMIT_BAL']))
    
    print("\nTotal Pending Amount for : ", num_delinq, " delinquencies is : ", sum(final_train_df[final_train_df['DELINQ_1']==num_delinq]['PENDING_AMT1']))
              

From the above, we get the following insights :

1. About 1.5 billion dollars credit has been allocated to customers who have never been delinquent.

2. Only about 111 million dollars is pending from such non-delinquent customers.

3. About 303 million dollars credit has been allocated to customers who have been delinquent for 1 months

4. About 146 million dollars is pending from such customers who have been delinquent for 1 month.

5. About 209 million dollars credit has been allocated to customers who have been delinquent for 2 months

6. About 255 million dollars is pending from such customers who have been delinquent for 2 months.

7. About 236 million dollars credit has been allocated to customers who have been delinquent for 3 months

8. About 502 million dollars is pending from such customers who have been delinquent for 3 months.

9. About 411 million dollars credit has been allocated to customers who have been delinquent for 4 months

10. About 1.15 billion dollars is pending from such customers who have been delinquent for 4 months.

11. About 856 million dollars credit has been allocated to customers who have been delinquent for 5 months

12. About 2.9 billion dollars is pending from such customers who have been delinquent for 5 months.

**We can clearly see that pending amount keeps increasing by an order of magnitude as the customers keep getting delinquent month after month.**

Now let us understand the Total Pending amount that is stuck by "No Payment" customers.


In [None]:
nopayments = list(final_train_df['NO_PMNT1'].unique())

In [None]:
for num_pmnt in nopayments:
    
    print("\nTotal Credit Amount for : ", num_pmnt, " no payments is : ", sum(final_train_df[final_train_df['NO_PMNT1']==num_pmnt]['LIMIT_BAL']))
    
    print("\nTotal Pending Amount for : ", num_pmnt, " no payments is : ", sum(final_train_df[final_train_df['NO_PMNT1']==num_pmnt]['PENDING_AMT1']))


From the above, we get the following insights :

1. About 2.1 billion dollars credit has been allocated to customers who have never defaulted on payments - even if they may have paid token amounts.

2. About 3.8 billion dollars is pending from such non-defaulting customers. This is the most surprising finding from the data. This is a proof that the company is focussed on the wrong metric.

3. About 577 million dollars credit has been allocated to customers who have defaulted on payments just once.

4. About 817 million dollars is pending from customers who have defaulted just once. 

5. About 299 million dollars credit has been allocated to customers who have defaulted on payments 2 times.

6. About 266 million dollars is pending from customers who have defaulted two times. 

7. About 186 million dollars credit has been allocated to customers who have defaulted on payments 3 times.

8. About 79 million dollars is pending from customers who have defaulted 3 times. 

9. About 132 million dollars credit has been allocated to customers who have defaulted on payments 4 times.

10. About 21 million dollars is pending from customers who have defaulted 4 times. 

11. About 205 million dollars credit has been allocated to customers who have defaulted on payments 5 times.

12. About 18 million dollars is pending from customers who have defaulted 5 times. 

**We can clearly see that pending amount keeps decreasing by an order of magnitude as the customers keep defaulting month after month. This is also because there is a regular decrease in the number of customers who are defaulting multiple times.**

**This behavior seems to be exactly opposite of the customers who are delinquent month after month. This is because there is a regular increase in the number of customers who are becoming delinquent.**

**It is obvious that the focus of this company needs to change towards customers becoming delinquent.**

Finally, let us understand the Total Pending amount with respect to Credit Utilization Ratios.

We know that Credit utilization ratio of <= 0.3 is considered good in the industry.

Let us understand how many of the customers fall in this safe category.

In [None]:
bins = 20
plt.hist(final_train_df.CREDIT_UTIL_RATIO, bins = bins, label = 'Total', alpha=0.8)

plt.xlabel('Credit Utilization Ratios in Sept-2005');plt.ylabel('Number of Accounts')
plt.title('Credit Utilization Ratios in Sept-2005 ',fontweight="bold", size=12)
plt.legend();
plt.show()

In [None]:
len(final_train_df[final_train_df['CREDIT_UTIL_RATIO'] <= 0.3])

In [None]:
len(final_train_df[(final_train_df['CREDIT_UTIL_RATIO'] > 0.3) & (final_train_df['CREDIT_UTIL_RATIO'] <= 0.7)])

In [None]:
len(final_train_df[(final_train_df['CREDIT_UTIL_RATIO'] > 0.7) & (final_train_df['CREDIT_UTIL_RATIO'] <= 1)])

In [None]:
len(final_train_df[final_train_df['CREDIT_UTIL_RATIO'] > 1])

We see that almost 50% of the customers have a credit utilization ratio of less than 30% which is good.

However, more than 20% customers have credit utilization ratios of more than 70%.

This again points to the fact that company should be focusing on the delinquent customers more than the once who miss the payments altogether.

## Correlation Coefficients:

Now we shall identify the correlation coefficients of the variables to shortlist the variables which have positive correlation with the target variable 'default_payment_next_month".

It is important to remove the variables which have no impact (OR) negative impact on the target variable, and hence improve out ML model performance metrices.


In [None]:
corr = final_train_df.corr()

In [None]:
corr['default_payment_next_month']

We see that only a handful of features are showing +ve correlation with our target variable.

We will only select these features for predictive modeling.

In [None]:
corr['default_payment_next_month']>= 0.1

In [None]:
# Making correlation coefficients pair plot of the selected features

selected_columns = ['CREDIT_UTIL_RATIO','PAY_1','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6','NO_PMNT1','NO_PMNT2','NO_PMNT3','NO_PMNT4','NO_PMNT5','default_payment_next_month']
plt.figure(figsize=(20,20))
ax = plt.axes()
corr_selected = final_train_df[selected_columns].corr()
sns.heatmap(corr_selected, vmax=1,vmin=-1, square=True, annot=True, cmap='Spectral',linecolor="white", linewidths=0.01, ax=ax)
ax.set_title('Correlation Coefficient Pair Plot',fontweight="bold", size=20)
plt.show()

**Let us create training set X and target set y.**



In [None]:
train_columns = ['PAY_1','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']

In [None]:
target_column = ['default_payment_next_month']

In [None]:
X = final_train_df[train_columns]

In [None]:
X.shape

In [None]:
y = final_train_df['default_payment_next_month']

In [None]:
y.shape

We will use the train_test_split function from sklearn to divide our train dataset. 



In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=2020)


# Building and Evaluating Predictive Models


As this is a classification problem, we can use the following algorithms:


* Logistic regression
* Decision tree
* Random forest
* Support Vector Classifications
* Stocastic Gradient Descend
* Adaboost
* XGBoost
* Neural Network Models


Considering a well known fact that out of all the above, Random Forest, Stocastic Gradient Descend, AdaBoost and XGBoost are the best suited algorithms for the problem at hand, we shall only focus on these algorithms.


We shall be using stratified k-folds cross validation, and "class_weight balancing" capability of the algorithms for estimating model parameters.


We shart with the Random Forest predictive model.



#                         Random Forest


Let’s import stratified KFold from sklearn and fit the model.



In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
import xgboost as xgb

from sklearn import metrics 
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc, f1_score

from statistics import mean 


In [None]:
i=1 
kf = StratifiedKFold(n_splits=5,random_state=2020,shuffle=True) 

score_rf = []

X_train = np.array(X_train)
                     
for train_index,test_index in kf.split(X_train,y_train):
    print('\n{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl = X_train[train_index],X_train[test_index]
    ytr,yvl = y_train.iloc[train_index],y_train.iloc[test_index]
    model_rf = RandomForestClassifier(class_weight='balanced',random_state=2020)
    model_rf.fit(xtr, ytr)
    pred_test_rf = model_rf.predict(xvl)
    score_rf.append(accuracy_score(yvl,pred_test_rf))
    print('\nAccuracy_score : ',score_rf[i-1])
    i+=1 
    
print("\nThe mean validation accuracy of Random Forest model is : ", mean(score_rf))


In [None]:
# Output confusion matrix

pred_rf = model_rf.predict(X_val)

print("Confusion Matrix:")
print(confusion_matrix(y_val, pred_rf))
print()
print("Classification Report")
print(classification_report(y_val, pred_rf))


In [None]:
# Visualize the ROC curve

pred_rf_prob=model_rf.predict_proba(xvl)[:,1]

fpr, tpr, _ = metrics.roc_curve(yvl,  pred_rf_prob)
auc = metrics.roc_auc_score(yvl, pred_rf_prob)
plt.figure(figsize=(12,8))
plt.plot(fpr,tpr,label="Validation, auc="+str(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4) 
plt.show()


In [None]:
importances=pd.Series(model_rf.feature_importances_, index=X.columns) 
importances.plot(kind='barh', figsize=(10,6))


We can see that PAY_1 seems to be the most important feature.

Then it is followed by PAY_2, PAY_3 etc..


##                     Gradient Boosting



In [None]:
i=1 

score_gb = []

for train_index,test_index in kf.split(X_train,y_train):
    print('\n{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl = X_train[train_index],X_train[test_index]
    ytr,yvl = y_train.iloc[train_index],y_train.iloc[test_index]
    
    model_gb = GradientBoostingClassifier(random_state=2020)
    
    model_gb.fit(xtr, ytr)
    pred_test_gb = model_gb.predict(xvl)
    score_gb.append(accuracy_score(yvl,pred_test_gb))
    print('\nAccuracy_score : ',score_gb[i-1])
    i+=1 
    
print("\nThe mean validation accuracy of the Gradient Boosting model is : ", mean(score_gb))


In [None]:
# Output confusion matrix and classification report of Gradient Boosting algorithm on validation set

pred_gb = model_gb.predict(X_val)

print("Confusion Matrix:")
print(confusion_matrix(y_val, pred_gb))
print()
print("Classification Report")
print(classification_report(y_val, pred_gb))

In [None]:
# Visualize the ROC curve

pred_gb_prob=model_gb.predict_proba(xvl)[:,1]

fpr, tpr, _ = metrics.roc_curve(yvl,  pred_gb_prob)
auc = metrics.roc_auc_score(yvl, pred_gb_prob)
plt.figure(figsize=(12,8))
plt.plot(fpr,tpr,label="Validation, auc="+str(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4) 
plt.show()


In [None]:
importances=pd.Series(model_gb.feature_importances_, index=X.columns) 
importances.plot(kind='barh', figsize=(12,8))


##                      AdaBoost



In [None]:
i=1 

score_adb = []

for train_index,test_index in kf.split(X_train,y_train):
    print('\n{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl = X_train[train_index],X_train[test_index]
    ytr,yvl = y_train.iloc[train_index],y_train.iloc[test_index]
    model_adb = AdaBoostClassifier(random_state=2020)
    model_adb.fit(xtr, ytr)
    pred_test_adb = model_adb.predict(xvl)
    score_adb.append(accuracy_score(yvl,pred_test_adb))
    print('\nAccuracy_score : ',score_adb[i-1])
    i+=1 
    
print("\nThe mean validation accuracy of the Ada Boosting model is : ", mean(score_adb))


In [None]:
# Output confusion matrix and classification report of Ada Boosting algorithm on validation set

pred_adb = model_adb.predict(X_val)

print("Confusion Matrix:")
print(confusion_matrix(y_val, pred_adb))
print()
print("Classification Report")
print(classification_report(y_val, pred_adb))

In [None]:
# Visualize the ROC curve

pred_adb_prob=model_adb.predict_proba(xvl)[:,1]

fpr, tpr, _ = metrics.roc_curve(yvl,  pred_adb_prob)
auc = metrics.roc_auc_score(yvl, pred_adb_prob)
plt.figure(figsize=(12,8))
plt.plot(fpr,tpr,label="Validation, auc="+str(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4) 
plt.show()


In [None]:
importances=pd.Series(model_adb.feature_importances_, index=X.columns) 
importances.plot(kind='barh', figsize=(12,8))


#                                            XGBOOST




In [None]:
i=1 

score_xgb = []

for train_index,test_index in kf.split(X_train,y_train):
    print('\n{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl = X_train[train_index],X_train[test_index]
    ytr,yvl = y_train.iloc[train_index],y_train.iloc[test_index]
    
    model_xgb = xgb.sklearn.XGBClassifier(objective="binary:logistic", random_state=2020)
    model_xgb.fit(xtr, ytr)
    pred_test_xgb = model_xgb.predict(xvl)
    score_xgb.append(accuracy_score(yvl,pred_test_xgb))
    print('\nAccuracy_score : ',score_xgb[i-1])
    i+=1 
    
print("\nThe mean validation accuracy of the XGBoost model is : ", mean(score_xgb))


In [None]:
# Visualize the ROC curve

pred_xgb_prob=model_xgb.predict_proba(xvl)[:,1]

fpr, tpr, _ = metrics.roc_curve(yvl,  pred_xgb_prob)
auc = metrics.roc_auc_score(yvl, pred_xgb_prob)
plt.figure(figsize=(12,8))
plt.plot(fpr,tpr,label="Validation, auc="+str(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4) 
plt.show()


In [None]:
importances=pd.Series(model_xgb.feature_importances_, index=X.columns) 
importances.plot(kind='barh', figsize=(12,8))


Out of the 4 algorithms, we find that GradientBoosting and XGBoosting algorithms have performed the BEST.

As our performance metric for submission is AUC, we find that Gradient Boosting has given us the best AUC.

We can now explaore if we can improve this AUC by parameter tuning.



In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Provide range for max_depth from 2 to 20 with an interval of 2 
# and from 40 to 200 with an interval of 20 for n_estimators 

paramgrid_gb = {'learning_rate':[0.05, 0.1, 0.15], 'max_depth': list(range(3, 21, 3)), 'n_estimators': list(range(60, 160, 20))}

grid_search_gb=GridSearchCV(GradientBoostingClassifier(max_features='auto',random_state=2020),paramgrid_gb)


In [None]:
# Fit the grid search model 

grid_search_gb.fit(X_train,y_train)


In [None]:
# Estimating the optimized value 


grid_search_gb.best_estimator_



Now we will use these best parameters to run the model again and see what is the best result that we can get from this.

In [None]:
i=1 
# kf = KFold(n_splits=5,random_state=2020,shuffle=True) 

score_gb = []

for train_index,test_index in kf.split(X_train,y_train):
    print('\n{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl = X_train[train_index],X_train[test_index]
    ytr,yvl = y_train.iloc[train_index],y_train.iloc[test_index]
    # model_gb = GradientBoostingClassifier(learning_rate=0.1, max_features='sqrt', max_depth=18, n_estimators=120, random_state=2020)
    
    model_gb = GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                                          learning_rate=0.15, loss='deviance', max_depth=3,
                                          max_features='auto', max_leaf_nodes=None,
                                          min_impurity_decrease=0.0, min_impurity_split=None,
                                          min_samples_leaf=1, min_samples_split=2,
                                          min_weight_fraction_leaf=0.0, n_estimators=80,
                                          n_iter_no_change=None, presort='deprecated',
                                          random_state=2020, subsample=1.0, tol=0.0001,
                                          validation_fraction=0.1, verbose=0,warm_start=False)
    
    model_gb.fit(xtr, ytr)
    pred_test_gb = model_gb.predict(xvl)
    score_gb.append(accuracy_score(yvl,pred_test_gb))
    print('\nAccuracy_score : ',score_gb[i-1])
    i+=1 
    
print("\nThe mean validation accuracy of the Gradient Boosting model is : ", mean(score_gb))


In [None]:
# Output confusion matrix and classification report of Gradient Boosting algorithm on validation set

pred_gb = model_gb.predict(X_val)

print("Confusion Matrix:")
print(confusion_matrix(y_val, pred_gb))
print()
print("Classification Report")
print(classification_report(y_val, pred_gb))

In [None]:
# Visualize the ROC curve

pred_gb_prob=model_gb.predict_proba(xvl)[:,1]

fpr, tpr, _ = metrics.roc_curve(yvl,  pred_gb_prob)
auc = metrics.roc_auc_score(yvl, pred_gb_prob)
plt.figure(figsize=(12,8))
plt.plot(fpr,tpr,label="Validation, auc="+str(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4) 
plt.show()


After tuning of the GradientBoosting model, we see an improvement in AUC from 0.7652 to 0.7687, and accuracy from 0.8216 to 0.8219.



We will now prepare the data for submission to the AnalyticsVidhya and Kaggle websites.

We will predict on the provided test dataset using the GradientBoosting model, and use these predictions to populate the submission.csv file. 



In [None]:
final_test_df.shape

In [None]:
final_train_df.shape

In [None]:
final_test_df.columns

In [None]:
final_train_df.columns

In [None]:
final_test_df.head()

In [None]:
X_train

In [None]:
X_train.shape

In [None]:
train_columns

In [None]:
target_column

In [None]:
X_test = final_test_df[train_columns]

In [None]:
X_test.head()

In [None]:
X_test.shape

Now we shall predict using the GB model on this test data, that has not been seen by the model so far.

In [None]:
pred_gb_test = model_gb.predict(X_test)

In [None]:
pred_gb_prob_test=model_gb.predict_proba(X_test)[:,1]

In [None]:
pred_gb_prob_test.shape

In [None]:
pred_gb_prob_test

In [None]:
submission=pd.read_csv("../input/av-janata-hack-payment-default-prediction/sample_submission_gm6gE0l.csv")

In [None]:
submission

In [None]:
submission['default_payment_next_month']=pred_gb_prob_test

In [None]:
original_test_df['ID']

In [None]:
submission['ID']=original_test_df['ID']

In [None]:
submission.to_csv("CC_Payment_Default_Janata_Hack_31_May.csv", index=False)