### CREDIT CARD DEFAULT PREDICTION 

Objective – Predict the probability of a customer defaulting payment for the credit card the subsequent month, 
based on past information. The past information is provided in the dataset. 
This probability will help the collections team to prioritise follow up with customers who have a high propensity of defaulting.

Input : Raw dataset from UCI (default_of_credit_card_clients.xls) - https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

#### Attributes description

    
```
This study uses 23 variables as explanatory variables, extracted/interpreted from :

   ----------------------------------------------------------------------------------------
   Feature Name         Feature Description        
   -------------------- -------------------------------------------------------------------
   limit_bal            Amount of the given credit (NT dollar): it includes both the individual 
                        consumer credit and his/her family (supplementary) credit.

   sex                  Gender (1 = male; 2 = female)

   education            Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)

   marriage             Marital status (1 = married; 2 = single; 3 = others)

   age                  Age (in years)

   pay_1 - pay_6        History of past payment. Past monthly payment records 
                        From April to September, 2005 as follows:                     
                        pay_1 = the repayment status in September, 2005
                        pay_2 = the repayment status in August, 2005
                        ...
                        pay_6 = the repayment status in April, 2005 

                        The measurement scale for the repayment status is: 
                        -1 = pay duly; 
                        1 = payment delay for one month 
                        2 = payment delay for two months
                        ...
                        8 = payment delay for eight months 
                        9 = payment delay for nine months and above

   bill_amt1-bill_amt5  Amount of bill statement (NT dollar). 
                        bill_amt1 = amount of bill statement in September, 2005 
                        bill_amt2 = amount of bill statement in August, 2005
                        ...
                        bill_amt6= amount of bill statement in April, 2005 

   pay_amt1-pay_amt6    Amount of previous payment (NT dollar)
                        pay_amt1 = amount paid in September, 2005
                        pay_amt2 = amount paid in August, 2005
                        ...
                        pay_amt6 = amount paid in April, 2005 
   ----------------------------------------------------------------------------------------
   ```

#### **Initializing Libraries**

In [1]:
#importing required libraries for data analysis
import pandas as pd
import numpy as np
from numpy import math

#importing libraries for data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
#Importing the excel file as python dataframe

data = pd.read_excel('default_of_credit_card_clients.xls',header=1)
print(data.columns)
data.shape

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')


(30000, 25)

In [3]:
#Taking sample of full data (25% considering the run time of ML algorithms)
df = data.sample(frac =.25)
  
# checking if sample is 0.25 times data or not
if (0.25*(len(data))== len(df)):
    print(len(data), len(df))

#display dependent variable distribution
print(df['default payment next month'].value_counts())

30000 7500
0    5847
1    1653
Name: default payment next month, dtype: int64


### **Data Understanding**

In [4]:
print('\n',"Shape of data:", df.shape)    #There are 30,000 observations and 25 columns

print('\n',"Numerical features count: ", len(df.dtypes[df.dtypes != "object"].index))

print('\n',"Categorical features count: ", len(df.dtypes[df.dtypes == "object"].index))

print('\n',"No of unique records: ", df['ID'].nunique())    #No duplicate rows found

print('\n',"Missing/null values:", df.isnull().sum())       #There are no missing values


 Shape of data: (7500, 25)

 Numerical features count:  25

 Categorical features count:  0

 No of unique records:  7500

 Missing/null values: ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default payment next month    0
dtype: int64


### **Data Pre-processing & Transformation**

In [5]:
#renaming of columns for simplicity and better understanding
# df.rename(columns={'default payment next month' : 'IsDefaulter'}, inplace=True)
# df.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
# df.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
# df.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)

In [6]:
# Dropping unnecessary feature(ID) which do not contribute significantly
df = df.drop(["ID"],axis=1)

#Renaming Pay_0 to Pay_1 to correct the numbering order of payment status
df.rename(columns={'PAY_0':'PAY_1'}, inplace=True)

#renaming of columns for simplicity and better understanding
df.rename(columns={'default payment next month' : 'IsDefaulter'}, inplace=True)

#Replacing all unknown class values as one category 'Others'

df["EDUCATION"]=df["EDUCATION"].map({0:4,1:1,2:2,3:3,4:4,5:4,6:4})
df["MARRIAGE"]=df["MARRIAGE"].map({0:3,1:1,2:2,3:3})

#### **Label Encoding**

In [7]:
#label encoding
encoders_nums = {"SEX":{2:0,1:1}}

df = df.replace(encoders_nums)

#### **One Hot Encoding**

In [8]:
# Feature encoding - One Hot Encoding of Categorical variables

def onehot_encode(df):
    onehotlabels = pd.get_dummies(df, prefix=None, prefix_sep='_', dummy_na=False, 
#                                  columns=["EDUCATION","MARRIAGE",'PAY_SEP', 'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR'], sparse=False,drop_first=False, dtype=None)
                                 columns=["EDUCATION","MARRIAGE",'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'], sparse=False,drop_first=False, dtype=None)

    return onehotlabels

df = onehot_encode(df)

In [9]:
#check for all the created variables 
df.head()

Unnamed: 0,LIMIT_BAL,SEX,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,...,PAY_6_-2,PAY_6_-1,PAY_6_0,PAY_6_2,PAY_6_3,PAY_6_4,PAY_6_5,PAY_6_6,PAY_6_7,PAY_6_8
10530,180000,0,36,787,398,0,0,0,0,398,...,1,0,0,0,0,0,0,0,0,0
11192,280000,1,45,170678,26031,191493,136249,111454,96425,7000,...,0,0,1,0,0,0,0,0,0,0
19915,500000,0,40,212961,264961,237006,139311,128095,181304,63000,...,0,0,1,0,0,0,0,0,0,0
13759,150000,0,27,573,0,220,693,0,0,0,...,1,0,0,0,0,0,0,0,0,0
10442,260000,1,37,36664,22228,39330,39880,78787,52790,1690,...,0,0,1,0,0,0,0,0,0,0


In [10]:
#shape of dataset after creating dummy variables
df.shape

(7500, 87)

In [11]:
#seperating dependant and independant variabale

X = df[(list(i for i in list(df.describe(include='all').columns) if i != 'IsDefaulter'))]
y = df['IsDefaulter']

In [12]:
X.shape

(7500, 86)

In [13]:
y.shape

(7500,)

### Train Test Split

In [15]:
#importing libraries for splitting data into training and testing dataset

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)

print(x_train.shape,y_train.shape,x_test.shape,y_test.shape)

(5250, 86) (5250,) (2250, 86) (2250,)


#### Feature Scaling

In [16]:
# #importing libraries for data transformation
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform (x_test)

In [19]:
#Final Datasets

print(x_train.shape,y_train.shape,x_test.shape,y_test.shape)

# Saving final processed files in csv format for future reference

pd.DataFrame(x_train).to_csv("x_train.csv", index=None)
pd.DataFrame(x_test).to_csv("x_test.csv", index=None)

pd.DataFrame(y_train).to_csv("y_train.csv", index=None)
pd.DataFrame(y_test).to_csv("y_test.csv", index=None)

(5250, 86) (5250,) (2250, 86) (2250,)


#### The Final datasets (Train & Test) are used in Part2 and Part3 - Machine learning modeling, Evaluation & Hyperparameter tuning