# $$ \color{red}{Credit\ Card\ Default\ Propensity\ Prediction} $$    

__Objective__ â€“ Predict the probability of a customer defaulting payment for the credit card the subsequent month, based on past information. The past information is provided in the dataset. This probability will help the collections team to prioritise follow up with customers who have a high propensity of defaulting.

## Project instructions
1.	Perform the required data pre-processing to treat for missing values and outliers
2.	Perform exploratory data analysis to visualise the spread of each of the X variables and the relationship between the various X variables and the Y variable
3.	Identify any multi-collinearity and perform dimensionality reduction like PCA if required
4.	Use the original X variables or the components from PCA & divide the given data into train and test sets
5.	Build a model to predict the default propensity of a customer
6.	Evaluate the model based on model performance measures for classification and recommend the most suitable model.
7.	Come up with recommendations / actionable insights based on feature importance scores derived from the model.


# About the dataset
The dataset contains the following information of 30000 customers:
1.	__DEFAULT__ - Default payment next month (Yes=1, No=0)
2.	__LIMIT_BAL__ - Amount of the given credit (INR)  
3.	__SEX__ - Gender (1 = male; 2 = female)
4.	__EDUCATION__ - Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
5.	__MARRIAGE__ - (1 = married; 2 = single; 3 = others)
6.	__AGE__ - (year)
7.	__PAST_PAY__ - History of repayment status
    <br> -  PAST_PAY1 = the repayment status in September 2005
    <br> -  PAST_PAY2 = the repayment status in August 2005
    <br> -  PAST_PAY6 = the repayment status in April 2005
    <br> -  The measurement scale for the repayment status is 
         i.  -1 = pay duly <br>
        ii.	 1 = payment delay for one month<br>
        iii. 2 = payment delay for two months <br>
        iv.	 8 = payment delay for eight months<br>
        v.	 9 = payment delay for nine months and above <br>
8.	__BILL_AMT__- Amount of bill statement (INR)
    <br> -  BILL_AMT1 = amount of bill statement in September 2005
    <br> -  BILL_AMT2 = amount of bill statement in August 2005
    <br> -  BILL_AMT6 = amount of bill statement in April 2005
9.	__PAY_AMT__ - Amount of previous payment (INR)
    <br> -  PAY_AMT1 = amount paid in September 2005
    <br> -  PAY_AMT2 = amount paid in August 2005
    <br> -  PAY_AMT6 = amount paid in April 2005


## Additional information
1.	The difference between the bill amount and the past pay for a month will give an indication of the pending amount for each month and can be used as an additional X variable
2.	The average / sum of bill amount can be used as an additional X variable which will give an indication on the customer value.


# Importing important libraries:

In [None]:
import pandas            as pd
import numpy             as np
import matplotlib.pyplot as plt
import seaborn           as sns
import statsmodels.api   as sm

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection   import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model      import LogisticRegression
from sklearn.metrics           import classification_report
from sklearn.tree              import DecisionTreeClassifier
from sklearn.ensemble          import RandomForestClassifier
from scipy.stats               import randint as sp_randint
from imblearn.over_sampling    import SMOTE

# Read the dataset and display first five rows:

In [None]:
df = pd.read_csv('../input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv')
df.head()

In [None]:
print("There are {} rows and {} columns in the dataset.".format(df.shape[0],df.shape[1]))

# To see the datatypes of the column:

In [None]:
df.info()

# Descriptive Statistics:

In [None]:
# Five point summary of the dataset

df.describe().T

Inference:
    - There are around 30000 distict credit card clients.
    - The average value of credit card Limits is Rs 1,67,484.
    - The Limited Balance has a high Standard deviation as the meadian value is Rs 1,40,000 and the extreme values as Rs 10,00,000.
    - Here the average is about 35 and meadian is 28 with a standard deviation of 9.2. This difference is explained by some very old people in the data set as given that the maximum age is 79.
    - Bill Amount and Pay Amount also shows us that there some people with extremely high bill amount which may be because for the higher Credit Limit or because of the pending dues added up. 
    - Bill amount for all the months, the mean is around 40,000 to 50,000 with some extreme amount in bill amount 3 of Rs 16,64,089.
    - Pay amount for all the months, the mean is around Rs 4800 to Rs 5800, with some extreme values such as Rs 16,64,089.
    - As the value 0 for default payment means 'not default' and value 1 means 'default', the mean of 0.221 means that there are 22.1% of credit card contracts that will default next month (will verify this in the next sections of this analysis).

In [None]:
print("There are {} missing records in the dataset.".format(df.isnull().sum().sum()))

In [None]:
# Storing feature names in variable 'cols'

cols = df.columns.tolist()

# Exploratory Data Analysis:

In [None]:
for i in [ 'SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']:
    plt.figure(figsize=(10,5))
    sns.countplot(df[i])
    plt.show()

#### BOX PLOT - Outliers:

In [None]:
# Boxplot for Bill_Amt vs Limit_bal

plt.figure(figsize=(10,7))
sns.boxplot(data=df[['LIMIT_BAL','BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']])
plt.show()

In [None]:
# Boxplot for Pay_Amt vs Limit_bal

plt.figure(figsize=(10,7))
sns.boxplot(data=df[['LIMIT_BAL','PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']])
plt.show()

In [None]:
# Boxplot for column 'AGE'
plt.figure(figsize=(5,5))
sns.boxplot(data=df['AGE'])
plt.show()

#### Outliers:

In [None]:
# Outliers on numberical columns

num_var = df.select_dtypes(exclude='object')
for i in num_var:
    
    q1 = df[i].quantile(0.25)
    q3 = df[i].quantile(0.75)

    IQR = q3 - q1
    UL = q3 + 1.5*IQR
    LL = q1 - 1.5*IQR

    print('IQR of',i,'= ',IQR)
    print('UL of',i,'= ',UL)
    print('LL of',i,'= ',LL)
    print('Number of Outliers in',i,' = ',(df.shape[0] - df[(df[i]<UL) & (df[i]>LL)].shape[0]))
    print(' ')

#### Distribution Plot:

In [None]:
mi0 = df[df['default.payment.next.month']==0]
mi1 = df[df['default.payment.next.month']==1]

In [None]:
con_col=['AGE','LIMIT_BAL','BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

for i in con_col:
    plt.figure(figsize=(20,5))
    sns.distplot(mi0[i],color='g')
    sns.distplot(mi1[i],color='r')
    plt.show()

#### Heat map:

In [None]:
plt.figure(figsize=(25,20))
sns.heatmap(df.corr(),annot=True)
plt.show()

#### Pairplot:

In [None]:
sns.pairplot(df)
plt.show()

## Feature Engineering:

#### Binning the 'AGE' column

In [None]:
def age(x):
    if x in range(21,41):
        return 1
    elif x in range(41,61):
        return 2
    elif x in range(61,80):
        return 3

df['AGE']=df['AGE'].apply(age)

#### Binning the 'PAY' column

In [None]:
def bins(x):
    if x == -2:
        return 'Paid Duly'
    if x == 0:
        return 'Paid Duly'
    if x == -1:
        return 'Paid Duly'
    if x in range(1,4):
        return '1 to 3'
    if x in range(4,7):
        return '4 to 6'
    if x in range(7,9):
        return '7 to 9'

for i in df[['PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']]:
    df[i]=df[i].apply(bins)

#### Replacing 0,5,6 to 4 in education columns

In [None]:
def rep(x):
    if x in [0,4,5,6]:
        return 4
    else:
        return x
df['EDUCATION']=df.EDUCATION.apply(rep)

In [None]:
# Dataset after feature engineering

df.head()

## Encoding categorical variable:

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in df.select_dtypes(include=object).columns:
    df[col] = le.fit_transform(df[col])

## Splitting dataset in dependent and independent variable:

In [None]:
X =df.drop('default.payment.next.month',axis=1)
y = df['default.payment.next.month']

## Splitting dataset into train and test split:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## Checking if Data is Imbalance:

In [None]:
print((df['default.payment.next.month'].value_counts()/df['default.payment.next.month'].shape)*100)
sns.countplot(df['default.payment.next.month'])
plt.show()

Here we can see that the data is imbalanced.

#### Balancing the dataset:

In [None]:
print('Before OverSampling, the shape of train_X: {}'.format(X_train.shape)) 
print('Before OverSampling, the shape of train_y: {} \n'.format(y_train.shape))

In [None]:
smote = SMOTE(sampling_strategy='minority')
X_train_sm, y_train_sm = smote.fit_sample(X_train, y_train)

In [None]:
print('After OverSampling, the shape of train_X: {}'.format(X_train_sm.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_sm.shape))

## Modelling:

### Base Model using LogisticRegression:

In [None]:
logreg = LogisticRegression(solver='liblinear', fit_intercept=True)

logreg.fit(X_train_sm, y_train_sm)

y_prob_train = logreg.predict_proba(X_train_sm)[:,1]
y_pred_train = logreg.predict (X_train_sm)

print('Classification report - Train: ', '\n', classification_report(y_train_sm, y_pred_train))

y_prob = logreg.predict_proba(X_test)[:,1]
y_pred = logreg.predict (X_test)

print('Classification report - Test: ','\n', classification_report(y_test, y_pred))

### Feature selection- Backward Elimination:

In [None]:
Xc=sm.add_constant(X_train_sm)
model = sm.Logit ( y_train_sm , Xc ).fit ( )

In [None]:
model.summary ( )

In [None]:
cols = list(X_train_sm.columns)
pmax = 1
while (len(cols)>0):
    p= []
    X_1 = X_train_sm[cols]
    X_1 = sm.add_constant(X_1)
    model = sm.OLS(y_train_sm,X_1).fit()
    p = pd.Series(model.pvalues.values[1:],index = cols)      
    pmax = max(p)
    feature_with_p_max = p.idxmax()
    if(pmax>0.05):
        cols.remove(feature_with_p_max)
    else:
        break
selected_features_BE = cols
print("Important features: {}".format(selected_features_BE))
print("\nNumber of important features: {}".format(len(selected_features_BE)))

In [None]:
# Adding target column

selected_features_BE.append('default.payment.next.month')
df2=df[selected_features_BE]

### Logistic Regression with important feature:

In [None]:
X = df2.drop('default.payment.next.month',axis=1)
y = df2['default.payment.next.month']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

smote = SMOTE(sampling_strategy='minority')
X_train_sm, y_train_sm = smote.fit_sample(X_train, y_train)

logreg = LogisticRegression(solver='liblinear', fit_intercept=True)

logreg.fit(X_train, y_train)

y_prob_train = logreg.predict_proba(X_train)[:,1]
y_pred_train = logreg.predict (X_train)

print('Classification report - Train: ', '\n', classification_report(y_train, y_pred_train))

y_prob = logreg.predict_proba(X_test)[:,1]
y_pred = logreg.predict (X_test)

print('Classification report - Test: ','\n', classification_report(y_test, y_pred))

### Decission Tree:

In [None]:
# Defining an object for DTC and fitting for whole dataset
dt = DecisionTreeClassifier(max_depth=3, min_samples_leaf=10, random_state=1 )
dt.fit(X_train_sm, y_train_sm)

y_pred_train = dt.predict(X_train_sm)
y_pred = dt.predict(X_test)
y_prob = dt.predict_proba(X_test)

In [None]:
#Classification for test before hyperparameter tuning
print(classification_report(y_test,y_pred))

#### Hyperparameter Tuning:

In [None]:
dt = DecisionTreeClassifier(random_state=1)

params = {'criterion': ['gini','entropy'],
          'splitter' : ["best", "random"],
          'max_depth' : [2,4,6,8,10,12],
          'min_samples_split': [2,3,4,5],
          'min_samples_leaf': [1,2,3,4,5]}

rand_search_dt = RandomizedSearchCV(dt, param_distributions=params, cv=3)

rand_search_dt.fit(X_train_sm,y_train_sm)

rand_search_dt.best_params_

In [None]:
# Passing best parameter for the Hyperparameter Tuning
dt = DecisionTreeClassifier(**rand_search_dt.best_params_, random_state=1)

dt.fit(X_train_sm, y_train_sm)

y_pred = dt.predict(X_test)

In [None]:
#Classification for test after hyperparameter tuning
print(classification_report(y_test,y_pred))

### Random Forest:

In [None]:
#Create a Gaussian Classifier
rfc=RandomForestClassifier(n_estimators=100, random_state=1)

#Train the model using the training sets y_pred=clf.predict(X_test)
rfc.fit(X_train_sm,y_train_sm)

y_pred = rfc.predict(X_test)

In [None]:
#Classification for test after hyperparameter tuning
print(classification_report(y_test,y_pred))

#### Hyperparameter Tuning:

In [None]:
rfc = RandomForestClassifier(random_state=1)

params = {'n_estimators': sp_randint(5,30),
          'criterion' : ['gini','entropy'],
          'max_depth' : sp_randint(2,10),
          'min_samples_split' : sp_randint(2,20),
          'min_samples_leaf' : sp_randint(1,20),
          'max_features' : sp_randint(2,18)}

rand_search_rfc = RandomizedSearchCV(rfc, param_distributions=params, random_state=1, cv=3)

rand_search_rfc.fit(X_train_sm,y_train_sm)

rand_search_rfc.best_params_

In [None]:
# Passing best parameter for the Hyperparameter Tuning
rfc = RandomForestClassifier(**rand_search_rfc.best_params_, random_state=1)

rfc.fit(X_train_sm, y_train_sm)

y_pred = rfc.predict(X_test)

In [None]:
#Classification for test after hyperparameter tuning
print(classification_report(y_test,y_pred))