The dataset comes from Machine Learning Repository. ( https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients )
The aim of this analysis is to establish what are the most important variables in predicting the default rate of credit card clients.


There are 25 variables:
ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment (1=yes, 0=no)

In [None]:
# loading modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN

# getting the data about credit card defaults
defaults=pd.read_excel('../input/default of credit card clients.xls')

In [None]:
# glancing at data
defaults.head()

In [None]:
# inspecting missing values
defaults.isnull().sum()

In [None]:
# examining the df
defaults.info()

In [None]:
# adjusting X3 (education) column
'''
From data description we have that the 'education -x3' column should represent information in the following way:
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
Currently we have also values 0,5,6. They should be classified as 'others-4'.
'''
defaults['EDUCATION']=defaults['EDUCATION'].apply(lambda x:4 if x in [0,4,5,6] else x)

# adjusting MARRIAGE
'''
From data description we have that MARRIAGE should be presented as folllows : (1 = married; 2 = single; 3 = others). 
Currently we also have 0.We will reclasify those as 3-others.
'''
defaults['MARRIAGE']=defaults['MARRIAGE'].apply(lambda l:3 if l==0 else l)

# What is the distribution of labels ?

In [None]:
# looking at the distribution of labels
ax=defaults['default payment next month'].value_counts().plot(kind='bar')
plt.title('Number of people who defalted on credit card')
ax.axes.xaxis.set_ticklabels(['not defaulted','defaulted'])
plt.show()

display(defaults['default payment next month'].value_counts(normalize=True))

We see that the classes are woefully imbalanced.
The classifier can achieve 78 % accuracy just by always predicting 'not defaulted'.
We will have to to fix this prior to implementing any predictive model.
We can either equalize the classes or choose other metrics to assess model performance. 

# What are the relationships between the numeric variables ?

In [None]:
# checking for interdependencies (correlations) in numeric data
defa_num=defaults[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','PAY_AMT1',
                  'PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6','LIMIT_BAL']]
display(defa_num.corr())
correlations=defa_num.corr()

In [None]:
# visualizing correlations
ax=sns.heatmap(correlations)
plt.show()

We can see that amounts on bill statements from different months are very strongly correlated and that the strongest
correlations are between the most proximate time periods.

# How the payment status from different months affected the default rate ?

In [None]:
# we will calculate spearman correlations between the ordinal data in our dataset
payment_categories=defaults[['PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6','default payment next month']]

display(payment_categories.corr(method='spearman'))
correlation_ordinal=payment_categories.corr(method='spearman')

The pairwise correlations between repayments status and response variable 'default payment next month'are weak.

In [None]:
sns.pairplot(payment_categories,hue='default payment next month')
plt.show()

In [None]:
# visualizing correlations
sns.heatmap(correlation_ordinal)
plt.show()

The correlations between repayment status from different months tend to be strong.
Correlations are the strongest between periods which are the most proximate in time.

# Who defaulted more often,men or women ?

In [None]:
# checking the default frequency among men
display(defaults[defaults.SEX==1]['default payment next month'].value_counts(normalize=True))
# checking the default frequency among women
display(defaults[defaults.SEX==2]['default payment next month'].value_counts(normalize=True))

We can see that 24.2 % of males have defaulted on their credit card, whereas
for females it was 20.7 %, so 3.5 % percentage points less than males.
Women defaulted a little less often than men.

# What was the age of the people who defaulted vs people who did not.

In [None]:
# age statistics of people who defalted on credit card payments
display(defaults[defaults['default payment next month']==1]['AGE'].describe())
# age statistics of people who paid off their debts in time
display(defaults[defaults['default payment next month']==0]['AGE'].describe())

From descriptive statistics there is no discernible difference between the age of people who defaulted and of those who did not.

In [None]:
# plotting the default rate among people above and below 60 years old 
older=defaults[defaults.AGE>=60]['default payment next month'].value_counts(normalize=True).plot(kind='bar')
older.axes.xaxis.set_ticklabels(['not defaulted','defaulted'])
plt.title('The default rate among people 60 years old or older')
plt.show()

younger=defaults[defaults.AGE<60]['default payment next month'].value_counts(normalize=True).plot(kind='bar')
younger.axes.xaxis.set_ticklabels(['not defaulted','defaulted'])
plt.title('The default rate among people below 60 years old')
plt.show()

From the charts above we see that the default rate was markedly lower for people under 60
compared to people above 60 years old.
The defalt rate was around 30 % for people above 60 years old and only about 20 % for people below 60.

# How the defaults varies with education level ? 

In [None]:
# education level of people who have defaulted
labels=['university','graduate school','high school','others']
values=defaults[defaults['default payment next month']==1]['EDUCATION'].value_counts().values
plt.pie(values,labels=labels, autopct='%1.1f%%',colors=['lightgray', 'khaki', 'steelblue', 'c'])
plt.title('Education level of people who have defaulted on their credit card bills')
plt.axis('equal')
plt.show()

# education level of people who have not defaulted
values2=defaults[defaults['default payment next month']==0]['EDUCATION'].value_counts().values
plt.pie(values2,labels=labels,autopct='%1.1f%%',colors=['lightgray', 'khaki', 'steelblue', 'c'])
plt.title('Education level of people who have NOT defaulted on their credit card bills')
plt.axis('equal')
plt.show()


From the pie charts above one can see that people with graduate school degrees defaulted on their credit card payments
considerably less often than people with only bacholer degree or without higher education.
Among people who have defaulted the fraction of people with graduate school degree was 5.9 percentage points lower than among people who have not defaulted.

# Did marital status affect number of defaults ?

In [None]:
# Fraction of married people among those who have defaulted
marital_status=['single','married','others']
ocurrence=defaults[defaults['default payment next month']==1].MARRIAGE.value_counts().values
plt.pie(ocurrence,labels=marital_status,autopct='%1.1f%%',colors=['tan','mediumturquoise','coral'])
plt.title('Breakdown of marital status of people who have defaulted on ther credit card bills')
plt.axis('equal')
plt.show()

# Fraction of married people among those who have NOT defaulted
incidence=defaults[defaults['default payment next month']==0].MARRIAGE.value_counts().values
plt.pie(incidence,labels=marital_status,autopct='%1.1f%%',colors=['tan','mediumturquoise','coral'])
plt.title('Breakdown of marital status of people who have NOT defaulted on ther credit card bills')
plt.axis('equal')
plt.show()

From the charts above we can see that the fraction of people who were single was measurably higher (3.7 percentage points) among people who have paid off their credit card bills. Single people tended to default somewhat less often than married ones.

# Preparing data for future modelling

Lets examine the unique values in each columns in our dataset.

In [None]:
# loking at the unique values for each column
for item in defaults.columns:
    print(item,' >>>>>>>>>>>>> ',defaults[item].unique())

Columns PAY_0 to PAY_6 needs to be transformed to binary data because they are not meaningful as numbers.
It is the same for EDUCATION and MARRIAGE columns.For feature SEX we will create the dummy variable that
takes the value of 1 if gender is female and 0 otherwise (for males).
We will use the process called One hot encoding to transform the above mentioned columns to
binary values to enable future modelling.

In [None]:
# creating dummy variable for SEX column
defaults['SEX']=defaults['SEX'].apply(lambda x:1 if x==2 else 0)

In [None]:
# renaming the values of EDUCATION column
defaults['EDUCATION']=defaults.EDUCATION.apply(lambda x:int(str(x)+'0'))

In [None]:
# Creating the function to perform One hot encoding
def OneHotEncoder(df_Column,what_is_1):
	output=[]
	for row in df_Column:
		if row==what_is_1:
			output.append(1)
		else:
			output.append(0)
	return output

def OneHotCaller(df_Column,set_of_categories):
	cat_columns_encoded={}
	for thing in set_of_categories:
		cat_columns_encoded[thing]=OneHotEncoder(df_Column,thing)
	return cat_columns_encoded

def get_Distinct_Values_from_Column(df_column):
	distinct_df_column=list(set([row for row in df_column]))
	return distinct_df_column

def CallOneHotCaller(df_Column):
	distinct=get_Distinct_Values_from_Column(df_column=df_Column)
	return OneHotCaller(df_Column,distinct)
        
def Encode_and_Implement_in_DF(df,df_Column):
    for key,value in CallOneHotCaller(df_Column).items():
        df[key]=value
    return df

# We will make use of this last function
def EncodeImplementAll(df,ListOfColumnsToEncode):
    '''The functions takes as arguments the df containing the columns to encode 
    and a list of df columns (pandas series) to encode'''
    for column in ListOfColumnsToEncode:
        Encode_and_Implement_in_DF(df,column)
    for i in ListOfColumnsToEncode:
        try :
            del df[str(i.name)]
        
        except Exception:
            pass
    
    return df

# creating a list of columns which we want to encode
columns_to_encode=[defaults.PAY_0,defaults.PAY_2,defaults.PAY_3,defaults.PAY_4,
                   defaults.PAY_5,defaults.PAY_6,defaults.EDUCATION,defaults.MARRIAGE]



In [None]:
# getting the data ready for modelling
enc_defaults=EncodeImplementAll(defaults.drop('default payment next month',axis=1),columns_to_encode)
enc_defaults.head()

As mentioned before,we have to balance the classes in our response variable,because now the classifier
can get 78 % accuracy by just always predicting that the customer will not default.

In [None]:
# we have roughly 3.5 times more class 0 than class 1
# we will copy the data with label 0 two times and with label one 7 times
# we need to bring back response variable to filter df by it
enc_defaults['default payment next month']=defaults['default payment next month']

# balancing the data
negative=enc_defaults[enc_defaults['default payment next month']==0]
neg_2=negative.copy()
positive=enc_defaults[enc_defaults['default payment next month']==1]
pos_2=enc_defaults[enc_defaults['default payment next month']==1].copy()
pos_3=enc_defaults[enc_defaults['default payment next month']==1].copy()
pos_4=enc_defaults[enc_defaults['default payment next month']==1].copy()
pos_5=enc_defaults[enc_defaults['default payment next month']==1].copy()
pos_6=enc_defaults[enc_defaults['default payment next month']==1].copy()
pos_7=enc_defaults[enc_defaults['default payment next month']==1].copy()

# putting together balanced dataset
bal_enc_defaults=pd.concat((negative,neg_2,positive,pos_2,pos_3,pos_4,pos_5,pos_6,pos_7))

In [None]:
# looking at the distribution of labels again
bal_enc_defaults['default payment next month'].value_counts().plot(kind='bar')
plt.show()

# Summing-up

In our Exploratory Data Analysis we have found that being single,graduate degree holder,being below 60 years old or being women decreased the chance of defaulting on credit card bills.
Repayments from different months were strongly correlated with each other.
The amounts on bill statements from different months were strongly correlated with each other too.
The dataset in itself did not contain any missing values, however it did contain non-numeric data, which was transformed to
numeric binary data through One hot encoding technique.
The dataset was also unbalanced,but the classes were equalized.
The data is now ready for modelling.

# What factors influence the default rate the most ?

We will use random forrest classifier to find out which variables influence 
the target variable 'default payment next month' the most.

In [None]:
# separating the response variable
y=bal_enc_defaults['default payment next month']

# selecting columns representing features data
features_col=[col for col in bal_enc_defaults.columns if (col!='default payment next month') & (col!='ID')]

# getting features data
X=bal_enc_defaults[features_col]

# Splitting the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# instantiating the classifier
rfc=RFC()

# fitting the model
rfc.fit(X_train,y_train)

# predicting the test labels
predictions=rfc.predict(X_test)

# checking the accuracy of the model
print(accuracy_score(y_test,predictions))

# double-checking the accuracy of the model

print(len(predictions),' << >> ',len(y_test))

ok=0
for pr,yt in zip(predictions,y_test):
    if pr==yt:
        ok+=1
        
print(ok/len(predictions))

# extracting feature importances and storing it in the dataframe
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)

# giving education related features a more descriptive names to avoid confusion with repayment status
ind=[feature for feature in feature_importances.index]
ind[ind.index(20)]='university'
ind[ind.index(10)]='graduate school'
ind[ind.index(30)]='high school'
ind[ind.index(40)]='other education'
feature_importances.index=ind

# examining feature importances
display(feature_importances)

According to our model the most important variables in predicting default are amount of previous payment,amount of bill statement and age.The accuracy of predicting defaults was about 99 %.

# Predicting defaults with different alghorithms

Support vector machine

We need to standardize our data before applying SVM.

In [None]:
# instantiating standardizer with which we will transform our data to normally
# distributed with mean 0 and standard deviation of 1
standardizer=StandardScaler()

# selecting features to transform
to_standardize=['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6',
   'PAY_AMT1','PAY_AMT2',  'PAY_AMT3',  'PAY_AMT4',  'PAY_AMT5',  'PAY_AMT6']

# creating standardized version of numeric variables in our df
numeric_data_normalized=standardizer.fit_transform(X[to_standardize])

# creating a list with the names of binary variables to subset X with it and append do standardized df later on
binary_variables=[var for var in X.columns if var not in to_standardize]

# creating df with normalized numeric features data
X_norm=pd.DataFrame(data=numeric_data_normalized,columns=to_standardize)

# creating ready for modelling df with normalized numeric features data and binary data as well
for bv in binary_variables:
    X_norm[bv]=list(X[bv])

# instanitating svm classifier
classify=LinearSVC()

# Splitting the data into train and test set
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_norm, y, test_size = 0.2)

# fitting SVM model
classify.fit(X_train2,y_train2)

# making predictions
preds=classify.predict(X_test2)

# checking the accuracy of predictions
print(accuracy_score(y_test2,preds))

The SVM model clearly suffers from underfitting.Both training and test accuracy was about 63 %.

In [None]:
K nearest neighbours

In [None]:
# initiating knn classifier
knn=KNN()

# fitting knn model to the data
knn.fit(X_train,y_train)

# predicting the test labels with knn
forecast=knn.predict(X_test)

# checking the accuracy of the knn classifier
print(accuracy_score(y_test,forecast))

KNN classifier achieved training accuracy of about 89 % and test accuracy of about 86 %.
It fits the data very well and also generalize to unseen data with high accuracy.

XGBoost

In [None]:
# initiating xgb classifier
boost=XGBClassifier()

# fitting xgb model to the data
boost.fit(X_train,y_train)

# predicting the test labels with xgboost
prophecies=boost.predict(X_test)

# checking the accuracy of the model
print(accuracy_score(y_test,prophecies))

XGBoost classifier did not fit the training data very well.
The accuracy on training set was around 70 % whereas on the test set it was around 69 %.

# Conclusions

The defaults of credit card clients seems predictible - after dealing with class imbalance we were able to classify who will default with 99 % accuracy using Random Forrest classifier.The most important variables in predicting default were amount of previous payment,amount of bill statement and age.