<h2><u>Machine Learning</u></h2>

# Case Study - Credit Card Payment


In this case study, you will apply various classification algorithms to predict the payment for credit card for next month is defaulted or not.

---

- ### Load Required Libraries

In [1]:
!pip install imblearn



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings # to ignore warning
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
%matplotlib inline

- ### Load and analyse data

        - Load the data from the required location into a DataFrame
        - Analyse the shape of the data by printing its total number of rows & columns
        - Also print 5 rows of the DataFrame

In [None]:
credit = pd.read_csv("UCI_Credit_Card.csv")

In [None]:
credit.shape

In [None]:
credit.head() ### Glimpse of the data

- ### Clean the data


    - ID variable as it has no relevance to training a model
    - Check for any null values


In [None]:
# Removing ID variable as it has no relevnce for logistic regression model
credit.drop(['ID'],axis=1, inplace=True)

In [None]:
# Checking for any null values
credit.isnull().sum()

There are no null values in the data set

- ### Check Description 
       
       - 7 point statistics for the continous features

In [None]:
credit.describe()

 - We know that many of the variable are categorical variabble

- ### Check either the data is balanced or not

In [None]:
## Checking for data unbalance
temp = credit["default.payment.next.month"].value_counts()
df = pd.DataFrame({'default.payment.next.month': temp.index,'values': temp.values})
plt.figure(figsize = (6,6))
plt.title('Default Credit Card Clients - target value - data unbalance\n (Default = 0, Not Default = 1)')
sns.set_color_codes("pastel")
sns.barplot(x = 'default.payment.next.month', y="values", data=df)
locs, labels = plt.xticks()
plt.show()

Around 22% of clients will default next month. The data has not a large unbalance with respect of the target value (default.payment.next.month).

- ### Correlation for only numeric variable

In [None]:
var = ['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']

plt.figure(figsize = (8,8))
plt.title('Amount of bill statement (Apr-Sept) \ncorrelation plot (Pearson)')
corr = credit[var].corr()
sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns)
plt.show()


Correlation seems fine for the variable

- ### Treatment of categorical features

Treating the categorical features to introduce into the model

In [None]:
cat_features = ['EDUCATION', 'SEX', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

In [None]:
credit_dummies = pd.get_dummies(credit, columns = cat_features)

In [None]:
print("Default of Credit Card Clients train data -  rows:",credit_dummies.shape[0]," columns:", credit_dummies.shape[1])

In [None]:
target= 'default.payment.next.month'
predictors = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4',
       'BILL_AMT5', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3',
       'PAY_AMT4', 'PAY_AMT5',
       'EDUCATION_0', 'EDUCATION_1', 'EDUCATION_2', 'EDUCATION_3',
       'EDUCATION_4', 'EDUCATION_5', 'SEX_1',
       'MARRIAGE_0', 'MARRIAGE_1', 'MARRIAGE_2', 'PAY_0_-2',
       'PAY_0_-1', 'PAY_0_0', 'PAY_0_1', 'PAY_0_2', 'PAY_0_3', 'PAY_0_4',
       'PAY_0_5', 'PAY_0_6', 'PAY_0_7', 'PAY_2_-2', 'PAY_2_-1',
       'PAY_2_0', 'PAY_2_1', 'PAY_2_2', 'PAY_2_3', 'PAY_2_4', 'PAY_2_5',
       'PAY_2_6', 'PAY_2_7', 'PAY_3_-2', 'PAY_3_-1', 'PAY_3_0',
       'PAY_3_1', 'PAY_3_2', 'PAY_3_3', 'PAY_3_4', 'PAY_3_5', 'PAY_3_6',
       'PAY_3_7', 'PAY_4_-2', 'PAY_4_-1', 'PAY_4_0', 'PAY_4_1',
       'PAY_4_2', 'PAY_4_3', 'PAY_4_4', 'PAY_4_5', 'PAY_4_6', 'PAY_4_7',
       'PAY_5_-2', 'PAY_5_-1', 'PAY_5_0', 'PAY_5_2', 'PAY_5_3',
       'PAY_5_4', 'PAY_5_5', 'PAY_5_6', 'PAY_5_7', 'PAY_6_-2',
       'PAY_6_-1', 'PAY_6_0', 'PAY_6_2', 'PAY_6_3', 'PAY_6_4', 'PAY_6_5',
       'PAY_6_6', 'PAY_6_7']

- ### Divide target and features

In [None]:
#Assigning and dividing the dataset
X = credit_dummies[predictors]
y=credit_dummies['default.payment.next.month']

- ### Create training and testing set

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 5)

In [None]:
X_train.shape

In [None]:
X_test.shape

- ### Apply Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(C=0.4,max_iter=1000,solver='liblinear')
classifier.fit(X_train, y_train)

In [None]:
#### Predicting on X_test dataset
y_pred = classifier.predict(X_test)

- #### Assessing Model performance
    - __Precision__: Percentage of correct results
    - __Recall__: Percentage of valid results correctly classified
    - __F1 Score__: A measure of test's accuracy which is harmonic mean of precision and recall. Maximising this improves the model. Perfect at 1 and worst at 0. 

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

results = pd.DataFrame([['Logistic Regression', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])
print(results)

In [None]:
confusion_matrix(y_test,y_pred)

- ### Apply SMOTE as data is unbalanced

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
sm = SMOTE(random_state=589)
X_SMOTE, y_SMOTE = sm.fit_sample(X_train, y_train)
print(len(y_SMOTE))

In [None]:
classifier1 = LogisticRegression(C=0.4,max_iter=1000,solver='liblinear')
classifier1.fit(X_SMOTE, y_SMOTE)

In [None]:
#### Predicting on X_test dataset
y_pred1 = classifier1.predict(X_test)

In [None]:
acc = accuracy_score(y_test, y_pred1)
prec = precision_score(y_test, y_pred1)
rec = recall_score(y_test, y_pred1)
f1 = f1_score(y_test, y_pred1)

model_results = pd.DataFrame([['Logistic Regression - with SMOTE', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)
print(results)

In [None]:
confusion_matrix(y_test,y_pred1)

Accuracy has dipped but F1 score of the model has improved

- ### Apply Decison Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
clf = DecisionTreeClassifier(random_state=14) 
# training the classifier
clf.fit(X_train, y_train)
# do our predictions on the test
pred_dt = clf.predict(X_test)

In [None]:
# Predicting Test Set
acc = accuracy_score(y_test, pred_dt)
prec = precision_score(y_test, pred_dt)
rec = recall_score(y_test, pred_dt)
f1 = f1_score(y_test, pred_dt)

model_results = pd.DataFrame([['Decision Tree', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)
print(results)

- ### Apply Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)

# Predicting Test Set
y_pred_rf = clf_rf.predict(X_test)
acc = accuracy_score(y_test, y_pred_rf)
prec = precision_score(y_test, y_pred_rf)
rec = recall_score(y_test, y_pred_rf)
f1 = f1_score(y_test, y_pred_rf)

model_results = pd.DataFrame([['Random Forest(Gini)', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)
print(results)

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(criterion='entropy')
clf_rf.fit(X_train, y_train)

# Predicting Test Set
y_pred_rf = clf_rf.predict(X_test)
acc = accuracy_score(y_test, y_pred_rf)
prec = precision_score(y_test, y_pred_rf)
rec = recall_score(y_test, y_pred_rf)
f1 = f1_score(y_test, y_pred_rf)

model_results = pd.DataFrame([['Random Forest(Entropy)', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)
print(results)

- ### Top 5 Features

> Decision Tree

In [None]:
result = pd.DataFrame({'features':X_train.columns,'score':clf.feature_importances_.tolist()})
result.sort_values(by=['score'],ascending=False).head()

> Random Forest

In [None]:
result_rf = pd.DataFrame({'features':X_train.columns,'score':clf_rf.feature_importances_.tolist()})
result_rf.sort_values(by=['score'],ascending=False).head()

Difference in feature importance provided by random forest classfier vs Decision tree.

- ### Apply SVM

In [None]:
from sklearn.svm import SVC

In [None]:
model_svm = SVC(cache_size=100)

In [None]:
model_svm.fit(X_train,y_train)

The model takes lot of time to run.  Hence caution before you run it

In [None]:
# Predicting Test Set
predicted= model_svm.predict(X_test)

In [None]:
acc = accuracy_score(y_test, predicted)
prec = precision_score(y_test, predicted,zero_division=True)
rec = recall_score(y_test, predicted,zero_division=True)
f1 = f1_score(y_test, predicted,zero_division=True)

model_results = pd.DataFrame([['Support Vector Machine', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)
print(results)

- ### Apply KNN with N = 3

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
model_knn3 = KNeighborsClassifier(n_neighbors=3)
model_knn3.fit(X_train,y_train)

- ### Apply KNN with N = 4

In [None]:
model_knn4 = KNeighborsClassifier(n_neighbors=4)
model_knn4.fit(X_train,y_train)

- ### Apply KNN with N = 5

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
model_knn5 = KNeighborsClassifier(n_neighbors=5)
model_knn5.fit(X_train,y_train)

In [None]:
# Predicting Test Set N=3
pred_knn3= model_knn3.predict(X_test)
acc = accuracy_score(y_test, pred_knn3)
prec = precision_score(y_test, pred_knn3)
rec = recall_score(y_test, pred_knn3)
f1 = f1_score(y_test, pred_knn3)

model_results = pd.DataFrame([['KNN-3 neigbours', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)


# Predicting Test Set N=4
pred_knn4= model_knn4.predict(X_test)
acc = accuracy_score(y_test, pred_knn4)
prec = precision_score(y_test, pred_knn4)
rec = recall_score(y_test, pred_knn4)
f1 = f1_score(y_test, pred_knn4)

model_results = pd.DataFrame([['KNN-4 neigbours', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)


# Predicting Test Set N=5
pred_knn5= model_knn5.predict(X_test)
acc = accuracy_score(y_test, pred_knn5)
prec = precision_score(y_test, pred_knn5)
rec = recall_score(y_test, pred_knn5)
f1 = f1_score(y_test, pred_knn5)

model_results = pd.DataFrame([['KNN-5 neigbours', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)

print(results)

- ### Apply Naive Bayes

In this case, it is possible to apply gaussian and bernoulli naive bayes.

- ### Apply Gaussian Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

In [None]:
# Predicting Test Set
pred_gnb = gnb.predict(X_test)
acc = accuracy_score(y_test, pred_gnb)
prec = precision_score(y_test, pred_gnb)
rec = recall_score(y_test, pred_gnb)
f1 = f1_score(y_test, pred_gnb)

model_results = pd.DataFrame([['Gaussian Naive Bayes', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)
print(results)

- ### Apply Bernoulli Naive Bayes

In [None]:
from sklearn.naive_bayes import BernoulliNB

In [None]:
model_bnb = BernoulliNB()
model_bnb.fit(X_train, y_train)

In [None]:
# Predicting Test Set
pred_bnb = model_bnb.predict(X_test)
acc = accuracy_score(y_test, pred_bnb)
prec = precision_score(y_test, pred_bnb)
rec = recall_score(y_test, pred_bnb)
f1 = f1_score(y_test, pred_bnb)

model_results = pd.DataFrame([['Bernoulli Naive Bayes', acc, prec, rec, f1]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results.append(model_results, ignore_index = True)
print(results)

- ### Which model fits the best?

- #### Check Best Accuracy

In [None]:
plt.figure(figsize=(8,5))
max_acc_index=results.Accuracy[results.Accuracy==results.Accuracy.max()].index[0]
plt.barh(results.Model,results.Accuracy,color='c')
plt.barh(results.Model[max_acc_index],results.Accuracy[max_acc_index],color='m')
plt.show()

- #### Check Best Precision

In [None]:
plt.figure(figsize=(8,5))
max_pre_index=results.Precision[results.Precision==results.Precision.max()].index[0]
plt.barh(results.Model,results.Precision,color='c')
plt.barh(results.Model[max_pre_index],results.Precision[max_pre_index],color='m')
plt.show()

- #### Check Best Recall

In [None]:
plt.figure(figsize=(8,5))
max_rc_index=results.Recall[results.Recall==results.Recall.max()].index[0]
plt.barh(results.Model,results.Recall,color='c')
plt.barh(results.Model[max_rc_index],results.Recall[max_rc_index],color='m')
plt.show()

- #### Best F1-Score

In [None]:
plt.figure(figsize=(8,5))
max_f1_index=results['F1 Score'][results['F1 Score']==results['F1 Score'].max()].index[0]
plt.barh(results.Model,results['F1 Score'],color='c')
plt.barh(results.Model[max_f1_index],results.Accuracy[max_f1_index],color='m')
plt.show()

Based on accuracy Random Forest(Gini) is the best model. On considering other parameters of the model it is performing well.

<b><i>Conclusion</i></b>: In this demonstration of the case study, we have gained an understanding of how to apply various data pre-processing steps, apply SMOTE and different classification algorithms.