#Personal Loan Prediction <a></a>

1. Overview
2. Importing Modules and Exploratory Data Analysis(EDA) 
3. Logistic Regression
4. K-Nearest Neighbour
5. Naive Bayes'
6. Support Vector Machine
7. Optimization

#Overview <a></a>
Welcome to my Kernel! In this kernel, I use various classification models and try to classify whether customer will accept Personal Loan or not. As you can guess, there are various methods to suceed this and each method has pros and cons.
If you have a question or feedback, do not hesitate to write.

#Importing Modules and Exploratory Data Analysis(EDA) <a></a>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#load the csv file and make the data frame
bank_df = pd.read_csv('/kaggle/input/bank-loan/Bank_Personal_Loan_Modelling.csv')

In [None]:
#display the first 5 rows of data frame
bank_df.head()

In [None]:
print("The dataframe has {} rows and {} columns".format(bank_df.shape[0],bank_df.shape[1]))

In [None]:
#display information of data frame
bank_df.info()

so from above we can see that there are no null values in the dataframe and all columns are numeric type.

In [None]:
#another way to check if null values are there or not
bank_df.apply(lambda x:sum(x.isnull()))

so from above we can see that there are 0 null values in each column.

In [None]:
#5 point summary of dataframe
bank_df.describe().transpose()

In [None]:
#display histogram plot of each attribute/column
for i in bank_df.columns:
    plt.hist(bank_df[i])
    plt.xlabel(i)
    plt.ylabel('frequency')
    plt.show()

->As per the data provided CreditCard attribute means does the customer use a credit card issued by universal bank so from above it looks like most customer not using credit card (i.e.,frequency of customer not using credit card is high)

->As per the data provided Online attribute means does the customer use internet banking facilities so from above it looks like most customer using internet facility(i.e.,frequency of customer using online facility is high)

->As per the data provided CD Account attribute means does the customer have a certificate of deposit(CD) account with the bank so from above it looks like most customer not having CD Account(i.e.,frequency of customer not having CD Account is high)

->As per the data provided Securities Account attribute means does the customer have a securities account with the bank so from above it looks like most customer not having Securities Account(i.e.,frequency of customer not having securities Account is high)

->As per the data provided Personal Loan attribute means did this customer accept the personal loan offered in the last campaign so from above it looks like most customer not accept Personal Loan(i.e.,frequency of customer not accept Personal Loan is high) and this is our Target Variable because our objective is to predict the probability that a customer will accept a personal loan or not.

->As per the data provided Mortgage attribute means Value of house mortgage if any.($000) so from above we can see that there is a rigt-skewness in Mortgage column because long tail is at right side(Mean>median) and for more than 50% customer value of Mortgage is 0

->As per the data provided Education attribute means Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional so from above it looks like most customers are undergraduate after that Advanced/Professional and last Graduate.

->As per the data provided CCAvg attribute means Avg. spending on credit cards per month(in thousnad dollar)
so from above we can see that there is a right−skewness in CCAvg column because long tail is at right side(Mean>median) and maximum Avg. spending on credit card per month is 10000.

->As per the data provided Family attribute means Family size of the customer so from above it looks like most customer are whose family size is 1 and least customers are those whose family size is 3.

->As per the data provided ZIP code attribute means Home Address ZIP code.

->As per the data provided Income attribute means Annual income of the customer(in thousand dollar)
so from above we can see that there is a right−skewness in Income column because long tail is at right side (Mean>median) and maximum Income is 224000

->As per the data provided Experience attribute means #years of professional experience so from above we can see that Experience is quite normally distributed but we have experience in negative(-) also. and maximum experience is of 43 years.

->As per the data provided Age attribute means Customer's age in completed years so from above we can see that Age is quite normally distributed and maximum age of customer is of 67 years and minimum age of customer is 23 years.

->As per the data provided Id attribute means Customer ID it's looks like it is just a serial number(1 to 5000)

In [None]:
bank_df['Personal Loan'].value_counts()

so from above we can see that only 480 customer accept personal loan and 4520 customer not accept personal loan.

In [None]:
print("Percentage of customer accept personal loan is {}%".format((480/5000)*100))
print("Percentage of customer not accept personal loan is {}%".format((4520/5000)*100))

In [None]:
sns.distplot(bank_df['Personal Loan'],kde=False)
plt.show()

so from above we can see that 90.4% customer not accept personal loan. so we dont have proper distribution of target column means majority of customers not accept personal loan. This means that without any model building if i say for any random customer that it will not accept personal loan than i am 90% true in claiming that thing.

so our model cannot predict very well that it will accept personal loan.

**Objective::**
our objective is to predict the probability that a customer will accept a personal loan or not.

so from above we can see that we have Experience in negative which is not feasible so we will fix that one may be it's a (-) sign by mistake.
so we will make new data frame and make changes in that dataframe.

In [None]:
new_bank_df = bank_df.copy()

so now we have new dataframe called new_bank_df

In [None]:
print("the total customers whose experience is in negative is {}".format((new_bank_df[new_bank_df['Experience']<0]).shape[0]))

In [None]:
#converting negative experience values into positive
new_bank_df['Experience'] = new_bank_df['Experience'].apply(lambda x : abs(x) if(x<0) else x)

In [None]:
print("now after manipulation total customers whose experience is in negative is {}".format((new_bank_df[new_bank_df['Experience']<0]).shape[0]))

so from above we can see that ID and Zip code is no more required in model building. so we can remove these two feature columns.

In [None]:
#dropping ID and ZIP Code columns from new_bank_df dataframe
new_bank_df.drop(['ID','ZIP Code'],axis=1,inplace=True)

In [None]:
#display first 5 rows of dataframe.
new_bank_df.head()

Education and Family can be sorted based on educational level and family size respectively so no need to apply one-hot encoding. 

In [None]:
#display pair plot
sns.pairplot(data=new_bank_df,hue='Personal Loan')
plt.show()

From above pair plot it looks like whose customer income is high they more likely to accept Personal Loan

**SPLITTING OF DATA INTO TRAINING AND TEST SET WITH 70:30 ratio**

In [None]:
X = new_bank_df.drop('Personal Loan',axis=1)
y = new_bank_df['Personal Loan']

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=1)

In [None]:
print("The training feature are {} % of dataset and training labels are {} % of dataset".format(((X_train.shape[0]/5000)*100),((y_train.shape[0]/5000)*100)))
print("The test feature are {} % of dataset and test labels are {} % of dataset".format(((X_test.shape[0]/5000)*100),((y_test.shape[0]/5000)*100)))

#Logistic Regression <a></a>

In [None]:
#importing the library
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,recall_score,precision_score,f1_score 

In [None]:
lr = LogisticRegression() #Instantiate the LogisticRegression object
lr.fit(X_train,y_train) #call the fit method of logistic regression to train the model or to learn the parameters of model

In [None]:
y_predict = lr.predict(X_test) #predicting the result of test dataset and storing in a variable called y_predict

In [None]:
print(accuracy_score(y_test,y_predict))#printing overall accuracy score

In [None]:
print("Confusion matrix")
print(confusion_matrix(y_test,y_predict))#creating confusion matrix

confusion matrix is a square matrix which will help us to know the class level accuracy so in our test dataset total 1500 entities/customers are there. so (1334+17)=1351, means 1351 customers out of 1500 customers in real not accept personal loan but our model predict 1334/1351 not accept personal loan and for 17 customers it did wrong prediction. likewise, (65+84)=149, means 149 customers out of 1500 customers accept personal loan but our model predict 84/149 accept personal loan and for 65 customers it did wrong prediction.

In [None]:
#displaying precision,recall and f1 score.
df_table = confusion_matrix(y_test,y_predict)
a = (df_table[0,0] + df_table[1,1]) / (df_table[0,0] + df_table[0,1] + df_table[1,0] + df_table[1,1])
p = df_table[1,1] / (df_table[1,1] + df_table[0,1])
r = df_table[1,1] / (df_table[1,1] + df_table[1,0])
f = (2 * p * r) / (p + r)

print("accuracy : ",round(a,2))
print("precision: ",round(p,2))
print("recall   : ",round(r,2))
print("F1 score : ",round(f,2))

In [None]:
#another way of displaying precision,recall and f1 score
print("precision:",precision_score(y_test,y_predict))
print("recall   :",recall_score(y_test,y_predict))
print("f1 score :",f1_score(y_test,y_predict))

In [None]:
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, lr.coef_[0][idx]))

In [None]:
print("The intercept is {}".format(lr.intercept_))

so in logistic regression our hypothesis is h(z) = 1/(1+e^(-z)). where z = -0.40*Age +  0.40*Experience + 0.048*Income + 0.63*family + 0.16*CCAvg + 1.62*Education + 0.000782*Mortgage + -0.86*Securities Account + 3.2*CD Account + -0.59*Online + -1.01*CreditCard + -2.37.
so above hypothesis give probability if probability is>=0.5 then we will classify as 1 else 0.

#K-Nearest Neighbour <a></a>

In [None]:
#importing the library
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=5) #Initialize the object
knn.fit(X_train,y_train)  #call the fit method of knn classifier to train the model

In [None]:
knn_y_predict = knn.predict(X_test) #predicting the result of test dataset and storing in a variable called knn_y_predict

In [None]:
print(accuracy_score(y_test,knn_y_predict)) #printing overall accuracy score

In [None]:
print("Confusion matrix")
print(confusion_matrix(y_test,knn_y_predict)) #creating confusion matrix

so in our test dataset total 1500 entities/customers are there. so (1306+45)=1351, means 1351 customers out of 1500 customers in real not accept personal loan but our model predict 1306/1351 not accept personal loan and for 45 customers it did wrong prediction. likewise, (94+55)=149, means 149 customers out of 1500 customers accept personal loan but our model predict 55/149 accept personal loan and for 94 customers it did wrong prediction

In [None]:
#displaying precision,recall and f1 score
print("precision:",precision_score(y_test,knn_y_predict))
print("recall   :",recall_score(y_test,knn_y_predict))
print("f1 score :",f1_score(y_test,knn_y_predict))

#Naive Bayes' <a></a>

In [None]:
#importing the library
from sklearn.naive_bayes import GaussianNB

In [None]:
nb = GaussianNB() #Initialize the object
nb.fit(X_train,y_train)  #call the fit method of gaussian naive bayes to train the model or to learn the parameters of model

In [None]:
nb_y_predict = nb.predict(X_test)  #predicting the result of test dataset and storing in a variable called nb_y_predict

In [None]:
print(accuracy_score(y_test,nb_y_predict))  #printing overall accuracy score

In [None]:
print("Confusion matrix")
print(confusion_matrix(y_test,nb_y_predict))  #printing confusion matrix

so in our test dataset total 1500 entities/customers are there. so (1228+123)=1351, means 1351 customers out of 1500 customers in real not accept personal loan but our model predict 1228/1351 not accept personal loan and for 123 customers it did wrong prediction. likewise, (65+84)=149, means 149 customers out of 1500 customers accept personal loan but our model predict 84/149 accept personal loan and for 65 customers it did wrong prediction.

In [None]:
#displaying precision,recall and f1 score
print("precision:",precision_score(y_test,nb_y_predict))
print("recall   :",recall_score(y_test,nb_y_predict))
print("f1 score :",f1_score(y_test,nb_y_predict))

#Support Vector Machine <a></a>

In [None]:
#importing the library
from sklearn.svm import SVC

In [None]:
svc = SVC()  #Initialize the object
svc.fit(X_train,y_train)  #call the fit method of support vector machine to train the model or to learn the parameters of model

In [None]:
svc_y_predict = svc.predict(X_test)  #predicting the result of test dataset and storing in a variable called svc_y_predict

In [None]:
print(accuracy_score(y_test,svc_y_predict))  #printing overall accuracy score

In [None]:
print("Confusion matrix")
print(confusion_matrix(y_test,svc_y_predict))#printing confusion matrix

so in our test dataset total 1500 entities/customers are there. so (1350+1)=1351, means 1351 customers out of 1500 customers in real not accept personal loan but our model predict 1350/1351 not accept personal loan and for 1 customer it did wrong prediction. likewise, (142+7)=149, means 149 customers out of 1500 customers accept personal loan but our model predict 7/149 accept personal loan and for 142 customers it did wrong prediction.

In [None]:
#displaying precision,recall and f1 score
print("precision:",precision_score(y_test,svc_y_predict))
print("recall   :",recall_score(y_test,svc_y_predict))
print("f1 score :",f1_score(y_test,svc_y_predict))

These are the basic model of logistic regression,knn,naive bayes',support vector machine and if we will see from above than we can conclude that Logistic Regression performs well among all. now we will see how we can optimize thses models

#Optimization <a></a>

In [None]:
#Earlier we select k randomly as 5 now we will see which k value will give least misclassification error
# creating odd list of K for KNN
myList = list(range(1,20))

# subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))

In [None]:
# empty list that will hold accuracy scores
ac_scores = []

# perform accuracy metrics for values from 1,3,5....19
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    # predict the response
    y_pred_var = knn.predict(X_test)
    # evaluate accuracy
    scores = accuracy_score(y_test, y_pred_var)
    ac_scores.append(scores)

# changing to misclassification error
MSE = [1 - x for x in ac_scores]

# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
knn_opt = KNeighborsClassifier(n_neighbors=9) #Initialize the object
knn_opt.fit(X_train,y_train)#call the fit method of knn classifier to train the model

In [None]:
knn_opt_y_predict = knn_opt.predict(X_test)#predicting the result of test dataset and storing in a variable called knn_opt_y_predict

In [None]:
print(accuracy_score(y_test,knn_opt_y_predict))#printing overall accuracy score

In [None]:
print("Confusion matrix")
print(confusion_matrix(y_test,knn_opt_y_predict))#creating confusion matrix

so in our test dataset total 1500 entities/customers are there. so (1315+36)=1351, means 1351 customers out of 1500 customers in real not accept personal loan but our model predict 1315/1351 not accept personal loan and for 36 customers it did wrong prediction. likewise, (99+50)=149, means 149 customers out of 1500 customers accept personal loan but our model predict 50/149 accept personal loan and for 99 customers it did wrong prediction.

In [None]:
#displaying precision,recall and f1 score
print("precision:",precision_score(y_test,knn_opt_y_predict))
print("recall   :",recall_score(y_test,knn_opt_y_predict))
print("f1 score :",f1_score(y_test,knn_opt_y_predict))

In logistic regression we can change the threshold and check what is the accuracy

In [None]:
lr_scores = []
thresh = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
for i in range(0,len(thresh)):
    preds = np.where(lr.predict_proba(X_test)[:,1] >=thresh[i], 1, 0)
    accurcy_scores = accuracy_score(y_test, preds)
    lr_scores.append(accurcy_scores)

df = pd.DataFrame(data={'thresh':thresh,'accuracy_scores':lr_scores})
print(df)

In [None]:
plt.plot(thresh,lr_scores)
plt.xlabel('Threshold')
plt.ylabel('Accuracy_scores')
plt.show()

so from above we can see that at threshold 0.5 we have maximum accuracy score(0.94533)