**Overview:** This dataset is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers.

**Data Description:** This dataset contains 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

# **Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns 
sns.set(color_codes=True)

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report
from sklearn.feature_selection import mutual_info_classif

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier 

from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE


# **Import Dataset**

In [None]:
Bank  = pd.read_csv("../input/bank-personal-loan-modelling/Bank_Personal_Loan_Modelling.csv")
Bank2 = pd.read_csv("../input/bank-personal-loan-modelling/Bank_Personal_Loan_Modelling.csv")
Bank3 = pd.read_csv("../input/bank-personal-loan-modelling/Bank_Personal_Loan_Modelling.csv")


# **Attributes Information:**


1. ID: Customer ID
2. Age: Customer's age in completed years
3. Experience: #years of professional experience
4. Income: Annual income of the customer
5. ZIP Code: Home Address ZIP code.
6. Family: Family size of the customer
7. CCAvg: Avg. spending on credit cards per month
8. Education: Education Level. 1: Undergrad  2: Graduate  3:Advanced/Professional
9. Mortgage: Value of house mortgage if any.
10. Personal Loan: Did this customer accept the personal loan offered in the last campaign?
11. Securities Account: Does the customer have a securities account with the bank?
12. CD Account: Does the customer have a certificate of deposit (CD) account with the bank?
13. Online: Does the customer use internet banking facilities?
14. Credit card: Does the customer use a credit card issued by the bank?

# **Explorate Dataset**

In [None]:
Bank.head()

In [None]:
Bank.describe()

# The Target of dataset is **"Personal Loan"**

In [None]:
Target= ["Personal Loan"]
t= Bank[Target]
t.head()

In [None]:
Bank['Personal Loan'].describe()

# **Show Unique Values on dataset**

In [None]:
Bank.nunique()

# **Dataset Shape**

In [None]:
rows_count, columns_count = Bank.shape
print("Number of rows :", rows_count)
print("Number of columns :", columns_count)

# **Check Null Values**

In [None]:
Bank.isnull()

# **Check Duplicates**

In [None]:
Bank.duplicated()

# **Heat map Correlation of Attributes**

In [None]:
plt.figure(figsize=(15,7))
plt.title('Correlation of Attributes', size=15)
sns.heatmap(Bank.corr(), annot=True, linewidths=3, fmt='.3f', center=1);

**Observation:** found that the most correlated features on Personal Loan is: 

1.   Income
2.   CCAvg
3.   CD Account

# **Data Visualization**

**Description:** Showing distribution of bunch of features:

In [None]:
sns.distplot(Bank['Income'])
plt.title('Income Distribution with KDE');

In [None]:
sns.distplot(Bank['Family'])
plt.title('Family Distribution with KDE');

In [None]:
sns.distplot(Bank['CCAvg'])
plt.title('Avg spending of credit cards Distribution with KDE');

In [None]:
sns.distplot(Bank['Education'])
plt.title('Education Distribution with KDE');

In [None]:
sns.distplot(Bank['Mortgage'])
plt.title('Mortgage Distribution with KDE');

In [None]:
sns.kdeplot(
   data=Bank, x='Income', hue="Personal Loan",
   fill=True, common_norm=False, palette="crest",
   alpha=.5, linewidth=0,
)

**Observation:** found that Personal Loan is  more likely to happend when the Income increases.

In [None]:
sns.kdeplot(
   data=Bank, x='Family', hue="Personal Loan",
   fill=True, common_norm=False, palette="crest",
   alpha=.5, linewidth=0,
)

**Observation:** found that Personal Loan is more likely to be achieved when the Family members are bigger.

In [None]:
sns.kdeplot(
   data=Bank, x='CCAvg', hue="Personal Loan",
   fill=True, common_norm=False, palette="crest",
   alpha=.5, linewidth=0,
)

**Observation:** found that Personal Loan is more likely to happend when the Credit Card Average of spending is higher. 

In [None]:
sns.kdeplot(
   data=Bank, x='Education', hue="Personal Loan",
   fill=True, common_norm=False, palette="crest",
   alpha=.5, linewidth=0,
)

**Observation:** found that Personal Loan is  more likely to happend when the Eduacation level is higher. 

In [None]:
sns.kdeplot(
   data=Bank, x='Mortgage', hue="Personal Loan",
   fill=True, common_norm=False, palette="crest",
   alpha=.5, linewidth=0,
) 

**Observation:** found that Personal Loan is more likely to happend when the Mortagage equal 0 

In [None]:
loan_acceptance_count = pd.DataFrame(Bank['Personal Loan'].value_counts()).reset_index()
loan_acceptance_count.columns = ['Labels', 'Personal Loan']
loan_acceptance_count

In [None]:
pie_labels = loan_acceptance_count['Labels']
pie_labels = ['Not Accepted' if x == 0 else 'Accepted' for x in pie_labels]
pie_data = loan_acceptance_count['Personal Loan'] 
explode = (0, 0.15) 
wp = { 'linewidth' : 1, 'edgecolor' : '#000000' }

def func(pct, allvalues): 
    absolute = int(np.round(pct / 100.*np.sum(allvalues)))
    return "{:.1f}%\n({:d})".format(pct, absolute)

fig, ax = plt.subplots(figsize =(10, 6))

ax.pie(pie_data,  
       autopct = lambda pct: func(pct, pie_data), 
       explode = explode,  
       labels = pie_labels, 
       shadow = True, 
       startangle = 70, 
       wedgeprops = wp)

ax.axis('equal') 
plt.title('Personal Loan Acceptance Percentage', size=22)
plt.show();

**Observation:** Percentage of Personal Loan Acceptance catogories:

1.   Accepted
2.   Not Accepted



In [None]:
fig = px.bar(Bank, x='Experience', y='Income', title='...', color='Experience')
fig.show()

**Observation:** found that the probabilty of Income to be at its highest point when the experience years are 20.

In [None]:
categorical_variables= [col for col in Bank.columns if Bank[col].nunique()<=5]
print(categorical_variables)

In [None]:
categorical_variables.remove("Personal Loan")
print(categorical_variables)

In [None]:
fig=plt.figure(figsize=(15,10))
for i,col in enumerate(categorical_variables):
    ax=fig.add_subplot(2,3,i+1)
    sns.barplot(x=col,y='Personal Loan',data=Bank,ci=None)

**Observation:**
1. Customers with family size equal to 3 have more chances of having Personal Loan.
2. Customers with Undergraduate degree have less chances of having Personal Loan as compaired to other customers having Graduate or Advanced/Professional degree.
3. Customers with CD Account and Securities Account have more chances of having Personal Loan.
4. Customers with Online & Credit Card is more likely to have Personal Loan than others don't have a one. 

# **Feature Selection**

In [None]:
X = Bank.drop('Personal Loan', axis = 1)    #set X with all feature except Personal Loan
Y = Bank[['Personal Loan']]                 #set y with our target feature Personal Loan

# **Train Test Split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1, stratify = Y)

In [None]:
mutual_information = mutual_info_classif(X_train, y_train, n_neighbors=5, copy = True)

plt.subplots(1, figsize=(26, 1))
sns.heatmap(mutual_information[:, np.newaxis].T, cmap='Blues', cbar=False, linewidths=1, annot=True, annot_kws={"size": 20})
plt.yticks([], [])
plt.gca().set_xticklabels(X_train.columns, rotation=45, ha='right', fontsize=16)
plt.suptitle("Variable Importance (mutual_info_classif)", fontsize=22, y=1.2)
plt.gcf().subplots_adjust(wspace=0.2)

**Observation:** most Imortant features on dataset (Income, CCAvg, CD Account)

In [None]:
rf_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rf_clf.fit(X_train, y_train)

features = list(X_train.columns)
importances = rf_clf.feature_importances_
indices = np.argsort(importances)

fig, ax = plt.subplots(figsize=(10, 7))
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
ax.tick_params(axis="x", labelsize=12)
ax.tick_params(axis="y", labelsize=14)
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Feature Importance', fontsize = 18)

**Observation:** Ranking most Imortant features on dataset by using Random Forest **feature_importances_** and the most effective feature are:


1. Income
2. Education
3. CCAvg



# **Model Building**

In [None]:
Bank = Bank.drop(['ID','ZIP Code'], axis=1) #dropped not important feaures

In [None]:
Bank.head(3)

# **Scaling Dataset**

**Observation:** Using standard scaling to scale unbalanced ranges in values

In [None]:
scaler=StandardScaler()
scaled_df=scaler.fit_transform(Bank.drop('Personal Loan',axis=1))

In [None]:
scaled_df=pd.DataFrame(scaled_df)

In [None]:
scaled_df.columns=Bank.drop('Personal Loan',axis=1).columns
scaled_df.head()



* We had to do scaling to enhance results & reduce miss classifications. 
* After doing scaling saw that accuracy & recall got higher with much accurate results.





# Decision Tree (DT)

In [None]:
DT = DecisionTreeClassifier(max_depth=2)
# max_depth is maximum number of levels in the tree 
DT.fit(X_train,y_train)

In [None]:
y_pred_DT= DT.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred_DT))

In [None]:
cm_DT = confusion_matrix(y_test, y_pred_DT)
print(cm_DT)

In [None]:
y_pred_DT= DT.predict(X_test)
print(classification_report(y_test, y_pred_DT))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_DT),annot=True,fmt='',cmap='YlGnBu')

# Support Vector Machine (SVM)

In [None]:
SVM = SVC(kernel='linear',C=1.0, gamma='scale')
SVM.fit(X_train,y_train)

In [None]:
y_pred_SVM = SVM.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred_SVM))

In [None]:
cm_SVM = confusion_matrix(y_test, y_pred_SVM)
print(cm_SVM)

In [None]:
y_pred_SVM= SVM.predict(X_test)
print(classification_report(y_test, y_pred_SVM))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_SVM),annot=True,fmt='',cmap='YlGnBu')

#K-Nearest Neighbors (KNN)

In [None]:
kclf = KNeighborsClassifier(n_neighbors=5)

In [None]:
kclf.fit(X_train,y_train)

In [None]:
y_pred_KNN= kclf.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_KNN))

In [None]:
cm_KNN = confusion_matrix(y_test, y_pred_KNN)
print(cm_KNN)

In [None]:
y_pred_KNN= kclf.predict(X_test)
print(classification_report(y_test, y_pred_KNN))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_KNN),annot=True,fmt='',cmap='YlGnBu')

# Logistic Regression (LR)

In [None]:
LR= LogisticRegression()
LR.fit(X_train, y_train)

In [None]:
y_pred_LR= LR.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_LR))

In [None]:
cm_LR = confusion_matrix(y_test, y_pred_LR)
print(cm_LR)

In [None]:
y_pred_LR= LR.predict(X_test)
print(classification_report(y_test, y_pred_LR))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_LR),annot=True,fmt='',cmap='YlGnBu')

# Gaussian Naive Bayes (GNB)

In [None]:
GNB = GaussianNB()

In [None]:
GNB.fit(X_train,y_train)

In [None]:
y_pred_GNB= GNB.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_GNB))

In [None]:
cm_GNB = confusion_matrix(y_test, y_pred_GNB)
print(cm_GNB)

In [None]:
y_pred_GNB= GNB.predict(X_test)
print(classification_report(y_test, y_pred_GNB))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_GNB),annot=True,fmt='',cmap='YlGnBu')

# Random Forest (RF)

In [None]:
RF= RandomForestClassifier(n_estimators=500, random_state=0)

In [None]:
RF.fit(X_train,y_train)

In [None]:
y_pred_RF= RF.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_RF))

In [None]:
cm_RF = confusion_matrix(y_test, y_pred_RF)
print(cm_RF)

In [None]:
y_pred_RF= RF.predict(X_test)
print(classification_report(y_test, y_pred_RF))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_RF),annot=True,fmt='',cmap='YlGnBu')


End of work before trying to enhance the results of models.

# **Understanding Relations Between Features**

In [None]:
Selected= Bank2[['Income','Family','CCAvg']].corr()
Selected 

**Observation:** Choosed most important features and highly ones  (Income, Family, CCAvg) 

In [None]:
plt.figure(figsize=(15,7))
plt.title('Correlation of Attributes', size=15)
sns.heatmap(Selected.corr(), annot=True, linewidths=3, fmt='.3f', center=1);

In [None]:
corr_PL= Bank2[['Age','Experience','Personal Loan']].corr()
corr_PL

In [None]:
plt.figure(figsize=(15,7))
plt.title('Correlation of Attributes', size=15)
sns.heatmap(corr_PL.corr(), annot=True, linewidths=3, fmt='.3f', center=1);

In [None]:
Bank2.drop('Experience',axis=1)

**Observation:** since Age shows a little better correlation with Personal loan we will drop the Experience attribute



# **Model Building After Try to Enhance**

In [None]:
smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_resample(X, Y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state=42, stratify=y_sm) 

In [None]:
Bank2 = Bank2.drop(['ID','ZIP Code','Securities Account','CreditCard','Online','Experience'], axis=1)

**Observation:** dropped Experience over Age because we found it slightly better on correlation with Personal Loan, and dropped Securities Account, CreditCard, Online becuse they didn't show any influence change.

In [None]:
Bank2.head() #dataset after dropping unwanted attributes

# Decision Tree (DT) After Enhance

In [None]:
dt2 = DecisionTreeClassifier(max_depth=2)
# max_depth is maximum number of levels in the tree 
dt2.fit(X_train,y_train)

In [None]:
y_pred_DT2= dt2.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred_DT2))

In [None]:
cm_DT2 = confusion_matrix(y_test, y_pred_DT2)
print(cm_DT2)

In [None]:
y_pred_DT2= dt2.predict(X_test)
print(classification_report(y_test, y_pred_DT2))


In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_DT2),annot=True,fmt='',cmap='YlGnBu')

# Support Vector Machine (SVM) After Enhance

In [None]:
SVM2 = SVC(kernel='linear',C=1.0, gamma='scale')
SVM2.fit(X_train,y_train)

In [None]:
y_pred_SVM2 = SVM2.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred_SVM2))

In [None]:
cm_SVM2 = confusion_matrix(y_test, y_pred_SVM2)
print(cm_SVM2)

In [None]:
y_pred_SVM2= SVM2.predict(X_test)
print(classification_report(y_test, y_pred_SVM2))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_SVM2),annot=True,fmt='',cmap='YlGnBu')

# K-Nearest Neighbors After Enhance

In [None]:
kclf2 = KNeighborsClassifier(n_neighbors=5)

In [None]:
kclf2.fit(X_train,y_train)

In [None]:
y_pred_KNN2= kclf2.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_KNN2))

In [None]:
cm_KNN2 = confusion_matrix(y_test, y_pred_KNN2)
print(cm_KNN2)

In [None]:
y_pred_KNN2= kclf2.predict(X_test)
print(classification_report(y_test, y_pred_KNN2))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_KNN2),annot=True,fmt='',cmap='YlGnBu')

# Logistic Regression After Enhance

In [None]:
LR2= LogisticRegression()
LR2.fit(X_train, y_train)

In [None]:
y_pred_LR2= LR2.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_LR2))

In [None]:
cm_LR2 = confusion_matrix(y_test, y_pred_LR2)
print(cm_LR2)

In [None]:
y_pred_LR2= LR2.predict(X_test)
print(classification_report(y_test, y_pred_LR2))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_LR2),annot=True,fmt='',cmap='YlGnBu')

## Gaussian Naive Bayes After Enhancing

In [None]:
GNB2 = GaussianNB()

In [None]:
GNB2.fit(X_train,y_train)

In [None]:
y_pred_GNB2= GNB2.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_GNB2))

In [None]:
cm_GNB2 = confusion_matrix(y_test, y_pred_GNB2)
print(cm_GNB2)

In [None]:
y_pred_GNB2= GNB2.predict(X_test)
print(classification_report(y_test, y_pred_GNB2))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_GNB2),annot=True,fmt='',cmap='YlGnBu')

# Random Forest After Enhance

In [None]:
RF2= RandomForestClassifier(n_estimators=500, random_state=0)

In [None]:
RF2.fit(X_train,y_train)

In [None]:
y_pred_RF2= RF2.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_RF2))

In [None]:
cm_RF2 = confusion_matrix(y_test, y_pred_RF2)
print(cm_RF2)

In [None]:
y_pred_RF2= RF2.predict(X_test)
print(classification_report(y_test, y_pred_RF2))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_RF2),annot=True,fmt='',cmap='YlGnBu')


# **Data Visualization After Enhancing**

In [None]:
sns.kdeplot(data=Bank, x="Income", hue="Personal Loan", multiple="stack")

In [None]:
sns.kdeplot(data=Bank, x="Family", hue="Personal Loan", multiple="stack")

In [None]:
sns.kdeplot(data=Bank, x="CCAvg", hue="Personal Loan", multiple="stack") 

In [None]:
sns.kdeplot(data=Bank, x="Education", hue="Personal Loan", multiple="stack")

In [None]:
sns.kdeplot(data=Bank, x="Mortgage", hue="Personal Loan", multiple="stack") 

In [None]:
X = Bank3[['CCAvg','Family','Income']]
Y = Bank3[['Personal Loan']] 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1, stratify = Y) 

**Description:** We did another try to enhance the model by giving it the most important features & selected three of them:
1.   Income
2.   Family 
3. CCAvg



# Decision Tree 3 (Try)

In [None]:
dt3 = DecisionTreeClassifier(max_depth=2)
# max_depth is maximum number of levels in the tree 
dt3.fit(X_train,y_train)

In [None]:
y_pred_DT3= dt3.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred_DT3))

In [None]:
cm_DT3 = confusion_matrix(y_test, y_pred_DT3)
print(cm_DT3)

In [None]:
y_pred_DT3= dt3.predict(X_test)
print(classification_report(y_test, y_pred_DT3))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_DT3),annot=True,fmt='',cmap='YlGnBu')

# Random Forest 3 (Try)

In [None]:
RF3= RandomForestClassifier(n_estimators=500, random_state=0)

In [None]:
RF3.fit(X_train,y_train)

In [None]:
y_pred_RF3= RF3.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred_RF3))

In [None]:
cm_RF3 = confusion_matrix(y_test, y_pred_RF3)
print(cm_RF3)

In [None]:
y_pred_RF3= RF3.predict(X_test)
print(classification_report(y_test, y_pred_RF3))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred_RF3),annot=True,fmt='',cmap='YlGnBu')

# **Finally Thank You!**