# Personal Loan Campaign Modelling

## Objective :
1. To predict whether a liability customer will buy a personal loan or not.
2. Which variables are most significant.
3. Which segment of customers should be targeted more.

## Key Questions:
1. What are the Key variables that have a strong relationships with the dependent variable?
2. Which metric is right for model performance evaluation and why?
3. How accurate are the Model predictions and can it be improved?


# Importing Necessary Libraries

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)

import statsmodels.api as sm

## Load and Explore the Data

In [None]:
data = '../input/personal-loan-modeling/Bank_Personal_Loan_Modelling.csv'
data_frame = pd.read_csv(data) #load and read the csv file
df= data_frame.copy() #making a copy to avoid changes to data
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
#checking the shape of the dataset
np.random.seed(85) 
df.sample(10) #loading random 10 rows

In [None]:
df.info() # looking at the structure of the data

* All the variables are of numerical datatype and there are no missing values
* The Dependent variable is Personal Loan. We see that it has only two values; `0 & 1`. As it is a binary class variable, we will convert to category for further processing.
* Education and Family have numerical inputs label-coded from differnet categories and should be in category datatype
* Securities_account, CD_Account, Online and CreditCard are int datatype but with Binary inputs
    * {i.e 0 = No and 1 = Yes} 
* The inputs can also be considered as two categories and hence for better model and analysis, we will convert them to categorical datatype    
* Zipcode signifies the area where the customer lives and we cannot use it in its Numerical Form. Hence we will extract relevant information from it and drop the column


## Feature Engineering:

In [None]:
#remove the spaces in the columns
df.rename(columns={"ZIP Code":"ZIPCode","Personal Loan":"Personal_Loan",
                        "Securities Account":"Securities_Account","CD Account":'CD_Account'},inplace=True)

* Checking the different values in the ZIPCode variable, we conclude that all the customers in this dataset are from the State of California. 
* We will extract the county where the customer is currently residing.

In [None]:
# checking the number of uniques in the zip code
df['ZIPCode'].nunique()

* There are 467 unique values in zip code.
* In US, The first digit of a PIN indicates the zone or a region, the second indicates the sub-zone, and the third, combined with the first two, indicates the sorting district within that zone. The final three digits are assigned to individual post offices within the sorting district. 
* Hence we will group them based on the first two digits


In [None]:
df['ZIPCode'] = df['ZIPCode'].astype(str)
df['ZIPCode'] = df['ZIPCode'].str[0:2]
df['ZIPCode'].nunique()

* Now the unique ZIPCodes are reduces to seven groups

## Fixing DataTypes

In [None]:
df.drop(['ID'],axis=1,inplace=True)
#Dropping ID as its not relevant

In [None]:
df['Education'] = df['Education'].astype('category')
df['Family'] = df['Family'].astype('category')
df['Personal_Loan'] = df['Personal_Loan'].astype('category')
df['Securities_Account'] = df['Securities_Account'].astype('category')
df['CD_Account'] = df['CD_Account'].astype('category')
df['Online'] = df['Online'].astype('category')
df['CreditCard'] = df['CreditCard'].astype('category')
df['ZIPCode'] = df['ZIPCode'].astype('category')

df.info() #rechecking the datatypes 

* We now have 5000 rows and 13 columns and we see that the memory has also reduced

## Summary of Numerical Columns

In [None]:
df.describe().T 

* The Mean and Median for Age is almost equal ie approx 45 yrs 
* Experience Column has a min value -3, which is could be an error and needs to be checked.
* Mean Income is greater than median income indicating skewness. We also see a very high Max value suggesting outliers.
* CCavg minimum value is  0.0 dollars; suggesting that the customer may not have any credit cards. The Mean and median for this variable are fairly close.
* Similarly, the Minimum value for Mortgage is 0.0 for atleast 50% of the customers; this could mean the customers dont own a home.


## Processing Columns

In [None]:
df[df['Experience'] < 0]['Experience'].count() #finding columns with -ve experience values

In [None]:
df1=df[(df.Experience<0)] 
print(f"The unique Negative Experience Array= {df1['Experience'].unique()}")
df1['Age'].value_counts(ascending=True)#finding the mean and median for w.r.t Age 

In [None]:
#Let's check the actual experience distribution for the ages above
df2=df[(df.Experience>=0)][df.Age<30] #Since the age for -ve experience values is less than 30 yrs
df2.groupby(['Age']).agg([np.mean,np.median]).Experience

* We see that the mean and median for the above ages are approximately equal.
* We are missing information for Age 23 yrs old, suggesting that the data has only -ve values for experience. 
* Replacing with either Mean or Median for corresponding ages will lose information for ages 23 & 24. 
* Hence we will consider this issue as an data entry error and remove the negative sign from the values, thus making them positive experience values

In [None]:
multiplier = -1
for i in range(len(df)):
    if df.Experience[i]<0:
        df.Experience[i]=(df.Experience[i]*multiplier)
(df.Experience<0).value_counts()         

* All negative experience values have been corrected.

## Missing Values :

In [None]:
df.isna().sum()

* There are no missing values in this dataset

## Exploratory Data Analysis:
### Univariate Analysis - Numerical Columns:

In [None]:
#Performing Univariate Analysis to study the central tendency and dispersion
#Plotting histogram to study distribution
Uni_num = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(17,75))
for i in range(len(Uni_num)):     #creating a loop that will show the plots for the columns in one plot
    plt.subplot(18,3,i+1)
    sns.histplot(df[Uni_num[i]],kde=False)
    plt.tight_layout()
    plt.title(Uni_num[i],fontsize=25)

plt.show()

In [None]:
plt.figure(figsize=(15,35))
for i in range(len(Uni_num)):
    plt.subplot(10,3,i+1)
    sns.boxplot(df[Uni_num[i]],showmeans=True, color='yellow')
    plt.tight_layout()
    plt.title(Uni_num[i],fontsize=25)

plt.show()

**Observations**:
* Age and Experience are almost normally distributed and look quite similar. This suggests a correlation between the two.
* There is skewness is the other three variables:
* Income:
    - Income shows the annual salary earned by the customer and its right-skewed in distribution.Majority of customers have income less than 100K, but there are several observation in the higher end.

* Credit Card Average: 
    - CCAvg has several outliers in the higher end and is heavily right-skewed. Almost 75% of customers have an average of less than 2.5(in thousand dollars). This suggests that some customers have very high charges compared to the rest

* Mortgage 
    - The distribution in Mortgage variable is also heavily skewed. Almost 50 % of customers dont have a mortgage,indicating they dont own a home. We will have to analyse the mortgage for customers who only own a home to understand the distributions

In [None]:
df3= df[(df.Mortgage>0)]

In [None]:
fig,(ax_box,ax_hist) = plt.subplots(2,1,sharex=True ,
                                        figsize=(10,9),
                                        gridspec_kw = {"height_ratios": (.35, .65)})
sns.boxplot(df3.Mortgage, ax=ax_box, showmeans=True, color='orange')
sns.histplot(df3.Mortgage, ax=ax_hist,kde=True)

**Insights**:
* The distribution is again right-skewed with an increased Mean of around 183K dollars.
* There are again several outliers in the higher end. We suspect this could be due to the location of the homes, as higher land value could mean higher mortgage price. Over 75% of the customers have Mortgageare below 230K dollars.

### Univariate Analysis - Categorical Columns:

In [None]:
categorical_val = df.select_dtypes(exclude=np.number).columns.tolist()

In [None]:
plt.figure(figsize=(15,75))
for i in range(len(categorical_val)):     #creating a loop that will show the plots for the columns in one plot
    plt.subplot(18,3,i+1)
    ax=sns.countplot(df[categorical_val[i]],palette='Spectral')
    plt.tight_layout()
    plt.title(categorical_val[i],fontsize=25)
    total = len (df[categorical_val[i]])
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
        x = p.get_x() + (p.get_width() / 2)-0.1  # width of the plot
        y = p.get_y() + p.get_height()           # hieght of the plot
        ax.annotate(percentage, (x, y), size = 12.5,color='black') # To annonate
plt.show()

**Observations**:
* 29.4% of customers are from region 94 followed by reion 92 at19.8%
* 29.4 % of customers are of single-family household, with Family variable having four unique values.
* Education has three unique values with 41.9% of at Undergrad level(1).
* Personal_Loans is the Dependent variable and we see that there is heavy imbalance. Only 9.6% of customers in the data have accepted a loan from the previous campaign
* 89.6% of customers dont have a Securities account whereas 94% of customers dont have a CD account.
* We that 59.7 % of customers use the bank's online facilities and about 70.6% dont have credit cards issue by another bank.

### Correlation Matrix

In [None]:
corr= df.corr()
plt.figure(figsize=(10,7))
sns.heatmap(corr,annot= True,vmin=0,vmax=1, cmap='RdYlGn_r',linewidths=0.75)
plt.show()

**Observations**:
* Age and Experience have the highest correlation at 0.99. We suspect multi-collinearity between these variables
* Income and CC_Avg have the next highest positive correlation at 0.65. This suggests that customers with higher income have higher Credit card charges.
* Income and CCAvg have a positive correlation with Mortgage.

In [None]:
sns.pairplot(data=df,hue='Personal_Loan')

**Observations:**
* The pair plot shows a more varying distribution in the variables between customers who took a loan and those who didnt. 
* We see again that the distribution for Age and Experience is very similar. This could suggest possible multicollinearity
* There are overlaps that make it difficults to interpret who has personal loans and who doesnt, hence we will analyse futher with other plots

## Bivariate Analysis

In [None]:
# For all numerical variables with Personal_Loan
plt.figure(figsize=(20,10))
for i, variable in enumerate(Uni_num):
                     plt.subplot(3,2,i+1)
                     sns.boxplot(df['Personal_Loan'],df[variable],palette="Dark2")
                     plt.tight_layout()
                     plt.title(variable)
plt.show()

**Observations**:
* The mean values for Age is the same for both categories of Personal Loans
* Similarly the mean values for Experience is also almost equal for both categories of Personal Loan. Both these variables dont have any outliers

* Customers who have Personal Loans also have high Mean **Income and CreditCard Average** compared to customers who dont have a loan. Interesting we see several outliers in the higher end for both these variables in Class **0**. 

* The mean value for Mortgage at both levels in 0.0(in dollars). This is because majority of the customers dont have Mortgages. However, we see that customers with higher mortgages have Personal loans. But, we also see that there are several outliers in the high end again for customers who dont have a loan.

* The above plot, suggests a correlation between Income,CCavg and Mortgage. Customers with high values for these variables have taken loans. This could suggests them as possible features of customers that can be targeted.

In [None]:
#Stacked plot of categorical variables with Personal Loans
def stacked_plot(x):
    sns.set(palette='Accent')
    tab1 = pd.crosstab(x,df['Personal_Loan'],margins=True)
    print(tab1)
    print('-'*120)
    tab = pd.crosstab(x,df['Personal_Loan'],normalize='index')
    tab.plot(kind='bar',stacked=True,figsize=(10,5))
    plt.legend(loc='lower left', frameon=True)
    plt.legend(loc="upper left", bbox_to_anchor=(1,1))
    plt.ylabel('Percentage')
    plt.show()

In [None]:
stacked_plot(df.ZIPCode)

* All the sub-regions show fairly equal distribution among customers who purchased a loan

In [None]:
stacked_plot(df.Family)


- More Customers with larger family size(3&4) have Personal Loans. 

In [None]:
stacked_plot(df.Education)

* Customers with higher Education levels have taken Personal Loans.

In [None]:
stacked_plot(df.Securities_Account)

* Majority of the customers dont have a Securities Account out of which 420 have Personal loans
* Remaining customers who do have an account; only 60 have loans. 

In [None]:
stacked_plot(df.CD_Account)

* In the 302 customers have a CD_account, almost 50% have a Personal Loan
* This suggests that customers who have a CD_account are likely to buy loans and can be a possible target feature.

In [None]:
stacked_plot(df.Online)

* 10% of customers in both classes of Online variable have purchased loans

In [None]:
stacked_plot(df.CreditCard)

* We have more customers who dont have Credit card with other banks
* Again 10% of customers in both Credit Card classes have purchased loans

## Multi-variate Analysis

In [None]:
#Income Vs Education Vs Personal_Loan
plt.figure(figsize=(15,7))
sns.boxplot(data=df,y='Income',x='Education',hue='Personal_Loan')
plt.show()

* As Education level increases, Mean Income also increases.
* Customers with Education level 2 and 3 who have personal loans have a much higher mean income than Education level 1 customers

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(data=df,y='Income',x='Family',hue='Personal_Loan')
plt.show()

* Income level among all Family groups is significantly higher for customers who have a Personal Loan. 

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(data=df,y='Mortgage',x='Family',hue='Personal_Loan')
plt.show()

* There are several outliers in Family size 1 and 2 for customers who dont have a Personal loan compares to the rest.
* We also see that as Family size increases, the Mortgage value also rises and the customers have Personal Loans

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(data=df,y='CCAvg',x='CreditCard',hue='Personal_Loan')
plt.show()

* Customers who have Personal loans have a higher credit card Average.
* There are several outliers in customers who dont have personal loans. 

In [None]:
plt.figure(figsize=(15,7))
sns.scatterplot(data=df,y='Income',x='CCAvg',hue='Personal_Loan')
plt.show()

**Observations**:
* More Customers with higher income and CCAvg `>2.5(in thousand dollars)` have personal loans.

## Data Pre-Processing:

### Outliers Treatment:
* Income, CCAvg and Mortgage have very high outliers in the higher end and must be treated.
* Since we will also be creating a Decision Tree model(Decision Trees are not influenced by outliers) 
  we will make a copy of the dataset before proceeding with outlier treatment.

In [None]:
df1=df.copy() # new copy for Decision Tree model

In [None]:
# Lets treat outliers by flooring and capping
def treat_outliers(df,col):
   
    Q1=df[col].quantile(0.25) # 25th quantile
    Q3=df[col].quantile(0.75)  # 75th quantile
    IQR=Q3-Q1
    Lower_Whisker = Q1 - 1.5*IQR 
    Upper_Whisker = Q3 + 1.5*IQR
    df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker 
                                                            # and all the values above upper_whisker will be assigned value of upper_Whisker 
    return df

def treat_outliers_all(df, col_list): # treat outliers in numerical column of Dataframe
    
    for c in col_list:
        df = treat_outliers(df,c)
        
        
    return df 

In [None]:
no_treatment = {'Age','Experience'} # These two variables dont have outliers
numerical_col = [ele for ele in Uni_num if ele not in no_treatment] 
#Applying outlier treatment
df = treat_outliers_all(df,numerical_col)

* All Outliers are treated

## Model Building 

### Model Evaluation Criterion 
#### Model can make two kinds of wrong predictions: 
1. Wrongly Identify customers as loan borrowers but they are not - False Positive
2. Wrongly identifying customers as not borrowers but they actually buy loans - False Negative

* Since the Banks wants to identify all potential customers who will purchase a loan, the False Negative value must be less.

#### How to reduce losses
* Recall is the Performance metric that must be improved.
* The Recall score must be maximised and greater the score the less the chance of missing potential customers.


#### Creating a Confusion Matrix

In [None]:
#Defining a function for Confusion matrix
from sklearn.metrics import classification_report,confusion_matrix
sns.set(font_scale=2.0) # to set font size for the matrix
def make_confusion_matrix(y_actual,y_predict):
    '''
    y_predict: prediction of class
    y_actual : ground truth  
    '''
    cm=confusion_matrix(y_actual,y_predict)
    group_names = ['True -ve','False +ve','False -ve','True +ve']
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2,v3 in
              zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(cm, annot=labels,fmt='',cmap='Blues')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
#Importing all necessary libraries
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import statsmodels.api as sm
from sklearn import metrics #accuracy,confusion metrics, etc
from sklearn import datasets 
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from statsmodels.stats.outliers_influence import variance_inflation_factor


In [None]:
## Defining X and Y variables
X = df.drop(['Personal_Loan'], axis=1) #dropping the dependent variable
Y = df[['Personal_Loan']]

#Convert categorical variables to dummy variables
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30,random_state=29) # 70% train set and 30% test set


### Logistic Regression (with Sklearn library)

In [None]:
logreg = LogisticRegression(solver='saga',max_iter=1000,penalty='none',verbose=True,n_jobs=1,random_state=29)

# There arae several optimizer, we are using optimizer called as 'saga' with max_iter equal to 1000 
# max_iter indicates number of iteration needed to converge

logreg.fit(X_train, y_train)
pred_train = logreg.predict(X_train)
pred_test = logreg.predict(X_test)

#Checking the Accuracy of the model:
print('\nAccuracy on train data:%.6f'%accuracy_score(y_train, pred_train) )
print('Accuracy on test data:%.6f' %accuracy_score(y_test, pred_test))
#checking the Recall metrics of the model:
print('\nRecall on train data:%.6f'%recall_score(y_train, pred_train) )
print('Recall on test data:%.6f'%recall_score(y_test, pred_test))
#checking the Precision metrics of model:
print("\nPrecision on training set : ",precision_score(y_train, pred_train))
print("Precision on test set : ",precision_score(y_test, pred_test))

print("\nF1 Score on training set : ",f1_score(y_train, pred_train))
print("F1 Score on test set : ",f1_score(y_test, pred_test))

In [None]:
make_confusion_matrix(y_test,pred_test) #display confusion matrix for test set

**Observations:**

* The Logistic Regression model has good accuracy by poor Recall values.
* This could be due to multi-collinearity and insignificant values in the model.
* To check this we will build a model using statsmodels library

### Logistic Regression Using Stats Model:
* Using Stats Model in Python, we will get an list of statistical results for each estimator.
* Stats Model is also used to further conduct tests and statistical data exploration

In [None]:
# adding constant to training and test set
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

In [None]:
#Defining a funciton to call all the performance metrics scores
def metrics_score(model,train,test,train_y,test_y):
    '''
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable '''
     
    pred = model.predict(train)
    pred_train = list(map(round,pred))
    pred1 = model.predict(test)
    pred_test = list(map(round,pred1))
   
    print("Accuracy on training set : ",accuracy_score(train_y,pred_train))
    print("Accuracy on test set : ",accuracy_score(test_y,pred_test))
    print("Recall on training set : ",recall_score(train_y,pred_train))
    print("Recall on test set : ",recall_score(test_y,pred_test))
    print("Precision on training set : ",precision_score(train_y,pred_train))
    print("Precision on test set : ",precision_score(test_y,pred_test))
    print("F1 on training set : ",f1_score(train_y,pred_train))
    print("F1 on test set : ",f1_score(test_y,pred_test))
        
  

In [None]:
logit = sm.Logit(y_train, X_train) #logistic regression
lg = logit.fit(warn_convergence =False) 
#Checking model performance 
metrics_score(lg,X_train,X_test,y_train,y_test)

**Observations**:
* The Accuracy for the test set is 0.96 which looks good
* But the Recall for the test set is only 0.68
* We must further analyse this model and check if the perfomance can be improved.

In [None]:
cm_pred = lg.predict(X_test)
pred_test = list(map(round,cm_pred))
make_confusion_matrix(y_test,pred_test)

* The True positive values('ie predicting customers who will purchase loan) is only 6.2% .
* The False negative is at 2.93%. We have to check if we can bring this lower

## Checking for Multicollinearity using VIF Scores:
* Multicollinearity occurs when there is correlation between the predictor variables.
* Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors.
* If VIF is 1 then there is no correlation among the predictor variables. Whereas if VIF exceeds 5, we say there is moderate multi-collinearity and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. 
* Alternatively we can check the significance of a variable to the model with the P-value

In [None]:
#checking the VIF scores for X_train set
vif_series1 = pd.Series([variance_inflation_factor(X_train.values,i) for i in range(X_train.shape[1])],index=X_train.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))

**Observations:**
* Age and Experience have the highest VIF values.
* We already suspected that these variables might have multicollinearity which is proven true with the above values.
* We will remove Experience column to remove multi-collinearity

In [None]:
X_train1 = X_train.drop('Experience', axis=1)
X_test1 = X_test.drop('Experience', axis=1)
vif_series2 = pd.Series([variance_inflation_factor(X_train1.values,i) for i in range(X_train1.shape[1])],index=X_train1.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series2))

* The VIF scores for all the variables is less than 5 and there is no more multi-collinearity in the model.
* Let's check the model performance

In [None]:
logit1=sm.Logit(y_train,X_train1)
lg1=logit1.fit()
metrics_score(lg1,X_train1,X_test1,y_train,y_test)

**Observations**:
* The Accuracy and Recall values for test set has a slight increase


**Variable Significance:**
- The P-Value of the variable indicates if the predictor variable is significant or not.
- The level of significance is 0.05 and any p-value less than 0.05 , then that variable would be considered significant.

In [None]:
print(lg1.summary())


**Insights**
- In the above model the following variables have p-value >0.05:
    - Age, Mortgage, Family2, SecuritiesAccount_1 and all the Dummy variables of the Variable ZIPCode.
- We know that Mortgage has a positive correlation to Personal Loan and Family2 despite having high p-value cannot be removed as its part of the Family category variable.
- Let's remove all the dummy variables of Category Region and check model performance

In [None]:
#dropping all dummy variables of Region
X_train2 = X_train1.drop(['ZIPCode_91','ZIPCode_92','ZIPCode_93','ZIPCode_94','ZIPCode_95','ZIPCode_96'], axis=1)
X_test2 = X_test1.drop(['ZIPCode_91','ZIPCode_92','ZIPCode_93','ZIPCode_94','ZIPCode_95','ZIPCode_96'], axis=1)

In [None]:
logit2=sm.Logit(y_train,X_train2)
lg2=logit2.fit()
#print(lg2.summary())

#Lets look at model performance 
metrics_score(lg2,X_train2,X_test2,y_train,y_test)

* The Recall for test set has dropped to 0.686
* Next let's drop Age variable and check model performance 

In [None]:
#Let's drop Age 
X_train3 = X_train2.drop(['Age'],axis=1)
X_test3 = X_test2.drop(['Age'],axis=1)
logit3=sm.Logit(y_train,X_train3)
lg3=logit3.fit()

metrics_score(lg3,X_train3,X_test3,y_train,y_test)

* The Recall for test set did not change.
* Next let's drop Mortgage variable for this model and check performance again.
* Even though Mortgage has a positive coefficient, it is compartitively low to the rest of the variables.

In [None]:
#Let's drop Mortgage 
X_train4 = X_train3.drop(['Mortgage'],axis=1)
X_test4 = X_test3.drop(['Mortgage'],axis=1)
logit4=sm.Logit(y_train,X_train4)
lg4=logit4.fit()
metrics_score(lg4,X_train4,X_test4,y_train,y_test)

* The Recall for test set improved to 0.6934
* Let's check the model summary

In [None]:
print(lg4.summary())

* There are no more insignificant variables 

**Hence, we will use lg4 as the final model**

## Observations from Model:

### Coefficient Interpretations:

* Income, CCAvg, Mortgage, Family_3, Family_4, Both Education variables and CD_account1 have positive co-efficients; which indicate that an increase in their values will increase the probability of customers having Personal loans


* Family2, Securities_Account_1,Online_1 and CreditCard_1 have a negative co-efficient; Which indicates that an increase in their value would decrease the probability  Customer's having Personal loans


### Converting Coefficients to odds: 
* In a Logistic Regression model, the coefficients of the variable is the Log of odds. 
* We will calculate the  odds ration to quantify the strength of the assosiation between the dependent and independent variables

**Odds ratio =  Exp(coef)**

**Probability  = odds/(1+odds)**

In [None]:
#Calculate Odds Ratio, probability
##create a data frame to collate Odds ratio, probability and p-value of the coef
lgcoef = pd.DataFrame(lg4.params, columns=['coef']) #getting the coefficent from lg4 model
lgcoef.loc[:, "Odds_ratio"] = np.exp(lgcoef.coef) #calculate the odds ratio

lgcoef['probability'] = lgcoef['Odds_ratio']/(1+lgcoef['Odds_ratio']) #calculate the probability 
lgcoef['pval']=lg.pvalues
pd.options.display.float_format = '{:.2f}'.format

In [None]:
# Filter by significant p-value (pval <0.005) and sort descending by Odds ratio
lgcoef = lgcoef.sort_values(by="Odds_ratio", ascending=False)
pval_filter = lgcoef['pval']<=0.005
lgcoef[pval_filter]

**Observations**:
* Customers with higher Education level, ie, Graduate and Post-Graduate level and with a CD Account have a 98% probability of having personal loans
* Customers with larger family size of 3 and 4 have higher probabilities 91% and 82% respectively of having a personal loan.
* Other Significant Variables that have moderate to high probability are Income , Customers who use the Bank's online services and those who have addditional credit cards from Other Banks.


### Identifying Key Variables:
* The model indicates the following key variables to have a strong relationship with the dependent variable Personal_Loan
    - Education 
    - CD_Account 
    - Family
    - CCAvg
    - Income
    - Online and
    - CreditCard

**Confusion matrix Prediction on lg4 model Test Data**

In [None]:
pred1 = lg4.predict(X_test4)
pred_test1 = list(map(round,pred1))
make_confusion_matrix(y_test,pred_test1)

**Observations**:
- In the lg4 model:
    - True positive is 6.33% 
    - True Negative is 90.33%
    - False Positive is 0.53% 
    - False Negative is 2.8%
- We need to improve the True Positive and reduce False Negative values.
- Let's check for model improvement

## Model Performance Improvement

###  AUC-ROC curve:
* This is a performance measurement for classification problems at various threshold settings

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, lg4.predict(X_test4))
fpr, tpr, thresholds = roc_curve(y_test, lg4.predict(X_test4))
plt.figure(figsize=(13,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

**Optimal Threshold from AUC-ROC**

* The optimal threshold cut off will be where True Positive Rate is high and False Positive Rate is low

In [None]:
# Optimal threshold as per AUC-ROC curve

optimal_idx = np.argmax(tpr - fpr)
optimal = thresholds[optimal_idx]
print(optimal)

In [None]:
#Applying the optimal threshold to predict model for test data
y_pred_train = (lg4.predict(X_train4)>optimal).astype(int)
y_pred_test = (lg4.predict(X_test4)>optimal).astype(int)

In [None]:
#Confusion matrix for test set for lg4 model
make_confusion_matrix(y_test,y_pred_test)

In [None]:
print("Accuracy on training set : ",accuracy_score(y_train,y_pred_train))
print("Accuracy on test set : ",accuracy_score(y_test,y_pred_test))
print("\nRecall on training set : ",recall_score(y_train,y_pred_train))
print("Recall on test set : ",recall_score(y_test,y_pred_test))
print("\nPrecision on training set : ",precision_score(y_train,y_pred_train))
print("Precision on test set : ",precision_score(y_test, y_pred_test))

print("\nF1 Score on training set : ",f1_score(y_train,y_pred_train))
print("F1 Score on test set : ",f1_score(y_test, y_pred_test))

**Observations**

* At Optimal Threshold, the Accuracy of the test set has reduced to 0.94 
* But the Recall score for Test set has rised significantly to 0.8321


### Percision-Recall Curve

* This curve will plot the precision and Recall values for the lg4 model.

In [None]:
from sklearn.metrics import precision_recall_curve
y_PresRec=lg4.predict(X_train4)
prec, rec, tre = precision_recall_curve(y_train, y_PresRec)

def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], 'b--', label='precision')
    plt.plot(thresholds, recalls[:-1], 'g--', label = 'recall')
    plt.xlabel('Threshold')
    plt.legend(loc='upper left')
    plt.ylim([0,1])
plt.figure(figsize=(10,7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()

In [None]:
#Applying the optimal threshold to predict model for test data
optimal_threshold = 0.25 # we get a balanced recall and precision at this threshold

y_pred_train1 = (lg4.predict(X_train4)>optimal_threshold).astype(int)
y_pred_test1 = (lg4.predict(X_test4)>optimal_threshold).astype(int)

In [None]:
#Confusion matrix for test set for lg4 model
make_confusion_matrix(y_test,y_pred_test1)

In [None]:
print("Accuracy on training set : ",accuracy_score(y_train,y_pred_train1))
print("Accuracy on test set : ",accuracy_score(y_test,y_pred_test1))
print("\nRecall on training set : ",recall_score(y_train,y_pred_train1))
print("Recall on test set : ",recall_score(y_test,y_pred_test1))
print("\nPrecision on training set : ",precision_score(y_train,y_pred_train1))
print("Precision on test set : ",precision_score(y_test, y_pred_test1))

print("\nF1 Score on training set : ",f1_score(y_train,y_pred_train1))
print("F1 Score on test set : ",f1_score(y_test, y_pred_test1))


**Observations**
* At 0.25 threshold, Recall has dropped to 0.810 for test set 
* But the Precision value has increased and Accuracy remains same.
* Since Precision is not the defining metric; AUC-ROC threshold value has a better model 

# Sequential Feature Selector method:
* This method will begin with an empty model and will add in each forward step the one variable that gives the maximum improvement to the model.
* The aim of this method is to discrad deceptive features and also speed training and testing process

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt 

In [None]:
## Defining X and Y variables
X = df.drop(['Personal_Loan'], axis=1)
Y = df[['Personal_Loan']]

#Convert categorical variables to dummy variables
X = pd.get_dummies(X, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30)
# Fit the model on train
m = LogisticRegression(solver='newton-cg',n_jobs=-1,random_state=0)

In [None]:
# we will first build model with all 
sfs = SFS(m, k_features=19, forward=True, floating=False, scoring='recall', verbose=2, cv=5)
sfs = sfs.fit(X_train, y_train)
fig = plot_sfs(sfs.get_metric_dict(),kind='std_dev')
plt.ylim([0, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()

* Since the recall value only stopped rising after 10th feature, we will proceed only with the best 10 features

In [None]:
sfs1 = SFS(m,k_features=10, forward=True, floating=False, scoring='recall', verbose=2, cv=5)

sfs1 = sfs1.fit(X_train, y_train)
fig1 = plot_sfs(sfs1.get_metric_dict(),kind='std_dev')
plt.ylim([0, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()


In [None]:
#Which are the important features?
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)

In [None]:
#Looking at the column names
X_train.columns[feat_cols]

In [None]:
#Creating new X_train and X_test with the selected columns
X_train_final = X_train[X_train.columns[feat_cols]]
X_test_final = X_test[X_train_final.columns]

In [None]:
#Fitting logistic regression model
logreg1 = LogisticRegression(solver='saga',max_iter=1000,penalty='none',verbose=True,n_jobs=1,random_state=29)
logreg1.fit(X_train_final, y_train)

In [None]:
#Lets check the model performance
metrics_score(logreg1,X_train_final,X_test_final,y_train,y_test)

In [None]:
pred_test2 = logreg1.predict(X_test_final)

print("confusion matrix = \n")
make_confusion_matrix(y_test,pred_test2)

In [None]:
#AOC-RUC Curve

SFS_roc_auc = roc_auc_score(y_test, logreg1.predict_proba(X_test_final)[:,1])
fpr, tpr, thresholds = roc_curve(y_test, logreg1.predict_proba(X_test_final)[:,1])
plt.figure(figsize=(13,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % SFS_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

In [None]:
optimal_idx1 = np.argmax(tpr - fpr)
optimal_threshold1 = thresholds[optimal_idx1]
print(optimal_threshold1)

In [None]:
y_pred_trn = (logreg1.predict(X_train_final)>optimal_threshold)
y_pred_tst = (logreg1.predict(X_test_final)>optimal_threshold)

In [None]:
# let us make confusion matrix after optimal threshold has been choosen
make_confusion_matrix(y_test,y_pred_tst)

In [None]:
print('Accuracy on train data:',accuracy_score(y_train, y_pred_trn) )
print('Accuracy on test data:',accuracy_score(y_test, y_pred_tst))

print('\nRcall on train data:',recall_score(y_train, y_pred_trn) )
print('Recall on test data:',recall_score(y_test, y_pred_tst))


**Observations**:
* The Accuracy is at 0.94 for the test set but the Recall is only 0.513 
* The recall value is only slightly better than the Sklearn logistic regression model

## Model Building - Decision Tree:
### Approach
1. Data preparation
2. Partition the data into train and test set.
3. Built a CART model on the train data.
4. Tune the model and prune the tree, if required.
5. Test the data on test set.

In [None]:
df1.info()

### Split Data

In [None]:
X= df1.drop(['Personal_Loan'],axis=1)
y=df1['Personal_Loan']

In [None]:
# encoding the categorical variables
X = pd.get_dummies(X, drop_first=True)
# Splitting data into training and test set:
X_train,X_test, y_train, y_test =train_test_split(X,y, test_size=0.3, random_state=29)
print(X_train.shape,X_test.shape)

## Model Building
* We will build the Decision Tree model using the default 'gini' criteria to split.
* In our dataset, we know that there is an imbalance in the Dependent variable Personal_Loan. ie. 90.4% of frequency is for 0 and 9.6% is for 1.
* This might cause the Decision Tree model to become biased towards the dominant class
* Hence we will add a class_weight hyperparameter as a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1

In [None]:
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
Dt = DecisionTreeClassifier(criterion='gini',class_weight={0:0.15,1:0.85},random_state=29)

In [None]:
Dt.fit(X_train,y_train)

In [None]:
y_predict = Dt.predict(X_test)
make_confusion_matrix(y_test,y_predict)

In [None]:
y_train.value_counts(1)

**Observations**:
* True Positive - 8.20%
* False Positive - 0.27%
* False Negative - 0.93%
* True Negative - 90.6%

* We also see that there are only 9.8% of the Class '1'.

## Model Evaluation Criteria - Recall
* As discussed earlier, the Bank wants to predict and identify all potential customers who will buy a Personal loan.
* We want to maintain the False Negative ie. wrongly identifying customers as non-buyers but they actually purchase a loan as low as possible.

* Hence Recall is the metric to be used


In [None]:
def scores(model):
    """ model : classifier to predict X values """
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    print("Accuracy on training set : ",metrics.accuracy_score(y_train,y_pred_train))
    print("Accuracy on test set : ",metrics.accuracy_score(y_test,y_pred_test))

    print("\nRecall on training set : ",metrics.recall_score(y_train,y_pred_train))
    print("Recall on test set : ",metrics.recall_score(y_test,y_pred_test))
    
    print("\nPrecision on training set : ",metrics.precision_score(y_train,y_pred_train))
    print("Precision on test set : ",metrics.precision_score(y_test,y_pred_test))
    
    print("\nF1 on training set : ",metrics.f1_score(y_train,y_pred_train))
    print("F1 on test set : ",metrics.f1_score(y_test,y_pred_test))

In [None]:
#Let's calculate the Accuracy and Recall Score of the model
scores(Dt)

* The Accuracy values for both Train and Test set are very close.
* But there is huge difference in the Recall Scores for train and test set. 
* This suggests that the model is overfitting.

## Visualizing the Decision Tree

In [None]:
column_names = list(X.columns)
print(column_names)

In [None]:
plt.figure(figsize=(20,30))

out = tree.plot_tree(Dt,feature_names=column_names,filled=True,fontsize=8,node_ids=True,class_names=True)
for o in out:
     arrow = o.arrow_patch
     if arrow is not None:
        arrow.set_edgecolor('black')
        arrow.set_linewidth(1)
plt.show()

In [None]:
# Text report showing the rules of a decision tree -

print(tree.export_text(Dt,feature_names=column_names,show_weights=True))

**Observations**:
* There are close to a 100 nodes in the tree with the smallest sample size = 2 and the Gini value for the last node is 0.0
* This is surely a overfitted Decision Tree model 
* Let's check the important features in the tree. This is also called the Gini importance

In [None]:
importance = Dt.feature_importances_
indices = np.argsort(importance)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importance[indices], color='green', align='center')
plt.yticks(range(len(indices)), [column_names[i] for i in indices])
plt.xlabel('Importance Value')
plt.show()

**Observations**:
* The top five important features are:
    - Income
    - Education_2
    - Family_4
    - CCAvg
    - Family_3
    
* The above tree is complex to interpret
* Since there is suspicions of over-fitting we must prune the tree(reduce overfit) for better model performance. 


## Reduce Over-Fitting:


### GridSearch for Hyperparameter tuning of Tree Model
* Hyperparameters are variables that control the network structure of the Decision tree.
* As there is no direct way to calculate the effects of value change in hyperparamter has on the model, we will use a GridSearch
* This is a tuning technique that will compute the optimum values of specific hyperparamters of the model
* The parameters are optimized using a cross-validated GridSearch over a parameter grid


In [None]:
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier. 
classifier = DecisionTreeClassifier(random_state=29,class_weight = {0:.15,1:.85}) #adding classweights 

#Defining the Hyperparameters

parameters = {'max_depth': np.arange(1,11), 
            'criterion': ['gini'],
            'splitter': ['best','random'],
            'max_features': ['log2','sqrt']}

# Type of scoring used to compare parameter combinations
recall_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search with the above parameters
grid_obj = GridSearchCV(classifier, parameters, scoring=recall_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set to the best combination of parameters
classifier = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
classifier.fit(X_train, y_train)

In [None]:
pred_test2 = classifier.predict(X_test)
make_confusion_matrix(y_test,pred_test2)

In [None]:
scores(classifier)

**Observations**:
* The Recall for test set has improved to 0.876 after the hyperparameter tuning.

### Visualizing the Tree

In [None]:
plt.figure(figsize=(20,30))

out = tree.plot_tree(classifier,feature_names=column_names,filled=True,fontsize=11,node_ids=True,class_names=True)
for o in out:
     arrow = o.arrow_patch
     if arrow is not None:
        arrow.set_edgecolor('black')
        arrow.set_linewidth(1)
plt.show()

In [None]:
# Text report showing the rules of a decision tree -

print(tree.export_text(classifier,feature_names=column_names,show_weights=True))

In [None]:
importances = classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='green', align='center')
plt.yticks(range(len(indices)), [column_names[i] for i in indices])
plt.xlabel('Importance Value')
plt.show()

**Observations**:
* The order importance of features has changed.
* The value for Income has increased and CCAvg is now the second most important feature.

## Cost Complexity Pruning
* This is another method to reduce and control the size of the Tree. This method is called Post-Pruning 
* Here, we use the Cost complexity Parameter `ccp_alpha` to prune the tree
* We will remove each possible nodes based on the alpha value. The greater the `ccp_alpha`value, higher number of nodes will be pruned and the total impurity will also increase

**Finding the `ccp_alpha` values**

In [None]:
ccp = DecisionTreeClassifier(random_state=29,class_weight = {0:0.15,1:0.85})
ccp.fit(X_train,y_train)
path = ccp.cost_complexity_pruning_path(X_train, y_train) #finding the alpha and impurity values
ccp_alphas, impurities = path.ccp_alphas, path.impurities

In [None]:
pd.DataFrame(path)
 #display as a dataframe

In [None]:
#plotting alpha vs impurities
fig, ax = plt.subplots(figsize=(15,5))
ax.plot(ccp_alphas, impurities, marker='o', drawstyle="steps-post")
ax.set_xlabel("Alpha")
ax.set_ylabel("Total impurity of leaves")
ax.set_title("Total Impurity vs Alpha for training set")
plt.show()

* The impurity values increases till ~ 0.03 of alpha value and remains constant till alpha ~ 0.22 before rising sharply

In [None]:
# Finding the number of nodes in the last tree and the corresponding alpha value
clfs = [] #creating a empty list 
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=29, ccp_alpha=ccp_alpha,class_weight = {0:0.15,1:0.85})
    clf.fit(X_train, y_train) #apply classifier model with alpha values  
    clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1])) #finding the last node and its corresponding alpha

**Let's plot the Recall Vs Alpha values for both Train and Test set**

In [None]:
#Creating empty lists for train and test recall
recall_train=[]
recall_test=[]

In [None]:
#run a loop to appead all recall scores for train and test at the alpha values
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=29, ccp_alpha=ccp_alpha,class_weight = {0:0.15,1:0.85})
    clf.fit(X_train, y_train)
    y_pred_train1 = clf.predict(X_train)
    y_pred_test1 = clf.predict(X_test)
    values_train = metrics.recall_score(y_train,y_pred_train1)
    values_test= metrics.recall_score(y_test,y_pred_test1)
    recall_train.append(values_train)
    recall_test.append(values_test)

In [None]:
#plot the recall VS alpha 
fig, ax = plt.subplots(figsize=(9,10))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
        drawstyle="steps-post",)
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
        drawstyle="steps-post")
ax.legend(loc='lower left')
plt.show()

In [None]:
#Let's find the best alpha threshold for max recall
index_best_alpha = np.argmax(recall_test)
best_model = clfs[index_best_alpha]
print(best_model)

**Maximum Recall value is at alpha 0.0042. But at this alpha we will lose valuable business information and the decision tree might have very less nodes.** 

**Hence we will use the point where the Recall values just begins to drop first; at alpha = 0.003.
This will ensure we are retaining information and also get a high recall value.**

In [None]:
#at alpha = 0.003
best_model2 = DecisionTreeClassifier(ccp_alpha=0.003,
                       class_weight={0: 0.15, 1: 0.85}, random_state=29)
best_model2.fit(X_train, y_train)


In [None]:
pred_test3=best_model2.predict(X_test)
make_confusion_matrix(y_test,pred_test3)

**Observations**:
* The True positive has increased to 8.53% and the False negative has decreased to 0.6%.

In [None]:
scores(best_model2)

**Observations**:

* The overall results for recall has increased from the initial model and its also higher than the Hypertuned model.
* The performance for both train(0.9417 and test (0.9343) recall is close and comparable


### Visualizing Decision Tree for best_model2

In [None]:
plt.figure(figsize=(10,10))

out = tree.plot_tree(best_model2,feature_names=column_names,filled=True,fontsize=9,node_ids=True,class_names=True)
for o in out:
     arrow = o.arrow_patch
     if arrow is not None:
        arrow.set_edgecolor('black')
        arrow.set_linewidth(1)
plt.show()
plt.show()

In [None]:
# Text report showing the rules of a decision tree -

print(tree.export_text(best_model2,feature_names=column_names,show_weights=True))

In [None]:
importances2 = best_model2.feature_importances_
indices = np.argsort(importances2)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances2[indices], color='green', align='center')
plt.yticks(range(len(indices)), [column_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

* Income is the most important feature to predict if the customer bought a personal loan. 
* Education2, Family4 and CCAvg are the next most important predictor features.

## Comparison of all Models for Personal_Loan prediction

In [None]:
All_models = {'Model':['Logistic Regression Model-sklearn','Logistic Regression-Statsmodel-mutlicollinearity remvo','Logistic Regression-Optimal Threshold =0.2017','Logistic Regression-Optimal Threshold =0.25','Sequential Feature Selction Method','Initial Decision Tree','Decision treee- hyperparameter tuning(pre-pruning)',
                                          'Decision tree- Cost Complexity post-pruning'],'Train_Accuracy':[0.9380,0.9640,0.9497,0.9560,0.944,1.0,0.8070,0.9749],'Test_Accuracy':[0.9387,0.9660,0.9400,0.9493,0.944,0.9880,0.80,0.972],'Train_Recall':[0.5245,0.7289,0.8484,0.8367,0.4817,1.0,0.9854,0.9417], 'Test_Recall':[0.4599,0.6788,0.8321,0.8102,0.5131,0.9051,0.9343,0.9343]}
comparison = pd.DataFrame(All_models)

comparison

# Misclassification of model:
## Analysing predictions that were off the mark

In [None]:
df2=df.copy() # making a new copy from the dataset without outlier treatment
A = df2.drop(['Personal_Loan'], axis=1) #dropping the dependent variable
B = df2[['Personal_Loan']]

In [None]:
A = pd.get_dummies(A, drop_first=True) #creat dummy variables 
# Splitting data into training and test set:
A_train,A_test, B_train, B_test =train_test_split(A,B, test_size=0.3,random_state=1)
#split data
print(A_train.shape,A_test.shape)

In [None]:
A_test.head()

In [None]:
#apply the final model best_model2 to the train and test set
final_pred_test = best_model2.predict(A_test)

In [None]:
data = df2.loc[A_test.index] #selecting rows with same index as test set
data['Predicted'] = final_pred_test
data.head()

In [None]:
comparison_column = np.where(data["Predicted"] == data["Personal_Loan"], True, False) #identifying the misclassification
data['Misclassification'] = comparison_column
data['Misclassification'].value_counts()

* There are 49 misclassified data in the test set

In [None]:
incorrect =data[data['Misclassification']== False] # Grouping only the misidentified rows 
incorrect.sample(5)

In [None]:
#Crearting a Pandas Profile report to identify pattern
from pandas_profiling import ProfileReport
profile  = ProfileReport(incorrect,title = 'Misclassification Pattern Profile',minimal=True) 
profile.to_widgets()

**OBSERVATIONS**:
* About 3% of the data from the test set has been misclassified i.e The Predicted value of the model was not the same as Personal_loan variable in the dataset.

* The miscalssifcation seeems to spread across all variables. But its significant on some
* Income and CCAvg have high misclassifications. This is understandable as the model highlighted these two features as very important. Hence the model seems to have classified customers with high income and CCavg as potential loan borrowers

* Among the categorical variables; again the misclassification is high for customer with CD_account; an important feature for the model. 
* The model has targeted all its important feature combinations as potential loan borrowing customers.