## Supervised Learning - Project

### Steps and tasks:
1. Read the column description and ensure you understand each attribute well
2. Study the data distribution in each attribute, share your findings (15 marks)
3. Get the target column distribution. Your comments (5 marks)
4. Split the data into training and test set in the ratio of 70:30 respectively (5 marks)
5. Use different classification models (Logistic, K-NN and Naïve Bayes) to predict the likelihood of a customer buying personal loans (15 marks)
6. Print the confusion matrix for all the above models (5 marks)
7. Give your reasoning on which is the best model in this case and why it performs better? (5 marks)


## 1. Read the column description and ensure you understand each attribute well

**There are total 14 attributes in the dataset and in the context of the given problem, the target (or dependent) attribute is "Personal Loan" whereas the remaining are independent attributes.**

**Attribute Information:**
* ID : Customer ID, is unique for ach customer which is assigned by the Bank.
* Age : Age of a particular Customer's in completed years.
* Experience : professional experience in terms of years.
* Income : Annual income of the customer. (\$000)
* ZIP Code : Home Address ZIP code of the customer.
* Family : Size of the customer's Family.
* CCAvg : Average spending on credit cards per month by customers. (\$000)
* Education : Education Level of the customers. 1: Undergrad; 2: Graduate; 3: Advanced/Professional.
* Mortgage : If any customer have any house mortgage then value of house mortgage. (\$000)
* Securities Account : Does the customer have a securities account with the bank?
* CD Account : Does the customer have a certificate of deposit (CD) account with the bank?
* Online : Does the customer use internet banking facilities?
* Credit card : Does the customer use a credit card issued by UniversalBank?
* Personal Loan : Did this customer accept the personal loan offered in the last campaign? (**Target Attribute**)

In [None]:
# Importing the necessary libraries
import math
import numpy                            as np                        # importing numpy library
import pandas                           as pd                        # importing pandas library
import seaborn                          as sns                       # For Data Visualization 
import matplotlib.pyplot                as plt                       # Necessary module for plotting purpose
import warnings                                                      # importing warning library

# add graphs into jupiter notebook
%matplotlib inline                             
warnings.filterwarnings('ignore')                                    # for ignoring warnings in notebook

import statsmodels.api                  as sm                        # importing statsmodel api
from sklearn.preprocessing              import MinMaxScaler          # importing MinMaxScaler for data scalling
from sklearn.model_selection            import train_test_split      # For train-test split
# getting methods for confusion matrix, F1 score, Accuracy Score
from sklearn.metrics                    import confusion_matrix,f1_score,accuracy_score,classification_report,roc_curve,auc
from sklearn.linear_model               import LogisticRegression    # For logistic Regression
from sklearn.naive_bayes                import GaussianNB            # For Naive Bayes classifier
from sklearn.neighbors                  import KNeighborsClassifier  # For K-NN Classifier
from sklearn.svm                        import SVC                   # For support vector machine based classifier

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# loading "Bank_Personal_Loan_Modelling.csv" data into loadDataOrg df using pandas read_csv function
loanDataOrg = pd.read_csv("../input/personal-loan-modeling/Bank_Personal_Loan_Modelling.csv")

# printing top 5 rows of the loanDataOrg df
loanDataOrg.head()

In [None]:
'''
To use columns of loanDataOrg df more conveniently following are some changes I have done
   a. pushing target column i.e 'Personal Loan' to last column
   b. converting all column names in lower case
   c. replacing spaces in the all column names with '_'
'''

loanData = loanDataOrg.copy()                                               # creating a copy of loanDataOrg into loanData

targetCol = 'Personal Loan'                                                 # defining target column
targetColDf = loanData.pop(targetCol)                                       # popping target column from loanData df
loanData.insert(len(loanData.columns),targetCol, targetColDf)               # inserting target column to last column

# deleting variables that were used for changing column position of target column
del targetCol 
del targetColDf

# converting column names into lower case and replacing spaces in column names with '_'
loanData.columns = [c.lower().replace(' ', '_') for c in loanData.columns]

# to check the above printing top 5 rows
loanData.head()

In [None]:
# Printing shape of the data i.e Rows and columns in the given dataset
print("\033[1mThere are {0} Rows and {1} Columns in the given Dataset.\033[0m".format(loanData.shape[0],loanData.shape[1]))

* **setting id as index of the dataset as 'id' column does not have any significance towards a customer opted for personal loan (target variable - 'personal_loan')**

In [None]:
# setting id column as index column
loanData.set_index('id',inplace=True)

In [None]:
# after setting column 'id' as index now we have less columns to confirm that printing number of rows and column once again
print("\033[1mAfter setting 'id' column as index of the Dataset,\033[0m now there are \033[1m{0}\033[0m Rows and \033[1m{1}\033[0m Columns in the given Dataset.".format(loanData.shape[0],loanData.shape[1]))

In [None]:
# printing top 5 rows once again to check
loanData.head()

In [None]:
# printing datatypes of each columns of the dataset

print("\033[1m*"*100)
print("a.\nColumn_Names        Data_Types")
print("*"*30)
print("\033[0m{0}\033[1m".format(loanData.dtypes))
print("*"*30)
print()

# printing No of Columns having different Types of Datatype

print("*"*100)
print("b.\nNumber of Columns with each DataTypes as follows :")
print("*"*50)
print("Column_Names     No_of_Columns\033[0m")
print("*"*30)
print(loanData.dtypes.value_counts())
print("\033[1m*"*30)
print("\033[0m")

# printing Different Column Names of the dataset

print("\033[1m*"*100)
print("c.\nEach Column Names of the dataset")
print("*"*80)
print("\033[0m{0}\033[1m".format(loanData.columns))
print("*"*80)
print("\033[0m")

- **After observing the dataset and column description given we can conclude the followings:**
    * **Columns having only two datatypes, int64, float64.**
    * **Column 'ccavg' is only having float64 datatype, remaining all columns datatype is int64.** 
    * **Columns 'age', 'experience', 'income', 'mortgage' and 'ccavg' are Numeric column.**
    * **Columns 'zip_code', 'securities_account', 'cd_account', 'online', 'creditcard' and 'personal_loan' are basicaly Nominal Categorical column.**
    * **Columns 'family' and 'education' are Ordinal Categorical column.**

In [None]:
# checking missing values in dataset for each attributes / columns 

print("\033[1m*"*100)
print("Column_Name       No_of_Missing_Values")
print("*"*50)
print("\033[0m{0}".format(loanData.isnull().sum()))
print("\033[1m*"*50)
print()

# checking if any duplicate rows available in the dataset

print("*"*100)
print("Showing Duplicate rows if any in the dataset: ")
print("*"*50)
print("\033[0m{0}".format(loanData[loanData.duplicated()]))
print("\033[1m*"*100)
print("\033[0m")

**As shown in the above, <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(a.) There are no missing values<br>and (b.) No duplicate row in the given dataset**


In [None]:
# Five point summary of each attribute
loanData.describe().transpose()

- **Following are the findings after looking into the above 5-point summary of the given dataset(loadData df)**
    * Numerical column 'age' and 'experience' column have no outliers.
    * Numerical column 'income', 'ccavg' and 'mortgage' column have outliers present in the dataset.
    * Nominal categorical columns 'zip_code', 'securities_account', 'cd_account', 'online', 'creditcard' and 'personal_loan' have no outliers and no data cleaning required.
    * Ordinal Categorical column 'family' and 'education' also have no outliers and are also clean.
    * **Intersetingly we can see the minimum value of -3.0 (negetive three) exist in the 'experience' column, which needs to be rectified or droped.**

### 2. Study the data distribution in each attribute, share your findings (15 marks)

### 1. 'age' column : 

In [None]:
# 5 point summary of age column
loanData.age.describe()

'age' column having mean of 45.338 with Inter Quartile Range (IQR) of 20 (Q3 - Q1 = 55 - 35 = 20), which shows there is no outliers in the 'age' column.
- Let's Check some facts: 

In [None]:
plt.figure(figsize=(10,5))                     # setting figure size with width = 10 and height = 5
sns.histplot(loanData.age, kde=True)           # seaborn histplot to examine distribution of the age
plt.title("Distribution of column : 'age'")    # setting title of the figure

**Age seems to be distributed quite uniformly.**
* **Lets check mean age w.r.t Loan Status (personal_loan column)**

In [None]:
# plotting bar graph to see mean age with personal loan status i.e 0 = Not taken Loan and 1 = Taken Loan
loanData.groupby('personal_loan').age.mean().plot(kind='bar')
plt.title("Mean Age w.r.t Loan Status")          # setting title of the figure

**In the above we have taken mean age in y-axis and grouped them with target variable. As we can see there is not much of a difference between mean age of person taken loan or not.**
* **Lets see how age is distributed between person taken loan and not taken loan :**

In [None]:
plt.figure(figsize=(12,8))                         # setting figure size with width = 12 and height = 8
# plotting histogram of age column where customers not opted for loan
sns.histplot(loanData[loanData.personal_loan == 0].age,kde=False, bins=5, color='b', label='Personal Loan = 0 (No)')
# plotting histogram of age column where customers opted for loan
sns.histplot(loanData[loanData.personal_loan == 1].age,kde=False, bins=5, color='r', label='Personal Loan = 1 (Yes)')
plt.legend()                                       # plotting legend on the figure
plt.title("Distribution of column : 'age'")        # setting title of the figure

**From the above distribution, we can say almost every Age group of the customers have bought Personal Loans.**<br>
* **Checking the percentage of customers under each age group who bought personal Loans :**<br>
For this we will use pandas.cut() which is used to segment data values into different categorical bins. And we will use pandas.crosstab() & DataFrame.div() function to plot the stacked bar graph which can be shown as:

In [None]:
bins = [20,30,40,50,60,70]                                         # defining age bins
# defining labels of age groups as per bins defined as above
ageGroup = ['Age : 20-30', 'Age : 30-40', 'Age : 40-50', 'Age : 50-60', 'Age : 60-70']
loanDataAgeBin = pd.cut(loanData.age,bins,labels=ageGroup)         # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to ageGroupCol variable
ageGroupCol = pd.crosstab(loanDataAgeBin,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(ageGroupCol)                                                 # printing above crosstab

# plotting a stacked bar chart to show loan status for different age group
ageGroupCol.div(ageGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different Age group")                  # setting title of the figure

- We can find out the following from the above crosstab:
    * **Age group between 20-30 and 60-70 having maximum number of loan conversion with percentage of 10.577 and 10.786 respectively.**
    * **Age group between 50-60 having minimum number of conversion of loans with percentage of 8.692..**
    * **Age group between 40-50 and 50-60 having loan conversion percentage of 9.547 and 9.606 respectively.**
- Normaly we think chances of customers aged between 30-40 tend to take loan compared to other age groups, **but our analysis as above contradict the same.**

### 2. 'experience' column : 

In [None]:
loanData.experience.describe()

From 5-point summary we can find 'experience' column having mean of 20.105 with Inter Quartile Range (IQR) of 20 (Q3 - Q1 = 30 - 10 = 20), which shows there are no outliers in the 'experience' column. **But we can see the minimum value of -3.0 (negative three) exist in the 'experience' column, which needs to be rectified or can be dropped.**

I will go with rectification on negetive values of 'experience' column, for that I'll check if any correlation exist with any other column:

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(loanData.corr(),annot=True)

Above we can see **'experience' column is highly correlated with 'age' column ( $ \rho = 0.99$)**.<br> 
I'll use 'age' column to rectify 'experience' columns where negetive values exist.

In [None]:
# looping through distinct negative values of 'experience' column
for oddExp in loanData[loanData.experience <0].experience.unique():
    # listing all ages of the negative values of 'experience' column
    ageForOddExp = loanData[loanData.experience == oddExp].age.value_counts().index.tolist()
    # looping through locations where negative values exist on 'experience' column
    for i in loanData[loanData.experience == oddExp].experience.index.tolist():
        # replacing mean experience of similar 'age' group for negative values in 'experience' column
        loanData.loc[i,'experience'] = loanData[(loanData.age.isin(ageForOddExp)) & (loanData.experience > 0)].experience.mean()
    print("{0} values in experience column is replaced with mean of expirance of same age group.\n".format(oddExp))

**Algorithm for above code :**
1. find unique negative experience values
2. for each distinct negative values<br>
    a. find list of ages for negative values<br>
    b. loop through all location of negatve experience values
        a. find mean experience of same age (same age w.r.t negative experience value) where experience is greater than zero
        b. replace mean experiace with negative values at each location

In [None]:
# checking for any negative value in 'experience' column
loanData[loanData.experience < 0].experience.value_counts()

In [None]:
loanData.experience.describe()

From the above 5-point summary we can confirm negative values of 'experience' column are replaced with suitable values
* **Now we will check the distribution of 'experience' column**


In [None]:
plt.figure(figsize=(10,5))                         # setting figure size with width = 10 and height = 5
sns.histplot(loanData.experience, kde=True)        # seaborn histplot to examine distribution of the experience
plt.title("Distribution of column : 'experience'") # setting title of the figure

**Experience seems to be distributed quite uniformly.**
* **Lets check mean experience w.r.t Loan Status (personal_loan column)**

In [None]:
# plotting bar graph to see mean experience with personal loan status i.e 0 = Not taken Loan and 1 = Taken Loan
loanData.groupby('personal_loan').experience.mean().plot(kind='bar')
plt.title("Mean Experience w.r.t Loan Status")      # setting title of the figure

**We have taken mean experience in y-axis and grouped them with target variable. As we can see there is not much of a difference between mean experience of customer who has taken loan or not.**
* **Lets see how experience is distributed between customer who has taken loan and not taken loan :**

In [None]:
plt.figure(figsize=(12,8))                                # setting figure size with width = 12 and height = 8
# plotting histogram of experience column where customers have not opted for loan
sns.histplot(loanData[loanData.personal_loan == 0].experience,kde=False, bins=5, color='b', label='Personal Loan = 0 (No)')
# plotting histogram of experience column where customers opted for loan
sns.histplot(loanData[loanData.personal_loan == 1].experience,kde=False, bins=5, color='r', label='Personal Loan = 1 (Yes)')
plt.legend()                                              # plotting legend on the figure
plt.title("Distribution of column : 'experience'")        # setting title of the figure

**From the above distribution, we can say almost every experience group of the customers have opted Personal Loans.**<br>
* **Checking the percentage of customers under each experience group who opted personal Loans :**<br>
Similar to what we have done in 'age' column we will use pandas.cut(), pandas.crosstab() & DataFrame.div() function to plot the stacked bar graph which can be shown as:

In [None]:
bins = [0,10,20,30,40,50]                                          # defining experience bins,
# defining labels of experience groups as per bins defined as above
expGroup = ['Experience : 0-10', 'Experience : 10-20', 'Experience : 20-30', 'Experience : 30-40', 'Experience : 40-50']
loanDataExpBin = pd.cut(loanData.experience,bins,labels=expGroup)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to expGroupCol variable
expGroupCol = pd.crosstab(loanDataExpBin,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(expGroupCol)                                                 # printing above crosstab

# ploting a stacked bar chart to show loan status for different experience group
expGroupCol.div(expGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different Experience group")           # setting title of the figure

- We can find out the following from the above crosstab:
    * **Experience group between 40-50 having maximum number of loan conversion with percentage of 12.963 .**
    * **Experience group between 0-10 having second highest conversion of loans with percentage of 10.303 .**
    * **Experience group between 10-20, 20-30 and 30-40 having loan conversion percentage of 9.417, 9.147 and 9.338 respectively.**

### 3. Analyzing 'income' column : 

In [None]:
loanData.income.describe()

From 5-point summary we can find 'income' column having mean of 73.774 with Inter Quartile Range (IQR) of 59 (Q3 - Q1 = 98 - 39 = 59), which shows **there are many outliers present in the 'income' column.**<br>
* Lets check what are the outliers present in 'income' column

In [None]:
Q1 = loanData.income.quantile(0.25)        # evaluating lower / first quartile
Q3 = loanData.income.quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = loanData[((loanData.income < (Q1 - 1.5 * IQR)) |(loanData.income > (Q3 + 1.5 * IQR)))].income
plt.figure(figsize=(15,3))                 # setting figure size with width = 15 and height = 3
print("*"*30)
print("\033[1mBoxplot of income column : \033[0m")
print("*"*30)
ax = sns.boxplot(x=loanData.income)        # seaborn boxplot to examine outliers of the feature
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format('income',round(np.mean(loanData.income),3),round(np.median(loanData.income),3),round(IQR,3))
     )
if(outliers.shape[0] == 0):                # comparing if number of outlier is zero
    print("There are \033[1mno outliers\033[0m in \033[1m'income'\033[0m feature.")
else:                                      # if the above condition is false i.e number of outlier is not zero
    # printing No of outliers, percentage of the data points are outliers and the values of the outliers
    print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
    .format(outliers.shape[0],round(((outliers.shape[0]/loanData.income.shape[0])*100),3),'income',outliers.tolist()))
print("*"*125)

**As above we can see there are 96 outliers (1.92 % of the datapoints) out of 5000 datapoints in 'income' column.**
* Lets check distribution of the 'income' column :

In [None]:
plt.figure(figsize=(10,5))                     # setting figure size with width = 10 and height = 5
sns.histplot(loanData.income, kde=True)        # seaborn histplot to examine distribution of the income
plt.title("Distribution of column : 'income'") # setting title of the figure

As above, we can clearly conclude that 'income' column is slightly right-skewed / positively skewed.<br>
* Skewness of 'income' column as below:

In [None]:
loanData.income.skew()

* **Lets check mean income w.r.t Loan Status (personal_loan column)**

In [None]:
# plotting bar graph to see mean experience with personal loan status i.e 0 = Not taken Loan and 1 = Taken Loan
loanData.groupby('personal_loan').income.mean().plot(kind='bar')
plt.title("Mean Income w.r.t Loan Status")        # setting title of the figure

**We have taken mean income in y-axis and grouped them with target variable. As we can see customers with high annual income are more inclined to take personal loan.** Without this insight we can think customers with low annual income are more inclined to opt for personal loan to fullfil their needs.
* **Lets see how income is distributed between customers taken loan and not taken loan :**

In [None]:
plt.figure(figsize=(12,8))                                # setting figure size with width = 12 and height = 8
# plotting histogram of income column where customers not opted for loan
sns.histplot(loanData[loanData.personal_loan == 0].income,kde=False, bins=5, color='b', label='Personal Loan = 0 (No)')
# plotting histogram of income column where customers opted for loan
sns.histplot(loanData[loanData.personal_loan == 1].income,kde=False, bins=5, color='r', label='Personal Loan = 1 (Yes)')
plt.legend()                                              # plotting legend on the figure
plt.title("Distribution of column : 'income'")            # setting title of the figure

As above we can conclude same as before that customers with high annual income are more tend to opt for personal loan.<br>
**One interesting thing we can see from above is customers with annual income between 0-60 (approx) have not taken any personal loan.**
* **Checking the percentage of customers under each income group who bought personal Loans :**<br>
Similar to what we have done in 'age' and 'experience' column we will use pandas.cut(), pandas.crosstab() & DataFrame.div() function to plot the stacked bar graph which can be shown as:

In [None]:
bins = [0,50,100,150,200,250]                                         # defining income bins,
# defining labels of income groups as per bins defined as above
incGroup = ['Income : 0-50', 'Income : 50-100', 'Income : 100-150', 'Income : 150-200', 'Income : 200-250']
loanDataIncBin = pd.cut(loanData.income,bins,labels=incGroup)         # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to incGroupCol variable
incGroupCol = pd.crosstab(loanDataIncBin,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(incGroupCol)                                                    # printing above crosstab

# ploting a stacked bar chart to show loan status for different income group
incGroupCol.div(incGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different Income group")                  # setting title of the figure

- We can find out the following from the above crosstab:
    * **Income group between 150-200 having maximum number of loan conversion with percentage of 50.470 .**
    * **Income group between 100-150 having second highest conversion of loans with percentage of 28.571, followed by Income group between 200-250 having conversion of loans with percentage of 18.750 .**
    * **Income group between 0-50 have no loan conversion and Income group between 50-100 having conversion of loans percentage of only 2.241 .**

### 4. Analyzing 'zip_code' column : 

As zip_code is numbers of series we might drop, but before dropping let us see if any demographic advantage is there or not for taking personal loan.<br>
* Lets quickly check maximum time unique zip code appears in the dataset.

In [None]:
loanData.zip_code.value_counts()

As above we can see some of the zip codes appear to be more then others. There are 467 unique zip codes available in the dataset. <br>
* Lets find out conversion of loans w.r.t zip code :

In [None]:
pd.crosstab(loanData.zip_code,loanData.personal_loan, values=loanData.personal_loan, aggfunc='count').sort_values([1],ascending=False)

As above we can see some zip codes are having more customers taking personal loan, but no of customers are also more for those zip codes. So we can think that banking outlets are more accesible to those zip codes than other zip code where no of customers are low.<br>
* So, we can conclude that zip_code has not played any role in customers taking personal loan for this dataset, and will procced to drop this column.

In [None]:
print("*"*40)
print("loanData dataframe before droping zip_code :")
print("*"*80)
print(loanData.head())
print("*"*80)
print()
loanData.drop('zip_code',axis=1,inplace=True)
print("*"*40)
print("loanData dataframe after droping zip_code :")
print("*"*80)
print(loanData.head())
print("*"*80)

### 5. Analyzing 'family' column : 

In [None]:
plt.figure(figsize=(10,5))                                  # setting figure size with width = 10 and height = 5
ax = sns.catplot(x='family', kind="count", data=loanData)   # seaborn count catplot to examine distribution of the family
plt.title("Distribution of column : 'family'")              # setting title of the figure

As above we can see, customers with 1 member in family having highest count followed by family member of 2, 4 and 3.
* Lets check family column w.r.t loan status:

In [None]:
plt.figure(figsize=(10,5))                        # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the family
ax = sns.catplot(x='family',hue='personal_loan', kind="count", data=loanData)
plt.title("Distribution of column : 'family'")    # setting title of the figure

Surprisingly, customers with family members of 3 and 4 have highest no of personal loan.
* Again lets see the numbers and percentage of loan conversion :

In [None]:
print("*"*70)
print("\033[1mNo of Customers taken loan or not w.r.t Family member :\033[0m")
print("*"*70)
famGroupCol = pd.crosstab(loanData.family,loanData.personal_loan)
print(famGroupCol)                                                 # printing above crosstab
print("*"*40)
print()
print("*"*70)
print("\033[1mPercentage of Customers taken loan or not w.r.t Family member :\033[0m")
print("*"*70)
famGroupPer = pd.crosstab(loanData.family,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(famGroupPer)                                                 # printing above crosstab
print("*"*40)
print("*"*100)
# plotting a stacked bar chart to show loan status for different no of family member group
famGroupCol.div(famGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different Family Member")              # setting title of the figure

- We can find out the following from the above crosstab:
    * **Customers with family member of 3 are having maximum number of loan conversion with percentage of 13.168 .**
    * **Customers with family member of 4 are having second highest conversion of loans with percentage of 10.966, followed by Customers with family member of 2 and 1 having conversion of loans with percentage of 8.179 and 7.269 respectively .**

### 6. Analyzing 'ccavg' column : 

In [None]:
loanData.ccavg.describe()

From 5-point summary we can find 'ccavg' column having mean of 1.938 with Inter Quartile Range (IQR) of 1.8 (Q3 - Q1 = 2.5 - 0.7 = 1.8), which shows **there are some outliers present in the 'ccavg' column.**<br>
* Lets check what are the outliers present in 'ccavg' column

In [None]:
Q1 = loanData.ccavg.quantile(0.25)        # evaluating lower / first quartile
Q3 = loanData.ccavg.quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                             # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = loanData[((loanData.ccavg < (Q1 - 1.5 * IQR)) |(loanData.ccavg > (Q3 + 1.5 * IQR)))].ccavg
plt.figure(figsize=(15,3))                # setting figure size with width = 15 and height = 3
print("*"*30)
print("\033[1mBoxplot of ccavg column : \033[0m")
print("*"*30)
ax = sns.boxplot(x=loanData.ccavg)        # seaborn boxplot to examine outliers of the feature
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format('ccavg',round(np.mean(loanData.ccavg),3),round(np.median(loanData.ccavg),3),round(IQR,3))
     )
if(outliers.shape[0] == 0):                # comparing if number of outlier is zero
    print("There are \033[1mno outliers\033[0m in \033[1m'ccavg'\033[0m feature.")
else:                                      # if the above condition is false i.e number of outlier is not zero
    # printing No of outliers, percentage of the data points are outliers and the values of the outliers
    print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
    .format(outliers.shape[0],round(((outliers.shape[0]/loanData.ccavg.shape[0])*100),3),'income',outliers.tolist()))
print("*"*125)

**As above we can see there are 324 outliers (6.48 % of the datapoints) out of 5000 datapoints in 'ccavg' column.**
* Lets check distribution of the 'ccavg' column :

In [None]:
plt.figure(figsize=(10,5))                     # setting figure size with width = 10 and height = 5
sns.histplot(loanData.ccavg, kde=True)         # seaborn histplot to examine distribution of the income
plt.title("Distribution of column : 'ccavg'")  # setting title of the figure

As above, we can clearly conclude that 'income' column is highly right-skewed / positively skewed.<br>
* Skewness of 'ccavg' column as below:

In [None]:
loanData.ccavg.skew()

**ccavg is highly right / positive skewed with value of 1.598 .**
* **Lets check mean ccavg w.r.t Loan Status (personal_loan column)**

In [None]:
# plotting bar graph to see mean ccavg with personal loan status i.e 0 = Not taken Loan and 1 = Taken Loan
loanData.groupby('personal_loan').ccavg.mean().plot(kind='bar')
plt.title("Mean Credit Card Spending (annual) w.r.t Loan Status")  # setting title of the figure

**We have taken mean ccavg (Annual Credit Card spending) in y-axis and grouped them with target variable. As we can see customers with high annual ccavg are more tend to buy personal loan.**
* **Lets see how ccavg is distributed between customers taken loan and not taken loan :**

In [None]:
plt.figure(figsize=(12,8))                                # setting figure size with width = 12 and height = 8
# plotting histogram of ccavg column where customers not opted for loan
sns.histplot(loanData[loanData.personal_loan == 0].ccavg,kde=False, bins=5, color='b', label='Personal Loan = 0 (No)')
# plotting histogram of ccavg column where customers opted for loan
sns.histplot(loanData[loanData.personal_loan == 1].ccavg,kde=False, bins=5, color='r', label='Personal Loan = 1 (Yes)')
plt.legend()                                              # plotting legend on the figure
plt.title("Distribution of column : 'ccavg'")             # setting title of the figure

Earlier we saw customers with high ccavg tend to buy personal loan, but here we can see that **customers who took personal loan are almost equally segregated in all values of ccavg.** Also we can conclude that percentage of customers who have taken loan is high in higher ccavg bins.
* **Checking the percentage of customers under each ccavg group who bought personal Loans :**<br>
Similar to what we have done in 'age', 'experience' and 'income' column we will use pandas.cut(), pandas.crosstab() & DataFrame.div() function to plot the stacked bar graph which can be shown as:

In [None]:
bins = [0,2,4,6,8,10]                                              # defining ccavg bins,
# defining labels of ccavg groups as per bins defined as above
ccavgGroup = ['CcAvg : 0-2', 'CcAvg : 2-4', 'CcAvg : 4-6', 'CcAvg : 6-8', 'CcAvg : 8-10']
loanDataCcavgBin = pd.cut(loanData.ccavg,bins,labels=ccavgGroup)   # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to ccavgGroupCol variable
ccavgGroupCol = pd.crosstab(loanDataCcavgBin,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(ccavgGroupCol)                                               # printing above crosstab

# plotting a stacked bar chart to show loan status for different ccavg group
ccavgGroupCol.div(ccavgGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different CcAvg group")                # setting title of the figure

- We can find out the following from the above crosstab:
    * **CcAvg group between 4-6 having maximum number of loan conversion with percentage of 46.926 .**
    * **CcAvg group between 8-10 having second higest conversion of loans with percentage of 35.897, followed by CcAvg group between 6-8 having conversion of loans with percentage of 30.693 .**
    * **CcAvg group between 2-4 having low conversion of loan with percentage of 13.549 and CcAvg group between 0-2 having conversion of loans percentage of only 3.025 .**

### 7. Analyzing 'education' column : 

In [None]:
plt.figure(figsize=(10,5))                                     # setting figure size with width = 10 and height = 5
ax = sns.catplot(x='education', kind="count", data=loanData)   # seaborn count catplot to examine distribution of the family
plt.title("Distribution of column : 'education'")              # setting title of the figure

As above we can see, customers with educational qualification of 1: Undergrad having highest count followed by educational background of 3: Advanced/Professional , 2: Graduate.
* Lets check education column w.r.t loan status:

In [None]:
plt.figure(figsize=(10,5))                                 # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the education
ax = sns.catplot(x='education',hue='personal_loan', kind="count", data=loanData)
plt.title("Distribution of column : 'eduation'")           # setting title of the figure

As above we can conclude, customers with high educational background are more tend to take personal loan.
* Again lets see the numbers and percentage of loan conversion :

In [None]:
print("*"*70)
print("\033[1mNo of Cutomers taken loan or not w.r.t Education :\033[0m")
print("*"*70)
eduGroupCol = pd.crosstab(loanData.education,loanData.personal_loan)
print(eduGroupCol)                                                 # printing above crosstab
print("*"*40)
print()
print("*"*70)
print("\033[1mPercentage of Cutomers taken loan or not w.r.t Education :\033[0m")
print("*"*70)
eduGroupPer = pd.crosstab(loanData.education,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(eduGroupPer)                                                 # printing above crosstab
print("*"*40)
print("*"*100)
# ploting a stacked bar chart to show loan status for different Educational background
eduGroupCol.div(eduGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different Educational Background")     # setting title of the figure

- We can find out the following from the above crosstab:
    * **Customers with education of 3: Advanced/Professional is having maximum number of loan conversion with percentage of 13.658 .**
    * **Customers with education of 2: Graduate is having second higest conversion of loans with percentage of 12.972, followed by Customers with education of 1: Undergrad having lowest conversion of loans with percentage of 4.437 .**

### 8. Analyzing 'mortgage' column : 

In [None]:
loanData.mortgage.describe()

From 5-point summary we can find 'mortgage' column having mean of 56.499 with Inter Quartile Range (IQR) of 101.0 (Q3 - Q1 = 101.0 - 0.0 = 101.0), which shows **there are many outliers present in the 'mortgage' column.**<br>
* Lets check what are the outliers present in 'mortgage' column

In [None]:
Q1 = loanData.mortgage.quantile(0.25)        # evaluating lower / first quartile
Q3 = loanData.mortgage.quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                                # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = loanData[((loanData.mortgage < (Q1 - 1.5 * IQR)) |(loanData.mortgage > (Q3 + 1.5 * IQR)))].mortgage
plt.figure(figsize=(15,3))                   # setting figure size with width = 15 and height = 3
print("*"*30)
print("\033[1mBoxplot of mortgage column : \033[0m")
print("*"*30)
ax = sns.boxplot(x=loanData.mortgage)        # seaborn boxplot to examine outliers of the feature
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format('mortgage',round(np.mean(loanData.mortgage),3),round(np.median(loanData.mortgage),3),round(IQR,3))
     )
if(outliers.shape[0] == 0):                  # comparing if number of outlier is zero
    print("There are \033[1mno outliers\033[0m in \033[1m'mortgage'\033[0m feature.")
else:                                        # if the above condition is false i.e number of outlier is not zero
    # printing No of outliers, percentage of the data points are outliers and the values of the outliers
    print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
    .format(outliers.shape[0],round(((outliers.shape[0]/loanData.mortgage.shape[0])*100),3),'mortgage',outliers.tolist()))
print("*"*125)

**As above we can see there are 291 outliers (5.82 % of the datapoints) out of 5000 datapoints in 'mortgage' column.**
* Lets check distribution of the 'mortgage' column :

In [None]:
plt.figure(figsize=(10,5))                       # setting figure size with width = 10 and height = 5
sns.histplot(loanData.mortgage, kde=True)        # seaborn histplot to examine distribution of the mortgage
plt.title("Distribution of column : 'mortgage'") # setting title of the figure

Clearly 'mortgage' column is highly right-skewed / positively skewed. If customers have no house mortgage, will have 0 values in it, but if customers have any house mortgage will have higher values. That's why this column is very highly skewed.<br>
* Skewness of 'mortgage' column as below:

In [None]:
loanData.mortgage.skew()

**mortgage is highly right / positive skewed with value of 2.104 .**
* **Lets check mean mortgage w.r.t Loan Status (personal_loan column)**

In [None]:
# plotting bar graph to see mean mortgage with personal loan status i.e 0 = Not taken Loan and 1 = Taken Loan
loanData.groupby('personal_loan').mortgage.mean().plot(kind='bar')
plt.title("Mean of mortgage w.r.t Loan Status")     # setting title of the figure

**We have taken mean mortgage in y-axis and grouped them with target variable. As we can see customers with high mortgage are more tend to take personal loan.**
* **Lets see how mortgage is distributed between customers taken loan and not taken loan :**

In [None]:
plt.figure(figsize=(12,8))                                # setting figure size with width = 12 and height = 8
# plotting histogram of mortgage column where customers not opted for loan
sns.histplot(loanData[loanData.personal_loan == 0].mortgage,kde=False, bins=5, color='b', label='Personal Loan = 0 (No)')
# plotting histogram of mortgage column where customers opted for loan
sns.histplot(loanData[loanData.personal_loan == 1].mortgage,kde=False, bins=5, color='r', label='Personal Loan = 1 (Yes)')
plt.legend()                                              # plotting legend on the figure
plt.title("Distribution of column : 'ccavg'")             # setting title of the figure

Earlier we saw customers with high mortgage having more customers who tend to buy personal loan, but here we can see that lower range of mortgage customers are having higher converson rate in terms of percentage of customer to take personal loan.
* **Checking the percentage of customers under each mortgage group who bought personal Loans :**<br>
Similar to what we have done in 'age', 'experience','income' and 'ccavg' column we will use pandas.cut(), pandas.crosstab() & DataFrame.div() function to plot the stacked bar graph which can be shown as:

In [None]:
bins = [0,100,200,300,400,500,600]                                         # defining ccavg bins,
# defining labels of mortgage groups as per bins defined as above
mortgageGroup = ['Mortgage : 0-100', 'Mortgage : 100-200', 'Mortgage : 200-300', 'Mortgage : 300-400', 'Mortgage : 400-500', 'Mortgage : 500-600']
loanDataMortgageBin = pd.cut(loanData.mortgage,bins,labels=mortgageGroup)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mortGroupCol variable
mortgageGroupCol = pd.crosstab(loanDataMortgageBin,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(mortgageGroupCol)                                                    # printing above crosstab

# ploting a stacked bar chart to show loan status for different mortgage group
mortgageGroupCol.div(mortgageGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different Mortgage group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **Mortgage group between 500-600 having maximum number of loan conversion with percentage of 66.667 .**
    * **Mortgage group between 400-500 having second higest conversion of loans with percentage of 41.667, followed by Mortgage group between 300-400 having conversion of loans with percentage of 31.250 .**
    * **Mortgage group between 200-300 having low conversion of loan with percentage of 13.468**
    * **Mortgage group between 100-200 and 0-100 having conversion of loans percentage of only 5.145 and 4.610 respectively.**

### 9. Analyzing 'securities_account' column : 

In [None]:
plt.figure(figsize=(10,5))                                   # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the securities_account
ax = sns.catplot(x='securities_account', kind="count", data=loanData)
plt.title("Distribution of column : 'securities_account'")   # setting title of the figure

As above we can see, customers with securities account are very less than customers without securities account
* Lets check securities account column w.r.t loan status:

In [None]:
plt.figure(figsize=(10,5))                                   # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the securities_account
ax = sns.catplot(x='securities_account',hue='personal_loan', kind="count", data=loanData)
plt.title("Distribution of column : 'securities_account'")   # setting title of the figure

As above we can conclude, customers with no security account with bank are more tend to take personal loan.
* Again lets see the numbers and percentage of loan conversion :

In [None]:
print("*"*70)
print("\033[1mNo of Cutomers taken loan or not w.r.t securities_account :\033[0m")
print("*"*70)
secGroupCol = pd.crosstab(loanData.securities_account,loanData.personal_loan)
print(secGroupCol)                                                 # printing above crosstab
print("*"*40)
print()
print("*"*70)
print("\033[1mPercentage of Cutomers taken loan or not w.r.t securities_account :\033[0m")
print("*"*70)
secGroupPer = pd.crosstab(loanData.securities_account,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(secGroupPer)                                                 # printing above crosstab
print("*"*40)
print("*"*100)
# ploting a stacked bar chart to show loan status w.r.t securities_account
secGroupCol.div(secGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different securities_account")         # setting title of the figure

- We can find out the following from the above crosstab:
    * **Customers with security account is having loan conversion with percentage of 11.494 .**
    * **Customers without security account is having loan conversion with percentage of 9.379 .**
    * Customer with Securities account have slightly higher percentage of taking the personal loan than the customers with no Securities account

### 10. Analyzing 'cd_account' column : 

In [None]:
plt.figure(figsize=(10,5))                           # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the cd_account
ax = sns.catplot(x='cd_account', kind="count", data=loanData)
plt.title("Distribution of column : 'cd_account'")   # setting title of the figure

As above we can see, customers with cd account (Certificate of Deposit) are very less than customers without cd account
* Lets check cd account column w.r.t loan status:

In [None]:
plt.figure(figsize=(10,5))                             # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the cd_account
ax = sns.catplot(x='cd_account',hue='personal_loan', kind="count", data=loanData)
plt.title("Distribution of column : 'cd_account'")     # setting title of the figure

As above we can conclude, customers with no cd account with bank are more tend to take personal loan.
* Again lets see the numbers and percentage of loan conversion :

In [None]:
print("*"*70)
print("\033[1mNo of Customers taken loan or not w.r.t cd_account :\033[0m")
print("*"*70)
cdGroupCol = pd.crosstab(loanData.cd_account,loanData.personal_loan)
print(cdGroupCol)                                                 # printing above crosstab
print("*"*40)
print()
print("*"*70)
print("\033[1mPercentage of Customers taken loan or not w.r.t cd_account :\033[0m")
print("*"*70)
cdGroupPer = pd.crosstab(loanData.cd_account,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(cdGroupPer)                                                 # printing above crosstab
print("*"*40)
print("*"*100)
# ploting a stacked bar chart to show loan status w.r.t cd_account
cdGroupCol.div(cdGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different cd_account")                # setting title of the figure

- We can find out the following from the above crosstab:
    * **Customers with cd account is having loan conversion with percentage of 46.358 .**
    * **Customers without cd account is having loan conversion with percentage of 7.237 .**
    * Customer with cd account have very higher percentage of taking the personal loan than the customers with no cd account

### 11. Analyzing 'online' column : 

In [None]:
plt.figure(figsize=(10,5))                           # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the online
ax = sns.catplot(x='online', kind="count", data=loanData)
plt.title("Distribution of column : 'online'")       # setting title of the figure

As above we can see, customers with online (Internet Banking) are more than customers without online
* Lets check online column w.r.t loan status:

In [None]:
plt.figure(figsize=(10,5))                            # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the online
ax = sns.catplot(x='online',hue='personal_loan', kind="count", data=loanData)
plt.title("Distribution of column : 'online'")        # setting title of the figure

As above we can conclude, customers with Internet Banking with bank are more tend to take personal loan.
* Again lets see the numbers and percentage of loan conversion :

In [None]:
print("*"*70)
print("\033[1mNo of Cutomers taken loan or not w.r.t online :\033[0m")
print("*"*70)
ibGroupCol = pd.crosstab(loanData.online,loanData.personal_loan)
print(ibGroupCol)                                                 # printing above crosstab
print("*"*40)
print()
print("*"*70)
print("\033[1mPercentage of Cutomers taken loan or not w.r.t online :\033[0m")
print("*"*70)
ibGroupPer = pd.crosstab(loanData.online,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(ibGroupPer)                                                 # printing above crosstab
print("*"*40)
print("*"*100)
# ploting a stacked bar chart to show loan status w.r.t online
ibGroupCol.div(ibGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different online")                    # setting title of the figure

- We can find out the following from the above crosstab:
    * **Customers with online is having loan conversion with percentage of 9.752 .**
    * **Customers without online is having loan conversion with percentage of 9.375 .**
    * Customer with online have slightly higher percentage of taking the personal loan than the customers without online

### 12. Analyzing 'creditcard' column : 

In [None]:
plt.figure(figsize=(10,5))                            # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the creditcard
ax = sns.catplot(x='creditcard', kind="count", data=loanData)
plt.title("Distribution of column : 'creditcard'")    # setting title of the figure

As above we can see, customers with Credit Card  are less than customers without Credit Card
* Lets check creditcard column w.r.t loan status:

In [None]:
plt.figure(figsize=(10,5))                             # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the creditcard
ax = sns.catplot(x='creditcard',hue='personal_loan', kind="count", data=loanData)
plt.title("Distribution of column : 'creditcard'")     # setting title of the figure

As above we can conclude, customers without Credit Card with bank are more tend to take personal loan.
* Again lets see the numbers and percentage of loan conversion :

In [None]:
print("*"*70)
print("\033[1mNo of Cutomers taken loan or not w.r.t creditcard :\033[0m")
print("*"*70)
ccGroupCol = pd.crosstab(loanData.creditcard,loanData.personal_loan)
print(ccGroupCol)                                                 # printing above crosstab
print("*"*40)
print()
print("*"*70)
print("\033[1mPercentage of Cutomers taken loan or not w.r.t creditcard :\033[0m")
print("*"*70)
ccGroupPer = pd.crosstab(loanData.creditcard,loanData.personal_loan).apply(lambda r: r/r.sum()*100, axis=1)
print(ccGroupPer)                                                 # printing above crosstab
print("*"*40)
print("*"*100)
# ploting a stacked bar chart to show loan status w.r.t creditcard
ccGroupCol.div(ccGroupCol.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("Loan Status with different creditcard")                # setting title of the figure

- We can find out the following from the above crosstab:
    * **Customers with Credit Card is having loan conversion with percentage of 9.728 .**
    * **Customers without Credit Card is having loan conversion with percentage of 9.547 .**
    * Customer with Credit Card have slightly higher percentage of taking the personal loan than the customers without Credit Card.

### 3. Get the target column distribution. Your comments (5 marks) 

In [None]:
plt.figure(figsize=(10,5))                                 # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the personal_loan
ax = sns.catplot(x='personal_loan', kind="count", data=loanData)
plt.title("Distribution of column : 'personal_loan'")      # setting title of the figure
y = []                                                     # creating a null or empty array
for val in range(loanData.personal_loan.nunique()):        # looping for number of unique values in the personal_loan
    # appending count of each unique values from personal_loan to array y
    y.append(loanData.groupby(loanData.personal_loan,sort=False)['personal_loan'].count()[val])
for i, v in enumerate(y):                                  # looping count of each unique value in the personal_loan
    # including count of each unique values in the plot 
    plt.annotate(str(v), xy=(i,float(v)), xytext=(i-0.1, v+40), color='black', fontweight='bold')

* Lets check the percentage and plot a pie chart to show :

In [None]:
plt.figure(figsize=(5,5))                               # setting figure size with width = 10 and height = 5
# seaborn pie chart to examine distribution of the personal loan
loanData.groupby(['personal_loan']).personal_loan.count().plot(kind='pie',labels=['No Personal Loan : 0','Personal Loan : 1'],
                                                               startangle=90, autopct='%1.1f%%')
plt.title("Distribution of column : 'personal_loan'")   # setting title of the figure

**From above we can see out of 5000 customers, 480 customers taken personal loan, 9.6 percentage of customers taken personal loan, which we can say is a very health conversion rate.**

### \*\*\* Feature Analysis :\*\*\*

* **Before moving further, let us plot the pairplot using all attributes :**

In [None]:
sns.pairplot(loanData,hue='personal_loan',diag_kind='hist')

- **From above we can conclude :**
    * **age has a positive linear relationship with experience.**
    * **income, ccavg, mortgage histograms are not normally distributed.**

In [None]:
# checking correlation of independent variable with dependent variable i.e personal_loan
loanData.corr().personal_loan.sort_values(ascending=False)

- **From the above we can conclude the following:**
    * **'income', 'ccavg' and 'cd_account' are the most important features to decide whether a customer will take personal loan or not.**
    * **Other features such as 'cd_account', 'mortgage', 'education', 'family', 'securities_account', 'online', 'creditcard' are also of some importance as per their mentioned order.**
    * **Since, Age and Experience have very low correlation with target attribute, they seem to be ineffective for predicting whether customer will take a personal loan or not.**
- **Lets check the significance statisticaly in predicting the target attribute :**

In [None]:
# taking the columns for deriving signifiance to predict target attribute
signiLoanData = loanData.loc[:,['personal_loan', 'income', 'ccavg', 'cd_account', 'mortgage', 'education', 'family', 'securities_account', 'age','experience']]
signiLoanData['intercept'] = 1               # setting intercept to 1

# Significance test for numerical columns
logit = sm.Logit(signiLoanData['personal_loan'], signiLoanData[['intercept', 'income', 'ccavg', 'mortgage', 'age']]).fit()
logit.summary()

- **We can see that,**
    * **$p$-values for Income, CCAvg are less than $\alpha = 0.05$. Hence with 95% confidence, we can say that they are significant for predicting the target attribute class.**
    * **$p$-values for mortgage is less than $\alpha = 0.10$. Hence with 90% confidence, we can say that mortgage attribute is significant for predicting the target attribute class.**
    * **Age attribute seems to be insignificant to predict target attribute at $\alpha = 0.10$**

In [None]:
# Let see the statistical significance of ordinal categorical variables Family and Education
logit = sm.Logit(signiLoanData['personal_loan'], signiLoanData[['intercept','family', 'education']]).fit()
logit.summary()

**$p$-values of Family and Education are less than $\alpha = 0.05$, we can say with 90% confidence that both the attributes are significant for predicting the target attribute class.**

**In the context of independent categorical variables with binary response, we observe their correlation with target attribute that only CD Account shows better correlation whereas Securities Account is slightly correlated. Let us see**

In [None]:
logit = sm.Logit(signiLoanData['personal_loan'], signiLoanData[['intercept','cd_account', 'securities_account']]).fit()
logit.summary()

**Since the $p$-values are less than 0.05, both the attributes are useful for target class prediction**

### 4. Split the data into training and test set in the ratio of 70:30 respectively (5 marks)

In [None]:
'''
We will drop age from the dataset as 'age' has no significance effect on dependent variable i.e personal_loan
and also 'age' and 'experience' both are highly correlated to each other. 
Note : id column is already shifted to index.
'''
X = loanData.iloc[:,1:-1]
y = loanData.personal_loan

# splitting the data into training and test set in the ratio of 70:30 respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=7)

### 5. Use different classification models (Logistic, K-NN and Naïve Bayes) to predict the likelihood of a customer buying personal loans (15 marks)

**For each model We will first evaluate the model performances on unscaled data and then will evalute the model performance on scaled data. For this lets scale the dataset using MinMaxScaler and save it to variables for future use.**

In [None]:
# Let us scale train as well as test data using MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

## Logistic Regression

1. **On unscaled data**

In [None]:
logRegModel = LogisticRegression(solver='liblinear')          # creating Logistic Regression model using constructor
logRegModel.fit(X_train,y_train)                              # fit the model to training set
y_pred = logRegModel.predict(X_test)                          # predict the test data to get y_pred

lrAccScore = accuracy_score(y_test,y_pred)                    # get accuracy of model
print("-"*70)
print("Accuracy of the Logistic Regression model is {} %".format(lrAccScore*100))
print("-"*70)
print()
lrF1Score = f1_score(y_test,y_pred)                           # get F1-score of model
print("-"*70)
print("f1-score of the Logistic Regression the model is {} %".format(lrF1Score*100))
print("-"*70)
print()
lrConMatrix = confusion_matrix(y_test,y_pred)                 # get the confusion matrix
print("-"*70)
print("Confusion matrix for of the Logistic Regression is: \n",lrConMatrix)
print("-"*70)
print()
lrClassReport = classification_report(y_test,y_pred)          # get the classification report
print("-"*70)
print("Detailed classification report for logistic regression is: \n",lrClassReport)
print("-"*70)
print()
print("-"*70)
lrProb = logRegModel.predict_proba(X_test)
lrProbFpr, lrProbTpr, lrProbThreshold = roc_curve(y_test,lrProb[:,1])
lrProbAUC = auc(lrProbFpr,lrProbTpr)
print("Area under the ROC curve for Logistic Regression is :{}".format(round(lrProbAUC,3)))
print("-"*70)

- **We can see from the above that accuracy with Logistic Regression on unscaled dataset is 95%, which is good. Precision is good at 83%, shows that out of predicted customers 83% are predicted to be correct but f1 score and recall values are little bit less for class 1 at 71% and 62% respectively. Also area under ROC curve is 0.963 .**
2. **Lets check on scaled data :**

In [None]:
logRegModelScaled = LogisticRegression()                            # creating Logistic Regression model using constructor
logRegModelScaled.fit(X_train_scaled,y_train)                       # fit the model to training set
y_pred = logRegModelScaled.predict(X_test_scaled)                   # predict the test data to get y_pred

lrAccScoreScaled = accuracy_score(y_test,y_pred)                    # get accuracy of model
print("-"*70)
print("Accuracy of the Logistic Regression model is {} %".format(lrAccScoreScaled*100))
print("-"*70)
print()
lrF1ScoreScaled = f1_score(y_test,y_pred)                           # get F1-score of model
print("-"*70)
print("f1-score of the Logistic Regression the model is {} %".format(lrF1ScoreScaled*100))
print("-"*70)
print()
lrConMatrixScaled = confusion_matrix(y_test,y_pred)                 # get the confusion matrix
print("-"*70)
print("Confusion matrix for of the Logistic Regression is: \n",lrConMatrixScaled)
print("-"*70)
print()
lrClassReportScaled = classification_report(y_test,y_pred)          # get the classification report
print("-"*70)
print("Detailed classification report for logistic regression is: \n",lrClassReportScaled)
print("-"*70)
print()
print("-"*70)
lrProbScaled = logRegModelScaled.predict_proba(X_test_scaled)
lrProbFprScaled, lrProbTprScaled, lrProbThresholdScaled = roc_curve(y_test,lrProbScaled[:,1])
lrProbAUCScaled = auc(lrProbFprScaled,lrProbTprScaled)
print("Area under the ROC curve for Logistic Regression is :{}".format(round(lrProbAUCScaled,3)))
print("-"*70)

* **In comparison with the logistic regression with unscaled data, in this case the algorithm accuracy is less than the former but the f1-score is increased. We can note that recall value is 69% for class-1.**
* **Recall value of 69% means that, out of all the customers who would actually buy the loan, 69% were correctly predicted to be positive (would buy the personal loan).**
* **Area under ROC Curve is 0.966 , which almost equal with earlier model of logistic regression with unscaled data.**

### Naive Bayes Classifier

1. **On unscaled data**

In [None]:
nbModel = GaussianNB()                                        # creating Gaussian Naive Bayes model using constructor
nbModel.fit(X_train,y_train)                                  # fit the model to training set
y_pred = nbModel.predict(X_test)                              # predict the test data to get y_pred

nbAccScore = accuracy_score(y_test,y_pred)                    # get accuracy of model
print("-"*70)
print("Accuracy of the Naive Bayes Classifier model is {} %".format(nbAccScore*100))
print("-"*70)
print()
nbF1Score = f1_score(y_test,y_pred)                           # get F1-score of model
print("-"*70)
print("f1-score of the Naive Bayes Classifier the model is {} %".format(nbF1Score*100))
print("-"*70)
print()
nbConMatrix = confusion_matrix(y_test,y_pred)                 # get the confusion matrix
print("-"*70)
print("Confusion matrix for of the Naive Bayes Classifier is: \n",nbConMatrix)
print("-"*70)
print()
nbClassReport = classification_report(y_test,y_pred)          # get the classification report
print("-"*70)
print("Detailed classification report for Naive Bayes Classifier is: \n",nbClassReport)
print("-"*70)
print()
print("-"*70)
nbProb = nbModel.predict_proba(X_test)
nbProbFpr, nbProbTpr, nbProbThreshold = roc_curve(y_test,nbProb[:,1])
nbProbAUC = auc(nbProbFpr,nbProbTpr)
print("Area under the ROC curve for Naive Bayes Classifier is :{}".format(round(nbProbAUC,3)))
print("-"*70)

- **We can see that accuracy is decreased slightly than logistic regression on unscaled data, as well as f1-score and Recall value in case of class 1 is also decreased. And area under the ROC curve for Naive Bayes Classifier is :0.926 .**
2. **On scaled data:**

In [None]:
nbModelScaled = GaussianNB()                                        # creating Gaussian Naive Bayes model using constructor
nbModelScaled.fit(X_train_scaled,y_train)                           # fit the model to training set
y_pred = nbModelScaled.predict(X_test_scaled)                       # predict the test data to get y_pred

nbAccScoreScaled = accuracy_score(y_test,y_pred)                    # get accuracy of model
print("-"*70)
print("Accuracy of the Naive Bayes Classifier model is {} %".format(nbAccScoreScaled*100))
print("-"*70)
print()
nbF1ScoreScaled = f1_score(y_test,y_pred)                           # get F1-score of model
print("-"*70)
print("f1-score of the Naive Bayes Classifier the model is {} %".format(nbF1ScoreScaled*100))
print("-"*70)
print()
nbConMatrixScaled = confusion_matrix(y_test,y_pred)                 # get the confusion matrix
print("-"*70)
print("Confusion matrix for of the Naive Bayes Classifier is: \n",nbConMatrixScaled)
print("-"*70)
print()
nbClassReportScaled = classification_report(y_test,y_pred)          # get the classification report
print("-"*70)
print("Detailed classification report for Naive Bayes Classifier is: \n",nbClassReportScaled)
print("-"*70)
print()
print("-"*70)
nbProbScaled = nbModelScaled.predict_proba(X_test_scaled)
nbProbFprScaled, nbProbTprScaled, nbProbThresholdScaled = roc_curve(y_test,nbProbScaled[:,1])
nbProbAUCScaled = auc(nbProbFprScaled,nbProbTprScaled)
print("Area under the ROC curve for Logistic Regression is :{}".format(round(nbProbAUCScaled,3)))
print("-"*70)

**Scaling actually decreases the performance of Naive Bayes classifier, although no significant change in accuracy however the f1-score is merely 3% which is a below par performance. And area under the ROC curve for Logistic Regression is :0.925 .**

### K-Nearest Neighbour Classifier

1. **On unscaled data**

In [None]:
knnModel = KNeighborsClassifier()                              # creating K-NN model using constructor
knnModel.fit(X_train,y_train)                                  # fit the model to training set
y_pred = knnModel.predict(X_test)                              # predict the test data to get y_pred

knnAccScore = accuracy_score(y_test,y_pred)                    # get accuracy of model
print("-"*70)
print("Accuracy of the K-NN Classifier model is {} %".format(knnAccScore*100))
print("-"*70)
print()
knnF1Score = f1_score(y_test,y_pred)                           # get F1-score of model
print("-"*70)
print("f1-score of the K-NN Classifier the model is {} %".format(knnF1Score*100))
print("-"*70)
print()
knnConMatrix = confusion_matrix(y_test,y_pred)                 # get the confusion matrix
print("-"*70)
print("Confusion matrix for of the K-NN Classifier is: \n",knnConMatrix)
print("-"*70)
print()
knnClassReport = classification_report(y_test,y_pred)          # get the classification report
print("-"*70)
print("Detailed classification report for K-NN Classifier is: \n",knnClassReport)
print("-"*70)
print()
print("-"*70)
knnProb = knnModel.predict_proba(X_test)
knnProbFpr, knnProbTpr, knnProbThreshold = roc_curve(y_test,knnProb[:,1])
knnProbAUC = auc(knnProbFpr,knnProbTpr)
print("Area under the ROC curve for K-NN Classifier is :{}".format(round(knnProbAUC,3)))
print("-"*70)

- **We can see that accuracy is less than logistic regression, f1-score is lowest among the compared algorithms. Recall value is merely .33 in case of class 1 i.e., out of all the customers who actually took the loan, only 33% were correctly predicted to be positive.**
* **Area under the ROC curve for K-NN Classifier is :0.885 .**

#### * **accuracy and f1 score for KNN model with different values for neighbors on unscaled dataset.**

In [None]:
# KNN Accuracy for neighbors = 1,3,...99
knnAcc=[]
knnF1 = []
for i in range(1,100,2):
    print("Calculating the K-NN classifier accuracy for {} neighbors.".format(i))
    knnModel = KNeighborsClassifier(n_neighbors=i)              # Calling default constructor
    knnModel.fit(X_train,y_train)                               # fit the model to training set
    y_pred = knnModel.predict(X_test)                           # Predict the test data to get y_pred
    knnAccScore = accuracy_score(y_test,y_pred)                 # get accuracy of model
    knnAcc.append(knnAccScore*100)
    # get F1-score of model
    knnF1Score = f1_score(y_test,y_pred) 
    knnF1.append(knnF1Score*100)
dfKnn = pd.DataFrame({'n_neighbors':list(range(1,100,2)), 'Accuracy':knnAcc,'f1-score':knnF1}) 
dfKnn

* **We can observe that K-NN algorithm with different values for neighbors doesn't provide much improvement in accuracy as well as f1-score over best performing Naive Bayes Classifier algorithm on unscaled data.**
2. **On Scaled Data:**

In [None]:
knnModelScaled = KNeighborsClassifier()                              # creating K-NN model using constructor
knnModelScaled.fit(X_train_scaled,y_train)                           # fit the model to training set
y_pred = knnModelScaled.predict(X_test_scaled)                       # predict the test data to get y_pred

knnAccScoreScaled = accuracy_score(y_test,y_pred)                    # get accuracy of model
print("-"*70)
print("Accuracy of the K-NN Classifier model is {} %".format(knnAccScoreScaled*100))
print("-"*70)
print()
knnF1ScoreScaled = f1_score(y_test,y_pred)                           # get F1-score of model
print("-"*70)
print("f1-score of the K-NN Classifier the model is {} %".format(knnF1ScoreScaled*100))
print("-"*70)
print()
knnConMatrixScaled = confusion_matrix(y_test,y_pred)                 # get the confusion matrix
print("-"*70)
print("Confusion matrix for of the K-NN Classifier is: \n",knnConMatrixScaled)
print("-"*70)
print()
knnClassReportScaled = classification_report(y_test,y_pred)          # get the classification report
print("-"*70)
print("Detailed classification report for K-NN Classifier is: \n",knnClassReportScaled)
print("-"*70)
print()
print("-"*70)
knnProbScaled = knnModelScaled.predict_proba(X_test_scaled)
knnProbFprScaled, knnProbTprScaled, knnProbThresholdScaled = roc_curve(y_test,knnProbScaled[:,1])
knnProbAUCScaled = auc(knnProbFprScaled,knnProbTprScaled)
print("Area under the ROC curve for K-NN Classifier is :{}".format(round(knnProbAUCScaled,3)))
print("-"*70)

* **Evidently K-NN (5-neighbor) on scaled data performs very well..The accuracy is 97% and f1-score and recall is also very good.**
* **Area under the ROC curve for K-NN Classifier is :0.951 .**
#### * **accuracy and f1 score for KNN model with different values for neighbors on scaled dataset.**

In [None]:
# KNN Accuracy for neighbors = 1,3,...99
knnAcc=[]
knnF1 = []
for i in range(1,100,2):
    print("Calculating the K-NN classifier accuracy for {} neighbors.".format(i))
    knnModel = KNeighborsClassifier(n_neighbors=i)              # Calling default constructor
    knnModel.fit(X_train_scaled,y_train)                        # fit the model to training set
    y_pred = knnModel.predict(X_test_scaled)                    # Predict the test data to get y_pred
    knnAccScore = accuracy_score(y_test,y_pred)                 # get accuracy of model
    knnAcc.append(knnAccScore*100)
    # get F1-score of model
    knnF1Score = f1_score(y_test,y_pred) 
    knnF1.append(knnF1Score*100)
dfKnn = pd.DataFrame({'n_neighbors':list(range(1,100,2)), 'Accuracy':knnAcc,'f1-score':knnF1}) 
dfKnn

**We can see that K-NN with 5 neighbor performs the best in terms of accuracy (97%) as well as f1-score(81.30%).**

- **Accuracy, f1-score and Area under ROC for different Classification Algorithms used on unscaled and scaled dataset:**

In [None]:
dfComp = pd.DataFrame({'Classification Algorithm':['Logistic Regression', 'Naive Bayes', 'K-Nearest Neighbor'],
                       'Accuracy (%) on Unscaled':[lrAccScore*100, nbAccScore*100, knnAccScore*100],
                       'f1-score (%) on Unscaled':[lrF1Score*100, nbF1Score*100, knnF1Score*100],
                       'Area under ROC on Unscaled':[lrProbAUC, nbProbAUC, knnProbAUC],
                       'Accuracy (%) on Scaled':[lrAccScoreScaled*100, nbAccScoreScaled*100, knnAccScoreScaled*100],
                       'f1-score (%) on Scaled':[lrF1ScoreScaled*100, nbF1ScoreScaled*100, knnF1ScoreScaled*100],
                       'Area under ROC on Scaled':[lrProbAUCScaled, nbProbAUCScaled, knnProbAUCScaled] 
                      })

print("Following table shows comparison of the classification algorithms: ")
dfComp

### 6. Print the confusion matrix for all the above models (5 marks)

* **Logistic Regression confusion matrix on unscaled dataset**

In [None]:
# Logistic Regression confusion matrix on unscaled dataset
dfConMat = pd.DataFrame(lrConMatrix, index = [i for i in ["No","Yes"]],columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(dfConMat, annot=True ,fmt='g')

* **Logistic Regression confusion matrix on scaled dataset**

In [None]:
# Logistic Regression confusion matrix on scaled dataset
dfConMat = pd.DataFrame(lrConMatrixScaled, index = [i for i in ["No","Yes"]],columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(dfConMat, annot=True ,fmt='g')

* **Naive Bayes Classifier confusion matrix on unscaled dataset**

In [None]:
# Naive Bayes Classifier confusion matrix on unscaled dataset
dfConMat = pd.DataFrame(nbConMatrix, index = [i for i in ["No","Yes"]],columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(dfConMat, annot=True ,fmt='g')

* **Naive Bayes Classifier confusion matrix on scaled dataset**

In [None]:
# Naive Bayes Classifier confusion matrix on scaled dataset
dfConMat = pd.DataFrame(nbConMatrixScaled, index = [i for i in ["No","Yes"]],columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(dfConMat, annot=True ,fmt='g')

* **K-NN Classifier confusion matrix on unscaled dataset**

In [None]:
# K-NN Classifier confusion matrix on unscaled dataset
dfConMat = pd.DataFrame(knnConMatrix, index = [i for i in ["No","Yes"]],columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(dfConMat, annot=True ,fmt='g')

* **K-NN Classifier confusion matrix on scaled dataset**

In [None]:
# K-NN Classifier confusion matrix on scaled dataset
dfConMat = pd.DataFrame(knnConMatrixScaled, index = [i for i in ["No","Yes"]],columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(dfConMat, annot=True ,fmt='g')

### 7. Give your reasoning on which is the best model in this case and why it performs better? (5 marks)

* **On unscaled dataset :** If we consider the algorithm comparison with unscaled data, Logistic Regression works best which has accuracy of 95.333% and f1-score of 70.833%. The reason it performs better is that Logistic Regression assumes that **(a)** the target / dependent variable is binary, it fits one of two clear cut categories. And **(b)** independent variables are indpendent of each other, means no multicollinearity between the independent variables. In this particular case both the assumsions are true. It can be seen in below mentioned correlation heatmap. **Please note earlier we had dropped 'age' attribute which had correlation of $ \rho = 0.99$ with 'experience' attribute.**
* **On scaled dataset :** With scaled dataset we can see that performance of K-NN classifier works best which has accuracy of 96.933% and f1-score of 81.300%. The reason it performs better is that it is a distance-based algorithm and as we had treated the outliers through data scalling MinMaxScalar method,K-NN classifier performance dramatically increased.

### Let us check with SVM (Support Vector Machine) performance on the scaled dataset:

In [None]:
svmModel = SVC()                                               # creating K-NN model using constructor
svmModel.fit(X_train_scaled,y_train)                           # fit the model to training set
y_pred = svmModel.predict(X_test_scaled)                       # predict the test data to get y_pred

svmAccScore = accuracy_score(y_test,y_pred)                    # get accuracy of model
print("-"*70)
print("Accuracy of the SVM model is {} %".format(svmAccScore*100))
print("-"*70)
print()
svmF1Score = f1_score(y_test,y_pred)                           # get F1-score of model
print("-"*70)
print("f1-score of the SVM the model is {} %".format(svmF1Score*100))
print("-"*70)
print()
svmConMatrix = confusion_matrix(y_test,y_pred)                 # get the confusion matrix
print("-"*70)
print("Confusion matrix for of the SVM is: \n",svmConMatrix)
print("-"*70)
print()
svmClassReport = classification_report(y_test,y_pred)          # get the classification report
print("-"*70)
print("Detailed classification report for SVM is: \n",svmClassReport)
print("-"*70)
print()
print("-"*70)
lrProbScaled = logRegModel.predict_proba(X_test_scaled)
lrProbFprScaled, lrProbTprScaled, lrProbThresholdScaled = roc_curve(y_test,lrProbScaled[:,1])
lrProbAUCScaled = auc(lrProbFprScaled,lrProbTprScaled)
print("Area under the ROC curve for Logistic Regression is :{}".format(round(lrProbAUCScaled,3)))
print("-"*70)

**For different C value of SVC constructor**

In [None]:
svmAcc=[]
svmF1 = []
for i in range(1,1000,100):
    print("Calculating the SVM classifier accuracy for C : {} ".format(i))
    svmModel = SVC(C=i)              # Calling default constructor
    svmModel.fit(X_train_scaled,y_train)                               # fit the model to training set
    y_pred = svmModel.predict(X_test_scaled)                           # Predict the test data to get y_pred
    svmAccScore = accuracy_score(y_test,y_pred)                 # get accuracy of model
    svmAcc.append(svmAccScore*100)
    # get F1-score of model
    svmF1Score = f1_score(y_test,y_pred) 
    svmF1.append(svmF1Score*100)
dfSvm = pd.DataFrame({'C':list(range(1,1000,100)), 'Accuracy':svmAcc,'f1-score':svmF1})
dfSvm

# For Support Vector Classifier with C = 201, we get accuracy of 97.400% and f1-score around 86.598% which is the best result for the given problem in comparison with the Logistic Regression, Naive Bayes Classifier and K-NN Classifiers used earlier.