## Mini Project | Supervised Learning and Ensembles
----
#### Submitted By: GROUP 5

**Team Members**

> **Sandeep Keswani**

> **Vivek V Krishnan**

> **Seeju Kumar** 

> **Shashank Shekhar**

#### Case Study
**Campaign for selling personal loans**

This case is about a bank (Thera Bank) which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.
The department wants to build a model that will help them identify the potential customers who have higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.
The file given below contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Data Set: **Bank_Personal_Loan_Modelling-1.xlsx**

#### Considering the information provided above, follow the steps given below:

    1. Read the column description and ensure you understand each attribute well. 
    2. Study the data distribution in each attribute, share your findings.
    3. Get the target column distribution. Your comments.
    4. Split the data into training and test set in the ratio of 70:30 respectively.
    5. Use a classification model to predict the likelihood of a liability customer buying personal loans. 
    6. Explain why you chose one model over the other (do not use ensemble techniques yet).
    7. Use ensemble techniques to improve the performance.

#### Please Note:

    - Total marks allotted for Mini Project is 50.
    - Only one person per group should make the submission for their group.
    - Please mention the group number and members on the first page of the submission.
    - Please submit working code and output (of each step) along with it in pdf,html and ipynb format. 
    - Please add necessary comments in all the files and make a managerial report based on that.
    - Name format of files should always be Group #_PGPBDA.B.AUG16_SLE_MiniPro.extension

#### Marks Distribution is as follows –

    1. Step 1 : Nil Points 
    2. Step 2 : 10 points 
    3. Step 3 : 5 points 
    4. Step 4 : 5 points 
    5. Step 5 : 10 points 
    6. Step 6 : 10 points 
    7. Step 7 : 10 points

In [1]:
%matplotlib inline

In [2]:
import pandas as pd

In [3]:
import sklearn as sk

In [4]:
import seaborn as sns

In [5]:
import matplotlib.pyplot as plt

In [6]:
import numpy as np

In [7]:
from scipy.stats import zscore

In [8]:
df = pd.read_excel("data/Bank_Personal_Loan_Modelling-1.xlsx", sheet_name=1)

In [39]:
df_columns = ['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'PersonalLoan', 'SecuritiesAccount', 'CDAccount', 'Online', 'CreditCard']

In [40]:
df.columns = df_columns

In [41]:
def highlight_zero(s):
    is_zero = s <= 0 
    return ['background-color: pink; color: red' if v else '' for v in is_zero]

In [42]:
df.head(10).style.apply(highlight_zero)

Unnamed: 0,ID,Age,Experience,Income,ZIPCode,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
5,6,37,13,29,92121,4,0.4,2,155,0,0,0,1,0
6,7,53,27,72,91711,2,1.5,2,0,0,0,0,1,0
7,8,50,24,22,93943,1,0.3,3,0,0,0,0,0,1
8,9,35,10,81,90089,3,0.6,2,104,0,0,0,1,0
9,10,34,9,180,93023,1,8.9,3,0,1,0,0,0,0


In [43]:
df.describe().transpose().style.apply(highlight_zero)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,5000,2500.5,1443.52,1,1250.75,2500.5,3750.25,5000
Age,5000,45.3384,11.4632,23,35.0,45.0,55.0,67
Experience,5000,20.1046,11.468,-3,10.0,20.0,30.0,43
Income,5000,73.7742,46.0337,8,39.0,64.0,98.0,224
ZIPCode,5000,93152.5,2121.85,9307,91911.0,93437.0,94608.0,96651
Family,5000,2.3964,1.14766,1,1.0,2.0,3.0,4
CCAvg,5000,1.93791,1.74767,0,0.7,1.5,2.5,10
Education,5000,1.881,0.839869,1,1.0,2.0,3.0,3
Mortgage,5000,56.4988,101.714,0,0.0,0.0,101.0,635
PersonalLoan,5000,0.096,0.294621,0,0.0,0.0,0.0,1


In [44]:
df.isnull().any()

ID                   False
Age                  False
Experience           False
Income               False
ZIPCode              False
Family               False
CCAvg                False
Education            False
Mortgage             False
PersonalLoan         False
SecuritiesAccount    False
CDAccount            False
Online               False
CreditCard           False
dtype: bool

In [45]:
df.isna().sum()

ID                   0
Age                  0
Experience           0
Income               0
ZIPCode              0
Family               0
CCAvg                0
Education            0
Mortgage             0
PersonalLoan         0
SecuritiesAccount    0
CDAccount            0
Online               0
CreditCard           0
dtype: int64