## Bank Telemarketing Campaign Case Study.

#### Problem Statement:

 

The bank provides financial services/products such as savings accounts, current accounts, debit cards, etc. to its customers. In order to increase its overall revenue, the bank conducts various marketing campaigns for its financial products such as credit cards, term deposits, loans, etc. These campaigns are intended for the bank’s existing customers. However, the marketing campaigns need to be cost-efficient so that the bank not only increases their overall revenues but also the total profit. You need to apply your knowledge of EDA on the given dataset to analyse the patterns and provide inferences/solutions for the future marketing campaign.

The bank conducted a telemarketing campaign for one of its financial products ‘Term Deposits’ to help foster long-term relationships with existing customers. The dataset contains information about all the customers who were contacted during a particular year to open term deposit accounts.


**What is the term Deposit?**

Term deposits also called fixed deposits, are the cash investments made for a specific time period ranging from 1 month to 5 years for predetermined fixed interest rates. The fixed interest rates offered for term deposits are higher than the regular interest rates for savings accounts. The customers receive the total amount (investment plus the interest) at the end of the maturity period. Also, the money can only be withdrawn at the end of the maturity period. Withdrawing money before that will result in an added penalty associated, and the customer will not receive any interest returns.

Your target is to do end to end EDA on this bank telemarketing campaign data set to infer knowledge that where bank has to put more effort to improve it's positive response rate. 

#### Importing the libraries.

In [None]:
#import the warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
#import the useful libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Session- 2, Data Cleaning 

### Segment- 2, Data Types 

There are multiple types of data types available in the data set. some of them are numerical type and some of categorical type. You are required to get the idea about the data types after reading the data frame. 

Following are the some of the types of variables:
- **Numeric data type**: banking dataset: salary, balance, duration and age.
- **Categorical data type**: banking dataset: education, job, marital, poutcome and month etc.
- **Ordinal data type**: banking dataset: Age group.
- **Time and date type** 
- **Coordinates type of data**: latitude and longitude type.


#### Read in the Data set. 

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#read the data set of "bank telemarketing campaign" in inp0.
inp0= pd.read_csv(r'/kaggle/input/bank_marketing_updated_v1.csv')
inp0

In [None]:
#Print the head of the data frame.
inp0.head()

### Segment- 3, Fixing the Rows and Columns 

Checklist for fixing rows:
- **Delete summary rows**: Total and Subtotal rows
- **Delete incorrect rows**: Header row and footer row
- **Delete extra rows**: Column number, indicators, Blank rows, Page No.

Checklist for fixing columns:
- **Merge columns for creating unique identifiers**, if needed, for example, merge the columns State and City into the column Full address.
- **Split columns to get more data**: Split the Address column to get State and City columns to analyse each separately. 
- **Add column names**: Add column names if missing.
- **Rename columns consistently**: Abbreviations, encoded columns.
- **Delete columns**: Delete unnecessary columns.
- **Align misaligned columns**: The data set may have shifted columns, which you need to align correctly.


#### Read the file without unnecessary headers.

In [None]:
#read the file in inp0 without first two rows as it is of no use.
inp0= pd.read_csv(r'/kaggle/input/bank_marketing_updated_v1.csv',skiprows=2)
inp0

In [None]:
#print the head of the data frame.
inp0.head()

In [None]:
#print the information of variables to check their data types.
print(type(inp0.info()))

In [None]:
#convert the age variable data type from float to integer.

inp0['age']=inp0['age'].astype('Int64')
inp0

In [None]:
#print the average age of customers.
avg=np.mean(inp0['age'])
print(round(avg,2))

#### Dropping customer id column. 

In [None]:
#drop the customer id as it is of no use.
inp0=inp0.drop(['customerid'],axis=1)
inp0

#### Dividing "jobedu" column into job and education categories. 

In [None]:
#Extract job in newly created 'job' column from "jobedu" column.
jobedu=inp0['jobedu'].apply(lambda x: pd.Series(x.split(',')))
inp0['job']=jobedu[0]
inp0

In [None]:
#Extract education in newly created 'education' column from "jobedu" column.
inp0['education']= jobedu[1]
inp0

In [None]:
#drop the "jobedu" column from the dataframe.
inp0=inp0.drop(['jobedu'],axis=1)
inp0

### Segment- 4, Impute/Remove missing values 

Take aways from the lecture on missing values:

- **Set values as missing values**: Identify values that indicate missing data, for example, treat blank strings, "NA", "XX", "999", etc., as missing.
- **Adding is good, exaggerating is bad**: You should try to get information from reliable external sources as much as possible, but if you can’t, then it is better to retain missing values rather than exaggerating the existing rows/columns.
- **Delete rows and columns**: Rows can be deleted if the number of missing values is insignificant, as this would not impact the overall analysis results. Columns can be removed if the missing values are quite significant in number.
- **Fill partial missing values using business judgement**: Such values include missing time zone, century, etc. These values can be identified easily.

Types of missing values:
- **MCAR**: It stands for Missing completely at random (the reason behind the missing value is not dependent on any other feature).
- **MAR**: It stands for Missing at random (the reason behind the missing value may be associated with some other features).
- **MNAR**: It stands for Missing not at random (there is a specific reason behind the missing value).


#### handling missing values in age column.

In [None]:
#count the missing values in age column.
inp0.age.isnull().sum(axis=0)

In [None]:
#pring the shape of dataframe inp0
inp0.shape

In [None]:
#calculate the percentage of missing values in age column.
round(100*(inp0.age.isnull().sum(axis=0)/len(inp0.index)),2)

Drop the records with age missing. 

In [None]:
#drop the records with age missing in inp0 and copy in inp1 dataframe.
inp1= inp0[~inp0.age.isnull()].copy()
inp1

#### handling missing values in month column

In [None]:
#count the missing values in month column in inp1.
inp1.month.isnull().sum(axis=0)

In [None]:
#print the percentage of each month in the data frame inp1.
inp1.month.value_counts(normalize=True) *100

In [None]:
#find the mode of month in inp1
mode_month=inp1.month.mode()[0]
mode_month

In [None]:
# fill the missing values with mode value of month in inp1.
inp1.month.fillna(mode_month,inplace=True)

In [None]:
#let's see the null values in the month column.
inp1.month.isna().sum()

#### handling missing values in response column 

In [None]:
#count the missing values in response column in inp1.
inp1.response.isna().sum()

In [None]:
#calculate the percentage of missing values in response column. 
round(100*(inp1.response.isna().sum())/len(inp1.index),3)

Target variable is better of not imputed.
- Drop the records with missing values.

In [None]:
#drop the records with response missings in inp1.
inp1=inp1[~inp1.response.isna()]
inp1

In [None]:
#calculate the missing values in each column of data frame: inp1.
inp1.isna().sum()

#### handling pdays column. 

In [None]:
#describe the pdays column of inp1.
inp1.pdays.describe()

-1 indicates the missing values.
Missing value does not always be present as null.
How to handle it:

Objective is:
- you should ignore the missing values in the calculations
- simply make it missing - replace -1 with NaN.
- all summary statistics- mean, median etc. we will ignore the missing values of pdays.

In [None]:
#describe the pdays column with considering the -1 values.
inp1.loc[inp1.pdays<0,"pdays"]= np.NaN
inp1.pdays.describe()

### Segment- 5, Handling Outliers 

Major approaches to the treat outliers:
 		
- **Imputation**
- **Deletion of outliers**
- **Binning of values**
- **Cap the outlier**


#### Age variable 

In [None]:
#describe the age variable in inp1.
inp1.age.describe()

In [None]:
#plot the histogram of age variable.
import matplotlib.pyplot as plt
inp1.age.plot.hist()
plt.show()

In [None]:
#plot the boxplot of age variable.
import seaborn as sns
sns.boxplot(inp1.age)
plt.show()

#### Salary variable 

In [None]:
#describe the salary variable of inp1.
inp0.salary.describe()

In [None]:
#plot the boxplot of salary variable.
sns.boxplot(inp0.salary)

#### Balance variable 

In [None]:
#describe the balance variable of inp1.
inp0.balance.describe()

In [None]:
#plot the boxplot of balance variable.
sns.boxplot(inp0.balance)

In [None]:
#plot the boxplot of balance variable after scaling in 8:2.
plt.figure(figsize=(8,2))
sns.boxplot(inp0.balance)

In [None]:
#print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of balance variable
inp1.balance.quantile([0.5, 0.7, 0.9, 0.95,0.99])


### Segment- 6, Standardising values 

Checklist for data standardization exercises:
- **Standardise units**: Ensure all observations under one variable are expressed in a common and consistent unit, e.g., convert lbs to kg, miles/hr to km/hr, etc.
- **Scale values if required**: Make sure all the observations under one variable have a common scale.
- **Standardise precision** for better presentation of data, e.g., change 4.5312341 kg to 4.53 kg.
- **Remove extra characters** such as common prefixes/suffixes, leading/trailing/multiple spaces, etc. These are irrelevant to analysis.
- **Standardise case**: String variables may take various casing styles, e.g., UPPERCASE, lowercase, Title Case, Sentence case, etc.
- **Standardise format**: It is important to standardise the format of other elements such as date, name, etce.g., change 23/10/16 to 2016/10/23, “Modi, Narendra” to “Narendra Modi", etc.

#### Duration variable

In [None]:
#describe the duration variable of inp1
inp1.duration.describe()


In [None]:
#convert the duration variable into single unit i.e. minutes. and remove the sec or min prefix.
inp1.duration=inp1.duration.apply(lambda x: float(x.split()[0])/60 if x.find("sec")>0 else float(x.split()[0]))


In [None]:
#describe the duration variable
inp1.duration.describe()

## Session- 3, Univariate Analysis 

### Segment- 2, Categorical unordered univariate analysis 

Unordered data do not have the notion of high-low, more-less etc. Example:
- Type of loan taken by a person = home, personal, auto etc.
- Organisation of a person = Sales, marketing, HR etc.
- Job category of persone.
- Marital status of any one.


#### Marital status 

In [None]:
#calculate the percentage of each marital status category. 
marital=inp1.marital.value_counts(normalize=True)*100
marital

In [None]:
#plot the bar graph of percentage marital status categories
marital.plot.barh(color='r')
plt.show()

#### Job  

In [None]:
#calculate the percentage of each job status category.
job=inp1.job.value_counts(normalize=True)*100
job

In [None]:
#plot the bar graph of percentage job categories
job.plot.barh(color='g')
plt.show()

### Segment- 3, Categorical ordered univariate analysis 

Ordered variables have some kind of ordering. Some examples of bank marketing dataset are:
- Age group= <30, 30-40, 40-50 and so on.
- Month = Jan-Feb-Mar etc.
- Education = primary, secondary and so on.

#### Education

In [None]:
#calculate the percentage of each education category.
education=inp1.education.value_counts(normalize=True)*100
education

In [None]:
#plot the pie chart of education categories
education.plot.pie()
plt.show()


#### poutcome 

In [None]:
#calculate the percentage of each poutcome category.
poutcome=inp1.poutcome.value_counts(normalize=True)*100
poutcome                                                                               

In [None]:
poutcome.plot.pie()
plt.show()

In [None]:
poutcometarget=inp1[~(inp1.poutcome=='unknown')].poutcome.value_counts(normalize=True)*100
poutcometarget

In [None]:
poutcometarget.plot.bar()
plt.show()

#### Response the target variable 

In [None]:
#calculate the percentage of each response category.
response=inp1.response.value_counts(normalize=True)*100
response

In [None]:
#plot the pie chart of response categories
response.plot.pie()
plt.show()

## Session- 4, Bivariate and Multivariate Analysis

### Segment-2, Numeric- numeric analysis 

There are three ways to analyse the numeric- numeric data types simultaneously.
- **Scatter plot**: describes the pattern that how one variable is varying with other variable.
- **Correlation matrix**: to describe the linearity of two numeric variables.
- **Pair plot**: group of scatter plots of all numeric variables in the data frame.

In [None]:
#plot the scatter plot of balance and salary variable in inp0
inp0.plot.scatter(x='salary',y='balance')
plt.show()

In [None]:
#plot the scatter plot of balance and age variable in inp1
plt.scatter(inp1.age,inp1.balance)
plt.show()

In [None]:
#plot the pair plot of salary, balance and age in inp1 dataframe.
sns.pairplot(data=inp1,vars=['salary','balance','age'])
plt.show()

#### Correlation heat map 

In [None]:
#plot the correlation matrix of salary, balance and age in inp1 dataframe.
inp1[['salary','balance','age']].corr()

In [None]:
sns.heatmap(inp1[['salary','balance','age']].corr(),annot=True,cmap='Reds')

plt.show()

### Segment- 4, Numerical categorical variable

#### Salary vs response 

In [None]:
#groupby the response to find the mean of the salary with response no & yes seperatly.
inp1.groupby("response")["salary"].mean()

In [None]:
#groupby the response to find the median of the salary with response no & yes seperatly.
inp1.groupby("response")["salary"].median()

In [None]:
#plot the box plot of salary for yes & no responses.
sns.boxplot(data=inp1,x='response',y='salary')
plt.show()

#### Balance vs response 

In [None]:
#plot the box plot of balance for yes & no responses.
sns.boxplot(data=inp1,x='response',y='balance')
plt.show()

In [None]:
#groupby the response to find the mean of the balance with response no & yes seperatly.
inp1.groupby("response")['balance'].mean()

In [None]:
#groupby the response to find the median of the balance with response no & yes seperatly.
inp1.groupby("response")['balance'].median()

##### 75th percentile 

In [None]:
#function to find the 75th percentile.
def p75(x):
    return np.quantile(x, 0.75)

In [None]:
#calculate the mean, median and 75th percentile of balance with response
inp1.groupby("response")['balance'].aggregate(["mean","median",p75])

In [None]:
#plot the bar graph of balance's mean an median with response.
inp1.groupby("response")['balance'].aggregate(["mean","median"]).plot.bar()
plt.show()

#### Education vs salary 

In [None]:
#groupby the education to find the mean of the salary education category.
inp1.groupby("education")["salary"].mean()

In [None]:
#groupby the education to find the median of the salary for each education category.
inp1.groupby("education")["salary"].median()

#### Job vs salary

In [None]:
#groupby the job to find the mean of the salary for each job category.
inp1.groupby("job")["salary"].mean()

### Segment- 5, Categorical categorical variable 

In [None]:
#create response_flag of numerical data type where response "yes"= 1, "no"= 0
inp1['response_flag']=np.where(inp1.response=='yes',1,0)
inp1.response_flag.value_counts()

#### Education vs response rate

In [None]:
#calculate the mean of response_flag with different education categories.
inp1.groupby("education")["response_flag"].mean()

#### Marital vs response rate 

In [None]:
#calculate the mean of response_flag with different marital status categories.
inp1.groupby("marital")["response_flag"].mean()

In [None]:
#plot the bar graph of marital status with average value of response_flag
(inp1.groupby("marital")["response_flag"].mean()).plot.barh()

#### Loans vs response rate 

In [None]:
#plot the bar graph of personal loan status with average value of response_flag
(inp1.groupby("loan")["response_flag"].mean()*100).plot.bar()
plt.show()

#### Housing loans vs response rate 

In [None]:
#plot the bar graph of housing loan status with average value of response_flag
(inp1.groupby("housing")["response_flag"].mean()*100).plot.bar()


#### Age vs response 

In [None]:
#plot the boxplot of age with response_flag
sns.boxplot(data=inp1,x='response_flag',y='age')

##### making buckets from age columns 

In [None]:
#create the buckets of <30, 30-40, 40-50 50-60 and 60+ from age column.
inp1["age_group"]= pd.cut(inp1.age,[0,30,40,50,60,120],labels=["<30", "30-40", "40-50","50-60","60+"])

In [None]:
#plot the percentage of each buckets and average values of response_flag in each buckets. plot in subplots.
plt.figure(figsize=[10,4])
plt.subplot(1,2,1)
(inp1.age_group.value_counts(normalize=True)*100).plot.bar()
plt.subplot(1,2,2)
(inp1.groupby(['age_group'])['response_flag'].mean()*100).plot.bar()
plt.show()

In [None]:
#plot the bar graph of job categories with response_flag mean value.
(inp1.groupby(['job'])['response_flag'].mean()*100).plot.barh()
plt.show()

### Segment-6, Multivariate analysis 

#### Education vs marital vs response 

In [None]:
#create heat map of education vs marital vs response_flag

ax=pd.pivot_table(data=inp1,index="education",columns='marital',values='response_flag')
sns.heatmap(ax,annot=True,cmap='PiYG')
plt.show()


#### Job vs marital vs response 

In [None]:
#create the heat map of Job vs marital vs response_flag.
ax=pd.pivot_table(data=inp1,index='job',columns='marital',values='response_flag')
sns.heatmap(ax,annot=True,cmap='rainbow')
plt.show()

#### Education vs poutcome vs response

In [None]:
#create the heat map of education vs poutcome vs response_flag.
ax=pd.pivot_table(data=inp1,index='education',columns='poutcome',values='response_flag')
sns.heatmap(ax,annot=True,cmap='plasma_r')
plt.show()