# This Notebook is part of an assignment submission made to BITS Pilani

<img src = "https://www.bits-pilani.ac.in/Uploads/Campus/BITS_university_logo.gif" style="height:50px">

Work Integrated Learning Programmes Division<br>
M.Tech (Data Science and Engineering)<br> Data Visualization And Interpretation (DSECL ZG555))<br>
Second Semester, 2020-21


##  Download and Prep the Data

Import the modules needed

In [None]:
# Pandas for managing and using dataframes
import pandas as pd

# Matplotlib is a basic plotting library in python
import matplotlib.pyplot as plt

# Built over matplotlib, seaborn offers many more features
import seaborn as sns

import scipy

sns.set()
sns.set_theme('notebook')
sns.set_style('white')
sns.despine()
# sns.set(color_codes=True)

golden_ratio = scipy.constants.golden

from pprint import pprint

Download the dataset and read it into a dataframe

In [None]:
# The name of the data file
datafile = '../input/bank-marketing-dataset-bits/dvi-a2-ps2-data.csv'

# Reading the data file into a dataframe object
data = pd.read_csv(datafile)

First five items in the dataset.

In [None]:
data.head(5)

Find out how many entries there are in the dataset.

In [None]:
print(f'Number of records in the dataset: {data.shape[0]}')

- Clean up data.
- Remove unnecessary columns. Mention the reasons.
- Show the data.

In [None]:
# Getting unique values for each column as a dictionary
def unique_dictionary(data):
    columns = data.columns
    unique_dict = dict()
    for column in columns:
        unique_dict[column] = '  '.join(data[column].unique().astype('str').tolist())
    return unique_dict

def dict_prettyprint(dictionary):
    for key in dictionary.keys():
        print('{:<12} : {}'.format(key, dictionary[key]))

def dataframe_print_unique(data):
    dict_prettyprint(unique_dictionary(data))


In [None]:
print('Unique Values before cleaning')
print()
dataframe_print_unique(data[['job','marital','education','default','housing','loan','contact','day_of_week','poutcome','subscription']])

In [None]:
# Cleaning

# Job type has a value 'admin.'. Changing it into 'admin'
data['job'] = data['job'].str.replace('admin.','admin')

# Changing 'nonexistent' in poutcome to 'unknown' to match the syntax of the rest of the data
data['poutcome'] = data['poutcome'].str.replace('nonexistent','unknown')

In [None]:
print('Unique Values after cleaning')
print()
dataframe_print_unique(data[['job','marital','education','default','housing','loan','contact','day_of_week','poutcome','subscription']])

# Visualisation Questions

### Question 1
#### Find the correlation and plot the heat map for the correlation between features.
#### Write the python code in the below cell to create appropriate visual to perform the above task.
#### Answer in markdown cells below the visual
1. Summarise your findings from the visual.
2. The reason for selecting the chart type you did
3. Mention the pre-attentive attributes used.(atleast 2)
4. Mention the gestalt principles used.(atleast 2)

In [None]:
correlation = data.corr()
fig, ax = plt.subplots(figsize=(12,10))
ax = sns.heatmap(correlation, annot=True, cmap='RdBu', square=True)
plt.xticks(rotation=45)
ax.set_xlabel('Features', fontweight='bold', fontsize=14)
ax.set_ylabel('Features', fontweight='bold', fontsize=14)
fig.suptitle('Feature Heatmap', fontsize=25)
plt.show()

**1. Findings**
- We can find some highly positive and negative correlated values
- `Subscription` has very limited correlation to other independent features

**2. Reason**

We wanted to obtain a heatmap of all the features so as to find some pattern and dependence of features in the data.

**3. Pre-attentive attributes**
- Visual heirarchy of information
- Using a color theme (red and blue)

**4. Gestalt Principles**
- Closure (there is a virtual bounding box of the figure)
- Similarity (closely correlated values have similary shades)

### Question 2
Find age distribution and Plot histogram graph for this. And check which age group is most likely to subscribe the bank.

Write the python code in the below cell to create the appropriate visual to perform the above task .

#### Answer in markdown cells below the visual
1. Summarise your findings from the visual.
2. The reason for selecting the chart type you did 
3. Mention the pre-attentive attributes used.(atleast 2)
4. Mention the gestalt principles used.(atleast 2)


In [None]:
height = 10
fig, ax = plt.subplots(figsize=(golden_ratio*height,height))

ax.set_title('Age distribution histogram', fontsize=24)
N0, bins0, patches0 = ax.hist(data[data['subscription'] == 0]['age'], 20, histtype='bar', label='Not Subscribed', color='#ffdccc')
N1, bins1, patches1 = ax.hist(x=data[data['subscription'] == 1]['age'], bins=20, label='Subscribed', color='#ccffcc')

ax.annotate(text='Most calls made to younger people', xy=(36,6000),xytext=(40,6000),fontsize=14,
           arrowprops=dict(arrowstyle='<-',edgecolor='black'))
for p in patches0:
    if (p.get_height() >= 5000):
        p.set_facecolor('tab:orange')
for p in patches1:
    if (p.get_height() >= 600):
        p.set_facecolor('tab:green')


ax.set_xlabel('Age', fontweight='bold', fontsize=14)
ax.set_ylabel('Count', fontweight='bold', fontsize=14)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
plt.xlim(10,80)
ax.legend(['Not Subscribed','Subscribed'], fontsize=14, title='Subscription').get_title().set_fontsize('14')

plt.show()

**1. Findings**
- It is clear that the organisation is following the right pattern in calling more individuals who are aged between 30 and 35.
- Looking at the right tail I can see that once over 62 (retirement age) the ratio of unsubscribed/subscribed has increased

**2. Reason**

We are looking at how the age of the customers affects their interest in subscribing to a plan from the bank. We chose a histogram approach to show the distribution of age of customers against the number of contacts that were made by that bank to them.

**3. Pre-attentive attributes**
- Visual heirarchy of information
- Using a color theme
- Highligting important values

**4. Gestalt Principles**
- Closure (there is a virtual bounding box of the figure)
- Proximity (the histogram bars of same group are very close in value to each other)
- Focal Point (2 bars in `Subscribed` and `Not Subscribed` are highlighted)

### Question 3
Visualize number of contacts made in each month.

Write the python code in the below cell to create the appropriate visual to perform the above task .
#### Answer in markdown cells below the visual
1. Summarise your findings from the visual.
2. The reason for selecting the chart type you did
3. Mention the pre-attentive attributes used.(atleast 2)
4. Mention the gestalt principles used.(atleast 2)


In [None]:
data_month_grouped = data.groupby('month').count().reset_index()
data_month_grouped["Month"] = pd.to_datetime(data_month_grouped.month, format='%b', errors='coerce').dt.month
data_month_grouped = data_month_grouped.sort_values(by="Month").reset_index(drop=True)

custom_palette = ['lightgrey']*9
custom_palette.insert(2,'tab:orange')

fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(ax = ax, data=data_month_grouped, y='month',x='subscription', palette=custom_palette)
ax.set_title('Contacts made each month', fontsize=24)
ax.set_ylabel('Month', fontweight='bold', fontsize=14)
ax.set_xlabel('Number of Contacts', fontweight='bold', fontsize=14)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.tick_params(axis='y', which='both', length=0)

plt.show()

**1. Findings**
- A month or 2 into the financial year and there is a considerable increase in the number of contacts
- I cannot assume budgetting to be a reason for rise of calls in may. But, onset of summer could be a correlation, may not be a causation. Here, we would need data from more than an year to make considerable insights

**2. Reason**

A horizontal bar chart can 
- accomodate the names of the months 
- occupy less space, lesser cognitive load
Here the bars are not sorted by value because we are looking at number of calls made each month

**3. Pre-attentive attributes**
- Highligting important values
- Bar length

**4. Gestalt Principles**
- Closure (there is a virtual bounding box of the figure)
- Focal Point (the bar for `may` is focussed by color)

### Question 4
Categorize the data by grouping into education divisions and check which sector is more likely to subscribe.

Write the python code in the below cell to create the appropriate visual to perform the above task .
#### Answer in markdown cells below the visual
1. Summarise your findings from the visual.
2. The reason for selecting the chart type you did
3. Mention the pre-attentive attributes used.(atleast 2)
4. Mention the gestalt principles used.(atleast 2)


In [None]:
data_education_grouped = data.groupby(['education', 'subscription']).count()[['age']].reset_index()
education_df = pd.DataFrame({'education': ['illiterate','basic.4y','basic.6y','basic.9y','high.school','professional.course','university.degree','unknown']})
education_df = education_df.reset_index().set_index('education')
data_education_grouped['education_index'] = data_education_grouped['education'].map(education_df['index'])
data_education_grouped = data_education_grouped.sort_values(by='education_index')

custom_palette = ['lightgrey']*7
custom_palette.insert(6, 'tab:orange')

fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(ax = ax, data=data_education_grouped[data_education_grouped['subscription'] == 1], y='education',x='age', palette = custom_palette)
ax.set_title('Education division vs Subscriptions', fontsize=24)
ax.set_ylabel('Education', fontweight='bold', fontsize=14)
ax.set_xlabel('Subscription count', fontweight='bold', fontsize=14)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.tick_params(axis='y', which='both', length=0)

plt.show()

**1. Findings**
- Education is necessary to understand the workings of a banking system.
- People with higher education are more likely to subscribe to a term deposit or insuarance with the bank.

**2. Reason**

A horizontal bar chart can 
- accomodate the long names of educational qualifications 
- occupy less space, lesser cognitive load
Here the bars are not sorted by value but instead sorted by level of education to show that education is important to understand banking.

**3. Pre-attentive attributes**
- Highligting important values
- Bar length

**4. Gestalt Principles**
- Closure (there is a virtual bounding box of the figure)
- Focal Point (the bar for `university.degree` is focussed by color)

### Question 5
Plot the chart to show the total number of clients subscribed to the deposit

Write the python code in the below cell to create the appropriate visual to perform the above task .
#### Answer in markdown cells below the visual
1. Summarise your findings from the visual.
2. The reason for selecting the chart type you did
3. Mention the pre-attentive attributes used.(atleast 2)
4. Mention the gestalt principles used.(atleast 2)

In [None]:
data_subscription_grouped = data.groupby('subscription').count()[['age']].reset_index()
data_subscription_grouped['subscription'] = data_subscription_grouped['subscription'].astype('str').str.replace('0','Not Subscribed')
data_subscription_grouped['subscription'] = data_subscription_grouped['subscription'].astype('str').str.replace('1','Subscribed')

custom_palette = ['lightgrey', 'tab:orange']

fig, ax = plt.subplots(figsize=(12,2))
sns.barplot(ax = ax, data=data_subscription_grouped, y='subscription',x='age', palette = custom_palette)
ax.set_title('Clients Subscribed', fontsize=24)
ax.set_ylabel('', fontweight='bold', fontsize=14)
ax.set_xlabel('Subscription count', fontweight='bold', fontsize=14)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.tick_params(axis='y', which='both', length=0)

plt.show()

**1. Findings**
- Most of the calls made by the bank to prospective customers is getting a negative response. Only about 12% calls are successfull

**2. Reason**

A horizontal bar chart can 
- accomodate the long labels
- occupy less space, lesser cognitive load

**3. Pre-attentive attributes**
- Highligting important values
- Bar length

**4. Gestalt Principles**
- Closure (there is a virtual bounding box of the figure)
- Proximity (the lables for each bar is very close to the bar showing what information the bar conveys)

#### Frame 1 (more) question which will help in the EDA(Exploratory Data Analysis) of the given data set and answer the same using the best visual.
 
 1. Write the question in a markdown cell
 2. Below the question,in a coding cell,write the python code to create the visual to answer the question  

#### Answer in markdown cells below the visual
1. Summarise your findings from the visual.
2. The reason for selecting the chart type you did
3. Mention the pre-attentive attributes used.(atleast 2)
4. Mention the gestalt principles used.(atleast 2)
  

#### Q. What is the most popular job types that are inquired for a term deposit?
In today's world where we are stuck at home due to the pandemic, many have also lost their jobs. They are now reliant on the money they saved and returns on the investments they had made. We have seen above that education is a big criteria in deciding how one is going to invest and subscribe to a term deposit.

With this question we wanted to investigate what sort of job is the most popular and what level of education is required. Given that this data is collected from the inquiries made by a bank to potential customers we cannot be entirely sure if the data is demographically random. But, we can nonetheless get some interesting insights from it.

To answer this we used 2 barplots to show the most popular jab among propective customers and the how much education did they receive before landing a job.

In [None]:
data_job_marital_subscription_grouped = data.groupby(by=['job', 'subscription']).count()[['age']].reset_index()

# data_job_marital_subscription_grouped = data_job_marital_subscription_grouped[data_job_marital_subscription_grouped['subscription'] == 1]
data_job_marital_subscription_grouped = data_job_marital_subscription_grouped[data_job_marital_subscription_grouped['job'] != 'unknown']
data_job_marital_subscription_grouped = data_job_marital_subscription_grouped[data_job_marital_subscription_grouped['job'] != 'retired']

data_job_marital_subscription_grouped = data_job_marital_subscription_grouped.sort_values(by='age', ascending=False)

custom_palette = ['#f38a20'] + ['#0d8bf2'] + ['lightgrey']*8

fig, ax = plt.subplots(1,2,figsize=(8*golden_ratio*2,8))
sns.barplot(ax=ax[0], data=data_job_marital_subscription_grouped, y='job', x='age', ci=None, palette=custom_palette)
sns.despine(left=True, bottom=True)
ax[0].set_title('Most popular Job', fontsize=24)
ax[0].set_ylabel('', fontweight='bold', fontsize=14)
ax[0].set_xlabel('Count', fontweight='bold', fontsize=14)



data_job_marital_subscription_grouped = data.groupby(by=['education', 'job', 'subscription']).count()[['age']].reset_index()

data_job_marital_subscription_grouped = data_job_marital_subscription_grouped[(data_job_marital_subscription_grouped['job'] == 'admin') | (data_job_marital_subscription_grouped['job'] == 'blue-collar')]
data_job_marital_subscription_grouped = data_job_marital_subscription_grouped[data_job_marital_subscription_grouped['education'] != 'unknown']
data_job_marital_subscription_grouped = data_job_marital_subscription_grouped[data_job_marital_subscription_grouped['education'] != 'illiterate']

custom_palette=['#fbd9b6', '#b6dcfb']

sns.barplot(ax=ax[1], data=data_job_marital_subscription_grouped, y='education', x='age',hue='job', ci=None, palette=custom_palette)
sns.despine(left=True, bottom=True)
ax[1].set_title('Job with Education', fontsize=24)
ax[1].set_ylabel('', fontweight='bold', fontsize=14)
ax[1].set_xlabel('Count', fontweight='bold', fontsize=14)
ax[1].legend(['admin','blue-collar'], fontsize=14, title='Job Type').get_title().set_fontsize('14')

ax[1].patches[8].set_facecolor('#0d8bf2')
ax[1].patches[5].set_facecolor('#f38a20')

ax[1].text(1500,1.8, 
'''Most Blue-collar job holders have
completed only basic 9 year education''', fontsize=14)

ax[1].text(2000,4.5, 
'''While University degree holders have
an admin job''', fontsize=14)

**1. Findings**
- `admin` and `blue-collar` jobs are the most popular among people
- `basic 9 year` education can easily get someone a `blue-collar` job while an `university degree` is needed for an `admin` job.
- This data feels a little skewed as the number of `university degree` holders that were enquired is very high compared to other `education` levels

**2. Reason**

A horizontal bar chart can 
- accomodate the long labels
- occupy less space, lesser cognitive load

**3. Pre-attentive attributes**
- Highligting important values
- Bar length

**4. Gestalt Principles**
- Proximity (text describing the bars is quite close to the bar)
- Foreground and Background