#  #Exploratory data analysis on the bank marketing data set with Pandas and Seaborn

Import libraries (very important)

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns


Using Pandas the data set i.e. bank.csv is loaded using the command shown below:

In [None]:
bank_data = pd.read_csv("../input/bank-marketing-dataset/bank.csv")
bank_data.head(5)

A quick glance at the data set set reveals that there are 17 columns in total namely 'age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays','previous', 'poutcome', 'deposit'.

The shape of the data set i.e the dimensions depicting the number of rows and columns is given by the command 'shape' in pandas as illustrated below. 
There are 11162 rows or records and 17 column or attributes in the bank data set.

In [None]:
bank_data.shape


Using the 'info' command, in-depth details about the attributes in the data set can be obtained. 
For example, in the bank data set used, 'age is a non-null attribute with type integer, 'job' 
is a non-null attribute with type object. Type object means that the attribute or variable is a categorical value.

In [None]:
bank_data.info()

Using the 'describe' function from pandas, further details about the variables can be 
obtained, such as count, mean, standard deviation values and minimum and max values.

In [None]:
bank_data.describe()

From the above output, we can make the following conclusions:
age: minimum age of the bank's client's is 18 while the maximum age is 95. The average age of the customers is 41.

balance: the mean customer balance is 1528.54, while the minimum balance is -6847.00. The maximum balance stands at 81204.00.

duration: the maximum duration in seconds of a single contact was 3881 seconds, while the shortest duration of a contact with a client lasted for 2 seconds.

campaign: the maximum number of contacts made in the campaign to a single client is 63 while the minimum number of contacts is 1. The average number of contacts made was 2.5

pdays: a maximum of 854 days passed by after a client was last contacted, while a minimum of -1 days passed by after a client was last contacted.

previous: a maximum of 58 contacts were made previously to a single client before the current campaign., while a minimum of 0 contacts were previously made.


Some specifics now…first we will look at how some variables relate to the target variable. 
Age and Deposit: there are more client between the ages of 0 and 40 who subscribed to the term deposit than 
those who did not.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.barplot(x=bank_data['deposit'], y=bank_data['age'])


Balance and Deposit: there are more clients with a balance between 0 and 1750 who 
subscribed to the term deposit than those who did not.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.barplot(x=bank_data['deposit'], y=bank_data['balance'])

Campaign and Deposit: the are more clients who did subscribe to
the term deposit but had been contacted more than 20 times.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.barplot(x=bank_data['deposit'], y=bank_data['campaign'])

Now let us look at some categorical variables. The categorical variables in our data set are shown below. We will use counplot from Seaborn to visualize our categorical variables.

In [None]:
s= (bank_data.dtypes =='object')
objectcols = list(s[s].index)
bank_data_object = bank_data[objectcols]
bank_data_object.head(5)


Job: from the visuals below, we can conclude that people with management jobs took part the most in the campaign.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.countplot(bank_data['job'])

Marital Status: there are more married people who took part in the campaign.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.countplot(bank_data['marital'])

Education: there are more clients who had a secondary education who had taken part in the campaign.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.countplot(bank_data['education'])

Loan: most clients had not taken a personal loan.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.countplot(bank_data['loan'])

Housing: most clients did not have a housing loan.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.countplot(bank_data['housing'])

We can also use heatmaps to visualize the correlation between the numerical values.

In [None]:
plt.figure(figsize=(14,7))
#sns.heatmap(data=df, annot=True)
cor = bank_data.corr()
sns.heatmap(cor, annot=True)
plt.show()

So was the campaign successful?. Well, out of 11162 records, there were 5289 subscriptions to the term deposit. 5873 clients did not subscribe. Not too bad.

In [None]:
(bank_data['deposit']=='yes').sum()

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,7))
sns.countplot(bank_data['deposit'])