**Attributes Description:**
1. Bank client data:
1 - age: (numeric)
2 - job: type of job (categorical: 'admin.','bluecollar','entrepreneur','housemaid','management','retired','selfemployed','services','student','technician','unemployed','unknown')
3 - marital: marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means
divorced or widowed)
4 - education: (categorical: primary, secondary, tertiary and unknown)
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
8 - balance: Balance of the individual.
2. Related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the
output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed.
Also, after the end of the call y is obviously known. Thus, this input should only be included for
benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
3. Other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes
last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign
(numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
4. Output variable (desired target):
21 - y - has the client subscribed a Term Deposit? (binary: 'yes','no')

First things first, let's import libaries that we need

Import the dataset "bank.csv"

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as sm
sns.set(style='whitegrid')
%matplotlib inline   

In [None]:
df = pd.read_csv('../input/bank-marketing-dataset/bank.csv')

## Part 1: Let's do some data wrangling

In [None]:
df.head()

In [None]:
df.info()

In [None]:
#It seems that there is no null data. Just to be sure, let's check again
df.isnull().sum()

In [None]:
#Numerical exploration
df.describe()

In [None]:
# Check the mean of the numerical attributes above
df.mean()

## Part 2: Let's plot these numerical attributes to see the distribution

In [None]:
df.hist(figsize=(14,10),bins=15,color='g')

Let's examine the Balance a little bit

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(x='job',y='balance',data=df,palette='Set2',)
plt.title('Distribution of balace by Job')

There are outliers in "The retired" with very high balance (>800,000). It seems that when they retire, they tend to loan more money and cannot pay their debt

In [None]:
plt.figure(figsize=(8,4))
sns.violinplot(x='education',y='balance',data=df,palette='Set2',)
plt.title('Distribution of balace by education')

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='age',y='balance',data=df,palette='Set2',hue='marital')
plt.title('Age vs Balance')

It seems that there is no relationship between Age and Balance

# Part 3: Let's also check the categorical attributes a little bit

In [None]:
df.head()

In [None]:
#Plot the categorical attributes
plt.figure(figsize = (20,15))

plt.subplot(331)
df["job"].value_counts().plot.barh()
plt.title('Job Categories')

plt.subplot(332)
df["marital"].value_counts().plot.barh()
plt.title('Marital Status')

plt.subplot(333)
df["education"].value_counts().plot.barh()
plt.title('Education Levels')

plt.subplot(334)
df["default"].value_counts().plot.barh()
plt.title('Has Credit in Default')


plt.subplot(335)
df["housing"].value_counts().plot.barh()
plt.title('Has Housing Loan')

plt.subplot(336)
df["loan"].value_counts().plot.barh()
plt.title('Has Personal Loan')

plt.subplot(337)
df["contact"].value_counts().plot.barh()
plt.title('Contact Communication Type')

plt.subplot(338)
df["month"].value_counts().plot.barh()
plt.title('Months Value Counts')

plt.subplot(339)
df["poutcome"].value_counts().plot.barh()
plt.title('Outcome of Previous Marketing Campaign');

plt.plot()

## Part 4: Now let's visualize some of the attributes against the Y='Deposit' and draw some insight from them

In [None]:
#Check how many customers open Deposit
plt.figure(figsize=(8,4))
sns.countplot(x='deposit',data=df,palette='Set2')
plt.title('How many customers open the Term Deposit')

=> Luckily, it seems that this Data set is well balanced so we can use this data straight forward to do the classification modeling. 
If the data set wasn't balanced, we would have to use SMOTE or over-sampling to adjust the samples.

In [None]:
#Marital, education and contact, Default, housing and loan vs Y
plt.figure(figsize=[18,8])

plt.subplot(231)
sns.countplot(x='marital', hue='deposit', data=df,palette="Set2")

plt.subplot(232)
sns.countplot(x='education', hue='deposit', data=df,palette="Set2")

plt.subplot(233)
sns.countplot(x='contact', hue='deposit', data=df,palette="Set2")

plt.subplot(234)
sns.countplot(x='default', hue='deposit', data=df,palette="Set2")

plt.subplot(235)
sns.countplot(x='housing', hue='deposit', data=df,palette="Set2")

plt.subplot(236)
sns.countplot(x='loan', hue='deposit', data=df,palette="Set2")

In [None]:
#Job and Month vs Y
plt.figure(figsize=(14,12))

plt.subplot(211)
sns.countplot(y='job',data=df,hue='deposit',palette='Set2')
plt.title('Job vs Term Deposit')

plt.subplot(212)
sns.countplot(x='month',data=df,hue='deposit',palette='Set2')
plt.title('Last contact month vs Term Deposit')

In [None]:
#Last contact day vs Y
plt.figure(figsize=(17,5))
sns.countplot(x='day',data=df,hue='deposit',palette='Set2')
plt.title('Last contact day vs Term Deposit')

In [None]:
#Poutcome vs Y
plt.figure(figsize=(17,5))
sns.countplot(x='poutcome',data=df,hue='deposit',palette='Set2')
plt.title('Outcome of the previous campaign vs Term Deposit')

In [None]:
#Age against Y
g = sns.FacetGrid(data=df,hue='deposit',height=4,aspect=2)
g.map(sns.kdeplot,'age',shade=True,legend=True)
g.add_legend()
plt.title('Age against Y')

In [None]:
#Balance against Y
g = sns.FacetGrid(data=df,hue='deposit',height=4,aspect=2)
g.map(sns.kdeplot,'balance',shade=True,legend=True)
g.add_legend()
plt.title('Balance against Y')

In [None]:
#Number of contact performed for this campaign against Y
g = sns.FacetGrid(data=df,hue='deposit',height=4,aspect=2)
g.map(sns.kdeplot,'campaign',shade=True,legend=True)
g.add_legend()
plt.title('Number of contact performed during this campaign')

In [None]:
#Duration of the last contact against Y
g = sns.FacetGrid(data=df,hue='deposit',height=4,aspect=2)
g.map(sns.kdeplot,'duration',shade=True,legend=True)
g.add_legend()
plt.title('Duration of the last contact')
plt.plot()

In [None]:
sns.kdeplot(df[df['deposit']=='yes']['pdays'])

In [None]:
sns.kdeplot(df[df['deposit']=='no']['pdays'])

In [None]:
sns.distplot(df[df['deposit']=='no']['pdays']).plot()

In [None]:
#Pdays against Y
g = sns.FacetGrid(data=df,hue='deposit',height=4,aspect=2)
g.map(sns.kdeplot,'pdays',shade=True,legend=True)
g.add_legend()
plt.title('Number of days that passed by after the client was last contacted')

# Part 5: Insights

People who are more likely to get a Term deposit are:
- Marital status: Single
- Education: Tertiary
- Age: below 30 or above 60
- Job: Management, Retired, Student, Unemployed

The most effective ways to conduct marketing are:
- Contact: by Celullar. It also could be by telephone, but this needs more consideration.
- Month: There are most contacts in May so the number of successful cases is highest. However, the successful rate is just about 50%. We should spend more time in: Feb, Mar, April (espcially higher), Sep, Oct, Dec.
- Day: 1-4; 10; 12-13; 15-16; 22-25; 27; 30. These are days we should contact more for next campaign.
- Duration of the last contact: above 300s. It seems that if clients spend more than 5 minutes to talk to you, there is a higher change that they will get a Deposit.
- Number of contact performed during this campaign: below 3. We shouldn't contact clients more than 3 times or they will get annoyed.
- Number of days that passed by after the client was last contacted: less than 25 days. It is best at 0 day (within the day). There is also a good chance that client has never been contacted will also get a Term deposit.

Other findings:
- Housing loan: People with no housing loan tend to get a term deposit. Maybe because when they do not have housing loan, they have more cash to open a deposit.
- Outcome of the previous campaign: If the outcome of previous campaign is a success, there is really high chance that client will get a term deposit this time as well. If the result is "Other", it has a slight higher chance of successful this time, but not so much. 
Even if the resulft in the previous campaign is "Failure", the successful rate for this time is still around 50% so we shouldn't skip these previous failure cases.  

# Part 6: Classification and top 5 important features

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
#Replace "yes", "no" in deposit with 1,0
df2=df.copy()
df2.replace({'deposit': {"yes": 1,'no':0}},inplace=True)
df2

In [None]:
# Pre-processing data
df2 = pd.get_dummies(df2,drop_first=True)
df2

In [None]:
X = df2.drop(['deposit','duration'],axis=1) #As state in the guidance, we shouldn't use duration
y= df2['deposit']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)
y_pred = rfc.predict(X_test)

In [None]:
print('Report:\n',classification_report(y_test, y_pred))
print('confusion Matrix:\n',confusion_matrix(y_pred,y_test))

In [None]:
#Top 5 important features

importances=rfc.feature_importances_
feature_importances=pd.Series(importances, index=X_train.columns).sort_values(ascending=False)
sns.barplot(x=feature_importances[0:5], y=feature_importances.index[0:5])
plt.title('Feature Importance')
plt.ylabel("Features")

# The end.