Prediction of Bank marketing campaign with the use of Machine Learning will really revolutionarize this segment because it will help in saving resources to attain maximum results in the form of profits.

In this kernel I will use **Catboost Algorithm** to predict client subscription to term deposit as this dataset contains lots of categorical variables and by using catboost we dont have to convert all categorical variables into dummy variables which is the usp pf this algo.

## Catboost Algorithm
CatBoost has the flexibility of giving indices of categorical columns so that it can be encoded as one-hot encoding using one_hot_max_size (Use one-hot encoding for all features with number of different values less than or equal to the given parameter value).

If you don’t pass any anything in cat_features argument, CatBoost will treat all the columns as numerical variables.

the generalized chart of comparison among XGBoost, Light GBM and Catboost is as follows :
![](https://i.ibb.co/q9wdt6M/chart.png)

As we can see from above chart Catboost is fastest and having higher accuracy among its peers.

# About Dataset

The dataset is related with direct marketing campaigns (phone calls) of a Portuguese banking institution.The classification goal of this dataset is to predict if the client or the customer of polish banking institution will subscribe a **term deposit** product of the bank or not. Now the question comes **what is term deposit ?**

## Term deposit 
A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. Term deposits can be invested into a bank, building society or credit union.

When the money is deposited, the customer understands that the money is there for the pre-determined period which usually ranges from 1 month to 5 years and the interest rate is guaranteed not to change for that nominated period of time.  Typically, the money can only be withdrawn at the end of the period – or earlier with a penalty attached.

Term deposits are popular with investors who prefer capital security and a set return as opposed to the fluctuations of, say, the share market. Many investors also use term deposits as a part of their investment mix.

## Dataset Attributes

### Input variables:
#### bank client data:
* **age** (numeric)
* **job :** type of job 
* (categorical: 'admin.','bluecollar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
* **marital :** marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
* **education** (categorical:'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
* **default:** has credit in default? (categorical: 'no','yes','unknown')
* **balance:** it is a numerical column which indicate the client has how much outstanding balance in his account with the bank.
* **housing:** has housing loan? (categorical: 'no','yes','unknown')
* **loan:** has personal loan? (categorical: 'no','yes','unknown')

#### related with the last contact of the current campaign:

* **contact:** contact communication type (categorical: 'cellular','telephone')
* **month:** last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
* **day:** last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
* **duration:** last contact duration, in seconds (numeric). Important note: this attribute highly affects the output 

#### other attributes:

* **campaign:** number of contacts performed during this campaign and for this client (numeric, includes last contact)
* **pdays:** number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
* **previous:** number of contacts performed before this campaign and for this client (numeric)
* **poutcome:** outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

### Output variable (desired target):

* **deposit:** has the client subscribed a term deposit? (binary: 'yes','no')

![](http://i.ibb.co/jyyVFdR/970x404-Friendship-between-artificial-and-real-man.jpg)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pandas import plotting
%matplotlib inline
from time import time
from IPython.display import display # Allows the use of display() for DataFrames
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Step 1: Data Reading

In [None]:
df = pd.read_csv('../input/bank.csv')

In [None]:
# Now lest see the first 5 samples to get the overview of the dataset 
df.head()

**Looks like there are many categorical attributes are there**

In [None]:
# Now lets see the structure of the data
df.info()

In [None]:
# Lets see the overview of the dataset means average, std, min , max of the data
df.describe(include='all')

In [None]:
# Lets see only categorical variables
df.describe(include='object')

# Step 2 : Data Cleaning

It is very crucial step to clean the data most of the times what happens we have missing values in the data or some inappropiate values in the data that we have to find out or handle that.

In [None]:
# Checking Missing values or null entries in the dataset
df.isna().sum()

**Now, we have to check the datatype of each of the feature whether it has relevant datatype corresponding to its value for e.g., sometimes what happens in a age column which have numerical entries but it may errorneously have datatype object. So such type of instances we have to remove or replace.**

In [None]:
print(df.dtypes)

So, it seems like every feature has relevant datatype.

In [None]:
df.shape

so it has 17 features including target variable and around 11162 records.

# Step 3: Exploratory data analysis

In [None]:
sns.countplot(x='deposit',data=df)

So, dataset has almost equal distribution of target variable deposit yes or no

In [None]:
sns.countplot(x='deposit',hue='housing',data=df)

It seems interesting pattern as there are more customers who have already subscribed term deposit of the bank but not taken housing loan, so bank can target these customers by offering housing loan to them.

Whereas in case of those customers who have already taken housing loan from the bank but not subscribed to term deposit scheme of the bank. In this case bank can offer term deposit scheme to those customers who have taken housing loan from bank.

In [None]:
sns.countplot(x='deposit',hue='loan',data=df)

In [None]:
# making boolean series for term deposit subscribed customers of bank
filter1 = df["deposit"]=="yes"
    
# filtering data on basis of both filters 
df_subscribed = df.where(filter1).dropna()

df_subscribed.head()

In [None]:
sns.countplot(x='deposit', hue='education',data=df_subscribed)

It seems that people who have secondary education have mostly subscribed to term deposit scheme where as unknown and primary education holder have comparatively very low in subscribing term deposit scheme.

So bank can design other products which are aligned towrads the need of highly educated individuals.

In [None]:
sns.countplot(x='deposit',hue='marital',data=df_subscribed)

It is somewhat quite obvious from the above plot that people after marriage try to save more in comparison to single so bank can also design deposit products which suits the need of young generation or single marital status customers.

## Pairplot of Numerical features

In [None]:
dataset2=df[['age','balance','duration','campaign','pdays']]

sns.pairplot(dataset2)
plt.show()

# Distribution of Age

In [None]:
plt.rcParams['figure.figsize'] = (20, 8)
sns.countplot(df['age'], palette = 'hsv')
plt.title('Distribution of Age', fontsize = 20)
plt.show()

It seems from the above plot that the age group  25 - 60 have most of the data.

Now, lets see what about age distribution of customers who have already subscribed to term deposit scheme of the bank

In [None]:
plt.rcParams['figure.figsize'] = (25, 8)
sns.countplot(df_subscribed['age'], palette = 'rainbow')
plt.title('Distribution of Age of Subscribed Customers', fontsize = 25)
plt.show()

From the above plot it is evident that people who have already subscribed to term deposit scheme of bank are mostly in the range of 25 to 40

# Distribution of Balance

In [None]:

sns.distplot(df['balance'], hist=True,kde_kws={"color": "k", "lw": 3, "label": "KDE"}, kde=True,bins=50,hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "g"})
plt.title('Distribution of Balance in Account', fontsize = 20)
plt.show()

In [None]:
sns.distplot(df_subscribed['balance'], hist=True,kde_kws={"color": "k", "lw": 3, "label": "KDE"}, kde=True,bins=50,hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "g"})
plt.title('Distribution of Balance of already subscribed account', fontsize = 20)
plt.show()

# Distribution of duration

In [None]:

sns.distplot(df['duration'], hist=True,kde_kws={"color": "k", "lw": 3, "label": "KDE"}, kde=True,bins=50,hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "g"})
plt.title('Distribution of Duration', fontsize = 20)
plt.show()

In [None]:
sns.distplot(df_subscribed['duration'], hist=True,kde_kws={"color": "k", "lw": 3, "label": "KDE"}, kde=True,bins=50,hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "g"})
plt.title('Distribution of Duration of already subscribed account', fontsize = 20)
plt.show()

In [None]:
labels = ['Normal', 'Default']
size = df['default'].value_counts()
colors = ['lightgreen', 'orange']
explode = [0, 0.1]

plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Default Loans Status', fontsize = 20)
plt.axis('off')
plt.legend()
plt.show()

In [None]:
labels = ['No Housing Loan','Housing loan taken' ]
size = df['housing'].value_counts()
colors = ['blue', 'yellow']
explode = [0, 0.1]

plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Status of Housing Loan', fontsize = 20)
plt.axis('off')
plt.legend()
plt.show()

In [None]:
labels = ['No Loan Taken','Has Taken Loan']
size = df['loan'].value_counts()
colors = ['green', 'blue']
explode = [0, 0.1]

plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Status of Loan customer', fontsize = 20)
plt.axis('off')
plt.legend()
plt.show()

In [None]:
labels = ['No deposit','Deposit in Bank']
size = df['deposit'].value_counts()
colors = ['blue', 'orange']
explode = [0, 0.1]

plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Status of Deposit customer', fontsize = 20)
plt.axis('off')
plt.legend()
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = (18, 7)
sns.boxenplot(df['housing'], df['balance'],hue=df['deposit'], palette = 'Blues')
plt.title('Hosuing vs Balance vs Deposit', fontsize = 20)
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = (18, 7)
sns.boxenplot(df['loan'], df['balance'],hue=df['deposit'], palette = 'rainbow')
plt.title('loan vs Balance vs Deposit', fontsize = 20)
plt.show()


In [None]:
plt.rcParams['figure.figsize'] = (18, 7)
sns.boxenplot(df['housing'], df['balance'],hue=df['default'])
plt.title('Hosuing vs Balance vs Default', fontsize = 20)
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = (18, 7)
sns.boxenplot(df['loan'], df['balance'],hue=df['default'],palette="Set1")
plt.title('Personal Loan vs Balance vs Default', fontsize = 20)
plt.show()

**From the above plot it is evident that those who have defaulted have very less balance in account with the bank. This is quite useful pattern in keeping tap on such type of defaulters.** 

In [None]:
# Sort the dataframe by target
deposit_yes = df.loc[df['deposit'] == 'yes']
deposit_no = df.loc[df['deposit'] == 'no']
fig = plt.figure(figsize=(20,8))
sns.distplot(deposit_yes[['duration']], hist=False, rug=True)
sns.distplot(deposit_no[['duration']], hist=False, rug=True)
plt.title('Duration of Deposit vs Non deposit', fontsize = 20)
fig.legend(labels=['Deposit','Non deposit'])
plt.show()


The people who have not taken deposit in bank have lower duration whereas those who have deposit in bank have higher duration

In [None]:
sns.countplot(df['poutcome'])

In [None]:
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (18, 8)

plt.subplot(1, 2, 1)
sns.set(style = 'whitegrid')
sns.distplot(df['previous'])
plt.title('Distribution of Previous', fontsize = 20)
plt.xlabel('Range of Previous')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.set(style = 'whitegrid')
sns.distplot(df['campaign'], color = 'red')
plt.title('Distribution of Campaign', fontsize = 20)
plt.xlabel('Range of Campaign')
plt.ylabel('Count')
plt.show()

In [None]:
df["deposit"] = df.deposit.apply(lambda  x:1 if x=="yes" else 0)
df["loan"] = df.loan.apply(lambda  x:1 if x=="yes" else 0)
df["housing"] = df.housing.apply(lambda  x:1 if x=="yes" else 0)
df["default"] = df.default.apply(lambda  x:1 if x=="yes" else 0)

# Step 4: Correlation Analysis

In [None]:
df1=df.drop(['deposit'],axis=1)

plt.figure(figsize=(20,10)) 
sns.heatmap(df1.corr(), annot=True) 

## Correlation with target Variable

In [None]:
df1.corrwith(df.deposit).plot.bar(
        figsize = (20, 10), title = "Correlation with Deposit", fontsize = 20,
        rot = 45, grid = True)

# Step 4 : Segregation of features & target variable 

In [None]:
X = df.drop(['deposit'],axis=1) # Feature 

y=df['deposit'] # Target variable

# Step 5 : Splitting of training & testing split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,test_size = 0.20, random_state=0)

# Step 6: convert categorical columns to integers

In [None]:
# convert categorical columns to integers
category_cols = ['job','marital','education','contact','month','poutcome']
for header in category_cols:
    X_train[header] = X_train[header].astype('category').cat.codes
    X_test[header] = X_test[header].astype('category').cat.codes

In [None]:
print(X_train.dtypes)

In [None]:
categorical_features_indices = np.where(X.dtypes != np.int64)[0]

# Step 7 : Model Building

This is the most fun part in which we build model using catboost algorithm

In [None]:
model = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42)

In [None]:
model.fit(X_train,y_train,cat_features=categorical_features_indices,eval_set=(X_test,y_test))

In [None]:
y_predict = model.predict(X_test)
from sklearn.metrics import  accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
roc=roc_auc_score(y_test, y_predict)
acc = accuracy_score(y_test, y_predict)
prec = precision_score(y_test, y_predict)
rec = recall_score(y_test, y_predict)
f1 = f1_score(y_test, y_predict)

results = pd.DataFrame([['CatBoost', acc,prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

# step8: Cross Validation
In this step we will perform cross validation with 10 folds.

In [None]:
from catboost import cv,Pool
cv_data = cv(Pool(X,y,cat_features=categorical_features_indices),model.get_params(),fold_count=10)


In [None]:
print('Best validation accuracy score: {:.2f}±{:.2f} on step {}'.format(
    np.max(cv_data['test-Accuracy-mean']), 
    cv_data['test-Accuracy-std'][cv_data['test-Accuracy-mean'].idxmax(axis=0)],
    cv_data['test-Accuracy-mean'].idxmax(axis=0)
))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True)

In [None]:
from sklearn import metrics
plt.figure()

# Add the models to the list that you want to view on the ROC plot
models = [
    {
    'label': 'CATBOOST',
    'model': CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42),        
    }
]

# Below for loop iterates through your models list
for m in models:
    model = m['model'] # select the model
    model.fit(X_train,y_train,cat_features=categorical_features_indices,eval_set=(X_test,y_test)) # train the model
    y_pred=model.predict(X_test) # predict the test data
# Compute False postive rate, and True positive rate
    fpr, tpr, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
# Calculate Area under the curve to display on the plot
    auc = metrics.roc_auc_score(y_test,model.predict(X_test))
# Now, plot the computed values
    plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (m['label'], auc))
# Custom settings for the plot 
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1-Specificity(False Positive Rate)')
plt.ylabel('Sensitivity(True Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

# Model explainability

In [None]:
features ='duration'
res =model.get_feature_statistics(X_train, y_train,features, plot=True)

In [None]:
features ='balance'
res =model.get_feature_statistics(X_train, y_train,features, plot=True)

In [None]:
import shap
shap_values = model.get_feature_importance(Pool(X_test, label=y_test,cat_features=categorical_features_indices), 
                                                                     type="ShapValues")
expected_value = shap_values[0,-1]
shap_values = shap_values[:,:-1]

shap.initjs()
shap.force_plot(expected_value, shap_values[3,:], X_test.iloc[3,:])

# Feature Importance

In [None]:
feature_score = pd.DataFrame(list(zip(X.dtypes.index, model.get_feature_importance(Pool(X, label=y, cat_features=categorical_features_indices)))),
                columns=['Feature','Score'])

feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last')

In [None]:
plt.rcParams["figure.figsize"] = (12,7)
ax = feature_score.plot('Feature', 'Score', kind='bar', color='b')
ax.set_title("Catboost Feature Importance Ranking", fontsize = 14)
ax.set_xlabel('')

rects = ax.patches

labels = feature_score['Score'].round(2)

for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom')

plt.show()