# Bank Marketing

#### Task

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

#### Data description
Dataset `bank-additional-full.csv` has 41188 examples and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014].

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


## Data Overview

In [None]:
df = pd.read_csv('../input/bank-marketing-analysis/bank-additional-full.csv', header=0,sep=";")
df.head(10)
df.shape

In [None]:
df.head(10)

In [None]:
df.info()

## Exploratory Data Analysis

#### Find Missing Values (NaN)

In [None]:
df.isna().any()

No missing value found

#### Find Features with One Value

In [None]:
for column in df.columns:
    print(column,df[column].nunique())

No feature with only one value

#### Visualizing the categorical features

In [None]:
cat_features = [col for col, dtype in df.dtypes.items() if dtype == 'object']

Check count based on categorical features

In [None]:
plt.figure(figsize=(25, 80), facecolor='white')
plotnumber =1
for cat_feature in cat_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.countplot(y=cat_feature, data=df, palette='pastel')
    plt.xlabel(cat_feature)
    plt.title(cat_feature)
    plotnumber+=1
plt.show()

Find out the relationship between categorical variable and target value

In [None]:
plt.figure(figsize=(25, 80), facecolor='white')
plotnumber =1
for cat_feature in cat_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.countplot(y=cat_feature, hue='y', palette='pastel', edgecolor='.6', data=df)
    plt.xlabel(cat_feature)
    plt.title(cat_feature)
    plotnumber+=1
plt.show()

<li>Customers who work as admin, technician and blue-collar are more inclined towards a term deposit;</li>
<li>Married customers have high interest on deposit;</li>
<li>Customers with university_degree are more inclined towards a term deposit;</li>
<li>Customers who don't have credit in default are more inclined towards a term deposit;</li>
<li>During the summer seasons (May to August) customers show high interest to deposit;</li>
<li>Customers who has personal loan seems to be less interested on deposit;</li>
<li>Customers who were contacted via 'cellular' are more inclined towards a term deposit.</li>

#### Visualizing the numerical features

In [None]:
num_features = [col for col, dtype in df.dtypes.items() if dtype == 'int64' or dtype == 'float64']

In [None]:
plt.figure(figsize=(20,60), facecolor='white')
plotnumber =1
for num_feature in num_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.kdeplot(df[num_feature], bw=1.5)
    plt.xlabel(num_feature)
    plotnumber+=1
plt.show()

#### Find Outliers in numerical features

Boxplot on numerical features to find outliers

In [None]:
plt.figure(figsize=(20,60))
plotnumber =1
for num_feature in num_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.boxplot(data = df, x = num_feature, palette='pastel')
    plt.xlabel(num_feature)
    plotnumber+=1
plt.show()

Age, duration, compaign, pdays, previous and cons.conf.idx have some outliers

#### Check if the Data set is balanced or not based on target values

In [None]:
plt.figure(figsize = (6, 4))
sns.countplot(data = df, x = 'y')
plt.tight_layout()

In [None]:
df['y'].groupby(df['y']).count()

Given dataset seems to be highly imbalanced.

In [None]:
df.groupby('y').mean()

## Data Preparation

#### Categorical Feature Encoding

In [None]:
# Categorical boolean mask
categorical_feature_mask = df.dtypes==object
# filter categorical columns using mask
categorical_cols = df.columns[categorical_feature_mask].tolist()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
# apply le on categorical feature columns
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
df[categorical_cols].head(10)

In [None]:
df.head()

Correlation between features

In [None]:
plt.figure(figsize=(20, 20))
corr = df.corr()
sns.heatmap(corr, fmt='.2f',annot=True)

From this correlation matrix we can see, that duration,pdays,emp.var.rate,euribor3m and nr.employed are more correlated to target columns.

In [None]:
df1=df.copy()

#### Remove Outliers

Removing outliers in feature 'pdays'

In [None]:
df1.groupby(['y','pdays']).size()

Dropping pdays-column as it has 999 value (means client was not previously contacted) for around 90%+ 

In [None]:
df1.drop(['pdays'],axis=1, inplace=True)

Removing outliers in feature 'campaign'

In [None]:
df1.groupby(['y','campaign'],sort=True)['campaign'].count()

Assuming campaign count greater than 37 as outliers

In [None]:
df2 = df1[df1['campaign'] < 37]

#### X and y preparation

In [None]:
X = df2.iloc[:, :-1]
X.head()

In [None]:
y = df2.iloc[:, -1]
y

#### Feature importance

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=80, max_features='auto')
rf.fit(X, y)
print('Training done using Random Forest')

ranking = np.argsort(-rf.feature_importances_)
f, ax = plt.subplots(figsize=(11, 11))
sns.barplot(x=rf.feature_importances_[ranking], y=X.columns.values[ranking], orient='h')
ax.set_xlabel("feature importance")
plt.tight_layout()
plt.show()

Important note: duration highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

In [None]:
X_main = X.iloc[:,ranking[1:11]]
y_main = y

#### Handling Imbalanced Dataset

Oversampling Using SMOTE Methode

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_sm, y_sm = smote.fit_sample(X_main, y_main)


In [None]:
plt.figure(figsize = (20, 5))
plt.subplot(1, 2, 1)
sns.countplot(x = y_main, palette='pastel')
plt.title('Reparition before SMOTE')
plt.subplot(1, 2, 2)
sns.countplot(x = y_sm, palette='pastel')
plt.title('Reparition after SMOTE')
plt.show()

#### Spliting Data On Traing And Test

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state=0)

## Predictive Model Preparation

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

#### Cross-Validation

In [None]:
def k_fold_fit_and_evaluate(X, y, model, scoring_method, n_splits=5):
    # define evaluation procedure
    cv = KFold(n_splits=5, random_state=42, shuffle=True)
    # evaluate model
    scores = cross_validate(model, X_train, y_train, scoring=scoring_method, cv=cv, n_jobs=-1)
    return scores["test_score"]

scoring_method_f1 = make_scorer(lambda true_target, prediction: f1_score(true_target, prediction, average="weighted"))

#### Average F1-Score For Different Models

In [None]:
random_state = 42
models = {
    "GaussianNB": GaussianNB(),
    "DummyClassifier": DummyClassifier(strategy="most_frequent"),
    "DecisionTreeClassifier": DecisionTreeClassifier(max_depth=32, min_samples_leaf=1, random_state=random_state),
    "KNeighborsClassifier": KNeighborsClassifier(n_neighbors=1, weights="uniform"),   
    "LogisticRegression": LogisticRegression(C=8, random_state=random_state),
    "GradientBoostingClassifier": GradientBoostingClassifier(loss = 'deviance', n_estimators = 20),
    "XGBClassifier": XGBClassifier(objective='binary:logistic', learning_rate=0.1, max_depth=22, n_estimators=300)
}


dict_f1 = {}
for name, model in models.items():
    metrics_f1 = k_fold_fit_and_evaluate(X_train, y_train, model, scoring_method_f1, n_splits=5) 
    dict_f1[name] = np.mean(metrics_f1)

val = []
for i in dict_f1.values():
    val.append(i)

keys = []
for i in dict_f1.keys():
    keys.append(i)

plt.figure(figsize=(13,5))
plt.barh(keys, val)
for index, value in enumerate(val):
    plt.text(value, index, str(round(value,3)))
plt.title("mean F1")


#### Grid-Search

In [None]:
from sklearn.model_selection import GridSearchCV

random_state = 42
n_splits = 5
scoring_method = make_scorer(lambda prediction, true_target: f1_score(true_target, prediction, average="weighted"))

model_parameters = {
    "GaussianNB": {
    
    },
    "DummyClassifier": {
        'strategy':['stratified','most_frequent','prior','uniform']
    },
    "DecisionTreeClassifier": {
        'max_depth': [20, 22, 28, 32, 37, 38, 42, 45, 50, 70],
        'min_samples_leaf':[1, 2, 3, 4, 5]
    },
    "KNeighborsClassifier": {
        'n_neighbors':[1, 2, 3, 4], 
        'weights':["uniform", "distance"]
    },
    "LogisticRegression": {
        'C':[7, 8, 10, 15, 30, 40, 50, 70],
        'max_iter':[1000]
        
    },
    "GradientBoostingClassifier": {
        'loss': ["deviance", "exponential"],
        'n_estimators': [1, 2, 10, 20]
    }
}

for model_name, parameters in model_parameters.items():
    model = models[model_name]
    
    cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, verbose=False, scoring=scoring_method).fit(X_train, y_train)

    best_score = grid_search.best_score_
    best_params = grid_search.best_params_
    
    print(model_name)
    print("- best_score =", best_score)
    print("best paramters:")
    for k,v in best_params.items():
        print("-", k, v)

## Predictive Model Application

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import sklearn.metrics as metrics

y_true = y_test

#### Gradient Boosting

In [None]:
clf_gbc = GradientBoostingClassifier(loss = 'deviance', n_estimators = 20, random_state=42)

clf_gbc.fit(X_train, y_train)
y_predicted_gbc = clf_gbc.predict(X_test)
print(classification_report(y_true, y_predicted_gbc, zero_division = 0))

In [None]:
cm = confusion_matrix(y_test, y_predicted_gbc)
sns.heatmap(cm, annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True Value')
plt.show()

In [None]:
probs = clf_gbc.predict_proba(X_test)
preds = probs[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_true, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### Decision Tree

In [None]:
clf_dt = DecisionTreeClassifier(max_depth=32, min_samples_leaf=1, random_state=0)
clf_dt.fit(X_train, y_train)

y_predicted_dt = clf_dt.predict(X_test)
print(classification_report(y_true, y_predicted_dt, zero_division = 0))

In [None]:
cm = confusion_matrix(y_test, y_predicted_dt)
sns.heatmap(cm, annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True Value')
plt.show()

In [None]:
probs = clf_dt.predict_proba(X_test)
preds = probs[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_true, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### KNeighbors

In [None]:
clf_knn=KNeighborsClassifier(n_neighbors=1, weights="uniform")
clf_knn.fit(X_train, y_train)

y_predicted_knn = clf_knn.predict(X_test)

print(classification_report(y_true, y_predicted_knn, zero_division = 0))

In [None]:
cm = confusion_matrix(y_test, y_predicted_knn)
sns.heatmap(cm, annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True Value')
plt.show()

In [None]:
probs = clf_knn.predict_proba(X_test)
preds = probs[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_true, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

*Normalization*

In [None]:
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(X_train)
X_train_norm = norm.transform(X_train)

In [None]:
norm = MinMaxScaler().fit(X_test)
X_test_main_norm = norm.transform(X_test)

*Standardization*

In [None]:
from sklearn import preprocessing

std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)

In [None]:
std_scale = preprocessing.StandardScaler().fit(X_test)
X_test_main_std = std_scale.transform(X_test)

#### KNeighbors With Normalization

In [None]:
clf_knn=KNeighborsClassifier(n_neighbors=1, weights="uniform")
clf_knn.fit(X_train_norm, y_train)

y_predicted_knn = clf_knn.predict(X_test_main_norm)

print(classification_report(y_true, y_predicted_knn, zero_division = 0))

In [None]:
cm = confusion_matrix(y_test, y_predicted_knn)
sns.heatmap(cm, annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True Value')
plt.show()

In [None]:
probs = clf_knn.predict_proba(X_test_main_norm)
preds = probs[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_true, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### KNeighbors With Standardization

In [None]:
clf_knn=KNeighborsClassifier(n_neighbors=1, weights="uniform")
clf_knn.fit(X_train_std, y_train)

y_predicted_knn = clf_knn.predict(X_test_main_std)

print(classification_report(y_true, y_predicted_knn, zero_division = 0))

In [None]:
cm = confusion_matrix(y_test,y_predicted_knn)
sns.heatmap(cm, annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True Value')
plt.show()

In [None]:
probs = clf_knn.predict_proba(X_test_main_std)
preds = probs[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_true, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### XG Boost 

In [None]:
clf_xgb=XGBClassifier(objective='binary:logistic',learning_rate=0.1,max_depth=22,n_estimators=300)
clf_xgb.fit(X_train, y_train)

y_predicted_xgb = clf_xgb.predict(X_test)

print(classification_report(y_true, y_predicted_xgb, zero_division = 0))

In [None]:
cm = confusion_matrix(y_test,y_predicted_xgb)
sns.heatmap(cm, annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True Value')
plt.show()

In [None]:
probs = clf_xgb.predict_proba(X_test)
preds = probs[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_true, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### CatBoost

In [None]:
from catboost import CatBoostClassifier

In [None]:
params = {'loss_function':'Logloss', # objective function
          'eval_metric':'AUC', # metric
          'verbose': 1000,
         }

clf_cat = CatBoostClassifier(**params)
clf_cat.fit(X_train, y_train)

y_predicted_cat = clf_cat.predict(X_test)

print(classification_report(y_true, y_predicted_cat, zero_division = 0))

In [None]:
cm = confusion_matrix(y_test,y_predicted_cat)
sns.heatmap(cm, annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True Value')
plt.show()

In [None]:
probs = clf_cat.predict_proba(X_test)
preds = probs[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_true, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## Conclusion

Model with the best score is <em>XGBoost Classifier</em>  
Achieved roc-auc: 0.97