# Bank Marketing Campaign 

**The aim of this notebook is to build a machine learning model to help ABC Bank identify the right customers to target
for their marketing campaign by classification of people buying term deposit or not.**


# Exploratory Data Analysis

In this notebook, we will go through the following steps. 

1. Data Preparation
   
   - Importing libraries
   - Data Ingestion
   - Data Overview/(Data Intake)
   - Categorical & Numerical variables
   

2. **Model Preparation**


3. **Model Building**
            
    
4. Communication

===================================================================================================

# Data Preparation

 ## Import Libraries 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer,
)

from feature_engine.transformation import (
    LogTransformer,
    YeoJohnsonTransformer,
)

from sklearn.preprocessing import MinMaxScaler,LabelEncoder

from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline

In [2]:
import warnings
warnings.filterwarnings('ignore')

## Data Ingestion

### Testutility file

In [3]:
%%writefile testutility.py
import logging
import os
import subprocess
import yaml 
import pandas as pd
import datetime
import gc
import re

#################
# File Reading #
#################

def read_config_file(filepath):
  with open(filepath, "r") as stream:
    try:
      return yaml.safe_load(stream)
    except yaml.YAMLError as exc:
      logging.error(exc)

Writing testutility.py


### Yaml file

In [4]:
%%writefile file.yaml
file_type: csv
dataset_name: testfile
file_name: bank-additional-full
table_name: edsurv
inbound_delimeter: ";"
outbound_delimeter: "|"

Writing file.yaml


In [5]:
# Read Config file
import testutility as util
config_data = util.read_config_file("file.yaml")

In [6]:
config_data["file_name"]

'bank-additional-full'

In [7]:
# read the files using config file
file_type = config_data["file_type"]
source_file = "./" + config_data["file_name"] + f".{file_type}"

# print("", source_file)
df = pd.read_csv(source_file, config_data["inbound_delimeter"])

FileNotFoundError: [Errno 2] No such file or directory: './bank-additional-full.csv'

## Data Overview

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.info()

In [None]:
import os
def summary(df, source_file):
    rows = len(df)
    columns = len(df.columns)
    print(f"Number of Rows: {rows}")
    print(f"Number of Columns: {columns}")
    file_size = os.path.getsize(source_file)
    print(f"Size: {file_size} bytes")

In [None]:
summary(df, source_file)

In [None]:
df.columns

In [None]:
df.duplicated().sum()

In [None]:
# Remove Duplicates
df = df.drop_duplicates()

In [None]:
df.duplicated().sum()

In [None]:
df.dtypes

Data types are correctly assigned

In [None]:
df.isna().sum()

Now, as per above result,  we can see there is no any duplicate value nor null value.

## Categorical & Numerical variables

### Variable Types

#### Target Variable

In [None]:
target_var = ['y']
target = df[target_var]
target_var

In [None]:
df["y"].unique()

#### Categorical Variables

In [None]:
categorical_var = [var for var in df.columns 
                   if df[var].dtype=="O" and 
                   var not in target_var and 
                   var not in ["month", "day_of_week", "default"]]
categorical_var

In [None]:
# let's explore the values of these categorical variables
for var in categorical_var:
    print(var, df[var].unique())
    print()

#### Numeric Variables

In [None]:
numerical_var = [var for var in df.columns if var not in categorical_var + target_var and var not in ["month", "day_of_week"]]
numerical_var

#### Temporal Variables 

In [None]:
temporal_var = [var for var in df.columns if var =="month" or var=="day_of_week"]
temporal_var

In [None]:
# let's explore the values of these temporal variables

for var in temporal_var:
    print(var, df[var].unique())
    print()

#### Discreete Variable 

In [None]:
discrete_var = [var for var in numerical_var if len(df[var].unique()) < 32 and var not in temporal_var]
discrete_var

In [None]:
# let's explore the values of these discrete variables

for var in discrete_var:
    print(var, df[var].unique())
    print()

##### Continuous variables 

In [None]:
# make list of continuous variables
continuous_var = [
    var for var in numerical_var if var not in discrete_var+temporal_var]
continuous_var

In [None]:
# let's explore the values of these temporal variables

for var in continuous_var:
    print(var, df[var].unique())
    print()

## Model Preparation

NOTES:
- Duration is a variable we can explore during EDA for business purposes to decide if we should try engaging the person for longer time on the call or not. The model should be built with and without the duration variable becuase Duration is obtained after the call is made to the potential client so if the target client has never received calls, this feature is not very useful. 
- No missing values or duplicate values. Data Types correctly assigned. So no need of change there.
- There is a 999 value existing in pdays column meaning the customer has not been contacted before. We should change it to 0
- Outliers need to be handled in the deposit column
- Age column can be log transformed

In [None]:
# dropping default column since very few values with "yes" so do not give model any predictive power
df.drop(columns =['default'] ,axis =1 ,inplace = True)
df.drop(columns = ["duration"]), axis=1, inplace = True)

## Train-Test Split

In [None]:
target = "y"
X = df.drop(columns=target)
y = df[target]

print("X shape:", X.shape)
print("y shape:", y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

# Feature Engineering

### Missing Values

In [None]:
df.isnull().sum()

Fortunately, there are no missing values in the dataset. However, after exploring we find that 6 of the categorical variables have an "unknown" value. Those are the only missing values which do not necessarily need to be dealt with as the "unknown" category is already created for them. However, the categorical variables with a small proportion of "unknown" values (less than 10%) can be replaced with the most frequent value of the column while the ones with high proportion of missing values (more than 10%) can be left as a separate "unknown" category.

Since we removed the default column there are only 5 categorical variables with unknown values and none of them have more than 10% unknown value

In [None]:
# To handle the missing values, we first turn the 'unknown' observations to NaNs
X_train = X_train.replace('unknown', np.nan)
X_test = X_test.replace('unknown', np.nan)
X_train.isnull().mean()

In [None]:
# make a list of the categorical variables that contain missing values

cat_vars_with_na = [
    var for var in categorical_var
    if X_train[var].isnull().sum() > 0
]

# print percentage of missing values per variable
X_train[cat_vars_with_na ].isnull().mean().sort_values(ascending=False)

In [None]:
cat_vars_with_na

In [None]:
# # variables to impute with the string "unknown"
# with_string_missing = [
#     var for var in cat_vars_with_na if X_train[var].isnull().mean() > 0.05]

# variables to impute with the most frequent category
with_frequent_category = [
    var for var in cat_vars_with_na if X_train[var].isnull().mean() < 0.05]

In [None]:
# # I print the values here, because it makes it easier for
# # later when we need to add this values to a config file for 
# # deployment

# with_string_missing

In [None]:
with_frequent_category

In [None]:
# # replace missing values with new label: "unknown"

# # set up the class
# cat_imputer_missing = CategoricalImputer(
#     imputation_method='missing', fill_value = "unknown", variables=with_string_missing)

# # fit the class to the train set
# cat_imputer_missing.fit(X_train)

# # the class learns and stores the parameters
# cat_imputer_missing.imputer_dict_

In [None]:
# # replace NA by missing
# X_train = cat_imputer_missing.transform(X_train)
# X_test = cat_imputer_missing.transform(X_test)

In [None]:
# replace missing values with most frequent category

# set up the class
cat_imputer_frequent = CategoricalImputer(
    imputation_method='frequent', variables=with_frequent_category)

# fit the class to the train set
cat_imputer_frequent.fit(X_train)

# the class learns and stores the parameters
cat_imputer_frequent.imputer_dict_

In [None]:
# replace NA by missing
X_train = cat_imputer_frequent.transform(X_train)
X_test = cat_imputer_frequent.transform(X_test)

In [None]:
# check that we have no missing information in the engineered variables

X_train[cat_vars_with_na].isnull().sum()

In [None]:
# check that test set does not contain null values in the engineered variables

[var for var in cat_vars_with_na if X_test[var].isnull().sum() > 0]

## Numerical variable transformation

### Logarithmic transformation

In the previous notebook, we observed that the numerical variables are not normally distributed.

We will transform with the logarightm the "age" variable in order to get a more Gaussian-like distribution.

In [None]:
log_transformer = LogTransformer(
    variables=["age"])

X_train = log_transformer.fit_transform(X_train)
X_test = log_transformer.transform(X_test)

### Yeo-Johnson transformation

We will apply the Yeo-Johnson transformation to duration to obtain gaussian like distribution.

In [None]:
# yeo_transformer = YeoJohnsonTransformer(
#     variables=['duration'])

# X_train = yeo_transformer.fit_transform(X_train)
# X_test = yeo_transformer.transform(X_test)

# # the learned parameter
# yeo_transformer.lambda_dict_

In [None]:
X_train.head()

## Categorical feature encoding

### Replace Binary Features with 1 and 0

In [None]:
binary_features = ['housing', 'loan']

In [None]:
for col in binary_features:
    X_train[col] = X_train[col].replace(["yes"], 1)
    X_train[col] = X_train[col].replace(["no"], 0)
    X_test[col] = X_test[col].replace(["yes"], 1)
    X_test[col] = X_test[col].replace(["no"], 0)

In [None]:
X_train.info()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder(categories = [['basic.9y', 'university.degree', 'illiterate', 'basic.4y', 'basic.6y', 'professional.course', 'high.school']])
X_train.education = ord_enc.fit_transform(X_train.loc[:, ["education"]])
X_test.education = ord_enc.transform(X_test.loc[:, ["education"]])
X_train.head()

In [None]:
y_train

In [None]:
X_test.head()

In [None]:
le=LabelEncoder()
y_train = y_train.to_frame()
y_test = y_test.to_frame()
y_train["y"]=le.fit_transform(y_train["y"])
y_test["y"] = le.transform(y_test["y"])

In [None]:
X_train.info()

In [None]:
X_test.info()

In [None]:
cols_to_encode_Xtrain = X_train.select_dtypes(include="object")

In [None]:
cols_to_encode_Xtest = X_test.select_dtypes(include="object")

In [None]:
#select the variables to encode first

for col in cols_to_encode_Xtrain:
  X_train = pd.concat([X_train, pd.get_dummies(X_train[col], prefix="%s"%col)], axis=1)
  X_train.drop([col], axis=1, inplace=True)

for col in cols_to_encode_Xtest:
  X_test = pd.concat([X_test, pd.get_dummies(X_test[col], prefix="%s"%col)], axis=1)
  X_test.drop([col], axis=1, inplace=True)

In [None]:
X_train.info()

In [None]:
X_test.info()

In [None]:
y_train.head()

In [None]:
y_test.head()

### Resample

In [None]:
over_sampler = RandomOverSampler(random_state=42)
X_train_over, y_train_over = over_sampler.fit_resample(X_train,y_train)
print("X_train_over shape:", X_train_over.shape)
X_train_over.head()

## Build Model

In [None]:
acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 4))

In [None]:
clf = make_pipeline(
    RandomForestClassifier(random_state=42)
)
print(clf)

In [None]:
cv_acc_scores = cross_val_score(clf, X_train_over, y_train_over, cv=5, n_jobs=-1)
print(cv_acc_scores)

In [None]:
params = {
    "randomforestclassifier__n_estimators": range(25,100,25),
    "randomforestclassifier__max_depth": range(10,50,10),

}
params

In [None]:
model = GridSearchCV(
    clf,
    param_grid = params,
    cv = 5,
    n_jobs = -1,
    verbose= 1
)
model

In [None]:
# Train model
model.fit(X_train_over, y_train_over)

In [None]:
cv_results = pd.DataFrame(model.cv_results_)
cv_results.head(10)

In [None]:
# Create mask
mask = cv_results["param_randomforestclassifier__max_depth"] == 40
# Plot fit time vs n_estimators
plt.plot( cv_results[mask]["param_randomforestclassifier__n_estimators"],
         cv_results[mask]["mean_fit_time"]    
)
# Label axes
plt.xlabel("Number of Estimators")
plt.ylabel("Mean Fit Time [seconds]")
plt.title("Training Time vs Estimators (max_depth=40)");

In [None]:
# Create mask
mask = cv_results["param_randomforestclassifier__n_estimators"] == 50
# Plot fit time vs max_depth
plt.plot( cv_results[mask]["param_randomforestclassifier__max_depth"],
         cv_results[mask]["mean_fit_time"]
    
)

# Label axes
plt.xlabel("Max Depth")
plt.ylabel("Mean Fit Time [seconds]")
plt.title("Training Time vs Max Depth (n_estimators=50)");

In [None]:
cv_results[mask][["mean_fit_time", "param_randomforestclassifier__max_depth"]]

In [None]:
# Extract best hyperparameters
model.best_params_

In [None]:
model.best_score_

### Evaluate 

In [None]:
X_test.head()

In [None]:
acc_train = model.score(X_train_over,y_train_over)
acc_test = model.score(X_test, y_test)



print("Training Accuracy:", round(acc_train, 4))
print("Test Accuracy:", round(acc_test, 4))

We beat the baseline! Just barely, but we beat it.

In [None]:
y_test.value_counts()

In [None]:
# Plot confusion matrix
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test);

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

ns_probs = [0 for _ in range(len(y_test))]
# predict probabilities
lr_probs = model.predict_proba(X_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)
# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('RabdomForest: ROC AUC=%.3f' % (lr_auc))
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
# plot the roc curve for the model
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Random Forest')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()

### Communicate

In [None]:
# Get feature names from training data
features = X_train_over.columns

# Extract importances from model
importances = model.best_estimator_.named_steps["randomforestclassifier"].feature_importances_
# Create a series with feature names and importances
feat_imp = pd.Series(importances, index=features).sort_values()
# Plot 10 most important features
feat_imp.tail(10).plot(kind = "barh")
plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("Feature Importance");