# Predicting the Success of a Bank Marketing Campaign using Machine Learning

Richard Kaldenhoven  
11/28/2021


## 1. Introduction

The objective of this notebook is to develop a machine learning model that can predict whether a bank telemarketing campaign will be succesful for a particular customer. The dataset used in this notebook is taken from the UCI Machine Learning Repository, and can be accessed at the following link (for this project the `bank-additional-full.csv` file is used):

[https://archive.ics.uci.edu/ml/datasets/Bank%2BMarketing](https://archive.ics.uci.edu/ml/datasets/Bank%2BMarketing)

The citation for this dataset is given below:

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.001

### 1.1 Problem Description

The machine learning problem is a binary classification where the target variable to predict is either:
* yes, for when the campaign is successful and the client has subscribed for a new term deposit
* no, for when the campaign is not successful and the client does not subscribe for a new term deposit.

The data description of the features and target is given in the next section. 

One issue with the dataset is the severe class imbalance, since the overwhelming majority of data observations represent failed attempts at getting a client to subscribe for a new term deposit.

### 1.2 Data Description

#### Input variables:
   **bank client data:**  
   1 - age (numeric)  
   2 - job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")  
   3 - marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)  
   4 - education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")  
   5 - default: has credit in default? (categorical: "no","yes","unknown")  
   6 - housing: has housing loan? (categorical: "no","yes","unknown")  
   7 - loan: has personal loan? (categorical: "no","yes","unknown")  

   **related with the last contact of the current campaign:**  
   8 - contact: contact communication type (categorical: "cellular","telephone")   
   9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")  
  10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")  
  11 - duration: last contact duration, in seconds (numeric). Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.  

   **other attributes:**  
  12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\
  13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)\
  14 - previous: number of contacts performed before this campaign and for this client (numeric)\
  15 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")  

   **social and economic context attributes**  
  16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)  
  17 - cons.price.idx: consumer price index - monthly indicator (numeric)   
  18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)   
  19 - euribor3m: euribor 3 month rate - daily indicator (numeric)  
  20 - nr.employed: number of employees - quarterly indicator (numeric)  

  #### Output variable (desired target):
  21 - y - has the client subscribed a term deposit? (binary: "yes","no")

## 2. Model Performance Metric and Evaluation Protocol

Models will be evaluated using a train/test split of the data, with the F1 score being used as the performance metric to maximize.

## 3. EDA and Preliminary Data Processing

In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

In [37]:
df = pd.read_csv('bank-additional-full.csv', sep=';')

df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [38]:
# drop the duration column, as recommended in the data description
df.drop('duration', axis=1, inplace=True)

In [39]:
df.isnull().sum().max()

0

In [40]:
df['y'] = df['y'].apply(lambda x: 0 if x=='no' else 1)

In [41]:
df['y'].value_counts()

0    36548
1     4640
Name: y, dtype: int64

In [42]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,41188.0,40.02406,10.42125,17.0,32.0,38.0,47.0,98.0
campaign,41188.0,2.567593,2.770014,1.0,1.0,2.0,3.0,56.0
pdays,41188.0,962.475454,186.910907,0.0,999.0,999.0,999.0,999.0
previous,41188.0,0.172963,0.494901,0.0,0.0,0.0,0.0,7.0
emp.var.rate,41188.0,0.081886,1.57096,-3.4,-1.8,1.1,1.4,1.4
cons.price.idx,41188.0,93.575664,0.57884,92.201,93.075,93.749,93.994,94.767
cons.conf.idx,41188.0,-40.5026,4.628198,-50.8,-42.7,-41.8,-36.4,-26.9
euribor3m,41188.0,3.621291,1.734447,0.634,1.344,4.857,4.961,5.045
nr.employed,41188.0,5167.035911,72.251528,4963.6,5099.1,5191.0,5228.1,5228.1
y,41188.0,0.112654,0.316173,0.0,0.0,0.0,0.0,1.0


In [43]:
def get_num_cat_col_names(df, target='y'):
    '''
    Function to get the names of the numerical and categorical columns in a dataframe.
    
    Arguments:
    df - DataFrame (DataFrame)
    y - target variable in the dataframe (string)

    Returns:
    numerical_columns - list of numerical columns
    categorical_columns - list of categorical columns
    '''
    
    numerical_columns = df._get_numeric_data().columns.to_list()
    numerical_columns.remove(target)

    categorical_columns = list(set(df.columns.to_list()) - set(numerical_columns))
    categorical_columns.remove(target)

    return numerical_columns, categorical_columns

numerical_columns, categorical_columns = get_num_cat_col_names(df, target='y')

print(len(numerical_columns), 'Numerical Columns')
print(numerical_columns)

print(len(categorical_columns), 'Categorical Columns')
print(categorical_columns)

9 Numerical Columns
['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
10 Categorical Columns
['day_of_week', 'contact', 'marital', 'loan', 'housing', 'default', 'month', 'job', 'education', 'poutcome']


In [44]:
def plot_num_features(df, numerical_columns):
    '''
    Function to plot distribution plots for each numerical column in a dataframe.
    
    Arguments:
    df - DataFrame (DataFrame)
    numerical_columns - list of numerical columns (list)

    Returns:
    None
    '''

    subscribed_df = df[df['y'] == 1]
    no_sub_df = df[df['y'] == 0]

    for i, col in enumerate(numerical_columns):
        plt.figure(i)
        sns.distplot(no_sub_df[col], kde=False, label='0')
        sns.distplot(subscribed_df[col], kde=False, label='1')
        plt.ylabel('Count')
        plt.legend()

def plot_cat_features(df, categorical_columns):
    '''
    Function to plot countplots for each categorical column in a dataframe.
    
    Arguments:
    df - DataFrame (DataFrame)
    categorical_columns - list of categorical columns (list)

    Returns:
    None
    '''
    for i, col in enumerate(categorical_columns):
        plt.figure(i, figsize=(14,4))
        sns.countplot(x=col, data=df, hue='y')
        plt.legend()

In [45]:
# hide plots for now to save space
#plot_num_features(df, numerical_columns)

In [46]:
# hide plots for now to save space
#plot_cat_features(df, categorical_columns)

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def process_data(dataframe, target='y', test_size=0.4, random_state=42):
    '''
    Function to scale the numerical columns and one hot encode the categorical columns in a dataframe,
    then perform a train/test split on the data.
    Requires previous function get_num_cat_col_names to be defined.
    
    Arguments:
    df - DataFrame (DataFrame)
    target - target variable (string)
    test_size - fraction of data used for the test data set (float)
    random_state - random state used to ensure reproducibility in the train/test split (float)

    Returns:
    X_train, X_test, y_train, y_test - x and y pairs for the train and test sets (DataFrame) 
    '''

    numerical_columns, categorical_columns = get_num_cat_col_names(df, target=target)
    
    X = dataframe.drop(target, axis=1)
    y = dataframe[target]

    X = pd.get_dummies(X, columns=categorical_columns, drop_first=True)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    X_train_num = X_train[numerical_columns]
    X_test_num = X_test[numerical_columns]

    scaler = StandardScaler()
    scaler.fit(X_train_num)

    X_train_num_scaled = pd.DataFrame(scaler.transform(X_train_num), columns=numerical_columns, index=X_train_num.index)
    X_test_num_scaled = pd.DataFrame(scaler.transform(X_test_num), columns=numerical_columns, index=X_test_num.index)

    X_train = X_train_num_scaled.join(X_train.drop(numerical_columns, axis=1))
    X_test = X_test_num_scaled.join(X_test.drop(numerical_columns, axis=1))

    return X_train, X_test, y_train, y_test

In [48]:
X_train, X_test, y_train, y_test = process_data(dataframe=df, target='y')

## 4. Baseline Model

### 4.1 Non-machine learning baseline

In [49]:
no_ml_pred = pd.Series(np.zeros((y_test.shape[0])))
no_ml_pred.shape

(16476,)

In [50]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score, precision_score, recall_score

experiment_logs = []

def evaluate_model(y_true, y_pred, experiment_name, show_reports=False, logs=experiment_logs):
    '''
    Function to evaluate a model and save performance metrics as an experiment log into a predefined variable.
    
    Arguments:
    y_true - true values for the target variable (DataFrame)
    y_pred - predicted values from the model (numpy array)
    experiment_name - name for the experiment log (string)
    show_reports - setting to display output from this function, default is False (boolean)
    logs - predefined variable to store experiment logs (list)

    Returns:
    metrics_dict - performance metrics (dictionary)
    '''

    if show_reports == True:
        class_report = classification_report(y_true, y_pred)
        
        conf_matrix = confusion_matrix(y_true, y_pred)
        conf_matrix_df = pd.DataFrame(conf_matrix, 
                                    index=[['Actual','Actual'],['1','0']], 
                                    columns=[['Predicted','Predicted'],['1','0']]
                                    )

        print(class_report)
        print('Confusion matrix: \n', conf_matrix_df, '\n')

        #print('F1 Score (macro):', round(f1_score_macro, 3))

    prec = precision_score(y_true, y_pred, average='macro')
    recall = recall_score(y_true, y_pred, average='macro')
    f1_score_macro = f1_score(y_true, y_pred, average='macro')
    
    metrics_dict = {'experiment name': experiment_name, 'results': {'precision': prec, 'recall': recall, 'f1 score': f1_score_macro}}
    
    logs.append(metrics_dict)
    return metrics_dict

no_ml_baseline_metrics_dict = evaluate_model(y_true=y_test, y_pred=no_ml_pred, experiment_name='No ML baseline')

  _warn_prf(average, modifier, msg_start, len(result))


### 4.2 Machine learning baseline

In [51]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(max_iter=500)
lr_model.fit(X_train, y_train)

lr_model_pred = lr_model.predict(X_test)

In [52]:
baseline_metrics_dict = evaluate_model(y_true=y_test, y_pred=lr_model_pred, experiment_name='ML baseline')

## 5. Improving the Model: Correcting Imbalanced Classes

### 5.1 Random Undersampling

In [53]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)

X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

In [54]:
from sklearn.linear_model import LogisticRegression

lr_model_rus = LogisticRegression(max_iter=500)
lr_model_rus.fit(X_train_rus, y_train_rus)

lr_model_rus_pred = lr_model_rus.predict(X_test)

In [55]:
rus_metrics_dict = evaluate_model(y_true=y_test, y_pred=lr_model_rus_pred, experiment_name='RUS')

### 5.2 Random Oversampling

In [56]:
from imblearn.over_sampling import SMOTE

ros = SMOTE(random_state=42)

X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

In [57]:
from sklearn.linear_model import LogisticRegression

lr_model_ros = LogisticRegression(max_iter=500)
lr_model_ros.fit(X_train_ros, y_train_ros)

lr_model_ros_pred = lr_model_ros.predict(X_test)

In [58]:
ros_metrics_dict = evaluate_model(y_true=y_test, y_pred=lr_model_ros_pred, experiment_name='ROS')

## 5.3 Combining SMOTE and Random Undersampling

In [59]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

oversample = SMOTE(random_state=42, sampling_strategy=0.2)

undersample = RandomUnderSampler(random_state=42, sampling_strategy=0.5)

model = LogisticRegression(max_iter=500)

steps = [('o', oversample), ('u', undersample), ('m', model)]
pipeline = Pipeline(steps=steps)
pipeline.fit(X_train, y_train)

pipeline_pred = pipeline.predict(X_test)


In [60]:
resample_metrics_dict = evaluate_model(y_true=y_test, y_pred=pipeline_pred, experiment_name='ROS + RUS')

## 6. Experimenting with a More Complex ML Model

### 6.1 Random Forest

In [61]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

oversample = SMOTE(random_state=42, sampling_strategy=0.2)

undersample = RandomUnderSampler(random_state=42, sampling_strategy=0.5)

model = RandomForestClassifier()

steps = [('o', oversample), ('u', undersample), ('m', model)]
pipeline = Pipeline(steps=steps)
pipeline.fit(X_train, y_train)

pipeline_pred = pipeline.predict(X_test)

In [62]:
rfc_metrics_dict = evaluate_model(y_true=y_test, y_pred=pipeline_pred, experiment_name='Random Forest')

### 6.2 Feature Importances and Feature Selection

In [63]:
def generate_rf_importances(train_data, model_object):
    '''
    Function to calculate and display the feature importances from a Random Forest model.
    
    Arguments:
    train_data - training data used for the model (DataFrame)
    model_object - Random Forest model object (Object)

    Returns:
    imp_df - table of feature importances (DataFrame) 
    '''

    imp_dict = {'Feature':np.asarray(train_data.columns), 'Random Forest Importance':model_object.feature_importances_}

    imp_df = pd.DataFrame(imp_dict, index=None)
    imp_df.set_index('Feature', inplace=True)
    imp_df.sort_values(by='Random Forest Importance', inplace=True, ascending=False)
    imp_df.apply(lambda s: s.apply('{0:.3f}'.format))

    return imp_df

In [64]:
rfimp_df = generate_rf_importances(train_data=X_train, model_object=model)
rfimp_df

Unnamed: 0_level_0,Random Forest Importance
Feature,Unnamed: 1_level_1
euribor3m,0.157507
age,0.141565
campaign,0.084052
nr.employed,0.069462
emp.var.rate,0.043425
cons.conf.idx,0.035279
housing_yes,0.026397
cons.price.idx,0.026129
pdays,0.024064
poutcome_success,0.020937


In [65]:
features_to_drop = rfimp_df[rfimp_df['Random Forest Importance'] < 0.025].index.to_list()

In [66]:
X_train_new = X_train.drop(features_to_drop, axis=1)
X_test_new = X_test.drop(features_to_drop, axis=1)

print(X_train_new.columns.to_list())

['age', 'campaign', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'housing_yes']


In [67]:
pipeline.fit(X_train_new, y_train)

pipeline_pred_new = pipeline.predict(X_test_new)

In [68]:
rfc_metrics_dict_new = evaluate_model(y_true=y_test, y_pred=pipeline_pred_new, experiment_name='Random Forest + Feat Select')

### 6.3 Reviewing Results

In [69]:
def collect_logs(experiment_logs):
    '''
    Function to collect and display the performance metrics from each experiment log stored in the experiment log variable

    Arguments:
    experiment_logs - variable containing previously created experiment logs (list)

    Returns:
    logs_df - table of performance metrics (DataFrame)
    '''

    results = []

    for dict in experiment_logs:
        results.append([dict['experiment name'], dict['results']['precision'], dict['results']['recall'], dict['results']['f1 score']])
                
        logs_df = pd.DataFrame(results, columns=['Experiment Name', 'Precision', 'Recall', 'F1 Score'])
    return logs_df

In [70]:
logs = collect_logs(experiment_logs)
logs

Unnamed: 0,Experiment Name,Precision,Recall,F1 Score
0,No ML baseline,0.443767,0.5,0.470208
1,ML baseline,0.785873,0.605002,0.640616
2,RUS,0.648088,0.738979,0.672574
3,ROS,0.603764,0.660747,0.618585
4,ROS + RUS,0.677262,0.704989,0.689434
5,Random Forest,0.680145,0.712164,0.693958
6,Random Forest + Feat Select,0.634596,0.687498,0.652653


## 7. Future Work

The results so far show that a combination of random oversampling and undersampling is the best resampling strategy, and the Random Forest model yields the highest F1 score. There are several steps that can be taken in the future:

* Further experimentation with feature selection, to see if a different combination of features can improve the F1 score
* Experimenting with a more complex model, such as gradient boosting, to improve the F1 score
* Implement a grid search or random search for hyperparameter tuning of a gradient boosting model to improve the F1 score