# Predicting Subscription to Term Deposit
Notebook by Garima Mittal

## Aim
Developing a machine learning model for predicting if a customer will subscribe to a term deposit. For this an adapted dataset about bank marketing statistics is used. More about the original dataset [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing)

### Importing modules and load data

In [None]:
#Importing required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import time
plt.style.use('ggplot')

# ignore warnings 
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

#to include matplotlib graphs within the notebook, next to the code
%matplotlib inline  

In [None]:
#Loading the dataset
df = pd.read_csv('../input/bank-customer-dataset/data.csv')

----
## 1. Data Visualisation

In [None]:
df.head()

In [None]:
df.dtypes

Many columns such as 'job', 'marital', 'education', 'housing' etc. have finite string values. Let's get a better  overview of these columns.

In [None]:
obj_columns = ['job', 'marital', 'education', 'housing', 'loan', 'poutcome']
df[obj_columns].describe()

In [None]:
plt.figure(figsize=(20,10))
for i, col in enumerate(obj_columns):
    ax = plt.subplot(2,3,i+1)
    ax = df[col].value_counts().plot(kind='bar')
    ax.set_title(col)
plt.subplots_adjust(hspace=1)
plt.show()

## 2. Data Cleaning & Pre-processing

### Observation 1

The string variables 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome', 'y' in the given dataset take on a limited and fixed number of possible values. These can therefore be converted to categorical type. This has the following advantages:
- Storing variables as numerical categories saves memory in comparison to storing long strings.<br>
- It allows application of suitable statistical methods and plotting techniques.

However, some ML algorithms interpret numerical data based on its numerical value. This might lead to misclassification in case of categorical numerical variables where the numerical values follow no order of precedence. It is therefore best to transform these variables using One Hot Encoding where each column represents a distinct value of the given variable.

### One Hot Encoding of (Non-Binary) Categorical Text Variables

In [None]:
#Selecting binary variables from category dict, since we do not want to apply One Hot encoding on those.
#These can directly be transformed into binary categories
binary_cols = df.columns[df.nunique() == 2]
binary_cols

In [None]:
#Non-binary categorical columns for one hot encoding
cat_columns = ['job', 'marital', 'education', 'housing', 'loan', 'poutcome']
df = pd.get_dummies(df, prefix_sep="_",columns=cat_columns)

### Factorizing the Binary Text Variables

In [None]:
df[binary_cols] = df[binary_cols].apply(lambda x: pd.factorize(x)[0])

### Observation 2

The variables 'month', 'day_of_week' can be converted to numerical values by converting these to datetime format and then getting the corresponding numerical values. This allows datetime operations or time based analysis to be performed on these variables.

**Note:** The initial idea was to one hot encode the **month** and **day_of_week** variables as well along with the above categorical variables. However, when testing the performance of the classifier trained on the dataset **with** one hot encoded month and day_of_week vs that trained on the dataset **without** one hot encode the month and day_of_week, the second one performed better. This is because the first case adds 13 extra features to the dataset. Too many features can lead to overfitting. This is discussed at the end of Section 2.

#### Converting *month* and *day_of_week* to numeric values

In [None]:
#First checking the month  and day_of_week columns for null values
print(df.month.isnull().sum())
print(df.day_of_week.isnull().sum())

In [None]:
#Since the conversion to datetime does not support null values. We remove the row with a null month value
df = df.dropna(axis=0, subset=['month'])

In [None]:
#We now convert month and day_of_week to datetime format and extract their respective numerical values and assign
#these to the columns

month = []
weekday = []

for m, wd in zip(df.month, df.day_of_week):
    
    #Extracting month number
    mth = datetime.strptime(m, '%b')
    month.append(mth.strftime('%m')) #Appending to list
    
    #Extracting weekday number starting Monday = 0
    wkdy = time.strptime(wd, "%a")
    weekday.append(wkdy.tm_wday) #Appending to list


In [None]:
df['month'] = month
df['day_of_week'] = weekday 

In [None]:
#Converting 'month' and 'day_of_week' to int type
df['month'] = df['month'].astype('int')
df['day_of_week'] = df['day_of_week'].astype('int')

In [None]:
#Dataset with transformed variables
df.head()

### Observation 3

The variables age, pdays, previous are currently float type. These are however integer variables and don't assume decimal values. They should be therefore converted to int type. However to do that we have to first take care of the null values in these columns in the next section (2.2). Then we can convert these variables to int type

## Handling Missing Values

In [None]:
#The row containing null value in 'month' column was already removed earlier

#Checking for null values in other columns
df.isnull().sum()

We can impute the missing values in rest of the colmns by replacing them with the median values of the respective column. Median values remain unaffected by the outliers in the data and hence are more reliable than using mean. 

Also if a variable has many missing values then imputing these with the valriable mode will only make the data more skewed towards the mode. Hence it is best to use the median.

In [None]:
#Replacing missing values with median
df = df.replace(np.nan,df.median())

In [None]:
df.isnull().sum()

No more missing values now

## Outlier Removal

#### The first step is to assign the correct type to variables

In [None]:
#Continuous/float variables
df.columns[df.dtypes == 'float']

As mentioned earlier the variables age, pdays, previous are integers with the wrong type i.e. float. We now convert these to int type

In [None]:
df['age'] = df['age'].astype('int')
df['pdays'] = df['pdays'].astype('int')
df['previous'] = df['previous'].astype('int')

We can now look at the outliers in the remaining float variables

In [None]:
#Checking again for the remaining float variables
df.columns[df.dtypes == 'float']

In [None]:
#Analysis of the continuous variables
df[['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
       'euribor3m', 'nr.employed']].describe()

The above table shows that there are outliers in the 'duration' column because the difference between the inter-quartile range (values between Q1and Q3) and the maximum value (4199.0) is extremely large.

The rest of the continuous variables appear to be without outliers. However it is practical to visualize the data to verify this hypothesis.

In [None]:
#Plots to verify the above impression

fig, axs = plt.subplots(2,3, figsize=(20,8))
sns.boxplot(x= 'duration', data=df, ax=axs[0,0])
sns.boxplot(x= 'emp.var.rate', data=df, ax=axs[0,1])
sns.boxplot(x= 'cons.price.idx', data=df, ax=axs[0,2])
sns.boxplot(x= 'cons.conf.idx', data=df, ax=axs[1,0])
sns.boxplot(x= 'euribor3m', data=df, ax=axs[1,1])
sns.boxplot(x= 'nr.employed', data=df, ax=axs[1,2])
fig.show()

In [None]:
#Plot for 'duration'
ax = df.duration.plot(style = 'o')
ax.set_ylabel('Duration')
plt.show()

### Z-score for Outlier Removal
We need to decide the Z-Score for removing the outliers. This is basically the number of standard deviations. The points lying outside this number are considered outliers and are removed from the dataset.

In [None]:
#Deciding the Z-score threshold for outlier removal

thresh_list = [2,2.5,3]
print(np.multiply(thresh_list, df['duration'].std()))

Looking at the plot above and the different Z-score thresholds, the threshold of 3 looks reasonable.

In [None]:
#Checking for outliers

from scipy import stats

thresh = 3

df_to_clean = df[['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
       'euribor3m', 'nr.employed']]

# 1. Select a threshold for a Z-score to identify and remove outliers
df_Z = df_to_clean[(np.abs(stats.zscore(df_to_clean)) < thresh).all(axis=1)]
ix_keep = df_Z.index

# 2. Subset the raw dataframe with the indexes you'd like to keep
df_without_outlier = df.loc[ix_keep]

In [None]:
print('Number of outliers removed: ',df.shape[0] - df_without_outlier.shape[0])

In [None]:
#Descriptive stats of 'duration' after outlier removal
df_without_outlier[['duration']].describe()

----
## 3. Training of a Machine Learning Classifier

In [None]:
#Modules required for training a classifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold
from sklearn.metrics import accuracy_score
from hyperopt import hp, tpe, fmin, space_eval

**NOTE: Rescaling of Data**

The initial thought was to rescale the data before training the classifier. However, both Standardization and Normalization lowered the accuracy of the classifier in comparison to the unscaled dataset. This is discussed at the end of Section 3.


### Partitioning the Dataset

Partitioning the dataset into a training and test set. The test set is 25% of the original dataset

In [None]:
X = df_without_outlier.drop(['Unnamed: 0','y'],axis=1)
y = df_without_outlier['y']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2018)

### Choosing a Machine Learning Model<br>

It is best to try out different classifier models on the given dataset and then select the one with best classification accuracy. Since the dataset contains both continous and categorical variables, tree-based models like Decision Tree and Random Forest are a good choice as they are robust in such cases. 

The strategy applied here is to select the hyperparameters for both classifiers using Stratified K-Fold Cross Validation and Bayesian Optimization. Then training both classifiers using their selected hyperparameters on the training data and testing their accuracy on the test data to choose the classifier with the maximum accuracy.

### 1. Decision Trees

#### Hyperparameter Selection

In [None]:
def StratifiedKFold_dtree(X, y, params):
    '''
    CV for hyperopt 
    '''
    skf = StratifiedKFold(n_splits=5, shuffle=True)
    result = []
    # Loop through the indices the split() method returns
    for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
        # Generate batches from indices
        xtrain, xtest = X.values[train_indices], X.values[val_indices]
        ytrain, ytest = y.values[train_indices], y.values[val_indices]
        # Create the Decision Tree model
        model = tree.DecisionTreeClassifier(**params)
        model.fit(xtrain, ytrain)
        y_hat = model.predict(xtest)
        score = accuracy_score(ytest, y_hat)
        result.append(score)
    return np.mean(result)

In [None]:
def objective(params):
    """
    Objective function to minimize
    """
    return -StratifiedKFold_dtree(X,y, params) #Maximize D-Tree accuracy score

In [None]:
#Decision Tree Parameter Space
from hyperopt.pyll.base import scope

dtree_space = {
    'max_depth': hp.quniform('max_depth', 1, 10, 1),
    
    'min_samples_split': hp.choice('min_samples_split', np.arange(2, 10, 1, dtype=int) ),
    
    'criterion': hp.choice('criterion', ('entropy',
                                         'gini',)),
    'random_state': 42,
}

In [None]:
dtree_best = fmin(objective, space=dtree_space, algo=tpe.suggest, max_evals=50)
dtree_params_from_hyperopt = space_eval(dtree_space, dtree_best)
print('Best Decision Tree Hyperparameters:\n', dtree_params_from_hyperopt)

In [None]:
#Stratified K-Fold Cross Validation Accuracy Score
mean_acc_dtree = StratifiedKFold_dtree(X, y, dtree_params_from_hyperopt)
print('Stratified K-Fold Cross Validation Mean Accuracy:\n',mean_acc_dtree)

### 2. Random Forests<br>

It is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The n_estimator parameter specifies the number of trees to fit. Keeping all the common hyperparameters of both classifiers same, Random Forest classifier usually delivers a higher accuracy over decision tress.

#### Hyperparameter Selection

In [None]:
def StratifiedKFold_randFor(X, y, params):
    '''
    CV for hyperopt 
    '''
    skf = StratifiedKFold(n_splits=5, shuffle=True)
    result = []
    # Loop through the indices the split() method returns
    for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
        # Generate batches from indices
        xtrain, xtest = X.values[train_indices], X.values[val_indices]
        ytrain, ytest = y.values[train_indices], y.values[val_indices]
        # Create the Random Forest model
        model = RandomForestClassifier(**params)
        model.fit(xtrain, ytrain)
        y_hat = model.predict(xtest)
        score = accuracy_score(ytest, y_hat)
        result.append(score)
    return np.mean(result)

In [None]:
def objective(params):
    """
    Objective function to minimize
    """
    return -StratifiedKFold_randFor(X,y, params) #Maximize Random Forest accuracy score

In [None]:
#Random Forest Parameter Space
#All parameters except n_estimators have the same values as D-Tree parameters (for accuracy comparison)

from hyperopt.pyll.base import scope

rf_space = {
    'max_depth': hp.quniform('max_depth', 1, 10, 1),
    
    'n_estimators': scope.int(hp.quniform('n_estimators', 10,20,1)), #max 20 estimators because of runtime concerns
    
    'min_samples_split': hp.choice('min_samples_split', np.arange(2, 10, 1, dtype=int) ),
    
    'criterion': hp.choice('criterion', ('entropy',
                                         'gini',)),
    'random_state': 42,
}

In [None]:
rf_best = fmin(objective, space=rf_space, algo=tpe.suggest, max_evals=50)
rf_params_from_hyperopt = space_eval(rf_space, rf_best)
print('Best Random Forest Hyperparameters:\n', rf_params_from_hyperopt)

In [None]:
#Stratified K-Fold Cross Validation Accuracy Score 
mean_acc_randFor = StratifiedKFold_randFor(X, y, rf_params_from_hyperopt)
print('Stratified K-Fold Cross Validation Mean Accuracy:\n',mean_acc_randFor)

**As evident from the Sratified K-Fold cross validation above, the Random Forest Classifier performed better than Decision Trees. We further verify these results, by training both the classifier models with their selected hyperparameters on the training set partitioned in Section 3.1 and apply them to the corresponding test set.**

## Decision Tree Model

In [None]:
dtree = tree.DecisionTreeClassifier(**dtree_params_from_hyperopt)
dtree.fit(X_train, y_train)
y_hat_dtree = dtree.predict(X_test)

In [None]:
dtree_test_acc =  accuracy_score(y_test,y_hat_dtree)
print('Accuracy of Decision Tree Model on Test Set:\n', dtree_test_acc)

## Random Forest Model

In [None]:
rf = RandomForestClassifier(**rf_params_from_hyperopt)
rf.fit(X_train, y_train)
y_hat_rf = rf.predict(X_test)

In [None]:
rf_test_acc = accuracy_score(y_test,y_hat_rf)
print('Accuracy of Random Forest Model on Test Set:\n', rf_test_acc)

In [None]:
print('RESULT SUMMARY:\n\n')
print('1. Decision Tree Model:\n')
print('Cross Validation Mean Accuracy: %f \n'%mean_acc_dtree)
print('Accuracy on Test Set: %f \n\n'%dtree_test_acc)
print('2. Random Forest Model:\n')
print('Cross Validation Mean Accuracy: %f\n' %mean_acc_randFor)
print('Accuracy on Test Set: %f' %rf_test_acc)

**As evident from the above results, Random Forest outperforms Decision Tree in both cases**

***
# Discussion

In this section we look at some more pre-processing aproaches tried on the data and their effect on the result and consequently why they were not applied to the data finally used for modelling.

### 1. One Hot Encoding of 'month' , 'day_of_week'

As mentioned in Section 2, the random forest model when applied to the dataset where the variables 'month' and 'day_of_week' are also one-hot encoded (along with other categorical variables) does not perform well compared to the previous case where only the categorical variables are one hot encoded. This could be due to the 13 extra one hot variables that get added to the dataset after one hot encoding the month and day_of_week variables. This could have caused overfitting and worsened the model performance.

In [None]:
#One hot encoding of 'month' and 'day_of_week'
df_one_hot = df_without_outlier
df_one_hot[['month','day_of_week']]= df_one_hot[['month','day_of_week']].apply(lambda x: pd.factorize(x)[0])

In [None]:
X_oh = df_one_hot.drop(['Unnamed: 0','y'],axis=1)
y_oh = df_one_hot['y']

#### Using One-Hot encoded variables for hyperparameter selection for Random Forest Model

In [None]:
def objective(params):
    """
    Objective function to minimize
    """
    return -StratifiedKFold_randFor(X_oh,y_oh, params) 
#Same Stratified CV function as before is called with dataset with one hot endcoded 'month' and 'day_of_week' variables

In [None]:
rf_oh_best = fmin(objective, space=rf_space, algo=tpe.suggest, max_evals=50) #Same parameter space used as before
rf_oh_params_from_hyperopt = space_eval(rf_space, rf_oh_best)
print('Best Random Forest Hyperparameters:\n', rf_oh_params_from_hyperopt)

In [None]:
#Comparing CV Mean Accuracy Scores for both cases
mean_acc_rf_one_hot = StratifiedKFold_randFor(X_oh, y_oh, rf_oh_params_from_hyperopt)
print('Cross Validation Mean Accuracy with one hot encoding categorical variables AND month, day_of_week: %f' %mean_acc_rf_one_hot)
print('Cross Validation Mean Accuracy with one hot encoding categorical variables WITHOUT month, day_of_week: %f' %mean_acc_randFor)

***
### 2. Rescaling of Dataset<br>
Both Standardization and Normalization of the dataset lowered the random forest model accuracy in most cases in comparison to unstandardized dataset. My hypothesis for this is that after one hot encoding of the categorical variables, the continuous features get far outnumbered by the added binary features in the dataset and rescaling of these few continuous features probably skews the model even more towards the binary features.

#### 2.1 Standardization

In [None]:
df_standardized = df_without_outlier
df_standardized.loc[:, df_standardized.columns != 'y'] =  preprocessing.scale(df_standardized.loc[:, df_standardized.columns != 'y'])


In [None]:
X_st = df_standardized.drop(['Unnamed: 0','y'],axis=1)
y_st = df_standardized['y']

In [None]:
def objective(params):
    """
    Objective function to minimize
    """
    return -StratifiedKFold_randFor(X_st,y_st, params)
#Same Stratified CV function as before is called with standardised dataset

In [None]:
rf_st_best = fmin(objective, space=rf_space, algo=tpe.suggest, max_evals=50) #Same parameter space used as before
rf_st_params_from_hyperopt = space_eval(rf_space, rf_st_best)
print('Best Random Forest Hyperparameters:\n', rf_st_params_from_hyperopt)

In [None]:
#Comparing CV Mean Accuracy Scores for both cases: standardized vs unstandardized
mean_acc_rf_st = StratifiedKFold_randFor(X_st, y_st, rf_st_params_from_hyperopt)
print('Cross Validation Mean Accuracy with standardized dataset: %f' %mean_acc_rf_st)
print('Cross Validation Mean Accuracy with unscaled dataset: %f'%mean_acc_randFor)

#### 2.2 Normalization

In [None]:
df_normalized = df_without_outlier[:]
df_normalized.loc[:, df_normalized.columns != 'y'] = preprocessing.normalize(df_normalized.loc[:, df_normalized.columns != 'y'])


In [None]:
X_norm = df_standardized.drop(['Unnamed: 0','y'],axis=1)
y_norm = df_standardized['y']

In [None]:
def objective(params):
    """
    Objective function to minimize
    """
    return -StratifiedKFold_randFor(X_norm,y_norm, params)
#Same Stratified CV function as before is called with normalised dataset

In [None]:
rf_norm_best = fmin(objective, space=rf_space, algo=tpe.suggest, max_evals=50) #Same parameter space used as before
rf_norm_params_from_hyperopt = space_eval(rf_space, rf_norm_best)
print('Best Random Forest Hyperparameters:\n', rf_norm_params_from_hyperopt)

In [None]:
#Comparing CV Mean Accuracy Scores for both cases: normalized vs unnormalized
mean_acc_rf_norm = StratifiedKFold_randFor(X_norm, y_norm, rf_norm_params_from_hyperopt)
print('Cross Validation Mean Accuracy with normalized dataset: %f' %mean_acc_rf_norm)
print('Cross Validation Mean Accuracy with unscaled dataset: %f'%mean_acc_randFor)

**As these results show, the above pre-processed approaches did not yield better results. Standardized data sometimes fared marginally better on mean accuracy. But since the improvement was only marginal, it was not considered in order to save on runtime. However, in situations where even the most marginal improvement can add value, standardization could be an option.**