BANK MARKETING CALL SUCCESS PREDICTION
====================
Consumers decide whether to deposit money in a bank based on factors such as

FACTORS USED FOR DECISION MAKING
-------------------------------------

### A. Personal Factors
   
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')

### B. related with the last contact of the current campaign

8. contact: contact communication type (categorical: 'cellular','telephone') 
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

### C. other attributes
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

### D. social and economic context attributes
   
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric) 
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)

DATA SOURCE
-------------------------------------
Publicly available data from the [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#) was used


Importing Libraries and reading in the file
-------------------------------------------

In [1]:
#imports and input file
model_list = []
column_list = ['model name', 'type', 'sampling technique','overall accuracy', 'yes accuracy','difference from baseline'] 

#import files
from io import StringIO
import requests
import json
import pandas as pd
from pandas import Series, DataFrame

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

#import balancing techniques
import imblearn
from imblearn.over_sampling import ADASYN

#neural network model imports
import keras as K
from keras.models import Sequential
from keras.layers import Dense
import numpy as np

#fix random seed for reproducibility
np.random.seed(7)
from sklearn.preprocessing import StandardScaler

#random forrest imports
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

#from sklearn.preprocessing import balance_weights #to balance

#logistic l1 regularization imports
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#decision tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

def get_object_storage_file_with_credentials_26adc31ece2740d9a52b41db6ee3541b(container, filename):
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage."""

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': 'member_9ba378027bf714f2ca74f0acfb809adb7535884b','domain': {'id': 'ce2b15e5802c44918d4ed1a3c22e867e'},
            'password': 'j1Pa-u3b7o/OR,cQ'}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', container, '/', filename])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2)
    return StringIO(resp2.text)
#read input
ba = pd.read_csv(get_object_storage_file_with_credentials_26adc31ece2740d9a52b41db6ee3541b('DefaultProjectrohanchakraborty1ibmcom', 'bank-additional-full.csv'), sep = ';')

Using TensorFlow backend.


Data Preparation
---------------
- Dummy encoding categorical variables
- Dividing data into training and test
- using ADASYN to synthetically generate data to solve the class imbalance between the yes and no class.
- finding the baseline to compare accuracy of the model with

In [2]:
#bank_reorder contains all numerical values
bank_reorder = (ba.select_dtypes(exclude=['object'])) #dataframe containing columns with numerical values
del bank_reorder['duration'] #deleting duration since that can not be known beforehand
categorical_bank = (ba.select_dtypes(include=['object'])) #dataframe containing columns with all non-numerical values

#1 hot encoding/dummy encoding 
for col in list(categorical_bank):#for each column (col) in the list of columns of the non-numerical columns
    one_hot = pd.get_dummies((categorical_bank[str(col)]))
    bank_reorder = bank_reorder.join(one_hot, rsuffix=('_'+str(col)))#adding each encoding 

del bank_reorder['no_y']#delete  y no since no unknown y's and y no is simply inverse of y_yes
x = bank_reorder.iloc[:,:(bank_reorder.shape[1]-1)]#all columns except last column are part of x
y = bank_reorder.iloc[:,(bank_reorder.shape[1]-1):bank_reorder.shape[1]]#only last column is y

#splitting into training and test values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)#.values
#fixing class imbalance

#balancing techniques
ada = ADASYN()#using synthetic data generation
x_resampled, y_resampled = ada.fit_sample(x_train.values, y_train.values.ravel())
y_resampled = y_resampled.reshape(y_resampled.shape[0],1)

#finding baseline
baseline = (ba['y'].value_counts()/ba.shape[0])
base = baseline[0]

Call Duration Effect
-------------------

    Average call duration
    
    - The duration of the call cannot be known before making the call, hence this variable was removed from the model
    - It is interesting to know the average call duration of calls that were successful and unsuccessful
        - Note: The additional time may have been spent on purchase intructions after the deal was closed
        

In [3]:
yes_mean = (ba["duration"][ba.y=="yes"]).mean()
no_mean = (ba["duration"][ba.y=="no"]).mean()
pd.DataFrame([[yes_mean, no_mean]], columns=["yes duration(s)","no duration(s)"])

Unnamed: 0,yes duration(s),no duration(s)
0,553.191164,220.844807


Correlation between duration and degree of success

In [4]:
#yes = (ba["duration"][ba.y=="yes"]).mean()
#no = (ba["duration"][ba.y=="no"]).mean()
#duration_frame = (ba["duration"].to_frame)

duration_frame = pd.DataFrame(ba["duration"])
duration_and_yes = duration_frame.join(y)
duration_and_yes.corr().iloc[:1, 1:2]

Unnamed: 0,yes_y
duration,0.405274


Correlation matrix
-----------------
- printing correlation matrix to see how much each factor affects the decision to deposit money
    - Month to make calls to maximize success: march marginally is the best while may the worst
    - Day to make calls to maximize success: thursday is the best while Monday is the worst

#### top 3 factors that had a positive correlation
    - success i.e. success of previous campiagns
    - previous i.e. when previous contact was made
    - cellular i.e. whether contact was made on their cellphone or landline

In [5]:
correlation_matrix = bank_reorder.corr()
correlation_col= correlation_matrix.iloc[: ,correlation_matrix.shape[1]-1 :correlation_matrix.shape[1]]
correlation_col = correlation_col.sort_values(by=['yes_y'], ascending=False)
correlation_col.iloc[1:4,:]

Unnamed: 0,yes_y
success,0.316269
previous,0.230181
cellular,0.144773


### top 4 factor that had a negative correlation
    - emp.var.rate employment variation rate
    - euribor3m	i.e. euribor 3 month rate
    - pdays	ie. number of days that passed by when a consumer was contacted for a previous campaign
    - nr.employed i.e. number of employees
    
Interestingly 3 out of 4 that majorly affected the decision negatively were socio-economic contexts

Hence the top factors both positively and negatively that affect  a decision were not related personally to the customer, but relied around the soci-economic conditions and the way the campaign was carried out

In [17]:
correlation_col.tail(4)

Unnamed: 0,yes_y
emp.var.rate,-0.298334
euribor3m,-0.307771
pdays,-0.324914
nr.employed,-0.354678


Using Model : Neural Networks
---------------------------
Adjusting
    - epochs (number of times the network sees the data) :
        - 100
        - 300
    - method of solving class imbalance : 
        - none
        - ADASYN
        - Class weights

In [9]:
#neural networks creation. 

def neural_net(x_vals, y_vals,balancing, eps):
    #neural network 
    model_neural = Sequential()

    # Add an input layer 18 is the number of hidden units (earlier 12) more units = more patterns
    model_neural.add(Dense(18, activation='relu', input_shape=(x_vals.shape[1],)))

    # Add one hidden layer 
    model_neural.add(Dense(12, activation='relu'))

    # Add an output layer 
    model_neural.add(Dense(1, activation='sigmoid'))

    #compile
    model_neural.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics= ['binary_accuracy'])
    
    if balancing=='class_weights':
        model_neural.fit(x_vals, y_vals, epochs =eps, batch_size=1, class_weight={0:1, 1:8})
    else:
        model_neural.fit(x_vals, y_vals, epochs =eps, batch_size=1) 
    
    y_preds = model_neural.predict_classes(x_test.values)
    accuracy =  accuracy_score(y_test, y_preds)
    model_list.append(('neural','epochs:'+str(eps), balancing , accuracy ,recall_score(y_test, y_preds),(accuracy-base) ))

neural_net(x_train.values, y_train.values, 'none', 10)#100
neural_net(x_train.values, y_train.values, 'none', 40)#300
neural_net(x_train.values, y_train.values, 'class_weights', 10)
neural_net(x_train.values, y_train.values, 'class_weights', 40)
neural_net(x_resampled, y_resampled, 'ADASYN', 10)
neural_net(x_resampled, y_resampled, 'ADASYN', 40)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40


Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40


Epoch 39/40
Epoch 40/40

Using Model : Decision tree
---------------------------
Adjusting
    - type :
        - gini coefficient
        - entropy information gain
    - method of solving class imbalance : 
        - none
        - ADASYN
        - Class weights

In [13]:
#decision tree gini, entropy
def decision_tree(x_vals, y_vals,balancing, type):
    #clf_gini
    
    if type =='gini':
        if balancing=='class_weights':
            clf_gini = DecisionTreeClassifier( class_weight = "balanced", random_state = 10)
        else:
            clf_gini = DecisionTreeClassifier()
           
        clf_gini.fit(x_vals, y_vals)
        y_pred_gini = clf_gini.predict(x_test)
        accuracy = accuracy_score(y_test, y_pred_gini)
        model_list.append(('decision tree',type, balancing, accuracy ,recall_score(y_test, y_pred_gini), (accuracy-base) ))
    else:
         if balancing=='none':
            clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 10, class_weight = "balanced")    
         
         else:
            clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 10)
         clf_entropy.fit(x_vals, y_vals)
         y_pred_en = clf_entropy.predict(x_test)
         accuracy = accuracy_score(y_test, y_pred_en)
         model_list.append(('decision tree',type, balancing, accuracy,recall_score(y_test, y_pred_en),(accuracy-base)  ))
        
decision_tree(x_train, y_train, 'none', 'gini')
decision_tree(x_resampled, y_resampled, 'ADASYN', 'gini')
decision_tree(x_train, y_train, 'class_weights', 'entropy')
decision_tree(x_train, y_train, 'none', 'entropy')
decision_tree(x_resampled, y_resampled, 'ADASYN', 'entropy')
decision_tree(x_resampled, y_resampled, 'class_weights', 'entropy')

Using Model : Random Forrest
---------------------------
Adjusting
    - method of solving class imbalance : 
        - none
        - ADASYN
        - Class weights

In [7]:
# Create a new random forest classifier for the most important features

def random_forr(x_vals, y_vals,balancing):
    if balancing=='class_weights':
        clf_forr = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1, class_weight = "balanced")
    else:
        clf_forr = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
#n_estimators = number of trees

# Train the new classifier on the new dataset containing the most important features
    clf_forr.fit(x_vals, y_vals)
    y_pred_forr = clf_forr.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred_forr)
    model_list.append(('Random forrest','NA' , balancing, accuracy ,recall_score(y_test, y_pred_forr), (accuracy-base) ))
    
random_forr(x_train, y_train.values.ravel(),'none')
random_forr(x_train, y_train.values.ravel(),'class_weights')
random_forr(x_resampled, y_resampled.ravel(),'ADASYN')

Using Model : Logistic Regression
---------------------------
Adjusting:
    - Regulatization parameter
        - 1
        - 10
        - 100
        - 1000
    - method of solving class imbalance : 
        - none
        - ADASYN
        - Class weights

In [8]:
#L1 regularization
def logistic_regression(x_vals, y_vals,balancing, penal):
    sc = StandardScaler()
    x_test_std = sc.fit_transform(x_test) 

    # Fit the scaler to the training data and transform
    x_train_std = sc.fit_transform(x_vals)

    # Apply the scaler to the test data

    C = [1,1e1, 1e2, 1e3]

    for c in C:
        if balancing=='class_weights':     
                clf = LogisticRegression(penalty=penal, C=c, class_weight='balanced')
        else:          
                clf = LogisticRegression(penalty=penal, C=c)    
            
        clf.fit(x_train_std,y_vals)
        y_pred_log = clf.predict(x_test_std)
        accuracy = accuracy_score(y_test, y_pred_log)
        model_list.append(("log regression l2 penalty", 'regularization strength inverse:'+str(c), balancing,accuracy,recall_score(y_test, y_pred_log),(accuracy-base) ))
logistic_regression(x_resampled, y_resampled.ravel(),'ADASYN', 'l2')
logistic_regression(x_train.values, y_train.values.ravel(),'none', 'l1')
logistic_regression(x_train, y_train.values.ravel(),'none', 'l2')
logistic_regression(x_train, y_train.values.ravel(),'class_weights', 'l2')

Printing out a comparison of all the models
---------------------------

In [14]:
df = pd.DataFrame(model_list, columns = column_list)
df.sort_values(by=['yes accuracy'], ascending=False)


Unnamed: 0,model name,type,sampling technique,overall accuracy,yes accuracy,difference from baseline
27,neural,epochs:10,class_weights,0.112325,1.0,-0.775021
26,neural,epochs:40,none,0.112325,1.0,-0.775021
25,neural,epochs:10,none,0.112325,1.0,-0.775021
9,log regression l2 penalty,regularization strength inverse:1,ADASYN,0.719268,0.711816,-0.168077
10,log regression l2 penalty,regularization strength inverse:10.0,ADASYN,0.719511,0.711816,-0.167835
11,log regression l2 penalty,regularization strength inverse:100.0,ADASYN,0.719511,0.711816,-0.167835
12,log regression l2 penalty,regularization strength inverse:1000.0,ADASYN,0.71943,0.711816,-0.167916
21,log regression l2 penalty,regularization strength inverse:1,class_weights,0.827143,0.621758,-0.060203
22,log regression l2 penalty,regularization strength inverse:10.0,class_weights,0.827709,0.621037,-0.059637
24,log regression l2 penalty,regularization strength inverse:1000.0,class_weights,0.827709,0.621037,-0.059637


Data citation
------------
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.001
