## Classification Problem : Bank Marketing

### Problem Description - Bank Marketing Decision

Our goal is to find the clients before call whether they would subscribe to the product (bank term deposit), ('yes') or not ('no').

    The data is related with direct marketing campaigns of a banking institution
    The marketing campaigns were based on phone calls
    Often, more than one contact to the same client was required

#### Data

    age: age of the Client (numeric)
    
    job: type of job (categorical: 'admin.','blue collar','entrepreneur','housemaid','management','retired','self employed','services','student','technician','unemployed','unknown')
    
    marital: marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means                       divorced or widowed)
    
    education:   (categorical:'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
    
    credit_default: has credit in default? (categorical: 'no','yes','unknown')
    
    housing: has housing loan? (categorical: 'no','yes','unknown')
    
    loan: has personal loan? (categorical: 'no','yes','unknown')
    
    contact: contact communication type (categorical: 'cellular','telephone')
    
    contacted_month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
    
    day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
    
    duration: last contact duration, in seconds (numeric)
    
    campaign: number of contacts performed during this campaign and for this client (numeric, includes last                     contact)
    
    pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric;              999 means client was not previously contacted)
    
    previous: number of contacts performed before this campaign and for this client (numeric)
    
    poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')social                 and economic context attributes
    
    emp_var_rate: employment variation rate quarterly indicator (numeric)
    
    cons_price_idx: consumer price index monthly indicator (numeric)
    
    cons_conf_idx: consumer confidence index monthly indicator (numeric)
    
    euribor3m: euribor 3 month rate - daily indicator (numeric)
    
    nr_employed: number of employees quarterly indicator (numeric)

#### Objective

Predict whether a customer will subscribe to the product or not. 

        Supervised learning --> Classification --> Binary Classification. 

### Import all required libraries

In [1]:
import os
import joblib

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer 
from sklearn.pipeline import Pipeline

from sklearn.metrics import confusion_matrix, accuracy_score

from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

#### Set the Current working directory

In [2]:
PATH = os.getcwd()
DATA_FILE = "gs://bankapp_gs/bank_data.csv"

### Load the data

In [3]:
data = pd.read_csv(DATA_FILE)

### Understanding the data

#### Number of rows and columns

In [4]:
data.shape

(41188, 21)

#### Column or Attribute names

In [5]:
data.columns

Index(['age', 'job', 'marital', 'education', 'credit_default', 'housing',
       'loan', 'contact', 'contacted_month', 'day_of_week', 'duration',
       'compaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate',
       'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employees', 'y'],
      dtype='object')

#### Display first 5 and last 5 records

In [6]:
data.head()

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,...,compaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [7]:
data.tail()

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,...,compaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees,y
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41187,74,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,3,999,1,failure,-1.1,94.767,-50.8,1.028,4963.6,no


#### Summary Statistics

In [8]:
data.describe()

Unnamed: 0,age,duration,compaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [9]:
data.describe(include='all')

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,...,compaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees,y
count,41188.0,41188,41188,41188,41188,41188,41188,41188,41188,41188,...,41188.0,41188.0,41188.0,41188,41188.0,41188.0,41188.0,41188.0,41188.0,41188
unique,,12,4,8,3,3,3,2,10,5,...,,,,3,,,,,,2
top,,admin.,married,university.degree,no,yes,no,cellular,may,thu,...,,,,nonexistent,,,,,,no
freq,,10422,24928,12168,32588,21576,33950,26144,13769,8623,...,,,,35563,,,,,,36548
mean,40.02406,,,,,,,,,,...,2.567593,962.475454,0.172963,,0.081886,93.575664,-40.5026,3.621291,5167.035911,
std,10.42125,,,,,,,,,,...,2.770014,186.910907,0.494901,,1.57096,0.57884,4.628198,1.734447,72.251528,
min,17.0,,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,
25%,32.0,,,,,,,,,,...,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.344,5099.1,
50%,38.0,,,,,,,,,,...,2.0,999.0,0.0,,1.1,93.749,-41.8,4.857,5191.0,
75%,47.0,,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,


In [10]:
data.dtypes

age                  int64
job                 object
marital             object
education           object
credit_default      object
housing             object
loan                object
contact             object
contacted_month     object
day_of_week         object
duration             int64
compaign             int64
pdays                int64
previous             int64
poutcome            object
emp_var_rate       float64
cons_price_idx     float64
cons_conf_idx      float64
euribor3m          float64
nr_employees       float64
y                   object
dtype: object

#### Observations

Few attributes such as job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome and y are categorical but are interpreted as object type. 

#### TypeCasting - Convert the attribute in to appropriate type

Using astype('category') to convert job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome and y attributes to categorical attributes from existing object datatype

In [11]:
cat_Attr_Names =  ['job', 'marital', 'education', 'credit_default', 'housing', 'loan', 
                   'contact', 'contacted_month', 'day_of_week', 'poutcome', 'y']

num_Attr_Names = list(set(data.columns) - set(cat_Attr_Names))

In [12]:
data[cat_Attr_Names] = data[cat_Attr_Names].apply(lambda col: col.astype('category'))
data[num_Attr_Names] = data[num_Attr_Names].apply(lambda col: col.astype('float64'))

In [13]:
data.dtypes

age                 float64
job                category
marital            category
education          category
credit_default     category
housing            category
loan               category
contact            category
contacted_month    category
day_of_week        category
duration            float64
compaign            float64
pdays               float64
previous            float64
poutcome           category
emp_var_rate        float64
cons_price_idx      float64
cons_conf_idx       float64
euribor3m           float64
nr_employees        float64
y                  category
dtype: object

#### Summary Statistics

In [14]:
data.describe(include='all')

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,...,compaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employees,y
count,41188.0,41188,41188,41188,41188,41188,41188,41188,41188,41188,...,41188.0,41188.0,41188.0,41188,41188.0,41188.0,41188.0,41188.0,41188.0,41188
unique,,12,4,8,3,3,3,2,10,5,...,,,,3,,,,,,2
top,,admin.,married,university.degree,no,yes,no,cellular,may,thu,...,,,,nonexistent,,,,,,no
freq,,10422,24928,12168,32588,21576,33950,26144,13769,8623,...,,,,35563,,,,,,36548
mean,40.02406,,,,,,,,,,...,2.567593,962.475454,0.172963,,0.081886,93.575664,-40.5026,3.621291,5167.035911,
std,10.42125,,,,,,,,,,...,2.770014,186.910907,0.494901,,1.57096,0.57884,4.628198,1.734447,72.251528,
min,17.0,,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,
25%,32.0,,,,,,,,,,...,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.344,5099.1,
50%,38.0,,,,,,,,,,...,2.0,999.0,0.0,,1.1,93.749,-41.8,4.857,5191.0,
75%,47.0,,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,


#### Handling of missing data

In [15]:
data.isnull().sum()

age                0
job                0
marital            0
education          0
credit_default     0
housing            0
loan               0
contact            0
contacted_month    0
day_of_week        0
duration           0
compaign           0
pdays              0
previous           0
poutcome           0
emp_var_rate       0
cons_price_idx     0
cons_conf_idx      0
euribor3m          0
nr_employees       0
y                  0
dtype: int64

In [16]:
pd.value_counts(data['y'])/data['y'].count() * 100

no     88.734583
yes    11.265417
Name: y, dtype: float64

### Train-Test Split

Using sklearn.model_selection.train_test_split

    Split the data into train and test subsets

In [17]:
X = data.drop(columns=['y'])
y = data['y']

cat_Attr_Names = list(set(cat_Attr_Names) - set('y'))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

In [18]:
print(X_train.shape)
print(X_test.shape)

(28831, 20)
(12357, 20)


### Data pre-process using pipelines

#### Numeric Attributes:

    Impute and Standardize numeric attribute. 

#### Categorial Attributes:

    Impute and Convert categorial attributes to numeric using OneHotEncoding
    
Concatenate transformed Numeric and Categorial Attributes. 

In [19]:
numeric_transformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_Attr_Names),
        ('cat', categorical_transformer, cat_Attr_Names)])

preprocess_pipeline = Pipeline([('preprocess', preprocess)])

In [20]:
preprocess.fit(X_train)  

X_train_trans = preprocess.transform(X_train)
X_test_trans = preprocess.transform(X_test)

In [21]:
print(X_train.shape)
print(X_train_trans.shape)

(28831, 20)
(28831, 63)


In [22]:
pd.DataFrame(X_train_trans)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,53,54,55,56,57,58,59,60,61,62
0,-0.573111,-0.348328,0.640729,0.703768,0.195559,0.716223,0.479897,0.324164,0.879444,0.233079,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.268589,-0.348328,0.832074,0.767418,0.195559,-0.233354,-1.058517,0.840284,0.944309,7.050209,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.163569,-0.348328,0.640729,0.706082,0.195559,0.716223,0.576048,0.324164,0.879444,-0.475595,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,-0.573111,-0.348328,0.640729,0.704346,0.195559,0.716223,-0.289310,0.324164,0.879444,0.999524,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.163569,-0.348328,0.640729,0.705504,0.195559,0.716223,1.345255,0.324164,0.879444,-0.356199,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,0.163569,-0.348328,0.832074,0.709554,0.195559,1.529407,-1.346969,0.840284,-0.288114,-0.718238,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28827,-0.573111,-0.348328,0.832074,0.762789,0.195559,0.585009,0.287595,0.840284,-0.482707,-0.125110,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28828,-0.573111,-0.348328,0.832074,0.765103,0.195559,0.585009,1.345255,0.840284,-0.482707,-0.155922,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28829,-0.204771,3.684053,-1.208936,-1.247389,0.195559,-0.870434,-0.097008,-0.954310,-1.434051,1.307643,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [23]:
preprocess

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('num_imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['compaign', 'previous', 'emp_var_rate',
                                  'euribor3m', 'pdays', 'cons_price_idx', 'age',
                                  'nr_employees', 'cons_conf_idx',
                                  'duration']),
                                ('cat',
                                 Pipeline(steps=[('cat_imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['contact', 'credit_default', 'housing',
                                  '

In [24]:
cat_Attr_Names_trans = preprocess.transformers_[1][1].named_steps['onehot'].get_feature_names(cat_Attr_Names)

In [25]:
ind_Attr_Names_Trans = num_Attr_Names + cat_Attr_Names_trans.tolist()
ind_Attr_Names_Trans

['compaign',
 'previous',
 'emp_var_rate',
 'euribor3m',
 'pdays',
 'cons_price_idx',
 'age',
 'nr_employees',
 'cons_conf_idx',
 'duration',
 'contact_cellular',
 'contact_telephone',
 'credit_default_no',
 'credit_default_unknown',
 'credit_default_yes',
 'housing_no',
 'housing_unknown',
 'housing_yes',
 'education_basic.4y',
 'education_basic.6y',
 'education_basic.9y',
 'education_high.school',
 'education_illiterate',
 'education_professional.course',
 'education_university.degree',
 'education_unknown',
 'poutcome_failure',
 'poutcome_nonexistent',
 'poutcome_success',
 'marital_divorced',
 'marital_married',
 'marital_single',
 'marital_unknown',
 'loan_no',
 'loan_unknown',
 'loan_yes',
 'day_of_week_fri',
 'day_of_week_mon',
 'day_of_week_thu',
 'day_of_week_tue',
 'day_of_week_wed',
 'contacted_month_apr',
 'contacted_month_aug',
 'contacted_month_dec',
 'contacted_month_jul',
 'contacted_month_jun',
 'contacted_month_mar',
 'contacted_month_may',
 'contacted_month_nov',
 'c

In [26]:
X_train_DF = pd.DataFrame(X_train_trans, columns=num_Attr_Names+cat_Attr_Names_trans.tolist())
X_test_DF = pd.DataFrame(X_test_trans, columns=num_Attr_Names+cat_Attr_Names_trans.tolist())

#### Using LabelEncoder to convert target attribute 'y' to Numerical

#### Target attribute distribution

## XGBoost Model

In [27]:
xgb_pipeline = Pipeline([('preprocess', preprocess),
                         ('xgboost', XGBClassifier(learning_rate=0.1, n_estimators=20, subsample=0.9))])

In [28]:
xgb_pipeline.fit(X_train, y_train)
    
y_train_Pred = xgb_pipeline.predict(X_train)
y_test_Pred = xgb_pipeline.predict(X_test)



In [29]:
print('========Train=======')
print(f"Confusion Matrix \n{confusion_matrix(y_train, y_train_Pred)}")
print(f"Accuracy \n{accuracy_score(y_train, y_train_Pred)}")

print('========Test=======')
print(f"Confusion Matrix \n{confusion_matrix(y_test, y_test_Pred)}")
print(f"Accuracy \n{accuracy_score(y_test, y_test_Pred)}")

Confusion Matrix 
[[24798   789]
 [ 1328  1916]]
Accuracy 
0.9265720925392806
Confusion Matrix 
[[10524   437]
 [  596   800]]
Accuracy 
0.9164036578457554


In [30]:
model=xgb_pipeline.named_steps['xgboost']
model

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.1, max_delta_step=0,
              max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=20, n_jobs=2,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.9,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [31]:
feature_imp = pd.DataFrame({'Value':model.feature_importances_,'Feature':ind_Attr_Names_Trans})

feature_imp.sort_values(by="Value", ascending=False)

Unnamed: 0,Value,Feature
7,0.556433,nr_employees
9,0.070974,duration
8,0.058271,cons_conf_idx
49,0.023561,contacted_month_oct
4,0.019408,pdays
...,...,...
41,0.000000,contacted_month_apr
58,0.000000,job_services
35,0.000000,loan_yes
27,0.000000,poutcome_nonexistent


In [32]:
joblib.dump(xgb_pipeline, 'model.joblib')

['model.joblib']

In [33]:
!gsutil cp ./model.joblib gs://bankapp_gs/model.joblib

Copying file://./model.joblib [Content-Type=application/octet-stream]...
/ [1 files][163.9 KiB/163.9 KiB]                                                
Operation completed over 1 objects/163.9 KiB.                                    


In [34]:
model = joblib.load("./model.joblib")

In [36]:
instance = [56, "housemaid", "married", "basic.4y", "no", "no", "no", "telephone", "may", "mon", 261, 1, 999, 0, "nonexistent", 1.1, 93.994, -36.4, 4.857, 5191]
COLUMN_NAMES = ['age', 'job', 'marital', 'education', 'credit_default', 'housing', 'loan', 'contact', 'contacted_month', 'day_of_week', 'duration', 'compaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employees']

In [37]:
model.predict(pd.DataFrame(data=[instance], columns=_COLUMN_NAMES))

array(['no'], dtype=object)

In [38]:
model.predict(X_test)

array(['no', 'no', 'no', ..., 'no', 'no', 'no'], dtype=object)