# Bank Marketing Campaign

## Step 1: Frame the Problem

- Objective: Predict whether client will subscribe to term deposit or not.
- Supervise, Classification Problem.


- Description: 
    - Marketing campaigns for any product from any domain are mainly focused into advertising the product in a such way that it will highlights customer needs, problem and capture their attention. There are various factors involve in marketing campagin and approach to those factors decides whether campaign will be successfull or not.
    - Few important factors involve in campaign are:
        1. Characterisitics of targeted customer base: Characterisitics includes place, age category, overall behavior towards new things coming in market, overall financial condition etc.
        2. Medium of marketing campaign: This involves various channels such as TV advertisement, Pamplets, socail marketing etc.
        3. Price: The cost of product or service the customer.
        4. Promotional strategy: This involves the timing of campaign, who will involve in the campaign, management of finance for compaign.
  
  
  
- What is term deposit?
    - It is investment where you deposit money in the bank for some period on which bank offers some fixed rate. At end of period you get your deposited money plus the amount equivalent to fixed rate % of deposited money.
  
  
- The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The dataset involves 17 attributes. Following is the description about each attribute.
- Columns giving information about client:
    - Age
    - Job : type of job (admin, bluecollar, entrepreneur, housemaid, management, retired, selfemployed, services, student, technician, unemployed, unknown)
    - Marital : marital status       (divorced, married, single, unknown)
    - Education : eduction level     (primary, secondary, tertiary and unknown)
    - Default: has credit in default?   (no, yes, unknown)
    - Housing: has housing loan?   (no, yes, unknown)
    - Loan: has personal loan?
    - Balance: Balance of the individual


- Related with the last contact of the current campaign:
    - Contact: contact communication type   (cellular, telephone)
    - Month: last contact month of year 
    - Day: last contact day of the month
    - Duration: last contact duration, in seconds


- Related with contact of previous campaign:
    - Campaign: number of contacts performed during this campaign and for this client
    - Pdays: number of days that passed by after the client was last contacted from a previous campaign
        - Note: 999 means client was not previously contacted.
    - Previous: number of contacts performed before this campaign and for this client
    - Poutcome: outcome of the previous marketing campaign (failure, nonexistent, success)


- Output variable:
    - deposit: has the client subscribed a term deposit? (yes, no)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv("../input/bank-marketing-dataset/bank.csv")
print(data.shape)
data.head()

## Step 2: Data Exploration

### Statistical Overview

In [None]:
data_explore = data.copy()

In [None]:
data_explore.info()

- No column contains null values.

In [None]:
data_explore.describe()

- There is negative balance values and pdays. As of now we don't know what those negative values represent.
- Minmum age of target customer is 18 indicates that this campaign is focused towards adults. This makes sense because there is noway children going to afford term deposits.

- We don't know what -1 pdays represent and it is occupying more than 70% of data. I will drop this column as there no information about what -1 refers to is given. Also days will not be negative values.

In [None]:
data_explore['pdays'].value_counts()[-1]

In [None]:
data_explore = data_explore.drop(columns=['pdays'], axis=1)

In [None]:
Q1 = data_explore.quantile(0.25)
Q3 = data_explore.quantile(0.75)
IQR = Q3 - Q1
((data_explore < (Q1 - 1.5 * IQR)) | (data_explore > (Q3 + 1.5 * IQR))).sum()

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1, 4, 1)
sns.boxplot(x='age', data=data_explore, orient='v')
plt.subplot(1, 4, 2)
sns.boxplot(x='balance', data=data_explore, orient='v')
plt.subplot(1, 4, 3)
sns.boxplot(x='campaign', data=data_explore, orient='v')
plt.subplot(1, 4, 4)
sns.boxplot(x='previous', data=data_explore, orient='v')
plt.tight_layout()

- There are almost 10% outliers exist in attributes balance and previous. 5% outliers exist in campaign attribute.

In [None]:
features = list(data_explore.columns)
cat_attrs = [ col for col in features if data_explore[col].dtype=='O' ]
cat_attrs

### Histograms

In [None]:
plt.figure(figsize=(13, 9))
for k in range(len(features)):
    plt.subplot(4, 4, k+1)
    plt.hist(data_explore[features[k]])
    plt.title(features[k], fontsize=12)
    plt.tight_layout()

In [None]:
plt.figure(figsize=(12, 5))
plt.hist(data_explore['job'])
plt.title('Jobs', fontsize=14)
plt.tight_layout()

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(data_explore['month'])
plt.title('Months', fontsize=14)
plt.tight_layout()

- The given dataset is fairly balanced in terms of deposit.
- Majority of customers lie within the age category of 25 to 45.
- Campaign is mostly activated in months from May to August.
- Peoples having job occuption as admin, service, management and bluecollar are mostly contacted for term deposit.
- Many peoples are having negative bank balance.

### Analysis of Deposit w.r.t. Client Information

In [None]:
sns.boxplot(x="deposit", y="age", hue="deposit", data=data_explore, palette="RdBu")
plt.tight_layout()

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(x="loan", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xlabel("Personal Loan?", fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(x="housing", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xlabel("Housing Loan?", fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(x="default", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xlabel("Has Default on Credit?", fontsize=14)
plt.show()

- We can see that those who have some kind of load or debt to pay they generally do not subscribe to the term deposit.

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(x="marital", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xlabel('Marital Status', fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(x="education", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))
plt.show()

- Though the peoples with secodary education are the most subscribers to term deposit among all types of educated peoples, it is highly educated peoples who have net positive response to term deposit than others.

In [None]:
plt.figure(figsize=(15, 6))
ax = sns.countplot(x="job", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xlabel('Job Occupation', fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x="balance", data=data_explore, orient='v')
plt.ylim(-3000, 6000)
plt.tight_layout()
plt.subplot(1, 2, 2)
sns.boxplot(x="deposit", y="balance", data=data_explore)
plt.ylim(-3000, 6000)
plt.tight_layout()

In [None]:
plt.figure(figsize=(13, 6))
sns.boxplot(x="job", y="balance", hue="deposit", data=data_explore)
plt.ylim(-4000, 8000)
plt.tight_layout()

- For all most all job categories, the median deposit balance of people who subscribed to term deposit is higher than those who hasn't subscribed.

- Lets introspect more about peoples who have some kind of loan and the peoples who are loan free.

In [None]:
def has_loan(loans):
    a, b, c = loans
    if a=='yes' or b=='yes' or c=='yes':
        return 1
    else:
        return 0
    
data_explore['has_loans'] = data_explore[['default', 'housing', 'loan']].apply(has_loan, axis=1)
data_explore.head()

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(x="has_loans", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xlabel("Has Loan?", fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(12, 5))
ax = sns.countplot(x="has_loans", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xlabel("Has Loan?", fontsize=14)
plt.show()

- There are some peoples who do have loan and also subscribed to term deposit. Lets explore more about those peoples.

In [None]:
data_explore_has_loan_deposit = data_explore[(data_explore['has_loans']==1) & (data_explore['deposit']=='yes')]
data_explore_has_loan_deposit.shape

In [None]:
plt.figure(figsize=(15, 15))
plt.subplot(3, 1, 1)
plt.title("Peoples Who Have Loan", fontsize=16)
ax = sns.countplot(x='education', hue='deposit', data=data_explore[data_explore['has_loans']==1])
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))

plt.subplot(3, 1, 2)
ax = sns.countplot(x='job', hue='deposit', data=data_explore[data_explore['has_loans']==1])
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
        
plt.subplot(3, 1, 3)
ax = sns.boxplot(x='job', y='balance', hue='deposit', data=data_explore[data_explore['has_loans']==1], orient='v' )
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.ylim(-2500, 7000)
plt.show()

- Be it any education level or job category, approximately only 35-40% of peoples(who is carrying loan) are subscibed to the term deposit.
- There is also not so much difference in the median balance of peoples who subscribed to term deposit and the one who doesn't.
- Though this is not completely true in all job categories but the overall picture gives the impression that the one who have loan and also subscribed to term deposit, may be because of good financial condition.

- It would have been very benificial to have information about amount of loan that is remain to payoff. This would have given a more clear picture about who are those people having loan can be a potential subscriber. In my opinion, if the person is near to repay his loan, there is some possibility that the person might be willing to look for some kind of safe investment.

- Who also saw there are people who are loan-free but didn't subscribe to the deposit.

In [None]:
plt.figure(figsize=(15, 15))
plt.subplot(3, 1, 1)
plt.title("Peoples Who Don't Have Loan", fontsize=16)
ax = sns.countplot(x='education', hue='deposit', data=data_explore[data_explore['has_loans']==0])
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))

plt.subplot(3, 1, 2)
ax = sns.countplot(x='job', hue='deposit', data=data_explore[data_explore['has_loans']==0])
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
        
plt.subplot(3, 1, 3)
ax = sns.boxplot(x='job', y='balance', hue='deposit', data=data_explore[data_explore['has_loans']==0], orient='v' )
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.ylim(-2500, 7000)
plt.show()

- If look at some job categories such as management, technician, blue collar, admin etc. we see that the people who didn't subscribe to deposit despite having no loans are the ones with less median balance.


- Up to this point we can say following are important attributes of term deposit subscribers:
    1. Loan-free or having loan with sufficiently good balance
    2. Good education background.
    3. Having sufficiently good balance and with job category either of following: admin., management, technicianm retired, student

### Analysis of Deposit w.r.t. Communication Type

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(x="contact", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.ylabel("Count", fontsize=14)
plt.xlabel("Communication Type", fontsize=14)
plt.show()

### Analysis of Deposit w.r.t. Previous Campaign Results

In [None]:
print("Average contact duration with perosn who has subscribed: ", data_explore[data_explore['deposit']=='yes']['duration'].mean()/60)
print("Average contact duration with perosn who hasn't subscribed: ", data_explore[data_explore['deposit']=='no']['duration'].mean()/60)

In [None]:
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x="duration", data=data_explore, palette="RdBu", orient='v')
plt.ylim(top=1600)
plt.tight_layout()
plt.subplot(1, 2, 2)
sns.boxplot(x="deposit", y="duration", data=data_explore, palette="RdBu")
plt.ylim(top=1600)
plt.tight_layout()

In [None]:
plt.figure(figsize=(12, 10))
plt.subplot(2, 1, 1)
sns.countplot(x='campaign', hue='deposit', data=data_explore)
plt.xlim(right=10)
plt.xlabel('')
plt.subplot(2, 1, 2)
sns.countplot(x='campaign', hue='deposit', data=data_explore)
plt.xlim(left=11)
plt.ylim(top=30)
plt.xlabel('# of Campaign', fontsize=14)
plt.show()

- Campaign attribute indicates number of contacts made with customer during this campaign.
- We can see that, as the number of contacts are increasing, lesser the customers are subscribing to the deposit.

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(x="poutcome", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.show()

- Focusing on success and failure, It is evident that if previous contact results in success there are very high chances that in next contact customer will subscribe to term deposit.
- Another interesting result is that even if the result of previous contact is failure, there is still chance of customer might subscribe to the term deposit.

In [None]:
plt.figure(figsize=(15, 6))
ax1 = plt.subplot(1, 3, 1)
sns.scatterplot(x='age', y='campaign', hue='poutcome', data=data_explore[(data_explore['deposit']=='no') & (data_explore['poutcome']!='unknown')],)
plt.title("Unsubscribed", fontsize=14)
plt.subplot(1, 3, 2, sharey=ax1)
sns.scatterplot(x='age', y='campaign', hue='poutcome', data=data_explore[(data_explore['deposit']=='yes') & (data_explore['poutcome']!='unknown')], )
plt.title("Subscribed", fontsize=14)

- Above chart clearly shows the importance of outcome of previous contact. Right side chart contains more green points indicating that the previous contact made with customers who has subscribed to deposit, was a mostly a success. 
- On left side, represents the customers who hasn't subscribed to the deposit, the previous contact made with the resulted in failure.
- We already seen that there many retired people who subscribed to the term deposit. If look at to the right side age 60, there are many peoples who subscribed to term deposit. 

In [None]:
plt.figure(figsize=(12, 5))
ax = sns.countplot(x="month", hue="deposit", data=data_explore)
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.show()

- From above chart, it seems that the campaign is successfull in during February to April and in September and October.
- Majority of campaign is done between May to August which didn't result in success.
- Very little activity is done in months of December and January.

- Up to this point it is was evident that duration, campaign, education, job, balance, loan, month of contact are some crucial parameters to needs to be taken in account for having successful campaign.    

In [None]:
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
for cat in cat_attrs:
    data_explore[cat] = label_enc.fit_transform(data_explore[cat])

data_explore.head()

In [None]:
corr_matrix = data_explore.corr()

plt.figure(figsize=(17, 12))
sns.heatmap(corr_matrix, mask=np.zeros_like(corr_matrix, dtype=np.bool), square=True, annot=True, cbar=False)

- It is obvious that duration attribute is highly correlated with target variable. More the concat with customer, higher the chances of him/her getting subscribe to the term deposit.
- Other than duration, there is no attribute which has strong correlation with target variable.
- Newly created attribute has_loan is slightly more correlated with target variable than housing, loan, & default.
- Variables such as previous, campaign and poutcome are few other attributes which shows slightly better correlation with target variable.
- There are some independent attributes which are strongly correlated with each other ex. poutcome & previous.

## Step 3: Data Preprocessing

- I will add new attribute 'has_loan' and will remove 'loan', 'housing' & 'default' attributes.
- The data is clean but I still add cleaning steps in preprocessing pipeline.
- For numeric columns, null value will be replace by mean and for categorical columns null value is replace by most frequent value.

In [None]:
X = data.drop(columns=['deposit'], axis=1)
y = data['deposit'].copy()
y = y.apply(lambda x: 0 if x=='no' else 1)

In [None]:
list(X.columns)

In [None]:
feature_columns = list(X.columns)
cat_attrs = [ col for col in feature_columns if X[col].dtype=='O' ]
num_attrs = [ col for col in feature_columns if not col in cat_attrs ]
num_attrs.remove('pdays')
cat_attrs, num_attrs

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
class AddCustomAttribute(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        default_idx, housing_idx, loan_idx = cat_attrs.index('default'), cat_attrs.index('housing'), cat_attrs.index('loan')
        has_loan_attr = (X[:, default_idx]=='yes') | (X[:, housing_idx]=='yes') | (X[:, loan_idx]=='yes')
        X = np.delete(X, (default_idx, housing_idx, loan_idx), axis=1)
        return np.c_[X, has_loan_attr]

In [None]:
cat_pipeline = Pipeline([('cat_imputer', SimpleImputer(strategy='most_frequent')),
                        ('add_attrs', AddCustomAttribute()),
                        ('encoder', OneHotEncoder(handle_unknown='ignore'))])

pre_process = ColumnTransformer([('drop_attrs', 'drop', ['pdays']),
                                 ('cat_process', cat_pipeline, cat_attrs),
                                 ('num_process', SimpleImputer(strategy='mean'), num_attrs)], remainder='passthrough')

In [None]:
X_train_transformed = pre_process.fit_transform(X_train)
X_test_transformed = pre_process.transform(X_test)

In [None]:
X_train_transformed.shape, X_test_transformed.shape

In [None]:
cat_attrs.remove('loan')
cat_attrs.remove('housing')
cat_attrs.remove('default')
cat_attrs.append('has_loan')

all_cat_attrs = list(pre_process.transformers_[1][1]['encoder'].get_feature_names(cat_attrs))

In [None]:
feature_columns = all_cat_attrs + num_attrs
len(feature_columns), feature_columns

## Step 4: Modelling

- Approach:
    - I will be implementing following algorithms:
        1. Logistic Regression
        2. Random Forest
        3. Gradient Boosting
        4. XGBoost
    - Using GridSeachCV, best model for each algorithm will be obtained. Since dataset is fairly balance hence I will select best model by observing ROC curve and accuracy.
    - I will use following metrics to evaluate performance of each model:
        1. Accuracy score
        2. Precision and Recall score
        3. ROC Curve
    

In [None]:
from sklearn.model_selection import GridSearchCV, KFold

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

def grid_search(model, grid_param):
    print("Obtaining Best Model for {}".format(model.__class__.__name__))
    grid_search = GridSearchCV(model, grid_param, cv=kf, scoring='roc_auc', return_train_score=True, n_jobs=-1)
    grid_search.fit(X_train_transformed, y_train)
    
    print("Best Parameters: ", grid_search.best_params_)
    print("Best Scores: ", grid_search.best_score_)
    
    cvres = grid_search.cv_results_
    print("\nResults for each run of {}...".format(model.__class__.__name__))
    for train_mean_score, test_mean_score, params in zip(cvres["mean_train_score"], cvres["mean_test_score"], cvres["params"]):
        print(train_mean_score, test_mean_score, params)
        
    return grid_search.best_estimator_

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score

results = dict()

np.set_printoptions(precision=4)

def plot_roc_curve(model, X=X_test_transformed, y_true=y_test):
    y_scores = model.predict(X)
    auc_score = np.round(roc_auc_score(y_true, y_scores), 4)
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
    plt.plot(fpr, tpr, linewidth=2, label=model.__class__.__name__+"(AUC Score: "+str(auc_score)+")")
    plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
    plt.axis([0, 1, 0, 1])
    plt.xlabel("FPR", fontsize=16)
    plt.ylabel("TPR", fontsize=16)
    plt.legend()
    

    
def performance_measures(model, store_results=True):
    
    test_acc = cross_val_score(model, X_test_transformed, y_test, cv=kf, n_jobs=-1, scoring='accuracy')
    test_acc = np.around(test_acc, decimals=4)
    mean_test_acc = np.around(np.mean(test_acc), decimals=4)
    sd_test_acc = np.around(np.std(test_acc), decimals=4)
    print("CV Test Accuracy Scores: ", test_acc)
    print("Mean Accuracy: {} (S.D = {})".format(mean_test_acc, sd_test_acc))
    
    test_f1 = cross_val_score(model, X_test_transformed, y_test, cv=kf, n_jobs=-1, scoring='f1')
    test_f1 = np.around(test_f1, decimals=4)
    mean_test_f1 = np.around(np.mean(test_f1), decimals=4)
    sd_test_f1 = np.around(np.std(test_f1), decimals=4)
    print("\nCV Test F1 Scores: ", test_f1)
    print("Mean F1: {} (S.D = {})".format(mean_test_f1, sd_test_f1))
     
    if store_results:
        results[model.__class__.__name__] = (mean_test_acc*100, sd_test_acc*100,  mean_test_f1*100, sd_test_f1*100)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logistic_clf = LogisticRegression(solver='liblinear', random_state=42, n_jobs=-1)
logistic_param_grid = [{'C':[0.01, 0.1, 1, 10], 'penalty':['l1', 'l2']}]

In [None]:
logistic_clf = grid_search(logistic_clf, logistic_param_grid)

In [None]:
feature_importance = []
for feature_imp in zip(feature_columns, logistic_clf.coef_[0]):
    feature_importance.append(feature_imp)
    
feature_importance.sort(key=lambda a:a[1], reverse=True)
feature_importance[:10]

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
forest_clf = RandomForestClassifier(n_estimators=250, random_state=42, n_jobs=-1)
forest_param_grid = [{'max_depth':[8, 12, 16, 20], 'max_features':[None, 'sqrt', 'auto']}]

In [None]:
forest_clf = grid_search(forest_clf, forest_param_grid)

- As depth is increasing, model is starting to overfit. At depth=8 model is not overfitting and has better ROC AUC score.

In [None]:
forest_clf.max_depth=8
forest_clf.max_features='auto'
forest_clf.fit(X_train_transformed, y_train)

In [None]:
feature_importance = []
for feature_imp in zip(feature_columns, forest_clf.feature_importances_):
    feature_importance.append(feature_imp)
    
feature_importance.sort(key=lambda a:a[1], reverse=True)
feature_importance[:10]

### Gradient Boost Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
gb_clf = GradientBoostingClassifier(n_estimators=250, loss='deviance', random_state=42)
gb_param_grid = [{'max_depth':[3, 8, 16], 'max_features':[None, 'sqrt', 'auto']}]

In [None]:
gb_clf = grid_search(gb_clf, gb_param_grid)
gb_clf

In [None]:
gb_clf.max_depth=3
gb_clf.max_features='auto'
gb_clf.fit(X_train_transformed, y_train)

In [None]:
feature_importance = []
for feature_imp in zip(feature_columns, gb_clf.feature_importances_):
    feature_importance.append(feature_imp)
    
feature_importance.sort(key=lambda a:a[1], reverse=True)
feature_importance[:10]

### XGBoost Classifier

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb_clf = XGBClassifier(n_estimators=250, random_state=42, n_jobs=-1)
xgb_param_grid = [{'max_depth':[4, 8, 16], 'learning_rate':[0.01, 0.1, 1]}]

In [None]:
xgb_clf = grid_search(xgb_clf, xgb_param_grid)

In [None]:
feature_importance = []
for feature_imp in zip(feature_columns, xgb_clf.feature_importances_):
    feature_importance.append(feature_imp)
    
feature_importance.sort(key=lambda a:a[1], reverse=True)
feature_importance[:10]

## Step 5: Model Evaluation

In [None]:
print('\n Logistic Regression : CV Results')
performance_measures(logistic_clf)

print("--"*30)
print('\n Random Forest : CV Results')
performance_measures(forest_clf)

print("--"*30)
print('\n Gradient Boost : CV Results')
performance_measures(gb_clf)

print("--"*30)
print('\n XGBoost : CV Results')
performance_measures(xgb_clf)

In [None]:
models =  list(results.keys())
result = list(results.values())
test_mean_acc=[]
test_sd_acc=[]
test_mean_f1=[]
test_sd_f1=[]

for res in result:
    test_mean_acc.append(res[0])
    test_sd_acc.append(res[1])
    test_mean_f1.append(res[2])
    test_sd_f1.append(res[3])

In [None]:
plt.figure(figsize=(7, 4))
plot_roc_curve(logistic_clf)
plot_roc_curve(forest_clf)
plot_roc_curve(gb_clf)
plot_roc_curve(xgb_clf)
plt.title("ROC Curve", fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(12, 4))
x_indexes = np.arange(len(models))     
width = 0.15                            

plt.bar(x_indexes - width,  test_mean_acc, label="Mean Test Accuracy (S.D.)", width=width)
for i in range(len(x_indexes)):
    label=str(test_mean_acc[i])[:6]+" ({:.3f})".format(test_sd_acc[i])
    plt.text(x=x_indexes[i]-width, y=test_mean_acc[i]+0.3, s=label, fontsize=12)

plt.bar(x_indexes,  test_mean_f1, label="Mean F1 Score (S.D.)", width=width)
for i in range(len(x_indexes)):
    label=str(test_mean_f1[i])[:6]+"({:.3f})".format(test_sd_f1[i])
    plt.text(x=x_indexes[i], y=test_mean_f1[i]+0.1, s=label, fontsize=12)
    
plt.ylim(75, 85)
plt.ylabel("%", fontsize=14)
plt.legend(loc="upper left", fontsize=12)
plt.xticks(ticks=x_indexes, labels=models, fontsize=12)
plt.show()

- Observations:
    - All models give good accuracy as well as f1-score in test dataset.
    - ROC Curve for ensembel models is very close to each other.
    - Among all models, XGBoost has slightly better AUC Score.


- There is not so much difference between performance of Gradient boost and XGBoost classifier. I will stick with ROC AUC Score and will select XGBoost as final model.

## Step 6: Introspection of Model Performance

Now lets analyse the predictions made by model on overall dataset. This will help us to understand where the model is not performing well. 

- Lets find out which are the most important features according to the selected model.

In [None]:
feature_importance = []
for feature_imp in zip(feature_columns, xgb_clf.feature_importances_):
    feature_importance.append(feature_imp)
    
feature_importance.sort(key=lambda a:a[1], reverse=True)
feature_importance[:10]

- The most important parameters according XGBoost model are outcome of previous contact, whether person has loan or not, month in which last contact is made.
- 'unknown' value of contact and poutcome variable is not really going to be helpful for determining steps to improve the success of campaign.

In [None]:
y_train_pred = xgb_clf.predict(X_train_transformed)
y_test_pred = xgb_clf.predict(X_test_transformed)
y_pred = np.concatenate([y_train_pred, y_test_pred], axis=0)

y_true = np.concatenate([y_train, y_test], axis=0)
y_pred.shape, y_true.shape

In [None]:
combine_data = pd.concat([X_train, X_test], axis=0)
combine_data.shape

In [None]:
combine_data['deposit'] = y_true
combine_data['predictions'] = y_pred
combine_data['has_loan'] = combine_data[['default', 'housing', 'loan']].apply(has_loan, axis=1)

In [None]:
combine_data.head()

In [None]:
plt.figure(figsize=(15, 4))
plt.subplot(1, 2, 1)
ax = sns.countplot(x='deposit', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))
plt.title("Observed Subscibers")
plt.subplot(1, 2, 2)
ax = sns.countplot(x='predictions', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))
plt.title("Predicted Subscibers")
plt.show()

In [None]:
plt.figure(figsize=(15, 4))
plt.subplot(1, 2, 1)
ax = sns.countplot(x='has_loan', hue='deposit', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))
plt.title("Observed Subscibers", fontsize=14)

plt.subplot(1, 2, 2)
ax = sns.countplot(x='has_loan', hue='predictions', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))
plt.title("Predicted Subscibers", fontsize=14)
plt.show()

- There not so much difference in observed and predicted values for peoples who are carrying some kind of loan.
- For peoples who are loan-free, there is slightly more difference in observed and predicted values.

In [None]:
plt.figure(figsize=(14, 10))
plt.subplot(2, 1, 1)
ax = sns.countplot(x='job', hue='deposit', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.title("Observed Subscibers", fontsize=14)
plt.subplot(2, 1, 2)
ax = sns.countplot(x='job', hue='predictions', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.title("Predicted Subscibers", fontsize=14)
plt.show()

- Again, there is not so much difference in predicted and observed values accorss all job categories. 

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
ax = sns.countplot(x='education', hue='deposit', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.title("Observed Subscibers", fontsize=14)
plt.subplot(1, 2, 2)
ax = sns.countplot(x='education', hue='predictions', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.title("Predicted Subscibers", fontsize=14)
plt.show()

- Now, there is considerable misclassification between peoples having either secondary or tertiary education.

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
ax = sns.countplot(x='contact', hue='deposit', data=combine_data)
plt.title("Observed Subscibers", fontsize=14)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.subplot(1, 2, 2)
ax = sns.countplot(x='contact', hue='predictions', data=combine_data)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.title("Predicted Subscibers", fontsize=14)
plt.show()

- Again there is considerable misclassification when the communication type was cellular.

### Final Thoughts:
- Following are some important parameters on which more emphasis should be given in order to improve success rate of campaign:
    1. <b>Duration of contact:</b> Focus should be on increasing the contact with target person. More engaging the communication, more are the chances of person becoming the potential customer.
    2. <b>Month:</b> According to analysis, most subscribers are the ones who are contacted in months of February to May and Septemeber and October. So cutting down the activities in other months and focusing more in these periods will be benificial.
    3. <b>Loan & Balance:</b> We saw that there are many loan-free peoples who subscribed to the term deposit. Also there are some peoples who do have loan but still subscribed to the deposit and there are loan-free ones who didn't subscribed to the deposit. The main reason for this is the balance. The ones with sufficiently good balance are tend to subscribe to the term deposit.
    4. <b>Contacts:</b> In analysis we saw that the number of contacts made with most of the subscribers is less than 4. So, contacting person for more than 3-4 time should be avoided.
    5. <b>Jobs:</b> Targetting either students, retired peoples or peoples from management background will bring more success. Also the data showed that there very less positive response from people having job occupations such as blue-collar, entrepreneue, services, house-maid etc. The contact to such peoples should be avoided or atleast there other factors such as balance, loan and education should be taken in consideration before approaching them. Overall, This factor shoulb be exercise with balance and education. 
    6. <b>Eduaction:</b> Not massive but still important factor to consider. In analysis we saw that there many subscribers who have good educational background. I believe that educated person having some kind of knowledge about investment will atleast give some extra time to understand the offer made.