<a href="https://colab.research.google.com/github/shashi6352/BOston-Consulting-Group-virtual-Internship/blob/main/Predict_Term_Deposit_in_a_Bank_Marketing_Campaign.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
janiobachmann_bank_marketing_dataset_path = kagglehub.dataset_download('janiobachmann/bank-marketing-dataset')

print('Data source import complete.')


![bankkunden.jpg](attachment:74612c5e-2977-462c-8338-a67d8e51a643.jpg)

---

### SUMMARY

1. [Introduction](#1)
2. [Read the Data](#2)
3. [Bank Client Segmentation](#3)
4. [Binary Classification](#4)
5. [Term Deposit and Success of the Marketing Campaign](#5)
6. [Final Suggestions](#6)

---

# 1. Introduction
<a id="1"></a>

In this notebook, I am dealing with the data of a marketing campaign. Marketing campaigns are sets of strategic activities that promote a business's goal or objective. A marketing campaign could be used to promote a product, a service, or the brand as a whole. To achieve the most effective results, campaigns are carefully planned and the activities are varied. Marketing campaigns make use of different channels, platforms, and mediums to maximize impact.

A business could run campaigns that utilize print media, social media, online ads, email, in-person demos, and more. Each campaign will vary depending on the intended purpose. However, the messaging and tone of any given campaign will closely link to the tone of the business’s brand.

The four Ps represent four key elements of a campaign. For example, see: [*4 Ps of Marketing: What They Are & How to Use Them Successfully* by Alexandra Twin, Investopedia](https://www.investopedia.com/terms/f/four-ps.asp).

- **Product.** Creating a marketing campaign starts with an understanding of the product itself. Who needs it and why? What does it do that no competitor's product can do? The job of the marketer is to define the product and its qualities and introduce it to the consumer - the basic marketing of a product (or service).

- **Price.** Price is the amount that consumers will be willing to pay for a product. Marketers must link the price point to the product's real and perceived value, while also considering supply costs, seasonal discounts, competitors' prices, and retail markup. In some cases, business decision-makers may raise the price of a product to give it the appearance of luxury or exclusivity. Or, they may lower the price so more consumers will try it.

- **Place.** Place is the consideration of where the product should be available and how it will be displayed. The decision is key: The makers of a luxury cosmetic product would want to be displayed in Sephora and Neiman Marcus, not in Walmart or Family Dollar. The goal of business executives is always to get their products in front of the consumers who are the most likely to buy them.

- **4. Promotion.** The goal of promotion is to communicate to consumers that they need this product and that it is priced appropriately. Promotion encompasses advertising, public relations, and the overall media strategy for introducing a product.

![Marketing-recirc-blue-77cc4c488cf14d4686691e82219f80cf.jpeg](attachment:325a0ab6-b1c4-45ee-9f0a-9685d248f4df.jpeg)

# 2. Read the Data
<a id="2"></a>

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns',None)
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold,cross_val_score,StratifiedKFold
from sklearn.model_selection import validation_curve
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from sklearn.metrics import roc_auc_score,f1_score,accuracy_score,roc_curve
from sklearn.preprocessing import LabelEncoder,StandardScaler,MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.inspection import permutation_importance

from xgboost import XGBClassifier

from time import time

from scipy.stats import randint
import random

from warnings import simplefilter
simplefilter("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv('/kaggle/input/bank-marketing-dataset/bank.csv')

data.head()

In [None]:
print(f'The dataset has {data.shape[0]} rows and {data.shape[1]} columns.')

## 2.1 Null and Duplicate Values

In [None]:
data.isnull().sum()

In [None]:
data.duplicated().sum()

The dataset has zero null and zero duplicate values.

In [None]:
data.dtypes

The columns content is either of integer or string type.

## 2.2 Columns and Their Meaning

**1. age.** Customer's age. Let's have a brief look at the values within this column.

In [None]:
data['age'].nunique(), data.age.min(), data.age.max()

The column *age* has 76 different values, ranging from 18 (age of consent) to 95.

**2. job.** Customer's job. All the possible (categorical) values are listed below.

In [None]:
data['job'].unique()

**3. marital.** Customer's marital status.

In [None]:
data['marital'].unique()

**4. education.** Customer's education level.

In [None]:
data['education'].unique()

**5. default.** Does the customer have credit in default?

In [None]:
data['default'].unique()

**6. balance.** Customer's balance. Can be either positive or negative.

In [None]:
data['balance'].min(), data['balance'].max()

**7. housing.** Does the customer have housing loan?

In [None]:
data['loan'].unique()

**8. contact.** Contact communication type.

In [None]:
data['contact'].unique()

**9. day.** Last contact day of the month.

In [None]:
data['day'].unique()

**10. month.** Last contact month of the year.

In [None]:
data['month'].unique()

**11. duration.** Last contact (telephone call) duration in seconds (numeric).

In [None]:
data['duration'].min(), data['duration'].max()

The calls between marketers and customers span a very large interval: from 2 to 3881 seconds.

**12. campaign.** Number of contacts performed during this campaign and for this client.

In [None]:
data['campaign'].min(), data['campaign'].max(), data['campaign'].mean()

**13. pdays.** Number of days that passed by after the client was last contacted from a previous campaign (numeric).

In [None]:
data['pdays'].min(), data['pdays'].max(), data['pdays'].mean()

We will have to find out what the value of -1 means.

**14. previous.** Number of contacts performed before this campaign and for this client.

In [None]:
data['previous'].min(), data['previous'].max(), data['previous'].mean()

**15. poutcome.** Outcome of the previous marketing campaign.

In [None]:
data['poutcome'].unique()

**16. deposit.** Has the client subscribed a term deposit?

In [None]:
data['deposit'].unique()

Term Deposit is the target variable. A term deposit is a deposit that a bank or a financial institution offers with a fixed rate (often better than just opening deposit account) in which your money will be returned back at a specific maturity time. For more information with regards to term deposits, see: [*Term Deposit: Definition, How It's Used, Rates, and How to Invest* by James Chen](https://www.investopedia.com/terms/t/termdeposit.asp).

Let's have a look at the distribution of the target variable classes ('yes' or 'no').

In [None]:
sns.histplot(data=data,x='deposit')
plt.title('Deposit Distribution',size=20)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

# 3. Bank Client Segmentation
<a id="3"></a>

## 3.1 Basic Descriptive Statistics

Before I start my exploratory data analysis, I want to have a look at the descriptive statistics of numerical ...

In [None]:
data.describe()

... and also categorical variables.

In [None]:
data.describe(include='object')

It is worth to note that:
- The average customer's age is 41.
- The average balance is 1529$\$$ (dollars?). I am not sure about the currency. The minimum balance is -6847$\$$ (meaning that the customer is indebted) and the maximum is 81204$\$$.

## 3.2 Distribution of the Numerical Variables

In [None]:
features = ['age','balance','day','duration','campaign','pdays','previous']

for i in range(2):
    fig,(ax1,ax2,ax3,ax4) = plt.subplots(ncols=4,figsize=(12,5))
    ax1 = sns.distplot(data[features[i*4]],ax=ax1,hist=False)
    ax1.set_title('Distribution of '+str(features[i*4]),fontsize=18)
    ax1.set_facecolor('lemonchiffon')
    ax2 = sns.distplot(data[features[i*4+1]],ax=ax2,hist=False)
    ax2.set_title('Distribution of '+str(features[i*4+1]),fontsize=18)
    ax2.set_facecolor('lemonchiffon')
    ax3 = sns.distplot(data[features[i*4+2]],ax=ax3,hist=False)
    ax3.set_title('Distribution of '+str(features[i*4+2]),fontsize=18)
    ax3.set_facecolor('lemonchiffon')
    if i < 1:
        ax4 = sns.distplot(data[features[i*4+3]],ax=ax4,hist=False)
        ax4.set_title('Distribution of '+str(features[i*4+3]),fontsize=18)
        ax4.set_facecolor('lemonchiffon')
    else:
        ax4.set_facecolor('lemonchiffon')

    fig.suptitle("Distribution of the Numerical Variables",fontsize=24)

    plt.tight_layout()
    fig.set_facecolor('lightsteelblue')

It is worth to note that:
- None of the distributions is bell-shaped.
- The *age* and *day* distributions are more spread than the others. The first is (almost) single-peaked and right-skewed, while the second is multi-peaked (with 3 main peaks).
- The remaining numerical distributions have a high peak around the origin, meaning that the higher values are less frequent than small ones.

In [None]:
for i in range(2):
    fig,(ax1,ax2,ax3,ax4) = plt.subplots(ncols=4,figsize=(12,5))
    ax1 = sns.boxplot(data[features[i*4]],ax=ax1)
    ax1.set_title('Boxplot of '+str(features[i*4]),fontsize=20)
    ax1.set_facecolor('lemonchiffon')
    ax2 = sns.boxplot(data[features[i*4+1]],ax=ax2)
    ax2.set_title('Boxplot of '+str(features[i*4+1]),fontsize=20)
    ax2.set_facecolor('lemonchiffon')
    ax3 = sns.boxplot(data[features[i*4+2]],ax=ax3)
    ax3.set_title('Boxplot of '+str(features[i*4+2]),fontsize=20)
    ax3.set_facecolor('lemonchiffon')
    if i < 1:
        ax4 = sns.boxplot(data[features[i*4+3]],ax=ax4)
        ax4.set_title('Boxplot of '+str(features[i*4+3]),fontsize=20)
        ax4.set_facecolor('lemonchiffon')
    else:
        ax4.set_facecolor('lemonchiffon')

    fig.suptitle("Boxplots of the Outliers",fontsize=24)

    plt.tight_layout()
    fig.set_facecolor('lightsteelblue')

There are many outliers. Let's see what is their percentage in the data.

In [None]:
outliers_perc = []

for k,v in data.items():
    # Column must be of numeric type (not object)
    if data[k].dtype != 'O':
        q1 = v.quantile(0.25)
        q3 = v.quantile(0.75)
        irq = q3 - q1
        v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
        perc = np.shape(v_col)[0] * 100.0 / np.shape(data)[0]
        out_tuple = (k,int(perc))
        outliers_perc.append(out_tuple)
        print("Column %s outliers = %.2f%%" % (k,perc))

## 3.3 Histograms of the Categorical Variables

In [None]:
features = ['job','marital','education','default','housing','loan','contact','month','poutcome','deposit']

for i in range(3):
    fig,(ax1,ax2,ax3,ax4) = plt.subplots(ncols=4,figsize=(12,5))
    ax1.hist(data[features[i*4]])
    if data[features[i*4]].nunique() > 5:
        ax1.set_xticklabels(ax1.get_xticklabels(),rotation=75,fontsize=8)
    ax1.set_title('Histogram of '+str(features[i*4]),fontsize=18)
    ax1.set_facecolor('lemonchiffon')
    ax2.hist(data[features[i*4+1]])
    ax2.set_title('Histogram of '+str(features[i*4+1]),fontsize=18)
    ax2.set_facecolor('lemonchiffon')
    if i < 2:
        ax3.hist(data[features[i*4+2]])
        ax3.set_title('Histogram of '+str(features[i*4+2]),fontsize=18)
        ax3.set_facecolor('lemonchiffon')
        ax4.hist(data[features[i*4+3]])
        ax4.set_title('Histogram of '+str(features[i*4+3]),fontsize=18)
        ax4.set_facecolor('lemonchiffon')
        if data[features[i*4+3]].nunique() > 5:
            ax4.set_xticklabels(ax4.get_xticklabels(),rotation=75,fontsize=8)
    else:
        ax3.set_facecolor('lemonchiffon')
        ax4.set_facecolor('lemonchiffon')

    fig.suptitle("Histograms of the Categorical Variables",fontsize=24)

    plt.tight_layout()
    fig.set_facecolor('lightsteelblue')

It is worth to note that:
- The majority of clients work as administrators, in the services or as blue collar.
- The majority of clients are married.
- They mainly communicate with the marketing staff via cellular.
- The majority of them has either secondary (high school) or tertiary (college) education.
- The large majority of clients do not have credits on default and are not on a loan either.

## 3.4 Age vs Other Features

In [None]:
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3,figsize=(12,5))

ax1 = sns.distplot(data[data['education'] == 'primary']['age'],hist=False,ax=ax1,kde_kws={'linewidth':2})
ax1 = sns.distplot(data[data['education'] == 'secondary']['age'],hist=False,ax=ax1,kde_kws={'linewidth':3})
ax1 = sns.distplot(data[data['education'] == 'tertiary']['age'],hist=False,ax=ax1,kde_kws={'linestyle':'--','linewidth':2})
ax1.set_title('Age vs Education Level',fontsize=18)
ax1.legend(labels=['primary','secondary','tertiary'])
ax1.set_facecolor('lemonchiffon')

ax2 = sns.distplot(data[data['marital'] == 'single']['age'],hist=False,ax=ax2,kde_kws={'linewidth':2})
ax2 = sns.distplot(data[data['marital'] == 'married']['age'],hist=False,ax=ax2,kde_kws={'linewidth':3})
ax2 = sns.distplot(data[data['marital'] == 'divorced']['age'],hist=False,ax=ax2,kde_kws={'linestyle':'--','linewidth':2})
ax2.set_title('Age vs Marital Status',fontsize=18)
ax2.legend(labels=['single','married','divorced'])
ax2.set_facecolor('lemonchiffon')

ax3 = sns.boxplot(data,x='age',y='job',ax=ax3)
ax3.set_title("Job vs Age",fontsize=20)
ax3.set_facecolor('lemonchiffon')

fig.suptitle("Age vs Education, Marital Status and Job",fontsize=24)

plt.tight_layout()
fig.set_facecolor('lightsteelblue')

It is worth to note that:
- Younger clients generally have either secondary or tertiary education, older clients generally have primary education. This means that younger clients are, on average, more educated than the elder.
- As expected, single clients are generally the youngest. The distributions of married and divorced clients are not so different, even though divorcees tend to be a little older.
- Finally, students usually are the youngest clients, retirees the oldest.

Below I am creating age clusters.

In [None]:
data.loc[data['age'] <= 25, 'age cluster'] = 'age <= 25'
data.loc[(data['age'] > 25) & (data['age'] <= 35), 'age cluster'] = '25 < age <= 35'
data.loc[(data['age'] > 35) & (data['age'] <= 45), 'age cluster'] = '35 < age <= 45'
data.loc[(data['age'] > 45) & (data['age'] <= 55), 'age cluster'] = '45 < age <= 55'
data.loc[(data['age'] > 55) & (data['age'] <= 65), 'age cluster'] = '55 < age <= 65'
data.loc[data['age'] > 65, 'age cluster'] = 'age > 65'

# Grouping by age
age_balance_groups = data.groupby('age cluster',as_index=False)['balance'].median().sort_values(by=['balance'],ascending=False)

I am also grouping the data to make some plots.

In [None]:
default_yes = data[data['default'] == 'yes'].groupby('age cluster')['default'].value_counts().reset_index(name='count yes')

default_no = data[data['default'] == 'no'].groupby('age cluster')['default'].value_counts().reset_index(name='count no')

default_yes_perc = default_yes.merge(default_no,on='age cluster')

default_yes_perc['default percentage'] = (default_yes_perc['count yes'] / (default_yes_perc['count yes'] + default_yes_perc['count no'])) * 100

default_yes_perc.drop(['default_x','default_y','count yes','count no'],axis=1,inplace=True)

default_yes_perc = default_yes_perc.sort_values(by='default percentage',ascending=False)

In [None]:
loan_yes = data[data['loan'] == 'yes'].groupby('age cluster')['loan'].value_counts().reset_index(name='count yes')

loan_no = data[data['loan'] == 'no'].groupby('age cluster')['loan'].value_counts().reset_index(name='count no')

loan_yes_perc = loan_yes.merge(loan_no,on='age cluster')

loan_yes_perc['loan percentage'] = loan_yes_perc['count yes'] / (loan_yes_perc['count yes'] + loan_yes_perc['count no']) * 100

loan_yes_perc.drop(['loan_x','loan_y','count yes','count no'],axis=1,inplace=True)

loan_yes_perc = loan_yes_perc.sort_values(by='loan percentage',ascending=False)

In [None]:
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3,figsize=(12,7))

ax1 = sns.histplot(data=data,x='housing',multiple='stack',hue='age cluster',ax=ax1)
ax1.set_title('Housing (Count) vs Age Clusters',fontsize=13)
ax1.set_facecolor('lemonchiffon')

ax2 = sns.barplot(data=default_yes_perc,x='age cluster',y='default percentage',ax=ax2)
ax2.set_xticklabels(ax2.get_xticklabels(),rotation=45,fontsize=8)
ax2.set_title('Default (Yes Percentage) vs Age Clusters',fontsize=13)
ax2.set_facecolor('lemonchiffon')

ax3 = sns.barplot(data=loan_yes_perc,x='age cluster',y='loan percentage',ax=ax3)
ax3.set_xticklabels(ax3.get_xticklabels(),rotation=45,fontsize=8)
ax3.set_title('Loan (Yes Percentage) vs Age Clusters',fontsize=13)
ax3.set_facecolor('lemonchiffon')

fig.suptitle("Age vs Education, Marital Status and Job",fontsize=24)

plt.tight_layout()
fig.set_facecolor('lightsteelblue')

It is worth to note that:
- A slight majority of clients does not have a housing loan. The fractions of housing_loan = yes and housing_loan = no clients vary depending on their age group: people older than 65 generally do not have a loan, clients belonging to other categories (e.g. those younger than 25) have a loan.
- The percentage of clients with credits in default is relatively small (less than 2%). This default percentage has a large variability within the age groups: clients between 35 and 45 years old have the highest default percentage, old clients (> 65 years) have the smallest.
- The percentage of clients having a loan changes a lot depending on their age group. Almost 16% of clients with 45 < age < 55 have a loan, while less than 1% of clients older than 65 have one.

## 3.5 Marital Status + Education Clusters vs Other Features

Below I am creating marital status + education clusters.

In [None]:
### Creating a new column 'marital_edu' ###
data['marital_edu'] = np.nan
list2 = [data]

for col in list2:
    col.loc[(col['marital'] == 'single') & (data['education'] == 'primary'), 'marital_edu'] = 'single+primary'
    col.loc[(col['marital'] == 'married') & (data['education'] == 'primary'), 'marital_edu'] = 'married+primary'
    col.loc[(col['marital'] == 'divorced') & (data['education'] == 'primary'), 'marital_edu'] = 'divorced+primary'
    col.loc[(col['marital'] == 'single') & (data['education'] == 'secondary'), 'marital_edu'] = 'single+secondary'
    col.loc[(col['marital'] == 'married') & (data['education'] == 'secondary'), 'marital_edu'] = 'married+secondary'
    col.loc[(col['marital'] == 'divorced') & (data['education'] == 'secondary'), 'marital_edu'] = 'divorced+secondary'
    col.loc[(col['marital'] == 'single') & (data['education'] == 'tertiary'), 'marital_edu'] = 'single+tertiary'
    col.loc[(col['marital'] == 'married') & (data['education'] == 'tertiary'), 'marital_edu'] = 'married+tertiary'
    col.loc[(col['marital'] == 'divorced') & (data['education'] == 'tertiary'), 'marital_edu'] = 'divorced+tertiary'

### Grouping by ###
marital_edu_groups = data.groupby('marital_edu',as_index=False)['balance'].median().sort_values(by=['balance'],ascending=False)

I am also grouping the data to make some plots.

In [None]:
default_yes = data[data['default'] == 'yes'].groupby('marital_edu')['default'].value_counts().reset_index(name='count yes')

default_no = data[data['default'] == 'no'].groupby('marital_edu')['default'].value_counts().reset_index(name='count no')

default_yes_perc = default_yes.merge(default_no,on='marital_edu')

default_yes_perc['default percentage'] = (default_yes_perc['count yes'] / (default_yes_perc['count yes'] + default_yes_perc['count no'])) * 100

default_yes_perc.drop(['default_x','default_y','count yes','count no'],axis=1,inplace=True)

default_yes_perc = default_yes_perc.sort_values(by='default percentage',ascending=False)

In [None]:
loan_yes = data[data['loan'] == 'yes'].groupby('marital_edu')['loan'].value_counts().reset_index(name='count yes')

loan_no = data[data['loan'] == 'no'].groupby('marital_edu')['loan'].value_counts().reset_index(name='count no')

loan_yes_perc = loan_yes.merge(loan_no,on='marital_edu')

loan_yes_perc['loan percentage'] = loan_yes_perc['count yes'] / (loan_yes_perc['count yes'] + loan_yes_perc['count no']) * 100

loan_yes_perc.drop(['loan_x','loan_y','count yes','count no'],axis=1,inplace=True)

loan_yes_perc = loan_yes_perc.sort_values(by='loan percentage',ascending=False)

In [None]:
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3,figsize=(12,8))

ax1 = sns.histplot(data=data,x='housing',multiple='stack',hue='marital_edu',ax=ax1)
ax1.set_title('Housing (Count) vs Marital+Education',fontsize=12)
ax1.set_facecolor('lemonchiffon')

ax2 = sns.barplot(data=default_yes_perc,x='marital_edu',y='default percentage',ax=ax2)
ax2.set_xticklabels(ax2.get_xticklabels(),rotation=60,fontsize=8)
ax2.set_title('Default (Yes Percentage) vs Marital+Education',fontsize=12)
ax2.set_facecolor('lemonchiffon')

ax3 = sns.barplot(data=loan_yes_perc,x='marital_edu',y='loan percentage',ax=ax3)
ax3.set_xticklabels(ax3.get_xticklabels(),rotation=60,fontsize=8)
ax3.set_title('Loan (Yes Percentage) vs Marital+Education',fontsize=12)
ax3.set_facecolor('lemonchiffon')

fig.suptitle("Age vs Education, Marital Status and Job",fontsize=24)

plt.tight_layout()
fig.set_facecolor('lightsteelblue')

It is worth to note that:
- Some marital + education categories are less likely to have a housing loan: married + tertiary and single + tertiary. Married + secondary clients are more likely to have a housing loan.
- The percentage of clients with credits in default is relatively small (less than 3%). This default percentage has a large variability within the marital + education groups: divorcees with primary education have the highest default percentage, singles with tertiary education have the smallest.
- The percentage of clients having a loan changes a lot depending on their marital + education group. Almost 20% of divorcees with secondary education have a loan, while less than 7% of singles with tertiary education have one.

## 3.6 Balance vs Categorical Fatures

Here, I am having a look at the clients' balance depending on their age and marital + education groups and also on their job.

In [None]:
job_groups = data.groupby('job',as_index=False)['balance'].median().sort_values(by=['balance'],ascending=False)

### Creating the plot ###
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3,figsize=(14,6))

ax1 = sns.barplot(data=marital_edu_groups,x='marital_edu',y='balance',ax=ax1)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=45,fontsize=7)
ax1.set_xlabel('marital status + education')
ax1.set_ylabel('balance (median)')
ax1.set_title('Balance vs Marital Status + Education',fontsize=14)
ax1.set_facecolor('lemonchiffon')

ax2 = sns.barplot(data=job_groups,x='job',y='balance',ax=ax2)
ax2.set_xticklabels(ax2.get_xticklabels(),rotation=45,fontsize=8)
ax2.set_ylabel('balance (median)')
ax2.set_title("Balance vs Job",fontsize=20)
ax2.set_facecolor('lemonchiffon')

ax3 = sns.barplot(data=age_balance_groups,x='age cluster',y='balance',ax=ax3)
ax3.set_xticklabels(ax3.get_xticklabels(),rotation=45,fontsize=8)
ax3.set_ylabel('balance (median)')
ax3.set_title("Balance vs Age Cluster",fontsize=20)
ax3.set_facecolor('lemonchiffon')

fig.suptitle("Balance vs Job, Marital Status + Education and Age Group",fontsize=24)

plt.tight_layout()
fig.set_facecolor('lightsteelblue')

It is worth to note that:
- Singles with tertiary education have the highest (median) balance; divorcees with secondary education have the smallest. This resembles the plots of default and loans, where singles with tertiary education had the lowest default and loan percentages. Divorcees with lower education had the highest default and loan percentages.
- Retirees have the highest (median) balance. It is strange to see that unemployees have the second highest (median) balance.
- In terms of age group, people older than 65 (the retirees) have the highest (median) balance, clients between 55 and 65 years the second highest, and so on.

## 3.7 Conclusions of the Client Segmentation

- **Age and Education.** The average customer's age is 41. Youngsters generally have either a secondary or tertiary education, older clients generally have a primary one. Single clients are generally the youngest.
- **Housing Loan.** A slight majority of clients does not have a housing loan. The fraction of clients with loan varies depending on their age group: people above 65 generally do not have a loan, clients belonging to other age groups (e.g. those younger than 25) have a loan.
Married clients and singles, both with tertiary education, are less likely to have a housing loan. Married clients with secondary education are more likely to have one.
- **Default Credit.** The percentage of clients with credit in default is less than 3%. This percentage has a large variability within the age groups [clients between 35 and 45 years old have the highest default percentage, old clients (> 65 years) have the smallest] and also within the marital status + education groups [divorcees with primary education have the highest percentage, singles with tertiary education have the smallest].
- **Loan.** The percentage of clients having a loan changes a lot depending on their age group and also on their marital + education group. Almost 16% of clients with 45 < age < 55 have a loan, while less than 1% of clients older than 65 have one.
Almost 20% of divorcees with secondary education have a loan, while less than 7% of singles with tertiary education have one.
- **Balance.** The average client's balance is 1529$\$$. Singles with tertiary education have the highest (median) balance; divorcees with secondary education have the smallest.
In terms of age, retirees have the highest (median) balance, clients between 55 and 65 the second highest.



# 4. Binary Classification
<a id="4"></a>

Here, I am performing a binary classification of the clients, with *deposit* as the target (boolean) variable. The definition of deposit is: "Has the client subscribed a term deposit?".

## 4.1 Outliers Capping

Outliers are data points that stand out significantly from the rest of the data. They can be extremely high or low values compared to the other observations and can be caused by measurement errors, natural variations in the data, or even unexpected discoveries.
It is well-known that the outliers in a dataset (that we want to study via classification or regression techniques) can result in a lower predictive performance; thus they need to be dealt with.

First, I am capping the outliers ...

In [None]:
def outlier_imputer(data,features):

    data_out = data.copy()

    for column in features:

        # First define the first and third quartiles
        Q1 = (data_out[column].quantile(0.25)).astype(int)
        Q3 = (data_out[column].quantile(0.75)).astype(int)
        # Define the inter-quartile range
        IQR = Q3 - Q1
        # ... and the lower/higher threshold values
        lowerL = (Q1 - 1.5 * IQR).astype(int)
        higherL = (Q3 + 1.5 * IQR).astype(int)

        # Impute 'left' outliers
        data_out.loc[data_out[column] < lowerL,column] = lowerL
        # Impute 'right' outliers
        data_out.loc[data_out[column] > higherL,column] = higherL

    return data_out

features = ['age','balance','day','duration','campaign','pdays','previous']

capped_data = outlier_imputer(data,features)

... then, I am checking that the capping procedure was successful: there should be no outlier left in the boxplots.

In [None]:
for i in range(2):
    fig,(ax1,ax2,ax3,ax4) = plt.subplots(ncols=4,figsize=(12,5))
    ax1 = sns.boxplot(capped_data[features[i*4]],ax=ax1)
    ax1.set_title('Boxplot of '+str(features[i*4]),fontsize=20)
    ax1.set_facecolor('lemonchiffon')
    ax2 = sns.boxplot(capped_data[features[i*4+1]],ax=ax2)
    ax2.set_title('Boxplot of '+str(features[i*4+1]),fontsize=20)
    ax2.set_facecolor('lemonchiffon')
    ax3 = sns.boxplot(capped_data[features[i*4+2]],ax=ax3)
    ax3.set_title('Boxplot of '+str(features[i*4+2]),fontsize=20)
    ax3.set_facecolor('lemonchiffon')
    if i < 1:
        ax4 = sns.boxplot(capped_data[features[i*4+3]],ax=ax4)
        ax4.set_title('Boxplot of '+str(features[i*4+3]),fontsize=20)
        ax4.set_facecolor('lemonchiffon')
    else:
        ax4.set_facecolor('lemonchiffon')

    fig.suptitle("Boxplots of the Outliers",fontsize=24)

    plt.tight_layout()
    fig.set_facecolor('lightsteelblue')

I am also plotting one of the distributions (*duration*) before and after the outliers imputation.

In [None]:
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(12,6))

ax1 = sns.distplot(data['duration'],ax=ax1,hist=False)
ax1.set_title('Distribution of Duration Before Imputation',fontsize=17)
ax1.set_facecolor('lemonchiffon')

ax2 = sns.distplot(capped_data['duration'],ax=ax2,hist=False)
ax2.set_title('Distribution of Duration After Imputation',fontsize=17)
ax2.set_facecolor('lemonchiffon')

fig.suptitle("Distribution of Duration Before/After Imputation",fontsize=24)

plt.tight_layout()
fig.set_facecolor('lightsteelblue')

One can notice that, after outliers imputation, the x-range has changed from around -1000$-$4000 to 0$-$1200. Moreover, the capped values have accomulated on the right-hand side of the distribution and have formed a second peak.

## 4.2 Feature Importance and Correlation Heatmap

I want to have a look at the feature importance and correlation between the variables.

**Feature Importance**

Before running the calculation, let's have a look at the variables.

In [None]:
capped_data.head()

Some of these variables are not necessary anymore (they were only used for the EDA). For example, I cannot keep both 'age' and 'age cluster'; the same holds for 'marital' and 'education' vs 'marital_edu'. This is why I will have to drop some of them.

In [None]:
capped_data.drop(['education','marital','age cluster'],axis=1,inplace=True)

Now, I can look at the relative importance of the features by means of a random forest classifier with a max depth ('max_depth') of 100. One can check that for values of 'max_depth' larger than 20, the results for the feature importance are substantially the same.

**Please Note**
1. The feature importance calculation does not work with categorical data. This is why I will have to encode the categorical variables.
2. Even though (in a later stage) I will perform the binary classification by one-hot-encoding the categorical columns, here I am using label encoding because (for this task) I do not want to split categorical columns into their single components.

In [None]:
data_feature = capped_data.copy()

## Label encoding ##
LABELS  = data_feature.columns
encoder = LabelEncoder()

for col in LABELS:
    # Check if object
    if data_feature[col].dtype == 'O':
        # Fit label encoder and return encoded labels
        data_feature[col] = encoder.fit_transform(data_feature[col])

X = data_feature.drop('deposit',axis=1)
y = data_feature['deposit']

# Random Forest Model
random_forest = RandomForestClassifier(random_state=1,max_depth=100)
random_forest.fit(X,y)

importances = pd.DataFrame({'feature':X.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)

importances

In [None]:
plt.figure(figsize=(12,6))

sns.barplot(importances[importances['importance'] > 0.025],x='feature',y='importance')

plt.title('Feature Importances > 0.025',fontsize=25)
plt.xlabel('feature',fontsize=15)
plt.xticks(fontsize=8,rotation=45)
plt.ylabel('relative importance',fontsize=15)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

One can notice that:
* 'duration' is expected to be (by far) the most important predictor.
* Then we have second-tier features (like 'balance', 'month', 'day' and 'age') and third-tier ones (like 'contact' and 'poutcome').
* The features not displayed in this plot (like 'default' or 'loan') are not expected to be so important in predicting the outcome of the marketing campaign.

**Correlation heatmap**

In [None]:
plt.figure(figsize=(15,10))

sns.heatmap(X.corr(method='pearson'),vmin=-1,vmax=1,annot=True,cmap='coolwarm')
plt.title('Correlation heatmap',fontsize=25)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.show()

There are some variables that are strongly correlated one another: 'pdays' vs 'previous', 'pdays' vs 'poutcome', 'previous' vs 'poutcome'. At least a couple of them ('pdays' vs 'previous'), with Pearson coefficient > 0.9, should be dealt with.

This is why I am dropping 'pdays' and 'previous'. 'poutcome' has a higher feature importance compared to theirs.

In [None]:
capped_data.drop(['pdays','previous'],axis=1,inplace=True)

data_feature.drop(['pdays','previous'],axis=1,inplace=True)

capped_data.columns

**Pairplot**

Let's also plot a pairplot in order to display the relation between the variables. We will also use it to find out whether there is multi-collinearity between them.

Multi-collinearity is a state of the dataset where 2 or more independent variables are showing signs of high correlation between themselves. This is a problem because it has potential to distort the outcomes of machine learning models and also compromise the reliability of the results. In this respect, logistic regression requires there to be little or no multicollinearity among the independent variables.

In [None]:
sns.set_style("whitegrid")
sns.pairplot(data_feature,size=3,corner=True)

plt.gcf().patch.set_facecolor('lightsteelblue')
plt.show()

We can see no sign of multi-collinearity.

## 4.3 Label Encoding and Scaling

**Defining X and y, train-test splitting**

In [None]:
X = capped_data.drop('deposit',axis=1)

y = capped_data['deposit']

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,stratify=y,random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

**Label encoding and scaling**

First, I am encoding the target variable.

In [None]:
encoder = LabelEncoder()

y_train = encoder.fit_transform(y_train)
y_test  = encoder.transform(y_test)

y_train

Then, I am checking the cardinality of the categorical variables in X_train (and X_test).

In [None]:
cat_cols = [col for col in X_train.columns if X_train[col].dtypes == 'O']

bin_cols = []
multi_cols = []

for col in cat_cols:
    print(f'feature = {col}; cardinality = {X_train[col].nunique()}')
    if X_train[col].nunique() <= 2:
        bin_cols.append(col)
    else:
        multi_cols.append(col)

print()
print(f'binary columns: {bin_cols}')
print()
print(f'multi-class columns: {multi_cols}')

I am label-encoding the binary features.

In [None]:
lb_encoder = LabelEncoder()

for col in bin_cols:
    X_train[col] = lb_encoder.fit_transform(X_train[col])
    X_test[col]  = lb_encoder.transform(X_test[col])

Before one-hot encoding the multi-class categorical variables, I have to check that all their classes have frequency > 0.05. This means that they are not rare classes. If their frequency is less than 0.05, I will change their class to 'other'.

In [None]:
def remove_005(train,test,column):

    props_df = train[column].value_counts(normalize=True).reset_index()
    le05_list = props_df.loc[props_df['proportion'] < 0.05][column].to_list()

    train.loc[train[column].isin(le05_list),column] = 'other'
    test.loc[test[column].isin(le05_list),column] = 'other'


for col in multi_cols:
    remove_005(X_train,X_test,col)

for col in multi_cols:
    print(X_train[col].value_counts(normalize=True).reset_index())
    print()

Before moving on, I want to check again if there are columns with missing values. These nulls may have been created from the operations that I carried out in the previous stages.

In [None]:
list_nulls = X_train.columns[X_train.isnull().any()].tolist()

list_nulls

I am imputing the null values in the 'marital_edu' column with 'other'.

In [None]:
X_train['marital_edu'].fillna('other',inplace=True)
X_test['marital_edu'].fillna('other',inplace=True)

X_train.columns[X_train.isnull().any()].tolist()

Now, I can one-hot-encode the multi-class categorical variables is X_train and X_test ...

In [None]:
### One-hot encoding ###
oh_encoder = OneHotEncoder(sparse_output=False,handle_unknown='ignore').set_output(transform="pandas")

# Fit and transform the categorical columns
OHE_train = pd.DataFrame(oh_encoder.fit_transform(X_train[multi_cols]))
OHE_test  = pd.DataFrame(oh_encoder.transform(X_test[multi_cols]))

# One-hot encoding removed index; put it back
OHE_train.index = X_train.index
OHE_test.index  = X_test.index

# Remove categorical columns (will replace with one-hot encoding)
num_train = X_train.drop(multi_cols,axis=1)
num_test  = X_test.drop(multi_cols,axis=1)

# Add one-hot encoded columns to numerical features
OHE_X_train = pd.concat([OHE_train,num_train],axis=1)
OHE_X_test  = pd.concat([OHE_test,num_test],axis=1)

... and then scale the numerical variables.

In [None]:
print(OHE_X_train.shape)
OHE_X_train.head()

In [None]:
# List of numerical features
num_features = [col for col in OHE_X_train.columns if OHE_X_train[col].dtypes != 'O']

# Instantiate the scaler
scaler = MinMaxScaler()

# Scaling the numerical columns
OHE_X_train[num_features] = scaler.fit_transform(OHE_X_train[num_features])
OHE_X_test[num_features]  = scaler.transform(OHE_X_test[num_features])

OHE_X_train.head()

## 4.4 Comparing the performance of Four Classifiers

I want to compare the performance of four classifiers to decide what is more fit for this problem. In order to do this, I am calculating the train and test accuracies by considering 1%, 10%, 25%, 50%, 75% and 100% fractions of the training data for each classifier.

In [None]:
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''

    results = {}

    # Fit the learner to the training data using slicing with 'sample_size'
    start = time() # Get start time
    learner = learner.fit(X_train[:sample_size],y_train[:sample_size])
    end = time() # Get end time

    # Calculate the training time
    results['train_time'] = end - start

    #  Get the predictions on the test set,
    #  then get predictions on the first 300 training samples
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    end = time() # Get end time

    # Calculate the total prediction time
    results['pred_time'] = end - start

    # Compute accuracy score on the first 300 training samples
    results['accuracy_train'] = accuracy_score(y_train[:300],predictions_train)

    # Compute accuracy score on test set
    results['accuracy_test'] = accuracy_score(y_test,predictions_test)

    # Compute recall score on the first 300 training samples
    results['recall_train'] = recall_score(y_train[:300],predictions_train,average='macro')

    # Compute recall score on test set
    results['recall_test'] = recall_score(y_test,predictions_test,average='macro')

    # Success
    print("{} trained on {} samples.".format(learner.__class__.__name__,sample_size))

    # Return the results
    return results

In [None]:
# Initialize the three models
clf_A = GradientBoostingClassifier(random_state=42)
clf_B = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=42))
clf_C = RandomForestClassifier(random_state=42)
clf_D = XGBClassifier(random_state=42)

# Calculate the number of samples for 1%, 10%, 25%, 50%, 75% and 100% of the training data
samples_1   = int(round(len(OHE_X_train) / 100))
samples_10  = int(round(len(OHE_X_train) / 10))
samples_25  = int(round(len(OHE_X_train) / 4))
samples_50  = int(round(len(OHE_X_train) / 2))
samples_75  = int(round(len(OHE_X_train) * 0.75))
samples_100 = len(OHE_X_train)

# Collect results on the learners
results = {}
for clf in [clf_A,clf_B,clf_C,clf_D]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i,samples in enumerate([samples_1,samples_10,samples_25,samples_50,samples_75,samples_100]):
        results[clf_name][i] = \
        train_predict(clf,samples,OHE_X_train,y_train,OHE_X_test,y_test)

In [None]:
# Printing out the values
for i in results.items():
    print(i[0])
    display(pd.DataFrame(i[1]).rename(columns={0:'1%',1:'10%',2:'25%',3:'50%',4:'75%',5:'100%'}))

In [None]:
test_results = [['GradientBoostingClassifier',results['GradientBoostingClassifier'][5]['accuracy_test'],results['GradientBoostingClassifier'][5]['recall_test']],
                ['AdaBoostClassifier',results['AdaBoostClassifier'][5]['accuracy_test'],results['AdaBoostClassifier'][5]['recall_test']],
                ['RandomForestClassifier',results['RandomForestClassifier'][5]['accuracy_test'],results['RandomForestClassifier'][5]['recall_test']],
                ['XGBClassifier',results['XGBClassifier'][5]['accuracy_test'],results['XGBClassifier'][5]['recall_test']]]

df_test_results = pd.DataFrame(test_results,columns=['Classifier','Test_Accuracy','Test_Recall'])

fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(12,6))

ax1 = sns.barplot(df_test_results,x='Classifier',y='Test_Accuracy',ax=ax1)
ax1.set_xticklabels(ax1.get_xticklabels(),fontsize=8)
ax1.axhline(y=results['RandomForestClassifier'][5]['accuracy_test'],color='black',linestyle='--')
ax1.text(0,0.865,'RandomForestClassifier',fontsize=10)
ax1.set_title('Accuracy Score on the Test Data',size=20)
ax1.set_facecolor('lemonchiffon')

ax2 = sns.barplot(df_test_results,x='Classifier',y='Test_Recall',ax=ax2)
ax2.set_xticklabels(ax2.get_xticklabels(),fontsize=8)
ax2.axhline(y=results['RandomForestClassifier'][5]['recall_test'],color='black',linestyle='--')
ax2.text(0,0.865,'RandomForestClassifier',fontsize=10)
ax2.set_title('Recall Score on the Test Data',size=20)
ax2.set_facecolor('lemonchiffon')

fig.suptitle('Scores on the Test Data of Different Models',size=30)

fig.set_facecolor('lightsteelblue')
plt.tight_layout()

The above plots show that the RandomForestClassifier performed slightly better than the other algorithms on the test data.

Now, let's have a look at the learning curves of the four classifiers.

In [None]:
########################################################
GB_learning_curve = [[samples_1,results['GradientBoostingClassifier'][0]['accuracy_test'],results['GradientBoostingClassifier'][0]['accuracy_train']],
                    [samples_10,results['GradientBoostingClassifier'][1]['accuracy_test'],results['GradientBoostingClassifier'][1]['accuracy_train']],
                    [samples_25,results['GradientBoostingClassifier'][2]['accuracy_test'],results['GradientBoostingClassifier'][2]['accuracy_train']],
                    [samples_50,results['GradientBoostingClassifier'][3]['accuracy_test'],results['GradientBoostingClassifier'][3]['accuracy_train']],
                    [samples_75,results['GradientBoostingClassifier'][4]['accuracy_test'],results['GradientBoostingClassifier'][4]['accuracy_train']],
                    [samples_100,results['GradientBoostingClassifier'][5]['accuracy_test'],results['GradientBoostingClassifier'][5]['accuracy_train']]]

AB_learning_curve = [[samples_1,results['AdaBoostClassifier'][0]['accuracy_test'],results['AdaBoostClassifier'][0]['accuracy_train']],
                    [samples_10,results['AdaBoostClassifier'][1]['accuracy_test'],results['AdaBoostClassifier'][1]['accuracy_train']],
                    [samples_25,results['AdaBoostClassifier'][2]['accuracy_test'],results['AdaBoostClassifier'][2]['accuracy_train']],
                    [samples_50,results['AdaBoostClassifier'][3]['accuracy_test'],results['AdaBoostClassifier'][3]['accuracy_train']],
                    [samples_75,results['AdaBoostClassifier'][4]['accuracy_test'],results['AdaBoostClassifier'][4]['accuracy_train']],
                    [samples_100,results['AdaBoostClassifier'][5]['accuracy_test'],results['AdaBoostClassifier'][5]['accuracy_train']]]

RF_learning_curve = [[samples_1,results['RandomForestClassifier'][0]['accuracy_test'],results['RandomForestClassifier'][0]['accuracy_train']],
                    [samples_10,results['RandomForestClassifier'][1]['accuracy_test'],results['RandomForestClassifier'][1]['accuracy_train']],
                    [samples_25,results['RandomForestClassifier'][2]['accuracy_test'],results['RandomForestClassifier'][2]['accuracy_train']],
                    [samples_50,results['RandomForestClassifier'][3]['accuracy_test'],results['RandomForestClassifier'][3]['accuracy_train']],
                    [samples_75,results['RandomForestClassifier'][4]['accuracy_test'],results['RandomForestClassifier'][4]['accuracy_train']],
                    [samples_100,results['RandomForestClassifier'][5]['accuracy_test'],results['RandomForestClassifier'][5]['accuracy_train']]]

XGB_learning_curve = [[samples_1,results['XGBClassifier'][0]['accuracy_test'],results['XGBClassifier'][0]['accuracy_train']],
                    [samples_10,results['XGBClassifier'][1]['accuracy_test'],results['XGBClassifier'][1]['accuracy_train']],
                    [samples_25,results['XGBClassifier'][2]['accuracy_test'],results['XGBClassifier'][2]['accuracy_train']],
                    [samples_50,results['XGBClassifier'][3]['accuracy_test'],results['XGBClassifier'][3]['accuracy_train']],
                    [samples_75,results['XGBClassifier'][4]['accuracy_test'],results['XGBClassifier'][4]['accuracy_train']],
                    [samples_100,results['XGBClassifier'][5]['accuracy_test'],results['XGBClassifier'][5]['accuracy_train']]]

df_GB_LC = pd.DataFrame(GB_learning_curve,columns=['Samples','Test_Accuracy','Train_Accuracy'])

df_AB_LC = pd.DataFrame(AB_learning_curve,columns=['Samples','Test_Accuracy','Train_Accuracy'])

df_RF_LC = pd.DataFrame(RF_learning_curve,columns=['Samples','Test_Accuracy','Train_Accuracy'])

df_XGB_LC = pd.DataFrame(XGB_learning_curve,columns=['Samples','Test_Accuracy','Train_Accuracy'])
########################################################

bigfig = plt.figure(figsize=(12,6))

(top,bottom) = bigfig.subfigures(2,1)

### Top figures ###
top.subplots_adjust(left=.1,right=.9,wspace=.4,hspace=.4)

fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(12,5))

ax1 = sns.lineplot(x=df_GB_LC['Samples'],y=df_GB_LC['Test_Accuracy'],ax=ax1,label='Test Accuracy')
ax1 = sns.lineplot(x=df_GB_LC['Samples'],y=df_GB_LC['Train_Accuracy'],ax=ax1,label='Train Accuracy')
ax1.set_ylabel('Accuracy')
plt.setp(ax1.get_legend().get_texts(),fontsize='10')
ax1.legend(loc='lower right')
ax1.set_facecolor('lemonchiffon')
ax1.set_title('Learning Curve of the GradientBoostingClassifier')

ax2 = sns.lineplot(x=df_AB_LC['Samples'],y=df_AB_LC['Test_Accuracy'],ax=ax2,label='Test Accuracy')
ax2 = sns.lineplot(x=df_AB_LC['Samples'],y=df_AB_LC['Train_Accuracy'],ax=ax2,label='Train Accuracy')
ax2.set_ylabel('Accuracy')
plt.setp(ax2.get_legend().get_texts(),fontsize='10')
ax2.legend(loc='lower right')
ax2.set_facecolor('lemonchiffon')
ax2.set_title('Learning Curve of the AdaBoostClassifier')

plt.suptitle('Learning Curves of the Four Classifiers',size=25)

fig.set_facecolor('lightsteelblue')
plt.tight_layout()

### Bottom figures ###
bottom.subplots_adjust(left=.1,right=.9,wspace=.4,hspace=.4)

fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(12,4.5))

ax1 = sns.lineplot(x=df_XGB_LC['Samples'],y=df_XGB_LC['Test_Accuracy'],ax=ax1,label='Test Accuracy')
ax1 = sns.lineplot(x=df_XGB_LC['Samples'],y=df_XGB_LC['Train_Accuracy'],ax=ax1,label='Train Accuracy')
ax1.set_ylabel('Accuracy')
plt.setp(ax1.get_legend().get_texts(),fontsize='10')
ax1.legend(loc='lower right')
ax1.set_facecolor('lemonchiffon')
ax1.set_title('Learning Curve of the XGBClassifier')

ax2 = sns.lineplot(x=df_RF_LC['Samples'],y=df_RF_LC['Test_Accuracy'],ax=ax2,label='Test Accuracy')
ax2 = sns.lineplot(x=df_RF_LC['Samples'],y=df_RF_LC['Train_Accuracy'],ax=ax2,label='Train Accuracy')
ax2.set_ylabel('Accuracy')
plt.setp(ax2.get_legend().get_texts(),fontsize='10')
ax2.legend(loc='lower right')
ax2.set_facecolor('lemonchiffon')
ax2.set_title('Learning Curve of the RandomForestClassifier')

fig.set_facecolor('lightsteelblue')
plt.tight_layout()

These curves show that:
* Both the AdaBoost and RandomForest classifiers overfit. Their training accuracies, when considering only 10% of the samples, jump to 100% and never go down.
* The learning curves of the XGBoost and GradientBoosting classifiers look better. In particular, the GradientBoosting classifier curve shows the best compromise between bias and variance.

In light of these findings, I will use the GradientBoosting classifier in the next phase of this study: the fine tuning of its model parameters.

## 4.5 Validation Curves of the Gradient Boosting Classifier Parameters

The overall parameters of this ensemble model can be divided into 3 categories:
* **Tree-Specific Parameter.** They affect each individual tree in the model.
* **Boosting Parameters.** They affect the boosting operation in the model.
* **Miscellaneous Parameters.** Other parameters for overall functioning.

I am defining a function to plot the validation curve for a given classifier and one of its parameters.

In [None]:
def plot_validation_curve(clf,X,y,CV,param_name,param_range,y_lim=[0.8, 0.95]):

    train_scores, test_scores = validation_curve(
                estimator = clf,
                X = X,
                y = y,
                param_name = param_name,
                param_range = param_range,
                cv = CV)

    train_mean = np.mean(train_scores,axis=1)
    train_std = np.std(train_scores,axis=1)
    test_mean = np.mean(test_scores,axis=1)
    test_std = np.std(test_scores,axis=1)

    plt.plot(param_range, train_mean,
         color='blue', marker='o',
         markersize=5, label='training accuracy')

    plt.fill_between(param_range, train_mean + train_std,
                 train_mean - train_std, alpha=0.15,
                 color='blue')

    plt.plot(param_range, test_mean,
         color='green', linestyle='--',
         marker='s', markersize=5,
         label='validation accuracy')

    plt.fill_between(param_range,
                 test_mean + test_std,
                 test_mean - test_std,
                 alpha=0.15, color='green')

    plt.xlim([param_range[0], param_range[-1]])
    plt.ylim(y_lim)

    plt.grid()
    plt.legend(loc='lower right')
    plt.xlabel(f'{param_name}')
    plt.ylabel('Accuracy')
    plt.title(f'Validation Curve of {param_name}')

    plt.tight_layout()
    plt.gcf().patch.set_facecolor('lightsteelblue')
    plt.gca().set_facecolor('khaki')
    plt.show()

The classifier that I want to focus on is the 'GradientBoostingClassifier' and the parameter that I want to study is 'n_estimators'.

**n_estimators** represents the number of trees in the forest. Usually the higher the number of trees, the better to learn the data.

In [None]:
clf_GB = GradientBoostingClassifier(random_state=42)

plot_validation_curve(clf_GB,OHE_X_train,y_train,10,'n_estimators',[25,50,100,200,300,400,500],y_lim=[0.8,0.95])

The plot shows that:
* 'n_estimators' is a very important parameter because it has a great influence on the accuracies.
* The GradientBoostingClassifier underfits for small values of 'n_estimators' (approx. < 75).
* The classifier starts to overfit above values of 'n_estimators' > 200. This is because the distance between the training and validation curves starts to increase then.

**max_features** represents the number of features to consider when looking for the best split.

In [None]:
plot_validation_curve(clf_GB,OHE_X_train,y_train,10,'max_features',[10,20,30,40,50,60,100],y_lim=[0.82,0.88])

The classifier seems to be almost insensitive to the value of 'max_features'.

**Learning rate**

In [None]:
plot_validation_curve(clf_GB,OHE_X_train,y_train,10,'learning_rate',[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.])

This plot shows that the optimal values of the learning rate are small (< 0.05). When they start growing, then the model begins to overfit.

**max_depth** indicates how deep the built tree can be. The deeper the tree, the more splits it has and it captures more information about how the data.

In [None]:
plot_validation_curve(clf_GB,OHE_X_train,y_train,10,'max_depth',[1,2,3,4,5,6,7])

This plot is really interesting. It shows that one should not use values of 'max_depth' larger than 3 or 4. For 'max_depth' > 5, the model clearly overfits.

**min_samples_split** represents the minimum number of samples required to split an internal node.

In [None]:
plot_validation_curve(clf_GB,OHE_X_train,y_train,10,'min_samples_split',[0.1,0.2,0.4,0.6,0.8,1.],y_lim=[0.8,0.86])

The plot shows that the optimal values for 'min_samples_split' are in the range [0.1, 0.2].

**min_samples_leaf** is the minimum number of samples required to be at a leaf node. It is similar to 'min_samples_splits'.

In [None]:
plot_validation_curve(clf_GB,OHE_X_train,y_train,10,'min_samples_leaf',[0.1,0.2,0.4,0.6,0.8,1.],y_lim=[0.5,0.85])

One should choose small values of 'min_samples_leaf' to prevent underfitting.

Based on these findings, I will fine tune the parameters:
* 'n_estimators'
* 'max_depth'
* 'min_samples_split'

For parameters like 'min_samples_leaf' and 'learning_rate', that are super sensitive, I prefer to keep their pre-set values.

## 4.6 Fine Tuning the Gradient Boosting Classifier Parameters

Before fine tuning the Gradient Boosting Classifier, I want to define a function that returns the test scores in a table.

In [None]:
def get_test_scores(model_name:str,preds,y_test_data):
    '''
    Generate a table of test scores.

    In:
        model_name (string): Your choice: how the model will be named in the output table
        preds: numpy array of test predictions
        y_test_data: numpy array of y_test data

    Out:
        table: a pandas df of precision, recall, f1, and accuracy scores for your model
    '''
    accuracy  = accuracy_score(y_test_data,preds)
    precision = precision_score(y_test_data,preds,average='macro')
    recall    = recall_score(y_test_data,preds,average='macro')
    f1        = f1_score(y_test_data,preds,average='macro')

    table = pd.DataFrame({'model': [model_name],'precision': [precision],'recall': [recall],
                          'F1': [f1],'accuracy': [accuracy]})

    return table

Now, I am tuning the classifier on a grid of parameters via **Grid Search + Cross Validation**.

In [None]:
gbrt_clf = GradientBoostingClassifier(random_state=42)

cv_params = {'n_estimators':[100,150,200,250,300],
             'max_depth':[3,4,5],
             'min_samples_split':[0.01,0.05]}

gbrt_grid = GridSearchCV(estimator=gbrt_clf,param_grid=cv_params,cv=10)

gbrt_grid.fit(OHE_X_train,y_train)

test_preds_gbrt = gbrt_grid.predict(OHE_X_test)

gbrt_test_GSCV_results = get_test_scores('GradientBoosting + GridSearchCV (test)',test_preds_gbrt,y_test)

print(gbrt_grid.best_params_)
print()
print(gbrt_grid.best_estimator_)
print()
gbrt_test_GSCV_results

**Confusion matrix**

In [None]:
# Generate array of values for confusion matrix
cm = confusion_matrix(y_test,test_preds_gbrt,labels=gbrt_grid.classes_)

ax = sns.heatmap(cm,annot=True,fmt='.4g')
ax.set_title('Confusion Matrix [GradientBoosting (test)]',fontsize=15)
ax.xaxis.set_ticklabels(['No Deposit','Deposit'])
ax.yaxis.set_ticklabels(['No Deposit','Deposit'])
ax.set_xlabel("Predicted")
ax.set_ylabel("Target")

plt.gcf().patch.set_facecolor('lightsteelblue')
plt.tight_layout()

**ROC-AUC score and ROC curve**

Then, I am computing the ROC-AUC score. This is another metric to evaluate the performance of binary classifiers.

In [None]:
prob_test = gbrt_grid.predict_proba(OHE_X_test)[:,1]

print("Test ROC-AUC (GradientBoostingClassifier + GridSearchCV):",roc_auc_score(y_test,prob_test))

To plot the ROC curve, I need to define a plotting function.

In [None]:
def plot_roc_curve(true_y,y_prob,text):
    """
    plots the roc curve based of the probabilities
    """
    fpr,tpr,thresholds = roc_curve(true_y,y_prob)
    plt.plot(fpr,tpr)
    plt.title(f'ROC Curve {text}')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')

    plt.gcf().patch.set_facecolor('lightsteelblue')
    plt.gca().set_facecolor('lemonchiffon')
    plt.tight_layout()

plot_roc_curve(y_test,prob_test,'(GradientBoostingClassifier + GridSearchCV)');

# 5. Term Deposit and Success of the Marketing Campaign
<a id="5"></a>

## 5.1 Feature Importance

**Feature Importance with a random forest classifier**

By having a look at the feature importance plot, we can find out what are the most important variables in determining the success of the campaign.

In [None]:
# Random Forest Model
random_forest = RandomForestClassifier(random_state=42,max_depth=100)
random_forest.fit(OHE_X_train,y_train)

importances = pd.DataFrame({'feature':OHE_X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)

importances.head()

In [None]:
plt.figure(figsize=(12,6))

sns.barplot(importances[importances['importance'] > 0.025],x='feature',y='importance')

plt.title('Feature Importances > 0.025',fontsize=25)
plt.xlabel('feature',fontsize=15)
plt.xticks(fontsize=8,rotation=45)
plt.ylabel('relative importance',fontsize=15)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

They are *duration*, *balance*, *day* and *age*, *duration* being the dominant.

**Permutation Based Feature Importance**

There are other ways to compute the feature importance. One of them is the permutation based one.

In [None]:
perm_importance = permutation_importance(random_forest,OHE_X_train,y_train)

sorted_idx = (-perm_importance.importances_mean).argsort()

list_of_tuples  = list(zip(OHE_X_train.columns[sorted_idx],
                           perm_importance.importances_mean[sorted_idx]))

perm_importance = pd.DataFrame(list_of_tuples,
                  columns=['feature','permutation importance'])

perm_importance.head()

In [None]:
plt.figure(figsize=(12,6))

sns.barplot(perm_importance[perm_importance['permutation importance'] > 0.015],x='feature',y='permutation importance')

plt.title('Permutation Importances > 0.015',fontsize=25)
plt.xlabel('feature',fontsize=15)
plt.xticks(fontsize=8,rotation=45)
plt.ylabel('relative importance',fontsize=15)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

These results are slightly different, even though 'duration' is still the dominant feature.

## 5.2 Impact of Duration on the Choice of Opening a Term Deposit

Let's find out what is the average value of duration. Duration represents the length (in seconds) of the last telephone call between a marketing company employee and the client.

In [None]:
print(f'Mean Duration   = {data.duration.mean():.1f} secs')
print(f'Median Duration = {data.duration.median():.1f} secs')

print()

mean_duration   = data.duration.mean()
median_duration = data.duration.median()

Q1 = np.percentile(data['duration'],25)
Q2 = np.percentile(data['duration'],50)
Q3 = np.percentile(data['duration'],75)

print(f'First Quartile of Duration  = {Q1:.1f} secs')
print(f'Second Quartile of Duration = {Q2:.1f} secs')
print(f'Third Quartile of Duration  = {Q3:.1f} secs')

Create a new column with tags 'above' or 'below' the *duration* median.

In [None]:
### Creating a new column 'duration_above_below_median' ###
data['duration_above_below_median'] = np.nan
list2 = [data]

for col in list2:
    col.loc[col['duration'] <= median_duration,'duration_above_below_median'] = 'duration below median'
    col.loc[col['duration'] > median_duration,'duration_above_below_median'] = 'duration above median'

Now I am doing the almost same, but the *duration* values will be divided into quartiles.

In [None]:
### Creating a new column 'duration_quartiles' ###
capped_data['duration_quartiles'] = np.nan
list3 = [data]

for col in list3:
    col.loc[col['duration'] <= Q1,'duration_quartiles'] = 'duration: Q1'
    col.loc[(col['duration'] > Q1) & (col['duration'] <= Q2),'duration_quartiles'] = 'duration: Q2'
    col.loc[(col['duration'] > Q2) & (col['duration'] <= Q3),'duration_quartiles'] = 'duration: Q3'
    col.loc[col['duration'] > Q3,'duration_quartiles'] = 'duration: Q4'

In [None]:
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(12,6))

ax1 = sns.histplot(data=data,x='duration_above_below_median',
             multiple="dodge",hue='deposit',shrink=0.9,stat='probability',ax=ax1)
ax1.set_title('Duration Above or Below Median',fontsize=20)
ax1.set_xlabel('')
ax1.set_facecolor('lemonchiffon')
ax1.legend(labels=["no term deposit","term deposit"],loc='upper center')

ax2 = sns.histplot(data=data,x='duration_quartiles',
             multiple="dodge",hue='deposit',shrink=0.9,stat='probability',ax=ax2)
ax2.set_title('Duration Quartiles',fontsize=20)
ax2.set_xlabel('')
ax2.set_facecolor('lemonchiffon')
ax2.legend(labels=["no term deposit","term deposit"],loc='upper center')

fig.suptitle("Term Deposit vs Call Duration",fontsize=28)

plt.tight_layout()
fig.set_facecolor('lightsteelblue')

One can notice that:
- Longer (last) call durations favor the success of the campaign.
- Call durations below the median *duration* value (255 secs) are generally unsuccessful, while those above the median value are generally successful.
- More specifically, the most successful calls belong to the fourth quartile of *duration* (*duration* > Q3 = 496 secs), the least successful belong to the first quartile (*duration* < Q1 = 138 secs).

## 5.3 Impact of Contact (Communication Type) on Opening a Term Deposit

In [None]:
data.contact.unique()

The marketing company can contact the client either on his/her mobile or on the landline.

In [None]:
data[data['contact'] != 'unknown'].groupby('deposit')['contact'].value_counts()

In [None]:
sns.histplot(data=data[data['contact'] != 'unknown'],x='contact',
             multiple="dodge",hue='deposit',shrink=0.9,stat='probability')
plt.title('Probability of Opening a Term Deposit vs Contact Type',fontsize=15)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

Calling the client on the mobile phone gives a slightly higher probability of success.

In [None]:
print(f'Probability of success on landline = {390 / (390 + 384) * 100:.0f}%')
print(f'Probability of success on mobile   = {4369 / (4369 + 3673) * 100:.0f}%')

The difference is rather small and might not be statistically significant. This is worth to check.

## 5.4 Impact of *poutcome* on Opening a Term Deposit

The variable *poutcome* represents the outcome of the previous marketing campaign. Its possible values are:

In [None]:
data['poutcome'].unique()

I am excluding the values 'other' and 'unknown'.

In [None]:
data[data['poutcome'].isin(['failure','success'])].groupby('deposit')['poutcome'].value_counts()

In [None]:
sns.histplot(data=data[data['poutcome'].isin(['failure','success'])],x='poutcome',
             multiple="dodge",hue='deposit',shrink=0.9,stat='probability')
plt.title("Probability of Opening a Term Deposit vs 'poutcome'",fontsize=15)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

This plot shows that the probability of a success in the present campaign is very high whenever the result of the previous one was a success. In other words, the event of the campaign is usually successful (i.e. the client opened a term deposit) whenever the client was persuaded in the previous marketing campaign as well.

## 5.5 Impact of *housing* and *loan* on Opening a Term Deposit

The *housing* and *loan* variables register whether the client has a housing loan or a loan, respectively.

In [None]:
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(12,6))

ax1 = sns.histplot(data=data,x='housing',multiple="dodge",hue='deposit',shrink=0.9,stat='probability',ax=ax1)
ax1.set_title("Term Deposit vs 'housing'",fontsize=20)
ax1.set_facecolor('lemonchiffon')

ax2 = sns.histplot(data=data,x='loan',multiple="dodge",hue='deposit',shrink=0.9,stat='probability',ax=ax2)
ax2.set_title("Term Deposit vs 'loan'",fontsize=20)
ax2.set_facecolor('lemonchiffon')

fig.suptitle("Probability of Opening a Term Deposit vs 'housing' and 'loan'",fontsize=28)

plt.tight_layout()
fig.set_facecolor('lightsteelblue')

- The first plot shows that there is a higher probability of a success whenever the targeted client does not have a housing loan.
- The second plot shows that a client with a loan would not usually open a term deposit.

## 5.6 Impact of Age on Opening a Term Deposit

In [None]:
sns.histplot(data=data,x='age cluster',multiple="dodge",hue='deposit',shrink=0.9,stat='probability')
plt.title("Probability of Opening a Term Deposit vs Age",fontsize=15)
plt.xticks(fontsize=8,rotation=45)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

Young (age < 25) and older (age > 65) clients are more likely to open a term deposit.

## 5.7 Impact of *balance*, *pdays* and *previous* on Opening a Term Deposit

To plot the probability of opening a term deposit vs the client's balance, first I have to segment the clients' balance into balance groups (4 quartiles).

In [None]:
features = ['balance','pdays','previous']
data2 = outlier_imputer(data,features)

### Creating a new column 'duration_quartiles' ###
data2['balance_quartiles'] = np.nan
list4 = [data2]

for col in list4:
    col.loc[col['balance'] <= Q1,'balance_quartiles'] = 'balance: Q1'
    col.loc[(col['balance'] > Q1) & (col['balance'] <= Q2),'balance_quartiles'] = 'balance: Q2'
    col.loc[(col['balance'] > Q2) & (col['balance'] <= Q3),'balance_quartiles'] = 'balance: Q3'
    col.loc[col['balance'] > Q3,'balance_quartiles'] = 'balance: Q4'

sorter = ['balance: Q1','balance: Q2','balance: Q3','balance: Q4']

data2.balance_quartiles = data2.balance_quartiles.astype("category")
data2.Tm = data2.balance_quartiles.cat.set_categories(sorter)

Regarding *pdays* and *previous*, they are the number of days that passed by after the client was last contacted from a previous campaign and the number of contacts performed before this campaign and for this client, respectively.

In [None]:
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3,figsize=(12,6))

ax1 = sns.histplot(data=data2,x='balance_quartiles',multiple="dodge",
                   hue='deposit',shrink=0.9,stat='probability',ax=ax1)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=45,fontsize=7)
ax1.set_title("Term Deposit vs Balance",fontsize=18)
ax1.set_facecolor('lemonchiffon')

ax2 = sns.distplot(data[data['deposit'] == 'yes']['pdays'],ax=ax2,hist=False)
ax2 = sns.distplot(data[data['deposit'] == 'no']['pdays'],ax=ax2,hist=False,color='orange')
ax2.set_xlim(-100,300)
ax2.set_title("Term Deposit vs 'pdays'",fontsize=18)
ax2.set_facecolor('lemonchiffon')
ax2.legend(labels=['term deposit','no term deposit'])

ax3 = sns.distplot(data[data['deposit'] == 'yes']['previous'],ax=ax3,hist=False)
ax3 = sns.distplot(data[data['deposit'] == 'no']['previous'],ax=ax3,hist=False,color='orange')
ax3.set_xlim(-2,10)
ax3.set_title("Term Deposit vs 'previous'",fontsize=18)
ax3.set_facecolor('lemonchiffon')
ax3.legend(labels=['term deposit','no term deposit'])

fig.suptitle("Probability of Opening a Term Deposit vs Balance, 'pdays' and 'previous'",fontsize=24)

plt.tight_layout()
fig.set_facecolor('lightsteelblue')

## 5.8 Impact of *month* on Opening a Term Deposit

In [None]:
data_month = data2.groupby(['month','deposit'])['deposit'].value_counts().reset_index(name='no_term_deposit')

sorter = ['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']

sorterIndex = dict(zip(sorter,range(len(sorter))))

data_month['month_rank'] = data_month['month'].map(sorterIndex)
data_month.sort_values('month_rank',ascending=True,inplace=True)
data_month.drop('month_rank',axis=1,inplace=True)

sns.barplot(data=data_month,x='month',y='no_term_deposit',hue='deposit')
plt.ylabel('count')
plt.title("Impact of Month on Opening a Term Deposit",fontsize=18)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

The plot shows that May is the month with the most attempts by the marketing company. It is also a month when the success to unsuccess ratio is very low. On the contrary, there are months, like March, September or October, when the success to unsuccess ratio is pretty high.
<br>Maybe the company should shift its main effort from late spring and summer months towards autumn and winter months.

## 5.9 Impact of Job, Marital Status and Education on Opening a Term Deposit

In [None]:
sns.histplot(data=data,x='deposit',multiple='stack',hue='marital_edu')
plt.title('Deposit (Count) vs Marital + Education Clusters',fontsize=16)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

The fraction of clients that opened a term deposit changes notably depending on their marital + education group. For example, the majority of married clients with a secondary education chose not to open a term deposit, while the majority of singles with a tertiary education opened one.

In [None]:
fig = plt.figure(figsize=(12,8))

sns.violinplot(x="balance",y="job",hue="deposit",data=data);

plt.title("Client's Balance vs Job vs Deposit",fontsize=28)

plt.tight_layout()
plt.gcf().patch.set_facecolor('lightsteelblue')
plt.gca().set_facecolor('lemonchiffon')
plt.show()

# 6. Final Suggestions
<a id="6"></a>

These are the final suggestions that can be derived from the data.

**1. Phone Calls and Loyal Clients**

Given that longer calls strongly favor the success of the campaign, the campaign focus should be mainly on those clients that appear to be more interested than average in the bank product. The clients who are not willing to ask for more details on the term deposit, while trying to keep the phone call as short as possible, are unlikely to take a positive action.
<br> Moreover, the probability of a success is very high whenever the result of the previous campaign was a success. In other words, the event of the campaign is usually successful (i.e. the client opened a term deposit) whenever the client was persuaded in the previous marketing campaign as well.

We can conclude that this type of loyal and/or 'positive' clients are a precious asset of the bank and thus they must be kept at any cost.

**2. Clients with Loans and Clients' Age Group, Marital Status and Education**

There is a higher probability of a campaign success whenever the targeted client does not have a loan or a housing loan.
<br>Young (age < 25) and older (age > 65) clients are more likely to open a term deposit.
<br>The fraction of clients that opened a term deposit changes notably depending on their marital status and education. For example, the majority of married clients with secondary education chose not to open a term deposit, while the majority of singles with a tertiary education opened one.

The campaign should thus be more focused on clients within these specific categories.

**3. Most and Least Favorable Months**

May is the month with the most attempts by the marketing company. It is also a month when the success to unsuccess ratio is low. On the contrary, there are months, like March, September or October, when the success to unsuccess ratio of the campaign is much higher. For some reasons, these are months with a limited campaign activity.

Therefore, the marketing company should shift its main effort from late spring and summer months towards autumn and winter months.

# References

1. Jacopo Ferretti, [*How to TARGET DONORS in Electoral Campaigns*](https://www.kaggle.com/code/jacopoferretti/how-to-target-donors-in-electoral-campaigns), notebook on Kaggle.
2. Mohtadi Ben Fraj, [*In Depth: Parameter tuning for Gradient Boosting*](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae), article on medium.com. Here one can find an interesting discussion on the parameters of Gradient Boosting Classifier.
3. Sebastian Raschka, *Machine Learning con Python* (English: *Python Machine Learning*), Apogeo Editore.