# Term Deposit Marketing - An Apziva Project

By Samuel Alter

Apziva: G3SuQYZYrFt9dwF3

## Project Overview

Using phone call data from a European bank, this project will be building a model that predicts if a customer will subscribe to a term deposit, a type of financial product. This project is a partnership with a startup focused on providing ML solutions for European banks.

### Goals

The startup is hoping that I can **achieve ≥81% accuracy** using a 5-fold cross validation strategy, taking the average performance score.

Bonus goals include:
* Determining which customers are most likely to buy the term deposit loan
  * Which segments of customers should the client prioritize?
* Determine what makes the customer buy the loan
  * Which feature should the startup focus on?

### The dataset

Using phone call data from a European bank, this project will be building models that predict if a customer will subscribe to a term deposit, a type of financial product. This project is a partnership with a startup focused on providing ML solutions for European banks.

The dataset consists of the following columns:
* `age`
  * Numeric
  * The age of the customer
* `job`
  * Categorical
  * The job category of the customer
* `marital`
  * Categorical
  * Whether the customer is married
* `education`
  * Categorical
  * The customer's level of education
* `default`
  * Binary
  * If the customer has credit in default or not
* `balance`
  * Numeric
  * Average yearly balance in Euros
* `housing`
  * Binary
  * If the customer has a housing loan or not
* `loan`
  * Binary
  * If the customer has a personal loan
* `contact`
  * Categorical
  * The type of contact communication
* `day`
  * Numeric
  * Last contact day of the month
* `month`
  * Categorical
  * Last contact month of the year
* `duration`
  * Numeric
  * Duration of the last phone call with the customer
* `campaign`
  * Numeric
  * The number of contacts performed during this campaign and for this client, which includes the last contact

The final column, `y`, is the target of the dataset and shows whether the client subscribed to a term deposit.

## Table of Contents

1. [EDA](#eda)
 * [Non-visual data analysis](#neda) of the data: check `dtype`, look at broad trends in data
 * [Visualization](#viz) of the data
   * [Figure 1: Barplots of **categorical** features](#fig1)
   * [Figure 2: Histograms of **continuous** features](#fig2)
   * [Figure 3: Boxplots of **continuous** features](#fig3)
   * [Figure 4: Correlation matrix of **continuous** features](#fig4)
   * [Figure 5: Correlation matrix of **categorical** features](#fig5)
   * What about [scatterplots?](#scat)
2. [Modeling](#mod)

## Imports and Helper Functions

In [1]:
# ignore warnings for seaborn
import warnings
warnings.filterwarnings("ignore", module="seaborn")

In [1]:
!python --version

/usr/bin/sh: 1: python: not found


In [5]:
!pip list

Package              Version
-------------------- -----------
argon2-cffi          21.3.0
argon2-cffi-bindings 21.2.0
asttokens            2.0.8
attrs                22.1.0
auto-sklearn         0.15.0
backcall             0.2.0
beautifulsoup4       4.11.1
black                22.8.0
bleach               5.0.1
certifi              2022.9.14
cffi                 1.15.1
cfgv                 3.3.1
charset-normalizer   2.1.1
click                8.1.3
cloudpickle          2.2.0
ConfigSpace          0.4.21
contourpy            1.0.5
coverage             6.4.4
cycler               0.11.0
Cython               0.29.32
dask                 2022.9.1
debugpy              1.6.3
decopatch            1.4.10
decorator            5.1.1
defusedxml           0.7.1
distlib              0.3.6
distributed          2022.9.1
distro               1.7.0
emcee                3.1.2
entrypoints          0.4
execnet              1.9.0
executing            1.0.0
fastjsonschema       2.16.2
filelock             3.8.0

In [6]:
import subprocess

# Run `pip list` and capture the output
pip_list_output = subprocess.check_output(['pip', 'list']).decode('utf-8')

# List of packages to check
required_packages = ['numpy', 'pandas', 'matplotlib', 'seaborn', 'scikit-learn', 'auto-sklearn']

# Check if the required packages are in the pip list output
for package in required_packages:
    if package in pip_list_output:
        print(f"{package} is installed.")
    else:
        print(f"{package} is NOT installed.")

numpy is installed.
pandas is installed.
matplotlib is installed.
seaborn is installed.
scikit-learn is installed.
auto-sklearn is installed.


--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/logging.py", line 177, in emit
    self.console.print(renderable, overflow="ignore", crop=False, style=style)
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1673, in print
    extend(render(renderable, render_options))
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1305, in render
    for render_output in iter_render:
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/logging.py", line 134, in __rich_console__
    for line in lines:
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/segment.py", line 249, in split_lines
    for segment in segments:
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1283, in render
    renderable = rich_cast(renderable)
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/protocol.py", line 36

In [3]:
!pip install optuna

Collecting optuna
  Downloading optuna-3.6.1-py3-none-any.whl (380 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m63.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:02[0m
[?25hCollecting colorlog
  Downloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Collecting tqdm
  Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.4/78.4 kB[0m [31m101.4 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting sqlalchemy>=1.3.0
  Downloading SQLAlchemy-2.0.32-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m132.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
Collecting alembic>=1.5.0
  Downloading alembic-1.13.2-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.0/233.0 kB[0m [31m103.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting Ma

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import copy

from scipy.stats import chi2_contingency

from sklearn.model_selection import train_test_split
# from pycaret.classification import setup,compare_models,create_model,plot_model,evaluate_model
# from pycaret.regression import *

import optuna
import autosklearn

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# import json
# from datetime import datetime
# from pathlib import Path
# import inspect

# from sklearn.datasets import make_classification

# from sklearn.metrics import accuracy_score
# from sklearn.model_selection import cross_val_score
# from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# from sklearn.metrics import roc_curve,auc,roc_auc_score

# from xgboost import XGBClassifier
# from sklearn.ensemble import ExtraTreesClassifier
# from sklearn.feature_selection import RFE
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.linear_model import LogisticRegression

# import lightgbm as lgb
# from lightgbm import LGBMClassifier
# from lightgbm import plot_importance

# from sklearn.ensemble import StackingClassifier
# from sklearn.ensemble import VotingClassifier
# from sklearn.model_selection import RepeatedStratifiedKFold
# from numpy import mean
# from numpy import std

In [4]:
# simple function to generate random integers

def rand_gen(low=1,high=1e4):
    '''
    Generates a pseudo-random integer
    consisting of up to four digits
    '''
    rng=np.random.default_rng()
    random_state=int(rng.integers(low=low,high=high))
    return random_state

In [5]:
def get_variable_name(var):
    callers_local_vars = inspect.currentframe().f_back.f_locals.items()
    return [name for name, val in callers_local_vars if val is var]

def fileDaterSaver(location: str,
                   filetype: str,
                   object_,
                   extra: str = '',
                   verbose: bool = True):

    '''
    Function that gets a timestamped filename and saves it
    to a user-specified location.

    Parameters:
    -----------
    location: str - The location where the file will be saved.
    filetype: str - The type of the file to save ('csv' or 'json').
    object_: The object to be saved. Should be a pandas DataFrame
        for 'csv' or serializable for 'json'.
    extra: str - Additional string to include in the filename.
    verbose: bool - Whether to print verbose messages.
    '''

    # get current date and time
    current_datetime = datetime.now()

    # print current date and time to check
    if verbose:
        print('current_datetime:', current_datetime)

    # format the datetime for a filename
    datetime_suffix = current_datetime.strftime("%Y-%m-%d_%H-%M-%S")

    # create filename with the datetime suffix
    if extra != '':
        file_name = f'{location}{extra}_{datetime_suffix}.{filetype}'
    else:
        file_name = f'{location}{datetime_suffix}.{filetype}'

    # print file name
    if verbose:
        print(file_name)

    # save object
    if filetype == 'csv':
        object_.to_csv(file_name, index=True)
    elif filetype == 'json':
        with open(file_name, 'w') as file:
            file.write(json.dumps(object_, default=str))
    else:
        raise ValueError("Unsupported file type. Use 'csv' or 'json'.")

    # confirm save
    file_path = Path(file_name)
    if file_path.exists():
        variable_name = get_variable_name(object_)
        if variable_name:
            print(f'Successfully saved {variable_name[0]} to {file_path}')
        else:
            print(f'Successfully saved object to {file_path}')
    else:
        print("File save error.")

In [6]:
seed=rand_gen()
seed

7387

In [7]:
test_size=0.2
test_size

0.2

In [None]:
# from google.colab import files
# uploaded=files.upload()

In [None]:
from google.colab import drive
drive.mount('/content/drive')
df=pd.read_csv('/content/drive/MyDrive/2_data.csv')
df.head(3)

In [None]:
# url=''
# df=pd.read_csv(url)
# df.head(3)

In [None]:
# if not in Google Colab:

# read in data
df=pd.read_csv('../data/2_data.csv')
df.head(3)

In [None]:
# !pip install -U -q PyDrive2
# from pydrive.auth import GoogleAuth
# from pydrive.drive import GoogleDrive
# from google.colab import auth
# from oauth2client.client import GoogleCredentials
# # Authenticate and create the PyDrive client.
# auth.authenticate_user()
# gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default()
# drive = GoogleDrive(gauth)

In [None]:
link='https://drive.google.com/file/d/1uIUzYWMQA_hl1odTnBTfwCV4m8K8BqMt/view?usp=share_link'
df=pd.read_csv(link)
df.head(3)

## EDA <a name='eda'></a>

### Non-visual data analysis <a name='neda'></a>

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,no


In [None]:
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in the dataset")

In [None]:
df.info()

There are no nulls in the dataset, which makes our lives easier.

In [None]:
df.describe()

We can glean the following insights from this table:
* The mean values for the `age`, `day`, and `campaign` columns are about equal to the 50th percentile
  * The distribution of the data may be symmetric
* The max value in each column besides `age` and `day` is much larger than the column's 75th percentile
  * This suggests there could be outliers
  * `age` and `day` are more or less categorical, so it makes sense that the max age is 95 and max day is 31

What if we compare the subset of the data that had a positive `y` outcome to those that had a negative outcome?

In [None]:
# customers in the dataset who did get a loan
df_yes = df[df['y'] == 'yes']
df_yes.describe()

In [None]:
# customers in the dataset who did not get a loan
df_no = df[df['y'] == 'no']
df_no.describe()

In [None]:
print(f"{df_yes.shape[0]/df.shape[0]*100}% of the dataset has positive outcomes, while {round(df_no.shape[0]/df.shape[0]*100,2)}% of the dataset has negative outcomes")

We can see:
* There is a large class imbalance in the dataset
* The mean values are roughly the same across the numerical columns and loan outcomes
  * Except for duration, which is about 3x as less for calls that don't end with a sale (`y`=no)
* The max values for `balance` and `campaign` are about 2.25x and 2x as large for `y`=no

Let's do some more aggregations to tease apart the differences between the different classes within each column.

In [None]:
# functions to compute the quantiles

def q25(x):
    return x.quantile(0.25)

def q50(x):
    return x.quantile(0.50)

def q75(x):
    return x.quantile(0.75)

def iqr(x):
    return q75(x)-q25(x)

We can edit this code to slice and dice the dataset as we please:

In [None]:
df.columns

In [None]:
groupby_list=['count','mean','std','min',q25,q50,q75,iqr]
df.groupby([
    # 'age',
    'job',
    # 'marital',
    # 'education',
    # 'default',
    # 'housing',
    # 'loan',
    # 'contact',
    # 'day',
    # 'month',
    # 'y'
]).agg(
    {
        'balance':groupby_list,
        # 'duration':groupby_list,
        # 'campaign':groupby_list
    })

There are different mean balances depending on what job the customer is in, which is to be expected: a blue-collar worker is not typically making the same amount of money that someone in management makes, so the balance in their bank account would be different too.

In [None]:
# create dictionary of unique values
dict_unique={col:df[col].nunique() for col in df.columns}

# this is a little unwieldy
# but it will give us a sense of
# HOW MANY unique values there are
dict_unique

We can tell which columns are categorical. For example, there are...
* 12 kinds of jobs
* 4 education levels  

...in the datset

In [None]:
# create dictionary of unique values
dict_nunique={col:df[col].unique() for col in df.columns}

# this is a little unwieldy
# but it will give us a sense of
# WHAT the unique values are
dict_nunique

Nothing too surprising jumps out at us here, so we'll need to use more... _visual_ methods to understand the dataset.

### Visualization <a name='viz'></a>

#### Figure 1: Barplots of Categorical features <a name='fig1'></a>

Make a big figure with all the categorical features:
* `job`
* `marital`
* `education`
* `default`
* `housing`
* `loan`
* `contact`
* `day`
* `month`

In [None]:
# make dictionary of just the categorical variables
cat_nunique=copy.deepcopy(dict_nunique)
del cat_nunique['age']
del cat_nunique['balance']
del cat_nunique['duration']
del cat_nunique['campaign']
del cat_nunique['y']
cat_nunique={key: sorted(value) for key, value in cat_nunique.items()}
print(cat_nunique)

In [None]:
job_order=list(df['job'].unique())
job_order.sort()
job_order

In [None]:
cat_nunique['month']=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']

In [None]:
# get total number of plots
num_plots=len(cat_nunique)*2

# create subplots
fig,axes=plt.subplots(num_plots,1,figsize=(15,num_plots*4))
plt.suptitle(t='Counts of Categorical Variables in Dataset',y=.999)
plt.tight_layout()

# flatten axes for easy indexing
axes=axes.flatten()

# plot each column
for i, (col, order) in enumerate(cat_nunique.items()):
#     plot 'no' part
    ax1=sns.countplot(data=df_no,x=col,palette='colorblind', dodge=True, order=order,ax=axes[i*2])
    # ax1.set_title(f'{col.capitalize()} Distribution for Failed Campaigns')
    for p in ax1.patches:
        ax1.annotate(format(p.get_height(), '.0f'),
                     (p.get_x() + p.get_width() / 2., p.get_height()),
                     ha = 'center', va = 'center',
                     xytext = (0, 4),
                     textcoords = 'offset points')
    ax1.text(ax1.get_xlim()[1]+(ax1.get_xlim()[1])*(-0.17),
             ax1.get_ylim()[1] - (ax1.get_ylim()[1])*(1/5),
             f'Variable: {col.capitalize()}\nFailed Campaigns', bbox=dict(facecolor='white', alpha=0.5))

    # Plot 'yes' part
    ax2 = sns.countplot(data=df_yes,x=col,palette='colorblind',dodge=True,order=order,ax=axes[i*2 + 1])
    # ax2.set_title(f'{col.capitalize()} Distribution for Successful Campaigns')
    for p in ax2.patches:
        ax2.annotate(format(p.get_height(), '.0f'),
                     (p.get_x() + p.get_width() / 2., p.get_height()),
                     ha = 'center', va = 'center',
                     xytext = (0, 4),
                     textcoords = 'offset points')
    ax2.text(ax2.get_xlim()[1]+(ax2.get_xlim()[1])*(-0.17),
             ax2.get_ylim()[1] - (ax2.get_ylim()[1])*(1/5),
             f'Variable: {col.capitalize()}\nSuccessful Campaigns', bbox=dict(facecolor='white', alpha=0.5))

plt.savefig('../figs/2_countcategorical.pdf')
plt.savefig('../figs/2_countcategorical.png')
plt.show()

There is a lot to observe here, but note that although the values differ drastically between successful and failed campaigns, the patterns are similar for most of the features.

Also notable is that there were no calls made to customers in the month of September.

#### Figure 2: Histograms of Continuous Features <a name='fig2'></a>

In [None]:
# make dictionary of just the categorical variables
num_nunique=copy.deepcopy(dict_nunique)
del num_nunique['job']
del num_nunique['marital']
del num_nunique['education']
del num_nunique['default']
del num_nunique['housing']
del num_nunique['loan']
del num_nunique['contact']
del num_nunique['day']
del num_nunique['month']
del num_nunique['y']
num_nunique

In [None]:
# get total number of plots
num_plots=len(num_nunique)*2

# create subplots
fig,axes=plt.subplots(num_plots,1,figsize=(15,num_plots*4))
plt.suptitle(t='Histograms of Continuous Variables in Dataset',y=.999)
plt.tight_layout()

# flatten axes for easy indexing
axes=axes.flatten()

# plot each column
for i, (col, order) in enumerate(num_nunique.items()):
#     plot 'no' part
    ax1=sns.histplot(data=df_no,x=col,color='cornflowerblue',ax=axes[i*2])
    ax1.text(ax1.get_xlim()[1]+(ax1.get_xlim()[1])*(-0.17),
             ax1.get_ylim()[1] - (ax1.get_ylim()[1])*(1/5),
             f'Variable: {col.capitalize()}\nFailed Campaigns', bbox=dict(facecolor='white', alpha=0.5))

    # Plot 'yes' part
    ax2 = sns.histplot(data=df_yes,x=col,color='orange',ax=axes[i*2 + 1])
    ax2.text(ax2.get_xlim()[1]+(ax2.get_xlim()[1])*(-0.17),
             ax2.get_ylim()[1] - (ax2.get_ylim()[1])*(1/5),
             f'Variable: {col.capitalize()}\nSuccessful Campaigns', bbox=dict(facecolor='white', alpha=0.5))

plt.savefig('../figs/2_histograms.pdf')
plt.savefig('../figs/2_histograms.png')

The patterns between successful and failed campaigns' continuous data are mostly similar, although the X and Y axes are different. The one feature that I see is different is the distribution for duration for successful campaigns is wider than those for failed campaigns. Boxplots may clear this up for us.

#### Figure 3: Boxplots of Continuous Features <a name='fig3'></a>

In [None]:
order

In [None]:
# get total number of plots
num_plots=len(num_nunique)*2

# create subplots
fig,axes=plt.subplots(num_plots,1,figsize=(15,num_plots*4))
plt.suptitle(t='Histograms of Continuous Variables in Dataset',y=.999)
plt.tight_layout()

# flatten axes for easy indexing
axes=axes.flatten()

# plot each column
for i, (col, order) in enumerate(num_nunique.items()):
#     plot 'no' part
    ax1=sns.boxplot(data=df_no,x=col,color='cornflowerblue',ax=axes[i*2])
    ax1.text(ax1.get_xlim()[1]+(ax1.get_xlim()[1])*(-0.17),
             ax1.get_ylim()[1] - (ax1.get_ylim()[1])*(1/5),
             f'Variable: {col.capitalize()}\nFailed Campaigns', bbox=dict(facecolor='white', alpha=0.5))

    # Plot 'yes' part
    ax2 = sns.boxplot(data=df_yes,x=col,color='orange',ax=axes[i*2 + 1])
    ax2.text(ax2.get_xlim()[1]+(ax2.get_xlim()[1])*(-0.17),
             ax2.get_ylim()[1] - (ax2.get_ylim()[1])*(1/5),
             f'Variable: {col.capitalize()}\nSuccessful Campaigns', bbox=dict(facecolor='white', alpha=0.5))

plt.savefig('../figs/2_boxplots.pdf')
plt.savefig('../figs/2_boxplots.png')

Duration does indeed seem different, though recall that this feature is describing how long the last phone call was with the customer. It may not tell us that much.

#### Figure 4: Correlation Matrix of Continuous Features <a name='fig4'></a>

In [None]:
df_num=df[['age','balance','duration','campaign']]

In [None]:
# compute correlation matrix
corr=df_num[['age','balance','duration','campaign']].corr()

# generate mask for the upper triangle
mask=np.triu(np.ones_like(corr, dtype=bool))

# set up matplotlib figure
f,ax = plt.subplots(figsize=(5, 4))

# draw heatmap with the mask and correct aspect ratio
sns.heatmap(corr,mask=mask,cmap='coolwarm',#vmax=1,vmin=-1,
            center=0,
            square=True,linewidths=.5,annot=True,
            fmt='.2f',cbar_kws={"shrink":.5})
plt.title('Correlation Matrix of Numerical Features\n$Higher$ $absolute$ $value$ $indicates$ $stronger$ $correlation$')
plt.tight_layout()

# save fig
plt.savefig('../figs/2_corrmatrix_num.pdf')
plt.savefig('../figs/2_corrmatrix_num.png')

It's good to see that there are no strong correlations with the numerical data. `age`:`balance` makes sense because as you age, you will have had a longer time to accrue more money.

Let's now look at the categorical data now:

#### Figure 5: Correlation Matrix of Categorical Features <a name='fig5'></a>

In [None]:
# make a df of just the categorical values
df_cat=df[['job','marital','education','default','housing','loan','contact','day','month','y']]

In [None]:
def cramers_v(x, y):
    """Calculate Cramér's V statistic for categorical-categorical association."""
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1)) / (n-1))
    rcorr = r - ((r-1)**2) / (n-1)
    kcorr = k - ((k-1)**2) / (n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

def cramers_v_matrix(df):
    """Compute a matrix of Cramér's V statistics for all pairs of categorical columns in a DataFrame."""
    cols = df.columns
    n = len(cols)
    cv_matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            cv_matrix[i, j] = cramers_v(df[cols[i]], df[cols[j]])
    return pd.DataFrame(cv_matrix, index=cols, columns=cols)

# Compute Cramér's V matrix
cv_matrix = cramers_v_matrix(df_cat)

# generate mask for the upper triangle
mask = np.triu(np.ones_like(cv_matrix, dtype=bool))

# Plot the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(cv_matrix, annot=True, cmap='coolwarm', #vmin=-1, vmax=1,
            mask=mask, cbar_kws={"shrink": .8},fmt='.2f')

plt.title("Cramér's V Correlation Matrix")

# save fig
plt.savefig('../figs/2_corrmatrix_categorical.pdf')
plt.savefig('../figs/2_corrmatrix_categorical.png')

plt.show()

This is a great figure. Most correlations are very slight, but there are a few stronger correlations, like `contact`:`month`, `housing`:`month`, `job`:`education`, and `day`:`month`. These correlations mostly make sense.

#### What about Scatterplots? <a name='scat'></a>

Scatterplots do not seem to give us much insight. The data points are very dispersed and a pattern does not readily emerge:

In [None]:
sns.pairplot(data=df,hue='y')

plt.savefig('../figs/2_pairplot.pdf')
plt.savefig('../figs/2_pairplot.png')

In [None]:
# reinstate warning labels
import warnings
warnings.filterwarnings("default", module="seaborn")

## Modeling <a name='mod'></a>

### Goals recap

To achieve this project's goals, we have to run models. As a reminder, this project is aiming to predict customer behavior. Specifically, we are training models to determine if a customer will buy a term deposit loan.

We are aiming to achieve ≥81% accuracy with the modeling
  * Use a 5-fold cross validation strategy and take the average performance score.

Bonus goals include:
* Determine which customers are most likely to buy the term deposit loan
  * Which segments of customers should the client prioritize?
* Determine what makes the customer buy the loan
  * Which feature should the startup focus on?

### PyCaret

[PyCaret](#https://pycaret.gitbook.io/docs) is a library that helps make it easy to experiment on the performance of different ML algorithms so that we can maximize our time on optimizing the best algorithm.

Classification using the OOP syntax, building on the example from [pycaret.gitbook.io](#pycaret.gitbook.io):

The results of PyCaret show that Gradient Boosting Classifier gave the best accuracy, at almost 94%!

In [None]:
clf1 = setup(df,
             target = 'y',
             session_id=seed,
             log_experiment=True,
             experiment_name='clf1')

In [None]:
# save setup results
setup_results=pull()
# print(setup_results)
# setup_results.to_csv('../joblib/2_pycaret_setupresults.csv')
from google.colab import files
setup_results.to_csv('2_pycaret_setupresults.csv',encoding='utf-8-sig')
files.download('2_pycaret_setupresults.csv')

In [None]:
best_model=compare_models(fold=5)

# save setup results
# best_model.to_csv('../joblib/2_pycaret_bestmodel.csv')
best_model=pull()
best_model.to_csv('2_pycaret_bestmodel.csv',encoding='utf-8-sig')
files.download('2_pycaret_bestmodel.csv')

In [None]:
gbc_model=create_model('gbc')

# save gbc_model
# gbc_model.pull()
# gbc_model.to_csv('../joblib/2_pycaret_gbcmodel.csv')
# gbc_model.to_csv('2_pycaret_gbcmodel.csv',encoding='utf-8-sig')
# files.download('2_pycaret_gbcmodel.csv')

In [None]:
feature_importances=plot_model(gbc_model,plot='feature',save=True)
feature_importances=plot_model(gbc_model,plot='feature')
# save feature_importances
# feature_importances.pull()
# feature_importances.to_csv('2_pycaret_featureimportances.csv',encoding='utf-8-sig')
# files.download('2_pycaret_featureimportances.csv')

We see that according to PyCaret, `duration` has the strongest importance on predicting campaign success.

In [None]:
evaluate_model(gbc_model)

In [None]:
# plot AUC
plot_model(gbc_model, plot = 'auc')

The AUC-ROC curve is looking pretty healthy: an AUC of over 90% is great. As this is just the base model, let's move to a more rigorous modeling strategy to help us answer this project's questions.

The results of the PyCaret experimentation show that the Gradient Boosting Classifier algorithm is best suited for the data. Let's use that for our modeling efforts.

### Will a customer purchase a loan?

To answer this question, we need to prepare the dataset so that we have a training and a testing set.

#### Data Preparation

##### Process categorical and continuous columns

In [9]:
df.head(3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,no


In [10]:
# make copy to preserve our progress
df_modeling=copy.deepcopy(df)

First, let's convert some categorical columns to binary. This will help me keep track of my progress, as I'll be able to clearly see which columns still need to be processed. Some may need to be discretized, like `job` and `education`.

In [11]:
df_modeling['loan'].value_counts()

loan
no     33070
yes     6930
Name: count, dtype: int64

In [12]:
df_modeling['y']=df_modeling['y'].map({'yes': 1, 'no': 0})
df_modeling['default']=df_modeling['default'].map({'yes': 1, 'no': 0})
df_modeling['housing']=df_modeling['housing'].map({'yes': 1, 'no': 0})
df_modeling['loan']=df_modeling['loan'].map({'yes': 1, 'no': 0})

In [13]:
df_modeling.head(3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,y
0,58,management,married,tertiary,0,2143,1,0,unknown,5,may,261,1,0
1,44,technician,single,secondary,0,29,1,0,unknown,5,may,151,1,0
2,33,entrepreneur,married,secondary,0,2,1,1,unknown,5,may,76,1,0


Those are all the binary features. We still have to get the continuous variables separated and have to discretize, or "OneHotEncode" the rest of the categorical columns.

In [14]:
# define the categorical columns
cat_cols=['job','marital','education','default','housing','loan','contact','month']
df_cat=df_modeling[cat_cols]

In [15]:
df_cat.head(3)

Unnamed: 0,job,marital,education,default,housing,loan,contact,month
0,management,married,tertiary,0,1,0,unknown,may
1,technician,single,secondary,0,1,0,unknown,may
2,entrepreneur,married,secondary,0,1,1,unknown,may


In [16]:
# make dataframe of continuous variables

# set of categorical columns
df_cat_set = set(df_cat.columns)
# set of all columns
df_modeling_set = set(df_modeling.columns)

# Find columns that are in df_modeling but not in df_cat
difference = df_modeling_set - df_cat_set

# Convert the set to list and name it the continuous
cont_cols = list(difference)
cont_cols.remove('y')

# print("Columns in DataFrame but not in the list:\n",cont_cols)

df_cont=df_modeling[cont_cols]
df_cont.head(3)

Unnamed: 0,campaign,balance,day,age,duration
0,1,2143,5,58,261
1,1,29,5,44,151
2,1,2,5,33,76


In [17]:
# convert the categorical columns to the 'category' type
for col in df_cat.columns:
    df_cat.loc[:,col] = df_cat[col].astype('category')

In [18]:
df_cat.head(3)

Unnamed: 0,job,marital,education,default,housing,loan,contact,month
0,management,married,tertiary,0,1,0,unknown,may
1,technician,single,secondary,0,1,0,unknown,may
2,entrepreneur,married,secondary,0,1,1,unknown,may


In [19]:
# discretize the categorical columns
already_encoded=['default','housing','loan','day']
columns_to_encode = [col for col in df_cat.columns if col not in already_encoded]
prefixes=columns_to_encode

# apply get_dummies
df_cat=pd.get_dummies(data=df_cat[columns_to_encode],prefix=prefixes,drop_first=True,dtype='int')

# confirm
df_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   job_blue-collar      40000 non-null  int64
 1   job_entrepreneur     40000 non-null  int64
 2   job_housemaid        40000 non-null  int64
 3   job_management       40000 non-null  int64
 4   job_retired          40000 non-null  int64
 5   job_self-employed    40000 non-null  int64
 6   job_services         40000 non-null  int64
 7   job_student          40000 non-null  int64
 8   job_technician       40000 non-null  int64
 9   job_unemployed       40000 non-null  int64
 10  job_unknown          40000 non-null  int64
 11  marital_married      40000 non-null  int64
 12  marital_single       40000 non-null  int64
 13  education_secondary  40000 non-null  int64
 14  education_tertiary   40000 non-null  int64
 15  education_unknown    40000 non-null  int64
 16  contact_telephone    4

In [20]:
# add the continuous and categorical columns together
df_x=pd.concat([df_cat,df_cont],axis=1)
df_x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 33 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   job_blue-collar      40000 non-null  int64
 1   job_entrepreneur     40000 non-null  int64
 2   job_housemaid        40000 non-null  int64
 3   job_management       40000 non-null  int64
 4   job_retired          40000 non-null  int64
 5   job_self-employed    40000 non-null  int64
 6   job_services         40000 non-null  int64
 7   job_student          40000 non-null  int64
 8   job_technician       40000 non-null  int64
 9   job_unemployed       40000 non-null  int64
 10  job_unknown          40000 non-null  int64
 11  marital_married      40000 non-null  int64
 12  marital_single       40000 non-null  int64
 13  education_secondary  40000 non-null  int64
 14  education_tertiary   40000 non-null  int64
 15  education_unknown    40000 non-null  int64
 16  contact_telephone    4

##### Define X and y

In [21]:
X=df_x
y=df_modeling[[col for col in df_modeling.columns if col == 'y']]

##### `train_test_split`

In [22]:
# train/test split
X_train, \
X_test, \
y_train, \
y_test = train_test_split(X,
                          y,
                          test_size=test_size,
                          stratify=y,
                          random_state=seed)

Now we have the dataset ready to go.

In [23]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 33 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   job_blue-collar      40000 non-null  int64
 1   job_entrepreneur     40000 non-null  int64
 2   job_housemaid        40000 non-null  int64
 3   job_management       40000 non-null  int64
 4   job_retired          40000 non-null  int64
 5   job_self-employed    40000 non-null  int64
 6   job_services         40000 non-null  int64
 7   job_student          40000 non-null  int64
 8   job_technician       40000 non-null  int64
 9   job_unemployed       40000 non-null  int64
 10  job_unknown          40000 non-null  int64
 11  marital_married      40000 non-null  int64
 12  marital_single       40000 non-null  int64
 13  education_secondary  40000 non-null  int64
 14  education_tertiary   40000 non-null  int64
 15  education_unknown    40000 non-null  int64
 16  contact_telephone    4

In [6]:
df.head(3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,no


In [None]:
# preprocess the data
X=

#### `Optuna`

In order to find the best hyperparameters for our modeling, we will be using [Optuna](#https://optuna.readthedocs.io/en/stable/index.html). This is similar to other frameworks like [Hyperopt](#http://hyperopt.github.io/hyperopt/), which are designed to quickly and efficiently find the best hyperparameters for your dataset.

In [None]:
def objective(params):
    # Create a pipeline with the preprocessor and the classifier
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', GradientBoostingClassifier(**params))
    ])

    # Perform cross-validation
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy')
    # Return the negative mean of the scores as we want to minimize the objective
    return {'loss': -scores.mean(), 'status': STATUS_OK}

# Define the hyperparameter space
space = {
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.2),
    'n_estimators': hp.choice('n_estimators', range(50, 300)),
    'max_depth': hp.choice('max_depth', range(3, 15)),
    'min_samples_split': hp.choice('min_samples_split', range(2, 10)),
    'min_samples_leaf': hp.choice('min_samples_leaf', range(1, 10)),
    'subsample': hp.uniform('subsample', 0.5, 1.0)
}

# Run the optimization
trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=100,
            trials=trials)

We'll be making a pipeline that includes a preprocessor to handle features that need to be scaled:

In [None]:
# first, separate categorical from continuous features
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
continuous_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
print('features separated')

# create pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), continuous_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])
print('preprocessor created')

Then, using the parameters `hyperopt` found, we'll train the model and test its accuracy:

In [None]:
best_params = {
    'learning_rate': best['learning_rate'],
    'n_estimators': range(50, 300)[best['n_estimators']],
    'max_depth': range(3, 15)[best['max_depth']],
    'min_samples_split': range(2, 10)[best['min_samples_split']],
    'min_samples_leaf': range(1, 10)[best['min_samples_leaf']],
    'subsample': best['subsample']
}

# Create a pipeline with the preprocessor and the classifier
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(**best_params))
])
print('model created')

# Train the model
model.fit(X_train, y_train)
print('model trained')

# Make predictions and evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy*100:.2f}%')

## Sklearn, AutoSklearn, Optuna

In [None]:
# from chatgpt

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import autosklearn.classification
import optuna
from optuna.integration import SklearnPruningCallback

# Create a synthetic dataset
data = {
    'age': [25, 32, 47, 51, 62, 20, 27, 40, 34, 55],
    'salary': [50000, 60000, 120000, 85000, 95000, 40000, 45000, 80000, 75000, 90000],
    'gender': ['male', 'female', 'female', 'male', 'male', 'female', 'female', 'male', 'male', 'female'],
    'bought_insurance': [0, 1, 1, 0, 1, 0, 0, 1, 1, 1]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Split into features and target
X = df.drop('bought_insurance', axis=1)
y = df['bought_insurance']

# Identify categorical and continuous columns
categorical_cols = ['gender']
continuous_cols = ['age', 'salary']

# Preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), continuous_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define Auto-sklearn classifier and pipeline
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=60, per_run_time_limit=30)
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', automl)])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Auto-sklearn accuracy: {accuracy:.2f}")

# Define Optuna objective function for hyperparameter tuning
def objective(trial):
    params = {
        'ensemble_size': trial.suggest_int('ensemble_size', 10, 50),
        'initial_configurations_via_metalearning': trial.suggest_int('initial_configurations_via_metalearning', 0, 25)
    }
    
    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=60, per_run_time_limit=30, 
        ensemble_size=params['ensemble_size'], 
        initial_configurations_via_metalearning=params['initial_configurations_via_metalearning']
    )
    
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', automl)])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    return accuracy

# Create Optuna study and optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

# Print the best hyperparameters
print(f"Best hyperparameters: {study.best_params}")

# Use the best hyperparameters found by Optuna to fit and evaluate the final model
best_params = study.best_params
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60, per_run_time_limit=30, 
    ensemble_size=best_params['ensemble_size'], 
    initial_configurations_via_metalearning=best_params['initial_configurations_via_metalearning']
)

pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', automl)])
pipeline.fit(X_train, y_train)

# Evaluate the final model
y_pred = pipeline.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
print(f"Final accuracy: {final_accuracy:.2f}")
