<h1><center>Bank marketing analyis using Machine Learning</center></h1>

## Problem Statement

Improve bank marketing of a bank by analyzing their past marketing campaign data and recommending which customer to target.

The aim of this project is to devise such a machine leaning prediction algorithm, the bank can better target its customers and channelize its mrketing efforts. 

### Data Attributes 

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).


**Attribute Information:**

1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

16 - balance: balance od the customer

17 - y - has the client subscribed a term deposit? (binary: 'yes','no')

<a id='import_lib'></a>
## 1. Import Libraries

In [None]:
# import 'Pandas' 
import pandas as pd 

# import 'Numpy' 
import numpy as np

# import subpackage of Matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# import 'Seaborn' 
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import train-test split 
from sklearn.model_selection import train_test_split

# import StandardScaler to perform scaling
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import MinMaxScaler 


# import various functions from sklearn 
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import confusion_matrix, classification_report 
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from IPython.display import Image 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier

<a id='set_options'></a>
## 2. Set Options

In [None]:
# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None

# return an output value upto 6 decimals
pd.options.display.float_format = '{:.6f}'.format

<a id='Read_Data'></a>
## 3. Read Data

In [None]:
df=pd.read_csv("/kaggle/input/bank-marketing-dataset/bank.csv")

In [None]:
df.head()

<a id='data_preparation'></a>
## 4. Data Analysis and Preparation

<a id='Data_Understanding'></a>
### 4.1 Understand the Dataset

<a id='Data_Shape'></a>
### 4.1.1 Data Dimension

In [None]:
df.shape

<a id='Data_Types'></a>
### 4.1.2 Data Types
Data has a variety of data types. The main types stored in pandas dataframes are object, float, int64, bool and datetime64. In order to learn about each attribute, it is always good for us to know the data type of each column.

**1. Check data types**

In [None]:
df.info()

<a id='Summary_Statistics'></a>
### 4.1.3 Summary Statistics
**1. For numerical variables, we use .describe()**

In [None]:
df.describe()

**2. For categorical features, we use .describe(include=object)**

In [None]:
df.describe(include='object')

<a id='Missing_Values'></a>
### 4.1.4 Missing Values

In [None]:
df.isnull().sum()

**There are no missing values in the dataset**

### Visualize Missing Values using Heatmap

In [None]:
plt.figure(figsize=(15, 8))

sns.heatmap(df.isnull(), cbar=False)

plt.show()

<a id='correlation'></a>
### 4.1.5 Correlation

#### Corelation heatmap

<ul>
    <li>Correlation is the extent of linear relationship among numeric variables</li>
    <li>It indicates the extent to which two variables increase or decrease in parallel</li>
    <li>The value of a correlation coefficient ranges between -1 and 1</li>
    <li> Correlation among multiple variables can be represented in the form of a matrix. This allows us to see which pairs are correlated</li>
    </ul>
    

In [None]:
cor=df.corr()
plt.figure(figsize=(15, 8))

sns.heatmap(cor,annot=True)

plt.show()

**This reveals a clear relationship among age, balance, duration, and campaign.**

        To investigate more about correlation, a correlation matrix was plotted with all qualitative variables. Clearly, “campaign outcome” has a strong correlation with “duration”, a moderate correlation with “previous contacts”, and mild correlations between “balance”, “month of contact” and “number of campaign”. Their influences on campaign outcome will be investigated further in the machine learning part.

<a id='categorical'></a>
### 4.1.6 Analyze Categorical Variables

Categorical variables are those in which the values are labeled categories. The values, distribution, and dispersion of categorical variables are best understood with bar plots.

In [None]:
df.describe(include=object)

In [None]:
df_categoric_features = df.select_dtypes(include='object').drop(['deposit'], axis=1)
fig, ax = plt.subplots(3, 2, figsize=(25, 20))
for variable, subplot in zip(df_categoric_features, ax.flatten()):
    countplot = sns.countplot(y=df[variable], ax=subplot )
    countplot.set_ylabel(variable, fontsize = 30) 
plt.tight_layout()   
plt.show()

In [None]:
for i in df_categoric_features:
    print(i.upper())
    print(df[i].value_counts())
    print( )

<a id='numerical'></a>
### 4.1.6 Analyze Numerical Variables

In [None]:
df_num=df.select_dtypes(include=np.number)
plt.figure(figsize=(15, 8))
for i in df_num:
    sns.boxplot(df[i])
    plt.show()

In [None]:
plt.rcParams['figure.figsize'] = [15,8]
df.drop('deposit', axis = 1).hist()
plt.tight_layout()
plt.show()  
print('Skewness:')
df.drop('deposit', axis = 1).skew()

<a id='Scaling the data'></a>
###  4.1.7 Scaling The Data

In [None]:
df_num=df.select_dtypes(include=np.number)
X_scaler = StandardScaler()
num_scaled = X_scaler.fit_transform(df_num)
X = pd.DataFrame(num_scaled, columns = df_num.columns)
X.head()

In [None]:
X.shape

In [None]:
X.skew()

<a id='encoding'></a>
### Encoding the categorical variable

In [None]:
df_cat=df.select_dtypes(exclude=np.number)

In [None]:
df_cat.head()

In [None]:
df_cat=df_cat.drop('deposit',axis=1)

In [None]:
df_cat.head()

In [None]:
X_encode=pd.get_dummies(df_cat,columns=df_cat.columns)
X_encode.head()

In [None]:
X_encode.shape

In [None]:
X.shape

In [None]:
x=pd.concat([X,X_encode],axis=1)

In [None]:
x.shape

In [None]:
x.head()

<a id='Target variable'></a>
### 4.1.8 Target variable

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(df.deposit)
plt.show()

In [None]:
df['deposit'].value_counts()

In [None]:
y=df['deposit']
for i in range(len(y)):
    if y[i] == 'yes':
        y[i] = 1
    else:
        y[i] = 0 
y=y.astype('int')
y.value_counts()

<a id='imbalance data'></a>
# Handling the imbalanced data

<a id='Train test split'></a>
# Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size =   0.3, random_state = 10) 

<a id='log_full model'></a>
# Logistic Regression Full Model

In [None]:
score_card = pd.DataFrame(columns=["Model Name",'Prob.Cutoff',"Stability","r2_score", 'AUC', 'Precision', 'Recall',
                                       'Accuracy', 'Kappa', 'f1-score'])
def update_score_card(Model_name,model,cutoff='-',stability="Stable"):
    y_pred_prob = model.predict(X_test)
    y_pred = [ 0 if x < cutoff else 1 for x in y_pred_prob]
    global score_card
    score_card = score_card.append({"Model Name":Model_name,
                                    "Prob.Cutoff":cutoff,
                                    'Stability': stability,
                                    "r2_score":model.prsquared,
                                    'AUC' : metrics.roc_auc_score(y_test, y_pred_prob),
                                    'Precision': metrics.precision_score(y_test, y_pred),
                                    'Recall': metrics.recall_score(y_test, y_pred),
                                    'Accuracy': metrics.accuracy_score(y_test, y_pred),
                                    'Kappa':metrics.cohen_kappa_score(y_test, y_pred),
                                    'f1-score': metrics.f1_score(y_test, y_pred)}, 
                                    ignore_index = True)
    return score_card

In [None]:
def get_test_report(model):
    test_pred = model.predict(X_test)
    return(classification_report(y_test, test_pred))

In [None]:
def logisticRegression(x,y,lr):
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size =   0.3, random_state = 10) 
    
    # describes info about train and test set 
    print("Number transactions X_train dataset: ", X_train.shape) 
    print("Number transactions y_train dataset: ", y_train.shape) 
    print("Number transactions X_test dataset: ", X_test.shape) 
    print("Number transactions y_test dataset: ", y_test.shape) 
    
    # train the model on train set 
    lr.fit(X_train, y_train) 
    
    predictions = lr.predict(X_test) 

    # print classification report 
    print(classification_report(y_test, predictions)) 

    cm = confusion_matrix(y_test, predictions, labels=lr.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=lr.classes_)
    disp.plot() 

In [None]:
def plot_roc(model):
    y_pred_prob = model.predict_proba(X_test)[:,1]
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
    plt.plot(fpr, tpr)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.plot([0, 1], [0, 1],'r--')
    plt.title('ROC curve for Bank marketing Classifier', fontsize = 15)
    plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
    plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)
    plt.text(x = 0.02, y = 0.9, s = ('AUC Score:',round(roc_auc_score(y_test, y_pred_prob),4)))
    plt.grid(True)

In [None]:
def plot_confusion_matrix(model):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    conf_matrix = pd.DataFrame(data = cm,columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])
    sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = ListedColormap(['lightskyblue']), cbar = False, 
                linewidths = 0.1, annot_kws = {'size':25})
    plt.xticks(fontsize = 20)
    plt.yticks(fontsize = 20)
    plt.show()

In [None]:
import statsmodels.api as sm
logreg = sm.Logit(y_train, X_train).fit()
print(logreg.summary())

In [None]:
print('AIC: ',logreg.aic)

In [None]:
df_odds = pd.DataFrame(np.exp(logreg.params), columns= ['Odds']) 
df_odds

In [None]:
y_pred_prob = logreg.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
youdens_table = pd.DataFrame({'TPR': tpr,
                             'FPR': fpr,
                             'Threshold': thresholds})
youdens_table['Difference'] = youdens_table.TPR - youdens_table.FPR
youdens_table = youdens_table.sort_values('Difference', ascending = False).reset_index(drop = True)
youdens_table.head()

In [None]:
y_pred_prob = logreg.predict(X_test)
y_pred_prob.head()

In [None]:
y_pred_prob = logreg.predict(X_test)
y_pred = [ 0 if x < 0.69 else 1 for x in y_pred_prob]
print(classification_report(y_test, y_pred))

## Confusion matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)
conf_matrix = pd.DataFrame(data = cm,columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])
sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = ListedColormap(['lightskyblue']), cbar = False,linewidths = 0.1, annot_kws = {'size':25})
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
plt.show()

In [None]:
TN = cm[0,0]
TP = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]

In [None]:
precision = TP / (TP+FP)
print('Precision:',precision)
recall = TP / (TP+FN)
print('Recall:',recall)
specificity = TN / (TN+FP)
print('Specificity:',specificity)
f1_score = 2*((precision*recall)/(precision+recall))
print('f1_score:',f1_score)
accuracy = (TN+TP) / (TN+FP+FN+TP)
print('Accuracy:',accuracy)

In [None]:
kappa = cohen_kappa_score(y_test, y_pred)
print('kappa value:',kappa)

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.plot([0, 1], [0, 1],'r--')
plt.title('ROC curve for Bank marketing Classifier (Full Model)', fontsize = 15)
plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)
plt.text(x = 0.02, y = 0.9, s = ('AUC Score:', round(metrics.roc_auc_score(y_test, y_pred_prob),4)))
plt.grid(True)

In [None]:
update_score_card("Simple Logistic Regression",logreg,cutoff=0.69,stability="Stable")

<a id='guassiannb_model'></a>
# Guassian naive bayes model

In [None]:
def update_score_card(Model_name,model,cutoff="-",stability="Stable"):
    y_pred_prob = model.predict_proba(X_test)[:,1]
    y_pred=model.predict(X_test)
    global score_card
    score_card = score_card.append({"Model Name":Model_name,
                                    "Prob.Cutoff":cutoff,
                                    'Stability': stability,
                                    "r2_score":metrics.r2_score(y_test, y_pred),
                                    'AUC' : metrics.roc_auc_score(y_test, y_pred_prob),
                                    'Precision': metrics.precision_score(y_test, y_pred),
                                    'Recall': metrics.recall_score(y_test, y_pred),
                                    'Accuracy': metrics.accuracy_score(y_test, y_pred),
                                    'Kappa':metrics.cohen_kappa_score(y_test, y_pred),
                                    'f1-score': metrics.f1_score(y_test, y_pred)}, 
                                    ignore_index = True)
    return score_card

In [None]:
gnb = GaussianNB()
gnb_model = gnb.fit(X_train, y_train)
plot_confusion_matrix(gnb_model)

In [None]:
test_report = get_test_report(gnb_model)
print(test_report)

In [None]:
plot_roc(gnb_model)

In [None]:
update_score_card("gNB Classifier",gnb_model,stability="Moderate")

<a id='knn_model'></a>
# KNN model

In [None]:
knn_classification = KNeighborsClassifier(n_neighbors = 3)
knn_model = knn_classification.fit(X_train, y_train)
plot_confusion_matrix(knn_model)

In [None]:
test_report = get_test_report(knn_model)
print(test_report)

In [None]:
plot_roc(knn_model)

In [None]:
update_score_card("KNN Classifier",knn_model,stability="Moderate")

<a id='decisiontree'></a>
# DECISION TREE

In [None]:
decision_tree_classification =DecisionTreeClassifier(criterion = 'entropy', random_state = 10)
decision_tree = decision_tree_classification.fit(X_train, y_train)

In [None]:
test_report = get_test_report(decision_tree)
print(test_report)

In [None]:
plot_roc(decision_tree)

In [None]:
update_score_card("Decision Tree Classifier",decision_tree,stability="Good")

<a id='randomforest'></a>

# RANDOM FOREST

In [None]:
rf_classification = RandomForestClassifier(n_estimators = 10, random_state = 10)
rf_model = rf_classification.fit(X_train, y_train)

In [None]:
test_report = get_test_report(rf_model)
print(test_report) 

In [None]:
plot_roc(rf_classification)

In [None]:
update_score_card("Random Forest Classifier",rf_model,stability="Good")

<a id='boosting'></a>

# BOOSTING TECHNIQUES

<a id='ADAboost'></a>

## ADABoost

In [None]:
ada_model = AdaBoostClassifier(n_estimators = 40, random_state = 10)
ada_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(ada_model)

In [None]:
test_report = get_test_report(ada_model)
print(test_report) 

In [None]:
plot_roc(ada_model)

In [None]:
update_score_card("ADAboost classifier",ada_model,stability="Moderate")

<a id='gradboost'></a>

## Gradient Boost

In [None]:
gboost_model = GradientBoostingClassifier(n_estimators = 150, max_depth = 10, random_state = 10)
gboost_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(gboost_model)

In [None]:
test_report = get_test_report(gboost_model)
print(test_report) 

In [None]:
plot_roc(gboost_model)

In [None]:
update_score_card("Gradient boost classifier",gboost_model,stability="Moderate")

<a id='xgboost'></a>

## XG Boost

In [None]:
xgb_model = XGBClassifier(max_depth = 10, gamma = 1)
xgb_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(xgb_model)

In [None]:
test_report = get_test_report(xgb_model)
print(test_report) 

In [None]:
plot_roc(xgb_model)

In [None]:
update_score_card("XGBClassifier",xgb_model,stability="Moderate")

## Stacking Classifier

In [None]:
base_learners = [('rf_model', RandomForestClassifier(criterion = 'entropy', max_depth = 10, max_features = 'sqrt', 
                                                     max_leaf_nodes = 8, min_samples_leaf = 5, min_samples_split = 2, 
                                                     n_estimators = 50, random_state = 10)),
                 ('KNN_model', KNeighborsClassifier(n_neighbors = 17, metric = 'euclidean')),
                 ('NB_model', GaussianNB())]
stack_model = StackingClassifier(estimators = base_learners, final_estimator = GaussianNB())
stack_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(stack_model)

In [None]:
test_report = get_test_report(stack_model)
print(test_report)

In [None]:
plot_roc(stack_model)

In [None]:
update_score_card("Stacking classifier",stack_model,stability="Moderate")

<a id='knngrid'></a>
# KNN model Tuned

In [None]:
tuned_paramaters = {'n_neighbors': np.arange(1, 25, 2),
                   'metric': ['hamming','euclidean','manhattan','Chebyshev']}
knn_classification = KNeighborsClassifier()
knn_grid = GridSearchCV(estimator = knn_classification, 
                        param_grid = tuned_paramaters, 
                        cv = 5, 
                        scoring = 'accuracy')
knn_grid.fit(X_train, y_train)
print('Best parameters for KNN Classifier: ', knn_grid.best_params_, '\n')

In [None]:
knn_classification = KNeighborsClassifier(metric='euclidean',n_neighbors=19)
knn_model_tuned = knn_classification.fit(X_train, y_train)
plot_confusion_matrix(knn_model_tuned)

In [None]:
print(get_test_report(knn_model_tuned))

In [None]:
plot_roc(knn_model_tuned)

In [None]:
update_score_card("KNN classifier tuned",knn_model_tuned,stability="Moderate")

In [None]:
plot_roc(gnb_model)
plot_roc(knn_model)
plot_roc(knn_model_tuned)
plot_roc(decision_tree)
plot_roc(rf_classification)
plot_roc(ada_model)
plot_roc(gboost_model)
plot_roc(xgb_model)
plot_roc(stack_model)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])


plt.plot([0, 1], [0, 1],'r--')

plt.title('ROC curve analysis', fontsize = 15)
plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)

plt.legend(prop={'size':13}, loc='lower right')
plt.grid(True)

**Compare the performance of the Different Models built**

1. Accuracy score for selected features for different models range between 0.75 - 0.94

2. Precision Score for selected features for different models range between 0.74 - 0.95

3. Gaussian Naïve Bayes Classifier has overall low precision, recall, accuracy and kappa score compared to other models built.

4. Simple Logistic Regression for selected features has a stable nature whereas both Random Forest and Decision Tree has Good nature of stability.

**Which metric did we choose and why?**

The next step after implementing a machine learning algorithm is to find out how effective is the model based on metric and datasets. Different performance metrics are used to evaluate different Machine Learning Algorithms. For example a classifier used to distinguish between images of different objects; we can use classification performance metrics such as, Precision score,accuracy score , recall score and Cross val score etc.

The machine learning model cannot be simply tested using the training set, because the output will be prejudiced, because the process of training the machine learning model has already tuned the predicted outcome to the training dataset. Therefore in order to estimate the generalization error, the model is required to test a dataset which it hasn’t seen yet; giving birth to the term testing dataset.

Therefore for the purpose of testing the model, we would require a labelled dataset. This can be achieved by splitting the training dataset into training dataset and testing dataset. This can be achieved by various techniques such as, k-fold cross validation.

**Which model has better performance on the test set?**

For binary classification model evaluation between random forest and logistic regression, our work focused on four distinct simulated datasets:    
(1) increasingthe variance in the explanatory and noise variables,     
(2) increasing the number of noise variables,     
(3) increasing the number of explanatory variables,     
(4) increasing the number of observations.

To benchmark and comparing classification scores between different classification models built, metrics such as accuracy, area under the curve, true positive rate, false positive rate, and precision were analyzed.

KNN classifier tuned has got better accuracy score compared to other models, hence we can say that it has a better performance.

# Conclusion 

According to the analysis made across, a target customer profile can be established. The most responsive customers possess following features:

-> Feature 1: age < 30 or age > 60      
-> Feature 2: students or retired people       
-> Feature 3: specific months (dec, may, oct)     

By applying the supervised learning classification techniques, using ensemble learning models, and boosting technqiues the estimation models were successfully built. With all the respective models, the bank will be able to predict a customer's response in the telemarketing campaign before calling the customer. In this way, the bank can allocate more marketing efforts to the clients who are classified as highly likely to accept term deposits, and call less to those who are unlikely to make term deposits. 