Problem Statement:

**Suppose you are a data scientist working for a bank that recently conducted a marketing campaign to promote term deposits to its clients. The bank collected data on various client characteristics, such as age, job type, marital status, education level, and more. Your task is to analyze this dataset and build a machine learning model to predict whether a client will subscribe to a term deposit or not.**

--------------------------------------------------------------------------------


**By accurately predicting client subscription behavior, your model will enable the bank to optimize its marketing efforts. It will help identify potential clients who are more likely to subscribe to the term deposit, allowing the bank to focus its resources on targeting these individuals. This targeted approach will not only increase the effectiveness of the marketing campaign but also maximize the bank's return on investment.**

# Importing required libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
#import all necessary libraries

In [None]:
%ls

## Upload data to colab and Read data

In [None]:
# Write your code here::: Read the csv file. Hint: to read it perfectly or to load it perfectly in the dataframe you will need a seperator.
df = pd.read_csv('bank-additional-full.csv', delimiter=';')
df.head()

## Data Introduction

## Understand the data columns
1. Check if there are missing values and decide either to impute or drop them
2. Understand descriptive statistics of each columns
3. Understand descriptive statistics of each column using pandas descibe
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

By default describe returns numerical columns. How do you also understand descriptive statistics for non numerical columns?

In [None]:
df.columns

In [None]:
df.shape

In [None]:
# Write you code here:: to check if there is any missing value for every feature.
df.isna().sum()

In [None]:
#your code here
# Keep only those features with less than 20% of missing values
missing_report = df.isna().sum()/len(df)
features_ss1 = missing_report[missing_report<0.2].index
print(features_ss1) # SS1 stands for Subset 1
df = df[features_ss1]
df.head()

## Please do the following steps

1. Begin by conducting exploratory data analysis (EDA) to gain a comprehensive understanding of the dataset. Visualize the data, compute summary statistics, and identify any patterns or insights.

-------------------------------------------------------------------------------


2. Preprocess the dataset by handling missing values, addressing categorical variables, and performing necessary data transformations. This step ensures that the data is in a suitable format for machine learning algorithms.

-------------------------------------------------------------------------------

3. Split the dataset into training and testing sets for model evaluation purposes

## EDA

My EDA will have 5 steps:
1) Outcome Exploration
2) Univariate Exploration of Quantitative Input Variables
3) Univariate Exploration of Categorical Input Variables
4) Bivariate Exploration of Quantitative Input Variables

With more time we could do Bivariate Exploration with Outcome vs All Inputs.

### 1) Outcome

In [None]:
# Write you code here::::: a countplot, take X-axis as "y"(from data)  
%matplotlib inline
fig, ax = plt.subplots(figsize=(10,4))
sns.countplot(data=df, x="y")
ax.set(xlabel='Term Deposit', ylabel='')
ax.set_title('Subscribe Y Variable', size=20)

yes_cases = (df['y']=='yes').sum()
print(f'y=yes represents {round((yes_cases/len(df))*100,2)}% of the cases')

### 2) Quantitative X's

In [None]:
# Split Quantitative from Categorical
x_quantitative = ['age', 'duration', 'campaign', 'pdays','previous', 'emp.var.rate',
                 'cons.price.idx','cons.conf.idx', 'euribor3m', 'nr.employed']
y = ['y']
x_categorical = [feature for feature in df.columns if ((feature not in x_quantitative) and (feature not in y)) ]

In [None]:
df.describe()

In [None]:
def histplot_visual(data: pd.DataFrame, columns: list[str]) -> None:
  """Create a histogram plot using a subset of variables specified.

  Args:
    data: Input data-frame containing variables we wish to plot.
    columns: Listing of column-names we wish to plot (must be contained within data).
  """
  fig, ax = plt.subplots(2, 5, figsize=(15, 6))
  fig.suptitle('Histogram for each numeric variable in our data',y=1, size=20)
  ax=ax.flatten()
  for i,feature in enumerate(columns):
    # Setting option `kde=True` allows for a Kernel Density Estimate (i.e. PDF).
    sns.histplot(data=data[feature],ax=ax[i], kde=True)
  plt.tight_layout()

# Invoke our function defined above.
histplot_visual(data=df, columns=x_quantitative)

### 3) Categorical Variables

In [None]:
for x in x_categorical:
    print(df[x].value_counts(normalize=True))

In [None]:
def count_plots(data: pd.DataFrame, columns: list[str]) -> None:
  """Create multiple plots using a subset of variables specified.

  Args:
    data: Input data-frame containing variables we wish to plot.
    columns: Listing of column-names we wish to plot (must be contained within data).
  """
  fig, axes = plt.subplots(2, 5, figsize=(15, 6))
  fig.suptitle('Countplot for each categorical variable in our data',y=1, size=20)
  axes=axes.flatten()
  for i,feature in enumerate(columns):
    # Setting option `kde=True` allows for a Kernel Density Estimate (i.e. PDF).
    sns.countplot(data=data, x=feature, ax=axes[i])
  plt.tight_layout()

# Invoke our function defined above.
count_plots(data=df, columns=x_categorical)

### 4) Correlation

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(numeric_only=True),annot=True, cmap='coolwarm')
plt.show()

In [None]:
plt.figure(figsize=(5,5))
sns.heatmap(df[['emp.var.rate','euribor3m', 'nr.employed']].corr(numeric_only=True),annot=True, cmap='coolwarm')
plt.show()

### EDA Findings
- y=1 -> 11%
- No Missings :)
- pdays has too many 999
- campaign, duration and previous are very skewed (log or categorization can help logistic regression)
- A lot of Unknown in categorical variables
- emp.var.rate is very correlated with euribor3m (Euro Interbank Offered Rate 3 months) and nr.employed
### Proposed Actions
- pdays=999 becames a binary variable
- remove euribor3m
- one-hot encoding for categorical variables
### Extra proposed actions if we have time
- Apply transformations to campaign, duration and previous

# Feature transformation / Pre processing

In [None]:
df['pdays999']=(df['pdays']==999)
df = df.drop(['pdays','euribor3m'],axis=1)
df.head()

In [None]:
df.columns

In [None]:
x_categorical.append('pdays999')
x_categorical

In [None]:
df = pd.get_dummies(data=df, columns=x_categorical, drop_first=True)

In [None]:
df.head()

In [None]:
df.columns

In [None]:
X = df.drop('y', axis=1)
y = df['y']
# Splitting our dataset between training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=100)

Train and evaluate various classification models, such as logistic regression, support vector machines etc. Compare the performance of these models to identify the most accurate one for the task at hand.

-------------------------------------------------------------------------------

Fine-tune the selected model by adjusting hyperparameters. Use regularization techniques.

Ensure you transform the labels and feature data before you do this step

## Model

**Logistic Regression**
(Example provided)
Please ensure you use the right metric to evaluate classifer

In [None]:
# Fit a logistic regression model to the training data
model1 = LogisticRegression(random_state = 42, max_iter = 1000)
model1.fit(X_train, y_train)
pred_test = model1.predict(X_test)
accuracy = accuracy_score(y_test, pred_test)
print('Accuracy:', round(accuracy,4))

In [None]:
# Compute the accuracy of the model
from sklearn.metrics import accuracy_score, classification_report, roc_curve, auc

def accuracy_report(model,values_list):
    '''This function will assess model performance. Given a sklearn model it will Predict, and measure performance for both Test and Train Data'''
    #Train
    print('Train Data:\n-----------')
    pred_train = model.predict(X_train)
    accuracy_train = accuracy_score(y_train, pred_train)
    print('Accuracy:', round(accuracy_train,4))
    print(classification_report(y_train, pred_train, target_names = values_list))
    roc_plot(model,X_train,y_train,values_list)
    
    print('Test Data:\n----------')
    pred_test = model.predict(X_test)
    accuracy_test = accuracy_score(y_test, pred_test)
    print('Accuracy:', round(accuracy_test,4))
    print(classification_report(y_test, pred_test, target_names = values_list))
    roc_plot(model,X_test,y_test,values_list)


def roc_plot(model,X_data,y_data,values_list):
    y_scores = model.predict_proba(X_data)[:, 1]
    y_data = y_data.map({values_list[0]:0,values_list[1]:1})
    fpr, tpr, thresholds = roc_curve(y_data, y_scores)
    roc_auc = auc(fpr, tpr)
    plt.figure(figsize=(3, 2))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
accuracy_report(model1,['no','yes'])

**SVM**

In [None]:
from sklearn.svm import SVC
model2 = SVC(kernel='linear', C=1, probability=True)
model2.fit(X_train, y_train)
accuracy_report(model2,['no','yes'])

In [None]:
from sklearn.svm import SVC
model3 = SVC(kernel='linear', C=0.5, probability=True)
model3.fit(X_train, y_train)
accuracy_report(model3,['no','yes'])

In [None]:
model4 = SVC(kernel='poly', C=1, probability=True)
model4.fit(X_train, y_train)
accuracy_report(model4,['no','yes'])

In [None]:
model5 = SVC(kernel='rbf', C=1, probability=True)
model5.fit(X_train, y_train)
accuracy_report(model5,['no','yes'])

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model6 = KNeighborsClassifier(n_neighbors=3)
model6.fit(X_train, y_train)
accuracy_report(model6,['no','yes'])

There is a known [issue](https://github.com/scikit-learn/scikit-learn/issues/26768) with predict and knn. I will create a second version of our assessment function.


In [None]:
def accuracy_report_v2(model,values_list):
    '''This function will assess model performance. Given a sklearn model it will Predict, and measure performance for both Test and Train Data'''
    #Train
    print('Train Data:\n-----------')
    pred_train = model.predict(X_train.values)
    accuracy_train = accuracy_score(y_train, pred_train)
    print('Accuracy:', round(accuracy_train,4))
    print(classification_report(y_train, pred_train, target_names = values_list))
    roc_plot_v2(model,X_train,y_train,values_list)
    
    print('Test Data:\n----------')
    pred_test = model.predict(X_test.values)
    accuracy_test = accuracy_score(y_test, pred_test)
    print('Accuracy:', round(accuracy_test,4))
    print(classification_report(y_test, pred_test, target_names = values_list))
    roc_plot_v2(model,X_test,y_test,values_list)


def roc_plot_v2(model,X_data,y_data,values_list):
    y_scores = model.predict_proba(X_data.values)[:, 1]
    y_data = y_data.map({values_list[0]:0,values_list[1]:1})
    fpr, tpr, thresholds = roc_curve(y_data, y_scores)
    roc_auc = auc(fpr, tpr)
    plt.figure(figsize=(3, 2))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
accuracy_report_v2(model6,['no','yes'])

In [None]:
model7 = KNeighborsClassifier(n_neighbors=5, algorithm = 'ball_tree')
model7.fit(X_train, y_train)
accuracy_report_v2(model7,['no','yes'])

In [None]:
model8 = KNeighborsClassifier(n_neighbors=8, algorithm = 'ball_tree') #Finding the optimal k raises a discussion of validation data set
model8.fit(X_train, y_train)
accuracy_report_v2(model8,['no','yes'])

In [None]:
from sklearn.naive_bayes import GaussianNB
model9 = GaussianNB()
model9.fit(X_train, y_train)
accuracy_report(model9,['no','yes'])

**Write a report**


Assess the model's performance using appropriate evaluation metrics such as accuracy, precision, recall, and F1-score. This evaluation will provide insights into how well the model can predict client subscription behavior.


Finally, present your findings and recommendations in a comprehensive report. Include details about the model's predictions, feature importance, and any potential insights gained from the analysis. Conclude the report with actionable recommendations for the bank based on the developed model.


As a data scientist, your deliverables will consist of a well-documented Jupyter Notebook or Python script that showcases your analysis, modeling approach, evaluation results, and conclusions. Additionally, prepare a comprehensive report summarizing your findings and recommendations for the bank based on the insights gained from the developed model.