# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

##### From the Materials and Method : 
- GOAL : Goal was to increase efficiency of directed campaigns for long-term deposit subscriptions by reducing the number of contacts to do.
- the Lift is the most commonly used metric to evaluate prediction models (Coppock 2002). In particular, the cumulative Lift curve is a percentage graph that divides the population into deciles, in which population members are placed based on their predicted probability of response. The responder deciles are sorted, with the highest responders are put on the first decile.
- 17 Campaigns between May 2008 and Nov 2010. Total of 79354 contacts.


### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

In [None]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [None]:
df.head()

In [None]:
df['campaign'].value_counts()

### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



In [None]:
df['education'].value_counts()

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

## Goal
Increase efficiency of directed campaigns for long term deposit subscriptions by optimizing the number of contacts made.

The result is not to specify the efficiency % or what the optimal / minimum # of contacts are, but more to identify which customer profile / type is better suited for the compaign.

Target Outcome (Dependent variable) : Is represented as 'y'. Value 'yes' indicates cient has subscribed to long term deposity. 'no' is the opposite
Independent Variables : Various data elements captured as part of the campaign as well as from internal bank records are listed above.

In [None]:
df.info()

In [None]:
# proportion of yes and no in the target variable
df['y'].value_counts()
# --- NOTE : Output clearly shows Target variable is imbalanced and hence we may need to stratify our sampling

In [None]:
sns.pairplot(df, hue='y')
plt.show()

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [None]:
# try forcing Python to identify the data types
df = df.convert_dtypes()
print("Education count= ", df['education'].value_counts())
# --- Rnning above converted Object to String. Hence explicitly converting to numeric



In [None]:

from sklearn.model_selection import train_test_split

# -- Create X and Y and split the data before Feature encoding / engineering.
# -- NOTE : As stated above in "Understandding the Features" and also from above pairplot we see a strong correlation between duration and y.
# Hence lets remove the column for prediction purposes.
X = df.drop(columns=['emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'duration', 'y'])
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [None]:
import category_encoders as ce
from sklearn.discriminant_analysis import StandardScaler

# --NOTE : Need to convert target var as well else JamesStein does not like it and errors "AttributeError: 'numpy.ndarray' object has no attribute 'groupby'"
y_train_enc = pd.get_dummies(y_train, drop_first=True, prefix='y')
y_test_enc = pd.get_dummies(y_test, drop_first=True, prefix='y')

# ---- boolean data : Use Dummy Encoder - Contact
contact_train_enc = pd.get_dummies(X_train['contact'], drop_first=True, prefix='contact')
contact_test_enc = pd.get_dummies(X_test['contact'], drop_first=True, prefix='contact')
X_train_enc = pd.concat([X_train, contact_train_enc], axis=1)
X_test_enc = pd.concat([X_test, contact_test_enc], axis=1)
X_train_enc = X_train_enc.drop(columns=['contact'])
X_test_enc = X_test_enc.drop(columns=['contact'])


# --- Ordinal encoder for
# education : 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown'
# Given 'unknown' is a tricky value, we use the avg value to fill this void. Avg is calculated using Ordinal values of other values
# Once we run the encoder without 'unknown', the value is eliminated from the data set, unfortunately. Hence running it as one operation.
education_map = {'basic.4y':4,'basic.6y':6,'basic.9y':9,'high.school':12,'illiterate':-1,'professional.course':14,'university.degree':16, 'unknown':11}
X_train_enc['education'] = X_train_enc['education'].map(education_map)
X_test_enc['education'] = X_test_enc['education'].map(education_map)

# unknown_val = int(X_test_enc['education'].mean())
# ed_all_map = {4:4, 6:6, 9:9, 12:12, -1:-1, 14:14, 16:16, 'unknown':unknown_val}
# X_train_enc['education'] = X_train_enc['education'].map(ed_all_map)
# numerical_cols = ['age', 'campaign', 'pdays', 'previous', 'education']
# std_scaler = StandardScaler()
# X_train_enc = std_scaler.fit_transform(X_train_enc[numerical_cols])
# X_test_enc = std_scaler.transform(X_test_enc[numerical_cols])

#Print unique values for each column
categorical_cols = ['job', 'marital', 'default', 'housing', 'loan', 'month', 'day_of_week', 'poutcome'] #, 'contact', 'education']


# James Stein encoder - job, marital, housing, loan, month, day_of_week, campaign, pdays, previous, poutcome
js_encoder = ce.JamesSteinEncoder(cols=categorical_cols, random_state=42)
js_encoder.fit(X_train_enc, y_train_enc)
X_train_enc = js_encoder.transform(X_train_enc)
X_test_enc = js_encoder.transform(X_test_enc)

X_train_enc.info()
print(y_train_enc.value_counts())


In [None]:
# -- encoding using Label Encoder and Target Encoder
from category_encoders import TargetEncoder
from sklearn.calibration import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler #TODO : Shud hv used sklearn.preprocessing.StandardScaler here.

le = LabelEncoder()
y_train_target_enc = le.fit_transform(y_train)
y_test_target_enc = le.transform(y_test)

categorical_cols = ['job', 'marital', 'default', 'housing', 'loan', 'month', 'day_of_week', 'poutcome', 'contact', 'education']
numerical_cols = ['age', 'campaign', 'pdays', 'previous']
col_transformer = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('categorical', TargetEncoder(), categorical_cols),
        #('target', LabelEncoder(), 'y')
    ]
)

print(type(y_train_target_enc))
X_train_target_enc = col_transformer.fit_transform(X_train, y_train_target_enc)
X_test_target_enc = col_transformer.transform(X_test)

cat_col_names = col_transformer.named_transformers_['categorical'].get_feature_names_out(categorical_cols)
all_col_names = list(numerical_cols) + list(cat_col_names)

X_train_target_enc_df = pd.DataFrame(X_train_target_enc, columns=all_col_names)
X_test_target_enc_df = pd.DataFrame(X_test_target_enc, columns=all_col_names)

print(X_train_target_enc_df.info())
X_train_target_enc_df.head()


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [None]:
# Given its a good practice to split Train, Test data before Feature engineering, this step has been performed above at the beginning of feature engineering step.

y_train_enc = y_train_enc.to_numpy()
y_test_enc = y_test_enc.to_numpy()

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

For this we use the DummyClassifier as a baseline, a model that predicts using basic rules and not necessarily from learning from the data. Other models we use must beat this model's accuracy.

In [None]:
import time
from sklearn.dummy import DummyClassifier
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

start_time = time.time()
dummy_classifier = DummyClassifier(strategy='stratified', random_state=42)
dummy_classifier.fit(X_train_enc, y_train_enc)
# X_train_transform = dummy_classifier.transform(X_train_enc)
y_test_preds = dummy_classifier.predict(X_test_enc)
end_time = time.time()

dummy_accuracy = accuracy_score(y_test_enc, y_test_preds)
dummy_precision = precision_score(y_test_enc, y_test_preds)
dummy_recall = recall_score(y_test_enc, y_test_preds)
dummy_f1score = f1_score(y_test_enc, y_test_preds)
dummy_conf_matrix = confusion_matrix(y_test_enc, y_test_preds)

print("Total time taken by Dummy classifier: ", end_time - start_time)
print('Dummy Classifier scores : \n'
      f'Accuracy: {dummy_accuracy}\n'
      f'Precision: {dummy_precision}\n'
      f'Recall: {dummy_recall}\n'
      f'F1 score: {dummy_f1score}\n')



#### Findings : 
<div class="alert alert-block alert-info">
The baseline model (Dummy Classifer) returns an Accuracy rate of ~80%, which is what we would expect our more sophisticated models to beat.
</div>

In [None]:
from cProfile import label
# import plotly.express as px
from sklearn.metrics import ConfusionMatrixDisplay


cmd = ConfusionMatrixDisplay(dummy_conf_matrix, display_labels=['not enrolled','enrolled'])
cmd.plot()

# dummy_conf_matrix_plot = ConfusionMatrixDisplay(dummy_conf_matrix)
# fig = px.imshow(dummy_conf_matrix)
# fig.update_layout()
# fig.update_xaxes(side='bottom') # move x axes tick labels to bottom
# fig.update_traces(text=dummy_conf_matrix, texttemplate="%{text}")
# fig.show()



### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [None]:
from statistics import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

results = {}
y_train_enc.ravel()
y_test_enc.ravel()

le_start_time = time.time()
le_gridSearch = GridSearchCV(estimator = LogisticRegression(random_state=42),
                             param_grid={'C': [0.01, 0.1, 1, 10, 100]}
                             )
le_gridSearch.fit(X_train_enc, y_train_enc)
le_y_enc_pred = le_gridSearch.predict(X_test_enc)
le_end_time = time.time()

le_accuracy = accuracy_score(y_test_enc, le_y_enc_pred)
le_precision = precision_score(y_test_enc, le_y_enc_pred)
le_recall = recall_score(y_test_enc, le_y_enc_pred)
le_f1score = f1_score(y_test_enc, le_y_enc_pred)
le_conf_matrix = confusion_matrix(y_test_enc, le_y_enc_pred)

results['LogisticRegression'] = {
    'Accuracy Score': le_accuracy,
    'Precision Score': le_precision,
    'Recall': le_recall,
    'F1 Score': le_f1score,
    'Confusion Matrix': le_conf_matrix,
    'Best Params': le_gridSearch.best_estimator_,
    'Time Taken': (le_end_time - le_start_time)
}

In [None]:
print(f"Logistic Regression Accuracy score: {results['LogisticRegression']['Accuracy Score']}")
cmd = ConfusionMatrixDisplay(le_conf_matrix, display_labels=['not enrolled','enrolled'])
cmd.plot()


### Problem 9: Score the Model

What is the accuracy of your model?

In [None]:

results_df = pd.DataFrame(results).T
results_df

#### Findings : 
<div class="alert alert-block alert-info">
From an Accuracy standpoint the Logistic Regression provides close to 90% accuracy, which is much better than our baseline model's (Dummy) accuracy score of 80%. Hence LogisticRegression can be considered as a viable model.
</div>

### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [None]:
# -- Declare models and their params

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier


models = {
    'knn' : {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': range(1, 10)
        }
    },
    'decision_tree': {
        'model': DecisionTreeClassifier(random_state=42),
        'params': {
            'max_depth': range(1, 10)
        }
    },
    'SVM': {
        'model': SVC(coef0=1, random_state=42),
        'params': {
            'gamma': [0.1, 1.0, 10.0, 100.0]
        }
    }

}

In [None]:
# Run the models without hyper param tuning

basic_results = {}
for model_name, model_params in models.items():
    model = model_params['model']
    start_time = time.time()

    model.fit(X_train_enc, y_train_enc)

    y_train_preds = model.predict(X_train_enc)
    y_test_preds = model.predict(X_test_enc)

    end_time = time.time()

    basic_results[model_name] = {
        'Train Time': (end_time - start_time),
        'Train Accuracy': accuracy_score(y_train_enc, y_train_preds),
        'Test Accuracy' : accuracy_score(y_test_enc, y_test_preds)
    }

basic_results_df = pd.DataFrame(basic_results).T.sort_values(by='Test Accuracy', ascending=False)
basic_results_df

#### Findings / Result :
<div class="alert alert-block alert-info">
For comparison, we will primarily focus on Test Accuracy since this param essentially tells us how the model would perform on unseen data.

All the above models did significantly better than the Dummy Classifier. But in terms of Accuracy, LogisticRegression with hyper parameter tuning did better than any of these 3 models which were not tuned. So may not be apples to apples comparison. 

But in terms of time, Decision Tree took the minimal time while its Test accuracy was the lowest. For a good balance between Train time and Test Accuracy, "knn" model seems to be the best option given the Accuracy is almost same as SVM but takes approximately one-twelth the time of SVM model.
</div>

### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [None]:
# - check the correlation among the features - Categorical vars encoded using JamesStein
X_train_enc.corr()

In [None]:
# -- check the correlation among features - Categorical vars encoded using Target Encoder
X_train_target_enc_df.corr()

# -- NOTE : Correlation between the 2 tables (James Stein encoded and Target Encoded) are largely similar.
# One diff is Education.  In former cases, where Ordinal Encoder was used with values manually assigned to each unique value, corr = -0.1786
# In case of latter, where Target Encoder was used to encode the values, corr = -0.013653.

#### Using Label and Target Encoders for Categorical data

In [None]:
# Run the model using Label and Target Encoded features
target_results = {}
for model_name, model_params in models.items():
    model = model_params['model']
    start_time = time.time()

    # gsearch = GridSearchCV(estimator=model,
    #                        scoring='accuracy',
    #                        cv=5,
    #                        verbose=1)
    model.fit(X_train_target_enc, y_train_target_enc)

    y_train_preds = model.predict(X_train_target_enc)
    y_test_preds = model.predict(X_test_target_enc)

    end_time = time.time()

    target_results[model_name] = {
        'Train Time': (end_time - start_time),
        'Train Accuracy': accuracy_score(y_train_enc, y_train_preds),
        'Test Accuracy' : accuracy_score(y_test_enc, y_test_preds)
    }

target_results_df = pd.DataFrame(target_results).T.sort_values(by='Test Accuracy', ascending=False)
target_results_df

#### Findings
<div class="alert alert-block alert-info">
Comparing above against previous run, Decision Tree Test Accuracy improved marginally, else the numbers are pretty similar 
</div>

#### Using Hyper Parameter Tuning

In [None]:
# Running for features encoded using Label Encoder and Target Encoder.
tune_results = {}
for model_name, model_params in models.items():
    model = model_params['model']
    start_time = time.time()

    gsearch = GridSearchCV(estimator=model,
                           param_grid=model_params['params'],
                           scoring='accuracy',
                           cv=5,
                           verbose=1)
    gsearch.fit(X_train_target_enc, y_train_target_enc)

    y_train_preds = gsearch.predict(X_train_target_enc)
    y_test_preds = gsearch.predict(X_test_target_enc)

    end_time = time.time()

    tune_results[model_name] = {
        'Train Time': (end_time - start_time),
        'Train Accuracy': accuracy_score(y_train_enc, y_train_preds),
        'Test Accuracy' : accuracy_score(y_test_enc, y_test_preds),
        'Confusion Matrix' : confusion_matrix(y_test_enc, y_test_preds),
        'Best Params': gsearch.best_estimator_
    }

tune_results_df = pd.DataFrame(tune_results).T.drop(columns=['Confusion Matrix']).sort_values(by='Test Accuracy', ascending=False)
tune_results_df


### Findings - Conclusion
<div class="alert alert-block alert-info">
After Hyperparameter Tuning, Decision Tree performs significantly better although marginally less Accurate the SVM. But given the time take is less than 1 sec, it would be the model of choice.
</div>

In [None]:
# Plot
import math

# add Logistic Regression results to the results
tune_results['Logistic Regression'] = {
    'Train Time': results['LogisticRegression']['Time Taken'],
    'Train Accuracy': results['LogisticRegression']['Accuracy Score'],
    'Confusion Matrix' : results['LogisticRegression']['Confusion Matrix'],
    'Best Params': results['LogisticRegression']['Best Params'],
}

# calc the # of rows and columns
n_models = len(tune_results)
n_cols = 2 # default for now
n_rows = math.ceil(n_models / n_cols)

fig, axes = plt.subplots(nrows=n_rows, ncols = n_cols, figsize=(n_cols * 3, n_rows*3))
axes = axes.flatten()

# One way to plot it
for idx, (model_name, metrics) in enumerate(tune_results.items()):
    conf_matrix = metrics['Confusion Matrix']
    disp = ConfusionMatrixDisplay(conf_matrix)
    disp.plot(ax=axes[idx])
    axes[idx].set_title(model_name)
plt.tight_layout()


##### Questions

1. How would the model perform if other hyperparam tuning Estimators(other than GridSearchCV) such as RandomSearch or HalvingGridSearchCV or HalvingRandomSearchCV were used ?
2. Can we run PCA as part of Feature Engineering ? What would the outcome look like ?