# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [36]:
import pandas as pd

In [37]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [38]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [40]:
To predict whether a client will subscribe to a term deposit on the bank clinet data, contact information, campaign details and socio-economic context

SyntaxError: invalid syntax (3243041851.py, line 1)

In [41]:
taget variable y having yes or no will indicate whether the client is subscribed to a term deposit

SyntaxError: invalid syntax (3086631777.py, line 1)

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [101]:
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

np.random.seed(42)

def prepare_feat (df):
    df_processed = df.copy()

    df_processed['previously_contacted'] = (df_processed['pdays'] !=999).astype(int)

    df_processed.loc[df_processed['pdays'] == 999, 'pdays' ] = np.nan

    categorical_cols = ['job', 'marital','education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']

    numerical_cols = ['age', 'duration', 'campaign','previous', 'emp.var.rate', 'cons.price.idx','cons.conf.idx',
                      'euribor3m', 'nr.employed', 'previously_contacted']

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numerical_cols),
            ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_cols)
        ])

    df_processed['y'] = df_processed['y'].map({'yes': 1, 'no': 0 })

    return df_processed, preprocessor

X = df.drop('y', axis=1)
y = df['y'].map({'yes': 1, 'no': 0 })

X_processed, preprocessor = prepare_feat(df)

X_transformed = preprocessor.fit_transform(X_processed.drop('y', axis=1))

### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [102]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

def prepare_and_split(df):
    data = df.copy()

    data['y'] = data['y'].map({'yes': 1, 'no': 0 })

    data['previously_contacted'] = (data['pdays'] !=999).astype(int)
    data.loc[data['pdays'] == 999, 'pdays'] = np.nan

    X = data.drop('y', axis =1)
    y = data['y']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    return X_train, X_test, y_train, y_test, data

In [103]:
from sklearn.impute import SimpleImputer

def preprocessing_pipeline(X_train):
    categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
    numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    
    return preprocessor


### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [104]:
def create_baseline(y_train):
    """
    Establish a baseline performance based on majority class classifier
    """
    print("\nProblem 7: Establishing a baseline model")
    
    # Most frequent class in the training set
    majority_class = pd.Series(y_train).value_counts().idxmax()
    baseline_accuracy = (y_train == majority_class).mean()
    
    print(f"Majority class: {majority_class}")
    print(f"Baseline accuracy (always predicting the majority class): {baseline_accuracy:.4f}")
    
    return baseline_accuracy


### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [105]:
from time import time

def build_logistic_regression(X_train, X_test, y_train, y_test, preprocessor):
    """
    Build a logistic regression model
    """
    print("\nProblem 8: Building a Logistic Regression model")
    
    # Create the pipeline with preprocessing and logistic regression
    log_reg_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(max_iter=1000, random_state=42))
    ])
    
    # Train the model and time it
    start_time = time()
    log_reg_pipeline.fit(X_train, y_train)
    train_time = time() - start_time
    
    # Make predictions on train and test sets
    y_train_pred = log_reg_pipeline.predict(X_train)
    y_test_pred = log_reg_pipeline.predict(X_test)
    
    # Calculate accuracy
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    
    print(f"Logistic Regression - Training time: {train_time:.4f} seconds")
    print(f"Logistic Regression - Training accuracy: {train_accuracy:.4f}")
    print(f"Logistic Regression - Test accuracy: {test_accuracy:.4f}")
    
    # Return the model and metrics
    return log_reg_pipeline, train_time, train_accuracy, test_accuracy

### Problem 9: Score the Model

What is the accuracy of your model?

In [106]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt

def evaluate_model(model, X_test, y_test):
    """
    Evaluate the model with more detailed metrics
    """
    print("\nProblem 9: Detailed model evaluation")
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Create and print classification report
    report = classification_report(y_test, y_pred)
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"ROC AUC: {roc_auc:.4f}")
    print("\nClassification Report:")
    print(report)
    
    # Plot confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['No', 'Yes'],
                yticklabels=['No', 'Yes'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.tight_layout()
    plt.savefig('confusion_matrix.png')
    plt.close()
    
    # Plot ROC curve
    from sklearn.metrics import RocCurveDisplay
    RocCurveDisplay.from_estimator(model, X_test, y_test)
    plt.grid(True)
    plt.plot([0, 1], [0, 1], 'r--')
    plt.title('ROC Curve')
    plt.savefig('roc_curve.png')
    plt.close()
    
    return accuracy, precision, recall, f1, roc_auc

### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [107]:
def compare_models(X_train, X_test, y_train, y_test, preprocessor):
    """
    Compare different machine learning models
    """
    print("\nProblem 10: Comparing different models")
    
    # Dictionary to store models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
        'KNN': KNeighborsClassifier(),
        'Decision Tree': DecisionTreeClassifier(random_state=42),
        'SVM': SVC(probability=True, random_state=42)
    }
    
    # Results DataFrame
    results = pd.DataFrame(columns=['Model', 'Train Time', 'Train Accuracy', 'Test Accuracy'])
    
    # Train and evaluate each model
    for name, model in models.items():
        # Create pipeline
        pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('classifier', model)
        ])
        
        # Train the model and time it
        start_time = time()
        pipeline.fit(X_train, y_train)
        train_time = time() - start_time
        
        # Make predictions
        y_train_pred = pipeline.predict(X_train)
        y_test_pred = pipeline.predict(X_test)
        
        # Calculate accuracy
        train_accuracy = accuracy_score(y_train, y_train_pred)
        test_accuracy = accuracy_score(y_test, y_test_pred)
        
        # Store results
        results = pd.concat([results, pd.DataFrame({
            'Model': [name],
            'Train Time': [f"{train_time:.4f}s"],
            'Train Accuracy': [f"{train_accuracy:.4f}"],
            'Test Accuracy': [f"{test_accuracy:.4f}"]
        })], ignore_index=True)
    
    # Display results
    print("Model Comparison Results:")
    print(results)
    
    return results

### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [108]:
import seaborn as sns

def improve_model(X_train, X_test, y_train, y_test, preprocessor):
    """
    Improve the model through hyperparameter tuning and feature engineering
    """
    print("\nProblem 11: Improving the model")
    
    # 1. Feature Importance Analysis
    # Let's use a Random Forest to assess feature importance
    print("Analyzing feature importance with Random Forest...")
    
    # Train a Random Forest
    rf_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
    ])
    rf_pipeline.fit(X_train, y_train)
    
    # Get feature names from the preprocessor
    # This is complex due to OneHotEncoder creating multiple columns
    cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
    num_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
    
    # Get onehotencoder categories
    cat_features = []
    try:
        ohe_step = rf_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['encoder']
        for i, col in enumerate(cat_cols):
            categories = ohe_step.categories_[i]
            for cat in categories:
                cat_features.append(f"{col}_{cat}")
    except:
        # Simplified approach if the above fails
        for col in cat_cols:
            cat_features.append(col)
    
    feature_names = num_cols + cat_features
    
    # Get feature importances (limit to the number of actual features)
    if hasattr(rf_pipeline.named_steps['classifier'], 'feature_importances_'):
        importances = rf_pipeline.named_steps['classifier'].feature_importances_
        if len(importances) == len(feature_names):
            # Create a DataFrame of feature importances
            feature_importance_df = pd.DataFrame({
                'Feature': feature_names,
                'Importance': importances
            }).sort_values('Importance', ascending=False)
            
            print("Top 10 Most Important Features:")
            print(feature_importance_df.head(10))
            
            # Plot feature importance
            plt.figure(figsize=(12, 8))
            sns.barplot(x='Importance', y='Feature', data=feature_importance_df.head(15))
            plt.title('Feature Importance')
            plt.tight_layout()
            plt.savefig('feature_importance.png')
            plt.close()
        else:
            print("Feature names and importances dimension mismatch - skipping detailed feature importance")
    else:
        print("Model doesn't provide feature importances - skipping feature importance analysis")
    
    # 2. Hyperparameter Tuning
    print("\nPerforming hyperparameter tuning...")
    
    # Grid search for Random Forest
    param_grid = {
        'classifier__n_estimators': [50, 100],
        'classifier__max_depth': [None, 10, 20],
        'classifier__min_samples_split': [2, 5],
        'classifier__min_samples_leaf': [1, 2]
    }
    
    # To save time, we'll use a simplified grid
    rf_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=42))
    ])
    
    grid_search = GridSearchCV(rf_pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
    
    # Evaluate the tuned model
    best_model = grid_search.best_estimator_
    y_test_pred = best_model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    
    print(f"Tuned Random Forest - Test accuracy: {test_accuracy:.4f}")
    
    # 3. More detailed evaluation of the best model
    print("\nDetailed evaluation of the best model:")
    final_accuracy, final_precision, final_recall, final_f1, final_roc_auc = evaluate_model(best_model, X_test, y_test)
    
    return best_model, test_accuracy

In [109]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report


def bank_marketing_analysis(df):
    """
    Run the entire analysis pipeline
    """
    X_train, X_test, y_train, y_test, processed_data = prepare_and_split(df)
    preprocessor = preprocessing_pipeline(X_train)
    
    baseline_accuracy = create_baseline(y_train)
    
    log_reg_model, log_reg_time, log_reg_train_acc, log_reg_test_acc = build_logistic_regression(
        X_train, X_test, y_train, y_test, preprocessor
    )
    
    log_reg_accuracy, log_reg_precision, log_reg_recall, log_reg_f1, log_reg_roc_auc = evaluate_model(
        log_reg_model, X_test, y_test
    )
    
    model_comparison_results = compare_models(X_train, X_test, y_train, y_test, preprocessor)
    
    best_model, best_model_accuracy = improve_model(X_train, X_test, y_train, y_test, preprocessor)
    
    print("\nSummary of Findings:")
    print(f"Baseline Accuracy: {baseline_accuracy:.4f}")
    print(f"Logistic Regression Accuracy: {log_reg_accuracy:.4f}")
    print(f"Best Model Accuracy: {best_model_accuracy:.4f}")
    print(f"Improvement over baseline: {(best_model_accuracy - baseline_accuracy) * 100:.2f}%")
    
    return processed_data, best_model, model_comparison_results


processed_data, best_model, model_comparison_results = bank_marketing_analysis(df)


Problem 7: Establishing a baseline model
Majority class: 0
Baseline accuracy (always predicting the majority class): 0.8873

Problem 8: Building a Logistic Regression model
Logistic Regression - Training time: 0.2607 seconds
Logistic Regression - Training accuracy: 0.9100
Logistic Regression - Test accuracy: 0.9165

Problem 9: Detailed model evaluation
Accuracy: 0.9165
Precision: 0.7120
Recall: 0.4343
F1 Score: 0.5395
ROC AUC: 0.9425

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.98      0.95      7310
           1       0.71      0.43      0.54       928

    accuracy                           0.92      8238
   macro avg       0.82      0.71      0.75      8238
weighted avg       0.91      0.92      0.91      8238


Problem 10: Comparing different models
Model Comparison Results:
                 Model Train Time Train Accuracy Test Accuracy
0  Logistic Regression    0.3260s         0.9100        0.9165
1                 

##### Questions