# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

According to the Materials and Methods writeup, the data covers **17** campaigns for the bank. The campaigns occurred between May 2008 and November 2010, corresponding to a total of 79354 contacts, with a 8% success rate.   

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [None]:
import pandas as pd
import time
import numpy as np
from IPython.display import display

In [72]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [73]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



In [74]:
# 1. Check for missing values (including 'unknown' values in categorical features)
print("=== MISSING VALUES ANALYSIS ===")

# Standard missing values (.isnull())
print("1. Standard missing values (NaN/null):")
missing_counts = df.isnull().sum()
if missing_counts.sum() == 0:
    print("No standard missing values found in any column.")
else:
    print(missing_counts[missing_counts > 0])

# Check for 'unknown' values in categorical columns
print("2. 'Unknown' values in categorical features:")
categorical_cols = df.select_dtypes(include=['object']).columns
unknown_counts = {}

for col in categorical_cols:
    unknown_count = (df[col] == 'unknown').sum()
    if unknown_count > 0:
        unknown_counts[col] = unknown_count
        unknown_pct = (unknown_count / len(df)) * 100
        print(f"   {col}: {unknown_count} ({unknown_pct:.2f}%)")

if not unknown_counts:
    print("   No 'unknown' values found in categorical columns")

# 2. Examine current data types
print("=== DATA TYPE ANALYSIS ===\n")
print("Current data types:")
for col in df.columns:
    print(f"   {col}: {df[col].dtype}")

# 3. Identify features that need type conversion
print("\n=== DATA TYPE RECOMMENDATIONS ===\n")

# Check if categorical features are properly typed
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 
                       'contact', 'month', 'day_of_week', 'poutcome', 'y']
print("1. Categorical features that should remain as object/string type:")
for feat in categorical_features:
    if feat in df.columns:
        print(f"   {feat}: Currently {df[feat].dtype} - OK")

# Check numeric features
numeric_features = ['age', 'duration', 'campaign', 'pdays', 'previous', 
                   'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 
                   'euribor3m', 'nr.employed']
print("2. Numeric features:")
for feat in numeric_features:
    if feat in df.columns:
        print(f"   {feat}: Currently {df[feat].dtype} - OK")

# Other
print("3. Other:")
print("   - 'pdays': Value of 999 indicates 'not previously contacted' - maybe needs transformed to something more reasonable?")
print("   - 'duration': drop?")
print("   - Target variable 'y': Binary categorical ('yes'/'no') - will need encoding")

=== MISSING VALUES ANALYSIS ===
1. Standard missing values (NaN/null):
No standard missing values found in any column.
2. 'Unknown' values in categorical features:
   job: 330 (0.80%)
   marital: 80 (0.19%)
   education: 1731 (4.20%)
   default: 8597 (20.87%)
   housing: 990 (2.40%)
   loan: 990 (2.40%)
=== DATA TYPE ANALYSIS ===

Current data types:
   age: int64
   job: object
   marital: object
   education: object
   default: object
   housing: object
   loan: object
   contact: object
   month: object
   day_of_week: object
   duration: int64
   campaign: int64
   pdays: int64
   previous: int64
   poutcome: object
   emp.var.rate: float64
   cons.price.idx: float64
   cons.conf.idx: float64
   euribor3m: float64
   nr.employed: float64
   y: object

=== DATA TYPE RECOMMENDATIONS ===

1. Categorical features that should remain as object/string type:
   job: Currently object - OK
   marital: Currently object - OK
   education: Currently object - OK
   default: Currently object - OK

In [75]:
# Cleanup - Drop all rows with 'unknown' values
print(f"Original dataset shape: {df.shape}")

# Drop rows where any column contains 'unknown'
df_cleaned = df[~df.isin(['unknown']).any(axis=1)]

print(f"Cleaned dataset shape: {df_cleaned.shape}")
print(f"Rows removed: {len(df) - len(df_cleaned)} ({((len(df) - len(df_cleaned))/len(df)*100):.2f}%)")

# Show which columns had 'unknown' values
print("\nColumns that had 'unknown' values:")
for col in df.columns:
    unknown_count = (df[col] == 'unknown').sum()
    if unknown_count > 0:
        print(f"  {col}: {unknown_count} rows")

# Update df to use the cleaned version
df = df_cleaned

# Drop euribor3m column
print(f"\nDropping 'euribor3m' column...")
df = df.drop('euribor3m', axis=1)
print(f"Final dataset shape: {df.shape}")

Original dataset shape: (41188, 21)
Cleaned dataset shape: (30488, 21)
Rows removed: 10700 (25.98%)

Columns that had 'unknown' values:
  job: 330 rows
  marital: 80 rows
  education: 1731 rows
  default: 8597 rows
  housing: 990 rows
  loan: 990 rows

Dropping 'euribor3m' column...
Final dataset shape: (30488, 20)


### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30488 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             30488 non-null  int64  
 1   job             30488 non-null  object 
 2   marital         30488 non-null  object 
 3   education       30488 non-null  object 
 4   default         30488 non-null  object 
 5   housing         30488 non-null  object 
 6   loan            30488 non-null  object 
 7   contact         30488 non-null  object 
 8   month           30488 non-null  object 
 9   day_of_week     30488 non-null  object 
 10  duration        30488 non-null  int64  
 11  campaign        30488 non-null  int64  
 12  pdays           30488 non-null  int64  
 13  previous        30488 non-null  int64  
 14  poutcome        30488 non-null  object 
 15  emp.var.rate    30488 non-null  float64
 16  cons.price.idx  30488 non-null  float64
 17  cons.conf.idx   30488 non-null  floa

# Objective

The business objective is to predict whether a client will subscribe to a term deposit based on data from marketing campaigns. By accurately identifying clients who are more likely to subscribe, the bank can focus their telemarketing efforts on high-potential customers, reducing wasted calls and improving conversion rates. This predictive model would help the marketing team allocate resources more efficiently and increase the overall ROI of their campaigns.

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [77]:
# Bank features and category encoding

# Select only bank client features
# According to the problem description, bank client features are:
# age, job, marital, education, default, housing, loan

bank_features = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
X_bank = df[bank_features].copy()
y = df['y'].copy()

print(f"Feature columns: {bank_features}")
print(X_bank.head())

Feature columns: ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
   age        job  marital            education default housing loan
0   56  housemaid  married             basic.4y      no      no   no
2   37   services  married          high.school      no     yes   no
3   40     admin.  married             basic.6y      no      no   no
4   56   services  married          high.school      no      no  yes
6   59     admin.  married  professional.course      no      no   no


In [78]:
# Encode categorical variables

# Task 2 & 3: Encode categorical variables and target variable
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# First encode the target variable 'y' from yes/no to 1/0
y_encoded = (y == 'yes').astype(int)

# Separate numeric and categorical features
numeric_features = ['age']
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan']

# One-hot encode categorical features
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_categorical_encoded = encoder.fit_transform(X_bank[categorical_features])

# Get feature names for encoded columns
feature_names = encoder.get_feature_names_out(categorical_features)

# Create DataFrame with encoded categorical features
X_categorical_df = pd.DataFrame(X_categorical_encoded, columns=feature_names, index=X_bank.index)

# Combine numeric and encoded categorical features using pandas concat
X_encoded = pd.concat([X_bank[numeric_features], X_categorical_df], axis=1)

print(f"Encoded feature matrix shape: {X_encoded.shape}")
print(f"Target variable distribution: {y_encoded.value_counts().to_dict()}")

Encoded feature matrix shape: (30488, 28)
Target variable distribution: {0: 26629, 1: 3859}


In [79]:
# Verify no missing values and create final feature matrix

print("Checking for missing values in encoded features...")
missing_values = X_encoded.isnull().sum()
if missing_values.sum() == 0:
    print("✓ No missing values found in the encoded features")
else:
    print("Missing values found:")
    print(missing_values[missing_values > 0])

# Create final feature matrix X and target vector y
X = X_encoded
y = y_encoded

print(f"\nFinal feature matrix X shape: {X.shape}")
print(f"Final target vector y shape: {y.shape}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}")

Checking for missing values in encoded features...
✓ No missing values found in the encoded features

Final feature matrix X shape: (30488, 28)
Final target vector y shape: (30488,)
Number of features: 28
Number of samples: 30488


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [80]:
# Train/Test Split
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
# Using 80/20 split with stratification to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print(f"\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))
print(f"\nClass distribution in test set:")
print(y_test.value_counts(normalize=True))

Training set size: 24390 samples
Test set size: 6098 samples

Class distribution in training set:
y
0    0.873432
1    0.126568
Name: proportion, dtype: float64

Class distribution in test set:
y
0    0.873401
1    0.126599
Name: proportion, dtype: float64


### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [81]:
# Baseline Model using DummyClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

# Create a dummy classifier that always predicts the most frequent class
dummy_clf = DummyClassifier(strategy='most_frequent', random_state=42)

# Fit the dummy classifier
dummy_clf.fit(X_train, y_train)

# Make predictions
y_pred_dummy = dummy_clf.predict(X_test)

# Calculate baseline accuracy
baseline_accuracy = accuracy_score(y_test, y_pred_dummy)

print("Baseline Model Performance using DummyClassifier:")
print(f"DummyClassifier using the most_frequent strategy found class to be: ('{dummy_clf.classes_[0]}')")
print(f"Accuracy: {baseline_accuracy:.4f}")


Baseline Model Performance using DummyClassifier:
DummyClassifier using the most_frequent strategy found class to be: ('0')
Accuracy: 0.8734


### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [82]:
# Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Scale the features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train logistic regression model
log_reg = LogisticRegression(random_state=42, max_iter=1000)

# Measure training time
start_time = time.time()
log_reg.fit(X_train_scaled, y_train)
train_time = time.time() - start_time

# Make predictions
y_pred_train = log_reg.predict(X_train_scaled)
start_time = time.time()
y_pred_test = log_reg.predict(X_test_scaled)
test_time = time.time() - start_time


# Calculate accuracies
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)

print("Logistic Regression Model Performance:")
print(f"Training time: {train_time:.3f} seconds")
print(f"Training accuracy: {train_accuracy:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")
print(f"\nImprovement over baseline: {test_accuracy - baseline_accuracy:.4f}")

Logistic Regression Model Performance:
Training time: 0.015 seconds
Training accuracy: 0.8734
Test accuracy: 0.8734

Improvement over baseline: 0.0000


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [89]:
# K-Nearest Neighbors (KNN) Model
from sklearn.neighbors import KNeighborsClassifier

# Create and train KNN model with default settings (n_neighbors=5)
knn = KNeighborsClassifier()

# Measure training time
start_time = time.time()
knn.fit(X_train_scaled, y_train)
knn_train_time = time.time() - start_time

# Make predictions
y_pred_train_knn = knn.predict(X_train_scaled)
start_time = time.time()
y_pred_test_knn = knn.predict(X_test_scaled)
knn_test_time = time.time() - start_time


# Calculate accuracies
knn_train_accuracy = accuracy_score(y_train, y_pred_train_knn)
knn_test_accuracy = accuracy_score(y_test, y_pred_test_knn)

print("K-Nearest Neighbors Model Performance:")
print(f"Training time: {knn_train_time:.3f} seconds")
print(f"Training accuracy: {knn_train_accuracy:.4f}")
print(f"Test accuracy: {knn_test_accuracy:.4f}")
print(f"Improvement over baseline: {knn_test_accuracy - baseline_accuracy:.4f}")

K-Nearest Neighbors Model Performance:
Training time: 0.004 seconds
Training accuracy: 0.8780
Test accuracy: 0.8622
Improvement over baseline: -0.0112


In [90]:
# Decision Tree Model
from sklearn.tree import DecisionTreeClassifier

# Create and train Decision Tree model with default settings
dt = DecisionTreeClassifier(random_state=42)

# Measure training time
start_time = time.time()
dt.fit(X_train, y_train)  # Decision trees don't require scaled features
dt_train_time = time.time() - start_time

# Make predictions
y_pred_train_dt = dt.predict(X_train)
start_time = time.time()
y_pred_test_dt = dt.predict(X_test)
dt_test_time = time.time() - start_time

# Calculate accuracies
dt_train_accuracy = accuracy_score(y_train, y_pred_train_dt)
dt_test_accuracy = accuracy_score(y_test, y_pred_test_dt)

print("Decision Tree Model Performance:")
print(f"Training time: {dt_train_time:.3f} seconds")
print(f"Test time: {dt_test_time:.3f} seconds")
print(f"Training accuracy: {dt_train_accuracy:.4f}")
print(f"Test accuracy: {dt_test_accuracy:.4f}")
print(f"Improvement over baseline: {dt_test_accuracy - baseline_accuracy:.4f}")

Decision Tree Model Performance:
Training time: 0.039 seconds
Test time: 0.002 seconds
Training accuracy: 0.9005
Test accuracy: 0.8526
Improvement over baseline: -0.0208


In [91]:
# Support Vector Machine (SVM) Model
from sklearn.svm import SVC

# Create and train SVM model with default settings
svm = SVC(random_state=42)

# Measure training time
start_time = time.time()
svm.fit(X_train_scaled, y_train)  # SVM requires scaled features
svm_train_time = time.time() - start_time

# Make predictions
y_pred_train_svm = svm.predict(X_train_scaled)
y_pred_test_svm = svm.predict(X_test_scaled)

# Calculate accuracies
svm_train_accuracy = accuracy_score(y_train, y_pred_train_svm)
start_time = time.time()
svm_test_accuracy = accuracy_score(y_test, y_pred_test_svm)
svm_test_time = time.time() - start_time


print("Support Vector Machine Model Performance:")
print(f"Training time: {svm_train_time:.3f} seconds")
print(f"Training accuracy: {svm_train_accuracy:.4f}")
print(f"Test accuracy: {svm_test_accuracy:.4f}")
print(f"Improvement over baseline: {svm_test_accuracy - baseline_accuracy:.4f}")

Support Vector Machine Model Performance:
Training time: 6.455 seconds
Training accuracy: 0.8738
Test accuracy: 0.8737
Improvement over baseline: 0.0003


In [None]:
# Model Comparison Summary


# Create a DataFrame to compare all models (with numeric values)
from turtle import mode

model_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'K-Nearest Neighbors', 'Decision Tree', 'Support Vector Machine'],
    'Train Time': [train_time, knn_train_time, dt_train_time, svm_train_time],
    'Train Accuracy': [train_accuracy, knn_train_accuracy, dt_train_accuracy, svm_train_accuracy],
    'Test Time': [test_time, knn_test_time, dt_test_time, svm_test_time],
    'Test Accuracy': [test_accuracy, knn_test_accuracy, dt_test_accuracy, svm_test_accuracy]
})
#| Model | Train Time | Train Accuracy | Test Accuracy |
display_df = model_comparison.loc[:, ['Model', 'Train Time', 'Train Accuracy', 'Test Accuracy']].copy()
display(display_df)


Unnamed: 0,Model,Train Time,Train Accuracy,Test Accuracy
0,Logistic Regression,0.015045,0.873432,0.873401
1,K-Nearest Neighbors,0.003697,0.878024,0.86225
2,Decision Tree,0.03916,0.900492,0.852575
3,Support Vector Machine,6.45535,0.873801,0.873729


In [None]:
# Lets see if we can get a ranking done just to figure out what we could pick instead of just accuracy
# Calculate ranking score based on: accuracy (most important), test time, train time
# Normalize test accuracy (0-1 scale, higher is better)

max_acc = model_comparison['Test Accuracy'].max()
min_acc = model_comparison['Test Accuracy'].min()
if max_acc != min_acc:
    model_comparison['Accuracy_Score'] = (model_comparison['Test Accuracy'] - min_acc) / (max_acc - min_acc)
else:
    model_comparison['Accuracy_Score'] = 1.0

# Normalize test time (0-1 scale, lower is better so we invert)
max_test_time = model_comparison['Test Time'].max()
min_test_time = model_comparison['Test Time'].min()
if max_test_time != min_test_time:
    model_comparison['Test_Time_Score'] = 1 - (model_comparison['Test Time'] - min_test_time) / (max_test_time - min_test_time)
else:
    model_comparison['Test_Time_Score'] = 1.0

# Normalize training time (0-1 scale, lower is better so we invert)
max_train_time = model_comparison['Train Time'].max()
min_train_time = model_comparison['Train Time'].min()
if max_train_time != min_train_time:
    model_comparison['Train_Time_Score'] = 1 - (model_comparison['Train Time'] - min_train_time) / (max_train_time - min_train_time)
else:
    model_comparison['Train_Time_Score'] = 1.0

# Calculate weighted ranking score (accuracy > test time > train time)
accuracy_weight = 0.6
test_time_weight = 0.3
train_time_weight = 0.1
model_comparison['Ranking_Score'] = (model_comparison['Accuracy_Score'] * accuracy_weight + 
                                     model_comparison['Test_Time_Score'] * test_time_weight +
                                     model_comparison['Train_Time_Score'] * train_time_weight)

# Sort by ranking score (descending)
model_comparison = model_comparison.sort_values('Ranking_Score', ascending=False)
model_comparison['Rank'] = range(1, len(model_comparison) + 1)

# Format the DataFrame for better display
model_comparison['Train Time'] = model_comparison['Train Time'].apply(lambda x: f"{x:.3f}")
model_comparison['Train Accuracy'] = model_comparison['Train Accuracy'].apply(lambda x: f"{x:.4f}")
model_comparison['Test Time'] = model_comparison['Test Time'].apply(lambda x: f"{x:.4f}")
model_comparison['Test Accuracy'] = model_comparison['Test Accuracy'].apply(lambda x: f"{x:.4f}")
model_comparison['Ranking_Score'] = model_comparison['Ranking_Score'].apply(lambda x: f"{x:.4f}")

# Select columns to display
display_cols = ['Rank', 'Model', 'Train Time', 'Train Accuracy', 'Test Time', 'Test Accuracy', 'Ranking_Score']
display_df = model_comparison[display_cols]

# Also display as a formatted table
display(display_df)

(4, 5)


Unnamed: 0,Rank,Model,Train Time,Train Accuracy,Test Time,Test Accuracy,Ranking_Score
0,1,Logistic Regression,0.015,0.8734,0.0004,0.8734,0.9905
3,2,Support Vector Machine,6.455,0.8738,0.0004,0.8737,0.8999
2,3,Decision Tree,0.039,0.9005,0.0018,0.8526,0.3969
1,4,K-Nearest Neighbors,0.004,0.878,0.1723,0.8622,0.3744


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

##### Questions