# scikit-learn Model Training and Improvement

This is a binary classification project––with or without heart disease––with the overall goal of model accuracy. The dataset, [obtained from Kaggle](https://www.kaggle.com/danimal/heartdiseaseensembleclassifier?select=Heart_Disease_Data.csv), consists of 303 observations and [13 features](https://www.kaggle.com/iamkon/ml-models-performance-on-risk-prediction#Complete-attribute-documentation). Of the 303 observation, 160 are without heart disease and the remaining 143 have some degree of heart disease. The study from which the data are drawn distinguishes between not having heart disease (0, in the 'pred_attribute' column) and 4 degrees of having the disease (1, 2, 3, 4). [Experiments with this dataset have concentrated on simply distinguishing between having and not having the disease](https://www.kaggle.com/iamkon/ml-models-performance-on-risk-prediction), and I have done the same here.

The dataset is already fairly clean, which gave me an opportunity to spend more time comparing the performance of a variety of models––a logistic regression model, SVCs, a KNN model, decision trees and Random Forests, and AdaBoost and XGBoost––as well as to try putting together an ensemble of my own.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pydotplus
import collections
import pickle

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, recall_score, classification_report, confusion_matrix

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from xgboost import XGBClassifier

%matplotlib inline

In [2]:
with open('../data/preprocessed/X_train.pkl', 'rb') as fp:
    X_train = pickle.load(fp)
    
with open('../data/preprocessed/X_test.pkl', 'rb') as fp:
    X_test = pickle.load(fp)
    
with open('../data/preprocessed/y_train.pkl', 'rb') as fp:
    y_train = pickle.load(fp)
    
with open('../data/preprocessed/y_test.pkl', 'rb') as fp:
    y_test = pickle.load(fp)

## Baseline Models for Processed Data

In [3]:
baseline_classifiers = {'LogReg': LogisticRegressionCV(random_state=42),
                        'KNN': KNeighborsClassifier(n_neighbors=3),
                        'SVC': SVC(gamma='auto', random_state=42),
                        'DT': DecisionTreeClassifier(random_state=42),
                        'RF': RandomForestClassifier(random_state=42),
                        'Ada': AdaBoostClassifier(DecisionTreeClassifier(random_state=42), random_state=42), 
                        'XGB': XGBClassifier(random_state=42)}

# baseline = {}
# for clf in baseline_classifiers:
#     name = clf
#     clf = baseline_classifiers[clf]
#     clf.fit(X_train, y_train)
#     preds = clf.predict(X_test)
#     acc = accuracy_score(y_test, preds)
#     baseline[name] = acc
# print(pd.Series(baseline).sort_values(ascending=False))

baseline_acc = {}
baseline_recall = {}
for clf in baseline_classifiers:
    name = clf
    clf = baseline_classifiers[clf]
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    acc = accuracy_score(y_test, preds)
    recall = recall_score(y_test, preds)
    baseline_acc[name] = acc
    baseline_recall[name] = recall
    
print('Recall:')
print(pd.Series(baseline_recall).sort_values(ascending=False))
print('')
print('Accuracy:')
print(pd.Series(baseline_acc).sort_values(ascending=False))

Recall:
LogReg    0.885714
RF        0.857143
SVC       0.857143
KNN       0.828571
XGB       0.800000
Ada       0.771429
DT        0.771429
dtype: float64

Accuracy:
LogReg    0.853333
RF        0.840000
KNN       0.840000
SVC       0.813333
XGB       0.800000
Ada       0.746667
DT        0.746667
dtype: float64


In [4]:
for clf in baseline_classifiers:
    print(f'\n{clf}')
    print(classification_report(y_test, baseline_classifiers[clf].predict(X_test)))


LogReg
              precision    recall  f1-score   support

           0       0.89      0.82      0.86        40
           1       0.82      0.89      0.85        35

    accuracy                           0.85        75
   macro avg       0.85      0.86      0.85        75
weighted avg       0.86      0.85      0.85        75


KNN
              precision    recall  f1-score   support

           0       0.85      0.85      0.85        40
           1       0.83      0.83      0.83        35

    accuracy                           0.84        75
   macro avg       0.84      0.84      0.84        75
weighted avg       0.84      0.84      0.84        75


SVC
              precision    recall  f1-score   support

           0       0.86      0.78      0.82        40
           1       0.77      0.86      0.81        35

    accuracy                           0.81        75
   macro avg       0.82      0.82      0.81        75
weighted avg       0.82      0.81      0.81        75




**Most models are seeing a lot of improvement––all models are up on recall score; the KNN model's accuracy increased from 59% to 84%; the SVC's accuracy incrased from 55% to 81%––with the exception of dear XGBoost, which saw a 4% decrease in accuracy. I'll use these models as my baseline for comparing models with finetuned hyperparameters.**

## Tuning Model Hyperparameters

I used [LogisticRegressionCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) in my baseline model as opposed to LogisticRegression, which is probably why it performed so well out of the box, so it's already pretty well tuned. I start with the KNN model.

In [5]:
best_models = {}
best_models['lr'] = baseline_classifiers['LogReg']

### KNN

In [6]:
clf_knn = KNeighborsClassifier()
param_grid = {
    'n_neighbors': list(range(1, 100)),
    'weights': ['uniform', 'distance'],
    'p': [1, 2, 3],
    
}

gs_knn = GridSearchCV(clf_knn, param_grid, scoring='recall', cv=10)
gs_knn.fit(X_train, y_train)

test_recalls = {}
test_recalls['KNN'] = recall_score(y_test, gs_knn.predict(X_test))

test_accuracies = {}
test_accuracies['KNN'] = accuracy_score(y_test, gs_knn.predict(X_test))

print(f'Train Recall: {gs_knn.best_score_}')
print(f'Test Recall: {test_recalls["KNN"]}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_knn.predict(X_train))}')
print(f'Test Accuracy: {test_accuracies["KNN"]}')
print(gs_knn.best_params_)

Train Recall: 0.7781818181818182
Test Recall: 0.8285714285714286
Train Accuracy: 1.0
Test Accuracy: 0.84
{'n_neighbors': 7, 'p': 2, 'weights': 'distance'}


In [7]:
best_models['knn']=gs_knn.best_estimator_

### Support Vector Classifier (Linear)

In [8]:
clf_svcl = SVC(random_state=42, probability=True)
param_grid = {
    'kernel': ['linear'], 
    'C': np.linspace(.1, 1, 10), 
    'gamma': ['scale', 'auto'], 
}
gs_svcl = GridSearchCV(clf_svcl, param_grid, scoring='recall', cv=10)
gs_svcl.fit(X_train, y_train)

test_recalls['Linear SVC'] = recall_score(y_test, gs_svcl.predict(X_test))
test_accuracies['Linear SVC'] = accuracy_score(y_test, gs_svcl.predict(X_test))

print(f'Train Recall: {gs_svcl.best_score_}')
print(f'Test Recall: {test_recalls["Linear SVC"]}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_svcl.predict(X_train))}')
print(f'Test Accuracy: {test_accuracies["Linear SVC"]}')
print(gs_svcl.best_params_)

Train Recall: 0.749090909090909
Test Recall: 0.8857142857142857
Train Accuracy: 0.8198198198198198
Test Accuracy: 0.8533333333333334
{'C': 0.7000000000000001, 'gamma': 'scale', 'kernel': 'linear'}


In [9]:
best_models['svcl']=gs_svcl.best_estimator_

Test accuracy is quite a bit higher than train accuracy, which is a good sign the model isn't overfitting.

### Support Vector Classifier (Polynomial)

In [10]:
clf_svcp = SVC(random_state=42, probability=True)
param_grid = {
    'kernel': ['poly'], 
    'degree': list(range(2, 6)), 
    'coef0': np.linspace(.1, 1, 10), 
    'C': np.linspace(.1, 1, 10), 
    'gamma': ['scale', 'auto'], 
}
gs_svcp = GridSearchCV(clf_svcp, param_grid, scoring='recall', cv=10)
gs_svcp.fit(X_train, y_train)

test_recalls['Polynomial SVC'] = recall_score(y_test, gs_svcp.predict(X_test))
test_accuracies['Polynomial SVC'] = accuracy_score(y_test, gs_svcp.predict(X_test))

print(f'Train Recall: {gs_svcp.best_score_}')
print(f'Test Recall: {test_recalls["Polynomial SVC"]}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_svcp.predict(X_train))}')
print(f'Test Accuracy: {test_accuracies["Polynomial SVC"]}')
print(gs_svcp.best_params_)

Train Recall: 0.7290909090909091
Test Recall: 0.8571428571428571
Train Accuracy: 0.8288288288288288
Test Accuracy: 0.8133333333333334
{'C': 0.30000000000000004, 'coef0': 0.7000000000000001, 'degree': 2, 'gamma': 'scale', 'kernel': 'poly'}


In [11]:
best_models['svcp']=gs_svcp.best_estimator_

### Support Vector Classifier (Sigmoid)

In [12]:
clf_svcs = SVC(random_state=42, probability=True)
param_grid = {
    'kernel': ['sigmoid'],
    'coef0': np.linspace(1, 50, 10), 
    'C': np.linspace(.01, .1, 10), 
    'gamma': ['scale', 'auto'], 
}
gs_svcs = GridSearchCV(clf_svcs, param_grid, scoring='recall', cv=10)
gs_svcs.fit(X_train, y_train)

test_recalls['Sigmoid SVC'] = recall_score(y_test, gs_svcs.predict(X_test))
test_accuracies['Sigmoid SVC'] = accuracy_score(y_test, gs_svcs.predict(X_test))

print(f'Train Recall: {gs_svcs.best_score_}')
print(f'Test Recall: {test_recalls["Sigmoid SVC"]}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_svcs.predict(X_train))}')
print(f'Test Accuracy: {test_accuracies["Sigmoid SVC"]}')
print(gs_svcs.best_params_)

Train Recall: 0.7181818181818181
Test Recall: 0.8571428571428571
Train Accuracy: 0.8243243243243243
Test Accuracy: 0.88
{'C': 0.07, 'coef0': 1.0, 'gamma': 'scale', 'kernel': 'sigmoid'}


In [13]:
best_models['svcs']=gs_svcs.best_estimator_

Like the linear SVC, the sigmoid SVC seems to be excelling on test data.

### Support Vector Classifier (Radial Basis Function)

In [14]:
clf_svcrbf = SVC(random_state=42, probability=True)
param_grid = {
    'kernel': ['rbf'],
    'coef0': np.linspace(.001, .01, 10), 
    'C': np.linspace(.01, .1, 10), 
    'gamma': ['scale', 'auto'], 
}
gs_svcrbf = GridSearchCV(clf_svcrbf, param_grid, scoring='recall', cv=10)
gs_svcrbf.fit(X_train, y_train)

test_recalls['RBF SVC'] = recall_score(y_test, gs_svcrbf.predict(X_test))
test_accuracies['RBF SVC'] = accuracy_score(y_test, gs_svcrbf.predict(X_test))

print(f'Train Recall: {gs_svcrbf.best_score_}')
print(f'Test Recall: {test_recalls["RBF SVC"]}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_svcrbf.predict(X_train))}')
print(f'Test Accuracy: {test_accuracies["RBF SVC"]}')
print(gs_svcrbf.best_params_)

Train Recall: 0.7281818181818182
Test Recall: 0.8857142857142857
Train Accuracy: 0.8378378378378378
Test Accuracy: 0.8266666666666667
{'C': 0.1, 'coef0': 0.001, 'gamma': 'scale', 'kernel': 'rbf'}


In [15]:
best_models['svcrbf']=gs_svcrbf.best_estimator_

### Decision Tree

In [16]:
clf_dt = DecisionTreeClassifier(random_state=42)
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 3, 5, 7, 9, 11, 13, 15], 
    'min_samples_split': [10, 20, 30, 40, 50, 60], 
    'min_samples_leaf': [5, 10, 15, 20, 25, 30, 35, 40],
}
gs_dt = GridSearchCV(clf_dt, param_grid, scoring='recall', cv=10)
gs_dt.fit(X_train, y_train)

test_recalls['Decision Tree'] = recall_score(y_test, gs_dt.predict(X_test))
test_accuracies['Decision Tree'] = accuracy_score(y_test, gs_dt.predict(X_test))

print(f'Train Recall: {gs_dt.best_score_}')
print(f'Test Recall: {test_recalls["Decision Tree"]}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_dt.predict(X_train))}')
print(f'Test Accuracy: {test_accuracies["Decision Tree"]}')
print(gs_dt.best_params_)

Train Recall: 0.7672727272727273
Test Recall: 0.8571428571428571
Train Accuracy: 0.8558558558558559
Test Accuracy: 0.8533333333333334
{'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 5, 'min_samples_split': 20}


In [17]:
best_models['dt']=gs_dt.best_estimator_

In [18]:
'''
Used this code to produce a .png file with visualization of decision tree.
Commented out now so as not to produce multiple files.
'''

# dot_data = tree.export_graphviz(gs_dt.best_estimator_,
#                                 feature_names=X_train.columns,
#                                 out_file=None,
#                                 filled=True,
#                                 rounded=True)
# graph = pydotplus.graph_from_dot_data(dot_data)

# colors = ('turquoise', 'orange')
# edges = collections.defaultdict(list)

# for edge in graph.get_edge_list():
#     edges[edge.get_source()].append(int(edge.get_destination()))

# for edge in edges:
#     edges[edge].sort()    
#     for i in range(2):
#         dest = graph.get_node(str(edges[edge][i]))[0]
#         dest.set_fillcolor(colors[i])

# graph.write_png('tree.png')

'\nUsed this code to produce a .png file with visualization of decision tree.\nCommented out now so as not to produce multiple files.\n'

### Random Forest

In [19]:
clf_rf = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 500, 1000], 
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 3, 7, 20], 
    'min_samples_split': [10, 40, 100], 
    'min_samples_leaf': [10, 100],
}
gs_rf = GridSearchCV(clf_rf, param_grid, verbose=1, scoring='recall', cv=3)
gs_rf.fit(X_train, y_train)

test_recalls['Random Forest'] = recall_score(y_test, gs_rf.predict(X_test))
test_accuracies['Random Forest'] = accuracy_score(y_test, gs_rf.predict(X_test))

print(f'Train Recall: {gs_rf.best_score_}')
print(f'Test Recall: {test_recalls["Random Forest"]}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_rf.predict(X_train))}')
print(f'Test Accuracy: {test_accuracies["Random Forest"]}')
print(gs_rf.best_params_)

Fitting 3 folds for each of 192 candidates, totalling 576 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Train Recall: 0.7647058823529411
Test Recall: 0.8571428571428571
Train Accuracy: 0.8288288288288288
Test Accuracy: 0.84
{'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 10, 'min_samples_split': 40, 'n_estimators': 50}


[Parallel(n_jobs=1)]: Done 576 out of 576 | elapsed:  2.7min finished


In [20]:
best_models['rf']=gs_rf.best_estimator_

### AdaBoost

In [21]:
clf_ab = AdaBoostClassifier(algorithm='SAMME.R', random_state=42)
param_grid = {
    'base_estimator': [DecisionTreeClassifier(max_depth=1), LogisticRegression(solver='lbfgs', multi_class='auto')], 
    'n_estimators': [10, 30, 50, 1000], 
    'learning_rate': [.0001, .001, .01, .1]
}
gs_ab = GridSearchCV(clf_ab, param_grid, scoring='recall', cv=5)
gs_ab.fit(X_train, y_train)

test_recalls['AdaBoost'] = recall_score(y_test, gs_ab.predict(X_test))
test_accuracies['AdaBoost'] = accuracy_score(y_test, gs_ab.predict(X_test))

print(f'Train Recall: {gs_ab.best_score_}')
print(f'Test Recall: {test_recalls["AdaBoost"]}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_ab.predict(X_train))}')
print(f'Test Accuracy: {test_accuracies["AdaBoost"]}')
print(gs_ab.best_params_)

Train Recall: 0.7847619047619048
Test Recall: 0.8571428571428571
Train Accuracy: 0.8558558558558559
Test Accuracy: 0.84
{'base_estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 0.1, 'n_estimators': 30}


In [22]:
best_models['ab']=gs_ab.best_estimator_

### XGBoost

In [24]:
'''
XGB removed for the time being until I can figure out what's changed with the newest updates.
'''

# clf_xgb = XGBClassifier(random_state=42, probability=True)
# param_grid = {
#     'max_depth': [1, 3, 5],
#     'learning_rate': [.05, .1, .15], 
#     'subsample': [.7, .8, .9],
#     'colsample_bytree': np.linspace(.1, 1, 10),
#     'min_child_weight': [10, 20, 30], 
#     'n_estimators': [10, 100, 500]
# }
# gs_xgb = GridSearchCV(clf_xgb, param_grid, scoring='recall', cv=5)
# gs_xgb.fit(X_train, y_train)

# test_recalls['XGBoost'] = recall_score(y_test, gs_xgb.predict(X_test))
# test_accuracies['XGBoost'] = accuracy_score(y_test, gs_xgb.predict(X_test))

# print(f'Train Recall: {gs_xgb.best_score_}')
# print(f'Test Recall: {test_recalls["XGBoost"]}')
# print(f'Train Accuracy: {accuracy_score(y_train, gs_xgb.predict(X_train))}')
# print(f'Test Accuracy: {test_accuracies["XGBoost"]}')
# print(gs_xgb.best_params_)

"\nXGB removed for the time being until I can figure out what's changed with the newest updates.\n"

In [25]:
# best_models['xgb']=gs_xgb.best_estimator_

### Summary

In [26]:
test_recalls['Logistic Regression'] = baseline_recall['LogReg']
test_accuracies['Logistic Regression'] = baseline_acc['LogReg']
print('Recall')
print(pd.Series(test_recalls).sort_values(ascending=False))
print('')
print('Accuracy')
print(pd.Series(test_accuracies).sort_values(ascending=False))

Recall
Logistic Regression    0.885714
RBF SVC                0.885714
Linear SVC             0.885714
AdaBoost               0.857143
Random Forest          0.857143
Decision Tree          0.857143
Sigmoid SVC            0.857143
Polynomial SVC         0.857143
KNN                    0.828571
dtype: float64

Accuracy
Sigmoid SVC            0.880000
Logistic Regression    0.853333
Decision Tree          0.853333
Linear SVC             0.853333
AdaBoost               0.840000
Random Forest          0.840000
KNN                    0.840000
RBF SVC                0.826667
Polynomial SVC         0.813333
dtype: float64


In [27]:
for model in best_models:
    with open(f'../models/{model}.pkl', 'wb') as fp:
        pickle.dump(best_models[model], fp)

## Creating an Ensemble Classifier

Here I selected the three top performing models: the sigmoid kernel from the SVCs, AdaBoost from the boosters, and the decision tree over the random forest, for a total of **3 individual models**. I combined them into a single voting classifier below.

In [28]:
clf_votehard = VotingClassifier(
    estimators=[('svcl', gs_svcl.best_estimator_),
                ('lr', LogisticRegressionCV(random_state=42)), 
                ('ab', gs_ab.best_estimator_)],
    voting='soft')
clf_votehard.fit(X_train, y_train)
recall_score(y_test, clf_votehard.predict(X_test))

0.8857142857142857

In [29]:
confusion_matrix(y_test, clf_votehard.predict(X_test))

array([[33,  7],
       [ 4, 31]])

This ensemble method does as well as two of the individual models, the decision tree and AdaBoost, but not as well the sigmoid SVC. I tried the same model, but with a soft voting system, giving a little more weight to the sigmoid SVC as it performed best on its own. I also included the KNN model hoping the extra diversity in models might produce a more powerful ensemble.

In [30]:
clf_votesoft = VotingClassifier(
    estimators=[('knn', gs_knn.best_estimator_),
                ('svcs', gs_svcs.best_estimator_), 
                ('dt', gs_dt.best_estimator_), 
                ('ab', gs_ab.best_estimator_), 
                ('lr', LogisticRegressionCV(random_state=42))],
    voting='soft')
clf_votesoft.fit(X_train, y_train)
recall_score(y_test, clf_votesoft.predict(X_test))

0.8571428571428571

In [31]:
confusion_matrix(y_test, clf_votesoft.predict(X_test))

array([[33,  7],
       [ 5, 30]])

Again, the model performs as well as the decision tree and AdaBoost each do alone, making this a none too impressive model. 

In [32]:
param_grid = {
    'ab__base_estimator': [DecisionTreeClassifier(max_depth=1), LogisticRegression(solver='lbfgs', multi_class='auto')], 
    'ab__n_estimators': [10, 30, 50, 1000], 
    'ab__learning_rate': [.0001, .001, .01, .1],
    'svcl__kernel': ['linear'], 
    'svcl__C': np.linspace(.1, 1, 10), 
    'svcl__gamma': ['scale', 'auto'] 
}
gs_hard = GridSearchCV(clf_votehard, param_grid, scoring='recall', cv=5)
gs_hard.fit(X_train, y_train)

print(f'Train Recall: {gs_hard.best_score_}')
print(f'Test Recall: {recall_score(y_test, gs_hard.predict(X_test))}')
print(f'Train Accuracy: {accuracy_score(y_train, gs_hard.predict(X_train))}')
print(f'Test Accuracy: {accuracy_score(y_test, gs_hard.predict(X_test))}')
print(gs_hard.best_params_)

Train Recall: 0.7461904761904761
Test Recall: 0.8857142857142857
Train Accuracy: 0.8288288288288288
Test Accuracy: 0.8533333333333334
{'ab__base_estimator': DecisionTreeClassifier(max_depth=1), 'ab__learning_rate': 0.01, 'ab__n_estimators': 1000, 'svcl__C': 0.30000000000000004, 'svcl__gamma': 'scale', 'svcl__kernel': 'linear'}


In [None]:
'''
Here I tried a lot of different weight combos in a soft vote classifier 
with 4 of the best performing models, but wasn't able to beat the 87%-89% accuracy
above. Already the algorithm is time consuming, and it is inelegant, so I have
commented it out. Possibly future work.
'''

# w = [0, 1, 2, 3]
# combos = []
# scores = []
# for a in w:
#     for b in w:
#         for c in w:
#             for d in w:
#                 if a==0 and b==0 and c==0 and d==0: 
#                     continue
#                 else:
#                     clf_votesoft = VotingClassifier(
#                         estimators=[('lr', baseline_classifiers['LogReg']), 
#                                     ('svcs', gs_svcs.best_estimator_),
#                                     ('dt', gs_dt.best_estimator_), 
#                                     ('ab', gs_ab.best_estimator_)],
#                         voting='soft', 
#                         weights=[a, b, c, d])
#                     clf_votesoft.fit(X_train, y_train)
#                     combos.append([a, b, c, d])
#                     scores.append(accuracy_score(y_test, clf_votesoft.predict(X_test)))
# df = pd.DataFrame([combos, scores]).T
# df.columns = ['combos', 'accuracy']
# df.sort_values(by='combos', ascending=True).head(15)

## Conclusion

1. There was little preprocessing necessary for this project as the [dataset](https://www.kaggle.com/danimal/heartdiseaseensembleclassifier) was already quite clean. I used a **heatmap of Pearson correlation coefficients** to identify correlated features (which I dropped) and used **feature importance** from a Random Forest to select the 7 most important features of the original 13. Lastly, I **standardized** all features. 


2. After tuning, accuracy on test data improved for all models except logistic regression, which lost 2% accuracy:

| Model | Initial Test Accuracy | Final Test Accuracy |
|-|-|-|
| Ensemble Classifier (Soft, Unprocessed) | -- | **89%** |
| Sigmoid SVC | 55% | **88%** |
| AdaBoost | 72% | **87%** |
| Decision Tree | 77% | **87%** |
| Ensemble Classifier (Soft) | -- | **87%** |
| Ensemble Classifier (Hard) | -- | **87%** |
| Logistic Regression | 87% | **85%** |
| Random Forest | 83% | **85%** |
| XGBoost | 81% | **84%** |
| KNN | 59% | **82%** |


3. On the preprocessed, feature-selected data, hard- and soft-vote ensemble classifiers both achieved 87% accuracy on test data. The hard-vote ensemble consisted of a sigmoid SVC, a decision tree, and an AdaBoost classifier; the soft-vote classifier additionally included a KNN classifier.


4. **The model with the highest accuracy on test data (89%) was a soft-vote ensemble classifier using a sigmoid SVC, a decision tree, an AdaBoost classifier, a KNN classifier, and a logistic regression classifier. It was only able to achieve this accuracy on data *that had not undergone dimensionality reduction or standardization* (but had been cleaned of missing values).**