<a href="https://www.kaggle.com/code/vinikuhlmann/model-comparison-examples?scriptVersionId=172905029" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This work was developed by Matheus Bermudes Viana, Rafael Kuhn Takano, Vinicius Silva Fernandes Kuhlmann and Vitor Souza Amim for the Artificial Intelligence course of the University of São Paulo's Computer Science BS Program. It aims to compare the following ML models:

- Decision tree: a hierarchical model that recursively partition the feature space based on simple decision rules, enabling classification or regression by sequentially splitting data into increasingly homogeneous subsets.

- K-Nearest Neighbors (KNN): a non-parametric algorithm that classifies data points based on the majority vote of their nearest neighbors in a feature space, where the value of K determines the number of neighbors considered.

- Naive Bayes: a probabilistic classification algorithm based on Bayes' theorem, assuming independence between features, where it calculates the probability of a class given a set of features by multiplying the conditional probabilities of each feature given the class and then selecting the class with the highest probability.

- Multilayer Perceptron (MLP): a feedforward artificial neural network composed of multiple layers of interconnected neurons, capable of learning complex patterns and relationships in data through forward propagation and backpropagation algorithms.

# Setup

In [1]:
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.exceptions import ConvergenceWarning
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.utils._testing import ignore_warnings

sns.set_theme(
    context="notebook",
    style="darkgrid",
    palette="colorblind",
    font="sans-serif",
    font_scale=1,
    rc=None,
)

# Define models and their hyperparameters
models = [
    {
        "name": "KNN",
        "model": KNeighborsClassifier(),
        "params": {
            "n_neighbors": [3, 5, 10, 20],
            "weights": ["uniform", "distance"],
            "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
            "leaf_size": [10, 30, 50],
            "p": [1, 2],
        },
    },
    {"name": "Naive Bayes", "model": GaussianNB(), "params": {}},
    {
        "name": "Decision Tree",
        "model": DecisionTreeClassifier(random_state=0),
        "params": {"criterion": ["gini", "entropy"], "splitter": ["best", "random"]},
    },
    {
        "name": "MLP",
        "model": MLPClassifier(
            max_iter=10000,
            early_stopping=True,
            n_iter_no_change=1,
            tol=0.001,
            random_state=0,
        ),
        "params": {
            "hidden_layer_sizes": [(15,), (50,), (100,), (15, 15), (50, 50)],
            "learning_rate_init": [0.001, 0.01, 0.1],
            "learning_rate": ["constant", "adaptive"],
        },
    },
]


warnings.simplefilter("ignore", category=ConvergenceWarning)


def analyze_models_on_dataset(dataset_name, X, y):
    print(f"\n ---- DATASET: {dataset_name} ---- ")

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=0
    )

    # Perform grid search for each model
    for model in models:
        print(f"\nGrid search for {model['name']}...")
        clf = GridSearchCV(model["model"], model["params"], cv=5, n_jobs=-1)
        clf.fit(X_train, y_train)

        print(f"Mean fit time: {clf.cv_results_['mean_fit_time'].mean() * 1000:.2f} ms")
        print(f"Mean score time: {clf.cv_results_['mean_score_time'].mean() * 1000:.2f} ms")
        print("Best parameters set found on development set:")
        print(clf.best_params_)
        print("Best 5 grid scores on development set:")
        means = clf.cv_results_["mean_test_score"]
        stds = clf.cv_results_["std_test_score"]
        params = clf.cv_results_["params"]
        best_values = sorted(
            zip(means, stds, params), key=lambda x: x[0], reverse=True
        )[:5]
        for mean, std, params in best_values:
            params_str = ", ".join(f"{k}={v}" for k, v in params.items())
            print(f"{mean:.4f} (+/-{std:.4f}) for ({params_str})")

        # Evaluate best model on test data
        print("Detailed classification report:")
        y_true, y_pred = y_test, clf.predict(X_test)
        print(classification_report(y_true, y_pred))

# Dataset 1: Iris

From [Mathnerd on Kaggle](https://www.kaggle.com/datasets/arshid/iris-flower-dataset):

> The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

In [2]:
iris_df = pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv")
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Testing the models

In [3]:
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(iris_df.drop("species", axis=1)))
X.columns = iris_df.columns[:-1]
y = iris_df["species"]
analyze_models_on_dataset("Iris", X, y)


 ---- DATASET: Iris ---- 

Grid search for KNN...
Mean fit time: 3.60 ms
Mean score time: 5.80 ms
Best parameters set found on development set:
{'algorithm': 'auto', 'leaf_size': 10, 'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}
Best 5 grid scores on development set:
0.9500 (+/-0.0486) for (algorithm=auto, leaf_size=10, n_neighbors=10, p=1, weights=uniform)
0.9500 (+/-0.0486) for (algorithm=auto, leaf_size=30, n_neighbors=10, p=1, weights=uniform)
0.9500 (+/-0.0486) for (algorithm=auto, leaf_size=50, n_neighbors=10, p=1, weights=uniform)
0.9500 (+/-0.0486) for (algorithm=ball_tree, leaf_size=10, n_neighbors=10, p=1, weights=uniform)
0.9500 (+/-0.0486) for (algorithm=ball_tree, leaf_size=30, n_neighbors=10, p=1, weights=uniform)
Detailed classification report:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       1.00      1.00      1.00        13
 Iris-virginica       1.00      1.00      1.00        

## Conclusion

Both KNN and Decision Tree had a perfect score. This is a well-known and relatively simple dataset with only four features and three classes, making it inherently easier for models to learn and classify accurately. Naive Bayes also performed relatively well, while MLP, on the other hand, performed poorly, especially for its long fit time.

# Dataset 2: Titanic Survivors

From [Kaggle](https://www.kaggle.com/code/nadintamer/titanic-survival-predictions-beginner):

> The sinking of the Titanic is one of the most infamous shipwrecks in history.
>
> On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
>
> While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
>
> In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

| Variable | Definition | Key |
|----------|------------|-----|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex |  |
| Age | Age in years |  |
| sibsp | # of siblings / spouses aboard the Titanic |  |
| parch | # of parents / children aboard the Titanic |  |
| ticket | Ticket number |  |
| fare | Passenger fare |  |
| cabin | Cabin number |  |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |


In [4]:
titanic_df = pd.read_csv("/kaggle/input/titanic/train.csv", index_col=0)
titanic_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Preprocessing

First, we must remove columns that are not useful for predicting the target.

In [5]:
titanic_df = titanic_df.drop(["Ticket", "Cabin"], axis=1)
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 69.6+ KB


Then, we must impute missing values. As this project involves comparing different ML models instead of getting the best possible predictions, we will do a simple median imputation instead of a regression or KNN one.

In [6]:
titanic_df["Age"] = SimpleImputer(strategy="median").fit_transform(titanic_df[["Age"]])
titanic_df["Embarked"] = titanic_df["Embarked"].replace({"NaN": np.nan})
titanic_df["Embarked"] = (
    SimpleImputer(strategy="most_frequent")
    .set_output(transform="pandas")
    .fit_transform(titanic_df[["Embarked"]])
)

# we could just use pure pandas, but SimpleImputer has consistency, reproducibility
# and is compatible with Pipelines, so it is good practice to use it

print("Missing values after imputation:")
titanic_df.isna().sum()

Missing values after imputation:


Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

There is an interesting information to be found on the names: titles. We can extract and list them below:

In [7]:
titanic_df["Title"] = titanic_df["Name"].str.extract(r" ([A-Za-z]+)\.", expand=False)
titanic_df.drop("Name", axis=1, inplace=True)
pd.crosstab(titanic_df["Title"], titanic_df["Sex"])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,0,1
Col,0,2
Countess,1,0
Don,0,1
Dr,1,6
Jonkheer,0,1
Lady,1,0
Major,0,2
Master,0,40
Miss,182,0


We already have the sex information, so the new relevant information that is presented to us is that some passengers are more of a higher title than others.

In [8]:
important_titles = {
    "Countess",
    "Lady",
    "Sir",
    "Don",
    "Jonkheer",
    "Col",
    "Capt",
    "Major",
    "Master",
    "Rev",
    "Dr",
}

titanic_df["Title"] = titanic_df["Title"].map(
    lambda title: "Important" if title in important_titles else "Other"
)

titanic_df[["Title", "Survived"]].groupby(["Title"], as_index=False).mean()

Unnamed: 0,Title,Survived
0,Important,0.492063
1,Other,0.375604


There seems to be a positive bias towards important people surviving.

Finally, we must encode all categorical features to numerical values in order for them to work on the KNN and MLP algorithms. Scaling Numerical features also ensures that they are weighted equally by the KNN model.

In [9]:
titanic_df["Sex"] = LabelEncoder().fit_transform(titanic_df["Sex"])
titanic_df["Title"] = LabelEncoder().fit_transform(titanic_df["Title"])

mat = OneHotEncoder().fit_transform(titanic_df[["Embarked"]])
titanic_df = titanic_df.join(pd.DataFrame(mat.toarray(), columns=["C", "Q", "S"]))
titanic_df.drop("Embarked", axis=1, inplace=True)

titanic_df[["Age", "Fare"]] = StandardScaler().fit_transform(
    titanic_df[["Age", "Fare"]]
)

titanic_df = titanic_df.dropna()
titanic_df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Title,C,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,1,-0.565736,1,0,-0.502445,1,1.0,0.0,0.0
2,1,1,0,0.663861,1,0,0.786845,1,0.0,0.0,1.0
3,1,3,0,-0.258337,0,0,-0.488854,1,0.0,0.0,1.0
4,1,1,0,0.433312,1,0,0.42073,1,0.0,0.0,1.0
5,0,3,1,0.433312,0,0,-0.486337,1,0.0,1.0,0.0


## Testing the models

In [10]:
analyze_models_on_dataset(
    "Titanic", titanic_df.drop("Survived", axis=1), titanic_df["Survived"]
)


 ---- DATASET: Titanic ---- 

Grid search for KNN...
Mean fit time: 4.31 ms
Mean score time: 13.18 ms
Best parameters set found on development set:
{'algorithm': 'auto', 'leaf_size': 10, 'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}
Best 5 grid scores on development set:
0.8104 (+/-0.0368) for (algorithm=auto, leaf_size=10, n_neighbors=10, p=1, weights=uniform)
0.8104 (+/-0.0368) for (algorithm=auto, leaf_size=30, n_neighbors=10, p=1, weights=uniform)
0.8104 (+/-0.0368) for (algorithm=auto, leaf_size=50, n_neighbors=10, p=1, weights=uniform)
0.8104 (+/-0.0368) for (algorithm=ball_tree, leaf_size=10, n_neighbors=10, p=1, weights=uniform)
0.8104 (+/-0.0368) for (algorithm=ball_tree, leaf_size=30, n_neighbors=10, p=1, weights=uniform)
Detailed classification report:
              precision    recall  f1-score   support

           0       0.81      0.84      0.82       112
           1       0.71      0.67      0.69        66

    accuracy                           0.78       178
   m

## Conclusion

In this case, MLP seems to be the better performing model, with the same accuracy as KNN but better f1-scores.

# Dataset 3: Cardiovascular Diseases

From [Fedesoriano on Kaggle](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction):

>Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.
>
>People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

| Variable       | Definition                                                     | Key                                                                                    |
|----------------|----------------------------------------------------------------|----------------------------------------------------------------------------------------|
| Age            | Age of the patient                                             |                                                                                        |
| Sex            | Sex of the patient                                             | M = Male, F = Female                                                                   |
| ChestPainType  | Chest pain type                                                | TA = Typical Angina, ATA = Atypical Angina, NAP = Non-Anginal Pain, ASY = Asymptomatic |
| RestingBP      | Resting blood pressure (mm Hg)                                 |                                                                                        |
| Cholesterol    | Serum cholesterol (mm/dl)                                      |                                                                                        |
| FastingBS      | Fasting blood sugar                                            | 1 if FastingBS > 120 mg/dl, 0 if otherwise                                             |
| RestingECG     | Resting electrocardiogram results                              | Normal = Normal, ST = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH = showing probable or definite left ventricular hypertrophy by Estes' criteria                                                                                             |
| MaxHR          | Maximum heart rate achieved (bpm)                              |                                                                                        |
| ExerciseAngina | Exercise-induced angina                                        | Y = Yes, N = No                                                                        |
| Oldpeak        | Numeric value measured in depression                           |                                                                                        |
| ST_Slope       | The slope of the peak exercise ST segment                      | Up = upsloping, Flat = flat, Down = downsloping                                        |
| HeartDisease   | Target class                                                   | 1 = heart disease, 0 = Normal                                                          |


In [11]:
cardio_df = pd.read_csv("/kaggle/input/heart-failure-prediction/heart.csv")
cardio_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## Preprocessing

In [12]:
# Encoding
le = LabelEncoder()
cardio_df.Sex = le.fit_transform(cardio_df.Sex)
cardio_df.ChestPainType = le.fit_transform(cardio_df.ChestPainType)
cardio_df.RestingECG = le.fit_transform(cardio_df.RestingECG)
cardio_df.ExerciseAngina = le.fit_transform(cardio_df.ExerciseAngina)
cardio_df.ST_Slope = le.fit_transform(cardio_df.ST_Slope)

# Scaling
scaler = StandardScaler()
cardio_df.Age = scaler.fit_transform(cardio_df.Age.values.reshape(-1, 1))
cardio_df.RestingBP = scaler.fit_transform(cardio_df.RestingBP.values.reshape(-1, 1))
cardio_df.Cholesterol = scaler.fit_transform(
    cardio_df.Cholesterol.values.reshape(-1, 1)
)
cardio_df.MaxHR = scaler.fit_transform(cardio_df.MaxHR.values.reshape(-1, 1))

cardio_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,-1.43314,1,1,0.410909,0.82507,0,1,1.382928,0,0.0,2,0
1,-0.478484,0,2,1.491752,-0.171961,0,1,0.754157,0,1.0,1,1
2,-1.751359,1,1,-0.129513,0.770188,0,2,-1.525138,0,0.0,2,0
3,-0.584556,0,0,0.302825,0.13904,0,1,-1.132156,1,1.5,1,1
4,0.051881,1,2,0.951331,-0.034755,0,1,-0.581981,0,0.0,2,0


## Testing the models

In [13]:
analyze_models_on_dataset(
    "Heart Failure", cardio_df.drop("HeartDisease", axis=1), cardio_df["HeartDisease"]
)


 ---- DATASET: Heart Failure ---- 

Grid search for KNN...
Mean fit time: 5.18 ms
Mean score time: 14.54 ms
Best parameters set found on development set:
{'algorithm': 'auto', 'leaf_size': 10, 'n_neighbors': 20, 'p': 1, 'weights': 'distance'}
Best 5 grid scores on development set:
0.8733 (+/-0.0283) for (algorithm=auto, leaf_size=10, n_neighbors=20, p=1, weights=distance)
0.8733 (+/-0.0283) for (algorithm=auto, leaf_size=30, n_neighbors=20, p=1, weights=distance)
0.8733 (+/-0.0283) for (algorithm=auto, leaf_size=50, n_neighbors=20, p=1, weights=distance)
0.8733 (+/-0.0283) for (algorithm=ball_tree, leaf_size=10, n_neighbors=20, p=1, weights=distance)
0.8733 (+/-0.0283) for (algorithm=ball_tree, leaf_size=30, n_neighbors=20, p=1, weights=distance)
Detailed classification report:
              precision    recall  f1-score   support

           0       0.82      0.82      0.82        77
           1       0.87      0.87      0.87       107

    accuracy                           0.85   

## Conclusion

KNN, Naive Bayes and MLP performed very similarly, with all three being valid choices for this dataset.