# CONSTANTS

- **RANDOM_STATE** is a constant that is typically used to initialize the random number generator in a library or algorithm that uses randomness in some aspect of its operation. This means that when the RANDOM_STATE is fixed to a certain value, the algorithm will generate the same random results every time it is run, which can be useful for reproducing results and ensuring result consistency.

In [None]:
RANDOM_STATE = 0

- **ALPHA** is a constant that is commonly used in statistical hypothesis testing to determine the significance level of a test. The value of ALPHA is typically set to 0.05 (or 5%), which means there is a 5% chance that the test results will be considered significant by chance. The value of ALPHA is used in conjunction with the p-value calculated from the data to determine whether the null hypothesis should be rejected or not.

In [None]:
ALPHA = 0.05

# LIBRARIES

In [None]:
import glob
import warnings

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import ttest_ind, mannwhitneyu, chi2_contingency, friedmanchisquare, shapiro, wilcoxon, kruskal

from sklearn.model_selection import cross_validate, RepeatedStratifiedKFold
from joblib import load

warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

# READ FILES

In [None]:
X_train = pd.read_csv('../dataset/X_train.csv', index_col=0)
y_train = pd.read_csv('../dataset/y_train.csv', index_col=0)

# READ MODELS

The code provided below performs the following tasks: It initializes an empty dictionary called "models." Then, it sets a folder path where the models are stored. 

In [None]:
# Initialize the dictionary
models = {}

# Set the folder path
folder_path = '../models/'

# List all subdirectories in the folder
subdirectories = [subdir for subdir in os.listdir(folder_path) if os.path.isdir(os.path.join(folder_path, subdir))]

# Iterate over the subdirectories
for subdir in subdirectories:
    subdirectory_path = os.path.join(folder_path, subdir)
    
    # List all files in the subdirectory
    files = os.listdir(subdirectory_path)
    
    # Iterate over the files
    for file in files:
        file_path = os.path.join(subdirectory_path, file)
        
        # Extract the file name without extension
        model_name = os.path.splitext(file)[0]
        
        # Load the model using joblib.load and add it to the dictionary
        models[model_name] = joblib.load(file_path)

# RESULTS

The code provided below implements a model evaluation process using cross-validation. It iterates through a list of model names and conducts cross-validation for each model. Depending on the model name, the training data is preprocessed accordingly, and the resulting metrics, such as accuracy, precision, recall, F1 score, and ROC AUC, are gathered for each model. These evaluation results are then stored in a dictionary named "results." This code efficiently evaluates multiple models and provides a comprehensive understanding of their performance on the dataset.

In [None]:
results = {}

for model_name in models:
    
    print(model_name)
    
    if model_name in ['TRACK','TRUST']:
        X_train_model = X_train[model_name].values.reshape(-1, 1)
    else:
        X_train_model = X_train.drop(['TRACK','TRUST'],axis=1)
    
    cv_results = cross_validate(
        models[model_name], 
        X_train_model, 
        y_train.values.ravel(), 
        cv=RepeatedStratifiedKFold(random_state=RANDOM_STATE), 
        scoring=['accuracy','precision','recall','f1','roc_auc']
    )
    
    results[model_name] = {
        'precision': cv_results['test_precision'],
        'accuracy': cv_results['test_accuracy'],
        'roc_auc': cv_results['test_roc_auc'],
        'recall': cv_results['test_recall'],
        'f1': cv_results['test_f1']
    }

## - Summary

A comprehensive summary of the model performance is provided in the form of a dataframe that includes the relevant performance metrics.

In [None]:
# Creating a list of results for each model
all_results = []
for model_name in results:
    # Creating a list of dictionaries containing the means and standard deviations of each metric for the current model
    data = []
    for metric in ['precision', 'accuracy', 'roc_auc', 'recall', 'f1']:
        mean = results[model_name][metric].mean()
        std = results[model_name][metric].std()
        data.append({'Model': model_name, 'Metric': metric, 'Mean_Std': f'{mean:.4f} ± {std:.4f}'})
    
    # Creating a DataFrame from the data for the current model
    df = pd.DataFrame(data)
    
    # Adding the DataFrame for the current model to the list of results
    all_results.append(df)

# Concatenating all the DataFrames into a single DataFrame
results_df = pd.concat(all_results, axis=0)

# Setting the index to the name of the model
results_df.set_index('Model', inplace=True)

# Reorganizing the DataFrame
results_df = results_df.pivot(columns='Metric', values='Mean_Std')

# Displaying the DataFrame with the metrics in the desired order
results_df.loc[:,['accuracy', 'precision', 'recall', 'f1', 'roc_auc']]

- **Accuracy:** It measures the percentage of correct predictions made out of all predictions. It is defined as the ratio of the number of correct predictions to the total number of predictions made. It is commonly used as an evaluation metric when the number of positive and negative samples is roughly the same.

- **Precision:** It measures the percentage of true positives (correctly identified positives) out of all positive predictions made. It is defined as the ratio of true positives to true positives plus false positives. It is useful when the cost of false positives is high.

- **Recall** (Sensitivity or True Positive Rate): It measures the percentage of true positives identified out of all actual positive samples. It is defined as the ratio of true positives to true positives plus false negatives. It is useful when the cost of false negatives is high.

- **F1:** It is a harmonic mean of precision and recall. It is defined as the weighted average of precision and recall, where the weights are the same, i.e., `F1 = 2 * (precision * recall) / (precision + recall)`. It is a useful metric when both precision and recall are important.

- **ROC AUC** (Receiver Operating Characteristic Area Under the Curve): It measures the ability of a model to distinguish between positive and negative classes. It is calculated by plotting the true positive rate against the false positive rate for different thresholds and calculating the area under the curve. A perfect model would have an ROC AUC of 1.0, while a model that performs no better than random guessing would have an ROC AUC of 0.5.

## - Boxplot

A boxplot is a powerful tool for visualizing the distribution of a dataset. The box of the plot represents the interquartile range (IQR), which contains 50% of the data, and the whiskers extend to the minimum and maximum values, excluding outliers. Outliers are represented as points outside the whiskers. The median, which is the middle value of the dataset, is shown as a horizontal line within the box. Boxplots can be used to compare the distributions of multiple groups, identify skewness or symmetry, and detect any unusual observations that might need further investigation.

In [None]:
def boxplot(metric):
    scores = [results[model][metric] for model in models]
    fig, ax = plt.subplots()
    ax.boxplot(scores,
               boxprops={'linewidth': 2, 'color': 'blue'},
               whiskerprops={'linewidth': 2, 'color': 'blue'},
               medianprops={'linewidth': 2, 'color': 'red'})
    ax.set_xticklabels(models, fontsize=10, rotation='vertical')
    ax.set_ylabel(metric.capitalize(), fontsize=10)
    ax.set_title('Results', fontsize=12)
    ax.grid(True)
    ax.tick_params(axis='both', which='major', labelsize=10)
    plt.show()    

In [None]:
boxplot('accuracy')

In [None]:
boxplot('precision')

In [None]:
boxplot('recall')

In [None]:
boxplot('f1')

In [None]:
boxplot('roc_auc')

The AUC (Area Under the Curve) is a common metric for evaluating binary classification models, such as predicting whether a patient will survive or not. The AUC measures the model's ability to correctly classify positive (e.g., patients who died) and negative (e.g., patients who survived) examples, and is especially useful when the class distribution is imbalanced. On the other hand, the f1-score is a performance measure that considers both precision and recall of the model. Precision measures the proportion of positive examples correctly classified among all predicted positive examples, while recall measures the proportion of positive examples correctly classified among all actual positive examples. The f1-score is a harmonic mean between precision and recall and can be useful when the balance between precision and recall is important.

## - Confidence Interval

Confidence interval is a statistical measure that indicates the range of likely values for an estimate with a certain degree of confidence. It is commonly used in statistical inference to provide a measure of uncertainty around a point estimate. The confidence interval provides a range of plausible values for the population parameter based on a sample of data.

In [None]:
def mean_confidence_interval(model_name, results, metric='roc_auc', confidence_level=0.95, decimal_place=4):

    results = results[model_name][metric]
    n = len(results)
    mean, sem, std = np.mean(results), stats.sem(results), np.std(results)

    confidence_interval = tuple(map(lambda x: round(x, decimal_place), stats.t.interval(confidence_level, n-1, mean, sem)))

    return pd.DataFrame({
        'Model Name': model_name,
        'Metric': metric,
        'Point Estimate': round(mean, decimal_place),
        'Standard Deviation': round(std, decimal_place),
        'Lower CI': confidence_interval[0],
        'Upper CI': confidence_interval[1],
        'Confidence Level': confidence_level,
        'Sample Size': n
    }, index=[0])

confidence_interval = pd.concat([mean_confidence_interval(model_name, results, metric) for metric in results_df.columns for model_name in models]).reset_index(drop=True)

# Set the metric as index
confidence_interval.set_index('Metric', inplace=True)

In [None]:
confidence_interval.loc['accuracy'].sort_values(by='Point Estimate', ascending=False)

In [None]:
confidence_interval.loc['precision'].sort_values(by='Point Estimate', ascending=False)

In [None]:
confidence_interval.loc['recall'].sort_values(by='Point Estimate', ascending=False)

In [None]:
confidence_interval.loc['f1'].sort_values(by='Point Estimate', ascending=False)

In [None]:
confidence_interval.loc['roc_auc'].sort_values(by='Point Estimate', ascending=False)

# STATISTICAL TESTS

## Normality Test

The normality test is important because many statistical methods assume that data are normally distributed. If a distribution is not normal, these methods may not be appropriate or require adjustments to work correctly.

In [None]:
def normality_test(results, metric):
    normality_dict = {}
    for model_name in models:
        score_values = results[model_name][metric]
        stat, p = shapiro(score_values)
        normality_dict[model_name] = {'Statistic': stat, 'p-value': p, 'Normality': p < ALPHA}
    
    return pd.DataFrame.from_dict(normality_dict, orient='index').assign(Metric=metric)

normality_test = pd.concat([normality_test(results, metric) for metric in results_df.columns], axis=0)

In [None]:
normality_test.query('Metric == "accuracy"')

In [None]:
normality_test.query('Metric == "precision"')

In [None]:
normality_test.query('Metric == "recall"')

In [None]:
normality_test.query('Metric == "f1"')

In [None]:
normality_test.query('Metric == "roc_auc"')

## Friedman

The Friedman test is used when the data are paired, such as when comparing the same set of subjects under different conditions. It tests the null hypothesis that the groups have the same distribution, and the alternative hypothesis that at least one group has a different distribution. The test statistic is calculated by ranking the data within each group and summing the ranks for each group. The test statistic is then compared to a chi-squared distribution with degrees of freedom equal to the number of groups minus one. If the p-value is less than the chosen significance level, then the null hypothesis is rejected, and it can be concluded that there is a significant difference between the groups.

In [None]:
# Define scores
scores = [results[key][metric] for key in results.keys() for metric in results_df.columns]

# Create a dictionary of statistics
data = {'metric': [], 'statistic': [], 'p-value': [], 'significant difference': []}

# Loop through the metrics
for metric in results_df.columns:
    metric_scores = scores[:len(results.keys())]
    scores = scores[len(results.keys()):]
    stat, p_value = friedmanchisquare(*metric_scores)
    data['metric'].append(metric)
    data['statistic'].append(stat)
    data['p-value'].append(p_value)
    if p_value < ALPHA:
        data['significant difference'].append('Yes')
    else:
        data['significant difference'].append('No')

# Create the dataframe
friedman = pd.DataFrame(data)

# Set the metric as index
friedman.set_index('metric', inplace=True)

# Show the dataframe
friedman

For all the metrics (accuracy, f1, precision, recall, and roc_auc), the p-value is less than the commonly adopted significance level of 0.05. Therefore, we can assert that there are statistically significant differences between the samples for all these metrics.

The fact that the p-value is very close to zero provides strong evidence against the null hypothesis that there are no differences between the samples. This suggests that at least one of the samples is significantly different from the others in terms of each evaluated metric.

## Wilcoxon-Mann-Whitney

The Wilcoxon-Mann-Whitney test is a nonparametric test used to determine if two independent samples were drawn from populations with the same distribution. It is used as an alternative to the two-sample t-test when the assumptions of normality and equal variances are not met.

In [None]:
def plot_wilcoxon_heatmap(metric):

    MODEL_NAMES = list(models.keys())

    n_models = len(MODEL_NAMES)

    # Calculate the p-value of the Wilcoxon test for each pair of classifiers
    wilcoxon_matrix = np.zeros((n_models, n_models))
    for i, clf1 in enumerate(MODEL_NAMES):
        for j, clf2 in enumerate(MODEL_NAMES[i+1:], i+1):
            _, p_value = wilcoxon(results[clf1][metric], results[clf2][metric])
            wilcoxon_matrix[i, j] = p_value

    # Plot the matrix using a color scale
    fig, ax = plt.subplots()
    im = ax.imshow(wilcoxon_matrix, cmap='coolwarm', vmin=0, vmax=1)

    # Add classifier labels
    ax.set_xticks(range(n_models))
    ax.set_yticks(range(n_models))
    ax.set_xticklabels(MODEL_NAMES, fontsize=10, rotation=90)
    ax.set_yticklabels(MODEL_NAMES, fontsize=10)

    # Add p-values to each cell
    for i in range(n_models):
        for j in range(i+1, n_models):
            p_value = wilcoxon_matrix[i, j]
            text_color = 'white' if p_value < ALPHA else 'black'
            text = ax.text(j, i, '{:.2f}'.format(p_value), ha='center', va='center', color=text_color, fontsize=10, fontweight='bold' if p_value < ALPHA else 'normal')

    # Configure color bar
    cbar = ax.figure.colorbar(im, ax=ax)
    cbar.ax.set_ylabel('p-value', rotation=-90, va='bottom', fontsize=10)

    # Set plot title
    plt.title('Wilcoxon Test - {}'.format(metric), fontsize=12)

    # Display plot
    plt.show()

In [None]:
plot_wilcoxon_heatmap(metric='accuracy')

In [None]:
plot_wilcoxon_heatmap('precision')

In [None]:
plot_wilcoxon_heatmap('recall')

In [None]:
plot_wilcoxon_heatmap('f1')

In [None]:
plot_wilcoxon_heatmap('roc_auc')