## Supervised models
This notebook is intended for giving an introduction the ML supervised models that can be used for Covid detection.

For this notebook to find the new modules created for this project, we need to set its path to be in the root directory.

In [1]:
# Auto reload modules
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../")

<img src="../images/Supervised_Models.png" width="800"/>

## Loading packages and dependencies

In [3]:
import numpy as np

from src.features.extract_features import load_extracted_features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from src.models.build_model import train_basic_supervised_model, evaluate_model


# Path to the raw data and preprocessed data
raw_data_dir = '../data/raw/COVID-19_Radiography_Dataset/'

## Extracting features from images

In [4]:
X_normal, y_normal, _ = load_extracted_features(images_dir=raw_data_dir+'{}/images',
                                                    category='NORMAL', dataset_label=0)
X_covid, y_covid, _ = load_extracted_features(images_dir=raw_data_dir+'{}/images',
                                                    category='COVID', dataset_label=1, random_seed=42, samples=6576, augmentor=True) 
X_pneumonia, y_pneumonia, _ = load_extracted_features(images_dir=raw_data_dir+'{}/images',
                                                    category='Viral Pneumonia', dataset_label=2, random_seed=42, samples=8847, augmentor=True) 
X_opacity, y_opacity, _ = load_extracted_features(images_dir=raw_data_dir+'{}/images',
                                                    category='Lung_Opacity', dataset_label=3, random_seed=42, samples=4180, augmentor=True) 

Loaded images for NORMAL: 10192 resized images, 10192 features, and 10192 labels.
Loaded images for COVID: 10192 resized images, 10192 features, and 10192 labels.
Loaded images for Viral Pneumonia: 10192 resized images, 10192 features, and 10192 labels.
Loaded images for Lung_Opacity: 10192 resized images, 10192 features, and 10192 labels.


## Normalizing features

In [5]:
# Combine datasets
X = np.vstack((X_normal, X_covid, X_pneumonia, X_opacity))
y = np.concatenate((y_normal, y_covid, y_pneumonia, y_opacity))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train shape: (32614, 14), y_train shape: (32614,)
X_test shape: (8154, 14), y_test shape: (8154,)


## Training and evaluating models

### Logistic regression

✅ Strengths:
* Simple, fast, and interpretable.
* Works well when features are linearly separable.

❌ Weaknesses:
* Struggles with complex, non-linear relationships.
* Sensitive to outliers.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'C': 0.1, 'class_weight': None, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}

In [6]:
model = train_basic_supervised_model(X_train, y_train, model_type='Logistic Regression')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='Logistic Regression')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Registered model 'sklearn-Logistic Regression-2025-03-08' already exists. Creating a new version of this model...
2025/03/08 15:56:05 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Logistic Regression-2025-03-08, version 2


🏃 View run Logistic Regression-2025-03-08 15:56:00.952831 at: http://localhost:8080/#/experiments/747560239450198032/runs/0647b5a66934420ebfa104b5c2007555
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.6202
Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.66      0.63      2073
           1       0.59      0.51      0.55      1998
           2       0.67      0.70      0.69      2056
           3       0.60      0.62      0.61      2027

    accuracy                           0.62      8154
   macro avg       0.62      0.62      0.62      8154
weighted avg       0.62      0.62      0.62      8154



Created version '2' of model 'sklearn-Logistic Regression-2025-03-08'.


### SVM

✅ Strengths:

* Works well on high-dimensional data.
* Effective on small datasets.
* Handles outliers better than logistic regression.

❌ Weaknesses:

* Slow on large datasets (especially with RBF kernel).
* Sensitive to hyperparameters (C, γ, degree).
* Difficult to interpret compared to logistic regression.

#### RBF kernel

In [7]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM RBF')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='SVM RBF')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-SVM RBF-2025-03-08'.
2025/03/08 15:56:30 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM RBF-2025-03-08, version 1


🏃 View run SVM RBF-2025-03-08 15:56:22.028763 at: http://localhost:8080/#/experiments/747560239450198032/runs/acc9add0d33c4705a46c6585882f060b
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.7563
Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.76      0.75      2073
           1       0.74      0.71      0.73      1998
           2       0.82      0.84      0.83      2056
           3       0.72      0.71      0.71      2027

    accuracy                           0.76      8154
   macro avg       0.76      0.76      0.76      8154
weighted avg       0.76      0.76      0.76      8154



Created version '1' of model 'sklearn-SVM RBF-2025-03-08'.


#### Linear kernel

In [8]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM Linear')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='SVM Linear')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-SVM Linear-2025-03-08'.
2025/03/08 15:58:02 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM Linear-2025-03-08, version 1


🏃 View run SVM Linear-2025-03-08 15:57:57.917372 at: http://localhost:8080/#/experiments/747560239450198032/runs/8b1dbf5db5d34a668008193700cf6ee2
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.6310
Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.66      0.64      2073
           1       0.60      0.54      0.57      1998
           2       0.68      0.70      0.69      2056
           3       0.62      0.61      0.61      2027

    accuracy                           0.63      8154
   macro avg       0.63      0.63      0.63      8154
weighted avg       0.63      0.63      0.63      8154



Created version '1' of model 'sklearn-SVM Linear-2025-03-08'.


### Random Forest

✅ Strengths
* High Accuracy – Performs well on complex datasets.
* Robust to Noise – Handles missing data and outliers well.
* Works with Categorical & Numerical Features.

❌ Weaknesses
* Slow on Large Datasets – Many trees increase computation time.
* Less Interpretable – Harder to understand than Logistic Regression.
* Memory Intensive – Requires more RAM compared to simpler models.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'class_weight': None, 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}

In [10]:
model = train_basic_supervised_model(X_train, y_train, model_type='Random Forest')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='Random Forest')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-Random Forest-2025-03-08'.
2025/03/08 15:58:43 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Random Forest-2025-03-08, version 1


🏃 View run Random Forest-2025-03-08 15:58:39.199318 at: http://localhost:8080/#/experiments/747560239450198032/runs/c48e4e51a769426480ac8da8dd31e77c
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.7546
Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.75      0.75      2073
           1       0.75      0.74      0.75      1998
           2       0.82      0.84      0.83      2056
           3       0.70      0.69      0.70      2027

    accuracy                           0.75      8154
   macro avg       0.75      0.75      0.75      8154
weighted avg       0.75      0.75      0.75      8154



Created version '1' of model 'sklearn-Random Forest-2025-03-08'.


### Catboost

✅ Strengths
* Handles categorical features natively (no need for one-hot encoding).
* Great for imbalanced data (built-in loss functions).
* Avoids overfitting using ordered boosting.
* Faster training than XGBoost & LightGBM.
* Works well with small datasets (better than deep learning in low-data settings).
* Automatically handles missing values.
* Requires minimal hyperparameter tuning.

❌ Weaknesses
* Slower inference than LightGBM (not ideal for real-time applications).
* Higher memory usage (uses more RAM than XGBoost).
* Smaller community support (troubleshooting is harder than XGBoost).
* Limited GPU acceleration (only supports specific settings).
* Not the best for highly sparse data (LightGBM may be better).

In [12]:
model = train_basic_supervised_model(X_train, y_train, model_type='CatBoost_Multi')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='CatBoost')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

0:	learn: 1.3577824	total: 64.2ms	remaining: 32s
100:	learn: 0.8310135	total: 384ms	remaining: 1.52s
200:	learn: 0.7399226	total: 707ms	remaining: 1.05s
300:	learn: 0.6881848	total: 1.03s	remaining: 683ms
400:	learn: 0.6484206	total: 1.35s	remaining: 334ms
499:	learn: 0.6210390	total: 1.67s	remaining: 0us


Successfully registered model 'sklearn-CatBoost-2025-03-08'.
2025/03/08 16:02:34 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-CatBoost-2025-03-08, version 1


🏃 View run CatBoost-2025-03-08 16:02:31.653633 at: http://localhost:8080/#/experiments/747560239450198032/runs/b62afd53df464cebabc362b02356b0e9
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.7350
Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.74      0.73      2073
           1       0.73      0.69      0.71      1998
           2       0.80      0.82      0.81      2056
           3       0.68      0.68      0.68      2027

    accuracy                           0.73      8154
   macro avg       0.73      0.73      0.73      8154
weighted avg       0.73      0.73      0.73      8154



Created version '1' of model 'sklearn-CatBoost-2025-03-08'.
