## Supervised models
This notebook is intended for giving an introduction the ML supervised models that can be used for Covid detection.

For this notebook to find the new modules created for this project, we need to set its path to be in the root directory.

In [1]:
# Auto reload modules
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../")

<img src="../images/Supervised_Models.png" width="800"/>

## Loading packages and dependencies

In [3]:
import numpy as np

from src.features.extract_features import load_extracted_features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from src.models.build_model import train_basic_supervised_model, evaluate_model


# Path to the raw data and preprocessed data
raw_data_dir = '../data/raw/dataset/images'

## Extracting features from images

In [4]:
X_normal, y_normal, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='NORMAL', dataset_label=0)
X_covid, y_covid, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='COVID', dataset_label=1, samples=6576, augmentor=True) 
X_pneumonia, y_pneumonia, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='Viral Pneumonia', dataset_label=2, samples=8847, augmentor=True) 
X_opacity, y_opacity, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='Lung_Opacity', dataset_label=3, samples=4180, augmentor=True) 

Loaded images for NORMAL: 10192 resized images, 10192 features, and 10192 labels.
Loaded images for COVID: 10192 resized images, 10192 features, and 10192 labels.
Loaded images for Viral Pneumonia: 10192 resized images, 10192 features, and 10192 labels.
Loaded images for Lung_Opacity: 10192 resized images, 10192 features, and 10192 labels.


## Normalizing features

In [5]:
# Combine datasets
X = np.vstack((X_normal, X_covid, X_pneumonia, X_opacity))
y = np.concatenate((y_normal, y_covid, y_pneumonia, y_opacity))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train shape: (32614, 14), y_train shape: (32614,)
X_test shape: (8154, 14), y_test shape: (8154,)


## Training and evaluating models

### Logistic regression

‚úÖ Strengths:
* Simple, fast, and interpretable.
* Works well when features are linearly separable.

‚ùå Weaknesses:
* Struggles with complex, non-linear relationships.
* Sensitive to outliers.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'C': 0.1, 'class_weight': None, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}

In [6]:
model = train_basic_supervised_model(X_train, y_train, model_type='Logistic Regression')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='Logistic Regression', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-Logistic Regression-multiclass'.
2025/03/09 14:16:00 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Logistic Regression-multiclass, version 1


üèÉ View run Logistic Regression-multiclass at: http://localhost:8080/#/experiments/316063285991046342/runs/b7c1bcadf5824bf397e9d3261ead22b5
üß™ View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.6269
Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.65      0.64      2073
           1       0.60      0.52      0.55      1998
           2       0.69      0.71      0.70      2056
           3       0.60      0.63      0.61      2027

    accuracy                           0.63      8154
   macro avg       0.63      0.63      0.62      8154
weighted avg       0.63      0.63      0.63      8154



Created version '1' of model 'sklearn-Logistic Regression-multiclass'.


### SVM

‚úÖ Strengths:

* Works well on high-dimensional data.
* Effective on small datasets.
* Handles outliers better than logistic regression.

‚ùå Weaknesses:

* Slow on large datasets (especially with RBF kernel).
* Sensitive to hyperparameters (C, Œ≥, degree).
* Difficult to interpret compared to logistic regression.

#### RBF kernel

In [7]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM RBF')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='SVM RBF', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-SVM RBF-multiclass'.
2025/03/09 14:16:23 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM RBF-multiclass, version 1


üèÉ View run SVM RBF-multiclass at: http://localhost:8080/#/experiments/316063285991046342/runs/dcb4b55b1076421b94aa62e3e102bbe7
üß™ View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.7546
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.75      0.74      2073
           1       0.74      0.71      0.73      1998
           2       0.82      0.85      0.83      2056
           3       0.72      0.70      0.71      2027

    accuracy                           0.75      8154
   macro avg       0.75      0.75      0.75      8154
weighted avg       0.75      0.75      0.75      8154



Created version '1' of model 'sklearn-SVM RBF-multiclass'.


#### Linear kernel

In [8]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM Linear')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='SVM Linear', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-SVM Linear-multiclass'.
2025/03/09 14:17:54 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM Linear-multiclass, version 1


üèÉ View run SVM Linear-multiclass at: http://localhost:8080/#/experiments/316063285991046342/runs/87a0342f28494c0abd64fda477aed5f8
üß™ View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.6342
Classification Report:
               precision    recall  f1-score   support

           0       0.63      0.67      0.65      2073
           1       0.60      0.55      0.57      1998
           2       0.68      0.70      0.69      2056
           3       0.62      0.62      0.62      2027

    accuracy                           0.63      8154
   macro avg       0.63      0.63      0.63      8154
weighted avg       0.63      0.63      0.63      8154



Created version '1' of model 'sklearn-SVM Linear-multiclass'.


### Random Forest

‚úÖ Strengths
* High Accuracy ‚Äì Performs well on complex datasets.
* Robust to Noise ‚Äì Handles missing data and outliers well.
* Works with Categorical & Numerical Features.

‚ùå Weaknesses
* Slow on Large Datasets ‚Äì Many trees increase computation time.
* Less Interpretable ‚Äì Harder to understand than Logistic Regression.
* Memory Intensive ‚Äì Requires more RAM compared to simpler models.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'class_weight': None, 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}

In [9]:
model = train_basic_supervised_model(X_train, y_train, model_type='Random Forest')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='Random Forest', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-Random Forest-multiclass'.
2025/03/09 14:18:31 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Random Forest-multiclass, version 1


üèÉ View run Random Forest-multiclass at: http://localhost:8080/#/experiments/316063285991046342/runs/11051af9368b4ccf808f0ce51a201bd1
üß™ View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.7597
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.76      0.75      2073
           1       0.77      0.74      0.75      1998
           2       0.82      0.84      0.83      2056
           3       0.70      0.69      0.70      2027

    accuracy                           0.76      8154
   macro avg       0.76      0.76      0.76      8154
weighted avg       0.76      0.76      0.76      8154



Created version '1' of model 'sklearn-Random Forest-multiclass'.


### Catboost

‚úÖ Strengths
* Handles categorical features natively (no need for one-hot encoding).
* Great for imbalanced data (built-in loss functions).
* Avoids overfitting using ordered boosting.
* Faster training than XGBoost & LightGBM.
* Works well with small datasets (better than deep learning in low-data settings).
* Automatically handles missing values.
* Requires minimal hyperparameter tuning.

‚ùå Weaknesses
* Slower inference than LightGBM (not ideal for real-time applications).
* Higher memory usage (uses more RAM than XGBoost).
* Smaller community support (troubleshooting is harder than XGBoost).
* Limited GPU acceleration (only supports specific settings).
* Not the best for highly sparse data (LightGBM may be better).

In [10]:
model = train_basic_supervised_model(X_train, y_train, model_type='CatBoost_Multi')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", model, X_test, y_test, model_type='CatBoost', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

0:	learn: 1.3578011	total: 61.1ms	remaining: 30.5s
100:	learn: 0.8279687	total: 379ms	remaining: 1.5s
200:	learn: 0.7353582	total: 697ms	remaining: 1.04s
300:	learn: 0.6810604	total: 1.02s	remaining: 674ms
400:	learn: 0.6436564	total: 1.34s	remaining: 330ms
499:	learn: 0.6146901	total: 1.66s	remaining: 0us


Successfully registered model 'sklearn-CatBoost-multiclass'.
2025/03/09 14:18:35 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-CatBoost-multiclass, version 1


üèÉ View run CatBoost-multiclass at: http://localhost:8080/#/experiments/316063285991046342/runs/6a0c9ccaf4474c95a87ba6cb1e7597d4
üß™ View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.7407
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.74      0.74      2073
           1       0.74      0.70      0.72      1998
           2       0.80      0.84      0.82      2056
           3       0.69      0.68      0.69      2027

    accuracy                           0.74      8154
   macro avg       0.74      0.74      0.74      8154
weighted avg       0.74      0.74      0.74      8154



Created version '1' of model 'sklearn-CatBoost-multiclass'.
