## Supervised models
This notebook is intended for giving an introduction the ML supervised models that can be used for Covid detection.

For this notebook to find the new modules created for this project, we need to set its path to be in the root directory.

In [1]:
# Auto reload modules
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../")

## Loading packages and dependencies

In [3]:
import numpy as np

from src.features.extract_features import load_extracted_features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from src.models.build_model import train_basic_supervised_model, evaluate_model


# Path to the raw data and preprocessed data
raw_data_dir = '../data/raw/COVID-19_Radiography_Dataset/'

## Extracting features from images

In [4]:
X_healthy, y_healthy, _ = load_extracted_features(images_dir=raw_data_dir+'{}/images',
                                                       category='NORMAL', dataset_label=0, random_seed=42, samples=781, augmentor=True)
X_sick, y_sick, _ = load_extracted_features(images_dir=raw_data_dir+'{}/images',
                                            category=['COVID','Viral Pneumonia','Lung_Opacity'], dataset_label=1)

Loaded images for NORMAL: 10973 resized images, 10973 features, and 10973 labels.
Loaded images for ['COVID', 'Viral Pneumonia', 'Lung_Opacity']: 10973 resized images, 10973 features, and 10973 labels.


## Normalizing features

In [5]:
# Combine datasets
X = np.vstack((X_healthy, X_sick))
y = np.concatenate((y_healthy, y_sick))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train shape: (17556, 14), y_train shape: (17556,)
X_test shape: (4390, 14), y_test shape: (4390,)


## Training and evaluating models

### Logistic regression

✅ Strengths:
* Simple, fast, and interpretable.
* Works well when features are linearly separable.

❌ Weaknesses:
* Struggles with complex, non-linear relationships.
* Sensitive to outliers.

In [7]:
model = train_basic_supervised_model(X_train, y_train, model_type='Logistic Regression')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='Logistic Regression')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

2025/03/02 14:55:18 INFO mlflow.tracking.fluent: Experiment with name 'Basic Supervised Models' does not exist. Creating a new experiment.
Successfully registered model 'sklearn-Logistic Regression-2025-03-02'.
2025/03/02 14:55:24 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Logistic Regression-2025-03-02, version 1


🏃 View run Logistic Regression-2025-03-02 14:55:18.392808 at: http://localhost:8080/#/experiments/747560239450198032/runs/92decb4e97e2485282012c92c047f870
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.7264
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.72      0.73      2204
           1       0.72      0.73      0.73      2186

    accuracy                           0.73      4390
   macro avg       0.73      0.73      0.73      4390
weighted avg       0.73      0.73      0.73      4390



Created version '1' of model 'sklearn-Logistic Regression-2025-03-02'.


### SVM

✅ Strengths:

* Works well on high-dimensional data.
* Effective on small datasets.
* Handles outliers better than logistic regression.

❌ Weaknesses:

* Slow on large datasets (especially with RBF kernel).
* Sensitive to hyperparameters (C, γ, degree).
* Difficult to interpret compared to logistic regression.

#### RBF kernel

In [8]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM RBF')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='SVM RBF')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-SVM RBF-2025-03-02'.
2025/03/02 14:56:07 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM RBF-2025-03-02, version 1


🏃 View run SVM RBF-2025-03-02 14:56:03.328186 at: http://localhost:8080/#/experiments/747560239450198032/runs/e0ceab257bd84fa483ad0c2fcd44af64
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.8043
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.79      0.80      2204
           1       0.80      0.81      0.81      2186

    accuracy                           0.80      4390
   macro avg       0.80      0.80      0.80      4390
weighted avg       0.80      0.80      0.80      4390



Created version '1' of model 'sklearn-SVM RBF-2025-03-02'.


#### Linear kernel

In [9]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM Linear')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='SVM Linear')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-SVM Linear-2025-03-02'.
2025/03/02 14:56:34 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM Linear-2025-03-02, version 1


🏃 View run SVM Linear-2025-03-02 14:56:31.388497 at: http://localhost:8080/#/experiments/747560239450198032/runs/34edb7ca68fa42a8b0cc15642319e8ce
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.7305
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.70      0.72      2204
           1       0.72      0.76      0.74      2186

    accuracy                           0.73      4390
   macro avg       0.73      0.73      0.73      4390
weighted avg       0.73      0.73      0.73      4390



Created version '1' of model 'sklearn-SVM Linear-2025-03-02'.


### Linear Regression

✅ Strengths
* Simple and Fast – Easy to implement and interpret.
* Works Well for Linearly Related Data.
* Low Computational Cost – Efficient on small datasets.

❌ Weaknesses
* Assumes a Linear Relationship – Struggles with non-linear patterns.
* Sensitive to Outliers – A few extreme values can skew results.
* Multicollinearity Issues – Highly correlated features can reduce accuracy.

In [10]:
model = train_basic_supervised_model(X_train, y_train, model_type='Linear Regression')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='Linear Regression')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-Linear Regression-2025-03-02'.
2025/03/02 14:56:37 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Linear Regression-2025-03-02, version 1


🏃 View run Linear Regression-2025-03-02 14:56:34.609219 at: http://localhost:8080/#/experiments/747560239450198032/runs/540daf8f5e4b40b6879504cd86f20403
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.7244
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.71      0.72      2204
           1       0.72      0.73      0.73      2186

    accuracy                           0.72      4390
   macro avg       0.72      0.72      0.72      4390
weighted avg       0.72      0.72      0.72      4390



Created version '1' of model 'sklearn-Linear Regression-2025-03-02'.


### Random Forest

✅ Strengths
* High Accuracy – Performs well on complex datasets.
* Robust to Noise – Handles missing data and outliers well.
* Works with Categorical & Numerical Features.

❌ Weaknesses
* Slow on Large Datasets – Many trees increase computation time.
* Less Interpretable – Harder to understand than Logistic Regression.
* Memory Intensive – Requires more RAM compared to simpler models.

In [11]:
model = train_basic_supervised_model(X_train, y_train, model_type='Random Forest')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='Random Forest')

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-Random Forest-2025-03-02'.
2025/03/02 14:56:43 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Random Forest-2025-03-02, version 1


🏃 View run Random Forest-2025-03-02 14:56:41.223507 at: http://localhost:8080/#/experiments/747560239450198032/runs/6d205337298b46b6bb385f2bf6d082ee
🧪 View experiment at: http://localhost:8080/#/experiments/747560239450198032
Classification Accuracy: 0.8068
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.80      0.81      2204
           1       0.80      0.81      0.81      2186

    accuracy                           0.81      4390
   macro avg       0.81      0.81      0.81      4390
weighted avg       0.81      0.81      0.81      4390



Created version '1' of model 'sklearn-Random Forest-2025-03-02'.
