## Supervised models
This notebook is intended for giving an introduction the ML supervised models that can be used for Covid detection.

For this notebook to find the new modules created for this project, we need to set its path to be in the root directory.

In [1]:
# Auto reload modules
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../")

<img src="../images/Supervised_Models.png" width="800"/>

## Loading packages and dependencies

In [3]:
import numpy as np

from src.features.extract_features import load_extracted_features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from src.models.build_model import train_basic_supervised_model, evaluate_model


# Path to the raw data and preprocessed data
raw_data_dir = '../data/raw/dataset/images'

## Extracting features from images

In [4]:
X_healthy, y_healthy, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                       category='NORMAL', dataset_label=0, random_seed=42, samples=781, augmentor=True)
X_sick, y_sick, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                            category=['COVID','Viral Pneumonia','Lung_Opacity'], dataset_label=1)

Loaded images for NORMAL: 10973 resized images, 10973 features, and 10973 labels.
Loaded images for ['COVID', 'Viral Pneumonia', 'Lung_Opacity']: 10973 resized images, 10973 features, and 10973 labels.


## Normalizing features

In [5]:
# Combine datasets
X = np.vstack((X_healthy, X_sick))
y = np.concatenate((y_healthy, y_sick))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train shape: (17556, 14), y_train shape: (17556,)
X_test shape: (4390, 14), y_test shape: (4390,)


## Training and evaluating models

### Logistic regression

✅ Strengths:
* Simple, fast, and interpretable.
* Works well when features are linearly separable.

❌ Weaknesses:
* Struggles with complex, non-linear relationships.
* Sensitive to outliers.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'C': 0.1, 'class_weight': None, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}

In [6]:
model = train_basic_supervised_model(X_train, y_train, model_type='Logistic Regression')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='Logistic Regression', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

2025/03/09 14:07:18 INFO mlflow.tracking.fluent: Experiment with name 'Basic Supervised Models' does not exist. Creating a new experiment.
Successfully registered model 'sklearn-Logistic Regression-binary'.
2025/03/09 14:07:22 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Logistic Regression-binary, version 1


🏃 View run Logistic Regression-binary at: http://localhost:8080/#/experiments/316063285991046342/runs/ca9cbc2ce799442ea7e61dbf60ae903f
🧪 View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.7301
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.73      0.73      2204
           1       0.73      0.73      0.73      2186

    accuracy                           0.73      4390
   macro avg       0.73      0.73      0.73      4390
weighted avg       0.73      0.73      0.73      4390



Created version '1' of model 'sklearn-Logistic Regression-binary'.


### SVM

✅ Strengths:

* Works well on high-dimensional data.
* Effective on small datasets.
* Handles outliers better than logistic regression.

❌ Weaknesses:

* Slow on large datasets (especially with RBF kernel).
* Sensitive to hyperparameters (C, γ, degree).
* Difficult to interpret compared to logistic regression.

#### RBF kernel

In [7]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM RBF')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='SVM RBF', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-SVM RBF-binary'.
2025/03/09 14:07:30 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM RBF-binary, version 1


🏃 View run SVM RBF-binary at: http://localhost:8080/#/experiments/316063285991046342/runs/9b38ed0eb3514dedae3e358ba97bcb4b
🧪 View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.8048
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.80      0.80      2204
           1       0.80      0.81      0.81      2186

    accuracy                           0.80      4390
   macro avg       0.80      0.80      0.80      4390
weighted avg       0.80      0.80      0.80      4390



Created version '1' of model 'sklearn-SVM RBF-binary'.


#### Linear kernel

In [8]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM Linear')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='SVM Linear', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-SVM Linear-binary'.
2025/03/09 14:07:56 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM Linear-binary, version 1


🏃 View run SVM Linear-binary at: http://localhost:8080/#/experiments/316063285991046342/runs/54cff4162e1f4b4a9d1abcbf731ea349
🧪 View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.7321
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.70      0.72      2204
           1       0.72      0.76      0.74      2186

    accuracy                           0.73      4390
   macro avg       0.73      0.73      0.73      4390
weighted avg       0.73      0.73      0.73      4390



Created version '1' of model 'sklearn-SVM Linear-binary'.


### Linear Regression

✅ Strengths
* Simple and Fast – Easy to implement and interpret.
* Works Well for Linearly Related Data.
* Low Computational Cost – Efficient on small datasets.

❌ Weaknesses
* Assumes a Linear Relationship – Struggles with non-linear patterns.
* Sensitive to Outliers – A few extreme values can skew results.
* Multicollinearity Issues – Highly correlated features can reduce accuracy.

In [9]:
model = train_basic_supervised_model(X_train, y_train, model_type='Linear Regression')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='Linear Regression', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-Linear Regression-binary'.
2025/03/09 14:07:59 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Linear Regression-binary, version 1


🏃 View run Linear Regression-binary at: http://localhost:8080/#/experiments/316063285991046342/runs/26c000e31ac040c1a1a4f1b5ef7e21f4
🧪 View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.7248
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.72      0.72      2204
           1       0.72      0.73      0.73      2186

    accuracy                           0.72      4390
   macro avg       0.72      0.72      0.72      4390
weighted avg       0.72      0.72      0.72      4390



Created version '1' of model 'sklearn-Linear Regression-binary'.


### Random Forest

✅ Strengths
* High Accuracy – Performs well on complex datasets.
* Robust to Noise – Handles missing data and outliers well.
* Works with Categorical & Numerical Features.

❌ Weaknesses
* Slow on Large Datasets – Many trees increase computation time.
* Less Interpretable – Harder to understand than Logistic Regression.
* Memory Intensive – Requires more RAM compared to simpler models.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'class_weight': None, 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}

In [10]:
model = train_basic_supervised_model(X_train, y_train, model_type='Random Forest')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='Random Forest', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Successfully registered model 'sklearn-Random Forest-binary'.
2025/03/09 14:08:18 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Random Forest-binary, version 1


🏃 View run Random Forest-binary at: http://localhost:8080/#/experiments/316063285991046342/runs/3aa87ce010aa4097a20963bf4a1e2be7
🧪 View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.8041
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.79      0.80      2204
           1       0.79      0.82      0.81      2186

    accuracy                           0.80      4390
   macro avg       0.80      0.80      0.80      4390
weighted avg       0.80      0.80      0.80      4390



Created version '1' of model 'sklearn-Random Forest-binary'.


### Catboost

✅ Strengths
* Handles categorical features natively (no need for one-hot encoding).
* Great for imbalanced data (built-in loss functions).
* Avoids overfitting using ordered boosting.
* Faster training than XGBoost & LightGBM.
* Works well with small datasets (better than deep learning in low-data settings).
* Automatically handles missing values.
* Requires minimal hyperparameter tuning.

❌ Weaknesses
* Slower inference than LightGBM (not ideal for real-time applications).
* Higher memory usage (uses more RAM than XGBoost).
* Smaller community support (troubleshooting is harder than XGBoost).
* Limited GPU acceleration (only supports specific settings).
* Not the best for highly sparse data (LightGBM may be better).

In [11]:
model = train_basic_supervised_model(X_train, y_train, model_type='CatBoost')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='CatBoost', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

0:	learn: 0.6818489	total: 62.3ms	remaining: 31.1s
100:	learn: 0.4837275	total: 347ms	remaining: 1.37s
200:	learn: 0.4469189	total: 628ms	remaining: 934ms
300:	learn: 0.4172246	total: 909ms	remaining: 601ms
400:	learn: 0.3930853	total: 1.19s	remaining: 294ms
499:	learn: 0.3730736	total: 1.46s	remaining: 0us


Successfully registered model 'sklearn-CatBoost-binary'.
2025/03/09 14:08:22 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-CatBoost-binary, version 1


🏃 View run CatBoost-binary at: http://localhost:8080/#/experiments/316063285991046342/runs/ff36dbdde8764d10ad9adfa72232d614
🧪 View experiment at: http://localhost:8080/#/experiments/316063285991046342
Classification Accuracy: 0.8048
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.80      0.80      2204
           1       0.80      0.81      0.81      2186

    accuracy                           0.80      4390
   macro avg       0.80      0.80      0.80      4390
weighted avg       0.80      0.80      0.80      4390



Created version '1' of model 'sklearn-CatBoost-binary'.
