## Supervised models
This notebook is intended for giving an introduction the ML supervised models that can be used for Covid detection.

For this notebook to find the new modules created for this project, we need to set its path to be in the root directory.

In [1]:
import sys
sys.path.append("../")

## Loading packages and dependencies

In [2]:
import numpy as np

from src.features.extract_features import load_extracted_features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from src.models.build_model import train_basic_supervised_model, evaluate_model


# Path to the raw data and preprocessed data
raw_data_dir = '../data/raw/COVID-19_Radiography_Dataset/'

## Extracting features from images

In [3]:
X_healthy, y_healthy, _ = load_extracted_features(raw_data_dir+'{}/images','NORMAL', 0)
X_sick, y_sick, _ = load_extracted_features(raw_data_dir+'{}/images',['COVID','Viral Pneumonia','Lung_Opacity'], 1)

Loaded images for NORMAL: 0 resized images, 10192 features, and 10192 labels.
Loaded images for ['COVID', 'Viral Pneumonia', 'Lung_Opacity']: 0 resized images, 10973 features, and 10973 labels.


## Normalizing features

In [4]:
# Combine datasets
X = np.vstack((X_healthy, X_sick))
y = np.concatenate((y_healthy, y_sick))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train shape: (16932, 14), y_train shape: (16932,)
X_test shape: (4233, 14), y_test shape: (4233,)


## Training and evaluating models

### Logistic regression

✅ Strengths:
* Simple, fast, and interpretable.
* Works well when features are linearly separable.

❌ Weaknesses:
* Struggles with complex, non-linear relationships.
* Sensitive to outliers.

In [5]:
model = train_basic_supervised_model(X_train, y_train, model_type='Logistic Regression')

accuracy_score, report = evaluate_model(model, X_test, y_test, model_type='Logistic Regression')
print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Classification Accuracy: 0.7413
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.70      0.72      2056
           1       0.73      0.78      0.76      2177

    accuracy                           0.74      4233
   macro avg       0.74      0.74      0.74      4233
weighted avg       0.74      0.74      0.74      4233



### SVM

✅ Strengths:

* Works well on high-dimensional data.
* Effective on small datasets.
* Handles outliers better than logistic regression.

❌ Weaknesses:

* Slow on large datasets (especially with RBF kernel).
* Sensitive to hyperparameters (C, γ, degree).
* Difficult to interpret compared to logistic regression.

#### RBF kernel

In [6]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM RBF')

accuracy_score, report = evaluate_model(model, X_test, y_test, model_type='SVM')
print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Classification Accuracy: 0.8146
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.79      0.80      2056
           1       0.81      0.84      0.82      2177

    accuracy                           0.81      4233
   macro avg       0.82      0.81      0.81      4233
weighted avg       0.81      0.81      0.81      4233



#### Linear kernel

In [7]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM Linear')

accuracy_score, report = evaluate_model(model, X_test, y_test, model_type='SVM')
print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Classification Accuracy: 0.7470
Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.68      0.72      2056
           1       0.73      0.81      0.77      2177

    accuracy                           0.75      4233
   macro avg       0.75      0.75      0.75      4233
weighted avg       0.75      0.75      0.75      4233



### Linear Regression

✅ Strengths
* Simple and Fast – Easy to implement and interpret.
* Works Well for Linearly Related Data.
* Low Computational Cost – Efficient on small datasets.

❌ Weaknesses
* Assumes a Linear Relationship – Struggles with non-linear patterns.
* Sensitive to Outliers – A few extreme values can skew results.
* Multicollinearity Issues – Highly correlated features can reduce accuracy.

In [8]:
model = train_basic_supervised_model(X_train, y_train, model_type='Linear Regression')

accuracy_score, report = evaluate_model(model, X_test, y_test, model_type='Linear Regression')
print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Mean Squared Error (MSE): 0.2592
R-squared (R²) Score: -0.0375
Classification Accuracy: 0.7408
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.70      0.72      2056
           1       0.73      0.78      0.76      2177

    accuracy                           0.74      4233
   macro avg       0.74      0.74      0.74      4233
weighted avg       0.74      0.74      0.74      4233



### Random Forest

✅ Strengths
* High Accuracy – Performs well on complex datasets.
* Robust to Noise – Handles missing data and outliers well.
* Works with Categorical & Numerical Features.

❌ Weaknesses
* Slow on Large Datasets – Many trees increase computation time.
* Less Interpretable – Harder to understand than Logistic Regression.
* Memory Intensive – Requires more RAM compared to simpler models.

In [9]:
model = train_basic_supervised_model(X_train, y_train, model_type='Random Forest')

accuracy_score, report = evaluate_model(model, X_test, y_test, model_type='Random Forest')
print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Classification Accuracy: 0.8157
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.79      0.81      2056
           1       0.81      0.84      0.82      2177

    accuracy                           0.82      4233
   macro avg       0.82      0.81      0.82      4233
weighted avg       0.82      0.82      0.82      4233

