## Basic models - Full image multiclasses
This notebook is training and testing our model(s) using the full images and for classifying 4 classes `['COVID','NORMAL','Viral Pneumonia','Lung_Opacity']`.

For this notebook to find the new modules created for this project, we need to set its path to be in the root directory.

In [None]:
# Auto reload modules
%load_ext autoreload*
%autoreload 2

In [2]:
import sys
sys.path.append("../")

## Loading packages and dependencies

In [3]:
import numpy as np

from src.features.extract_features import load_extracted_features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from src.models.build_model import train_basic_supervised_model, evaluate_model


# Path to the raw data and preprocessed data
raw_data_dir = '../data/raw/dataset/images'

## Extracting features from images

In [4]:
X_normal, y_normal, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='NORMAL', dataset_label=0)
X_covid, y_covid, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='COVID', dataset_label=1)
X_pneumonia, y_pneumonia, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='Viral Pneumonia', dataset_label=2)
X_opacity, y_opacity, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='Lung_Opacity', dataset_label=3)

Loaded images for NORMAL: 10192 resized images, 10192 features, and 10192 labels.
Loaded images for COVID: 3616 resized images, 3616 features, and 3616 labels.
Loaded images for Viral Pneumonia: 1345 resized images, 1345 features, and 1345 labels.
Loaded images for Lung_Opacity: 6012 resized images, 6012 features, and 6012 labels.


## Normalizing features

In [5]:
# Combine datasets
X = np.vstack((X_normal, X_covid, X_pneumonia, X_opacity))
y = np.concatenate((y_normal, y_covid, y_pneumonia, y_opacity))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train shape: (16932, 14), y_train shape: (16932,)
X_test shape: (4233, 14), y_test shape: (4233,)


## Training and evaluating models

### Logistic regression

✅ Strengths:
* Simple, fast, and interpretable.
* Works well when features are linearly separable.

❌ Weaknesses:
* Struggles with complex, non-linear relationships.
* Sensitive to outliers.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'C': 0.1, 'class_weight': None, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='Logistic Regression')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", 
                                        model, X_test, y_test, model_type='Logistic Regression', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]


Successfully registered model 'sklearn-Logistic Regression-multiclass'.
2025/04/14 07:37:22 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Logistic Regression-multiclass, version 1


🏃 View run Logistic Regression-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/d4d99b2856d546728ee4e2174f4f7d42
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.6551
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.68      0.71      2056
           1       0.57      0.54      0.56       730
           2       0.46      0.78      0.58       272
           3       0.64      0.65      0.64      1175

    accuracy                           0.66      4233
   macro avg       0.60      0.66      0.62      4233
weighted avg       0.67      0.66      0.66      4233



Created version '1' of model 'sklearn-Logistic Regression-multiclass'.


### SVM

✅ Strengths:

* Works well on high-dimensional data.
* Effective on small datasets.
* Handles outliers better than logistic regression.

❌ Weaknesses:

* Slow on large datasets (especially with RBF kernel).
* Sensitive to hyperparameters (C, γ, degree).
* Difficult to interpret compared to logistic regression.

#### RBF kernel

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM RBF')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", 
                                        model, X_test, y_test, model_type='SVM RBF', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]


Successfully registered model 'sklearn-SVM RBF-multiclass'.
2025/04/14 07:37:32 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM RBF-multiclass, version 1


🏃 View run SVM RBF-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/978f1b30aae2477e973b60af8dd592bc
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.7427
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.72      0.78      2056
           1       0.72      0.78      0.75       730
           2       0.55      0.93      0.69       272
           3       0.69      0.72      0.70      1175

    accuracy                           0.74      4233
   macro avg       0.70      0.79      0.73      4233
weighted avg       0.76      0.74      0.75      4233



Created version '1' of model 'sklearn-SVM RBF-multiclass'.


#### Linear kernel

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM Linear')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", 
                                        model, X_test, y_test, model_type='SVM Linear', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]


Successfully registered model 'sklearn-SVM Linear-multiclass'.
2025/04/14 07:38:04 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM Linear-multiclass, version 1


🏃 View run SVM Linear-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/3499c7fb15c34b4c985350c62c073ffe
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.6494
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.60      0.69      2056
           1       0.54      0.66      0.60       730
           2       0.41      0.85      0.55       272
           3       0.64      0.68      0.66      1175

    accuracy                           0.65      4233
   macro avg       0.60      0.70      0.62      4233
weighted avg       0.69      0.65      0.66      4233



Created version '1' of model 'sklearn-SVM Linear-multiclass'.


### Random Forest

✅ Strengths
* High Accuracy – Performs well on complex datasets.
* Robust to Noise – Handles missing data and outliers well.
* Works with Categorical & Numerical Features.

❌ Weaknesses
* Slow on Large Datasets – Many trees increase computation time.
* Less Interpretable – Harder to understand than Logistic Regression.
* Memory Intensive – Requires more RAM compared to simpler models.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'class_weight': None, 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='Random Forest')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", 
                                        model, X_test, y_test, model_type='Random Forest', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]


Successfully registered model 'sklearn-Random Forest-multiclass'.
2025/04/14 07:38:24 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Random Forest-multiclass, version 1


🏃 View run Random Forest-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/de2903fea2d94c029e1041590721fef9
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.7576
Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.83      0.80      2056
           1       0.82      0.71      0.76       730
           2       0.84      0.67      0.74       272
           3       0.68      0.68      0.68      1175

    accuracy                           0.76      4233
   macro avg       0.78      0.72      0.75      4233
weighted avg       0.76      0.76      0.76      4233



Created version '1' of model 'sklearn-Random Forest-multiclass'.


### Catboost

✅ Strengths
* Handles categorical features natively (no need for one-hot encoding).
* Great for imbalanced data (built-in loss functions).
* Avoids overfitting using ordered boosting.
* Faster training than XGBoost & LightGBM.
* Works well with small datasets (better than deep learning in low-data settings).
* Automatically handles missing values.
* Requires minimal hyperparameter tuning.

❌ Weaknesses
* Slower inference than LightGBM (not ideal for real-time applications).
* Higher memory usage (uses more RAM than XGBoost).
* Smaller community support (troubleshooting is harder than XGBoost).
* Limited GPU acceleration (only supports specific settings).
* Not the best for highly sparse data (LightGBM may be better).

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='CatBoost_Multi')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for images without masks", 
                                        model, X_test, y_test, model_type='CatBoost', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]
0:	learn: 1.3545795	total: 62.7ms	remaining: 31.3s
100:	learn: 0.7762778	total: 331ms	remaining: 1.31s
200:	learn: 0.6739704	total: 591ms	remaining: 880ms
300:	learn: 0.6108766	total: 854ms	remaining: 565ms
400:	learn: 0.5658499	total: 1.11s	remaining: 275ms
499:	learn: 0.5323763	total: 1.37s	remaining: 0us


Successfully registered model 'sklearn-CatBoost-multiclass'.
2025/04/14 07:38:28 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-CatBoost-multiclass, version 1


🏃 View run CatBoost-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/58c84dd9913b46c18d7261a89a7e4d18
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.7279
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.70      0.77      2056
           1       0.68      0.78      0.73       730
           2       0.55      0.89      0.68       272
           3       0.67      0.70      0.68      1175

    accuracy                           0.73      4233
   macro avg       0.68      0.77      0.71      4233
weighted avg       0.75      0.73      0.73      4233



Created version '1' of model 'sklearn-CatBoost-multiclass'.
