## Basic models - Masked lungs multiclasses
This notebook is training and testing our model(s) using masked lung images and for classifying 4 classes `['COVID','NORMAL','Viral Pneumonia','Lung_Opacity']`.

For this notebook to find the new modules created for this project, we need to set its path to be in the root directory.

In [1]:
# Auto reload modules
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../")

## Loading packages and dependencies

In [3]:
import numpy as np

from src.features.extract_features import load_extracted_features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from src.models.build_model import train_basic_supervised_model, evaluate_model


# Path to the raw data and preprocessed data
raw_data_dir = '../data/raw/dataset/masked_images'

## Extracting features from images

In [4]:
X_normal, y_normal, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='NORMAL', dataset_label=0)
X_covid, y_covid, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='COVID', dataset_label=1)
X_pneumonia, y_pneumonia, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='Viral Pneumonia', dataset_label=2)
X_opacity, y_opacity, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                    category='Lung_Opacity', dataset_label=3)

Loaded images for NORMAL: 10192 resized images, 10192 features, and 10192 labels.
Loaded images for COVID: 3616 resized images, 3616 features, and 3616 labels.
Loaded images for Viral Pneumonia: 1345 resized images, 1345 features, and 1345 labels.
Loaded images for Lung_Opacity: 6012 resized images, 6012 features, and 6012 labels.


## Normalizing features

In [5]:
# Combine datasets
X = np.vstack((X_normal, X_covid, X_pneumonia, X_opacity))
y = np.concatenate((y_normal, y_covid, y_pneumonia, y_opacity))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train shape: (16932, 14), y_train shape: (16932,)
X_test shape: (4233, 14), y_test shape: (4233,)


## Training and evaluating models

### Logistic regression

✅ Strengths:
* Simple, fast, and interpretable.
* Works well when features are linearly separable.

❌ Weaknesses:
* Struggles with complex, non-linear relationships.
* Sensitive to outliers.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'C': 0.1, 'class_weight': None, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='Logistic Regression')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for masked images", 
                                        model, X_test, y_test, model_type='Logistic Regression', 
                                        classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]


Registered model 'sklearn-Logistic Regression-multiclass' already exists. Creating a new version of this model...
2025/04/14 07:40:38 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Logistic Regression-multiclass, version 2


🏃 View run Logistic Regression-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/09bcaeffca8d4ed69632d40685f52f8e
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.6097
Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.69      0.70      2056
           1       0.42      0.29      0.34       730
           2       0.36      0.76      0.49       272
           3       0.63      0.63      0.63      1175

    accuracy                           0.61      4233
   macro avg       0.53      0.59      0.54      4233
weighted avg       0.62      0.61      0.61      4233



Created version '2' of model 'sklearn-Logistic Regression-multiclass'.


### SVM

✅ Strengths:

* Works well on high-dimensional data.
* Effective on small datasets.
* Handles outliers better than logistic regression.

❌ Weaknesses:

* Slow on large datasets (especially with RBF kernel).
* Sensitive to hyperparameters (C, γ, degree).
* Difficult to interpret compared to logistic regression.

#### RBF kernel

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM RBF')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for masked images", 
                                        model, X_test, y_test, model_type='SVM RBF', 
                                        classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]


Registered model 'sklearn-SVM RBF-multiclass' already exists. Creating a new version of this model...
2025/04/14 07:40:50 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM RBF-multiclass, version 2


🏃 View run SVM RBF-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/29edf54b0f8d42d6b65f6bfe03a4ef1b
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.6171
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.60      0.70      2056
           1       0.43      0.64      0.51       730
           2       0.33      0.85      0.47       272
           3       0.70      0.58      0.63      1175

    accuracy                           0.62      4233
   macro avg       0.57      0.67      0.58      4233
weighted avg       0.70      0.62      0.64      4233



Created version '2' of model 'sklearn-SVM RBF-multiclass'.


#### Linear kernel

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM Linear')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for masked images", 
                                        model, X_test, y_test, model_type='SVM Linear', 
                                        classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]


Registered model 'sklearn-SVM Linear-multiclass' already exists. Creating a new version of this model...
2025/04/14 07:41:22 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM Linear-multiclass, version 2


🏃 View run SVM Linear-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/1710b94f5bc64f8fa5a7378b524e01a1
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.5941
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.60      0.68      2056
           1       0.39      0.53      0.45       730
           2       0.32      0.86      0.47       272
           3       0.71      0.56      0.63      1175

    accuracy                           0.59      4233
   macro avg       0.55      0.64      0.56      4233
weighted avg       0.66      0.59      0.61      4233



Created version '2' of model 'sklearn-SVM Linear-multiclass'.


### Random Forest

✅ Strengths
* High Accuracy – Performs well on complex datasets.
* Robust to Noise – Handles missing data and outliers well.
* Works with Categorical & Numerical Features.

❌ Weaknesses
* Slow on Large Datasets – Many trees increase computation time.
* Less Interpretable – Harder to understand than Logistic Regression.
* Memory Intensive – Requires more RAM compared to simpler models.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'class_weight': None, 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='Random Forest')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for masked images", 
                                        model, X_test, y_test, model_type='Random Forest', 
                                        classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]


Registered model 'sklearn-Random Forest-multiclass' already exists. Creating a new version of this model...
2025/04/14 07:41:41 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Random Forest-multiclass, version 2


🏃 View run Random Forest-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/988e220d1c5a428093d05dc38f8f07c8
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.6856
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.85      0.78      2056
           1       0.60      0.42      0.49       730
           2       0.56      0.41      0.47       272
           3       0.66      0.63      0.64      1175

    accuracy                           0.69      4233
   macro avg       0.64      0.58      0.60      4233
weighted avg       0.68      0.69      0.67      4233



Created version '2' of model 'sklearn-Random Forest-multiclass'.


### Catboost

✅ Strengths
* Handles categorical features natively (no need for one-hot encoding).
* Great for imbalanced data (built-in loss functions).
* Avoids overfitting using ordered boosting.
* Faster training than XGBoost & LightGBM.
* Works well with small datasets (better than deep learning in low-data settings).
* Automatically handles missing values.
* Requires minimal hyperparameter tuning.

❌ Weaknesses
* Slower inference than LightGBM (not ideal for real-time applications).
* Higher memory usage (uses more RAM than XGBoost).
* Smaller community support (troubleshooting is harder than XGBoost).
* Limited GPU acceleration (only supports specific settings).
* Not the best for highly sparse data (LightGBM may be better).

In [10]:
model = train_basic_supervised_model(X_train, y_train, model_type='CatBoost_Multi')

accuracy_score, report = evaluate_model("Multi-label classification [Normal, COVID, Viral Pneumonia, Lung_Opacity] for masked images", model, X_test, y_test, model_type='CatBoost', classification_type="multiclass")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 0.5202802359882006, 1: 1.4667359667359667, 2: 3.9450139794967383, 3: 0.875129212321687} labels: [0 1 2 3]
0:	learn: 1.3665902	total: 62.9ms	remaining: 31.4s
100:	learn: 1.0100573	total: 344ms	remaining: 1.36s
200:	learn: 0.9261526	total: 626ms	remaining: 931ms
300:	learn: 0.8677810	total: 902ms	remaining: 596ms
400:	learn: 0.8240651	total: 1.18s	remaining: 291ms
499:	learn: 0.7914557	total: 1.46s	remaining: 0us


Registered model 'sklearn-CatBoost-multiclass' already exists. Creating a new version of this model...
2025/04/14 07:41:45 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-CatBoost-multiclass, version 2


🏃 View run CatBoost-multiclass at: http://localhost:8080/#/experiments/629108935222992872/runs/dab9ecb347a84f3eba0a9be57400c7c3
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.6199
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.64      0.72      2056
           1       0.44      0.58      0.50       730
           2       0.32      0.80      0.45       272
           3       0.68      0.57      0.62      1175

    accuracy                           0.62      4233
   macro avg       0.56      0.65      0.57      4233
weighted avg       0.68      0.62      0.64      4233



Created version '2' of model 'sklearn-CatBoost-multiclass'.
