## Basic models - Full image binary classes
This notebook is training and testing our model(s) using the full images and for classifying 2 classes `['Normal/Healthy','SICK']`.

For this notebook to find the new modules created for this project, we need to set its path to be in the root directory.

In [1]:
# Auto reload modules
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../")

## Loading packages and dependencies

In [3]:
import numpy as np

from src.features.extract_features import load_extracted_features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from src.models.build_model import train_basic_supervised_model, evaluate_model


# Path to the raw data and preprocessed data
raw_data_dir = '../data/raw/dataset/images'

## Extracting features from images

In [None]:
X_healthy, y_healthy, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                                       category='NORMAL', dataset_label=0)
X_sick, y_sick, _ = load_extracted_features(images_dir=raw_data_dir+'/{}',
                                            category=['COVID','Viral Pneumonia','Lung_Opacity'], dataset_label=1)

## Normalizing features

In [5]:
# Combine datasets
X = np.vstack((X_healthy, X_sick))
y = np.concatenate((y_healthy, y_sick))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train shape: (16932, 14), y_train shape: (16932,)
X_test shape: (4233, 14), y_test shape: (4233,)


## Training and evaluating models

### Logistic regression

✅ Strengths:
* Simple, fast, and interpretable.
* Works well when features are linearly separable.

❌ Weaknesses:
* Struggles with complex, non-linear relationships.
* Sensitive to outliers.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'C': 0.1, 'class_weight': None, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='Logistic Regression')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", 
                                        model, X_test, y_test, model_type='Logistic Regression', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 1.0405604719764012, 1: 0.9624829467939973} labels: [0 1]


2025/04/14 07:33:54 INFO mlflow.tracking.fluent: Experiment with name 'Basic Supervised Models' does not exist. Creating a new experiment.
Successfully registered model 'sklearn-Logistic Regression-binary'.
2025/04/14 07:33:59 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Logistic Regression-binary, version 1


🏃 View run Logistic Regression-binary at: http://localhost:8080/#/experiments/629108935222992872/runs/412790f062ec46e59d1ed15f79122733
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.7368
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.73      0.73      2056
           1       0.74      0.75      0.74      2177

    accuracy                           0.74      4233
   macro avg       0.74      0.74      0.74      4233
weighted avg       0.74      0.74      0.74      4233



Created version '1' of model 'sklearn-Logistic Regression-binary'.


### SVM

✅ Strengths:

* Works well on high-dimensional data.
* Effective on small datasets.
* Handles outliers better than logistic regression.

❌ Weaknesses:

* Slow on large datasets (especially with RBF kernel).
* Sensitive to hyperparameters (C, γ, degree).
* Difficult to interpret compared to logistic regression.

#### RBF kernel

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM RBF')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", 
                                        model, X_test, y_test, model_type='SVM RBF', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 1.0405604719764012, 1: 0.9624829467939973} labels: [0 1]


Successfully registered model 'sklearn-SVM RBF-binary'.
2025/04/14 07:34:08 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM RBF-binary, version 1


🏃 View run SVM RBF-binary at: http://localhost:8080/#/experiments/629108935222992872/runs/a91ef95de82045cdb22b635e110d4375
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.8148
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.80      0.81      2056
           1       0.81      0.83      0.82      2177

    accuracy                           0.81      4233
   macro avg       0.81      0.81      0.81      4233
weighted avg       0.81      0.81      0.81      4233



Created version '1' of model 'sklearn-SVM RBF-binary'.


#### Linear kernel

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='SVM Linear')

In [None]:

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", model, X_test, y_test, model_type='SVM Linear', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 1.0405604719764012, 1: 0.9624829467939973} labels: [0 1]


Successfully registered model 'sklearn-SVM Linear-binary'.
2025/04/14 07:34:35 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-SVM Linear-binary, version 1


🏃 View run SVM Linear-binary at: http://localhost:8080/#/experiments/629108935222992872/runs/26d71f39f1e94a64b32de0f016c55b49
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.7432
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.70      0.73      2056
           1       0.74      0.78      0.76      2177

    accuracy                           0.74      4233
   macro avg       0.74      0.74      0.74      4233
weighted avg       0.74      0.74      0.74      4233



Created version '1' of model 'sklearn-SVM Linear-binary'.


### Linear Regression

✅ Strengths
* Simple and Fast – Easy to implement and interpret.
* Works Well for Linearly Related Data.
* Low Computational Cost – Efficient on small datasets.

❌ Weaknesses
* Assumes a Linear Relationship – Struggles with non-linear patterns.
* Sensitive to Outliers – A few extreme values can skew results.
* Multicollinearity Issues – Highly correlated features can reduce accuracy.

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='Linear Regression')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", 
                                        model, X_test, y_test, model_type='Linear Regression', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 1.0405604719764012, 1: 0.9624829467939973} labels: [0 1]


Successfully registered model 'sklearn-Linear Regression-binary'.
2025/04/14 07:34:39 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Linear Regression-binary, version 1


🏃 View run Linear Regression-binary at: http://localhost:8080/#/experiments/629108935222992872/runs/e56e45a2ced0452da575754691598ecf
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.7323
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.72      0.72      2056
           1       0.74      0.75      0.74      2177

    accuracy                           0.73      4233
   macro avg       0.73      0.73      0.73      4233
weighted avg       0.73      0.73      0.73      4233



Created version '1' of model 'sklearn-Linear Regression-binary'.


### Random Forest

✅ Strengths
* High Accuracy – Performs well on complex datasets.
* Robust to Noise – Handles missing data and outliers well.
* Works with Categorical & Numerical Features.

❌ Weaknesses
* Slow on Large Datasets – Many trees increase computation time.
* Less Interpretable – Harder to understand than Logistic Regression.
* Memory Intensive – Requires more RAM compared to simpler models.

Using GridSearchCV, the tuned hyperparameters based on the features used in this notebook are:

{'class_weight': None, 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='Random Forest')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks",
                                        model, X_test, y_test, model_type='Random Forest', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 1.0405604719764012, 1: 0.9624829467939973} labels: [0 1]


Successfully registered model 'sklearn-Random Forest-binary'.
2025/04/14 07:34:58 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-Random Forest-binary, version 1


🏃 View run Random Forest-binary at: http://localhost:8080/#/experiments/629108935222992872/runs/05676bec71874f3b930e7bdd9dbbe913
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.8134
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.78      0.80      2056
           1       0.80      0.84      0.82      2177

    accuracy                           0.81      4233
   macro avg       0.81      0.81      0.81      4233
weighted avg       0.81      0.81      0.81      4233



Created version '1' of model 'sklearn-Random Forest-binary'.


### Catboost

✅ Strengths
* Handles categorical features natively (no need for one-hot encoding).
* Great for imbalanced data (built-in loss functions).
* Avoids overfitting using ordered boosting.
* Faster training than XGBoost & LightGBM.
* Works well with small datasets (better than deep learning in low-data settings).
* Automatically handles missing values.
* Requires minimal hyperparameter tuning.

❌ Weaknesses
* Slower inference than LightGBM (not ideal for real-time applications).
* Higher memory usage (uses more RAM than XGBoost).
* Smaller community support (troubleshooting is harder than XGBoost).
* Limited GPU acceleration (only supports specific settings).
* Not the best for highly sparse data (LightGBM may be better).

In [None]:
model = train_basic_supervised_model(X_train, y_train, model_type='CatBoost')

accuracy_score, report = evaluate_model("Binary classification [Normal, Others] for images without masks", 
                                        model, X_test, y_test, model_type='CatBoost', classification_type="binary")

print(f"Classification Accuracy: {accuracy_score:.4f}")
print("Classification Report:\n", report)

Computed Class Weights:{0: 1.0405604719764012, 1: 0.9624829467939973} labels: [0 1]
0:	learn: 0.6798811	total: 61ms	remaining: 30.5s
100:	learn: 0.4838238	total: 340ms	remaining: 1.34s
200:	learn: 0.4467777	total: 617ms	remaining: 918ms
300:	learn: 0.4163232	total: 894ms	remaining: 591ms
400:	learn: 0.3914075	total: 1.17s	remaining: 289ms
499:	learn: 0.3709681	total: 1.44s	remaining: 0us


Successfully registered model 'sklearn-CatBoost-binary'.
2025/04/14 07:35:02 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn-CatBoost-binary, version 1


🏃 View run CatBoost-binary at: http://localhost:8080/#/experiments/629108935222992872/runs/0a92d1c05d9a48ceaadc832380387f68
🧪 View experiment at: http://localhost:8080/#/experiments/629108935222992872
Classification Accuracy: 0.8160
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.80      0.81      2056
           1       0.81      0.83      0.82      2177

    accuracy                           0.82      4233
   macro avg       0.82      0.82      0.82      4233
weighted avg       0.82      0.82      0.82      4233



Created version '1' of model 'sklearn-CatBoost-binary'.
