# Data for ML: Binary classification

Train a binary classifier for detecting "disease".

* Naive classifier
* Logistic regression


### Metrics

We'll evaluate:
* **Precision**: how many predicted positives are actually positive
* **Recall**: how many actual positives are correctly detected
* **Accuracy**: the proportion of correct predictions (both true positives and true negatives) out of all predictions made




## Project initialization

Create a project: a dedicated space where we can manage functions, artifacts and executions.

Use the ``username`` as param to create a personal project in the shared space.

In [None]:
%pip install scikit-learn matplotlib

In [None]:
import digitalhub as dh
import os

project = dh.get_or_create_project(f"my-test-project-{os.environ['USER']}")
project

## Dataset

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

In [None]:
# 1. Generate imbalanced synthetic data: 95% "no disease", 5% "disease"
X, y_num = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=5,
    n_redundant=2,
    n_clusters_per_class=1,
    weights=[0.95, 0.05],  # class 0: 95%, class 1: 5%
    flip_y=0,
    random_state=42,
)

# Map numeric labels to string labels
label_map = {0: "no disease", 1: "disease"}
y = np.vectorize(label_map.get)(y_num)

In [None]:
import matplotlib.pyplot as plt

# Count each class
classes, counts = np.unique(y, return_counts=True)

# Pie chart
plt.figure(figsize=(6, 6))
plt.pie(counts, labels=classes, autopct="%1.1f%%", startangle=90)
plt.title("Class Distribution")
plt.show()

Maybe we should track the dataset?
Let's persist it as an exercise

In [None]:
import pandas as pd
 
 # create a dataset with the features
dataset = pd.DataFrame(X)

# Let's add the label as a column
dataset['y'] = y
 

In [None]:
dataset.head()

In [None]:
# save the dataset in the project
#TODO


## Model training

Let's try training a model for binary classification:
* define the model
* split the dataset in training/test
* (train)
* evaluate the model

In [None]:
# 2. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [None]:
print("train set #disease: "+str(len(list(filter(lambda x: x == 'disease', y_train)))))
print("train set #no disease: "+str(len(list(filter(lambda x: x == 'no disease', y_train)))))

print("test set #disease: "+str(len(list(filter(lambda x: x == 'disease', y_test)))))
print("test set #no disease: "+str(len(list(filter(lambda x: x == 'no disease', y_test)))))

In [None]:
# 3. Define and train a naive majority-class classifier
clf = DummyClassifier(strategy="most_frequent")  # always predicts majority class
clf.fit(X_train, y_train)


In [None]:
from sklearn.metrics import classification_report

# 4. Evaluate
y_pred = clf.predict(X_test)


print("predict set #disease: "+str(len(list(filter(lambda x: x == 'disease', y_pred)))))
print("predict set #no disease: "+str(len(list(filter(lambda x: x == 'no disease', y_pred)))))

train_acc = clf.score(X_train, y_train)
test_acc = clf.score(X_test, y_test)

print("Training accuracy:", train_acc)
print("Test accuracy:", test_acc)
print("Predicted classes on test set:", np.unique(clf.predict(X_test)))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt


# Create confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=["no disease", "disease"])

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["no disease", "disease"])
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix: Naive Majority Classifier")
plt.show()

print("\nConfusion Matrix (raw counts):")
print(cm)

## Logistic regression model

Let's swap the dummy classifier with a model able to leverage class weights to prioritize minority class detection

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import matplotlib.pyplot as plt

# Define and train logistic regression with class weights to emphasize minority class
clf = LogisticRegression(class_weight={"no disease": 1, "disease": 19}, max_iter=1000, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)


print("predict set #disease: "+str(len(list(filter(lambda x: x == 'disease', y_pred)))))
print("predict set #no disease: "+str(len(list(filter(lambda x: x == 'no disease', y_pred)))))

train_acc = clf.score(X_train, y_train)
test_acc = clf.score(X_test, y_test)

print("Training accuracy:", train_acc)
print("Test accuracy:", test_acc)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["no disease", "disease"]))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=["no disease", "disease"])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["no disease", "disease"])
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix: Logistic Regression with Class Weights")
plt.show()

Try changing class weights and evaluate the results.