# AI4Health - 04 – Histology Image Classification

---

## Introduction

Histopathology is the study of microscopic images of tissue, and it plays a crucial role in diagnosing diseases like cancer. Pathologists examine stained tissue samples under a microscope to look for abnormal cells. This process, while highly effective, is also time-consuming and requires years of expertise. With the growing number of cases and a shortage of specialists, there is a strong need for computational tools that can assist in diagnosis.

In this notebook, you will learn how to use machine learning to classify small patches of histology images as either cancerous or non-cancerous. Instead of using advanced deep learning models (like Convolutional Neural Networks), we will start with a simple approach: flattening each image into a 1D array of pixel values and using a **Logistic Regression** model. This method is easy to understand and implement, making it a great starting point for those new to medical image analysis.

You will learn how to:
- Prepare image data for machine learning
- Train and evaluate a simple classification model
- Interpret model results and identify challenges in medical image classification

By the end of this notebook, you’ll have hands-on experience with a real-world medical dataset and a better understanding of how machine learning can support healthcare professionals.

### Learning Objectives:
- Understand the basics of histopathology images and their role in medical diagnosis.
- Learn how to load, preprocess, and visualise medical image data in Python.
- Apply logistic regression to classify image patches as cancerous or non-cancerous.
- Evaluate model performance and interpret classification results.
- Reflect on the challenges and limitations of classical machine learning for medical images.

---

## Additional Context

### What Are Histology Images?

Histology images are high-resolution scans of stained tissue samples. Staining (e.g., Hematoxylin and Eosin) highlights cellular structures to help pathologists detect abnormalities. In digital pathology, these images are divided into smaller **patches** for analysis.

We will use the **IDC Breast Histopathology dataset** which contains 50x50 patches labeled as either:
- `0`: Non-invasive (normal)
- `1`: Invasive Ductal Carcinoma (IDC)

### Why Use Classical Machine Learning Instead of Deep Learning?

While modern deep learning models (such as Convolutional Neural Networks, CNNs) excel at image classification, they require large datasets and significant computational resources. As a baseline, this notebook uses a simpler approach:
- **Flattening** each image into a 1D array of pixel values
- **Normalising** pixel values to a standard range (0–1)
- Training a **Logistic Regression** model

This classical approach is accessible, interpretable, and suitable for small datasets or limited hardware, helping you focus on the core machine learning workflow.

### Key Concepts in Medical Image Analysis

Understanding the foundational concepts in medical image analysis is essential for building effective and trustworthy machine learning models. These concepts guide how we prepare, represent, and interpret image data, especially when working with sensitive clinical information. Below are some of the most important considerations:

- **Preprocessing**: Standardising image size, color channels (e.g., converting to grayscale), and normalisation are crucial for consistent analysis.
- **Feature Representation**: Flattening images allows classical models to process them, but may lose spatial information compared to CNNs.
- **Class Imbalance**: Medical datasets often have more normal than abnormal samples, which can bias models if not addressed.
- **Interpretability**: Simple models like logistic regression are easier to interpret, which is important for clinical trust and adoption.

### Challenges in Medical Image Classification

Medical image classification presents unique challenges that stem from both the complexity of biological tissues and the technical aspects of image acquisition. Recognising these challenges helps in designing robust models and understanding their limitations:

- **Subtle Differences**: Cancerous and non-cancerous tissue may look very similar, especially in small patches.
- **Image Artifacts**: Variability in staining, scanning, or tissue preparation can introduce noise.
- **Data Volume**: Large whole-slide images are often split into thousands of patches, creating challenges for storage and computation.
- **Label Quality**: Accurate labeling requires expert pathologists, and errors can impact model performance.

### Clinical and Ethical Considerations

Applying machine learning to medical images requires careful attention to clinical and ethical issues to ensure models are safe, fair, and trustworthy. Clinicians must be able to understand and trust predictions, so transparency is essential. Rigorous validation on independent data is needed before clinical use. Datasets should represent diverse populations to prevent bias, and sensitive medical images must be handled securely to protect privacy.

---

## Related Guides

- *MatPlotLib - Pyplot:* https://matplotlib.org/stable/tutorials/pyplot.html
- *SciKit-Learn - Confusion Matrix:* https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix
- *SciKit-Learn - Cross Validation (train, test, split):* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
- *SciKit-Learn - Tuning the hyperparameters (grid search):* https://scikit-learn.org/stable/modules/grid_search.html
- *SciKit-Learn - Logistic Regression:* https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

---

## Step 1: Load Required Libraries

To begin our analysis, we import the essential Python libraries for image processing, visualisation, and machine learning. These include tools for handling file paths, manipulating arrays, displaying images, and building classification models. Loading these libraries ensures we have everything needed to work with image data and apply machine learning techniques in the following steps.

In [None]:
import matplotlib.pyplot as plt
import numpy
import os
import random

from PIL import Image
from scipy.ndimage import rotate
from skimage.util import random_noise
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, GridSearchCV

print("OK")

**Questions:**

- **1.1.** Why do we need specialised libraries for working with medical images?
- **1.2.** What roles do libraries like PIL and matplotlib play in image analysis?
- **1.3.** How does the choice of machine learning library affect the types of models you can build?

---

## Step 2: Image Handling Tips: PIL and matplotlib

This section provides practical guidance on working with medical images in Python. We demonstrate how to load, preprocess, and visualise images using the PIL and matplotlib libraries. These basic image handling techniques are essential for inspecting data, debugging preprocessing steps, and ensuring that images are correctly formatted for analysis. Mastery of these tools is a key skill for anyone working in medical image analysis.

- *Pillow - Image:* https://pillow.readthedocs.io/en/latest/reference/Image.html

In [None]:
img = Image.open("datasets/breast_histopathology_image_samples/8863/1/8863_idx5_x1001_y801_class1.png").convert("L")  # Load grayscale
img = img.resize((50, 50))  # Resize
arr = numpy.array(img) / 255.0  # Normalise

plt.imshow(arr, cmap='gray')
plt.title("Sample Image")
plt.show()

**Questions:**

- **2.1.** How can visualisation help you debug preprocessing steps?
- **2.2.** Why is it important to ensure consistent image formatting before analysis?

---

## Step 3: Load the Dataset

In this step, we load a sample of histology image patches from the dataset. Each image is labeled as either cancerous or non-cancerous. We read the image files from disk, resize them for consistency, convert them to grayscale to simplify the analysis, and flatten each image into a one-dimensional array of pixel values. This transformation prepares the images for use with classical machine learning models, which expect tabular input rather than raw image files. After loading, we combine the images and their labels into arrays for further processing and print out the dataset shape to confirm everything loaded as expected.

In [None]:
# Load a small sample of images for demonstration
def load_images(base_path, label, max_images=100):
    images = []
    labels = []
    path = os.path.join(base_path, str(label))
    files = os.listdir(path)
    for fname in random.sample(files, min(len(files), max_images)):
        img = Image.open(os.path.join(path, fname)).resize((50, 50)).convert('L')
        images.append(numpy.array(img).flatten())
        labels.append(label)
    return images, labels

# Example path structure
base_path = "datasets/breast_histopathology_image_samples/8863/"
class0_imgs, class0_labels = load_images(base_path, label=0, max_images=1000)
class1_imgs, class1_labels = load_images(base_path, label=1, max_images=1000)

X = numpy.array(class0_imgs + class1_imgs)
y = numpy.array(class0_labels + class1_labels)

print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {numpy.bincount(y)}")

**Questions:**

- **3.1.** Why do we convert images to grayscale and resize them before analysis?
- **3.2.** What are the advantages and limitations of flattening images into 1D arrays for machine learning?
- **3.3.** How can you verify that your dataset has been loaded and labeled correctly?
- **3.4.** What challenges might arise when working with real-world medical image datasets?

---

## Step 4: Visualising Sample Images

Before training a model, it is important to visualise some of the image patches from the dataset. By displaying a grid of images along with their labels, we can better understand the kinds of patterns and features present in the data. This step helps us see what the model will be learning from and can reveal challenges such as subtle differences between classes or noisy samples. Visualisation also provides an opportunity to check that the data has been loaded and preprocessed correctly.

In [None]:
def plot_random_images(n=8):
    plt.figure(figsize=(12, 6))
    for i in range(n):
        # Randomly select an index for each class
        index0 = random.randint(0, len(class0_imgs) - 1)
        index1 = random.randint(0, len(class1_imgs) - 1)
        # Select images from both classes
        if i < n/2:
            image = class0_imgs[index0]
            label = class0_labels[index0]
        else:
            image = class1_imgs[index1]
            label = class1_labels[index1]
        # Plot the image
        plt.subplot(2, n//2, i+1)
        img = image.reshape(50, 50)
        plt.imshow(img, cmap='gray')
        plt.title(f"Label: {label}")
        plt.axis("off")
    # Adjust layout and show the plot
    plt.tight_layout()
    plt.show()

plot_random_images()

**Questions:**

- **4.1.** What patterns or differences can you observe between cancerous and non-cancerous patches?
- **4.2.** How might image quality or artifacts affect model performance?
- **4.3.** What can visualisation reveal about potential data issues?

---

## Step 5: Data Augmentation

To improve the diversity of our dataset and help the model generalise better, we can apply basic data augmentation techniques. These include flipping, rotating, and adding noise to the images. Data augmentation is especially useful when working with small datasets, as it artificially increases the number of training samples.

- *SciPy - rotate:* https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.rotate.html

In [None]:
def augment_image(image):
    # Apply random rotation
    rotated = rotate(image, angle=random.choice([90, 180, 270]), reshape=False)
    # Add random noise
    noisy = random_noise(rotated, mode='gaussian', var=0.01)
    return noisy

# Apply augmentation to a subset of images (e.g., first 10 images of class 0)
augmented_images = [augment_image(img.reshape(50, 50)).flatten() for img in class0_imgs[:10]]
augmented_labels = [0] * len(augmented_images)

# Add augmented data to the dataset
X = numpy.vstack([X, augmented_images])
y = numpy.hstack([y, augmented_labels])

print(f"Augmented dataset shape: {X.shape}")

**Questions:**

- **5.1.** How does data augmentation improve model performance?
- **5.2.** What are the risks of over-augmenting a dataset?
- **5.3.** How can you ensure that augmented images still represent the original data distribution?

---

## Step 6: Normalise, Split, and Train the Model

With the data prepared, we proceed to normalise the pixel values so that they fall within a standard range, which helps the model train more effectively. We then split the dataset into training and testing sets to ensure that we can fairly evaluate the model’s performance on unseen data. Next, we train a logistic regression model using the flattened image data. After training, we generate predictions on the test set and assess the model’s performance using standard classification metrics and a confusion matrix. This step demonstrates how a simple, interpretable machine learning model can be applied to image classification tasks, even without deep learning.

- *SciKit-Learn - ConfusionMatrixDisplay:* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html
- *SciKit-Learn - LogisticRegression:* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- *SciKit-Learn - train_test_split:* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
# Normalise and split
X = X / 255.0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
print(classification_report(y_test, y_pred))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()


**Questions:**

- **6.1.** Why do we normalise pixel values before training a machine learning model?
- **6.2.** What are the strengths and limitations of using logistic regression for image classification?
- **6.3.** How do you interpret the classification report and confusion matrix in a clinical context?

---

## Step 7: What’s Going Wrong?

To gain deeper insight into the model’s limitations, we examine some of the images that were misclassified. By visualising these challenging cases, we can look for patterns or features that may have confused the model. This analysis helps us reflect on the difficulty of the task, especially when working with low-resolution images or subtle visual differences. Understanding where and why the model makes mistakes is crucial for improving future models and for interpreting results in a clinical context.

- *Numpy - where:* https://numpy.org/doc/stable/reference/generated/numpy.where.html

In [None]:
def plot_sample_images(X, y, n=8):
    plt.figure(figsize=(12, 6))
    for i in range(n):
        plt.subplot(2, n//2, i+1)
        img = X[i].reshape(50, 50)
        plt.imshow(img, cmap='gray')
        plt.title(f"Label: {y[i]}")
        plt.axis("off")
    # Adjust layout and show the plot
    plt.tight_layout()
    plt.show()

# Identify misclassified images by comparing predictions with true labels
misclassified = numpy.where(y_pred != y_test)[0]
print(f"Misclassified images found: {len(misclassified)}")

plot_sample_images(X_test[misclassified], y_test[misclassified])

**Questions:**

- **7.1.** What patterns do you notice in the misclassified images?
- **7.2.** How might low image resolution or subtle features contribute to misclassification?
- **7.3.** Why is it important to analyse model errors in medical applications?
- **7.4.** How could you address the limitations revealed by misclassified examples?

---

## Step 8: Hyperparameter Tuning

Hyperparameter tuning is a critical step in optimising machine learning models. For logistic regression, we can adjust parameters such as the regularisation strength (`C`) and the solver type. This step demonstrates how to systematically search for the best combination of hyperparameters to improve model performance.

- *SciKit-Learn - GridSearchCV:* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- *SciKit-Learn - LogisticRegression:* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
# Define hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
}

# Perform grid search
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Display best parameters and score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_}")

**Questions:**

- **8.1.** Why is cross-validation important during hyperparameter tuning?
- **8.2.** How do you decide which hyperparameters to tune?
- **8.3.** What are the trade-offs between different solvers in logistic regression?

---

## Step 9: Feature Importance Analysis

Understanding which features (pixels) contribute most to the model's predictions can provide valuable insights. For logistic regression, the model coefficients represent the importance of each feature. By visualising these coefficients, we can identify which regions of the image are most influential in distinguishing between classes.

In [None]:
# Reshape coefficients to match image dimensions
coefficients = model.coef_.reshape(50, 50)

# Visualise feature importance
plt.imshow(coefficients, cmap='coolwarm', interpolation='nearest')
plt.colorbar(label="Feature Importance")
plt.title("Logistic Regression Coefficients")
plt.show()

**Questions:**

- **9.1.** What do the most important features reveal about the dataset?
- **9.2.** How can feature importance analysis help improve model performance?
- **9.3.** What are the limitations of interpreting feature importance in logistic regression?

---

## Step 10: Summary and Reflection

In this notebook, you completed the full workflow of histology image classification using classical machine learning methods. You began by loading and preprocessing digital pathology images, converting them to grayscale and flattening them into feature vectors suitable for traditional models. By visualising sample images, you gained insight into the types of patterns present in the data and the challenges of distinguishing between cancerous and non-cancerous tissue.

You then normalised the data, split it into training and testing sets, and trained a logistic regression classifier. Model performance was evaluated using standard metrics and confusion matrices, allowing you to identify both strengths and weaknesses in the predictions. By examining misclassified images, you reflected on the limitations of simple models when faced with subtle or complex visual differences.

Throughout the process, you also practiced essential image handling techniques in Python, building a foundation for more advanced work in medical image analysis. This exercise highlighted the importance of careful data preparation, visualisation, and critical evaluation of results, especially in a clinical context where accuracy and interpretability are crucial.

### Summary

- Classical machine learning can be applied to medical image classification by flattening images into feature vectors.
- Visualisation and normalisation are key steps in preparing image data for analysis.
- Logistic regression provides a simple, interpretable baseline for image classification tasks.
- Examining misclassified examples helps reveal model limitations and guides future improvements.

### What's next?

- **10.1.** How could deep learning methods improve performance on this task?
- **10.2.** What additional preprocessing or feature engineering might help classical models?
- **10.3.** How can explainable AI techniques make image-based predictions more trustworthy for clinicians?
- **10.4.** What are the challenges of scaling this approach to larger, more complex medical image datasets?

---

## Explore Further

### Datasets

- [Breast Histopathology Images (IDC)](https://www.kaggle.com/paultimothymooney/breast-histopathology-images)

### Articles

- **Machine Learning Methods for Histopathological Image Analysis**
<br>*Computational and Structural Biotechnology Journal*
  - https://www.csbj.org/article/S2001-0370(17)30086-7/fulltext

- **Machine learning methods for histopathological image analysis: Updates in 2024**
<br>*Computational and Structural Biotechnology Journal*
  - https://www.csbj.org/article/S2001-0370(24)00454-9/fulltext

- **A survey on artificial intelligence in histopathology image analysis**
<br>*WIREs Data Mining and Knowledge Discovery*
  - https://wires.onlinelibrary.wiley.com/doi/full/10.1002/widm.1474

- **Conventional Machine Learning versus Deep Learning for Magnification Dependent Histopathological Breast Cancer Image Classification: A Comparative Study with Visual Explanation**
<br>*Diagnostics (Basel)*
  - https://pmc.ncbi.nlm.nih.gov/articles/PMC8001768/