# Python Practice: Blood Cell Classification System

**Objective:** Build a simplified system that can extract features from blood cell images and use those features to train a machine learning classifier capable of distinguishing among different cell types.

**General Instructions:**

- Open this notebook in Google Colab.
- Run the code cells in order. This project is sequential.
- Discuss the questions in each section with your group.
- Modify the code in the exercises to experiment and understand the concepts.


## Part 0: Setup and Data Loading
**Brief Explanation:**
Let’s begin by installing and importing the required libraries. We will use the `BloodMNIST` dataset from the MedMNIST suite, which contains images of 8 different blood cell types. The code below will download the data and prepare the images and labels for use.

**How the Code Works:**

1. Installation: `pip install medmnist` installs the library that simplifies dataset downloading.
2. Imports: Imports all libraries we will use throughout this practice.
3. Download and Loading:
- `BloodMNIST(split='train', download=True)`: Downloads the dataset (if not already present) and loads it.
- We extract images (`images`) and labels (`labels`).
- `info['label']`: Contains mapping from numeric labels (0-7) to cell class names.
4. Visualization: A `plot_many_images` function is defined to display a batch of sample images from different classes.


In [None]:
# --- 0.1 Install medmnist library ---
!pip install -q medmnist

In [None]:
# --- 0.2 Required Imports ---
import numpy as np
import matplotlib.pyplot as plt
from medmnist import BloodMNIST
from skimage import img_as_float, exposure, filters, morphology, measure, color, transform
from sklearn.model_selection import train_test_split
from skimage.color import rgb2gray, rgb2hsv, hsv2rgb, label2rgb
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import pandas as pd
from skimage import util
from skimage import data, transform, img_as_float, img_as_ubyte, exposure, filters, morphology, feature
from scipy import ndimage
import cv2

# --- 0.3 Helper Plotting Function ---
def plot_many_images(images, titles, rows=3, cols=4, figsize=(15,10)):
    fig, axes = plt.subplots(rows, cols, figsize=figsize, squeeze=False)
    axes_flat = axes.ravel()
    for i in range(len(axes_flat)):
        if i < len(images):
            axes_flat[i].imshow(images[i], cmap='gray' if images[i].ndim==2 else None)
            axes_flat[i].set_title(titles[i])
            axes_flat[i].axis('off')
        else:
            axes_flat[i].axis('off')
    plt.tight_layout()
    plt.show()

# --- 0.4 Load Dataset ---
print("Downloading and loading BloodMNIST dataset...")
try:
    dataset = BloodMNIST(split='train', download=True)
    images = dataset.imgs
    labels = dataset.labels.flatten()
    class_names = dataset.info['label']
    print("Dataset loaded successfully!")
    print("Class Mapping:", class_names)

    # Show some sample images
    sample_indices = [np.where(labels == i)[0][0] for i in range(len(class_names))]
    sample_images = images[sample_indices]
    sample_titles = [f"Class {i}: {class_names[str(i)]}" for i in range(len(class_names))]
    plot_many_images(sample_images, sample_titles, rows=2, cols=4)

except Exception as e:
    print(f"Error downloading/loading dataset: {e}")
    print("The practice cannot continue without data.")

**Interpreting the Results (Part 0):**

- The cell should download the dataset and print the class mapping (e.g., '0': 'basophil', '1': 'eosinophil', etc.).
- A grid with 8 images will be displayed, showing one example of each blood cell type. Observe their visual differences (size, shape, nucleus color, etc.).


## Part 1: Image Fundamentals and Simple Preprocessing


**Brief Explanation:**
Every analysis starts with understanding the data. We will analyze basic image properties and apply preprocessing techniques (enhancement and noise reduction) to prepare images for segmentation.

**How the Code Works:**

1. Image Analysis: We select an example image and check its dimensions, data type, and bit depth.
1. Grayscale Conversion: We convert the RGB image to grayscale (`rgb2gray`), since many segmentation and morphology techniques operate on single-channel images.
1. Enhancement: We apply `exposure.equalize_hist` to improve grayscale contrast.
1. Restoration: We simulate adding Gaussian noise and apply a median filter to remove it, showing why noise handling matters.


In [None]:
# --- 1.1 Basic Analysis and Preprocessing ---

def plot_many_images(images, titles, rows=1, cols=None, cmaps=None, figsize=(15,10), main_title=None):
    """
    Plots a list of images in a grid.

    Args:
        images (list): List of images (NumPy arrays).
        titles (list): List of titles for each image.
        rows (int): Number of rows in subplot grid.
        cols (int): Number of columns in grid. If None, calculated automatically.
        cmaps (list ou str, optional): List of colormaps. If a string, it is used for all.
                                      If None, matplotlib default is used.
        figsize (tuple): Figure size.
        main_title (str, optional): Optional main title for the figure.
    """
    num_images = len(images)
    if cols is None:
        # Calcula colunas para caberem todas as imagens
        cols = (num_images + rows - 1) // rows

    # squeeze=False garante que 'axes' seja sempre um array 2D, evitando erros
    fig, axes = plt.subplots(rows, cols, figsize=figsize, squeeze=False)
    axes_flat = axes.ravel() # Flatten Axes array to simplify iteration

    # Handle cmaps argument flexibly
    if cmaps is None:
        cmaps_list = [None] * num_images # Deixa o matplotlib decidir o cmap
    elif isinstance(cmaps, str):
        cmaps_list = [cmaps] * num_images # Usa o mesmo cmap para todas
    else: # Assume it is a list of cmaps
        cmaps_list = list(cmaps)
        # If cmaps list is shorter than number of images, fill the rest with None
        if len(cmaps_list) < num_images:
            cmaps_list.extend([None] * (num_images - len(cmaps_list)))

    # Iterate over axes (subplots) and plot images
    for i in range(len(axes_flat)):
        ax = axes_flat[i]
        if i < num_images:
            img = images[i]
            title = titles[i]
            cmap_val = cmaps_list[i]

            # imshow handles color images (RGB/RGBA) without cmap.
            # For 2D images (grayscale), cmap is applied.
            if img.ndim == 2:
                ax.imshow(img, cmap=cmap_val)
            else:
                ax.imshow(img) # For color images, cmap is ignored

            ax.set_title(title)
            ax.axis('off')
        else:
            ax.axis('off') # Turn off extra axes that will not be used

    if main_title:
        fig.suptitle(main_title, fontsize=16)

    # Adjust layout to avoid title overlap
    fig.tight_layout(rect=[0, 0, 1, 0.96] if main_title else None)
    plt.show()

# Select a sample image for analysis
idx_analise = 5 # Choose any index from 0 to dataset size
img_rgb_exemplo = images[idx_analise]
label_exemplo = labels[idx_analise]
print(f"Analyzing Image {idx_analise} - Class: {class_names[str(label_exemplo)]}")

# Task 1.1: Fundamentos
print(f"Image dimensions: {img_rgb_exemplo.shape}")
print(f"Data type: {img_rgb_exemplo.dtype}")
print(f"Bit depth (per channel): {img_rgb_exemplo.dtype.itemsize * 8} bits")

# Conversion to Grayscale
img_cinza_exemplo = rgb2gray(img_rgb_exemplo)

# Task 1.2: Contrast Enhancement
img_eq_exemplo = exposure.equalize_hist(img_cinza_exemplo)

# Task 1.3: Noise Reduction
# Add noise for simulation and then remove
img_ruidosa_exemplo = util.random_noise(img_cinza_exemplo, mode='gaussian', var=0.01)
img_denoised_exemplo = filters.median(img_as_ubyte(img_ruidosa_exemplo), footprint=morphology.disk(1))

# Step-by-step Visualization
plot_many_images(
    [img_rgb_exemplo, img_cinza_exemplo, img_eq_exemplo, img_ruidosa_exemplo, img_denoised_exemplo],
    ["Original RGB", "Grayscale", "Realce (Hist. Eq.)", "With Simulated Noise", "After Median Filter"],
    rows=2, cols=3, cmaps=[None, 'gray', 'gray', 'gray', 'gray'], figsize=(18,10)
)

**Interpreting the Results (Part 1):**

- Grayscale: The black-and-white version of the cell image.
- Enhancement (Hist. Eq.): The image with improved contrast. Differences between nucleus and cytoplasm may become more visible.
- With Simulated Noise / After Median Filter: Shows how noise can degrade an image and how a median filter can effectively clean it.


**Exercise (Part 1):**

1.  Fundamentals: If the images were 14×14 pixels instead of 28×28, how would that affect your ability to visually distinguish different cell types? Discuss the importance of spatial resolution.
2. Enhancement: In this example, we used histogram equalization. Try applying a gamma transform (`exposure.adjust_gamma`) to `img_cinza_exemplo`. Would `γ<1` or `γ>1` better highlight the cell nucleus?
3. Restoration: Replace `filters.median` with `filters.gaussian`. Which filter seems to preserve cell boundaries better while removing noise? Why?


## Part 2: Cell Segmentation (Ch. 10)
**Brief Explanation:**
The most critical step is segmentation, which isolates the target cell from the image background. We will use a combination of techniques: thresholding to create an initial mask and mathematical morphology to clean and refine that mask.

**How the Code Works:**

1. Otsu Thresholding (Ch. 10.3): `filters.threshold_otsu()` is applied to the grayscale image (after slight Gaussian blur to reduce noise) to find an optimal global threshold separating cell (usually darker) from background. This creates an initial binary mask.
2. Morphology for Cleaning (Ch. 9):
- `remove_small_objects()`: Removes small white "noise" objects captured by thresholding.
- `ndi.binary_fill_holes()`: Fills holes inside the main object (e.g., brighter nucleus parts classified as background).


In [None]:
# --- 2.1 Cell Segmentation (Cap. 10) ---

# Usaremos a imagem em tons de cinza 'img_cinza_exemplo'
print(f"Segmenting image from class: {class_names[str(label_exemplo)]}")

# A slight Gaussian blur can make thresholding more robust
img_suavizada_seg = filters.gaussian(img_cinza_exemplo, sigma=1)

# Task 2.1: Otsu Thresholding (Cap. 10)
limiar_otsu_celula = filters.threshold_otsu(img_suavizada_seg)
# The cell is darker than the background, so the mask is where image < threshold
mascara_inicial = img_suavizada_seg < limiar_otsu_celula

# Task 2.2: Morphological Cleaning (Cap. 9)
# Remove small white objects (noise)
mascara_limpa = morphology.remove_small_objects(mascara_inicial, 60) # Removes objects with fewer than 60 pixels

# Fill holes inside main object
mascara_final = ndimage.binary_fill_holes(mascara_limpa)

# Step-by-step Segmentation Visualization
plot_many_images(
    [img_cinza_exemplo, mascara_inicial, mascara_limpa, mascara_final],
    ["Original Grayscale", "Initial Otsu Mask", "Clean Mask (noise removed)", "Final Mask (filled)"],
    rows=1, cols=4, cmaps='gray'
)

**Interpreting the Results (Part 2):**

- Initial Otsu Mask: Shows the first segmentation result. It may contain small holes or spurious objects.
- Clean Mask: After removing small objects, the mask should contain less "noise".
- Final Mask: The cell object should appear as a solid white blob, ready for feature extraction.


**Exercise (Part 2):**

- Thresholding (Ch. 10): Try using `filters.threshold_local` instead of `threshold_otsu`. Does segmentation improve or worsen for this image? Why?
- Morphology (Ch. 9): In the `morphology.remove_small_objects` step, change `60` to `10` and then `200`. How does this affect the clean mask? What does this parameter control?


## Part 3: Feature Extraction
**Brief Explanation:**
Now that we isolated the cell with a mask, we can describe it quantitatively by extracting features (descriptors). We will extract shape, color, and texture features.

**How the Code Works:**

1. `measure.label` and `measure.regionprops`: We use the final mask to label object(s). `regionprops` then computes several properties for each labeled region.
1. Shape Descriptors: We extract area, perimeter, eccentricity (how elongated the shape is), solidity (how "solid" vs. irregular the shape is), and Hu moments (a set of 7 values invariant to translation, scale, and rotation that describe shape).
1. Color Descriptors: We use the mask to select cell pixels in the original RGB image and compute mean of each color channel (R, G, B).
1. Texture Descriptors: We use the mask on the grayscale image to compute mean and standard deviation of cell pixel intensities.


In [None]:
# --- 3.1 Feature Extraction (Cap. 11) ---

# Use mascara_final from previous step and original image (RGB and grayscale)

# Label region in final mask for regionprops
labels_celula = measure.label(mascara_final)

# Extract region properties (assuming only 1 main object)
if labels_celula.max() == 0: # If segmentation fails and finds no object
    print("No object found in segmentation. Skipping feature extraction.")
    features_dict = None
else:
    propriedades = measure.regionprops(labels_celula, intensity_image=img_cinza_exemplo)
    prop_obj_principal = propriedades[0] # Assume largest object is first/only

    # --- Shape Descriptors ---
    area = prop_obj_principal.area
    perimetro = prop_obj_principal.perimeter
    excentricidade = prop_obj_principal.eccentricity
    solidez = prop_obj_principal.solidity

    # Momentos de Hu (usando cv2)
    # Image for moments must be uint8
    mascara_ubyte_hu = img_as_ubyte(mascara_final)
    momentos_espaciais = cv2.moments(mascara_ubyte_hu)
    momentos_hu = cv2.HuMoments(momentos_espaciais).flatten()

    # --- Color Descriptors ---
    pixels_celula_rgb = img_rgb_exemplo[mascara_final]
    media_R, media_G, media_B = np.mean(pixels_celula_rgb, axis=0)

    # --- Texture Descriptors ---
    pixels_celula_cinza = img_cinza_exemplo[mascara_final]
    media_intensidade = np.mean(pixels_celula_cinza)
    std_intensidade = np.std(pixels_celula_cinza)

    # Compile all features into a dictionary
    features_dict = {
        'area': area,
        'perimetro': perimetro,
        'excentricidade': excentricidade,
        'solidez': solidez,
        'media_R': media_R,
        'media_G': media_G,
        'media_B': media_B,
        'media_intensidade': media_intensidade,
        'std_intensidade': std_intensidade,
    }
    # Add Hu moments
    for i, hu_val in enumerate(momentos_hu):
        features_dict[f'hu_{i}'] = hu_val

    print("\n--- Extracted Features from Segmented Cell ---")
    for nome, valor in features_dict.items():
        print(f"  {nome}: {valor:.4f}")

# Visualize segmentation overlay on original image
segmentacao_sobreposta = color.label2rgb(labels_celula, image=img_rgb_exemplo, bg_label=0, alpha=0.3)
plot_many_images(
    [img_rgb_exemplo, mascara_final, segmentacao_sobreposta],
    ["Original", "Final Mask", "Overlay Segmentation"],
    1, 3, cmaps=[None, 'gray', None], figsize=(15,5)
)

**Interpreting the Results (Part 3):**

- The printed output shows descriptor values for the example cell. These numbers are what the computer "sees."
- The "Overlay Segmentation" visualization helps confirm whether the final mask truly isolated the cell correctly.

**Exercise (Part 3):**

1. Descriptors (Ch. 11): Looking at dataset cell types (basophil, eosinophil, lymphocyte, etc.), which features (area, eccentricity, color, texture) do you think would be most useful for distinguishing a "neutrophil" from a "lymphocyte"? (This may require a quick review of cell morphology).
1. The Fourier Transform is mentioned in Chapter 4. How could Fourier Descriptors (mentioned in Ch. 11) be used here? What would they describe? (Hint: they describe the cell contour in the frequency domain).


## Part 4: Object Classification
**Brief Explanation:**
Now we will use the features we learned to extract in order to train a machine learning classifier. The process will be:

1. Process a subset of images, applying segmentation and feature extraction to each one.
1. Split our feature dataset into train and test sets.
1. Train a classifier (we will use a simple MLP neural network) on the training data.
1. Evaluate classifier performance on test data.


**How the Code Works:**

1. Feature Extraction Loop:
- Iterates over `num_amostras_total` images from the dataset.
- For each image, applies the full segmentation and feature extraction pipeline we built.
- Stores feature dictionary and corresponding label in a list.
2. DataFrame Creation: List of dictionaries is converted into a Pandas DataFrame, a convenient tabular data structure.
3. Training Preparation:
- `X` contains features, `y` contains labels.
- `train_test_split` divides `X` and `y` into train and test sets.
- `StandardScaler` normalizes features. This is very important for good neural-network performance.
4. Training and Evaluation:
- `MLPClassifier(...)`: Defines neural network architecture.
- `.fit(X_train, y_train)`: Trains classifier.
- `.predict(X_test)`: Predicts on test set.
- `accuracy_score` and `confusion_matrix`: Evaluate model performance.


In [None]:
# --- 4.1 Object Classification (Cap. 12) ---
print("\n--- Module 4: Object Classification ---")

# Function to encapsulate segmentation and extraction pipeline for one image
def extrair_features_de_imagem(img_rgb):
    try:
        # Preprocessing and Segmentation
        img_cinza = rgb2gray(img_rgb)
        img_suavizada = filters.gaussian(img_cinza, sigma=1)
        limiar_otsu = filters.threshold_otsu(img_suavizada)
        mascara_inicial = img_suavizada < limiar_otsu
        mascara_limpa = morphology.remove_small_objects(mascara_inicial, 60)
        mascara_final = ndimage.binary_fill_holes(mascara_limpa)

        # Feature Extraction
        labels = measure.label(mascara_final)
        if labels.max() == 0: return None # Segmentation failed

        # Find largest object if there is more than one
        props = measure.regionprops(labels, intensity_image=img_cinza)
        maior_obj_idx = np.argmax([p.area for p in props])
        prop_obj_principal = props[maior_obj_idx]

        # Descriptors
        area = prop_obj_principal.area
        perimetro = prop_obj_principal.perimeter
        excentricidade = prop_obj_principal.eccentricity
        solidez = prop_obj_principal.solidity

        mascara_obj_principal = labels == (maior_obj_idx + 1)

        pixels_celula_rgb = img_rgb[mascara_obj_principal]
        media_R, media_G, media_B = np.mean(pixels_celula_rgb, axis=0) if pixels_celula_rgb.size > 0 else (0,0,0)

        pixels_celula_cinza = img_cinza[mascara_obj_principal]
        media_intensidade = np.mean(pixels_celula_cinza) if pixels_celula_cinza.size > 0 else 0
        std_intensidade = np.std(pixels_celula_cinza) if pixels_celula_cinza.size > 0 else 0

        momentos_espaciais = cv2.moments(img_as_ubyte(mascara_obj_principal))
        momentos_hu = cv2.HuMoments(momentos_espaciais).flatten()

        features_dict = {
            'area': area, 'perimetro': perimetro, 'excentricidade': excentricidade,
            'solidez': solidez, 'media_R': media_R, 'media_G': media_G,
            'media_B': media_B, 'media_intensidade': media_intensidade, 'std_intensidade': std_intensidade,
        }
        for i, hu_val in enumerate(momentos_hu):
            features_dict[f'hu_{i}'] = hu_val

        return features_dict

    except Exception as e:
        # print(f"Error processing an image: {e}")
        return None

# Loop to process a subset of images (e.g., 500 for a quick test)
num_amostras_total = 1000 # Reduce if it is taking too long (ex: 500)
all_features = []
all_labels = []

print(f"Extracting features from {num_amostras_total} images. This may take a minute...")
for i in range(min(num_amostras_total, len(images))):
    img_atual = images[i]
    label_atual = labels[i]

    features = extrair_features_de_imagem(img_atual)
    if features is not None:
        all_features.append(features)
        all_labels.append(label_atual)

# Create a Pandas DataFrame
df_features = pd.DataFrame(all_features)
df_features['label'] = all_labels

print(f"Processing completed. {len(df_features)} cells were segmented and had their features extracted.")
print("Feature DataFrame sample:")
print(df_features.head())

# --- Classifier Training and Evaluation ---
if not df_features.empty:
    X = df_features.drop('label', axis=1) # All columns except label
    y = df_features['label']

    # Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train an MLP classifier (Neural Network)
    print("\nTraining Neural Network (MLPClassifier)...")
    mlp = MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=500, random_state=42, early_stopping=True, n_iter_no_change=20)
    mlp.fit(X_train_scaled, y_train)

    # Evaluate on test set
    y_pred = mlp.predict(X_test_scaled)
    acuracia = accuracy_score(y_test, y_pred)
    print(f"MLP classifier accuracy on test set: {acuracia:.4f}")

    # Confusion Matrix
    print("Generating Confusion Matrix...")
    nomes_classes_str = [class_names[str(i)] for i in sorted(y.unique())]
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=nomes_classes_str)

    fig, ax = plt.subplots(figsize=(10, 10))
    disp.plot(ax=ax, xticks_rotation='vertical')
    plt.title("Confusion Matrix")
    plt.show()
else:
    print("No features were extracted, classification step was skipped.")

**Interpreting the Results (Part 4):**

- DataFrame Sample: Shows a table with extracted features for the first cells.
- Accuracy: A percentage indicating how many test-set cells were correctly classified. An accuracy of 0.85 means 85% correct.


Confusion Matrix: A visual table showing classifier performance in detail.
- Main diagonal (top-left to bottom-right) shows correct classifications.
- Off-diagonal values are errors. For example, the cell at row "basophil" and column "eosinophil" indicates how many basophils were incorrectly classified as eosinophils.
- Helps identify which classes are hardest for the model to distinguish.


**Final Exercise and Discussion:**

1. Accuracy and Confusion Matrix: Was the achieved accuracy good? Looking at the confusion matrix, which classes did the model struggle most to distinguish (largest off-diagonal values)?
1. Improving the System: How could you improve this system’s performance? Discuss at least three ideas based on concepts from all chapters covered. (E.g., use more features (Ch. 11), use a more robust segmentation model (Ch. 10), use a different classifier or architecture (Ch. 12), use more training data, etc.).
1. Ethical Implications: Discuss ethical implications, challenges, and responsibilities of building and deploying a computer-aided diagnostic system like this in a real clinical environment. What would happen if the system incorrectly classified a malignant cell as benign, or vice versa?
