
# Histopathology Pipeline Prototype (2024 biopsy matches)

This notebook explores the 2024 biopsy matching dataset and prepares it for training a histopathology classifier. The focus is to transform the raw diagnostic reports into machine-readable labels and outline a patch-extraction workflow for `.svs` whole-slide images (WSI).



## Objectives

1. Inspect and clean the biopsy metadata exported from Excel/CSV.
2. Consolidate heterogeneous diagnostic texts into consistent disease labels.
3. Define a modelling target that is feasible with the available annotations.
4. Produce stratified train/validation/test splits for downstream modelling.
5. Sketch a patch extraction pipeline for `.svs` slides, including sanity checks for missing files and simple tissue filtering heuristics.

> **Note:** The repository currently only ships the metadata CSV. The actual WSI `.svs` files need to be mounted separately before running the patch extractor cells.


In [None]:

from __future__ import annotations

import json
import re
from collections import Counter
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable, List, Optional

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

DATA_PATH = Path('Data') / '조직검사 결과 매칭(2024)_utf8_pruned.csv'
assert DATA_PATH.exists(), f"Missing dataset: {DATA_PATH}"

raw_df = pd.read_csv(DATA_PATH)
print(f"Loaded {len(raw_df):,} biopsy records with {raw_df.shape[1]} columns")
raw_df.head()


### Missing value overview

In [None]:

missing_summary = (
    raw_df.isna()
    .mean()
    .sort_values(ascending=False)
    .to_frame(name='missing_ratio')
)
missing_summary.head(10)



## Diagnosis normalisation

The `DIAGNOSIS` column contains verbose free-text strings. To use these as machine learning targets, we:

1. Keep only the portion before the first comma (drops staging or margin comments).
2. Strip parentheses and normalise whitespace.
3. Lowercase and map common synonyms to a canonical form.


In [None]:

NORMALISATION_RULES = [
    (r'^mast cell tumor', 'mast cell tumor'),
    (r'^cutaneous mast cell tumor', 'mast cell tumor'),
    (r'^subcutaneous mast cell tumor', 'mast cell tumor'),
    (r'^mammary gland adenoma', 'mammary adenoma'),
    (r'^mammary complex adenoma', 'mammary complex adenoma'),
    (r'^mammary benign mixed tumor', 'mammary benign mixed tumor'),
    (r'^mammary carcinoma', 'mammary carcinoma'),
    (r'^mammary duct carcinoma', 'mammary carcinoma'),
    (r'^mammary adenoma', 'mammary adenoma'),
    (r'^lipoma', 'lipoma'),
    (r'^subcutaneous lipoma', 'lipoma'),
    (r'^hepatic lipoma', 'lipoma'),
    (r'^sebaceous adenoma', 'sebaceous adenoma'),
    (r'^sebaceous epithelioma', 'sebaceous epithelioma'),
    (r'^soft tissue sarcoma', 'soft tissue sarcoma'),
    (r'^trichoblastoma', 'trichoblastoma'),
]


def normalise_diagnosis(value: str) -> str:
    if not isinstance(value, str) or not value.strip():
        return 'unknown'

    base = value.split(',')[0]
    base = re.sub(r'\([^)]*\)', '', base)
    base = re.sub(r'[^a-zA-Z0-9\s]', ' ', base)
    base = re.sub(r'\s+', ' ', base).strip().lower()

    for pattern, canonical in NORMALISATION_RULES:
        if re.match(pattern, base):
            return canonical

    return base


raw_df['disease_family'] = raw_df['DIAGNOSIS'].map(normalise_diagnosis)
print('Unique disease families:', raw_df['disease_family'].nunique())
raw_df['disease_family'].value_counts().head(20)



### Choosing a modelling target

The dataset covers thousands of diagnostic phrases. A multi-class classifier across all 5,000+ labels would be extremely sparse. Instead we group infrequent classes into an `other` bucket and focus on the most common disease families. Adjust `MIN_CASES` to control class balance.


In [None]:

MIN_CASES = 250
value_counts = raw_df['disease_family'].value_counts()
major_labels = value_counts[value_counts >= MIN_CASES].index.tolist()
print(f"Keeping {len(major_labels)} frequent labels (≥{MIN_CASES} cases)")

raw_df['target_label'] = np.where(
    raw_df['disease_family'].isin(major_labels),
    raw_df['disease_family'],
    'other'
)

raw_df['target_label'].value_counts()



### Train/validation/test split metadata

We create stratified splits based on the consolidated target labels. These splits can later be joined with patch-level data after extraction.


In [None]:

META_COLUMNS = ['INSP_RQST_NO', 'FOLDER', 'FILE_NAME', 'target_label']
meta_df = raw_df[META_COLUMNS].drop_duplicates().reset_index(drop=True)

train_df, temp_df = train_test_split(
    meta_df,
    test_size=0.3,
    random_state=42,
    stratify=meta_df['target_label']
)
valid_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    random_state=42,
    stratify=temp_df['target_label']
)

print('Train size:', len(train_df))
print('Valid size:', len(valid_df))
print('Test size :', len(test_df))

split_summary = {
    'train': train_df['target_label'].value_counts(normalize=True).to_dict(),
    'valid': valid_df['target_label'].value_counts(normalize=True).to_dict(),
    'test': test_df['target_label'].value_counts(normalize=True).to_dict(),
}
json.dumps(split_summary, indent=2)



## WSI availability check

The metadata references `.svs` files stored per request `FOLDER`. Update `WSI_ROOT` to the directory where slides are mounted.


In [None]:

WSI_ROOT = Path('Data/WSI')  # TODO: update to actual location

if not WSI_ROOT.exists():
    print(f"WSI root missing at {WSI_ROOT.resolve()}. Patch extraction will be skipped.")
else:
    sample_rows = raw_df.head(3)
    for _, row in sample_rows.iterrows():
        candidate = WSI_ROOT / row['FOLDER'] / row['FILE_NAME']
        print(candidate, 'exists' if candidate.exists() else 'missing')



## Patch extraction utilities

We rely on [OpenSlide](https://openslide.org/) to stream high-resolution patches from WSI files. Install prerequisites before running the next cell:

```bash
sudo apt-get install openslide-tools
pip install openslide-python
```

The code below degrades gracefully if OpenSlide is unavailable.


In [None]:

try:
    import openslide
    from PIL import Image
    HAS_OPENSLIDE = True
    print('OpenSlide version:', openslide.__library_version__)
except Exception as exc:  # noqa: BLE001
    HAS_OPENSLIDE = False
    print('OpenSlide not available:', exc)


In [None]:

@dataclass
class PatchSpec:
    slide_id: str
    x: int
    y: int
    level: int
    size: int
    label: str
    source_path: Path


def resolve_slide_path(row: pd.Series, root: Path) -> Optional[Path]:
    """Resolve the absolute slide path using folder/name hints."""
    candidates = [root / row['FOLDER'] / row['FILE_NAME'], root / row['FILE_NAME']]
    for cand in candidates:
        if cand.exists():
            return cand
    return None


def tissue_fraction(tile: Image.Image) -> float:
    arr = np.asarray(tile.convert('L'))
    norm = (arr - arr.min()) / (arr.ptp() + 1e-6)
    return float((norm < 0.85).mean())


def extract_patches_for_slide(
    slide_path: Path,
    label: str,
    level: int = 0,
    patch_size: int = 256,
    stride: Optional[int] = None,
    max_patches: int = 200,
    tissue_threshold: float = 0.2,
) -> List[PatchSpec]:
    if not HAS_OPENSLIDE:
        raise RuntimeError('OpenSlide support is required to extract patches.')

    slide = openslide.OpenSlide(str(slide_path))
    stride = stride or patch_size
    width, height = slide.level_dimensions[level]

    patches: List[PatchSpec] = []
    for y in range(0, height - patch_size + 1, stride):
        for x in range(0, width - patch_size + 1, stride):
            region = slide.read_region((x, y), level, (patch_size, patch_size))
            if tissue_fraction(region) < tissue_threshold:
                continue
            patches.append(PatchSpec(
                slide_id=slide_path.stem,
                x=x,
                y=y,
                level=level,
                size=patch_size,
                label=label,
                source_path=slide_path,
            ))
            if len(patches) >= max_patches:
                break
        if len(patches) >= max_patches:
            break
    slide.close()
    return patches


def build_patch_index(
    split_df: pd.DataFrame,
    root: Path,
    level: int = 0,
    patch_size: int = 256,
    stride: Optional[int] = None,
    max_patches_per_slide: int = 200,
    tissue_threshold: float = 0.2,
) -> pd.DataFrame:
    records = []
    for _, row in split_df.iterrows():
        slide_path = resolve_slide_path(row, root)
        if slide_path is None:
            continue
        try:
            patches = extract_patches_for_slide(
                slide_path=slide_path,
                label=row['target_label'],
                level=level,
                patch_size=patch_size,
                stride=stride,
                max_patches=max_patches_per_slide,
                tissue_threshold=tissue_threshold,
            )
        except Exception as exc:  # noqa: BLE001
            print(f"Skipping {row['FILE_NAME']}: {exc}")
            continue

        for patch in patches:
            records.append({
                'slide_id': patch.slide_id,
                'x': patch.x,
                'y': patch.y,
                'level': patch.level,
                'size': patch.size,
                'label': patch.label,
                'source_path': str(patch.source_path),
            })

    return pd.DataFrame.from_records(records)



### Patch extraction demo

The following cell attempts to build a small patch index on the training split. When slides are unavailable, it simply reports the missing prerequisites.


In [None]:

if not HAS_OPENSLIDE:
    print('OpenSlide missing — skipping patch extraction demo.')
elif not WSI_ROOT.exists():
    print('WSI directory not found — mount slides and re-run.')
else:
    demo_df = train_df.head(2)
    patch_index = build_patch_index(
        split_df=demo_df,
        root=WSI_ROOT,
        level=0,
        patch_size=256,
        stride=256,
        max_patches_per_slide=16,
        tissue_threshold=0.25,
    )
    print('Extracted', len(patch_index), 'patches')
    patch_index.head()



## Next steps

* Augment the normalisation rules once additional diagnostic variants are observed.
* Persist the split metadata (`train.csv`, `valid.csv`, `test.csv`) and the generated `patch_index.parquet` for downstream training scripts.
* Implement patch-level filtering (e.g. blur detection, stain normalisation) prior to model ingestion.
* Connect a slide-level cache so that repeated patch extraction is avoided during experimentation.
