# Medical Segmentation Decathlon: Lung Tumor Data Ingestion

This notebook downloads the Medical Segmentation Decathlon (MSD) Task06_Lung dataset and ingests it into RadiObject for ML training.

## Dataset Overview

| Property | Value |
|----------|-------|
| **Task** | Task06_Lung |
| **Modality** | CT |
| **Subjects** | 63 training + 32 test |
| **Labels** | Lung tumor segmentation masks |
| **Size** | ~8.5GB (compressed tar) |
| **License** | CC-BY-SA 4.0 |

We derive a binary classification label (`has_tumor`) from the segmentation masks.

## Configuration

Edit `config.py` to change the target URI:

```python
# S3 storage (default)
MSD_LUNG_URI = "s3://souzy-scratch/msd-lung/radiobject"

# Local storage alternative
MSD_LUNG_URI = "./data/msd-lung"
```

**Reference**: [Medical Segmentation Decathlon](http://medicaldecathlon.com/)

## 1. Setup

In [None]:
import sys
sys.path.insert(0, '..')

import json
import shutil
import subprocess
import tarfile
from pathlib import Path

import numpy as np
import pandas as pd
import nibabel as nib
import matplotlib.pyplot as plt

from config import MSD_LUNG_URI, S3_REGION
from radiobject.radi_object import RadiObject
from radiobject.ctx import configure, S3Config

print(f"NumPy: {np.__version__}")
print(f"NiBabel: {nib.__version__}")
print(f"Target URI: {MSD_LUNG_URI}")

In [None]:
# Configure S3 access if using S3 URI
if MSD_LUNG_URI.startswith("s3://"):
    configure(s3=S3Config(region=S3_REGION, max_parallel_ops=8))

# Configure ISOTROPIC tile layout for ML training (64³ chunks for efficient patch extraction)
from radiobject.ctx import TileConfig, SliceOrientation
configure(tile=TileConfig(orientation=SliceOrientation.ISOTROPIC))

# S3 URIs for backup (optional)
S3_BUCKET = "s3://souzy-scratch/msd-lung"
S3_NIFTI_URI = f"{S3_BUCKET}/nifti"  # Backup of raw NIfTIs

# Local paths (temporary, for processing)
DATA_DIR = Path("../data/msd_lung")
TAR_PATH = DATA_DIR / "Task06_Lung.tar"
TASK_DIR = DATA_DIR / "Task06_Lung"

DATA_DIR.mkdir(parents=True, exist_ok=True)
print(f"Local data directory: {DATA_DIR.resolve()}")
print(f"Target RadiObject URI: {MSD_LUNG_URI}")
print(f"Tile orientation: ISOTROPIC (64³ chunks for ML patch extraction)")

## 2. Check if RadiObject Exists

In [None]:
def uri_exists(uri: str) -> bool:
    """Check if RadiObject exists at URI."""
    try:
        radi = RadiObject(uri)
        _ = radi.collection_names  # Force validation by accessing group metadata
        return True
    except Exception:
        return False

if uri_exists(MSD_LUNG_URI):
    print(f"RadiObject already exists at {MSD_LUNG_URI}")
    print("Skipping ingestion. Delete the URI to re-ingest.")
    SKIP_INGESTION = True
else:
    print(f"No RadiObject found at {MSD_LUNG_URI}")
    print("Proceeding with ingestion...")
    SKIP_INGESTION = False

## 3. Download MSD Task06_Lung

The dataset is hosted on AWS S3 with anonymous access (no authentication required).

In [None]:
if not SKIP_INGESTION:
    # Download from S3 backup (faster and more reliable than original MSD source)
    if not TASK_DIR.exists():
        print(f"Downloading NIfTIs from S3 backup: {S3_NIFTI_URI}...")
        TASK_DIR.mkdir(parents=True, exist_ok=True)
        result = subprocess.run(
            ["aws", "s3", "sync", S3_NIFTI_URI, str(TASK_DIR)],
            capture_output=True,
            text=True,
        )
        if result.returncode != 0:
            raise RuntimeError(f"Download failed: {result.stderr}")
        print("Download complete.")
    else:
        print(f"Task directory already exists at {TASK_DIR}")

In [None]:
if not SKIP_INGESTION:
    # Verify download (no extraction needed - we downloaded NIfTIs directly)
    dataset_json = TASK_DIR / "dataset.json"
    if not dataset_json.exists():
        raise FileNotFoundError(f"dataset.json not found at {dataset_json}")
    print(f"Found dataset.json at {dataset_json}")

## 4. Parse Dataset Metadata

In [None]:
if not SKIP_INGESTION:
    # Load dataset.json
    with open(TASK_DIR / "dataset.json") as f:
        metadata = json.load(f)

    print(f"Dataset name: {metadata.get('name', 'N/A')}")
    print(f"Description: {metadata.get('description', 'N/A')}")
    print(f"Modality: {metadata.get('modality', 'N/A')}")
    print(f"Labels: {metadata.get('labels', 'N/A')}")
    print(f"Tensor image size: {metadata.get('tensorImageSize', 'N/A')}")
    print(f"Training samples: {metadata.get('numTraining', len(metadata.get('training', [])))}")
    print(f"Test samples: {metadata.get('numTest', len(metadata.get('test', [])))}")

In [None]:
if not SKIP_INGESTION:
    # Parse training samples
    training = metadata["training"]
    print(f"\nFirst 3 training entries:")
    for entry in training[:3]:
        print(f"  Image: {entry['image']}, Label: {entry['label']}")

## 5. Create obs_meta with Binary Labels

We derive `has_tumor` by checking if any voxel in the segmentation mask is non-zero.

In [None]:
if not SKIP_INGESTION:
    obs_data = []

    for i, entry in enumerate(training):
        # Handle path format (may have "./" prefix)
        image_rel = entry["image"].lstrip("./")
        label_rel = entry["label"].lstrip("./")
        
        image_path = TASK_DIR / image_rel
        label_path = TASK_DIR / label_rel
        
        # Derive subject_id from filename (e.g., "lung_001" from "imagesTr/lung_001.nii.gz")
        filename = Path(image_rel).name
        subject_id = filename.replace(".nii.gz", "").replace(".nii", "")
        
        # Load label mask to determine has_tumor
        label_img = nib.load(label_path)
        label_data = np.asarray(label_img.dataobj)
        has_tumor = int(np.any(label_data > 0))
        
        # Get image metadata
        img = nib.load(image_path)
        shape = img.shape
        zooms = img.header.get_zooms()
        
        obs_data.append({
            "obs_subject_id": subject_id,
            "obs_id": subject_id,
            "has_tumor": has_tumor,
            "shape_x": shape[0],
            "shape_y": shape[1],
            "shape_z": shape[2],
            "spacing_x": float(zooms[0]),
            "spacing_y": float(zooms[1]),
            "spacing_z": float(zooms[2]),
            "image_path": str(image_path),
            "label_path": str(label_path),
        })
        
        if (i + 1) % 10 == 0:
            print(f"Processed {i + 1}/{len(training)} samples...")

    obs_df = pd.DataFrame(obs_data)
    print(f"\nTotal samples: {len(obs_df)}")
    print(f"Label distribution: {obs_df['has_tumor'].value_counts().to_dict()}")

In [None]:
if not SKIP_INGESTION:
    # Display sample metadata
    display(obs_df.head(10))

## 6. Explore Sample Data

In [None]:
if not SKIP_INGESTION:
    # Load and visualize a sample with tumor
    tumor_samples = obs_df[obs_df['has_tumor'] == 1]
    sample = tumor_samples.iloc[0] if len(tumor_samples) > 0 else obs_df.iloc[0]

    img = nib.load(sample["image_path"])
    img_data = np.asarray(img.dataobj)

    label = nib.load(sample["label_path"])
    label_data = np.asarray(label.dataobj)

    print(f"Subject: {sample['obs_subject_id']}")
    print(f"Has tumor: {sample['has_tumor']}")
    print(f"Image shape: {img_data.shape}")
    print(f"Image dtype: {img_data.dtype}")
    print(f"Value range: [{img_data.min():.0f}, {img_data.max():.0f}]")
    print(f"Label unique values: {np.unique(label_data)}")
    print(f"Voxel spacing: {img.header.get_zooms()}")

In [None]:
if not SKIP_INGESTION:
    # Find slice with tumor
    if sample['has_tumor'] == 1:
        # Find z-slice with maximum tumor area
        tumor_area_per_slice = (label_data > 0).sum(axis=(0, 1))
        best_z = int(np.argmax(tumor_area_per_slice))
    else:
        best_z = img_data.shape[2] // 2

    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # CT image
    axes[0].imshow(img_data[:, :, best_z].T, cmap="gray", origin="lower")
    axes[0].set_title(f"CT - Axial (z={best_z})")
    axes[0].axis("off")

    # Segmentation mask
    axes[1].imshow(label_data[:, :, best_z].T, cmap="hot", origin="lower")
    axes[1].set_title("Tumor Mask")
    axes[1].axis("off")

    # Overlay
    axes[2].imshow(img_data[:, :, best_z].T, cmap="gray", origin="lower")
    mask_overlay = np.ma.masked_where(label_data[:, :, best_z].T == 0, label_data[:, :, best_z].T)
    axes[2].imshow(mask_overlay, cmap="Reds", alpha=0.5, origin="lower")
    axes[2].set_title("Overlay")
    axes[2].axis("off")

    plt.suptitle(f"Subject: {sample['obs_subject_id']} (has_tumor={sample['has_tumor']})", fontsize=14)
    plt.tight_layout()
    plt.show()

## 7. Backup Raw NIfTIs to S3 (Optional)

Upload the raw NIfTI files to S3 so you can access them from any machine.

In [None]:
if not SKIP_INGESTION:
    # Skip S3 backup - we already downloaded from S3 backup
    print("Data loaded from S3 backup - no additional backup needed.")

## 8. Create RadiObject from NIfTIs

In [None]:
if not SKIP_INGESTION:
    # Prepare NIfTI list for ingestion
    nifti_list = [
        (row["image_path"], row["obs_subject_id"])
        for _, row in obs_df.iterrows()
    ]

    # Subject-level metadata (exclude file paths, keep shape and spacing info)
    obs_meta_df = obs_df[[
        "obs_subject_id", "obs_id", "has_tumor",
        "shape_x", "shape_y", "shape_z",
        "spacing_x", "spacing_y", "spacing_z",
    ]].copy()

    print(f"Creating RadiObject with {len(nifti_list)} volumes...")
    print(f"Target URI: {MSD_LUNG_URI}")

In [None]:
if not SKIP_INGESTION:
    # Create RadiObject (stored on S3 or local)
    radi = RadiObject.from_niftis(
        uri=MSD_LUNG_URI,
        niftis=nifti_list,
        obs_meta=obs_meta_df,
        reorient=True,
    )

    print(f"\nRadiObject created: {radi}")

## 9. Validate & Explore RadiObject

In [None]:
if not SKIP_INGESTION:
    # Validate integrity
    radi.validate()
    print("Validation passed")

In [None]:
if not SKIP_INGESTION:
    # Explore the RadiObject
    print(f"Number of subjects: {len(radi)}")
    print(f"Collections: {radi.collection_names}")
    print(f"\nobs_meta:")
    display(radi.obs_meta.read())

In [None]:
if not SKIP_INGESTION:
    # Verify data by reading a sample
    collection_name = radi.collection_names[0]
    vc = radi.collection(collection_name)

    print(f"Collection: {collection_name}")
    print(f"Shape: {vc.shape}")
    print(f"Number of volumes: {len(vc)}")

    # Read a slice
    sample_vol = vc.iloc[0]
    mid_z = sample_vol.shape[2] // 2
    axial_slice = sample_vol.axial(z=mid_z)

    plt.figure(figsize=(8, 8))
    plt.imshow(axial_slice.T, cmap="gray", origin="lower")
    plt.title(f"RadiObject sample: {collection_name} (z={mid_z})")
    plt.axis("off")
    plt.show()

## 10. Cleanup Local Data

Remove local NIfTI files after successful upload to S3. Data is now accessible from any machine via the S3 URIs.

In [None]:
if not SKIP_INGESTION:
    # Remove local data (data is now in S3)
    CLEANUP_LOCAL = False  # Set to True to remove local files

    if CLEANUP_LOCAL:
        shutil.rmtree(DATA_DIR)
        print(f"Removed local data: {DATA_DIR}")
    else:
        print(f"Local data retained at: {DATA_DIR.resolve()}")
        print("Set CLEANUP_LOCAL = True to remove after confirming S3 upload.")

## 11. Verify RadiObject

Load the RadiObject from the URI to verify it was created correctly.

In [None]:
# Load from URI (works whether we just created it or it already existed)
radi = RadiObject(MSD_LUNG_URI)

print(f"Loaded: {radi}")
print(f"Collections: {radi.collection_names}")
print(f"Subjects: {len(radi)}")

# Quick data check
collection_name = radi.collection_names[0]
vol = radi.collection(collection_name).iloc[0]
print(f"\nSample volume: {vol}")
print(f"Axial slice shape: {vol.axial(z=vol.shape[2]//2).shape}")

## Summary

This notebook:

1. **Checked** if RadiObject already exists (skip if so)
2. **Downloaded** MSD Task06_Lung from public AWS S3
3. **Extracted** tar archive and parsed dataset.json
4. **Derived** binary `has_tumor` labels from segmentation masks
5. **Captured** full metadata: shape, voxel spacing, labels
6. **Backed up** raw NIfTIs to your S3 bucket (optional)
7. **Created** RadiObject with NIfTI volumes and metadata

### S3 Locations (default)

| Resource | URI |
|----------|-----|
| RadiObject | `s3://souzy-scratch/msd-lung/radiobject` |
| Raw NIfTIs | `s3://souzy-scratch/msd-lung/nifti/` |

### Dataset Statistics

| Property | Value |
|----------|-------|
| Total subjects | 63 (training set) |
| Modality | CT |
| Labels | `has_tumor` (binary) |
| Metadata | shape, voxel spacing |
| Storage | S3 with TileDB compression |

### Next Steps

- [06_ml_training.ipynb](./06_ml_training.ipynb) - Train a tumor classifier using this RadiObject

### Loading from URI on Another Machine

```python
from config import MSD_LUNG_URI, S3_REGION
from radiobject.radi_object import RadiObject
from radiobject.ctx import configure, S3Config

if MSD_LUNG_URI.startswith("s3://"):
    configure(s3=S3Config(region=S3_REGION))

radi = RadiObject(MSD_LUNG_URI)
```