# BraTS 2020 Data Ingestion

Ingests BraTS 2020 challenge data into a RadiObject. Run this **once** before notebooks 01-04.

- Check if RadiObject exists (skip if so)
- Create RadiObject with 5 collections: FLAIR, T1w, T1gd, T2w, seg
- Include subject metadata: age, survival days, resection status

**Data Source:** [BraTS 2020 Challenge](https://www.med.upenn.edu/cbica/brats2020/data.html) via [Kaggle](https://www.kaggle.com/datasets/awsaf49/brats20-dataset-training-validation). Requires Kaggle API setup ([instructions](https://www.kaggle.com/docs/api)). For a no-auth alternative, set `with_metadata=False` in the download cell to use the public MSD bucket (no clinical metadata).

**Configuration:** See [S3 Setup](https://srdsam.github.io/RadiObject/how-to/s3-setup/) for cloud storage options.

In [1]:
import json
from pathlib import Path

import pandas as pd

from radiobject import (
    CompressionConfig,
    Compressor,
    RadiObject,
    S3Config,
    SliceOrientation,
    TileConfig,
    WriteConfig,
    configure,
    uri_exists,
)
from radiobject.data import get_brats_nifti_path

# ── Storage URI ──────────────────────────────────────────────────
# Default: S3 (requires AWS credentials)
BRATS_URI = "s3://souzy-scratch/radiobject/brats-tutorial"
# For local storage, comment out the line above and uncomment:
# BRATS_URI = "./data/brats_radiobject"
# ─────────────────────────────────────────────────────────────────

print(f"Target URI: {BRATS_URI}")

Target URI: s3://souzy-scratch/radiobject/brats-tutorial


In [2]:
# Configure TileDB storage
configure(
    s3=S3Config(region="us-east-2"),
    write=WriteConfig(
        tile=TileConfig(orientation=SliceOrientation.AXIAL),
        compression=CompressionConfig(algorithm=Compressor.ZSTD, level=3),
    ),
)

In [3]:
if uri_exists(BRATS_URI):
    print(f"RadiObject already exists at {BRATS_URI}")
    print("Skipping ingestion. Delete the URI to re-ingest.")
    SKIP_INGESTION = True
else:
    print(f"No RadiObject found at {BRATS_URI}")
    print("Proceeding with ingestion...")
    SKIP_INGESTION = False

RadiObject already exists at s3://souzy-scratch/radiobject/brats-tutorial
Skipping ingestion. Delete the URI to re-ingest.


In [4]:
if not SKIP_INGESTION:
    # Get BraTS 2020 data with metadata (downloads from Kaggle if not cached)
    # Set with_metadata=False to use MSD version without Kaggle API
    NIFTI_DIR = get_brats_nifti_path(with_metadata=True)

    # Load manifest - contains paths and metadata for each subject
    manifest_path = NIFTI_DIR / "manifest.json"
    with open(manifest_path) as f:
        manifest = json.load(f)

    print(f"Found {len(manifest)} BraTS subjects")
    print("Sample entry:")
    print(json.dumps(manifest[0], indent=2))

## Subject Metadata (obs_meta)

The `obs_meta` DataFrame provides subject-level metadata — one row per patient.

- `obs_subject_id`: Unique subject identifier (required) - links volumes across modalities
- Additional columns: Age, survival days, resection status from BraTS challenge
- `obs_ids`: System-managed JSON list of volume obs_ids per subject (auto-populated after ingestion)

The BraTS 2020 dataset includes real clinical metadata for survival prediction tasks.

In [5]:
if not SKIP_INGESTION:
    # Filter to subjects with complete data (all modalities + segmentation files exist)
    def has_complete_files(entry: dict, base_dir: Path) -> bool:
        """Check that all required NIfTI files exist for this subject."""
        required_keys = ["t1_path", "t1ce_path", "t2_path", "flair_path", "seg_path"]
        for key in required_keys:
            if key not in entry:
                return False
            if not (base_dir / entry[key]).exists():
                return False
        return True

    complete_entries = [e for e in manifest if has_complete_files(e, NIFTI_DIR)]
    print(f"Using {len(complete_entries)} subjects with complete data")

    # Build obs_meta from manifest metadata
    obs_meta = pd.DataFrame(
        {
            "obs_subject_id": [entry["sample_id"] for entry in complete_entries],
            "age": [entry.get("age") for entry in complete_entries],
            "survival_days": [entry.get("survival_days") for entry in complete_entries],
            "resection_status": [
                entry.get("resection_status", "") or "" for entry in complete_entries
            ],
            "dataset": "BraTS2020",
        }
    )

    # Convert numeric columns
    obs_meta["age"] = pd.to_numeric(obs_meta["age"], errors="coerce")
    obs_meta["survival_days"] = pd.to_numeric(obs_meta["survival_days"], errors="coerce")

    print(f"Created obs_meta with {len(obs_meta)} subjects")
    print("Metadata summary:")
    age_col = obs_meta["age"].dropna()
    print(f"  Age: {age_col.min():.0f} - {age_col.max():.0f} years")
    resection_counts = obs_meta["resection_status"].value_counts().to_dict()
    print(f"  Resection status: {resection_counts}")
    display(obs_meta.head(10))

In [6]:
if not SKIP_INGESTION:
    # Build images dict mapping collection names to (path, subject_id) lists
    images = {
        "T1w": [(NIFTI_DIR / entry["t1_path"], entry["sample_id"]) for entry in complete_entries],
        "T1gd": [
            (NIFTI_DIR / entry["t1ce_path"], entry["sample_id"]) for entry in complete_entries
        ],
        "T2w": [(NIFTI_DIR / entry["t2_path"], entry["sample_id"]) for entry in complete_entries],
        "FLAIR": [
            (NIFTI_DIR / entry["flair_path"], entry["sample_id"]) for entry in complete_entries
        ],
        "seg": [(NIFTI_DIR / entry["seg_path"], entry["sample_id"]) for entry in complete_entries],
    }

    print("Collections to ingest:")
    for name, paths in images.items():
        print(f"  {name}: {len(paths)} volumes")

In [7]:
if not SKIP_INGESTION:
    print(f"Creating RadiObject at: {BRATS_URI}")

    radi = RadiObject.from_images(
        uri=BRATS_URI,
        images=images,
        obs_meta=obs_meta,
        validate_alignment=True,
        progress=True,
    )

    print(f"Created: {radi}")

In [8]:
if not SKIP_INGESTION:
    radi.validate()
    print("Validation passed")

    print(f"Collections: {radi.collection_names}")
    print(f"Subjects: {len(radi)}")

In [9]:
# Load from URI (works whether we just created it or it already existed)
radi = RadiObject(BRATS_URI)

print(f"Loaded: {radi}")
print(f"Collections: {radi.collection_names}")
print(f"Subjects: {len(radi)}")

# Quick data check
vol = radi.FLAIR.iloc[0]
print(f"Sample volume: {vol}")
print(f"Axial slice shape: {vol.axial(z=77).shape}")

Loaded: RadiObject(368 subjects, 5 collections: [seg, T2w, FLAIR, T1gd, T1w])


Collections: ('seg', 'T2w', 'FLAIR', 'T1gd', 'T1w')
Subjects: 368


Sample volume: Volume(shape=240x240x155, dtype=int16, obs_id='BraTS20_Training_001_FLAIR')


Axial slice shape: (240, 240)


## obs_meta vs obs: Subject vs Volume Metadata

RadiObject has two levels of metadata:

| Level | Accessor | Scope | Example Fields |
|-------|----------|-------|----------------|
| **Subject** | `radi.obs_meta` | One row per patient | obs_subject_id, age, survival_days, obs_ids (system-managed) |
| **Volume** | `radi.FLAIR.obs` | One row per volume | obs_id, obs_subject_id, voxel_spacing, dimensions |

The `obs_subject_id` column links these levels - each subject can have multiple volumes across collections. The `obs_ids` column in obs_meta is a JSON list of all volume obs_ids linked to that subject (auto-populated).

In [10]:
# Subject-level metadata (one row per patient)
print("Subject metadata (obs_meta):")
display(radi.obs_meta.read().head())

# Volume-level metadata (one row per volume in a collection)
print("Volume metadata (FLAIR.obs):")
display(
    radi.FLAIR.obs.read(columns=["obs_id", "obs_subject_id", "dimensions", "voxel_spacing"]).head()
)

Subject metadata (obs_meta):


Unnamed: 0,obs_subject_id,age,survival_days,resection_status,dataset,obs_ids
0,BraTS20_Training_001,60.463,289.0,GTR,BraTS2020,"[""BraTS20_Training_001_FLAIR"", ""BraTS20_Traini..."
1,BraTS20_Training_002,52.263,616.0,GTR,BraTS2020,"[""BraTS20_Training_002_FLAIR"", ""BraTS20_Traini..."
2,BraTS20_Training_003,54.301,464.0,GTR,BraTS2020,"[""BraTS20_Training_003_FLAIR"", ""BraTS20_Traini..."
3,BraTS20_Training_004,39.068,788.0,GTR,BraTS2020,"[""BraTS20_Training_004_FLAIR"", ""BraTS20_Traini..."
4,BraTS20_Training_005,68.493,465.0,GTR,BraTS2020,"[""BraTS20_Training_005_FLAIR"", ""BraTS20_Traini..."


Volume metadata (FLAIR.obs):


Unnamed: 0,obs_subject_id,obs_id,dimensions,voxel_spacing
0,BraTS20_Training_001,BraTS20_Training_001_FLAIR,"(240, 240, 155)","(1.0, 1.0, 1.0)"
1,BraTS20_Training_002,BraTS20_Training_002_FLAIR,"(240, 240, 155)","(1.0, 1.0, 1.0)"
2,BraTS20_Training_003,BraTS20_Training_003_FLAIR,"(240, 240, 155)","(1.0, 1.0, 1.0)"
3,BraTS20_Training_004,BraTS20_Training_004_FLAIR,"(240, 240, 155)","(1.0, 1.0, 1.0)"
4,BraTS20_Training_005,BraTS20_Training_005_FLAIR,"(240, 240, 155)","(1.0, 1.0, 1.0)"


In [11]:
# Filter subjects by clinical metadata
# Example: subjects over 50 with gross total resection (GTR)
filtered = radi.filter("age > 50 and resection_status == 'GTR'")
print(f"Subjects over 50 with GTR: {len(filtered)}")
subject_ids = filtered.obs_subject_ids[:5]
print(f"Subject IDs: {subject_ids}...")

Subjects over 50 with GTR: 101
Subject IDs: ['BraTS20_Training_001', 'BraTS20_Training_002', 'BraTS20_Training_003', 'BraTS20_Training_005', 'BraTS20_Training_006']...


## Next Steps

The RadiObject is now available at `BRATS_URI`. Proceed to the tutorial notebooks:

- [01_explore_data.ipynb](./01_explore_data.ipynb) - Explore RadiObject, collections, and volumes
- [02_configuration.ipynb](./02_configuration.ipynb) - Write settings, read tuning, S3 config