# RadiObject - The Top-Level Container

RadiObject is the top-level container for multi-collection radiology data. This notebook covers:

- Loading RadiObject from URI (S3 or local)
- Subject-level metadata (`obs_meta`)
- Collection discovery and access
- Subject indexing (`iloc`, `loc`, boolean masks)
- EDA filtering with `filter()`, `head()`, `tail()`, `sample()`
- Views and materialization
- **Pipeline Mode** with lazy `Query` builder for ETL and ML training

## Key Concepts

| Class | Description |
|-------|-------------|
| **RadiObject** | Top-level container organizing subject metadata (`obs_meta`) and multiple VolumeCollections |
| **VolumeCollection** | A group of Volumes with consistent X/Y/Z dimensions (e.g., all FLAIR scans) |
| **Volume** | A single 3D or 4D radiology acquisition backed by TileDB |
| **Query** | Lazy filter builder for pipeline mode (ETL, streaming, ML training) |

**Prerequisites:** Run [00_ingest_brats.ipynb](./00_ingest_brats.ipynb) first to create the RadiObject.

**Related notebooks:**
- [02_volume_collection.ipynb](./02_volume_collection.ipynb) - Collection-level operations
- [03_volume.ipynb](./03_volume.ipynb) - Single volume operations and partial reads

## Setup & Configuration

In [None]:
import sys
sys.path.insert(0, '..')

import tempfile
from pathlib import Path

import numpy as np
import pandas as pd

from config import BRATS_URI, S3_REGION
from src.radi_object import RadiObject, RadiObjectView
from src.query import Query  # Pipeline mode query builder
from src.ctx import configure, S3Config, TileConfig, SliceOrientation, CompressionConfig, Compressor

print(f"RadiObject URI: {BRATS_URI}")

In [None]:
# Configure S3 if using S3 URI
if BRATS_URI.startswith("s3://"):
    configure(s3=S3Config(region=S3_REGION))

# Configure TileDB for axial-optimized storage
configure(
    tile=TileConfig(orientation=SliceOrientation.AXIAL),
    compression=CompressionConfig(algorithm=Compressor.ZSTD, level=3)
)

## Load RadiObject from URI

Load the pre-ingested BraTS data. The same code works for S3 or local paths.

In [None]:
radi = RadiObject(BRATS_URI)
radi

## Subject-Level Metadata: `obs_meta`

Access subject demographics and clinical information.

In [None]:
# Read all subject metadata
radi.obs_meta.read()

In [None]:
# Read specific columns
radi.obs_meta.read(columns=["obs_subject_id", "tumor_grade", "age"])

In [None]:
# Filter with QueryCondition
radi.obs_meta.read(value_filter="tumor_grade == 'HGG'")

## Collection Discovery

Discover and access VolumeCollections.

In [None]:
print(f"Collection names: {radi.collection_names}")
print(f"Number of collections: {radi.n_collections}")

In [None]:
# Access via attribute or method
flair = radi.FLAIR                    # Attribute access
flair_alt = radi.collection("FLAIR")  # Method access

# Display the VolumeCollection
flair

In [None]:
# Iterate over collection names
for name in radi:
    coll = radi.collection(name)
    print(f"{name}: {coll}")

## Subject Indexing

Select subjects using pandas-like indexing patterns.

In [None]:
# Integer-location indexing (iloc) - returns RadiObjectView
view_single = radi.iloc[0]           # First subject
view_slice = radi.iloc[0:3]          # First 3 subjects
view_list = radi.iloc[[0, 2, 4]]     # Specific positions

print(f"iloc[0]: {view_single}")
print(f"iloc[0:3]: {view_slice}")
print(f"iloc[[0, 2, 4]]: {view_list}")

In [None]:
# Label-based indexing (loc)
subject_ids = radi.obs_subject_ids
view_by_id = radi.loc[subject_ids[0]]
view_by_ids = radi.loc[[subject_ids[0], subject_ids[2]]]

print(f"loc['{subject_ids[0]}']: {view_by_id}")
print(f"loc[multiple]: {view_by_ids}")

In [None]:
# Boolean mask indexing
meta = radi.obs_meta.read()
mask = (meta["age"] > 40).values

view_filtered = radi.iloc[mask]
print(f"Subjects with age > 40: {view_filtered.obs_subject_ids}")

## EDA Filtering with `filter()`, `head()`, `tail()`, `sample()`

Quick filtering for interactive exploration. All methods return `RadiObjectView`.

In [None]:
# Filter: filter subjects by metadata expression (EDA mode)
hgg_view = radi.filter("tumor_grade == 'HGG'")
hgg_view

In [None]:
# Compound filter
filtered_view = radi.filter("tumor_grade == 'HGG' and age > 40")
print(f"HGG and age > 40: {filtered_view.obs_subject_ids}")

In [None]:
# Head and tail
print(f"head(2): {radi.head(2).obs_subject_ids}")
print(f"tail(2): {radi.tail(2).obs_subject_ids}")

In [None]:
# Random sample (reproducible with seed)
sampled = radi.sample(n=3, seed=42)
print(f"sample(n=3, seed=42): {sampled.obs_subject_ids}")

## Filtering Collections with `select_collections()`

In [None]:
# Select specific collections
tumor_view = radi.select_collections(["FLAIR", "T2w"])

print(f"Original: {radi.collection_names}")
print(f"Filtered: {tumor_view.collection_names}")

In [None]:
# Chain filters
chained = radi.iloc[0:3].select_collections(["FLAIR", "T1w"])

print(f"Subjects: {chained.obs_subject_ids}")
print(f"Collections: {chained.collection_names}")

## Views & Materialization

All filtering operations return `RadiObjectView` - a lazy, immutable view. Materialize to persist.

In [None]:
# Views are lazy - they don't copy data
view = radi.iloc[0:2]
view

In [None]:
# Materialize view to new storage
TEMP_DIR = tempfile.mkdtemp(prefix="radi_demo_")
subset_uri = str(Path(TEMP_DIR) / "subset")
subset_view = radi.iloc[0:2].select_collections(["FLAIR"])

subset_radi = subset_view.to_radi_object(subset_uri)
subset_radi

In [None]:
# Verify data integrity
orig = radi.FLAIR.iloc[0].axial(z=77)
copy = subset_radi.FLAIR.iloc[0].axial(z=77)
print(f"Data matches: {np.allclose(orig, copy)}")

# Cleanup
import shutil
shutil.rmtree(TEMP_DIR)

## S3 Configuration

RadiObject supports S3 storage natively.

```python
from src.ctx import configure, S3Config

configure(
    s3=S3Config(
        region="us-east-2",
        max_parallel_ops=16,
        multipart_part_size_mb=100,
    )
)

# Load from S3
radi = RadiObject("s3://bucket/study")

# Materialize to S3
view.to_radi_object("s3://bucket/subset")
```

## Validation

In [None]:
radi.validate()
print("Validation passed")

## Next Steps

- [02_volume_collection.ipynb](./02_volume_collection.ipynb) - Working with volume groups
- [03_volume.ipynb](./03_volume.ipynb) - Single volume operations and partial reads
- [04_storage_configuration.ipynb](./04_storage_configuration.ipynb) - Tile orientation, compression, S3

## Pipeline Mode with `query()`

For ETL pipelines and ML training, use the lazy `Query` builder. Unlike EDA mode (`iloc`/`loc`), queries don't access data until explicitly materialized.

| Mode | Entry Point | Returns | Use Case |
|------|-------------|---------|----------|
| **EDA** | `radi.iloc[]`, `radi.loc[]`, `radi.head()` | `RadiObjectView` | Interactive exploration |
| **Pipeline** | `radi.query()` | `Query` (lazy) | ETL, streaming, ML training |

In [None]:
# Create a lazy Query - no data access yet
q = radi.query()
print(f"Query type: {type(q).__name__}")
print(f"Query: {q}")

In [None]:
# Chain filters - still lazy, no data access
filtered = (
    q.filter("tumor_grade == 'HGG'")
    .select_collections(["FLAIR", "T1w"])
    .head(3)
)

# Inspect query without materializing (only reads metadata)
print(f"Filtered Query: {filtered}")
counts = filtered.count()
print(f"Subject count: {counts.n_subjects}")
print(f"Volume counts: {counts.n_volumes}")

In [None]:
# Materialize to DataFrame (triggers metadata read)
filtered.to_obs_meta()

### Streaming Iteration

Iterate over volumes without loading all into memory.

In [None]:
# Iterate over volumes (streaming - memory efficient)
for vol in filtered.iter_volumes():
    print(f"Volume: {vol.obs_id}, shape: {vol.shape}")
    break  # Just show first

In [None]:
# Batch iteration for ML training - returns stacked numpy arrays
for batch in filtered.iter_batches(batch_size=2):
    print(f"Batch subjects: {batch.subject_ids}")
    for coll_name, arr in batch.volumes.items():
        print(f"  {coll_name} shape: {arr.shape}")
    break  # Just show first batch

### Streaming Export

Export query results to a new RadiObject with memory-efficient streaming.

In [None]:
# Export query results to new RadiObject (streaming=True for memory efficiency)
TEMP_DIR_QUERY = tempfile.mkdtemp(prefix="radi_query_")
subset_uri_query = str(Path(TEMP_DIR_QUERY) / "query_subset")

subset_from_query = filtered.to_radi_object(subset_uri_query, streaming=True)
print(f"Exported: {subset_from_query}")

# Cleanup
import shutil
shutil.rmtree(TEMP_DIR_QUERY)