# Storage & Configuration

This notebook explains how RadiObject uses TileDB for efficient radiology data storage. Understanding these concepts helps you:

- **Optimize storage** for your analysis patterns
- **Choose the right tile orientation** for your workflow
- **Configure S3** for cloud-native storage
- **Understand fragments** and when to consolidate

**Prerequisites:** [01_radi_object.ipynb](./01_radi_object.ipynb), [03_volume.ipynb](./03_volume.ipynb)

## TileDB Fundamentals

TileDB stores data in **tiles** (chunks). When you read data, TileDB fetches only the tiles that overlap your query region.

```
        3D Volume (240 x 240 x 155)
    +-------------------------------------+
    |  +--+--+--+--+--+--+--+--+--+--+    |
    |  |  |  |  |  |  |  |  |  |  |  |    |  <- Each small box is a TILE
    |  +--+--+--+--+--+--+--+--+--+--+    |
    |  |  |  |  |  |  |  |  |  |  |  |    |     Tiles are independently
Z   |  +--+--+--+--+--+--+--+--+--+--+    |     compressed and stored
    |  |  |  |  |  |  |  |  |  |  |  |    |
    |  +--+--+--+--+--+--+--+--+--+--+    |
    |  |  |  |  |  |  |  |  |  |  |  |    |
    |  +--+--+--+--+--+--+--+--+--+--+    |
    +-------------------------------------+
                X x Y
```

**Key insight:** Tile shape determines read efficiency. If your tiles match your query pattern, you read exactly what you need.

## Tile Orientations

RadiObject supports four tile orientations optimized for different access patterns:

| Orientation | Tile Shape | Best For |
|-------------|------------|----------|
| `AXIAL` | 240 x 240 x 1 | Slice-by-slice viewing (neuroimaging, CT review) |
| `SAGITTAL` | 1 x 240 x 155 | Sagittal plane analysis |
| `CORONAL` | 240 x 1 x 155 | Coronal plane analysis |
| `ISOTROPIC` | 64 x 64 x 64 | 3D ROI extraction (ML training, tumor analysis) |

```
    AXIAL (XY slices)          SAGITTAL (YZ slices)       ISOTROPIC (64^3 cubes)
    +--------------+           +--------------+           +--------------+
    |==============| Z=0       | ||           |           | +--+--+--+   |
    |==============| Z=1       | ||           | X=0       | +--+--+--+   |
    |==============| Z=2       | ||           | X=1       | +--+--+--+   |
    |      ...     |           | || ...       |           |    ...       |
    |==============| Z=n       | ||           |           | (3D chunks)  |
    +--------------+           +--------------+           +--------------+
    
    Reading Z=77:              Reading X=120:             Reading ROI:
    Reads 1 tile               Reads 1 tile               Reads ~8 tiles
```

In [None]:
import sys
sys.path.insert(0, '..')

import tempfile
import shutil
from pathlib import Path
import time
import os

import numpy as np
import matplotlib.pyplot as plt

from config import BRATS_URI, S3_REGION
from src.radi_object import RadiObject
from src.volume import Volume
from src.ctx import (
    configure, get_config,
    TileConfig, CompressionConfig, IOConfig, S3Config,
    SliceOrientation, Compressor
)

TEMP_DIR = tempfile.mkdtemp(prefix="storage_tutorial_")
print(f"Working directory: {TEMP_DIR}")

## Loading Real Data from URI

Before exploring storage configuration, let's load real data from the configured URI.

In [None]:
# Configure S3 if using S3 URI
if BRATS_URI.startswith("s3://"):
    configure(s3=S3Config(region=S3_REGION))

# Load RadiObject from configured URI
radi = RadiObject(BRATS_URI)
print(f"Loaded: {radi}")
print(f"Collections: {radi.collection_names}")

# Quick data access example
vol = radi.FLAIR.iloc[0]
print(f"\nSample volume shape: {vol.shape}")

## Configuring RadiObject

Use `configure()` to set global storage options before creating volumes.

In [None]:
# Configure for axial slice access (default)
configure(
    tile=TileConfig(orientation=SliceOrientation.AXIAL),
    compression=CompressionConfig(algorithm=Compressor.ZSTD, level=3),
)

# View current configuration
config = get_config()
print(f"Tile orientation: {config.tile.orientation}")
print(f"Compression: {config.compression.algorithm}, level={config.compression.level}")

In [None]:
# See how tile extents are computed for different orientations
shape = (240, 240, 155)

for orient in SliceOrientation:
    tile_cfg = TileConfig(orientation=orient)
    extents = tile_cfg.extents_for_shape(shape)
    print(f"{orient.value:10s} -> tile extents: {extents}")

## Orientation Benchmark

Let's compare read performance for different orientations using synthetic data.

In [None]:
# Create test data
test_data = np.random.randn(240, 240, 155).astype(np.float32)

# Create volumes with different tile orientations
volumes = {}
for orient in [SliceOrientation.AXIAL, SliceOrientation.SAGITTAL, SliceOrientation.ISOTROPIC]:
    configure(tile=TileConfig(orientation=orient))
    uri = str(Path(TEMP_DIR) / f"vol_{orient.value}")
    volumes[orient.value] = Volume.from_numpy(uri, test_data)
    print(f"Created {orient.value}: {volumes[orient.value]}")

In [None]:
# Benchmark: read axial slices from each volume
n_reads = 50
results = {}

for name, vol in volumes.items():
    start = time.perf_counter()
    for z in range(0, 150, 3):  # Read every 3rd slice
        _ = vol.axial(z)
    elapsed = time.perf_counter() - start
    results[name] = elapsed
    print(f"{name:10s}: {elapsed*1000:.1f}ms for {n_reads} axial reads")

print(f"\nAxial-tiled is {results['isotropic']/results['axial']:.1f}x faster for axial reads")

In [None]:
# Benchmark: read sagittal slices
results_sag = {}

for name, vol in volumes.items():
    start = time.perf_counter()
    for x in range(0, 240, 5):
        _ = vol.sagittal(x)
    elapsed = time.perf_counter() - start
    results_sag[name] = elapsed
    print(f"{name:10s}: {elapsed*1000:.1f}ms for sagittal reads")

print(f"\nSagittal-tiled is {results_sag['axial']/results_sag['sagittal']:.1f}x faster for sagittal reads")

## Fragments & Consolidation

TileDB uses **fragments** for efficient writes. Each write creates a new fragment.

```
    Array on disk:
    +-------------------------------------+
    |  fragment_1/    (initial write)     |
    |  fragment_2/    (update)            |
    |  fragment_3/    (another update)    |
    |  __meta/        (array metadata)    |
    |  __schema       (array schema)      |
    +-------------------------------------+
    
    After consolidation:
    +-------------------------------------+
    |  __consolidated/  (merged data)     |
    |  __meta/                            |
    |  __schema                           |
    +-------------------------------------+
```

**When to consolidate:**
- After bulk ingestion (many writes)
- Before long-term storage
- When read performance degrades

RadiObject's `from_niftis()` creates one fragment per volume, which is efficient. For incremental updates, consider periodic consolidation.

## Memory Configuration

TileDB uses memory buffers for read/write operations.

In [None]:
# Configure memory budget
configure(
    io=IOConfig(
        memory_budget_mb=1024,  # 1GB memory budget
        concurrency=4,          # 4 parallel I/O threads
    )
)

config = get_config()
print(f"Memory budget: {config.io.memory_budget_mb} MB")
print(f"I/O concurrency: {config.io.concurrency} threads")

```
    Read Operation Flow:
    
    +--------------+    +--------------+    +--------------+
    |   Storage    |--->|  I/O Buffer  |--->|    Result    |
    |  (S3/Local)  |    |  (RAM)       |    |  (NumPy)     |
    +--------------+    +--------------+    +--------------+
           |                   |
           |    Decompression  |
           |    happens here   |
           +-------------------+
    
    Memory budget controls the I/O buffer size.
    Larger buffer = more tiles read in parallel.
```

## S3 Configuration

RadiObject works seamlessly with S3 storage using the same API.

In [None]:
# Configure S3 (example - requires AWS credentials)
configure(
    s3=S3Config(
        region="us-east-2",
        max_parallel_ops=16,       # Parallel S3 requests
        multipart_part_size_mb=50, # Multipart upload chunk size
    )
)

print("S3 configuration:")
print(f"  Region: {get_config().s3.region}")
print(f"  Max parallel ops: {get_config().s3.max_parallel_ops}")
print(f"  Multipart size: {get_config().s3.multipart_part_size_mb} MB")

```
    Local vs S3 Storage:
    
    LOCAL                              S3
    +-----------------+               +-----------------+
    | /data/study/    |               | s3://bucket/    |
    |   collections/  |               |   study/        |
    |     FLAIR/      |     Same      |     collections/|
    |       obs       |<----API------>|       FLAIR/    |
    |       volumes/  |               |         ...     |
    |         0/      |               |                 |
    |         1/      |               |                 |
    +-----------------+               +-----------------+
    
    # Both work identically:
    radi = RadiObject("/data/study")        # Local
    radi = RadiObject("s3://bucket/study")  # S3
```

**S3 Credentials:** Set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables, or use IAM roles on AWS infrastructure.

## Compression Options

| Algorithm | Speed | Ratio | Use Case |
|-----------|-------|-------|----------|
| `ZSTD` | Medium | High | Default, balanced |
| `LZ4` | Fast | Medium | Real-time analysis |
| `GZIP` | Slow | Highest | Archival storage |
| `NONE` | Fastest | None | Testing, already compressed data |

In [None]:
# Compare compression sizes
test_data = np.random.randn(120, 120, 60).astype(np.float32)
uncompressed_size = test_data.nbytes

for compressor in [Compressor.NONE, Compressor.LZ4, Compressor.ZSTD, Compressor.GZIP]:
    configure(
        compression=CompressionConfig(algorithm=compressor, level=3),
        tile=TileConfig(orientation=SliceOrientation.AXIAL)
    )
    uri = str(Path(TEMP_DIR) / f"vol_{compressor.value}")
    vol = Volume.from_numpy(uri, test_data)
    
    # Get directory size
    total_size = sum(
        os.path.getsize(os.path.join(dp, f))
        for dp, dn, filenames in os.walk(uri)
        for f in filenames
    )
    ratio = uncompressed_size / total_size if total_size > 0 else 0
    print(f"{compressor.value:6s}: {total_size/1024:.1f} KB (ratio: {ratio:.2f}x)")

## Practical Guidelines

### Choosing Tile Orientation

| Your Workflow | Recommended Orientation |
|---------------|------------------------|
| Slice-by-slice viewing (radiologist review) | `AXIAL` |
| 3D ROI extraction for ML training | `ISOTROPIC` |
| Sagittal plane analysis (spine imaging) | `SAGITTAL` |
| Mixed access patterns | `ISOTROPIC` |

### Storage Strategy

```
    INGESTION                    ANALYSIS
    ---------                    --------
    
    Raw NIfTI/DICOM              RadiObject (TileDB)
    +--------------+             +------------------+
    | Patient 1    |--+          | Optimized for    |
    | Patient 2    |--+- ingest->| your access      |
    | Patient 3    |--+          | pattern          |
    +--------------+             +------------------+
         |                              |
         |                              | Partial reads
         |                              | (only load what you need)
         v                              v
    Full file reads              Efficient tile access
```

In [None]:
# Final example: configure for ML training workflow
configure(
    tile=TileConfig(orientation=SliceOrientation.ISOTROPIC),  # 64^3 cubes for ROI
    compression=CompressionConfig(algorithm=Compressor.ZSTD, level=3),
    io=IOConfig(memory_budget_mb=2048, concurrency=8),  # More memory for training
)

print("Configuration for ML training:")
config = get_config()
print(f"  Tiles: {config.tile.orientation.value}")
print(f"  Compression: {config.compression.algorithm.value}")
print(f"  Memory: {config.io.memory_budget_mb} MB")

## Cleanup

In [None]:
shutil.rmtree(TEMP_DIR)
print(f"Cleaned up: {TEMP_DIR}")

## Summary

| Concept | Key Points |
|---------|------------|
| **Tile Orientation** | Match tiles to your access pattern for best performance |
| **Compression** | ZSTD is default; LZ4 for speed, GZIP for size |
| **Fragments** | Each write creates a fragment; consolidate after bulk ingestion |
| **S3** | Same API as local; configure region and parallelism |
| **Memory** | Increase budget for large datasets or parallel training |

### Quick Reference

```python
from src.ctx import configure, TileConfig, SliceOrientation, CompressionConfig, Compressor

# For slice viewing
configure(tile=TileConfig(orientation=SliceOrientation.AXIAL))

# For ML training
configure(tile=TileConfig(orientation=SliceOrientation.ISOTROPIC))

# For archival
configure(compression=CompressionConfig(algorithm=Compressor.GZIP, level=9))
```