# Get to Know a Dataset: PCNSL (Primary CNS Lymphoma) MRI Dataset

This notebook serves as a guided tour of the [PCNSL MRI Dataset](https://registry.opendata.aws/jhu-pcnsl) dataset. More usage examples, tutorials, and documentation for this dataset and others can be found at the [Registry of Open Data on AWS](https://registry.opendata.aws/).

### Q: How have you organized your dataset? Help us understand the key prefix structure of your S3 bucket.

Our dataset is organized in BIDS (Brain Imaging Data Structure) format at the top level of our S3 bucket. The structure contains:

1. Subject directories (`sub-XXXX/`) containing session subdirectories (`ses-YYYY/`)
2. Each session contains:
   - An `anat/` folder with three MRI sequences:
     - **T1w**: T1-weighted structural image
     - **ce-gadolinium_T1w**: Gadolinium-enhanced (post-contrast) T1-weighted image
     - **FLAIR**: Fluid-attenuated inversion recovery image
   - A `dwi/` folder with one MRI sequence:
     - **DWI_ADC**: Apparent diffusion coefficient map from diffusion-weighted imaging
3. A `derivatives/pyalfe/` folder containing processed outputs for each subject/session:
   - `statistics/` - CSV files with lesion measurements
   - `skullstripped/` - Brain-extracted images
   - `masks/` - Lesion segmentation masks

Full documentation for this dataset can be found in the dataset's README and associated publications.

In [None]:
# This notebook requires Python 3.10-3.13 and the following libraries
# (please install using the preferred method for your environment, e.g. pip, conda, uv):
#
# boto3 >= 1.38
# nibabel >= 5.0
# nilearn >= 0.10
# pandas >= 2.0
# numpy >= 1.24
# matplotlib >= 3.7
#
# Or install all dependencies with: uv sync

First we will import the Python libraries required throughout this notebook.

In [None]:
# Import the libraries required for this notebook
# Built-ins
import io
import tempfile
from pathlib import Path

# =============================================================================
# DATA SOURCE CONFIGURATION (set this before imports)
# =============================================================================
# Set to True to use local files, False to use S3
USE_LOCAL_DATA = True

# Installed libraries (always needed)
import nibabel as nib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nilearn import plotting

# For Jupyter notebooks, try to use inline backend
try:
    from IPython import get_ipython
    if get_ipython() is not None:
        get_ipython().run_line_magic('matplotlib', 'inline')
except:
    pass

# S3 libraries (only imported if needed)
if not USE_LOCAL_DATA:
    import boto3
    from botocore import UNSIGNED
    from botocore.config import Config

print("All imports successful!")
print(f"Data source: {'Local filesystem' if USE_LOCAL_DATA else 'AWS S3'}")

In [None]:
LOCAL_ROOT = Path("/Users/mromano/Library/CloudStorage/Box-Box/Research/pcnsl_radiomics/dataset_manuscript")
LOCAL_BIDS_DIR = LOCAL_ROOT / "bids_dir_for_aws_anon"
LOCAL_CSV_DIR = LOCAL_ROOT / "csvs_for_amazon_anonymized"

## Data Paths Configuration

The data source (`USE_LOCAL_DATA`) was configured in the imports cell above. Now we'll set up the paths for accessing the data.

In [None]:
# =============================================================================
# DATA PATHS CONFIGURATION
# =============================================================================
# Local paths (only used if USE_LOCAL_DATA = True)
LOCAL_ROOT = Path("") # update local root here
LOCAL_BIDS_DIR = Path(LOCAL_ROOT / "bids_dir_for_aws_anon")
LOCAL_CSV_DIR = Path(LOCAL_ROOT / "csvs_for_amazon_anonymized")

# S3 configuration (only used if USE_LOCAL_DATA = False)
bucket = "ucsf-pcnsl"

# Initialize S3 client only if needed
s3 = None
if not USE_LOCAL_DATA:
    s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
    
    # Print the items in the top-level prefixes (subject directories)
    response = s3.list_objects_v2(Bucket=bucket, Delimiter='/', MaxKeys=20)
    print("Top-level prefixes in the S3 bucket:")
    if 'CommonPrefixes' in response:
        for item in response['CommonPrefixes'][:10]:
            print(f"  {item['Prefix']}")
        print(f"  ... and more subjects")
else:
    print(f"Using local data from:")
    print(f"  BIDS directory: {LOCAL_BIDS_DIR}")
    print(f"  CSV directory:  {LOCAL_CSV_DIR}")
    
    # List local subjects
    subjects = sorted([d.name for d in LOCAL_BIDS_DIR.iterdir() if d.is_dir() and d.name.startswith('sub-')])
    print(f"\nFound {len(subjects)} subjects:")
    for subj in subjects[:10]:
        print(f"  {subj}/")
    if len(subjects) > 10:
        print(f"  ... and {len(subjects) - 10} more subjects")

Looking into a subject's directory, we can see the BIDS-compliant structure with session folders.

In [None]:
# List the contents of a subject directory
subject = "sub-0001"

if USE_LOCAL_DATA:
    subject_path = LOCAL_BIDS_DIR / subject
    print(f"Contents of {subject}/")
    for item in sorted(subject_path.iterdir()):
        print(f"  {item.name}/")
else:
    response = s3.list_objects_v2(Bucket=bucket, Prefix=f'{subject}/', Delimiter='/')
    print(f"Contents of {subject}/")
    if 'CommonPrefixes' in response:
        for item in response['CommonPrefixes']:
            print(f"  {item['Prefix']}")

Within each session, we find the anatomy folder containing the MRI sequences.

In [None]:
# List anatomy files for a subject/session
session = "ses-0001"

if USE_LOCAL_DATA:
    anat_path = LOCAL_BIDS_DIR / subject / session / "anat"
    print(f"Anatomy files for {subject}/{session}:")
    for item in sorted(anat_path.iterdir()):
        print(f"  {item.name}")
    
    # Also show DWI files
    dwi_path = LOCAL_BIDS_DIR / subject / session / "dwi"
    if dwi_path.exists():
        print(f"\nDWI files for {subject}/{session}:")
        for item in sorted(dwi_path.iterdir()):
            print(f"  {item.name}")
else:
    response = s3.list_objects_v2(Bucket=bucket, Prefix=f'{subject}/{session}/anat/')
    print(f"Anatomy files for {subject}/{session}:")
    if 'Contents' in response:
        for item in response['Contents']:
            print(f"  {item['Key'].split('/')[-1]}")

### Q: What data formats are present in your dataset? What kinds of data are stored using these formats? Can you give any advice for how you work with these data formats?

Our dataset contains two primary data formats:

**1. NIfTI (Neuroimaging Informatics Technology Initiative) - `.nii.gz` files**

NIfTI is the standard format for neuroimaging data. Our dataset uses NIfTI for:
- Raw MRI images (T1w, T1-Post Gadolinium, FLAIR)
- Skullstripped (brain-extracted) images
- Lesion segmentation masks

NIfTI files store:
- 3D volumetric image data as a multidimensional array
- Header information including voxel dimensions, orientation, and coordinate system
- Affine transformation matrix for mapping voxel to world coordinates

We recommend using:
- **nibabel**: Python library for reading/writing NIfTI files
- **nilearn**: Built on nibabel, provides neuroimaging-specific visualization and analysis tools
- **ITK-SNAP** or **3D Slicer**: Desktop applications for interactive visualization

**2. CSV (Comma-Separated Values) - `.csv` files**

CSV files contain quantitative measurements extracted from the images:
- **SummaryLesions**: Aggregate lesion statistics per subject (total volume, tissue distribution)
- **IndividualLesions**: Per-lesion measurements (one row per lesion)
- **radiomics**: PyRadiomics texture features for machine learning applications

Microsoft Excel, the python Pandas library, and the Polars library among others can all be used to explore CSVs.

### Q: Can you show us an example of downloading and loading data from your dataset?

Let's load a FLAIR MRI image and its associated lesion statistics from S3.

In [None]:
# Helper functions to load data from S3 or local filesystem

def load_nifti_from_s3(bucket, key, s3_client):
    """Load a NIfTI file from S3 into a nibabel image object."""
    response = s3_client.get_object(Bucket=bucket, Key=key)
    file_content = response['Body'].read()
    
    # Create a temporary file to load with nibabel
    with tempfile.NamedTemporaryFile(suffix='.nii.gz', delete=False) as tmp:
        tmp.write(file_content)
        tmp_path = tmp.name
    
    img = nib.load(tmp_path)
    # Load data into memory so we can delete the temp file
    img = nib.Nifti1Image(img.get_fdata(), img.affine, img.header)
    Path(tmp_path).unlink()  # Clean up temp file
    return img

def load_nifti(key):
    """Load a NIfTI file from either local filesystem or S3."""
    if USE_LOCAL_DATA:
        local_path = LOCAL_BIDS_DIR / key
        print(f"Loading: {local_path}")
        return nib.load(local_path)
    else:
        print(f"Loading: s3://{bucket}/{key}")
        return load_nifti_from_s3(bucket, key, s3)

def load_csv(key):
    """Load a CSV file from either local filesystem or S3."""
    if USE_LOCAL_DATA:
        local_path = LOCAL_BIDS_DIR / key
        print(f"Loading: {local_path}")
        return pd.read_csv(local_path)
    else:
        print(f"Loading: s3://{bucket}/{key}")
        response = s3.get_object(Bucket=bucket, Key=key)
        return pd.read_csv(io.BytesIO(response['Body'].read()))

# Load the FLAIR image for subject sub-0001
subject = "sub-0001"
session = "ses-0001"
flair_key = f"{subject}/{session}/anat/{subject}_{session}_FLAIR.nii.gz"

flair_img = load_nifti(flair_key)
print(f"FLAIR image loaded successfully!")

Let's examine the properties of the loaded NIfTI image.

In [None]:
# Display image properties
print(f"Image shape: {flair_img.shape}")
print(f"Voxel dimensions (mm): {flair_img.header.get_zooms()}")
print(f"Data type: {flair_img.get_data_dtype()}")
print(f"Affine matrix:\n{flair_img.affine}")

Now let's load the lesion statistics CSV file.

In [None]:
# Load summary lesion statistics
stats_key = f"derivatives/pyalfe/{subject}/{session}/statistics/lesions_SummaryLesions/{subject}_{session}_FLAIR_SummaryLesions.csv"

summary_stats = load_csv(stats_key)

print(f"\nSummary statistics columns:")
print(summary_stats.columns.tolist())
print(f"\nStatistics for {subject}:")
summary_stats

### Q: A picture is worth a thousand words. Show us a visual (or several!) from your dataset that either illustrates something informative about your dataset, or that you think might excite someone to dig in further.

Let's visualize the MRI images and lesion segmentations from our dataset.

In [None]:
# Visualize the FLAIR image using nilearn
plotting.plot_anat(flair_img, title=f"{subject} FLAIR Image", display_mode='ortho')
plt.show()

In [None]:
# Load the skullstripped FLAIR and lesion mask
# Note: Derivative files use anonymized hash prefixes in filenames

def find_derivative_file(subject, session, subdir, pattern):
    """Find a derivative file matching a pattern (handles hash-based filenames)."""
    if USE_LOCAL_DATA:
        search_dir = LOCAL_BIDS_DIR / "derivatives" / "pyalfe" / subject / session / subdir
        matches = list(search_dir.glob(pattern))
        if matches:
            return matches[0]
        return None
    else:
        # For S3, construct the expected path
        return f"derivatives/pyalfe/{subject}/{session}/{subdir}/{pattern}"

# Find skullstripped FLAIR
flair_ss_dir = "skullstripped/lesions_FLAIR_space"
if USE_LOCAL_DATA:
    flair_ss_path = find_derivative_file(subject, session, flair_ss_dir, "*_FLAIR_to_FLAIR_skullstripped.nii.gz")
    print(f"Loading skullstripped FLAIR: {flair_ss_path.name}")
    flair_ss = nib.load(flair_ss_path)
else:
    flair_ss_key = f"derivatives/pyalfe/{subject}/{session}/skullstripped/lesions_FLAIR_space/{subject}_{session}_FLAIR_skullstripped.nii.gz"
    print("Loading skullstripped FLAIR...")
    flair_ss = load_nifti_from_s3(bucket, flair_ss_key, s3)

# Find lesion mask
mask_dir = "masks/lesions_seg_comp"
if USE_LOCAL_DATA:
    mask_path = find_derivative_file(subject, session, mask_dir, "*_FLAIR_abnormal_seg_comp.nii.gz")
    print(f"Loading lesion mask: {mask_path.name}")
    lesion_mask = nib.load(mask_path)
else:
    mask_key = f"derivatives/pyalfe/{subject}/{session}/masks/lesions_seg_comp/{subject}_{session}_FLAIR_lesions.nii.gz"
    print("Loading lesion mask...")
    lesion_mask = load_nifti_from_s3(bucket, mask_key, s3)

print(f"\nSkullstripped FLAIR shape: {flair_ss.shape}")
print(f"Lesion mask shape: {lesion_mask.shape}")

In [None]:
# Overlay lesion segmentation on the FLAIR image
# Note: The mask contains component labels (each lesion has a unique ID)
# Binarize for uniform color display
binary_mask = nib.Nifti1Image(
    (lesion_mask.get_fdata() > 0).astype(float), 
    lesion_mask.affine, 
    lesion_mask.header
)

plotting.plot_roi(
    binary_mask,
    bg_img=flair_ss,
    title=f"{subject} FLAIR with Lesion Overlay",
    display_mode='ortho',
    alpha=0.5,
    cmap="Reds",
    resampling_interpolation='nearest',
    vmin=0,
    vmax=1
)
plt.show()

In [None]:
# Mosaic view showing multiple slices
plotting.plot_roi(
    lesion_mask,
    bg_img=flair_ss,
    title=f"{subject} FLAIR Lesions (Mosaic View)",
    display_mode='mosaic',
    cut_coords=8,
    alpha=0.5,
    cmap='Reds',
    resampling_interpolation='nearest',
    vmin=0,
    vmax=1
)
plt.show()

In [None]:
# Load and aggregate statistics for multiple subjects to show distribution

def load_all_summary_stats_local(bids_dir, max_subjects=50):
    """Load summary statistics for multiple subjects from local filesystem."""
    all_stats = []
    subjects = sorted([d.name for d in bids_dir.iterdir() 
                      if d.is_dir() and d.name.startswith('sub-')])
    
    for subj in subjects[:max_subjects]:
        try:
            stats_path = bids_dir / "derivatives" / "pyalfe" / subj / "ses-0001" / "statistics" / "lesions_SummaryLesions" / f"{subj}_ses-0001_FLAIR_SummaryLesions.csv"
            if stats_path.exists():
                df = pd.read_csv(stats_path)
                df['subject'] = subj
                all_stats.append(df)
        except Exception as e:
            continue
    
    return pd.concat(all_stats, ignore_index=True) if all_stats else pd.DataFrame()

def load_all_summary_stats_s3(bucket, s3_client, max_subjects=50):
    """Load summary statistics for multiple subjects from S3."""
    all_stats = []
    
    # List all subjects
    response = s3_client.list_objects_v2(Bucket=bucket, Delimiter='/')
    subjects = [p['Prefix'].rstrip('/') for p in response.get('CommonPrefixes', []) 
                if p['Prefix'].startswith('sub-')]
    
    for subj in subjects[:max_subjects]:
        try:
            stats_key = f"derivatives/pyalfe/{subj}/ses-0001/statistics/SummaryLesions_FLAIR.csv"
            response = s3_client.get_object(Bucket=bucket, Key=stats_key)
            df = pd.read_csv(io.BytesIO(response['Body'].read()))
            df['subject'] = subj
            all_stats.append(df)
        except:
            continue
    
    return pd.concat(all_stats, ignore_index=True) if all_stats else pd.DataFrame()

print("Loading statistics for multiple subjects...")
if USE_LOCAL_DATA:
    all_summary = load_all_summary_stats_local(LOCAL_BIDS_DIR)
else:
    all_summary = load_all_summary_stats_s3(bucket, s3)

# convert to wide form
all_summary = all_summary.pivot(index="subject", columns="Unnamed: 0", values="0")

print(f"Loaded statistics for {len(all_summary)} subjects")

In [None]:
# Plot the distribution of lesion volumes
if 'total_lesion_volume' in all_summary.columns and len(all_summary) > 0:
    plt.figure(figsize=(12, 7), dpi=100, facecolor='white')
    
    plt.hist(all_summary['total_lesion_volume'], 
             bins=30,
             color='#3498db',
             edgecolor='white',
             linewidth=1.2,
             alpha=0.8)
    
    plt.title('Distribution of Total Lesion Volume in PCNSL Patients', 
             fontsize=16, pad=20, fontweight='bold')
    plt.xlabel('Total Lesion Volume (mm³)', fontsize=12, labelpad=10)
    plt.ylabel('Number of Subjects', fontsize=12, labelpad=10)
    
    plt.grid(True, linestyle='--', alpha=0.3, color='gray')
    
    ax = plt.gca()
    ax.set_facecolor('#f8f9fa')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_linewidth(0.5)
    ax.spines['bottom'].set_linewidth(0.5)
    
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print(f"\nLesion Volume Statistics:")
    print(f"  Mean: {all_summary['total_lesion_volume'].mean():.2f} mm³")
    print(f"  Median: {all_summary['total_lesion_volume'].median():.2f} mm³")
    print(f"  Min: {all_summary['total_lesion_volume'].min():.2f} mm³")
    print(f"  Max: {all_summary['total_lesion_volume'].max():.2f} mm³")

In [None]:
# Box plot showing lesion distribution by tissue type
tissue_cols = [
    'lesion_volume_in_white_matter',
    'lesion_volume_in_Cortical Gray Matter',
    'lesion_volume_in_Deep Gray Matter',
    'lesion_volume_in_CorpusCallosum'
]

available_cols = [c for c in tissue_cols if c in all_summary.columns]

if available_cols and len(all_summary) > 0:
    plt.figure(figsize=(12, 7), dpi=100, facecolor='white')
    
    tissue_data = all_summary[available_cols].copy()
    tissue_data.columns = ['White Matter', 'Cortical GM', 'Deep GM', 'Corpus Callosum'][:len(available_cols)]
    
    tissue_data.boxplot()
    
    plt.title('PCNSL Lesion Volume by Brain Tissue Type', 
             fontsize=16, pad=20, fontweight='bold')
    plt.ylabel('Lesion Volume (mm³)', fontsize=12, labelpad=10)
    plt.xlabel('Tissue Type', fontsize=12, labelpad=10)
    
    plt.xticks(rotation=45)
    plt.grid(True, linestyle='--', alpha=0.3, color='gray', axis='y')
    
    ax = plt.gca()
    ax.set_facecolor('#f8f9fa')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Custom visualization comparing FLAIR lesions
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Get data arrays
flair_data = flair_ss.get_fdata()
mask_data = lesion_mask.get_fdata()

# Find slice with most lesions
lesion_counts = mask_data.sum(axis=(0, 1))
best_slice = np.argmax(lesion_counts)

# FLAIR image
axes[0].imshow(np.rot90(flair_data[:, :, best_slice]), cmap='gray')
axes[0].set_title('FLAIR Image', fontsize=14)
axes[0].axis('off')

# Lesion mask
axes[1].imshow(np.rot90(mask_data[:, :, best_slice]), cmap='hot')
axes[1].set_title('Lesion Segmentation', fontsize=14)
axes[1].axis('off')

# Overlay
axes[2].imshow(np.rot90(flair_data[:, :, best_slice]), cmap='gray')
masked = np.ma.masked_where(mask_data[:, :, best_slice] == 0, 
                            mask_data[:, :, best_slice])
axes[2].imshow(np.rot90(masked), cmap='hot', alpha=0.6)
axes[2].set_title('FLAIR with Lesion Overlay', fontsize=14)
axes[2].axis('off')

plt.suptitle(f'{subject} PCNSL Lesion Visualization', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?

**Question: What is the typical lesion burden and anatomical distribution in primary CNS lymphoma patients?**

Using this dataset, we analyzed the distribution of lesion volumes across different brain tissue types. PCNSL has a characteristic predilection for white matter, and this dataset allowed us to quantify this pattern across a cohort of patients.

Key findings from our analysis:
1. PCNSL lesions show significant involvement of white matter
2. Total lesion volume varies considerably between patients, reflecting the heterogeneous nature of the disease

### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?

**Unanswered Question: Do radiomic features correlate with certain genomic differences in PCNSL patients?**

This dataset includes PyRadiomics texture features extracted from both FLAIR and T1-Post contrast images. These features capture subtle patterns in image intensity and texture that may correlate with tumor biology and treatment outcomes.

**Recommendations for tackling this question:**

1. **Data Loading**: Use the radiomics CSV files in `derivatives/pyalfe/sub-XXXX/ses-YYYY/statistics/`. Use the UCSF500 CSV file in `ucsf500_mutations.csv`

2. **Feature Selection for radiomics**: The radiomics files contain hundreds of features.
   - Correlation-based feature selection to remove redundant features
   - LASSO or elastic net regularization for automatic selection
   - Domain knowledge to focus on clinically relevant feature categories

3. **Start small when looking at genetic data**:
   - Begin by looking at the `gene` column in the UCSF500 dataset and selecting a single mutation to investigate at a time

4. **Modeling Approach**:
   - Start with interpretable models (logistic regression, random forest)
   - Use proper cross-validation given the relatively small sample size
   - Consider combining FLAIR and T1-Post features for multimodal analysis

5. **Validation**: External validation on an independent cohort would strengthen any predictive model

This research direction could contribute to personalized treatment planning in PCNSL.

---

## Summary

This tutorial covered:

1. **Dataset Organization**: BIDS-formatted MRI data with subjects, sessions, and derivatives
2. **Data Formats**: NIfTI files for imaging data, CSV files for statistics
3. **Loading Data**: Using boto3/nibabel for S3 access, or direct filesystem access for local data
4. **Visualization**: Using nilearn and matplotlib to display MRI images and lesion overlays
5. **Analysis Examples**: Lesion volume distributions and tissue-specific patterns

### Data Source Options

This notebook supports two data access methods:
- **S3 (AWS)**: Set `USE_LOCAL_DATA = False` to access data from the public S3 bucket
- **Local**: Set `USE_LOCAL_DATA = True` and configure `LOCAL_BIDS_DIR` and `LOCAL_CSV_DIR` paths

### Key Resources

- **nibabel documentation**: https://nipy.org/nibabel/
- **nilearn documentation**: https://nilearn.github.io/
- **BIDS specification**: https://bids-specification.readthedocs.io/
- **PyRadiomics**: https://pyradiomics.readthedocs.io/