# ModSSC | Data loader

Load a dataset end to end with the ModSSC data loader.

## Objective
- Show the minimal steps to run this component in a notebook setting.
- Provide the exact objects to look at (outputs, shapes, metrics) to confirm it worked.

## Prerequisites
- Python 3.11+.
- `pip install modssc`.
- Optional dependencies depend on datasets and backends. If an import fails, install the matching extra and rerun.

## Outline
1) Imports and configuration
2) Core run (the part that does the work)
3) Sanity checks and outputs



## Notebook notes

This notebook demonstrates how to use ModSSC's data loading capabilities to access, download, and manage datasets.

## Installation

To use the data loader, you need to install the package.

Base install (lightweight, no heavy dependencies):
```bash
pip install -e .
```

For full data loading capabilities (including pandas, etc.), install with the `data` extra:
```bash
pip install -e ".[data]"
```

## List available datasets

ModSSC comes with a curated catalog of datasets. We can list them and inspect their metadata.

## Imports and configuration



In [None]:
from modssc.data_loader import available_datasets, dataset_info

# List all dataset keys in the catalog
keys = available_datasets()
print("Catalog:", keys)

# Inspect metadata for the 'toy' dataset
# This returns a DatasetSpec object containing modality, size, etc.
print("Toy spec:", dataset_info("toy").as_dict())

## Download and load (toy)

We can download and load a dataset using the `load_dataset` function. Here, we load a small toy dataset for demonstration purposes.

In [None]:
from modssc.data_loader import download_dataset, load_dataset

# Download the 'toy' dataset.
# force=True ensures we re-download/re-generate it even if it exists.
ds = download_dataset("toy", force=True)

# The loaded dataset object contains splits (train, val, test) and metadata.
print("train X:", ds.train.X.shape)
print("train y:", ds.train.y.shape)
print("test present:", ds.test is not None)

# We can also load without downloading if we know it's cached.
ds_offline = load_dataset("toy", download=False)
print("offline ok:", ds_offline.train.X.shape)

## Bulk download (best effort)

We can bulk download all datasets using the `bulk_download_datasets` function. This function attempts to download all available datasets, skipping any that fail to download.


In [None]:
from modssc.data_loader import download_all_datasets

# Attempt to download all datasets in the catalog.
# ignore_missing_extras=True: Don't fail if we lack dependencies for some datasets (e.g. audio libs).
# skip_cached=True: Don't re-download if already present.
report = download_all_datasets(ignore_missing_extras=True, skip_cached=True)

print(report.summary())
print("missing_extras:", report.missing_extras)
print("failed:", report.failed)

## Cache inspection (CLI)

We can inspect the cache using the ModSSC CLI. The following command lists all cached datasets and their details:

In [None]:
import subprocess
import sys


def run_cli(*args):
    cmd = [sys.executable, "-m", "modssc", *args]
    res = subprocess.run(cmd, text=True, capture_output=True)
    return res.returncode, res.stdout.strip(), res.stderr.strip()


print(run_cli("datasets", "cache", "ls"))

## Provider URIs (requires extras)

Some datasets can be accessed via different providers. We can list the available URIs for a dataset using the `get_dataset_uris` function.

We can list the available providers for a dataset and choose one to load the dataset from.


In [None]:
import socket

from modssc.data_loader import download_dataset, load_dataset
from modssc.data_loader.errors import OptionalDependencyError

# Increase default socket timeout to 10 minutes for large/slow downloads (e.g. OpenML)
socket.setdefaulttimeout(600)

# ModSSC supports generic providers to load datasets NOT in the curated catalog.
# The URI format is "provider:dataset_name".
#
# Supported providers:
# - openml:ID -> fetches by OpenML ID via sklearn
# - hf:name/config -> fetches via HuggingFace datasets
# - torchvision:ClassName -> instantiates torchvision.datasets.ClassName
# - torchaudio:ClassName -> instantiates torchaudio.datasets.ClassName
# - pyg:ClassName -> instantiates torch_geometric.datasets.ClassName

uris = [
    "openml:31",  # Credit-g (Tabular)
    "hf:rotten_tomatoes",  # Rotten Tomatoes (Text)
    "torchvision:FashionMNIST",  # FashionMNIST (Vision)
    "pyg:KarateClub",  # KarateClub (Graph)
    "torchaudio:CMUARCTIC",  # CMU ARCTIC (Audio)
]

for u in uris:
    try:
        print(f"--- Processing {u} ---")
        # download_dataset will cache the processed data locally
        download_dataset(u)

        # load_dataset returns the standard LoadedDataset object
        ds = load_dataset(u)
        print(f"[OK] Loaded {u}")

        # Inspect the data shape/length
        # Note: Some providers return lists (Audio/Text) or Tensors/Arrays (Vision/Tabular)
        train_len = ds.train.X.shape if hasattr(ds.train.X, "shape") else len(ds.train.X)
        print(f"  Train X shape/len: {train_len}")

        if ds.train.y is not None:
            y_len = ds.train.y.shape if hasattr(ds.train.y, "shape") else len(ds.train.y)
            print(f"  Train y shape/len: {y_len}")

        print(f"  Meta: {ds.meta}")

    except OptionalDependencyError as e:
        print(f"[SKIP] {u} missing extra: {e.extra}")
    except Exception as e:
        print(f"[FAIL] {u} {type(e).__name__}: {e}")

## Outputs

- The last cells should print key shapes and a minimal metric or artifact summary.
- If something fails early, the error should point to a missing optional dependency.


## Next steps
- Explore the adjacent notebooks in this folder for the other pipeline components.
- If you hit an optional dependency error, install the suggested extra and rerun.
