# ModSSC | Sampling

Create and validate semi supervised splits (labeled, unlabeled, val, test).

## Objective
- Show the minimal steps to run this component in a notebook setting.
- Provide the exact objects to look at (outputs, shapes, metrics) to confirm it worked.

## Prerequisites
- Python 3.11+.
- `pip install modssc`.
- Optional dependencies depend on datasets and backends. If an import fails, install the matching extra and rerun.

## Outline
1) Imports and configuration
2) Core run (the part that does the work)
3) Sanity checks and outputs



## Notebook notes

This notebook demonstrates how to use ModSSC's sampling capabilities to create reproducible data splits (train/val/test) and label sets.

## Installation

Base install:
```bash
pip install -e .
```

## Load dataset

First, we load a dataset. We need its fingerprint to ensure reproducibility of the sampling.

## Imports and configuration



In [None]:
from modssc.data_loader import load_dataset

ds = load_dataset("toy", download=True)
ds_fp = ds.meta.get("dataset_fingerprint")
ds_fp

## Holdout + labeling


In [None]:
from modssc.sampling import HoldoutSplitSpec, LabelingSpec, SamplingPlan, sample

# Define a Sampling Plan
# 1. Split: How to divide the data (Holdout, K-Fold, etc.)
# 2. Labeling: How many labels to reveal (Active Learning / Semi-Supervised setting)
plan = SamplingPlan(
    split=HoldoutSplitSpec(test_fraction=0.3, val_fraction=0.1, stratify=True),
    labeling=LabelingSpec(mode="fraction", value=0.1, per_class=True, min_per_class=1),
)

# Run the sampling
# - dataset_fingerprint: Ensures we are sampling the exact same data version
# - save=True: Saves the result to disk for later use
# - overwrite=True: Re-runs sampling even if the file exists (useful for notebooks)
result, path = sample(ds, plan=plan, seed=0, dataset_fingerprint=ds_fp, save=True, overwrite=True)

print("Split Fingerprint:", result.split_fingerprint)
print("Saved to:", path)

In [None]:
result.stats

In [None]:
import numpy as np


def print_class_dist(y, idx, name="subset"):
    """Helper to print class distribution of a subset."""
    if idx is None or len(idx) == 0:
        print(f"{name}: empty")
        return
    classes, counts = np.unique(y[idx], return_counts=True)
    dist = dict(zip(classes, counts, strict=True))
    print(f"{name} distribution: {dist} (total={len(idx)})")


# Inspect the resulting indices
# The result object provides .train_idx, .val_idx, .test_idx, .labeled_idx
print_class_dist(ds.train.y, result.train_idx, "Train")
print_class_dist(ds.train.y, result.val_idx, "Val")
print_class_dist(ds.train.y, result.test_idx, "Test")
print_class_dist(ds.train.y, result.labeled_idx, "Labeled")

## K-Fold Cross Validation

We can also use K-Fold splitting. Note that `val_fraction` can still be used within each fold's training set.


In [None]:
from modssc.sampling import KFoldSplitSpec

# K-Fold Cross Validation
# We specify k=5 and fold=0 (the first fold)
# Note: val_fraction is applied *within* the training portion of the fold
k_plan = SamplingPlan(
    split=KFoldSplitSpec(k=5, fold=0, stratify=True, val_fraction=0.1),
    labeling=LabelingSpec(mode="fraction", value=1.0),  # Use all train data as labeled
)

k_res, _ = sample(ds, plan=k_plan, seed=0, dataset_fingerprint=ds_fp)
print("Fold 0 stats:", k_res.stats)
print_class_dist(ds.train.y, k_res.test_idx, "Fold 0 Test")

## Imbalance Simulation

We can simulate class imbalance (e.g. Long Tail) or subsample classes.


In [None]:
from modssc.sampling import ImbalanceSpec

# Imbalance Simulation
# We can simulate a "long_tail" distribution on the training set.
imb_plan = SamplingPlan(
    split=HoldoutSplitSpec(test_fraction=0.2),
    labeling=LabelingSpec(mode="fraction", value=1.0),
    imbalance=ImbalanceSpec(kind="long_tail", alpha=0.5, min_per_class=1, apply_to="train"),
)

imb_res, _ = sample(ds, plan=imb_plan, seed=0, dataset_fingerprint=ds_fp)
print("Imbalance stats:", imb_res.stats)
print_class_dist(ds.train.y, imb_res.train_idx, "Imbalanced Train")

## Load split from disk


In [None]:
from modssc.sampling import load_split

loaded = load_split(path)
loaded.split_fingerprint == result.split_fingerprint

## CLI

This mirrors the Python API above using the `modssc sampling` subcommand.


In [None]:
import subprocess
import sys


def run_cli(*args):
    cmd = [sys.executable, "-m", "modssc", *args]
    res = subprocess.run(cmd, text=True, capture_output=True)
    return res.returncode, res.stdout.strip(), res.stderr.strip()


print(run_cli("sampling", "show", str(path)))
print(run_cli("sampling", "validate", str(path), "--dataset", "toy"))

## Graph example (optional)


In [None]:
from modssc.data_loader.errors import OptionalDependencyError
from modssc.sampling.plan import HoldoutSplitSpec, LabelingSpec, SamplingPlan

# Graph Sampling Example
# For graphs, we often use transductive splitting (masks on nodes) rather than inductive splitting (indices).
# ModSSC handles this unifiedly.

try:
    gds = load_dataset("cora", download=True)
    gfp = gds.meta.get("dataset_fingerprint")

    gplan = SamplingPlan(
        split=HoldoutSplitSpec(test_fraction=0.0, val_fraction=0.0, stratify=False),
        labeling=LabelingSpec(mode="per_class", value=20, min_per_class=1),
    )

    # overwrite=True allows re-running
    gres, gpath = sample(
        gds, plan=gplan, seed=0, dataset_fingerprint=gfp, save=True, overwrite=True
    )
    print("graph split:", gres.stats)

except OptionalDependencyError as e:
    print("SKIP graph, missing extra:", e.extra)

## Outputs

- The last cells should print key shapes and a minimal metric or artifact summary.
- If something fails early, the error should point to a missing optional dependency.


## Next steps
- Explore the adjacent notebooks in this folder for the other pipeline components.
- If you hit an optional dependency error, install the suggested extra and rerun.
