# ModSSC | Preprocessing

Run preprocessing plans and inspect available steps and models.

## Objective
- Show the minimal steps to run this component in a notebook setting.
- Provide the exact objects to look at (outputs, shapes, metrics) to confirm it worked.

## Prerequisites
- Python 3.11+.
- `pip install modssc`.
- Optional dependencies depend on datasets and backends. If an import fails, install the matching extra and rerun.

## Outline
1) Imports and configuration
2) Core run (the part that does the work)
3) Sanity checks and outputs



## Notebook notes

This notebook shows how to build a deterministic, cacheable preprocessing pipeline and (optionally) compute embeddings with pretrained models.

## Installation

Base install (no heavy extras):
```bash
pip install -e .
```

With embedding extras (optional):
```bash
pip install -e ".[preprocess]"
```

## Imports and configuration



In [1]:
import numpy as np

from modssc.data_loader import load_dataset
from modssc.preprocess import PreprocessPlan, StepConfig, preprocess

## Load a dataset

We use the built-in `toy` dataset (deterministic, small, offline).


In [None]:
ds = load_dataset("toy")
ds.train.X.shape, ds.train.y.shape

((48, 4), (48,))

## List available preprocessing steps

Let's list the available preprocessing steps in ModSSC:

In [10]:
from modssc.preprocess.registry import available_steps

print("\n".join(available_steps()))

audio.wav2vec2
core.cast_dtype
core.ensure_2d
core.pca
core.random_projection
embeddings.auto
graph.attach_edge_weight
graph.edge_sparsify
labels.ensure_onehot
text.ensure_strings
text.hash_tokenizer
text.sentence_transformer
text.tfidf
vision.channels_order
vision.ensure_num_channels
vision.normalize
vision.openclip
vision.resize
vision.zca_whitening


Some code to list available preprocessing steps with their metadata:

In [12]:
from modssc.preprocess.registry import available_steps, step_info

for sid in available_steps():
    info = step_info(sid)
    print(sid)
    print("  kind:", info["kind"])
    print("  modalities:", info["modalities"])
    print("  consumes:", info["consumes"])
    print("  produces:", info["produces"])
    print("  required_extra:", info["required_extra"])

audio.wav2vec2
  kind: featurizer
  modalities: ['audio']
  consumes: ['raw.X']
  produces: ['features.X']
  required_extra: preprocess-audio
core.cast_dtype
  kind: transform
  modalities: []
  consumes: ['features.X']
  produces: ['features.X']
  required_extra: None
core.ensure_2d
  kind: transform
  modalities: []
  consumes: ['raw.X']
  produces: ['features.X']
  required_extra: None
core.pca
  kind: fittable
  modalities: []
  consumes: ['features.X']
  produces: ['features.X']
  required_extra: None
core.random_projection
  kind: fittable
  modalities: []
  consumes: ['features.X']
  produces: ['features.X']
  required_extra: None
embeddings.auto
  kind: featurizer
  modalities: []
  consumes: ['raw.X']
  produces: ['features.X']
  required_extra: None
graph.attach_edge_weight
  kind: transform
  modalities: ['graph']
  consumes: ['graph.edge_index']
  produces: ['graph.edge_weight']
  required_extra: None
graph.edge_sparsify
  kind: transform
  modalities: ['graph']
  consumes:

## CLI

You can access the same registries via the CLI.


In [None]:
import subprocess
import sys


def run_cli(*args):
    cmd = [sys.executable, "-m", "modssc", *args]
    res = subprocess.run(cmd, text=True, capture_output=True)
    return res.returncode, res.stdout.strip(), res.stderr.strip()


print(run_cli("preprocess", "steps", "list"))
print(run_cli("preprocess", "models", "list"))

## Build a plan

For tabular data, we typically:
- create a 2D numeric matrix (`features.X`)
- cast dtype
- optionally reduce dimension


In [None]:
# Define a Preprocessing Plan
# A plan consists of a sequence of steps.
# Each step is defined by its ID (e.g., "core.pca") and optional parameters.
# output_key defines which artifact is considered the "main" output (usually "features.X").

plan = PreprocessPlan(
    steps=(
        StepConfig("core.ensure_2d"),
        StepConfig("core.cast_dtype", params={"dtype": "float32"}),
        StepConfig("core.pca", params={"n_components": 3}),
    ),
    output_key="features.X",
)
plan

PreprocessPlan(steps=(StepConfig(step_id='core.ensure_2d', params={}, modalities=(), requires_fields=(), enabled=True), StepConfig(step_id='core.cast_dtype', params={'dtype': 'float32'}, modalities=(), requires_fields=(), enabled=True), StepConfig(step_id='core.pca', params={'n_components': 3}, modalities=(), requires_fields=(), enabled=True)), output_key='features.X')

## Run preprocessing

Fittable steps need `fit_indices` (relative to the training split). Here we fit on the full train split.


In [None]:
# Run the preprocessing pipeline
# - fit_indices: Specifies which samples are used to FIT the steps (e.g., PCA).
#   Usually, this is the training set indices to avoid data leakage.
# - cache=True: Enables caching of intermediate and final results.

fit_idx = np.arange(ds.train.y.shape[0], dtype=np.int64)
res = preprocess(ds, plan, seed=0, fit_indices=fit_idx, cache=True)

out = res.dataset
print("Processed X shape:", out.train.X.shape)
print("Processed y shape:", out.train.y.shape)

((48, 3), (48,))

## Cache

When cache is enabled, each step stores its outputs under the dataset fingerprint directory.


In [15]:
res.cache_dir

'/Users/melvin/Library/Caches/modssc/preprocess/dataset:c132a331e6104ef264c05eb27e89e26bd5380f1bc447110f50f75fc1276a5408'

## Tabular Preprocessing

Note: Requires `pip install -e ".[tabular]"` and internet access to download datasets.

For tabular data, we typically:

In [16]:
import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig

rng = np.random.default_rng(0)
X = rng.normal(size=(64, 10)).astype(np.float32)
y = rng.integers(0, 3, size=(64,), dtype=np.int64)
ds = LoadedDataset(
    train=Split(X=X, y=y), meta={"modality": "tabular", "dataset_fingerprint": "synthetic:tab64"}
)

plan = PreprocessPlan(
    steps=(
        StepConfig("core.ensure_2d"),
        StepConfig("core.cast_dtype"),
        StepConfig("core.pca", params={"n_components": 5}),
        StepConfig("core.random_projection", params={"n_components": 7}),
        StepConfig("labels.ensure_onehot"),
    )
)

res = preprocess(ds, plan, fit_indices=np.arange(0, 32, dtype=np.int64), cache=False)
print("X", res.dataset.train.X.shape)
print("onehot", res.train_artifacts.require("labels.y_onehot").shape)

X (64, 7)
onehot (64, 3)


## Vision Preprocessing

Note: Requires `pip install -e ".[vision]"` and internet access to download datasets.

For vision data, we typically:.

In [17]:
import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig

rng = np.random.default_rng(0)
X = rng.random(size=(32, 16, 16, 1)).astype(np.float32)
y = rng.integers(0, 2, size=(32,), dtype=np.int64)
ds = LoadedDataset(
    train=Split(X=X, y=y), meta={"modality": "vision", "dataset_fingerprint": "synthetic:vision32"}
)

plan = PreprocessPlan(
    steps=(
        StepConfig("vision.channels_order", params={"order": "NCHW"}),
        StepConfig("vision.ensure_num_channels", params={"num_channels": 3}),
        StepConfig("vision.resize", params={"height": 8, "width": 8}),
        StepConfig("vision.normalize"),
        StepConfig("vision.zca_whitening", params={"max_features": 4096}),
        StepConfig("embeddings.auto"),
        StepConfig("core.random_projection", params={"n_components": 6}),
    )
)

res = preprocess(ds, plan, fit_indices=np.arange(0, 16, dtype=np.int64), cache=False)
print("raw.X", np.asarray(res.train_artifacts.require("raw.X")).shape)
print("X", res.dataset.train.X.shape)

raw.X (32, 3, 8, 8)
X (32, 6)


## Graph Preprocessing
Note: Requires `pip install -e ".[graph]"` and internet access to download datasets.

For graph data, we typically:.

In [18]:
import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig

rng = np.random.default_rng(0)
N, F = 50, 8
X = rng.normal(size=(N, F)).astype(np.float32)
y = rng.integers(-1, 3, size=(N,), dtype=np.int64)
E = 200
edge_index = np.stack(
    [rng.integers(0, N, size=(E,), dtype=np.int64), rng.integers(0, N, size=(E,), dtype=np.int64)],
    axis=0,
)
masks = {
    "train": (np.arange(N) < 25),
    "val": ((np.arange(N) >= 25) & (np.arange(N) < 35)),
    "test": (np.arange(N) >= 35),
}
ds = LoadedDataset(
    train=Split(X=X, y=y, edges=edge_index, masks=masks),
    meta={"modality": "graph", "dataset_fingerprint": "synthetic:graph50"},
)

plan = PreprocessPlan(
    steps=(
        StepConfig("graph.attach_edge_weight"),
        StepConfig("graph.edge_sparsify", params={"keep_fraction": 0.3}),
    )
)

res = preprocess(ds, plan, fit_indices=np.arange(0, 25, dtype=np.int64), cache=False)
edges = res.dataset.train.edges
print(
    "edge_index", edges["edge_index"].shape if isinstance(edges, dict) else np.asarray(edges).shape
)
print("edge_weight", edges["edge_weight"].shape if isinstance(edges, dict) else "none")

edge_index (2, 64)
edge_weight (64,)


## Audio Preprocessing
Note: Requires `pip install -e ".[audio]"` and internet access to download datasets.

For audio data, we typically:.

In [19]:
import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig

rng = np.random.default_rng(0)
X = [rng.standard_normal(800, dtype=np.float32) for _ in range(40)]
y = rng.integers(0, 4, size=(40,), dtype=np.int64)
ds = LoadedDataset(
    train=Split(X=X, y=y), meta={"modality": "audio", "dataset_fingerprint": "synthetic:audio40"}
)

plan = PreprocessPlan(
    steps=(
        StepConfig("embeddings.auto"),
        StepConfig("core.random_projection", params={"n_components": 5}),
    )
)

res = preprocess(ds, plan, fit_indices=np.arange(0, 20, dtype=np.int64), cache=False)
print("X", res.dataset.train.X.shape)

X (40, 5)


## Text Preprocessing

Note: Requires `pip install -e ".[text]"` and internet access to download datasets.

For text data, we typically:

In [20]:
import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig

X = np.array(["a b c", "hello world", "x y", ""] * 16, dtype=object)
y = np.array([0, 1, 0, 1] * 16, dtype=np.int64)
ds = LoadedDataset(
    train=Split(X=X, y=y), meta={"modality": "text", "dataset_fingerprint": "synthetic:text64"}
)

plan = PreprocessPlan(
    steps=(
        StepConfig("text.ensure_strings"),
        StepConfig("text.hash_tokenizer", params={"max_length": 16, "vocab_size": 2000}),
        StepConfig("embeddings.auto"),
        StepConfig("core.random_projection", params={"n_components": 8}),
        StepConfig("labels.ensure_onehot"),
    )
)

res = preprocess(ds, plan, fit_indices=np.arange(0, 32, dtype=np.int64), cache=False)
print("X", res.dataset.train.X.shape)
print("tokens", res.train_artifacts.require("tokens.input_ids").shape)

X (64, 8)
tokens (64, 16)


## Custom Steps

You can define custom steps and register them dynamically.


In [None]:
from dataclasses import dataclass
from typing import Any

import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig
from modssc.preprocess.registry import default_step_registry
from modssc.preprocess.store import ArtifactStore
from modssc.preprocess.types import StepSpec


# --- 1. Define a custom step class ---
# The class must implement a `transform` method.
# It can optionally implement `fit` if it needs to learn parameters.
@dataclass
class MyCustomStep:
    factor: float = 2.0

    def transform(self, store: ArtifactStore, *, rng: np.random.Generator) -> dict[str, Any]:
        # Access artifacts from the store
        # We assume "features.X" is already available (e.g. from ensure_2d)
        X = store.require("features.X")
        return {"features.X": X * self.factor}


# --- 2. Register it dynamically ---
# We need a StepSpec to tell the registry how to instantiate and use it.
# import_path="__main__:MyCustomStep" works because we defined it in the notebook (main module).
my_spec = StepSpec(
    step_id="custom.multiply",
    import_path="__main__:MyCustomStep",
    kind="transform",
    description="Multiply features by a factor.",
    required_extra=None,
    modalities=(),
    consumes=("features.X",),
    produces=("features.X",),
)

# Get the default registry and add our spec
reg = default_step_registry()
reg.specs["custom.multiply"] = my_spec

# --- 3. Create a simple numeric dataset for demonstration ---
# (Previous cells might have left 'ds' as text or audio, which ensure_2d can't handle directly)
X_custom = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], dtype=np.float32)
y_custom = np.array([0, 1, 0], dtype=np.int64)
ds_custom = LoadedDataset(
    train=Split(X=X_custom, y=y_custom),
    meta={"modality": "tabular", "dataset_fingerprint": "custom_demo"},
)

# --- 4. Build and run a plan using the custom step ---
plan_custom = PreprocessPlan(
    steps=(
        StepConfig("core.ensure_2d"),
        StepConfig("custom.multiply", params={"factor": 10.0}),
    ),
    output_key="features.X",
)

# Pass the registry explicitly so it finds "custom.multiply"
res_custom = preprocess(ds_custom, plan_custom, seed=0, fit_indices=None, registry=reg)

print("Original:\n", X_custom)
print("Processed:\n", res_custom.dataset.train.X)

Original:
 [[1. 2.]
 [3. 4.]
 [5. 6.]]
Processed:
 [[10. 20.]
 [30. 40.]
 [50. 60.]]


## Using Pretrained Embeddings

This section demonstrates how to use pretrained models (like Sentence Transformers, OpenCLIP, Wav2Vec2) to generate embeddings.

### Sentence-Transformers (Text)

In [30]:
import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig

X = np.array(["bonjour le monde", "exemple court"] * 64, dtype=object)
y = np.array([0, 1] * 64, dtype=np.int64)
ds = LoadedDataset(
    train=Split(X=X, y=y), meta={"modality": "text", "dataset_fingerprint": "synthetic:text_st"}
)

plan = PreprocessPlan(steps=(StepConfig("text.sentence_transformer"),))
res = preprocess(ds, plan, fit_indices=np.arange(0, 10, dtype=np.int64), cache=False)

print("X_shape", X.shape)
print("X_embedded_shape", res.dataset.train.X.shape)

X_shape (128,)
X_embedded_shape (128, 384)


### Open-clip (vision)

In [None]:
import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig

rng = np.random.default_rng(0)
X = (rng.random(size=(8, 64, 64, 3)) * 255).astype(np.uint8)
y = rng.integers(0, 2, size=(8,), dtype=np.int64)
ds = LoadedDataset(
    train=Split(X=X, y=y),
    meta={"modality": "vision", "dataset_fingerprint": "synthetic:vision_clip"},
)

plan = PreprocessPlan(steps=(StepConfig("vision.openclip"),))
res = preprocess(ds, plan, fit_indices=np.arange(0, 4, dtype=np.int64), cache=False)
print("X_shape", X.shape)
print("X_embedded_shape", res.dataset.train.X.shape)



X_shape (8, 64, 64, 3)
X_embedded_shape (8, 512)


### Wav2Vec2 (audio)

In [36]:
import numpy as np

from modssc.data_loader.types import LoadedDataset, Split
from modssc.preprocess.api import preprocess
from modssc.preprocess.plan import PreprocessPlan, StepConfig

rng = np.random.default_rng(0)
X = [rng.standard_normal(16000, dtype=np.float32) for _ in range(4)]
y = rng.integers(0, 2, size=(4,), dtype=np.int64)
ds = LoadedDataset(
    train=Split(X=X, y=y), meta={"modality": "audio", "dataset_fingerprint": "synthetic:audio_w2v"}
)

plan = PreprocessPlan(steps=(StepConfig("audio.wav2vec2"),))
res = preprocess(ds, plan, fit_indices=np.arange(0, 2, dtype=np.int64), cache=False)
print("X_shape", len(X), [x.shape for x in X])
print("X_embedded_shape", res.dataset.train.X.shape)

X_shape 4 [(16000,), (16000,), (16000,), (16000,)]
X_embedded_shape (4, 768)


## Outputs

- The last cells should print key shapes and a minimal metric or artifact summary.
- If something fails early, the error should point to a missing optional dependency.


## Next steps
- Explore the adjacent notebooks in this folder for the other pipeline components.
- If you hit an optional dependency error, install the suggested extra and rerun.
