# Data‑related

This tutorial provides a brief introduction to the **data‑related datasets** in PhyAGI and shows you **how to load** pre‑processed data with `phyagi-sdk`.

After completing this tutorial, you will be able to:

- Understand the data‑handling features of **PhyAGI**.
- Tokenize a dataset from scratch (class‑style *and* functional‑style).
- Load pre‑processed data from configuration objects or YAML files.

## Datasets

In PhyAGI (and PyTorch) a **dataset** is a *class* that implements the `__len__` and `__getitem__` methods. Each call to `__getitem__` should return a **dictionary** representing one sample.

A dataset is responsible for:

- Loading the raw data.
- Applying processing or augmentation.
- Returning each example as a dictionary with the keys.
  - `input_ids`: sequence of input tokens.
  - `labels`: sequence of target tokens.

*(Optional)* You can attach other fields (e.g. `attention_mask`) directly in the dataset class or add them later via a **data collator**.

The following example builds an empty `LMDataset`. Each call to `dataset[i]` returns the dictionary described above:


In [1]:
import numpy as np
from phyagi.datasets.train.lm.lm_dataset import LMDataset

inputs = np.zeros(2048)
dataset = LMDataset(inputs, seq_len=128)
print(dataset[0])

{'input_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])}


## Dataset providers

When you work with large corpora or multiple files, it is convenient to put all data‑loading logic inside a **dataset provider**. A provider knows how to:

1. Locate the raw data (local path, S3 bucket, or Hugging Face Hub).
2. Load and cache it.
3. Preprocess and tokenize it.
4. Return ready‑to‑use *training* and *validation* datasets.

The snippet below downloads a dataset from the Hugging Face Hub **once**, preprocesses it, and then reuses the same provider instance to obtain both splits:

In [2]:
from phyagi.datasets.train.lm.lm_dataset_provider import LMDatasetProvider

dataset_provider = LMDatasetProvider.from_hub(
    dataset_path="wikitext",
    dataset_name="wikitext-2-raw-v1",
    tokenizer="gpt2",
)

train_dataset = dataset_provider.get_train_dataset()
val_dataset = dataset_provider.get_val_dataset()
print(train_dataset[0], val_dataset[0])

[phyagi] [2025-05-19 09:05:03,655] [INFO] [lm_dataset_provider.py:252:from_hub] Loading non-tokenized dataset...
[phyagi] [2025-05-19 09:05:08,094] [INFO] [lm_dataset_provider.py:285:from_hub] Creating validation split (if necessary)...


Processing dataset with shared memory...:   0%|          | 0/5 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/37 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/4 [00:00<?, ? examples/s]

[phyagi] [2025-05-19 09:05:09,982] [INFO] [lm_dataset_provider.py:309:from_hub] Saving tokenized dataset: cache
{'input_ids': tensor([  796,   569, 18354,  ...,   373,  3194, 16385]), 'labels': tensor([  796,   569, 18354,  ...,   373,  3194, 16385])} {'input_ids': tensor([  796,  8074, 20272,  ...,  7423,  1267,   837]), 'labels': tensor([  796,  8074, 20272,  ...,  7423,  1267,   837])}


## Data collators

Just before data enters the model you often need to perform *batch‑level* adjustments such as padding or masking. In PhyAGI you do this with a **data collator**.

The example below constructs an `LMDataCollator` that ignores the token with `token_id = 796`:

In [3]:
from phyagi.datasets.train.train_data_collator import LMDataCollator

data_collator = LMDataCollator(ignore_token_ids=[796], ignore_index=-100)
samples = [train_dataset[0], train_dataset[1]]

collated_samples = data_collator(samples)
assert collated_samples["labels"][0][0] == -100
print(collated_samples)

{'input_ids': tensor([[  796,   569, 18354,  ...,   373,  3194, 16385],
        [11308,   575, 10546,  ..., 13460,  5670,   319]]), 'labels': tensor([[ -100,   569, 18354,  ...,   373,  3194, 16385],
        [11308,   575, 10546,  ..., 13460,  5670,   319]])}


*Note: `796` is used here only as an example, you can instruct the collator to ignore any token ID, or none at all.*

For more details about datasets, providers, and collators, consult the [official documentation](https://microsoft.github.io/phyagi/api/datasets.html).

# Tokenizing data

PhyAGI offers **two approaches** to tokenize a dataset:

1. **Class‑style** – wrap the entire workflow in an `LMDatasetProvider`. *Recommended for most users.*  
2. **Functional‑style** – call low level helper functions directly. *Useful when you need complete control over preprocessing.*

The sections below illustrate both approaches.

## Class-style

`LMDatasetProvider` encapsulates the entire tokenization pipeline. Under the hood it leverages the Hugging Face `datasets` library, caches the processed data, and exposes a clean API:

In [4]:
from phyagi.datasets.train.lm.lm_dataset_provider import LMDatasetProvider

dataset_provider = LMDatasetProvider.from_hub(
    dataset_path="wikitext",
    dataset_name="wikitext-2-raw-v1",
    tokenizer="gpt2",
    cache_dir="cache",
)

[phyagi] [2025-05-19 09:05:10,858] [INFO] [lm_dataset_provider.py:252:from_hub] Loading non-tokenized dataset...
[phyagi] [2025-05-19 09:05:15,890] [INFO] [lm_dataset_provider.py:285:from_hub] Creating validation split (if necessary)...


Processing dataset with shared memory...:   0%|          | 0/5 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/37 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/4 [00:00<?, ? examples/s]

[phyagi] [2025-05-19 09:05:18,372] [INFO] [lm_dataset_provider.py:309:from_hub] Saving tokenized dataset: cache


The previous snippet tokenizes the entire Wikipedia dataset and stores the result in the `cache/` directory:

- `train.npy`: Tokenized training split.
- `validation.npy`: Tokenized validation split (if available). 
- `tokenizer.pkl`: Serialized tokenizer object.

You can customize the provider, for example, to change the column name (default is `text`), adjust block size, or supply a different tokenizer by passing the corresponding arguments:

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
dataset_provider = LMDatasetProvider.from_hub(
    dataset_path="glue",
    dataset_name="sst2",
    tokenizer=tokenizer,
    mapping_column_name="sentence",
    cache_dir="cache",
)

[phyagi] [2025-05-19 09:05:18,925] [INFO] [lm_dataset_provider.py:252:from_hub] Loading non-tokenized dataset...
[phyagi] [2025-05-19 09:05:20,719] [INFO] [lm_dataset_provider.py:285:from_hub] Creating validation split (if necessary)...


Processing dataset with shared memory...:   0%|          | 0/68 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/1 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/2 [00:00<?, ? examples/s]

[phyagi] [2025-05-19 09:05:21,556] [INFO] [lm_dataset_provider.py:309:from_hub] Saving tokenized dataset: cache


## Functional-style

If you prefer *not* to use a provider, you can call the low level tokenization utilities directly. The snippet below shows how to tokenize a dataset using plain functions:

In [6]:
import numpy as np
from transformers import AutoTokenizer

from datasets import load_dataset
from phyagi.datasets.train.lm.lm_dataset_provider import _tokenize_concatenated

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

mapping_column_name = ["text"]
dtype = np.uint16 if tokenizer.vocab_size < 64 * 1024 else np.int32
cache_dir = "cache"

dataset = dataset.map(
        _tokenize_concatenated,
        fn_kwargs={
            "tokenizer": tokenizer,
            "mapping_column_name": mapping_column_name,
            "use_eos_token": True,
            "dtype": dtype,
        },
        batched=True,
        num_proc=1,
        remove_columns=dataset["train"].column_names,
    )
dataset.save_to_disk(cache_dir)

Saving the dataset (0/1 shards):   0%|          | 0/5 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/37 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/4 [00:00<?, ? examples/s]

Internally, the functional helpers still rely on the Hugging Face `datasets` library, so the tokenized dataset is saved as an **Arrow** file.

If you need a NumPy array you can:

1. Load the Arrow dataset into memory with `process_dataset_to_memory`.  
2. Save it with `save_numpy_dataset`.

Both helpers live in `phyagi.datasets.shared_memory_utils`.

In [7]:
from phyagi.datasets.shared_memory_utils import process_dataset_to_memory, save_memory_dataset

processed_dataset_dict = process_dataset_to_memory(
    dataset,
    cache_dir,
    dtype,
    ["input_ids"],
    num_workers=1,
    use_shared_memory=True,
)
save_memory_dataset(processed_dataset_dict, tokenizer, cache_dir, use_shared_memory=True)

Processing dataset with shared memory...:   0%|          | 0/5 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/37 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/4 [00:00<?, ? examples/s]

{'test_file': PosixPath('cache/test.npy'),
 'train_file': PosixPath('cache/train.npy'),
 'validation_file': PosixPath('cache/validation.npy'),
 'tokenizer_file': PosixPath('cache/tokenizer.pkl')}

# Loading data

PhyAGI ships with a collection of **providers** and their respective **configuration classes**. These classes allow you to load pre‑tokenized datasets with a single call without need to repeat the preprocessing steps.

### From a configuration object

The example below uses the default `LMProviderConfig` to load the dataset tokenized in the previous section.Every provider has its own config class, e.g., the parameters of `LMProviderConfig` can be customised as follows:

In [8]:
from phyagi.datasets.dataset_provider import DatasetProviderConfig
from phyagi.datasets.train.lm.lm_dataset_provider import LMDatasetProvider

dataset_config = DatasetProviderConfig(cache_dir="cache", seq_len=512)
dataset_provider = LMDatasetProvider.from_config(dataset_config)

dataset = dataset_provider.get_train_dataset()
print(dataset[0])
print(len(dataset))

{'input_ids': tensor([  796,   569, 18354,  7496, 17740,  6711,   796,   220,   198, 50256,
         2311,    73, 13090,   645,   569, 18354,  7496,   513,  1058,   791,
        47398, 17740,   357,  4960,  1058, 10545,   230,    99,   161,   254,
          112,  5641, 44444,  9202, 25084, 24440, 12675, 11839,    18,   837,
         6578,   764,   569, 18354,  7496,   286,   262, 30193,   513,  1267,
          837,  8811,  6412,   284,   355,   569, 18354,  7496, 17740,  6711,
         2354,  2869,   837,   318,   257, 16106,  2597,  2488,    12,    31,
         2712,  2008,   983,  4166,   416, 29490,   290,  6343,    13, 44206,
          329,   262, 14047, 44685,   764, 28728,   287,  3269,  2813,   287,
         2869,   837,   340,   318,   262,  2368,   983,   287,   262,   569,
        18354,  7496,  2168,   764, 12645,   278,   262,   976, 21748,   286,
        16106,   290,  1103,  2488,    12,    31,   640, 11327,   355,   663,
        27677,   837,   262,  1621,  4539, 10730, 

## From a configuration file

You can also store the configuration as YAML and reload it later. The snippet below shows how to load an `LMProviderConfig` from `config.yaml`:

In [9]:
# Load the configuration and use it to instantiate the dataset configuration
# config = load_config("config.yaml")
config = {
    "cache_dir": "cache",
    "seq_len": 512,
}

dataset_config = DatasetProviderConfig.from_dict(config)
dataset_provider = LMDatasetProvider.from_config(dataset_config)

dataset = dataset_provider.get_train_dataset()
print(dataset[0])
print(len(dataset))

{'input_ids': tensor([  796,   569, 18354,  7496, 17740,  6711,   796,   220,   198, 50256,
         2311,    73, 13090,   645,   569, 18354,  7496,   513,  1058,   791,
        47398, 17740,   357,  4960,  1058, 10545,   230,    99,   161,   254,
          112,  5641, 44444,  9202, 25084, 24440, 12675, 11839,    18,   837,
         6578,   764,   569, 18354,  7496,   286,   262, 30193,   513,  1267,
          837,  8811,  6412,   284,   355,   569, 18354,  7496, 17740,  6711,
         2354,  2869,   837,   318,   257, 16106,  2597,  2488,    12,    31,
         2712,  2008,   983,  4166,   416, 29490,   290,  6343,    13, 44206,
          329,   262, 14047, 44685,   764, 28728,   287,  3269,  2813,   287,
         2869,   837,   340,   318,   262,  2368,   983,   287,   262,   569,
        18354,  7496,  2168,   764, 12645,   278,   262,   976, 21748,   286,
        16106,   290,  1103,  2488,    12,    31,   640, 11327,   355,   663,
        27677,   837,   262,  1621,  4539, 10730, 

Alternatively, PhyAGI exposes a set of **getter functions** that handle provider creation, caching, and concatenation of multiple datasets behind the scenes.

`get_dataset` is especially handy when your training or fine‑tuning job involves more than one dataset. You can even oversample or undersample each dataset via the `weight` parameter, as demonstrated below:

In [10]:
from phyagi.datasets.registry import get_dataset

config = {
    "cache_dir": "cache",
    "seq_len": 512,
    "weight": 1.5,
}

dataset, _ = get_dataset([config, config], dataset_concat="random", dataset_provider="lm")
print(dataset[0])
print(len(dataset))

[phyagi] [2025-05-19 09:05:29,029] [INFO] [dataset.py:108:get_dataset] Loading datasets...
[phyagi] [2025-05-19 09:05:29,032] [INFO] [dataset.py:114:get_dataset] Datasets: [DatasetProviderConfig(cache_dir=PosixPath('/home/gderosa/phyagi-sdk/docs/tutorials/cache'), train_file=PosixPath('/home/gderosa/phyagi-sdk/docs/tutorials/cache/train.npy'), validation_file=PosixPath('/home/gderosa/phyagi-sdk/docs/tutorials/cache/validation.npy'), validation_split=None, tokenizer_file=None, seq_len=512, shift_labels=False, weight=1.5, label='', ignore_token_id=-100, seed=42, random_mask_prob=None), DatasetProviderConfig(cache_dir=PosixPath('/home/gderosa/phyagi-sdk/docs/tutorials/cache'), train_file=PosixPath('/home/gderosa/phyagi-sdk/docs/tutorials/cache/train.npy'), validation_file=PosixPath('/home/gderosa/phyagi-sdk/docs/tutorials/cache/validation.npy'), validation_split=None, tokenizer_file=None, seq_len=512, shift_labels=False, weight=1.5, label='', ignore_token_id=-100, seed=42, random_mask_pro

Internally, `get_dataset` instantiates each provider, applies the requested sampling strategy, and concatenates the resulting datasets according to `dataset_concat`.

*In the example above, the final length **14154** reflects that each dataset was oversampled by **1.5×** before concatenation.*