# Big data? ü§ó Datasets to the rescue!

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [2]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [3]:
!pip install zstandard



In [37]:
from datasets import load_dataset, DownloadConfig, disable_progress_bar
from tqdm import tqdm

tqdm.disable = True
disable_progress_bar()

download_config = DownloadConfig(disable_tqdm=True)

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://huggingface.co/datasets/qualis2006/PUBMED_title_abstracts_2020_baseline/resolve/main/PUBMED_title_abstracts_2020_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train", download_config=download_config)
pubmed_dataset

Dataset({
    features: ['meta', 'text'],
    num_rows: 17722096
})

In [38]:
pubmed_dataset[0]

{'meta': {'pmid': 1673585, 'language': 'eng'},
 'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was n

In [39]:
!pip install psutil



In [40]:
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 1728.12 MB


In [41]:
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

Number of files in dataset : 24453015916
Dataset size (cache file) : 22.77 GB


In [42]:
import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

Iterated over 17722096 examples (about 22.8 GB) in 274.2s, i.e. 0.083 GB/s


In [43]:
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

In [44]:
next(iter(pubmed_dataset_streamed))

'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 1eab3cef-a339-4a89-aa7e-2c3b3a1bff58)')' thrown while requesting GET https://huggingface.co/datasets/qualis2006/PUBMED_title_abstracts_2020_baseline/resolve/main/PUBMED_title_abstracts_2020_baseline.jsonl.zst
Retrying in 1s [Retry 1/5].


{'meta': {'pmid': 1673585, 'language': 'eng'},
 'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was n

In [45]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))

{'meta': {'pmid': 1673585, 'language': 'eng'},
 'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was n

In [46]:
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

{'meta': {'pmid': 1675166, 'language': 'ita'},
 'text': '[Benzodiazepine withdrawal syndrome].\nBenzodiazepines (BDZ) are widely prescribed in clinical practice for many pathological conditions, because of their anxiolytic, sedative, myorelaxant and anticonvulsant properties. The effectiveness, specificity and rapidity of action, the few side effects and the virtual absence of toxicity, have contributed to the widespread use of these compounds. In the last decade, however, the attitude towards BDZ has greatly changed, due to growing awareness and concern about dependence liability, withdrawal phenomena, and long-term side effects. Withdrawal symptoms have been singled out and specified in the contest of a well-defined syndrome with foreseeable onset, duration and remission. Psychic and physical symptoms and disorders of sensory perception can be observed. These manifestations can be suppressed by resuming treatment. The symptomatic and developmental aspects of BDZ withdrawal syndrome a

In [47]:
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

[{'meta': {'pmid': 1673585, 'language': 'eng'},
  'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was

In [48]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

In [49]:
!curl -I https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst

HTTP/2 404 
[1mserver[0m: nginx/1.24.0 (Ubuntu)
[1mdate[0m: Sun, 18 Jan 2026 22:36:41 GMT
[1mcontent-type[0m: text/html
[1mcontent-length[0m: 162



The `curl -I` command returned an `HTTP/2 404` status code, confirming that the `FreeLaw_Opinions.jsonl.zst` file is not found at the specified URL. This means the original URL `https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst` is no longer valid or accessible.

To proceed, I will now attempt to find a working URL or an alternative source for the 'FreeLaw_Opinions' dataset. A common practice is to check the Hugging Face Hub for datasets, as it often hosts many publicly available datasets, including those related to the Pile project.

In [50]:
from huggingface_hub import HfApi

hf_api = HfApi()
datasets = hf_api.list_datasets(search="FreeLaw Opinions")
for dataset in datasets:
    print(dataset.id)

**Reasoning**:
The previous search on Hugging Face Hub for 'FreeLaw Opinions' did not return any results. I will try a broader search term, 'FreeLaw', to see if any datasets containing that term can be found, as the original component might be named differently or be part of a larger dataset.



In [51]:
from huggingface_hub import HfApi

hf_api = HfApi()
datasets = hf_api.list_datasets(search="FreeLaw")
for dataset in datasets:
    print(dataset.id)

free-law/florida
free-law/florida_embeddings
suolyer/pile_freelaw
awettig/Pile-FreeLaw-0.5B-6K-opt
awettig/Pile-FreeLaw-0.5B-8K-opt
datajuicer/the-pile-freelaw-refined-by-data-juicer
free-law/wikitext-2-v1-with-embeddings
free-law/PM_links
free-law/md
free-law/sd
free-law/ri
free-law/sc
free-law/mo
free-law/colorado
free-law/nh
free-law/nc
free-law/uk
free-law/nj
free-law/Arizona_Caselaw_Access_Project
free-law/tenn
free-law/n_mar_i
free-law/vt
free-law/guam
free-law/navajo_nation
free-law/dc
free-law/va
free-law/vi
free-law/minn
free-law/nm
free-law/pa
free-law/arkansas
free-law/ga
free-law/utah
free-law/tribal
free-law/miss
free-law/ny
free-law/us
free-law/tex
free-law/nd
free-law/pr
free-law/ill
free-law/alaska
free-law/iowa
free-law/delaware
free-law/kan
free-law/la
free-law/okla
free-law/ky
free-law/california
free-law/idaho
free-law/me
free-law/w_va
free-law/mass
free-law/mont
free-law/conn
free-law/wyo
free-law/mich
free-law/wis
free-law/haw
free-law/ind
free-law/neb
free-law/oh

*   The original URL `https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst` for the "FreeLaw_Opinions" dataset was confirmed to be invalid, returning an `HTTP/2 404` (Not Found) status code.
*   An initial search on Hugging Face Hub for "FreeLaw Opinions" did not yield direct results.
*   A broader search for "FreeLaw" on Hugging Face identified `suolyer/pile_freelaw` as a suitable alternative dataset.
*   The `suolyer/pile_freelaw` dataset was successfully loaded in streaming mode using the `validation` split after an attempt to load the non-existent `train` split failed.
*   The first record of the `suolyer/pile_freelaw` dataset was successfully retrieved and printed, confirming its accessibility and structure.

### Insights or Next Steps

*   The `suolyer/pile_freelaw` dataset (validation split) is a viable replacement for the previously inaccessible "FreeLaw_Opinions" dataset for subsequent analysis steps.
*   When loading datasets from Hugging Face Hub, it's crucial to verify the available splits to prevent `ValueError` due to incorrect split specifications.


In [52]:
from datasets import load_dataset

# Attempt to load the 'suolyer/pile_freelaw' dataset from Hugging Face Hub
law_dataset_streamed = load_dataset("suolyer/pile_freelaw", split="validation", streaming=True)

# Print the first example to verify its content and structure
print(next(iter(law_dataset_streamed)))

{'text': '\n315 A.2d 483 (1974)\nSEAWAY SHOPPING CENTER CORPORATION\nv.\nThe GRAND UNION STORES, INC., OF VERMONT, and the Grand Union Company.\nNo. 3-73.\nSupreme Court of Vermont, Chittenden.\nFebruary 5, 1974.\n*484 James D. Foley, of Yandell, Page & Archer, Burlington, for plaintiff.\nWilson, Curtis, Bryan, Quinn & Jenkins, Burlington, for defendants.\nBefore SHANGRAW, C. J., and BARNEY, SMITH, KEYSER and DALEY, JJ.\nSHANGRAW, Chief Justice.\nThis is an appeal from a judgment entered in a civil action tried by the Chittenden County Court on December 6, 1972. Jury trial was waived and following a hearing by the court plaintiff was awarded damages in the amount of $14,839.05 and its costs. A judgment for this amount followed, and the defendants have appealed therefrom.\nPlaintiff, a Vermont corporation, owns and operates the Seaway Shopping Center in South Burlington, Vermont. Its principal stockholder and officer is Thomas Farrell, who developed the shopping center.\nThe Grand Union

In [53]:
# law_dataset_streamed = load_dataset(
#     "json",
#     data_files="suolyer/pile_freelaw",
#     split="validation",
#     streaming=True,
# )
next(iter(law_dataset_streamed))

{'text': '\n315 A.2d 483 (1974)\nSEAWAY SHOPPING CENTER CORPORATION\nv.\nThe GRAND UNION STORES, INC., OF VERMONT, and the Grand Union Company.\nNo. 3-73.\nSupreme Court of Vermont, Chittenden.\nFebruary 5, 1974.\n*484 James D. Foley, of Yandell, Page & Archer, Burlington, for plaintiff.\nWilson, Curtis, Bryan, Quinn & Jenkins, Burlington, for defendants.\nBefore SHANGRAW, C. J., and BARNEY, SMITH, KEYSER and DALEY, JJ.\nSHANGRAW, Chief Justice.\nThis is an appeal from a judgment entered in a civil action tried by the Chittenden County Court on December 6, 1972. Jury trial was waived and following a hearing by the court plaintiff was awarded damages in the amount of $14,839.05 and its costs. A judgment for this amount followed, and the defendants have appealed therefrom.\nPlaintiff, a Vermont corporation, owns and operates the Seaway Shopping Center in South Burlington, Vermont. Its principal stockholder and officer is Thomas Farrell, who developed the shopping center.\nThe Grand Union

In [54]:
from itertools import islice
from datasets import interleave_datasets

# Remove the 'meta' column from both datasets to make their schemas compatible
pubmed_dataset_streamed_cleaned = pubmed_dataset_streamed.remove_columns(["meta"])
law_dataset_streamed_cleaned = law_dataset_streamed.remove_columns(["meta"])

combined_dataset = interleave_datasets([pubmed_dataset_streamed_cleaned, law_dataset_streamed_cleaned])
list(islice(combined_dataset, 2))

[{'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was not influenced by xamoterol treatment. In all s

In [55]:
base_url = "https://the-eye.eu/public/AI/pile/"
data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
}
pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
next(iter(pile_dataset["train"]))

FileNotFoundError: Unable to find 'https://the-eye.eu/public/AI/pile/train/00.jsonl.zst'

Use one of the large Common Crawl corpora like mc4 or oscar to create a streaming multilingual dataset that represents the spoken proportions of languages in a country of your choice. For example, the four national languages in Switzerland are German, French, Italian, and Romansh, so you could try creating a Swiss corpus by sampling the Oscar subsets according to their spoken proportion.

In [56]:
from huggingface_hub import HfApi

hf_api = HfApi()
datasets = hf_api.list_datasets(search="oscar")
for dataset in datasets:
    print(dataset.id)

oscarqjh/ViewSpatial_lmmseval
oscar-corpus/oscar
TurkuNLP/register_oscar
bs-modeling-metadata/OSCAR_Entity_13_000
flax-community/dummy-oscar-als-32
nlpufg/oscar-pt
oscar-corpus/OSCAR-2109
stas/oscar-en-10k
nthngdy/oscar-mini
sasha/pii-oscar-sample
oscar-corpus/OSCAR-2201
nthngdy/oscar-small
oscarfossey/NLP_Pole_emploi
acul3/Oscar_Indo_May_2022
jamescalam/oscar-en-minilm-2m
CaoHaiNam/vi-oscar
oscarmutante/oscar
lcw99/oscar-ko-only
datablations/oscar-filter-small
ola13/small-oscar-dedup
ola13/small-oscar-repetitions
oscar127/ImagenesOscar
oscar127/Images
datablations/oscar-filter
datablations/oscar-dedup-expanded
mohiron/nva-Oscar
5w4n/OSCAR-2201-Burmese-fix
5w4n/OSCAR-2019-Burmese-fix
teven/oscar_brackets
teven/oscar_no_brackets
5w4n/processed_oscar_bert_dataset
oscar-corpus/OSCAR-2301
djstrong/oscar-small
musabg/wikipedia-oscar-tr
oscar-corpus/oscar-2301-hpc
ClementRomac/cleaned_deduplicated_oscar
teven/oscar_sanity
AnanthZeke/oscar_tamil_2201
AnanthZeke/oscar_tamil_clean
livinNector/t

# Using 'yiyic/oscar_LANG_train' datasets

In [57]:
from datasets import load_dataset

# Define the languages for Switzerland and their approximate spoken proportions
swiss_languages = {
    "deu": 0.62,  # German
    "fra": 0.23,  # French
    "ita": 0.08,  # Italian
    "rm": 0.005,  # Romansh - might be hard to find a dedicated large dataset
}

# Dataset IDs for each language from the Hugging Face Hub search results
dataset_names = {
    "deu": "yiyic/oscar_deu_Latn_train",
    "fra": "yiyic/oscar_fr_train",
    "ita": "yiyic/oscar_it_train",
    # No obvious direct 'yiyic/oscar_rm_train' for Romansh, so we will handle this later if not found
}

# Load individual language datasets in streaming mode
lang_datasets = {}
for lang_code, hf_dataset_name in dataset_names.items():
    print(f"Loading {hf_dataset_name}...")
    try:
        lang_datasets[lang_code] = load_dataset(hf_dataset_name, split="train", streaming=True)
        print(f"Successfully loaded {hf_dataset_name}")
    except Exception as e:
        print(f"Could not load {hf_dataset_name}: {e}")

# Check the first example of a loaded dataset (e.g., German) to verify
if "deu" in lang_datasets:
    print("\nFirst example from German dataset:")
    print(next(iter(lang_datasets["deu"])))

Loading yiyic/oscar_deu_Latn_train...
Successfully loaded yiyic/oscar_deu_Latn_train
Loading yiyic/oscar_fr_train...
Successfully loaded yiyic/oscar_fr_train
Loading yiyic/oscar_it_train...
Successfully loaded yiyic/oscar_it_train

First example from German dataset:
{'text': 'In unserem Themenverzeichnis finden Sie alle wichtigen Informationen zum Thema Conveniencestore. Die Artikel sind nach Relevanz sortiert und bieten Ihnen somit einen kompletten √úberblick √ºber alle Nachrichten und Galerien. Sollten Sie nach einem bestimmten Thema in Verbindung mit "Conveniencestore" suchen oder sollte sich Ihre Suche auf einen bestimmten Zeitraum oder Artikeltyp beziehen, dann benutzen Sie bitte unsere interne Suche: Conveniencestore\nArtikel zum Thema\n-\nSpar\nNeuer Spar Conveniencestore bringt Food in the City\n"Mit unserem neuen Konzept \'Food in the City\' haben wir den Begriff Nahversorgung neu interpretiert", erl√§utert Spar √ñsterreich-Vorstandsdirektor Hans K. Reisch die Beweggr√º mehr'}

# Using lang_datasets


The `interleave_datasets` function allows for efficient streaming of data from these individual language datasets, ensuring that the combined dataset adheres to the specified language proportions without loading all data into memory. This makes it suitable for further processing or training large language models.

In [58]:
from datasets import interleave_datasets
from itertools import islice

# Define the languages for Switzerland and their approximate spoken proportions
swiss_languages = {
    "deu": 0.62,  # German
    "fra": 0.23,  # French
    "ita": 0.08,  # Italian
    "rm": 0.005,  # Romansh - not included in this dataset due to lack of readily available large dataset
}

# Calculate the sum of proportions for normalization, excluding Romansh as it's not loaded
total_proportion_loaded = sum(swiss_languages[lang] for lang in lang_datasets.keys())

# Calculate sampling probabilities for the loaded languages
sampling_probabilities = [
    swiss_languages[lang] / total_proportion_loaded
    for lang in lang_datasets.keys()
]

# Create the interleaved dataset
# Ensure the order of datasets matches the order of probabilities
interleaved_swiss_dataset = interleave_datasets(
    list(lang_datasets.values()),
    probabilities=sampling_probabilities,
    seed=42
)

print(f"Created interleaved dataset with sampling probabilities: {sampling_probabilities}")
print("First 5 examples from the interleaved dataset (alternating between languages based on probabilities):")
for i, example in enumerate(islice(interleaved_swiss_dataset, 5)):
    print(f"Example {i+1}: {example}")

Created interleaved dataset with sampling probabilities: [0.6666666666666666, 0.24731182795698925, 0.08602150537634408]
First 5 examples from the interleaved dataset (alternating between languages based on probabilities):
Example 1: {'text': "Con√ßue par Apple pour compl√©ter l‚ÄôiPhone 13 Pro, la coque en cuir avec MagSafe est une √©l√©gante fa√ßon d‚Äôoffrir une protection suppl√©mentaire √† votre iPhone tout en lui apportant une touche de style.\nFabriqu√©e dans un cuir au tannage et √† la finition uniques, la mati√®re est remarquablement douce au toucher et acquiert, au fil du temps, une patine naturelle. La coque se met rapidement en place et √©pouse les courbes de votre iPhone sans en compromettre la finesse.\nDot√©e d‚Äôaimants int√©gr√©s qui s‚Äôajustent parfaitement √† l‚ÄôiPhone, la coque se fixe et se d√©tache avec une aisance presque magique. Ce parfait positionnement des aimants permet une recharge sans fil plus rapide et plus facile que jamais. Et pour recharger votre iPh

This code snippet defines the proportional distribution for German, French, and Italian, normalizes these proportions, and then uses `interleave_datasets` to create a single streaming dataset. The output confirms that the interleaved dataset was created with the specified probabilities and shows the first few examples, demonstrating the successful interleave of content from different languages. This dataset is now ready for further machine learning tasks, such as tokenization or model training.