# Dataset preparation

This note explores how data is prepared for BERT, which is trained in two distinct phases using two different data pipelines.

1.  **Pretraining:** The model learns "language" (grammar, context, facts) from massive unlabeled text corpora.
2.  **Finetuning:** The model adapts its knowledge to solve a specific task (e.g., Sentiment Analysis, Question Answering, etc.) using labeled data.

## Phase 1: Pretraining

Pretraining is "unsupervised" (or self-supervised). We don't have human labels. Instead, we generate labels from the text itself.

### Task A: Next Sentence Prediction (NSP)
BERT needs to understand the relationship *between* sentences to handle tasks like Question Answering (where the answer follows the question). For this, we feed the model with the following input:

`[CLS] Sentence A [SEP] Sentence B [SEP]`

and:
* **50% of the time:** `Sentence B` is the actual sentence next to `Sentence A`. Generate label accordingly for `Sentence B`: `IS_NEXT`.
* **50% of the time:** `Sentence B` is a random sentence from the corpus. Genrate label: `NOT_NEXT`.

This forces the model to look at the `[CLS]` token embedding and decide if the two sentences flow logically.

### Task B: Masked Language Modeling (MLM)
Standard language models (like GPT) predict the *next* token (left-to-right). This limits context.\
BERT is **Bidirectional**: it looks at the whole sentence at once. To prevent it from "cheating" and seeing the answer, we hide / mask some tokens.

1.  Select **15%** of tokens to mask.
2.  Apply the **80-10-10 Rule** to these selected tokens:
    * **80%:** Replace with `[MASK]`. The model must predict the original token based on context.
    * **10%:** Replace with a **Random Token**. This forces the model to keep checking the input (it can't just trust that non-masked tokens are correct).
    * **10%:** Keep the **Original Token**. This biases the model towards the real token so it doesn't always predict "the input is wrong."


Let's see both of these data pipeline tasks in action.\
As before, let's start by adding _src_ to the python system path in case your notebook is not being run with it already added.

In [1]:
import sys
from pathlib import Path
sys.path.append((Path('').resolve().parent / 'src').as_posix())

We will start by training a BPE tokenizer with our super fast Rust implementation on the same [wikitext](https://huggingface.co/datasets/Salesforce/wikitext) available in Hugging Face that we used in the tokenizetion notes. 

In [2]:
from settings import TokenizerSettings
from token_encoders.rust.bpe import RustBPETokenizer

corpus_file = Path("").resolve().parent / "data"/ "wikitext_103.txt"
text = corpus_file.read_text()

tokenizer = RustBPETokenizer(TokenizerSettings())
tokenizer.train([text])

Now that we have a tokenizer, let's create some dummy documents to see how NSP and MLM affect the data. In practice we will be loading these documents from a file, but here to illustrate we will just define them as variables. If you want to take a look more in details into implementations, you can go to the _src/datasets_ module, where you will find everything related to pretraining, finetuning and inference data corpora, dataloaders and datasets. 

In [3]:
import random

from datasets.pretraining import NSPLabel, PretrainingDataset
from datasets.types.inputs.pretraining import PretrainingCorpusData
from settings import LoaderSettings

documents = [
    # Document 1: Artificial Intelligence & BERT (Logical Explanation Flow)
    [
        "BERT models are pretrained on large text corpora to understand language.",
        "They use masked language modeling to learn context from these texts.",
        "In this process, random words are hidden and the model guesses them.",
        "Next sentence prediction helps with logical coherence between clauses.",
        "This requires the model to understand if two sentences belong together.",
        "Transformer architectures rely heavily on self-attention mechanisms to do this.",
        "Attention heads allow the model to focus on different parts of a sentence.",
        "The bidirectional nature of BERT provides a deeper understanding than previous models.",
        "Fine-tuning allows these pretrained models to adapt to specific problems.",
        "Deep learning has thus revolutionized natural language processing tasks."
    ],
    # Document 2: Baking & Cooking (Chronological Steps)
    [
        "To bake a perfect cake you need fresh flour and sugar.",
        "First, mix the dry ingredients together in a large ceramic bowl.",
        "Next, whisk the eggs and butter until the mixture is smooth.",
        "Pour the wet mixture into the dry bowl and stir gently.",
        "Preheat your oven to 350 degrees before you start baking.",
        "Grease the baking pan thoroughly to prevent the cake from sticking.",
        "Pour the batter evenly into the prepared cake tin.",
        "Bake in the center of the oven for approximately thirty minutes.",
        "Insert a toothpick to check if the cake is fully cooked inside.",
        "Finally, let the cake cool on a wire rack before applying frosting."
    ],
    # Document 3: Space Exploration (Spatial Flow: Earth -> Outwards)
    [
        "Astronomy begins with observing the sky from our home planet.",
        "The moon is Earth's only natural satellite and the closest celestial body.",
        "Beyond the moon, our solar system consists of eight planets orbiting the sun.",
        "Mars is the next target for human exploration due to its proximity.",
        "Future missions aim to land humans on the Red Planet within a decade.",
        "Further out, gas giants like Saturn display spectacular ring systems.",
        "Our sun is just one of billions of stars in the Milky Way galaxy.",
        "Stars are born in giant clouds of gas and dust called nebulae.",
        "When massive stars die, they can collapse into black holes.",
        "The universe is vast, expanding, and filled with infinite mysteries."
    ],
    # Document 4: Oceanography (Depth Flow: Surface -> Deep Sea)
    [
        "The ocean covers more than seventy percent of the Earth's surface.",
        "The surface is shaped by tides caused by the moon's gravity.",
        "Ocean currents regulate the global climate by transporting this heat.",
        "Just below the surface, plankton serves as the base of the food web.",
        "Coral reefs thrive in these shallow waters, hosting diverse ecosystems.",
        "Larger animals like whales migrate thousands of miles through these waters.",
        "As we go deeper, sunlight fades and the water becomes much colder.",
        "Deep sea trenches contain some of the most mysterious creatures on Earth.",
        "Many species in this dark abyss generate their own light through bioluminescence.",
        "Exploring this ocean floor is as difficult as exploring outer space."
    ],
    # Document 5: History of Civilization (Chronological Timeline)
    [
        "Early humans lived as hunter-gatherers, moving constantly to find food.",
        "The discovery of agriculture allowed these groups to settle in one place.",
        "Settlements grew into cities, requiring new ways to organize society.",
        "Ancient civilizations like Egypt developed writing to record their history.",
        "Later, the Greek empire laid the foundation for Western philosophy and democracy.",
        "The Roman Empire expanded these ideas across a vast network of roads.",
        "After Rome fell, the Middle Ages brought a period of feudalism.",
        "The Renaissance later sparked a rebirth of art, science, and exploration.",
        "The Industrial Revolution eventually changed how humans lived and worked forever.",
        "Today, we live in a globalized world connected by digital technology."
    ]
]

corpus_data = PretrainingCorpusData(documents=documents)
loader_settings = LoaderSettings(max_seq_len=48, batch_size=2, shuffle=True)
ds = PretrainingDataset(corpus_data, tokenizer, loader_settings)
print(f"Dataset Initialized. Total samples available: {len(ds)}\n")

Dataset Initialized. Total samples available: 50



Let's pick a random sentence, and see how NSP and MLM affect it. 

In [4]:
sample_idx = random.randint(0, len(ds) - 1)
print(f"--- Sample Analysis (Index {sample_idx}) ---")
doc_idx, sent_idx = ds.samples_index[sample_idx]
print(f"Original sentence:\n{ds.data[doc_idx][sent_idx]}")

sample = ds[sample_idx]
label_name = NSPLabel(sample.nsp_labels.item()).name
print(f"\n[1] NSP Label: {sample.nsp_labels.item()} ({label_name})")
token_ids = sample.input_ids.tolist()
decoded_text = tokenizer.decode(token_ids)
print(f"Input Sequence:\n'{decoded_text}'")
print(f"\n[2] MLM Analysis (Comparing Input vs Labels)")
print(f"{'Input Token':<20} | {'Target Label':<20} | {'Status'}")
print("-" * 65)

labels = sample.mlm_labels.tolist()


for i, (inp_id, lbl_id) in enumerate(zip(token_ids, labels)):
    if inp_id == tokenizer.pad_token_id:
        continue
    inp_token = tokenizer.inverse_vocab.get(inp_id, f"[{inp_id}]")
    if lbl_id == -100:
        lbl_token = "-"
        status = "Ignored"
    else:
        lbl_token = tokenizer.inverse_vocab.get(lbl_id, f"[{lbl_id}]")
        if inp_id == tokenizer.mask_token_id:
            status = "MASKED [80%]"
        elif inp_id != lbl_id:
            status = "RANDOM WORD [10%]"
        else:
            status = "UNCHANGED [10%]"
    print(f"{inp_token:<20} | {lbl_token:<20} | {status}")

--- Sample Analysis (Index 6) ---
Original sentence:
Attention heads allow the model to focus on different parts of a sentence.

[1] NSP Label: 1 (NOT_NEXT)
Input Sequence:
'[CLS] Attention[MASK] allow the model to focus on different[MASK] of a sentence.[SEP] Deep sea trenches contain some ing the most mysterious creatures on Earth.[SEP][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD]'

[2] MLM Analysis (Comparing Input vs Labels)
Input Token          | Target Label         | Status
-----------------------------------------------------------------
[CLS]                | -                    | Ignored
ĠAtt                 | -                    | Ignored
ention               | -                    | Ignored
[MASK]               | Ġheads               | MASKED [80%]
Ġallow               | -                    | Ignored
Ġthe                 | -                    | Ignored
Ġmodel               | -                    | Ignored
Ġto                  | -  

## Phase 2: Finetuning

Once the model understands language (via Pretraining), we adapt it.
Finetuning is much simpler:
1.  **Input:** Single sequence (usually).
2.  **Output:** A specific label (Sentiment 0/1, Category A/B/C).
3.  **No Masking:** We want the model to see the whole text.
4.  **No Random Next Sentence:** The input is just the document we want to classify.

We usually take the embedding of the `[CLS]` token from the last layer and pass it through a simple classifier layer.