
# Assignment 1 — Data Collection & Preprocessing for Foundation Model Pre‑Training

This notebook implements an end‑to‑end pipeline to:
1) **Collect** large‑scale text data (streaming with Hugging Face Datasets)  
2) **Clean & deduplicate** documents  
3) **Tokenize & chunk** into fixed‑length blocks for transformer pretraining  
4) **Build a PyTorch `Dataset`/`DataLoader`** and **save sample batches** (`.pt`)  

**Deliverables produced** (when you run all cells):
- `data/sample_dataset.pt` — first 5–10 sample batches (tokenized)
- Token blocks saved under `data/tokenized/`
- Cleaned JSONL shards under `data/clean/`
- Raw JSONL shards under `data/raw/`




## 1. Imports & Paths


In [1]:

import os, json, hashlib, re, html
from pathlib import Path
from typing import Iterator, Dict, List, Tuple
from dataclasses import dataclass

import numpy as np
from tqdm import tqdm
from datasets import load_dataset
from transformers import AutoTokenizer

import torch
from torch.utils.data import Dataset, DataLoader

# Project directories
ROOT = Path.cwd()
DATA_DIR = ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
CLEAN_DIR = DATA_DIR / "clean"
TOK_DIR = DATA_DIR / "tokenized"
for d in [DATA_DIR, RAW_DIR, CLEAN_DIR, TOK_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("Project root:", ROOT)
print("Data dir:", DATA_DIR)


  from .autonotebook import tqdm as notebook_tqdm


Project root: /Users/venkateshkelam/Desktop/CSYE7374
Data dir: /Users/venkateshkelam/Desktop/CSYE7374/data



## 2. Configuration

- `TARGET_BYTES`: the minimum total raw text size to collect (set slightly above 1GB to be safe)
- `BLOCK_SIZE`: token block length for training (adjust for your future model)


In [2]:

# === CONFIG ===
TARGET_BYTES = 1_000_000_000   # ~1.05 GB
BLOCK_SIZE   = 1024            # e.g., GPT-style 1024-token blocks
TOKENIZER_MODEL = "gpt2"       # or "bert-base-uncased", "roberta-base", etc.

print("TARGET_BYTES={TARGET_BYTES:,}  BLOCK_SIZE={BLOCK_SIZE}}")
print("TOKENIZER_MODEL={TOKENIZER_MODEL}")


TARGET_BYTES={TARGET_BYTES:,}  BLOCK_SIZE={BLOCK_SIZE}}
TOKENIZER_MODEL={TOKENIZER_MODEL}



## 3. Data Collection (Streaming)

We stream multiple public datasets for **diversity**. Edit this list to fit your resources:
- **Wikipedia** (encyclopedic text)
- **OpenWebText** (general web)
- **C4** English (large web crawl; keep streaming)



In [3]:
from datasets import load_dataset

ds = load_dataset("wikimedia/wikipedia", "20231101.en", split="train")

In [4]:
random_dataset = ds.shuffle(seed=42).select(range(150000))

In [5]:
relevant_texts = []
for ele in random_dataset:
    relevant_texts.append(ele)

In [7]:
relevant_texts[14]

{'id': '63739215',
 'url': 'https://en.wikipedia.org/wiki/Circus%20of%20Books%20%28film%29',
 'title': 'Circus of Books (film)',
 'text': 'Circus of Books is a 2019 American documentary film directed by Rachel Mason, written by Rachel Mason and Kathryn Robson and starring Karen Mason, Barry Mason and Rachel Mason. The premise revolves around Circus of Books, a bookstore and gay pornography shop in West Hollywood, California, and in the Silver Lake neighborhood of Los Angeles.\n\nThe film premiered at the 2019 Tribeca Film Festival, and was released on Netflix in the United States on April 22, 2020.\n\nCast\n Karen Mason\n Barry Mason\n Rachel Mason\n Josh Mason\n Micah Mason\n Alexei Romanoff\n Billy Miller\n Don Norman\n Freddie Bercovitz\n Paulo Morillo\n Ellen Winer\n Larry Flynt\n David Gregory\n Fernando Aguilar\n Alaska Thunderfuck\n Jeff Stryker\n\nRelease \nThe film premiered at the 2019 Tribeca Film Festival. It went on to show at several film festivals, including the Framelin


## 4. Cleaning & Deduplication

- Lowercasing, whitespace normalization  
- Light removal of HTML/markdown artifacts  
- Drop **very short** docs (< 50 words)  
- **Exact dedup** via SHA1 of normalized text


In [8]:

_whitespace_re = re.compile(r"\s+")
_html_tag_re = re.compile(r"<[^>]+>")

def clean_text(raw: str) -> str:
    t = html.unescape(raw)
    t = _html_tag_re.sub(" ", t)                # strip HTML tags
    t = t.replace("#", " ").replace("*", " ").replace("`", " ")  # light markdown cleanup
    t = t.lower()
    t = t.replace("“", '"').replace("”", '"').replace("’", "'").replace("–", "-")
    t = _whitespace_re.sub(" ", t).strip()
    return t

cleaned_dataset = []
for ele in random_dataset:
    new_text = clean_text(ele['text'])
    cleaned_dataset.append(new_text)

In [9]:
cleaned_dataset

['hmp hull is a category b men\'s local prison located in kingston upon hull in england. the term \'local\' means that this prison holds people on remand to the local courts. the prison is operated by his majesty\'s prison service. history hull prison opened in 1870, and is of a typical victorian design. ethel major was the last person and only woman to be executed at hull in 1934. she had been convicted of the murder of her husband. an exhibition "within these walls" follows the prison\'s history from 1299 to 1934. the exhibition was designed and created by officer rob nicholson and officially opened by lawrence major, ethel\'s grandson. in 1976 hull prison was involved in a three-day riot by inmates of the prison. over 100 prisoners were involved in a protest that erupted over staff brutality. the riot ended peacefully on 3 september 1976 but over two thirds of the prison was destroyed, with an estimated repair cost of £3 - £4 million. the prison was closed for a year while repairs w

In [11]:
def is_low_quality(t: str) -> bool:
    return len(t.split()) < 50

filtered_dataset = []
for ele in cleaned_dataset:
    if is_low_quality(ele)==False:
        filtered_dataset.append(ele)

def sha1(s: str) -> str:
    return hashlib.sha1(s.encode("utf-8")).hexdigest()


seen_hashes = set()
deduplicated_dataset = []
for ele in filtered_dataset:
    if sha1(ele) not in seen_hashes:
        deduplicated_dataset.append(ele)
        seen_hashes.add(sha1(ele))

In [12]:
len(deduplicated_dataset)

129150


## 5. Tokenization & Chunking

- Uses a transformer-compatible tokenizer (default: **GPT‑2 BPE**).  
- Concatenates token streams and splits into **fixed blocks** of `BLOCK_SIZE` tokens.  
- Saves blocks as `.npy` arrays in `data/tokenized/`.


In [13]:
import torch
from transformers import AutoTokenizer

assert isinstance(deduplicated_dataset, list) and len(deduplicated_dataset) > 0, \
    "deduplicated_dataset must be a non-empty list of cleaned strings."

# 1) Transformer-compatible tokenizer (GPT-2 uses BPE)
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_MODEL, use_fast=True)

In [14]:
import torch
from transformers import AutoTokenizer

assert isinstance(deduplicated_dataset, list) and len(deduplicated_dataset) > 0, \
    "deduplicated_dataset must be a non-empty list of cleaned strings."

# 1) Transformer-compatible tokenizer (GPT-2 uses BPE)
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_MODEL, use_fast=True)
if tokenizer.pad_token is None:
    if tokenizer.eos_token is not None:
        tokenizer.pad_token = tokenizer.eos_token  # reuse EOS for padding
    else:
        tokenizer.add_special_tokens({"pad_token": "<|pad|>"})
pad_id = tokenizer.pad_token_id
eos_id = tokenizer.eos_token_id
assert eos_id is not None, "Tokenizer must have an eos_token_id"


def iter_blocks_from_list(texts, block_size: int):
    buf = []
    for doc in texts:
        ids = tokenizer(doc, add_special_tokens=False, return_attention_mask=False)["input_ids"]
        if not ids:
            continue
        ids.append(eos_id)            # separator between docs
        buf.extend(ids)
        while len(buf) >= block_size: # emit fixed-length blocks
            yield torch.tensor(buf[:block_size], dtype=torch.long)
            buf = buf[block_size:]
    # we drop the trailing remainder (common for pretraining)

# 3) Build blocks and save as tensors (efficient data structure)
TOK_DIR.mkdir(parents=True, exist_ok=True)
blocks = list(iter_blocks_from_list(deduplicated_dataset, BLOCK_SIZE))
input_ids = torch.stack(blocks) if blocks else torch.empty(0, BLOCK_SIZE, dtype=torch.long)
attention_mask = torch.ones_like(input_ids, dtype=torch.long)

Token indices sequence length is longer than the specified maximum sequence length for this model (3281 > 1024). Running this sequence through the model will result in indexing errors


In [15]:
full_path = TOK_DIR / "tokenized_blocks.pt"
torch.save({"input_ids": input_ids, "attention_mask": attention_mask}, full_path)

# 4) Small sample batch for your submission/report
sample_k = min(10, input_ids.size(0))
sample_path = TOK_DIR / "sample_dataset.pt"
torch.save(
    {"input_ids": input_ids[:sample_k], "attention_mask": attention_mask[:sample_k]},
    sample_path
)

print(f"[OK] Tokenization complete.")
print(f" - Model: {TOKENIZER_MODEL} (BPE) | Block size: {BLOCK_SIZE}")
print(f" - Blocks: {input_ids.size(0)}  Shape: {tuple(input_ids.shape)}  Dtype: {input_ids.dtype}")
print(f" - Saved full:  {full_path}")
print(f" - Saved sample:{sample_path}")


[OK] Tokenization complete.
 - Model: gpt2 (BPE) | Block size: 1024
 - Blocks: 102414  Shape: (102414, 1024)  Dtype: torch.int64
 - Saved full:  /Users/venkateshkelam/Desktop/CSYE7374/data/tokenized/tokenized_blocks.pt
 - Saved sample:/Users/venkateshkelam/Desktop/CSYE7374/data/tokenized/sample_dataset.pt


In [16]:
type(input_ids)

torch.Tensor

In [17]:
type(deduplicated_dataset)

list


## 6. PyTorch Dataset & DataLoader

Loads fixed-length token blocks. For variable-length sequences, replace with a pad‑aware `collate_fn`.


In [18]:
from torch.utils.data import Dataset, DataLoader

class BlockDataset(Dataset):
    """
    Expects:
      - input_ids: LongTensor [N, L]
      - attention_mask: LongTensor [N, L] (optional; if None we'll derive from pad_id)
    Yields:
      dict with input_ids, attention_mask, labels for next-token prediction.
    """
    def __init__(self, input_ids: torch.Tensor,
                 attention_mask: torch.Tensor | None = None,
                 pad_id: int | None = None):
        assert input_ids.ndim == 2, "input_ids must be [N, L]"
        self.input_ids = input_ids
        self.L = input_ids.size(1)
        if attention_mask is None:
            if pad_id is None:
                # no padding in fixed blocks → mask is all ones
                self.attention_mask = torch.ones_like(input_ids, dtype=torch.long)
            else:
                self.attention_mask = (input_ids != pad_id).long()
        else:
            self.attention_mask = attention_mask

    def __len__(self):
        return self.input_ids.size(0)

    def __getitem__(self, idx):
        ids = self.input_ids[idx]                  # [L]
        attn = self.attention_mask[idx]            # [L]
        labels = ids.clone()
        # ignore pads in loss if any
        labels[attn == 0] = -100
        return {"input_ids": ids, "attention_mask": attn, "labels": labels}

def collate_blocks(batch):
    # All examples are same length already → simple stack
    return {
        "input_ids": torch.stack([b["input_ids"] for b in batch], dim=0),
        "attention_mask": torch.stack([b["attention_mask"] for b in batch], dim=0),
        "labels": torch.stack([b["labels"] for b in batch], dim=0),
    }

# --- build the DataLoader ---
BATCH_SIZE = 32  # tune to your GPU
dataset = BlockDataset(input_ids=input_ids,
                       attention_mask=attention_mask,
                       pad_id=pad_id)

In [19]:
train_loader = DataLoader(dataset,
                          batch_size=BATCH_SIZE,
                          shuffle=True,       # ok for map-style
                          drop_last=True,     # stable batch shapes
                          num_workers=0,      # adjust
                          pin_memory=True,
                          collate_fn=collate_blocks)

# sanity check
# batch = next(iter(train_loader))
# print({k: v.shape for k, v in batch.items()})  # → {'input_ids': [B, L], ...}

In [20]:
for bat in train_loader:
    print(bat['input_ids'].shape)
    break

torch.Size([32, 1024])




In [21]:

def save_sample_batches(dataloader, n_batches=5, out_path=DATA_DIR / "sample_dataset.pt"):
    samples = []
    it = iter(dataloader)
    for _ in range(n_batches):
        try:
            batch = next(it)
        except StopIteration:
            break
        samples.append({k: v.clone().cpu() for k, v in batch.items()})
    torch.save({"batches": samples}, out_path)
    print("Saved:", out_path)
