# Task 2 - N-gram Language Models with BPE

In this task we implement and evaluate n-gram language models based on a cleaned Shakespeare corpus.  
The corpus is split into training, validation, and test sets (predefined).  

Key requirements:
- Models over BPE subword tokens
- Unigram → Bigram → Trigram → 4-gram
- Intrinsic evaluation: Perplexity
- Bigram analysis across different smoothing constants *k*
- Laplace (add-one) smoothing
- Simple interpolation/backoff
- Extrinsic evaluation: sentence generation


## Data Loading and Tokenization

We start by loading the cleaned Shakespeare dataset and applying **Byte-Pair Encoding (BPE)**.  
The dataset is split into **train, validation, and test** to ensure consistent comparison across models.  
The number of BPE merges (*k*, e.g. 1600) determines the vocabulary size and granularity of subword units.


In [None]:
# =============================================================================
# TASK 2 — BLOCK 1: IMPORTS AND SETUP
# =============================================================================

import math
from typing import Optional, List, Tuple, Dict
import os, re, random
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from typing import Iterable
import time, math
from typing import List, Optional, Dict, Any

# Essential constants and functions from Task 1
CORPUS_DIR = "Corpus"
GENERATED_DIR = "Generated_tokens"
TRAIN_CLEAN = os.path.join(CORPUS_DIR, "Shakespeare_clean_train.txt")
VALID_CLEAN = os.path.join(CORPUS_DIR, "Shakespeare_clean_valid.txt")
TEST_CLEAN = os.path.join(CORPUS_DIR, "Shakespeare_clean_test.txt")
K_LIST = [1000, 1200, 1400, 1600, 1800, 2000]
WORD_END = "</w>"

_wsre = re.compile(r"\s+")

# Task 2 specific tokens
EOS = "<eos>"
BOS = "<bos>"
random.seed(42)


## Utility Functions

This block defines helper functions that are reused across the notebook.  
They cover tasks such as:

- Handling tokenization and decoding  
- Managing probability calculations  
- Supporting text generation routines  

By centralizing these functions, the implementation of n-gram models remains cleaner and easier to extend.


In [None]:
# =============================================================================
# TASK 2 — BLOCK 2: UTILITY FUNCTIONS
# =============================================================================


# sorted out 

import src.data_utils as du




## N-gram Building Functions

- `add_bos_context(n)`: pads each line with `<bos>` tokens for n-gram context.
- `build_ngrams`: builds vocabulary, n-gram counts, and context counts from tokenized lines.  
These counts are the basis for probability estimation and perplexity.


In [None]:
# all sorted out 

import src.ngram as ngram

## N-gram Language Model

`NGramLM(n)` implements several estimators:
- **ML** (`p_ml`) and **Laplace** (`p_laplace`) smoothing.  
- **Linear interpolation** (`p_interpolated`) over orders 1…n (defaults to Laplace components).  
- **Katz-like backoff** (`p_backoff_katz`, simplified) and **stupid backoff** (`p_backoff`, not normalized).  
Models are chained recursively so lower-order distributions are available.


In [None]:
## sorted out

## Perplexity

`perplexity(model, token_lines, mode)` computes PPL over BPE tokens, resetting context at `<eos>`.  
Supported modes: `ml`, `laplace`, `interp`, `backoff`, `katz`.  
**Note:** Stupid backoff is not a proper probability distribution—treat its PPL as a *relative* score.


In [None]:
# PERPLEXITY CALCULATION

import src.evaluation as ev

## Data Loading Functions

- `find_merges_file(k)`: locates a merges file in `Generated_tokens/` (flexible naming).  
- `load_token_lines_for_k(k)`: loads splits, applies merges, and returns tokenized lines for train/valid/test.  
Per the task, the **validation set is used for tuning interpolation weights, not for choosing `k`**.


In [None]:
# sorted out

corpus = GENERATED_DIR
train = TRAIN_CLEAN
valid = VALID_CLEAN
test = TEST_CLEAN

## Training & Evaluation Utilities

- `grid_simplex_lambdas(n, step)`: enumerates interpolation weights that sum to 1.  
- `train_and_eval_for_k(k, n_max, tune_interp)`: trains n=1…n_max; reports PPL for ML/Laplace/Backoff;  
  tunes **interpolation weights on the validation set** and then evaluates on test.  
- `bigram_vs_k(k_list)`: evaluates bigram performance across different BPE merge counts.


In [None]:
# TRAINING AND EVALUATION FUNCTIONS




## Text Generation (Extrinsic Evaluation)

- **Encoding/decoding:** `bpe_encode_words`, `bpe_decode_to_words` convert between words and BPE tokens.  
- **Fallbacks:** `_unigram_fallback('most'|'avg')` handles unseen contexts.  
- **Decoding:** `_next_token_argmax_or_sample` supports argmax or temperature sampling.  
- **Driver:** `generate_sentence(...)` continues from a prompt until `<eos>` or a word budget is reached.


In [None]:

# sorted out


## Model Preparation

`prepare_models(k, n_max)` trains and caches n-gram models (1…n_max) for a fixed `k`,  
and prints Laplace perplexities on train/valid/test—used by the generation suite to avoid retraining.


In [None]:
def prepare_models(k: int, n_max: int = 4) -> Tuple[List[Tuple[str, str]], Dict[int, ngram.NGramLM]]:
    """
    Load merges and train n-gram models up to order n_max.
    Also compute perplexities on validation and test.
    Returns:
      merges: BPE merges list
      models: dict {n: NGramLM}
    """
    # Load merges + tokenized splits
    merges, tr_tok, va_tok, te_tok = du.load_token_lines_for_k(k, corpus, train, valid, test)
    models = {}

    for n in range(1, n_max + 1):
        print(f"\n[prepare_models] Training {n}-gram model for k={k}")
        model = ngram.NGramLM(n, tr_tok)

       # Evaluate perplexities
        ppl_train = ev.perplexity(model, tr_tok, mode="laplace")
        ppl_valid = ev.perplexity(model, va_tok, mode="laplace") if va_tok else None
        ppl_test  = ev.perplexity(model, te_tok, mode="laplace") if te_tok else None

        print(f"[n={n}] train ppl={ppl_train:.2f} | valid ppl={ppl_valid:.2f} | test ppl={ppl_test:.2f}")

        models[n] = model

    return merges, models


## Helpers for Reporting

Lightweight analysis for generated text:
- Diversity: `distinct-1/2`.  
- Repetition: adjacent duplicates and longest repeat run.  
- Optional self-scoring hook (`score_sequence_logprob`) to compute avg NLL / PPL if available.  
Also maps external mode names (e.g., `"simple"` → stupid backoff) and optionally uses a fast generator.


In [None]:
# =============================== Helpers ======================================
def _clip(s: str, max_chars: int = 160) -> str:
    """We trim and single-line the sample text so the console stays readable."""
    s = (s or "").strip().replace("\n", " ")
    return s if len(s) <= max_chars else s[:max_chars - 3] + "..."

def _ensure_lambdas(n: int, lambdas: Optional[List[float]], mode: str):
    """
    We provide interpolation weights:
      - if explicit weights are given, we use them,
      - else if default_lambdas_for(n) exists, we use that,
      - else we fall back to uniform weights.
    """
    if mode != "interp":
        return None
    if lambdas is not None:
        return lambdas
    f = globals().get("default_lambdas_for", None)
    if callable(f):
        return f(n)
    return [1.0 / n] * n

def _distinct_ratio(seq, n=1):
    """We compute distinct-n / total-n ratio as a simple diversity metric."""
    if n == 1:
        total = len(seq)
        return (len(set(seq)) / total) if total else 0.0
    ngrams = list(zip(*[seq[i:] for i in range(n)]))
    total = len(ngrams)
    return (len(set(ngrams)) / total) if total else 0.0

def _max_repeat_run(seq):
    """We measure the longest run of identical consecutive tokens."""
    if not seq:
        return 0
    mx = cur = 1
    for i in range(1, len(seq)):
        if seq[i] == seq[i - 1]:
            cur += 1
            mx = max(mx, cur)
        else:
            cur = 1
    return mx

def _summarize_text(txt: str) -> Dict[str, Any]:
    """We summarize the generated text with lightweight quality indicators."""
    tokens = txt.strip().split()
    n_tok = len(tokens)
    d1 = _distinct_ratio(tokens, 1)
    d2 = _distinct_ratio(tokens, 2)
    rep_pairs = sum(1 for i in range(1, n_tok) if tokens[i] == tokens[i - 1])
    rep_ratio = rep_pairs / max(1, (n_tok - 1))
    return {
        "len_words": n_tok,
        "distinct1": round(d1, 4),
        "distinct2": round(d2, 4),
        "repeat_ratio": round(rep_ratio, 4),
        "max_repeat_run": _max_repeat_run(tokens),
        "ends_with_eos": (tokens[-1] == "<eos>") if tokens else False,
    }

def _maybe_score_ppl(txt: str, merges, models, n: int, mode: str, lambdas: Optional[List[float]]):
    """
    We optionally self-score the generated text if a scorer is available:
    expects score_sequence_logprob(tokens, merges, models, n, mode, lambdas) → log p.
    Returns (avg_nll, ppl) or (None, None) if not available.
    """
    scorer = globals().get("score_sequence_logprob", None)
    if not callable(scorer):
        return None, None
    tokens = txt.strip().split()
    if not tokens:
        return None, None
    try:
        logp = scorer(tokens, merges=merges, models=models, n=n, mode=mode, lambdas=lambdas)
        avg_nll = -logp / len(tokens)
        ppl = math.exp(avg_nll)
        return round(avg_nll, 4), round(ppl, 4)
    except Exception:
        return None, None

def _map_mode(mode: str) -> str:
    """
    We accept external mode names and map them to the implementation:
      - "simple" → "backoff" (stupid backoff)
      - "interp", "ml", "laplace" pass through
    We do not use Katz here.
    """
    if mode == "simple":
        return "backoff"
    return mode

def _call_generate(merges, models, k: int, n: int,
                   prompt_words: List[str], mode: str,
                   lambdas: Optional[List[float]], sample: bool,
                   temperature: float, max_new_words: int = 20,
                   fallback_strategy: str = "most") -> str:
    """
    We prefer generate_sentence_fast(...) if present (uses prebuilt models),
    and fall back to generate_sentence(...) otherwise.
    """
    impl_mode = _map_mode(mode)
    gen_fast = globals().get("generate_sentence_fast", None)
    if callable(gen_fast):
        return gen_fast(
            merges, models, k, n,
            prompt_words=prompt_words, mode=impl_mode,
            lambdas=lambdas, sample=sample, temperature=temperature,
            fallback_strategy=fallback_strategy, max_new_words=max_new_words
        )
    return generate_sentence( corpus, train, valid, test,
        k=k, n=n, prompt_words=prompt_words, mode=impl_mode,
        lambdas=lambdas, max_new_words=max_new_words,
        sample=sample, temperature=temperature,
        fallback_strategy=fallback_strategy
    )

# =============================== Main suite ===================================
def run_generation_suite(k: int = 1600,
                         temperatures = (0.5, 0.7, 1.0),
                         max_new_words: int = 20,
                         show_text: bool = True,
                         text_max_chars: int = 160) -> List[Dict[str, Any]]:
    """
    We run a compact generation & reporting suite:
      1) Two fixed examples for quick inspection,
      2) A small grid over (prompt, mode, n) × {argmax, sample} × temperatures,
      3) Speed + lightweight quality metrics (len, distinct-1/2, repetition).
    If show_text=True, we also print each generated sentence (truncated).
    Returns a list of dicts (ready for DataFrame/CSV).
    """
    # We assume prepare_models(k) exists and returns (merges, {order: NGramLM})
    merges, models = prepare_models(k)
    results = []

    # (A) Two fixed examples we show up front
    special_tests = [
        dict(prompt=["to","be","or","not"], mode="interp",  n=2, lambdas=[0.2, 0.8], sample=False, temperature=1.0, label="spec_interp_argmax"),
        dict(prompt=["my","lord"],          mode="simple",  n=3, lambdas=None,       sample=True,  temperature=0.7, label="spec_simple_sample"),
    ]

    # (B) Grid similar to the earlier quick test
    grid_tests = [
        dict(prompt=["to","be","or"],     mode="simple",  n=2, lambdas=None),
        dict(prompt=["the","king","of"],  mode="laplace", n=2, lambdas=None),
        dict(prompt=["fair","is","foul"], mode="interp",  n=3, lambdas=None),
    ]

    def run_one(test_cfg: Dict[str, Any], temperature: float, sample: bool, label: str):
        prompt = test_cfg["prompt"]
        mode   = test_cfg["mode"]        # we keep the external label ("simple" etc.)
        n      = test_cfg["n"]
        lamb   = _ensure_lambdas(n, test_cfg.get("lambdas"), mode)

        t0 = time.time()
        txt = _call_generate(
            merges, models, k, n,
            prompt_words=prompt, mode=mode, lambdas=lamb,
            sample=sample, temperature=temperature,
            max_new_words=max_new_words,
            fallback_strategy=test_cfg.get("fallback_strategy", "most"),
        )
        dt = time.time() - t0

        summary = _summarize_text(txt)
        avg_nll, ppl = _maybe_score_ppl(txt, merges, models, n, mode, lamb)

        rec = {
            "label": label,
            "prompt": " ".join(prompt),
            "mode": mode,  # external name ("simple" not Katz)
            "n": n,
            "sample": sample,
            "temperature": temperature,
            "lambdas": lamb,
            "text": txt,
            "gen_time_sec": round(dt, 4),
            "tok_per_sec_est": round(summary["len_words"]/dt, 2) if dt > 0 else None,
            "len_words": summary["len_words"],
            "distinct1": summary["distinct1"],
            "distinct2": summary["distinct2"],
            "repeat_ratio": summary["repeat_ratio"],
            "max_repeat_run": summary["max_repeat_run"],
            "ends_with_eos": summary["ends_with_eos"],
            "avg_nll": avg_nll,
            "ppl_self": ppl
        }
        results.append(rec)

        # We print one example line per run (truncated) so the instructor sees outputs.
        if show_text:
            dec = "sample" if sample else "argmax"
            print(f"\n[{label}] mode={mode} | n={n} | T={temperature} | {dec}")
            print("  " + _clip(txt, text_max_chars))

    # (A) run the two fixed examples
    for cfg in special_tests:
        run_one(cfg, cfg["temperature"], cfg["sample"], cfg["label"])

    # (B) grid: argmax and sample across temperatures
    for cfg in grid_tests:
        for T in temperatures:
            run_one(cfg, T, False, f"grid_argmax_T{T}")
        for T in temperatures:
            run_one(cfg, T, True,  f"grid_sample_T{T}")

    # Short  summary
    print("\n=== (label | mode | n | temp | sample | len | d1 | d2 | rep) ===")
    for r in results:
        print(f"{r['label']:18s} | {r['mode']:7s} | {r['n']} | {r['temperature']:>3} | "
              f"{'S' if r['sample'] else 'A'} | {r['len_words']:3d} | "
              f"{r['distinct1']:.2f} | {r['distinct2']:.2f} | {r['repeat_ratio']:.2f}")

    return results

if __name__ == "__main__":
    # We run with k=1600 by default; tweak temperatures/max_new_words as needed.
    _ = run_generation_suite(k=1000)


[Found] Using merges file: Generated_tokens\bpe_merges with k = 1000.txt

[prepare_models] Training 1-gram model for k=1000
[n=1] train ppl=282.43 | valid ppl=278.58 | test ppl=277.44

[prepare_models] Training 2-gram model for k=1000
[n=2] train ppl=72.68 | valid ppl=80.55 | test ppl=80.21

[prepare_models] Training 3-gram model for k=1000
[n=3] train ppl=164.65 | valid ppl=217.90 | test ppl=221.55

[prepare_models] Training 4-gram model for k=1000
[n=4] train ppl=306.63 | valid ppl=464.27 | test ppl=479.42
[Found] Using merges file: Generated_tokens\bpe_merges with k = 1000.txt

[spec_interp_argmax] mode=interp | n=2 | T=1.0 | argmax
  to the  and the  and the  and the  and the  and the  and the
[Found] Using merges file: Generated_tokens\bpe_merges with k = 1000.txt

[spec_simple_sample] mode=simple | n=3 | T=0.7 | sample
  othello i do estate upon me. desdemona i am cruel cassio. othello o ialack the couples that lives in the
[Found] Using merges file: Generated_tokens\bpe_merges w

## Results

We evaluated continuations generated with **k = 1600 BPE merges** across decoding regimes—stupid backoff, Laplace, and interpolation (n = 3).  
Each run produced 20 tokens.

### Diversity
- **Sampling** yielded high diversity:  
  - distinct-1 ≈ 0.90–1.00  
  - distinct-2 ≈ 0.95–1.00  
  - Example: simple backoff, T = 0.7 → distinct-1 = 0.95, distinct-2 = 1.00
- Same pattern for Laplace (n = 2) and Interpolation (n = 3) under sampling.

### Degeneracy
- **Argmax** reduced diversity and caused loops:  
  - Laplace bigrams: distinct-1 ≈ 0.20, distinct-2 ≈ 0.21  
  - Simple bigrams: distinct-1 ≈ 0.30, distinct-2 ≈ 0.32  
  - Interpolation (n = 3) improved slightly (≈ 0.35/0.37) but still repeated phrases.

### Repetition Metric
- Adjacent token repetition = 0.00 in all cases.  
- However, phrase-level loops (e.g., “the matters of the matters of …”) were common → metric underestimates degeneracy.

### Qualitative Examples
- Argmax: repeated clause fragments (e.g., “and the moor: I am not …”).  
- Sampling (T = 0.7): varied, syntactically richer lines (e.g., “polonius give him directions …”).

**Conclusion:**  
Deterministic decoding collapses onto frequent n-grams, reducing diversity.  
Stochastic sampling restores lexical richness without immediate repetition.  
Interpolation alleviates loops under argmax but does not eliminate them.
