# Pynini vs NLTK PTB Tokenizer — Adversarial Diff

This notebook exercises known and remaining differences between the pynini FST
and NLTK's `TreebankWordTokenizer`.

**Fixed** (sections 1-5): `#` punctuation, ALL CAPS contractions, `d'ye`/`more'n`,
`'tis`/`'twas`, `''` as opening quotes, closing-quote context (`"B +"`).

**Remaining** (section 6): double contractions, `it's'`, `""`, double comma/colon,
tabs. None observed in 2000 wikitext paragraphs.

In [None]:
import sys; sys.path.insert(0, '..')

from nltk.tokenize import TreebankWordTokenizer
from transduction.applications.ptb import build_ptb_fst_pynini, string_to_byte_strs, SEP
from transduction.fsa import EPSILON as NATIVE_EPSILON

nltk_tok = TreebankWordTokenizer()
fst = build_ptb_fst_pynini()

In [2]:
from transduction.fst import FST


def fst_tokenize(fst, text):
    """Tokenize text using the pynini-built FST."""
    byte_strs = string_to_byte_strs(text)
    input_fst = FST.from_string(byte_strs)
    output_fsa = fst(input_fst, None)
    try:
        output = next(output_fsa.language(tuple=True))
    except StopIteration:
        return None  # FST rejected input
    tokens = []
    current = []
    for sym in output:
        if sym == SEP:
            if current:
                tokens.append(bytes(int(b) for b in current).decode('utf-8', errors='replace'))
                current = []
        elif int(sym) < 256:
            current.append(sym)
    if current:
        tokens.append(bytes(int(b) for b in current).decode('utf-8', errors='replace'))
    return tokens


def compare(text):
    """Compare NLTK and FST tokenization, highlighting differences."""
    n = nltk_tok.tokenize(text)
    f = fst_tokenize(fst, text)
    match = '✓' if n == f else '✗'
    print(f'{match} {text!r}')
    if n != f:
        print(f'  NLTK:  {n}')
        print(f'  FST:   {f}')

## 1. `#` in special punctuation (FIXED)

NLTK separates `[;@#$%&]`. Previously pynini only had `[;@%&$]`.

In [3]:
compare('Price is #100')
compare('##heading')
compare('C# programming')

✓ 'Price is #100'
✓ '##heading'
✓ 'C# programming'


## 2. ALL CAPS contractions (FIXED, except arbitrary mixed case)

NLTK's CONTRACTIONS2 uses `(?i)` (case-insensitive). Pynini now handles lowercase, Title, and UPPER.
Arbitrary mixed case (e.g. `cAnNoT`) would require 2^n variants and is not supported.

In [4]:
compare('CANNOT STOP')
compare('I WANNA GO')
compare('GONNA BE GREAT')
compare('GOTTA RUN')
compare('LEMME SEE')
compare('GIMME THAT')
compare('cAnNoT stop')  # mixed case

✓ 'CANNOT STOP'
✓ 'I WANNA GO'
✓ 'GONNA BE GREAT'
✓ 'GOTTA RUN'
✓ 'LEMME SEE'
✓ 'GIMME THAT'
✗ 'cAnNoT stop'
  NLTK:  ['cAn', 'NoT', 'stop']
  FST:   ['cAnNoT', 'stop']


## 3. `d'ye`, `more'n` contractions (FIXED)

Added to pynini's contraction list with lowercase/Title/UPPER variants.

In [5]:
compare("D'ye think so?")
compare("d'ye know")
compare("more'n enough")
compare("More'n I expected")

✓ "D'ye think so?"
✓ "d'ye know"
✓ "more'n enough"
✓ "More'n I expected"


## 4. `'tis`, `'twas` — CONTRACTIONS3 (FIXED)

Added as a separate contraction stage in the pynini pipeline.

In [6]:
compare("'Tis the season")
compare("'Twas the night")
compare("'tis nothing")
compare("'twas long ago")

✓ "'Tis the season"
✓ "'Twas the night"
✓ "'tis nothing"
✓ "'twas long ago"


## 5. Double single-quotes `''` as opening quote (FIXED)

NLTK STARTING_QUOTES rule 3 converts `''` after space/brackets to `` `` ``.
Also fixed: remaining `"` now correctly becomes `''` (not `` `` ``) matching NLTK's ENDING_QUOTES.

In [7]:
compare("She said ''hello'' there")
compare("''Hello,'' she replied")

✓ "She said ''hello'' there"
✓ "''Hello,'' she replied"


## 6. Remaining exotic differences

These are real differences that we accept. None appear in 2000 wikitext paragraphs.

In [8]:
# Double contraction: NLTK keeps wouldn't as a unit, then splits 've.
# Pynini splits n't first (both valid decompositions, different order).
compare("wouldn't've")

# Clitic + trailing apostrophe: NLTK keeps it's together.
# Pynini splits 's first, leaving a bare trailing '.
compare("it's'")

✗ "wouldn't've"
  NLTK:  ["wouldn't", "'ve"]
  FST:   ['would', "n't", "'ve"]
✗ "it's'"
  NLTK:  ["it's", "'"]
  FST:   ['it', "'s", "'"]


In [9]:
# Two adjacent double quotes: NLTK makes both ``, pynini makes first `` second ''
# (NLTK: ^" fires, then remaining " -> '' but wait, ^" only matches first char...
#  actually NLTK does ^" -> `` then " -> '' in ENDING_QUOTES, but the first "
#  at BOS becomes `` and the second " also matches ^" context... tricky!)
compare('""')

# Double comma: NLTK regex ([:,])([^\d]) consumes the second , as the
# non-digit lookahead, so only the first comma is separated.
compare("a,,b")

# Double colon: same regex interaction as double comma.
compare("::colon")

# Tab: NLTK's .split() treats tabs as whitespace. Pynini only maps space to separator.
compare("hello\tworld")

✗ '""'
  NLTK:  ['``', '``']
  FST:   ['``', "''"]
✗ 'a,,b'
  NLTK:  ['a', ',', ',b']
  FST:   ['a', ',', ',', 'b']
✗ '::colon'
  NLTK:  [':', ':colon']
  FST:   [':', ':', 'colon']
✗ 'hello\tworld'
  NLTK:  ['hello', 'world']
  FST:   ['hello\tworld']


## 7. Sanity checks (should all match)

In [10]:
compare("I can't do it.")
compare("It's a test -- really!")
compare("Don't you think?")
compare("I'll go there.")
compare("We've been here.")
compare('1,000 people')
compare('at 3:00 PM')
compare('items: none')
compare('foo;bar')
compare('$100 & more')
compare('50% off @ store')
compare('"Hello," she said.')
compare('He said, "go!"')
compare('"a"')
compare('She said "don\'t"')
compare('Hello world. Goodbye.')
compare('a.')
compare("the kids' toys")
compare("James' book")
compare("CAN'T STOP WON'T STOP")
compare('a "B +" grade on average.')

✓ "I can't do it."
✓ "It's a test -- really!"
✓ "Don't you think?"
✓ "I'll go there."
✓ "We've been here."
✓ '1,000 people'
✓ 'at 3:00 PM'
✓ 'items: none'
✓ 'foo;bar'
✓ '$100 & more'
✓ '50% off @ store'
✓ '"Hello," she said.'
✓ 'He said, "go!"'
✓ '"a"'
✓ 'She said "don\'t"'
✓ 'Hello world. Goodbye.'
✓ 'a.'
✓ "the kids' toys"
✓ "James' book"
✓ "CAN'T STOP WON'T STOP"
✓ 'a "B +" grade on average.'


## 8. Wikitext bulk comparison

In [None]:
from benchmark.data import load_wikitext, wikitext_detokenize

dataset = load_wikitext("test")
n_tested = n_match = n_error = 0
diffs = []

for item in dataset:
    text = item["text"].strip()
    if not text or text.startswith("="):
        continue
    text = wikitext_detokenize(text)[:500]
    if len(text) < 10:
        continue

    n = nltk_tok.tokenize(text)
    f = fst_tokenize(fst, text)
    n_tested += 1

    if f is None:
        n_error += 1
    elif n == f:
        n_match += 1
    else:
        for i in range(max(len(n), len(f))):
            nt = n[i] if i < len(n) else "<END>"
            ft = f[i] if i < len(f) else "<END>"
            if nt != ft:
                diffs.append((n_tested, text[:80], nt, ft))
                break

    if n_tested >= 2000:
        break

print(f"Tested:  {n_tested}")
print(f"Match:   {n_match} ({100*n_match/n_tested:.1f}%)")
print(f"Errors:  {n_error}")
print(f"Diffs:   {len(diffs)}")
for idx, txt, nt, ft in diffs[:10]:
    print(f"  #{idx}: NLTK={nt!r} FST={ft!r}  text={txt!r}")