# Pynini vs NLTK PTB Tokenizer — Adversarial Diff

This notebook exercises known differences between the pynini FST and NLTK's `TreebankWordTokenizer`.

In [1]:
import sys; sys.path.insert(0, '..')

from nltk.tokenize import TreebankWordTokenizer
from benchmark.fsts.ptb_pynini import build_ptb_fst_pynini, string_to_byte_strs, SEP
from transduction.fsa import EPSILON as NATIVE_EPSILON

nltk_tok = TreebankWordTokenizer()
fst = build_ptb_fst_pynini()

Composing PTB rules...
Core PTB FST: 317 states
Final pynini FST: 303 states
Converting to native FST...
Native FST: 303 states, 24545 arcs
  eps: 111 in, 430 out
  MARKER: 0 in, 0 out
  [EOS]: 0 in, 0 out


In [2]:
from transduction.fst import FST


def fst_tokenize(fst, text):
    """Tokenize text using the pynini-built FST."""
    byte_strs = string_to_byte_strs(text)
    input_fst = FST.from_string(byte_strs)
    output_fsa = fst(input_fst, None)
    try:
        output = next(output_fsa.language(tuple=True))
    except StopIteration:
        return None  # FST rejected input
    tokens = []
    current = []
    for sym in output:
        if sym == SEP:
            if current:
                tokens.append(bytes(int(b) for b in current).decode('utf-8', errors='replace'))
                current = []
        elif int(sym) < 256:
            current.append(sym)
    if current:
        tokens.append(bytes(int(b) for b in current).decode('utf-8', errors='replace'))
    return tokens


def compare(text):
    """Compare NLTK and FST tokenization, highlighting differences."""
    n = nltk_tok.tokenize(text)
    f = fst_tokenize(fst, text)
    match = '✓' if n == f else '✗'
    print(f'{match} {text!r}')
    if n != f:
        print(f'  NLTK:  {n}')
        print(f'  FST:   {f}')

## 1. Missing `#` in special punctuation

NLTK separates `[;@#$%&]`, but pynini only has `[;@%&$]`.

In [3]:
compare('Price is #100')
compare('##heading')
compare('C# programming')

✓ 'Price is #100'
✓ '##heading'
✓ 'C# programming'


## 2. ALL CAPS contractions

NLTK's CONTRACTIONS2 uses `(?i)` (case-insensitive). Pynini only handles lowercase + Title case.

In [4]:
compare('CANNOT STOP')
compare('I WANNA GO')
compare('GONNA BE GREAT')
compare('GOTTA RUN')
compare('LEMME SEE')
compare('GIMME THAT')
compare('cAnNoT stop')  # mixed case

✓ 'CANNOT STOP'
✓ 'I WANNA GO'
✓ 'GONNA BE GREAT'
✓ 'GOTTA RUN'
✓ 'LEMME SEE'
✓ 'GIMME THAT'
✗ 'cAnNoT stop'
  NLTK:  ['cAn', 'NoT', 'stop']
  FST:   ['cAnNoT', 'stop']


## 3. Missing contractions: `d'ye`, `more'n`

Present in NLTK's CONTRACTIONS2 but absent from the pynini FST.

In [5]:
compare("D'ye think so?")
compare("d'ye know")
compare("more'n enough")
compare("More'n I expected")

✓ "D'ye think so?"
✓ "d'ye know"
✓ "more'n enough"
✓ "More'n I expected"


## 4. Missing CONTRACTIONS3: `'tis`, `'twas`

NLTK splits `'tis` → `'t is` and `'twas` → `'t was`. Pynini doesn't handle these.

In [6]:
compare("'Tis the season")
compare("'Twas the night")
compare("'tis nothing")
compare("'twas long ago")

✓ "'Tis the season"
✓ "'Twas the night"
✓ "'tis nothing"
✓ "'twas long ago"


## 5. Double single-quotes (`''`) as opening quote

NLTK converts `''` to ` `` ` when used as opening quotes (STARTING_QUOTES rule 3). Pynini doesn't handle this.

In [7]:
compare("She said ''hello'' there")
compare("''Hello,'' she replied")

✓ "She said ''hello'' there"
✓ "''Hello,'' she replied"


## 6. Sanity checks (should all match)

In [8]:
compare("I can't do it.")
compare("It's a test -- really!")
compare("Don't you think?")
compare("I'll go there.")
compare("We've been here.")
compare('1,000 people')
compare('at 3:00 PM')
compare('items: none')
compare('foo;bar')
compare('$100 & more')
compare('50% off @ store')

✓ "I can't do it."
✓ "It's a test -- really!"
✓ "Don't you think?"
✓ "I'll go there."
✓ "We've been here."
✓ '1,000 people'
✓ 'at 3:00 PM'
✓ 'items: none'
✓ 'foo;bar'
✓ '$100 & more'
✓ '50% off @ store'
