[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/1.words/HW1_Tokenization.ipynb)

# Homework 1: Tokenization

In this homework, you'll compare the tokenizations outputs from different classes of tokenizers. This homework is also an opportunity for you to check in on your Python proficiency; for all of the operations below (downloading a file, reading it in, counting objects), you should either be comfortable implementing them already or know how to find out how to do so yourself (if you find yourself struggling with them, we encourage you to take this class at a later date, with a bit more Python experience under your belt).

We've added some space for you to write the code for each section, but feel free to create more code cells if you'd like.

## Part 1

Tokenize the following document with each of these models. Feel free to use the documentation linked (and AI Assistance) to do so for this low-level operation (but again remember that you have to be able to explain what your code is doing).  For each of the tokenizers above, we want to see a list of tokens for this document (not numeric token IDs, but legible words) -- e.g., \["London", ".", ...\]

* NLTK `word_tokenize` (https://www.nltk.org/book/ch03.html)
* Spacy `tokenize` (https://spacy.io/usage/spacy-101#annotations-token)
* Tiktoken BPE tokenization (https://github.com/openai/tiktoken) -- cl100k_base (GPT-3.5, GPT-4).



In [None]:
document = "London. Michaelmas term lately over, and the Lord Chancellor sitting in Lincoln’s Inn Hall. Implacable November weather. As much mud in the streets as if the waters had but newly retired from the face of the earth, and it would not be wonderful to meet a Megalosaurus, forty feet long or so, waddling like an elephantine lizard up Holborn Hill. Smoke lowering down from chimney-pots, making a soft black drizzle, with flakes of soot in it as big as full-grown snowflakes—gone into mourning, one might imagine, for the death of the sun. Dogs, undistinguishable in mire. Horses, scarcely better; splashed to their very blinkers. Foot passengers, jostling one another’s umbrellas in a general infection of ill temper, and losing their foot-hold at street-corners, where tens of thousands of other foot passengers have been slipping and sliding since the day broke (if this day ever broke), adding new deposits to the crust upon crust of mud, sticking at those points tenaciously to the pavement, and accumulating at compound interest."

In [24]:
from __future__ import division
import nltk, re, pprint
from nltk import word_tokenize
import spacy
import tiktoken

# Nltk word_tokenize
nltk_tokens = word_tokenize(document)


# Spacy tokenize
nlp = spacy.load("en_core_web_sm")
doc = nlp(document)
spacy_tokens = []

for token in doc:
    spacy_tokens.append(f"{token}")

# Tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tiktoken_tokens = enc.encode(document)
tiktoken_decoded = [enc.decode([t]) for t in tiktoken_tokens]



print(nltk_tokens)
print(spacy_tokens)
print(tiktoken_decoded)

['London', '.', 'Michaelmas', 'term', 'lately', 'over', ',', 'and', 'the', 'Lord', 'Chancellor', 'sitting', 'in', 'Lincoln', '’', 's', 'Inn', 'Hall', '.', 'Implacable', 'November', 'weather', '.', 'As', 'much', 'mud', 'in', 'the', 'streets', 'as', 'if', 'the', 'waters', 'had', 'but', 'newly', 'retired', 'from', 'the', 'face', 'of', 'the', 'earth', ',', 'and', 'it', 'would', 'not', 'be', 'wonderful', 'to', 'meet', 'a', 'Megalosaurus', ',', 'forty', 'feet', 'long', 'or', 'so', ',', 'waddling', 'like', 'an', 'elephantine', 'lizard', 'up', 'Holborn', 'Hill', '.', 'Smoke', 'lowering', 'down', 'from', 'chimney-pots', ',', 'making', 'a', 'soft', 'black', 'drizzle', ',', 'with', 'flakes', 'of', 'soot', 'in', 'it', 'as', 'big', 'as', 'full-grown', 'snowflakes—gone', 'into', 'mourning', ',', 'one', 'might', 'imagine', ',', 'for', 'the', 'death', 'of', 'the', 'sun', '.', 'Dogs', ',', 'undistinguishable', 'in', 'mire', '.', 'Horses', ',', 'scarcely', 'better', ';', 'splashed', 'to', 'their', 'very

## Part 2

Examine the different tokenizations for the passage above -- i.e., actually read through them and see how they differ. In a paragraph or two, characterize the salient differences in tokenization between a.) NLTK and Spacy and b.) NLTK and BPE.  Reference real examples in the text. At the end of this homework, you want to be able to discuss the practical differences between tokenization methods.

**Response**:

a) Both NLTK and Spacy produce nearly identical outputs. They treat words as whole units and separate punctuations as independent tokens. However, we can notice a slight difference in the way the word 'Lincoln's' is being encoded; NLTK separates " ' " and " s " at the end of the word whereas Spacy combines them in one token " 's ". Furthermore, we also find they process words with dashes differently; both "full-grown" and "snowflakes-gone" are counted as one token by NLTK, and separated by Spacy. Aside from these two salient differences, NLTK and Spacy are highly similar.

b) In contrast, the BPE tokenizer used by TikToken splits words into subword units when they are less frequent or outside its learned vocabulary. For example, 'Michaelmas' is split into 'Michael' and 'mas', rather than counted as one token. This reflects the tradeoff in BPE: It ensures that any word, even rare ones, can be represented from smaller building blocks, at the cost of splitting some uncommon words into several units. This means that NLTK is a better option (in my unprofessional opinion; correct me if wrong) when full words and punctuation matter, whereas BPE is optimised for neural model where compact vocabulary coverage and handling of unseen words are crucial.



## Part 3

Download the full text of *Pride and Prejudice* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt) and tokenize it using each of the methods above. How many word types (in the formal sense we discussed in class) does each tokenization method have for that complete file?

In [None]:
# Your code here:


## Part 4

Which text has the greater type-token ratio, *Pride and Prejudice* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt) or *Emma* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/158_emma.txt)?  Calculate the TTR for both texts using the NLTK tokenizer, but only use the first 1,000 tokens from each text when calculating its TTR.

In [None]:
# Your code here:


In [None]:
pp_ttr = 0.0  # fill this in!
emma_ttr = 0.0  # fill this in!
answer = "???"  # fill this in!

print("The TTR for 'Pride and Prejudice' is", pp_ttr)
print("The TTR for 'Emma' is", emma_ttr)
print(f"{answer} has the higher TTR.")