<a href="https://colab.research.google.com/github/szandian/LLM/blob/main/HW1_LLM_Cource_Szandian.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

EE 690/EE 790 Large Language Models

Assignment #1 (Tokenizer)

Blazer ID: Szandian

PhD Student: Somayeh Zandian


In [1]:
!pip install tiktoken



a) Tokenize and Decode Sentences (30 points)
Write Python code that does the following:
- Imports the tiktoken library and loads the GPT-2 tokenizer.
- Takes three sentences as input (hard-coded is fine).
- Tokenizes each sentence into token IDs using tokenizer.encode().
- Decodes the token IDs back into text using tokenizer.decode().
- Print: Original sentences, Token ID sequences, Decoded text from token IDs

In [4]:

import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Bioinformatics combines biology, computer science, and statistics.",
    "Tokenization is the first step in many NLP pipelines."
]

for i, sent in enumerate(sentences, 1):
    token_ids = tokenizer.encode(sent)
    decoded = tokenizer.decode(token_ids)
    print(f"Sentence {i}: {sent!r}")
    print(f"  Token IDs: {token_ids}")
    print(f"  Decoded : {decoded!r}\n")

Sentence 1: 'The quick brown fox jumps over the lazy dog.'
  Token IDs: [464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13]
  Decoded : 'The quick brown fox jumps over the lazy dog.'

Sentence 2: 'Bioinformatics combines biology, computer science, and statistics.'
  Token IDs: [42787, 259, 18982, 873, 21001, 17219, 11, 3644, 3783, 11, 290, 7869, 13]
  Decoded : 'Bioinformatics combines biology, computer science, and statistics.'

Sentence 3: 'Tokenization is the first step in many NLP pipelines.'
  Token IDs: [30642, 1634, 318, 262, 717, 2239, 287, 867, 399, 19930, 31108, 13]
  Decoded : 'Tokenization is the first step in many NLP pipelines.'



b) Analyze Token Behavior (20 points)
- Use the tokenizer to encode the following words:
"hello", "Hello", "HELLO", "hElLo"
- Print the token ID sequences for each.
- Use decode() to convert the token IDs back to strings.
- Reflection Questions (to answer in comments or a separate text file):
. How does the tokenizer treat different capitalizations?
. Why do you think subword tokenization is used instead of word-level?

In [7]:
words = ["hello", "Hello", "HELLO", "hElLo"]

print("=== Part (b): Token Behavior ===\n")
for w in words:
    ids = tokenizer.encode(w)
    back = tokenizer.decode(ids)
    print(f"Word {w!r}:")
    print(f"  Token IDs: {ids}")
    print(f"  Decoded : {back!r}\n")

=== Part (b): Token Behavior ===

Word 'hello':
  Token IDs: [31373]
  Decoded : 'hello'

Word 'Hello':
  Token IDs: [15496]
  Decoded : 'Hello'

Word 'HELLO':
  Token IDs: [13909, 3069, 46]
  Decoded : 'HELLO'

Word 'hElLo':
  Token IDs: [71, 9527, 27654]
  Decoded : 'hElLo'



**Q: How does the tokenizer treat different capitalizations?**

A: The tokenizer is case-sensitive: different capitalizations produce different token IDs.

 **Q:  Why do you think subword tokenization is used instead of word-level?**

 A: Subword tokenization is used so that rare or unseen words can be broken into known pieces,
improving handling of out-of-vocabulary words and reducing total vocabulary size.



c) Vocabulary and Unknown Tokens (20 points)
- Print the total vocabulary size using:
- Try encoding a nonsense word like "zqxjklmno" and print its token ID sequence and
decoded form.

- How many subword units is it broken into?
- What does this say about how tiktoken handles unknown or rare words?

In [10]:
# 1. Total vocabulary size
try:
    vocab_size = tokenizer.n_vocab
except AttributeError:
    # Some versions use .vocab_size or len(tokenizer)
    vocab_size = len(tokenizer._tokenizer.get_vocab())
print(f"Total GPT-2 vocab size: {vocab_size}\n")

# 2. Nonsense word
nonsense = "zqxjklmno"
n_ids = tokenizer.encode(nonsense)
n_decoded = tokenizer.decode(n_ids)
print(f"Nonsense word {nonsense!r}:")
print(f"  Token IDs      : {n_ids}")
print(f"  Decoded        : {n_decoded!r}")
print(f"  # subword units: {len(n_ids)}\n")

Total GPT-2 vocab size: 50257

Nonsense word 'zqxjklmno':
  Token IDs      : [89, 80, 87, 73, 41582, 76, 3919]
  Decoded        : 'zqxjklmno'
  # subword units: 7




Q: How many subword units is it broken into?
What does this say about how tiktoken handles unknown or rare words?


A: This shows that unknown/rare words are split into multiple subwords,
allowing the model to still represent and learn from them.


d) Token Batching and Padding (30 points)

Using a list of short sentences:
sentences = ["Short.", "This one is longer.", "The longest sentence of all three."]
- Tokenize each sentence.
- Pad all sequences to match the length of the longest sequence (use 0 as the pad token).
- Create an attention mask indicating real tokens (1) and padding (0).
- Print: Padded token sequences, Attention masks

In [13]:
sentences2 = [
    "Short.",
    "This one is longer.",
    "The longest sentence of all three."
]

# 1. Tokenize each
tokenized = [tokenizer.encode(s) for s in sentences2]
max_len = max(len(t) for t in tokenized)
pad_id = 0  # GPT-2 uses 0 for the '<|pad|>' token in our scheme

padded_seqs = []
attention_masks = []

for t in tokenized:
    pad_length = max_len - len(t)
    padded = t + [pad_id] * pad_length
    mask   = [1] * len(t) + [0] * pad_length
    padded_seqs.append(padded)
    attention_masks.append(mask)

for i, (p, m) in enumerate(zip(padded_seqs, attention_masks), 1):
    decoded = tokenizer.decode([tok for tok in p if tok != pad_id])
    print(f"Sentence {i!r}:")
    print(f"  Padded IDs    : {p}")
    print(f"  Attention mask: {m}")
    print(f"  Decoded (no pad): {decoded!r}\n")


Sentence 1:
  Padded IDs    : [16438, 13, 0, 0, 0, 0, 0]
  Attention mask: [1, 1, 0, 0, 0, 0, 0]
  Decoded (no pad): 'Short.'

Sentence 2:
  Padded IDs    : [1212, 530, 318, 2392, 13, 0, 0]
  Attention mask: [1, 1, 1, 1, 1, 0, 0]
  Decoded (no pad): 'This one is longer.'

Sentence 3:
  Padded IDs    : [464, 14069, 6827, 286, 477, 1115, 13]
  Attention mask: [1, 1, 1, 1, 1, 1, 1]
  Decoded (no pad): 'The longest sentence of all three.'

