<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day24_Intro_to_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day24
## Intro to Transformers

#### CS167: Machine Learning, Fall 2025



__Credit__:

Much of the code and lecture materials used from [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)

Free online course: [How Transformers Work](https://learn.deeplearning.ai/courses/how-transformer-llms-work)


## __Put the Model on Training Device (GPU or CPU)__


It's not necessary to have GPU for this notebook. However, it won't hurt.
We want to accelerate the training process using graphical processing unit (GPU). Fortunately, in Colab we can access for GPU. You need to enable it from _Runtime (or click on the down arrow near RAM & DISK in upper right)-->Change runtime type-->GPU or TPU_

Professor Urness tested this code with the GPU option: T4

Some necessary import statements:

In [None]:
# !pip install transformers>=4.46.1

# Warning control
import warnings
warnings.filterwarnings('ignore')


## Tokenizing Text

In this section, you will tokenize the sentence "Hello World!" using the tokenizer of the [`bert-base-cased` model](https://huggingface.co/google-bert/bert-base-cased).

Let's import the `Autotokenizer` class, define the sentence to tokenize, and instantiate the tokenizer.

<p style="background-color:#fff1d7; padding:15px; "> <b>FYI: </b> The transformers library has a set of Auto classes, like AutoConfig, AutoModel, and AutoTokenizer. The Auto classes are designed to automatically do the job for you.</p>

In [None]:
from transformers import AutoTokenizer

# define the sentence to tokenize
sentence = "Hello world!"

In [None]:
# load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

You'll now apply the tokenizer to the sentence. The tokeziner splits the sentence into tokens and returns the IDs of each token.

In [None]:
# apply the tokenizer to the sentence and extract the token ids
token_ids = tokenizer(sentence).input_ids

In [None]:
# print out the token ids
print(token_ids)

In [None]:
# To map each token ID to its corresponding token, you can use the `decode` method of the tokenizer.
for id in token_ids:
    print(tokenizer.decode(id))

__What is `[SEP]`??__

In the bert-base-cased tokenizer, the special token [SEP] is a separator token used to mark the end of a sentence or separate two segments in a pair of inputs.

# Exercise #1
Using the bert-base-cased tokenizer, what is the token number assigned to the word "Thanksgiving"?

# Exercise #2

Using the bert-base-cased tokenizer, how many tokens does it require to represent the word 'discombobulated' (not including the CLS or SEP tokens)?


## Visualizing Tokenization

In this section, you'll use the provided function `show_tokens`. The function takes in a text and the model name, and prints the vocabulary length of the tokenizer and a colored list of the tokens.

In [None]:
# A list of colors in RGB for representing the tokens
colors = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence: str, tokenizer_name: str):
    """ Show the tokens each separated by a different color """

    # Load the tokenizer and tokenize the input
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids

    # Extract vocabulary length
    print(f"Vocab length: {len(tokenizer)}")

    # Print a colored list of tokens
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors[idx % len(colors)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

Here's the text that you'll use to explore the different tokenization strategies of each model. Notice how complicated it is, including strange characters and icons. This string will illustrate how different tokenizers handle these cases.

In [None]:
text = """
English and CAPITALIZATION
ðŸŽµ é¸Ÿ
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

In [None]:
# note how the bert-based-cased tokenization breaks down words to tokenize them
show_tokens(text, "bert-base-cased")

# Exercise #3
Go to https://huggingface.co/

- Search for models
- Explore running the `show_tokens(text, "model_name_goes_here")` on various models

What is the vocbulary length of the `Qwen/Qwen2-VL-7B-Instruct` tokenizer?

In [None]:
show_tokens(text, "openai/gpt-oss-20b")

#__Word Embeddings__



Experiment with different pairs of words.  

In [None]:
import gensim.downloader as api
import torch
import torch.nn.functional as F

# ----------------------------------------------------------
# 1. Load a small pretrained embedding model from Gensim
# ----------------------------------------------------------
w2v = api.load("glove-wiki-gigaword-50")   # or 100, 200, 300

In [None]:

# ----------------------------------------------------------
# 2. Words
# ----------------------------------------------------------
word1 = "dog"
word2 = "puppy"

vec1 = torch.tensor(w2v[word1])
vec2 = torch.tensor(w2v[word2])

print("Vec1 shape:", vec1.shape)
print("Vec2 shape:", vec2.shape)

# ----------------------------------------------------------
# 3. Cosine similarity
# ----------------------------------------------------------
cos_sim = F.cosine_similarity(vec1, vec2, dim=0).item()
print(f"\nCosine similarity: {cos_sim:.4f}")

#__Word Embeddings with Context__

In [None]:
from transformers import BertTokenizer, BertModel
import torch
import torch.nn.functional as F

# ----------------------------------------------------------
# 1. Load a pretrained tokenizer + model (BERT)
# ----------------------------------------------------------
# The tokenizer maps text â†’ tokens â†’ token IDs
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# The model produces contextual embeddings for each token
# (each token gets a 768-dimensional vector based on the surrounding words)
model = BertModel.from_pretrained('bert-base-uncased')


# ----------------------------------------------------------
# 2. Two sentences where "bank" means different things
# ----------------------------------------------------------
s1 = "He sat by the bank of the river."
s2 = "I went to the bank to deposit money."


# ----------------------------------------------------------
# 3. Tokenize and get embeddings for the target word
# ----------------------------------------------------------
def get_word_embedding(sentence, word):
    # Tokenize the sentence, adding CLS, SEP, and returning PyTorch tensors
    tokens = tokenizer(sentence, return_tensors='pt')

    # Run the tokens through BERT â€” outputs last_hidden_state:
    # shape = [1, seq_len, 768]
    outputs = model(**tokens)

    # Remove the batch dimension â†’ [seq_len, 768]
    embeddings = outputs.last_hidden_state.squeeze(0)

    # Convert token IDs back to readable token strings
    tokens_decoded = tokenizer.convert_ids_to_tokens(tokens['input_ids'].squeeze(0))

    # Find the index where the target word appears in the token list
    # (lowercasing used since this is an uncased model)
    idx = tokens_decoded.index(word)

    # Return the 768-dimensional embedding vector and the token list
    return embeddings[idx], tokens_decoded

# ----------------------------------------------------------
# 4. Get the contextual embedding for "bank" in both sentences
# ----------------------------------------------------------

emb1, tokens1 = get_word_embedding(s1, "bank")
emb2, tokens2 = get_word_embedding(s2, "bank")


# ----------------------------------------------------------
# 5. Compare the two contextual embeddings
# ----------------------------------------------------------
# Cosine similarity close to 1 â†’ vectors are similar (same meaning)
# Cosine similarity close to 0 â†’ vectors are different
similarity = F.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()

print("Tokens 1:", tokens1)
print("Tokens 2:", tokens2)
print(f"\nCosine similarity between 'bank' embeddings: {similarity:.4f}")


#__Context Window__

In [None]:
# import values
!pip install transformers torch --quiet

from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForCausalLM,
    AutoModelForSeq2SeqLM
)
import torch
import torch.nn.functional as F

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

In [None]:
model_name = "distilgpt2"  # small GPT-style decoder-only model

dec_tokenizer = AutoTokenizer.from_pretrained(model_name)
dec_model     = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# For GPT2-like models we often set pad_token = eos_token
dec_tokenizer.pad_token = dec_tokenizer.eos_token
dec_model.config.pad_token_id = dec_tokenizer.eos_token_id

prompt_short = "My favorite color is blue. Question: What is my favorite color? Answer:"
prompt_long  = "My favorite color is blue. " + "blah " * 80 + "Question: What is my favorite color? Answer:"

def generate_and_print(prompt):
    inputs = dec_tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = dec_model.generate(
            **inputs,
            max_new_tokens=10,
            do_sample=False
        )

    print("PROMPT:\n", prompt)
    print("COMPLETION:\n", dec_tokenizer.decode(outputs[0], skip_special_tokens=True))
    print("-" * 60)

generate_and_print(prompt_short)
generate_and_print(prompt_long)
