<a href="https://colab.research.google.com/github/rahul0772/python-ml-ai-relearning/blob/main/Large%20Language%20Models/day12_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Language Model

In [1]:
# ============================================================
# BEGINNER NOTE: WHAT IS ACTUALLY HAPPENING IN LLMs
# (READ SLOWLY — THIS IS NOT CODE TO MEMORIZE)
# ============================================================

# A computer is NOT a human.
# It does NOT understand:
# - words
# - language
# - meaning
# - emotions
#
# A computer understands ONLY:
# NUMBERS

# ------------------------------------------------------------
# PROBLEM:
# Humans use TEXT
# Computers use NUMBERS
# ------------------------------------------------------------

# So we need a way to convert:
# TEXT  →  NUMBERS
# and later convert back:
# NUMBERS → TEXT

# ------------------------------------------------------------
# STEP 1: TEXT → SMALL PIECES (TOKENS)
# ------------------------------------------------------------

# Example sentence:
# "I am sorry"

# The computer breaks this into SMALL pieces:
# "I" | " am" | " sorry"

# These small pieces are called TOKENS.
# Think of tokens like LEGO blocks of text.

# IMPORTANT:
# - Tokens are NOT always full words
# - Sometimes they are parts of words
# - This makes it easier for the computer

# ------------------------------------------------------------
# STEP 2: TOKENS → NUMBERS
# ------------------------------------------------------------

# The computer STILL cannot understand tokens.
# So each token is given a NUMBER.

# Example:
# "I"      → 45
# " am"    → 302
# " sorry" → 987

# The sentence now becomes:
# [45, 302, 987]

# NOW the computer is happy,
# because it ONLY sees numbers.

# ------------------------------------------------------------
# STEP 3: NUMBERS → MATH
# ------------------------------------------------------------

# The AI model is NOT thinking.
# It is a HUGE math machine.

# It takes the numbers and does math:
# - addition
# - multiplication
# - probability calculations

# It asks ONE question:
# "Given these numbers, what number usually comes next?"

# ------------------------------------------------------------
# STEP 4: MATH → NEW NUMBERS
# ------------------------------------------------------------

# After doing math, the model guesses the NEXT number.

# Example:
# [45, 302, 987] → 654

# This guessing happens:
# - one number at a time
# - again and again
# - very fast

# ------------------------------------------------------------
# STEP 5: NUMBERS → TOKENS
# ------------------------------------------------------------

# Now we reverse the process.

# Example:
# 654 → " for"

# The computer looks up the number
# and finds the matching token.

# ------------------------------------------------------------
# STEP 6: TOKENS → TEXT
# ------------------------------------------------------------

# All tokens are joined back together:

# "I am sorry for ..."

# NOW humans can read it.

# ------------------------------------------------------------
# MOST IMPORTANT TRUTH (REMEMBER THIS)
# ------------------------------------------------------------

# The AI NEVER understands language.
# It NEVER knows meaning.
# It NEVER knows emotions.

# It is ONLY very good at guessing:
# "What text usually comes next?"

# ------------------------------------------------------------
# ONE-LINE SUMMARY:
# ------------------------------------------------------------

# TEXT → TOKENS → NUMBERS → MATH → NUMBERS → TOKENS → TEXT

# THAT IS ALL A LANGUAGE MODEL DOES.
# ============================================================

In [33]:
# Token Embeddings - Easy Explanation
# -----------------------------------
# 1. Tokens are like tiny pieces of words or characters that a model can understand.
# 2. Each token is turned into a number list called an "embedding" so the computer can work with it.
# 3. These embeddings help the model know:
#    - Which words go together
#    - What words mean in context
#    - Patterns in sentences
# 4. Basically, embeddings are the model’s way of "understanding" language.

In [2]:
# STEP 0: Install compatible versions of torch and transformers
#!pip install --upgrade torch torchvision torchaudio --quiet
# !pip install --upgrade transformers --quiet

In [3]:
# ============================================================
# LLMs, TOKENS, EMBEDDINGS
# ============================================================

# WHAT IS HAPPENING (IN SIMPLE WORDS):
#   Text → small pieces (TOKENS)
#   Tokens → numbers
#   Numbers → math
#   Math → new numbers
#   Numbers → tokens
#   Tokens → text

# ------------------------------------------------------------
# STEP 0: Install required library (this gives us AI tools)
# ------------------------------------------------------------
# !pip install -q transformers torch accelerate

# ------------------------------------------------------------
# STEP 1: Import tools
# ------------------------------------------------------------
# TRANSFORMERS
# - download AI models
# - load them
# - use them

# WITHOUT this library:
# - You would need thousands of lines of code
# - You would need years of math knowledge

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# "From the transformers library,
#  give me two tools:
#  1) AutoTokenizer
#  2) AutoModelForCausalLM"

# AutoTokenizer is NOT the tokenizer itself.
# It is a FACTORY.

# A factory means:
# - You give it a model name
# - It automatically figures out:
#   which tokenizer is needed
#   how to load it
#   how to configure it

# You DON'T manually build a tokenizer.
# AutoTokenizer does it FOR YOU.

# WHY THE NAME "AutoModelForCausalLM"?
# ============================================================

# This is the ACTUAL AI MODEL.
# This is the "brain".

# But IMPORTANT:
# It is NOT a thinking brain.
# It is a NUMBER-PREDICTING MACHINE.

# "LM" = Language Model
# Means: works with language

# "Causal" means:
# - looks ONLY at past text
# - predicts the NEXT token
# - does NOT see the future
# Hugging Face has MANY model types.
# You don’t want to manually pick the right one.

# "AutoModelForCausalLM" means:
# - Automatically choose the correct model class
# - Automatically load the right architecture
# - Automatically configure it

# YOU DO NOT BUILD THE MODEL.
# YOU DO NOT TRAIN THE MODEL.
# YOU ONLY LOAD IT.


# ------------------------------------------------------------
# STEP 2: Load TOKENIZER
# Tokenizer = translator between HUMAN TEXT and AI NUMBERS
# ------------------------------------------------------------
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct"
)

# - Python contacted the internet
# - Downloaded tokenizer files for this model
# - Loaded a vocabulary (huge table: token ↔ number)
# - Stored it in the variable called `tokenizer`
# - converts those tokens into numbers
# - converts numbers BACK into text

# IMPORTANT:
# The tokenizer MUST match the model.
# If you mix tokenizer and model → everything breaks.


# ------------------------------------------------------------
# STEP 3: Load MODEL
# Model = the "brain" that predicts the next word piece
# ------------------------------------------------------------
# A MODEL is a HUGE file full of numbers.

# Those numbers are called WEIGHTS.
# They were learned by training on massive text data.

# The model does ONE thing only:
# Given some numbers,
# it PREDICTS the NEXT number.

# That’s it.
# No understanding.
# No thinking.
# Just prediction

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="auto",                                 # Tells: "If there is a GPU, use it", "Otherwise, use the CPU"
    torch_dtype=torch.float16                          # Tells "Use smaller numbers to save memory"(faster, less memory, same behavior)
)
# 1) Python download a VERY LARGE file (the model weights)
# 2) These weights are just NUMBERS (millions/billions of them)
# 3) They were trained earlier by Microsoft (not by us)
# 4) The model was loaded into RAM / GPU memory


# ------------------------------------------------------------
# STEP 4: Write INPUT TEXT (PROMPT)
# <|assistant|> tells the model: "now YOU answer"
# ------------------------------------------------------------
prompt = "Write a short apology email to Sarah about a gardening mistake.<|assistant|>"

# ------------------------------------------------------------
# STEP 5: TEXT → TOKENS → NUMBERS
# This is what the model ACTUALLY sees
# ------------------------------------------------------------
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Print the raw text we gave to the model
print("\nRAW INPUT TEXT:")
print(prompt)


# Print the numbers that model actually sees
# These are the input_ids from the tensor
# Model ONLY understands these numbers
print("\nTOKEN NUMBERS (AI ONLY UNDERSTANDS THIS):")
print(inputs["input_ids"])

# 1️⃣ tokenizer(prompt, return_tensors="pt")
# -------------------------------------------
# tokenizer takes our text ("prompt") and:
# a) breaks it into small pieces (tokens)
# b) converts tokens into numbers
# c) puts numbers into a tensor (a special number table)
#
# return_tensors="pt" means:
# - Use PyTorch format (PyTorch = torch)
# - A tensor is just a fancy multi-dimensional array
#   that can hold all the numbers in a format the model likes

# Example (conceptual):
# Text: "Hello"
# Tokens: ["Hello"]
# Numbers: [15496]
# Tensor: tensor([[15496]])

# 2️⃣ .to(model.device)
# --------------------
# The model can run on:
# - GPU (fast)
# - CPU (slow)
#
# .to(model.device) moves the tensor to the same place
# where the model is loaded. This is REQUIRED or the model
# cannot read the numbers.

# ------------------------------------------------------------
# STEP 6: Show TOKENS (small text pieces)
# ------------------------------------------------------------

# We already converted text into numbers:
# inputs["input_ids"] looks like: tensor([[15496, 703, 389, 1029]])

# Each number = a small piece of text (a TOKEN)
# Tokens are like LEGO blocks of words

# This line converts numbers BACK to text tokens:
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Example conceptually:
# [15496, 703, 389, 1029] → ["Hello", " how", " are", " you"]

# Print it so we can see the pieces
print("\nTOKENS (TEXT BROKEN INTO SMALL PIECES):")
print(tokens)


# ------------------------------------------------------------
# STEP 7: MODEL GENERATES NEW TOKENS (ONE BY ONE)
# It is guessing the next token again and again
# ------------------------------------------------------------
output = model.generate(
    **inputs,
    max_new_tokens=50
)
# What this does:
# 1️⃣ Takes the input numbers (tensor) we prepared
# 2️⃣ Looks at the numbers
# 3️⃣ Predicts the next token (number) that usually comes after
# 4️⃣ Adds that number to the sequence
# 5️⃣ Repeats until max_new_tokens is reached (here 50 tokens)
# IMPORTANT:
# - Model NEVER sees words
# - Model ONLY guesses numbers (tokens)
# - It does this ONE TOKEN AT A TIME

# ------------------------------------------------------------
# STEP 8: NUMBERS → TOKENS → HUMAN TEXT
# ------------------------------------------------------------
# Now we need to convert ALL the numbers back to readable text
# tokenizer.decode() does that for us
final_text = tokenizer.decode(output[0], skip_special_tokens=True)

# skip_special_tokens=True removes any <|special|> markers the model uses internally

print("\nFINAL OUTPUT (WHAT HUMANS READ):")
print(final_text)

# ============================================================
# FINAL SIMPLE TRUTH (REMEMBER ONLY THIS):
#
# 1. Model NEVER sees words
# 2. Model ONLY sees numbers
# 3. Tokens are small pieces of text
# 4. Embeddings are numbers that represent meaning
# 5. Model predicts next token again and again
#
# THAT'S IT. THAT IS AN LLM.
# ============================================================

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]


RAW INPUT TEXT:
Write a short apology email to Sarah about a gardening mistake.<|assistant|>

TOKEN NUMBERS (AI ONLY UNDERSTANDS THIS):
tensor([[14350,   263,  3273,  3095,  3002,  4876,   304, 19235,  1048,   263,
         16423,   292, 10171, 29889, 32001]])

TOKENS (TEXT BROKEN INTO SMALL PIECES):
['▁Write', '▁a', '▁short', '▁ap', 'ology', '▁email', '▁to', '▁Sarah', '▁about', '▁a', '▁garden', 'ing', '▁mistake', '.', '<|assistant|>']

FINAL OUTPUT (WHAT HUMANS READ):
Write a short apology email to Sarah about a gardening mistake. Subject: My Sincere Apologies for the Gardening Oversight


Dear Sarah,


I hope this message finds you well. I am writing to express my deepest apologies for the recent incident in your


In [8]:
# run the cell where you define your prompt and tokenize it:
prompt = "Hello, how are you?"

# Tokenize the input and move to model device
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Save the token numbers
input_ids = inputs["input_ids"]

In [10]:
# run your decode loop

# Print each token separately
for id in input_ids[0]:
    print(tokenizer.decode(id))
# This prints (each token is on a separate line)

Hello
,
how
are
you
?


In [14]:
# Show tokens as text pieces
for t in tokens:
    print(t)

▁Write
▁a
▁short
▁ap
ology
▁email
▁to
▁Sarah
▁about
▁a
▁garden
ing
▁mistake
.
<|assistant|>


In [15]:
# show token numbers
print("\nTOKEN NUMBERS (what the model sees):")
print(input_ids)


TOKEN NUMBERS (what the model sees):
tensor([[15043, 29892,   920,   526,   366, 29973]])


In [18]:
# You can decode multiple IDs at once:
print(tokenizer.decode(3323))
print(tokenizer.decode(622))
print(tokenizer.decode([3323, 622]))
print(tokenizer.decode(29901))


Sub
ject
Subject
:


In [19]:
# ============================================================
# INSPECTING TOKENS AND TOKEN IDs
# ============================================================

# 1️⃣ Tokenizer.decode() lets us convert token IDs back into text humans can read
#    Example: for id in input_ids[0]: print(tokenizer.decode(id))
#    This prints each token on a separate line, e.g.:
#    <s>
#    Write
#    an
#    email
#    apolog
#    izing
#    to
#    Sarah
#    for
#    the
#    trag
#    ic
#    garden
#    ing
#    m
#    ish
#    ap
#    .
#    Explain
#    how
#    it
#    happened
#    .
#    <|assistant|>

# 2️⃣ How input tokens are broken down:
#    - First token <s> = special token marking start of text
#    - Some tokens = full words (Write, an, email)
#    - Some tokens = partial words (apolog + izing, trag + ic)
#    - Punctuation = its own token (., ,)
#    - Spaces are not separate tokens; partial tokens have hidden marker indicating connection
#    - Tokens without marker assume a space before them

# 3️⃣ On the output side, the model generates token IDs as well
#    - Example: tensor([[1, 14350, 385, 4876, ..., 3323, 622, 29901]])
#    - Tokens 3323 ('Sub') + 622 ('ject') form 'Subject'
#    - Token 29901 = ':'
#    - To read output as text, use tokenizer.decode() on individual IDs or lists

# 4️⃣ Examples:
#    print(tokenizer.decode(3323))      # Sub
#    print(tokenizer.decode(622))       # ject
#    print(tokenizer.decode([3323,622]))# Subject
#    print(tokenizer.decode(29901))     # :

In [21]:
# ============================================================
# HOW TOKENIZERS BREAK DOWN TEXT
# ============================================================

# 1️⃣ What a tokenizer does:
# - Converts human text → numbers (token IDs) that a model can understand
# - Works both on input (text → tokens → numbers) and output (numbers → tokens → text)

# 2️⃣ How tokenizers decide to split text:
# - Depends on 3 main factors:
#   1. Tokenization method chosen by model creators (e.g., BPE for GPT, WordPiece for BERT)
#   2. Design choices: vocabulary size, special tokens (like <s> or <|assistant|>)
#   3. Training dataset: a tokenizer trained on English text differs from one trained on code or multilingual text

# 3️⃣ Types of tokenization:
# - Word-level tokens:
#   * Each token = full word
#   * Hard to handle new words
#   * Big vocabulary, lots of similar words (apology, apologize, apologetic)
# - Subword tokens (most common for modern LLMs):
#   * Tokens can be full words or parts of words
#   * Can represent new words by combining subwords
#   * More expressive, more efficient than character tokens
# - Character tokens:
#   * Each token = single character
#   * Can handle any new word
#   * Less efficient, uses more tokens for same text, harder for model to learn sequences
# - Byte tokens:
#   * Each token = single byte of a character
#   * Useful for multilingual or rare characters
#   * Some subword tokenizers include bytes as fallback for unknown characters

# 4️⃣ Key advantages of subword tokenization:
# - Fits more text into model context (e.g., 3x more text than character tokens)
# - Handles new words by breaking them into familiar pieces
# - Balances vocabulary size vs flexibility

# ✅ Takeaway:
# - Tokenizers are essential for turning text into numbers the model understands
# - Subword tokenization is the standard for LLMs because it's flexible, efficient, and handles new words well
# - Other methods (word, char, byte) exist and are useful in specific cases
# ============================================================


In [22]:
sentence = "Write an email apologizing to Sarah for the tragic gardening mishap."

input_ids = tokenizer(sentence).input_ids

# Print tokens normally
for id in input_ids:
    print(tokenizer.decode(id))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.


In [24]:
# ============================================================
# COMPARING TRAINED LLM TOKENIZERS
# ============================================================

# 1️⃣ What influences a tokenizer?
# - 3 main factors determine how text is split into tokens:
#    1. Tokenization method (BPE, WordPiece, etc.)
#    2. Tokenizer parameters and special tokens (like <s> for start, <|endoftext|>)
#    3. Dataset used to train the tokenizer (English, code, multilingual, etc.)
# - These choices affect how the model sees and processes text.

# 2️⃣ Why compare tokenizers?
# - To see how different tokenizers handle the same text.
# - Newer tokenizers often improve performance.
# - Specialized models (like code generation) sometimes need specialized tokenizers.

# 3️⃣ Example text to test tokenizers:
# text = """
# English and CAPITALIZATION
# 🎵鸟
# show_tokens False None elif == >= else: two tabs:" " Three tabs: " "
# 12.0*50=600
# """
# - Contains:
#    * Capital letters
#    * Non-English characters (like Chinese or emojis)
#    * Programming code and keywords
#    * Whitespaces and indentation
#    * Numbers
#    * Special tokens (start/end markers)

# 4️⃣ What the show_tokens function does:
# - Loads a tokenizer by name
# - Converts text into token IDs
# - Decodes each token ID back into text
# - Prints tokens with colors to visualize how text is broken down

# 5️⃣ Key Learning Points:
# - Different tokenizers split the same text differently
# - Special tokens help the model understand structure (start, end, etc.)
# - Tokenizers handle text, code, numbers, emojis, and multiple languages differently
# - Visualizing tokens (even with colors) helps understand how a model “sees” text

# ✅ Takeaway:
# - Comparing tokenizers teaches you about model behavior
# - Helps in choosing the right tokenizer for your task (general language, code, or multilingual)
# ============================================================

In [26]:
# ============================================================
# COMPARING LLM TOKENIZERS
# ============================================================

# 1️⃣ BERT (2018)
# - Tokenization: WordPiece
# - Vocabulary: ~30k
# - Special tokens: [CLS], [SEP], [PAD], [MASK], [UNK]
# - Cased vs Uncased:
#    • Uncased = lowercase, removes newlines, unknown chars become [UNK]
#    • Cased = preserves capitalization
# - Subwords: e.g., "capitalization" → "capital ##ization" (uncased)

# 2️⃣ GPT-2 (2019)
# - Tokenization: Byte Pair Encoding (BPE)
# - Vocabulary: ~50k
# - Special tokens: <|endoftext|>
# - Preserves newlines, capitalization, emojis, spaces
# - Words split into smaller subwords

# 3️⃣ Flan-T5 (2022)
# - Tokenization: SentencePiece
# - Vocabulary: ~32k
# - Special tokens: <unk>, <pad>, </s>
# - Ignores newlines/whitespaces, replaces unknown chars with <unk>
# - Not ideal for code or formatting-sensitive tasks

# 4️⃣ GPT-4 (2023)
# - Tokenization: BPE
# - Vocabulary: ~100k
# - Special tokens: <|endoftext|>, <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>
# - Focuses on code and text, handles whitespace sequences as single tokens
# - Reduces number of tokens per word (more efficient)

# 5️⃣ StarCoder2 (2024)
# - Tokenization: BPE, code-focused
# - Vocabulary: ~49k
# - Special tokens for code repo context: <filename>, <reponame>, <gh_stars>
# - Encodes numbers digit-by-digit (600 → 6 0 0)
# - Encodes whitespace like GPT-4

# 6️⃣ Galactica
# - Tokenization: BPE, science-focused
# - Vocabulary: ~50k
# - Special tokens for citations, reasoning, DNA/amino acids
# - Encodes whitespace/tabs as single tokens
# - Code & math-friendly

# 7️⃣ Phi-3 / Llama 2
# - Tokenization: BPE
# - Vocabulary: ~32k
# - Special tokens for chat: <|user|>, <|assistant|>, <|system|>
# - Encodes whitespace/tab sequences efficiently
# - Code & math-friendly

# ============================================================
# KEY TAKEAWAYS:
# - Tokenizers vary based on purpose: text, code, science, or chat
# - Methods: WordPiece, BPE, SentencePiece
# - Special tokens help models understand structure (start/end, roles, citations)
# - Handling of capitalization, emojis, whitespace, numbers differs per model
# - Code-focused models optimize whitespace and numbers for programming tasks
# ============================================================

In [27]:
# ============================================================
# “Efficient Training of Language Models to Fill in the Middle”
# (a research paper about teaching LLMs to fill in missing text) GPT-4
# ============================================================

# 1️⃣ What the paper is about:
# - Traditional autoregressive language models (like GPT‑style models)
#   generate text from left to right only.
# - They are *not naturally good* at generating text when you give both
#   a beginning and an ending and want them to fill in the missing middle.
# - This paper shows a simple way to train such models to also learn how
#   to “fill in the middle” (FIM) of a document. :contentReference[oaicite:0]{index=0}

# 2️⃣ How the paper does it:
# - They take a text document and randomly split it into three parts:
#     prefix (start), middle, suffix (end).
# - Then they rearrange it so that the middle comes last:
#     (prefix, suffix, middle).
# - They train the model on many examples like this mixed with normal
#   left‑to‑right examples. This teaches the model both tasks at once. :contentReference[oaicite:1]{index=1}

# 3️⃣ Why this works:
# - The model learns to generate the missing “middle” text given
#   both the beginning and the ending.
# - This does *not* hurt the model’s ability to generate left‑to‑right text.
#   They find it still keeps its original capabilities. :contentReference[oaicite:2]{index=2}

# 4️⃣ Important findings:
# ✅ The model can fill in missing text in the middle after training.
# ✅ Training on a mix of normal and FIM data (“FIM‑for‑free”) does not
#   reduce its original performance left‑to‑right.
# ✅ They test many variations of how often and how to split text, and
#   provide recommendations for best practice. :contentReference[oaicite:3]{index=3}

# 5️⃣ Practical impact:
# - Models trained this way are useful for tasks like:
#   • Completing partial code (e.g., in coding assistants)
#   • Filling in document gaps
#   • Generating missing content that logically fits between given text
# - This approach can be used in future LLM training by default. :contentReference[oaicite:4]{index=4}

# 6️⃣ Why this matters:
# - Left‑to‑right text generation is powerful but limited.
# - Infilling (FIM) capability makes models better for editing,
#   rewriting, and completion tasks where context on both sides matters.
# - It teaches a model to consider *what comes before AND after* the missing part. :contentReference[oaicite:5]{index=5}

# ============================================================
# SIMPLE ONE‑LINE TRUTH:
# ============================================================
# This paper shows a simple training trick that teaches language models
# to fill in missing text between a given beginning and ending without
# losing their original generative abilities. :contentReference[oaicite:6]{index=6}


In [29]:
# ============================================================
# TOKENIZER PROPERTIES
# ============================================================

# 1️⃣ What determines how a tokenizer breaks down text?
# There are three main factors:
#  1. Tokenization method
#  2. Tokenizer parameters
#  3. Domain of the data the tokenizer targets

# ============================================================
# 2️⃣ Tokenization Methods
# ============================================================
# - Tokenization is the process of splitting text into "tokens"
#   that the model can understand.
# - Popular methods include:
#   • Byte Pair Encoding (BPE)  → widely used by GPT models
#   • WordPiece  → used by BERT
#   • SentencePiece  → used by T5/Flan-T5
# - Each method has its own algorithm to decide which pieces of
#   text should be separate tokens.
# - The choice affects how efficiently the model can represent text.

# ============================================================
# 3️⃣ Tokenizer Parameters
# ============================================================
# After choosing a tokenization method, the model designer must
# decide on several parameters:

# (a) Vocabulary size
# - How many tokens the tokenizer can recognize.
# - Common sizes: 30K, 50K, sometimes up to 100K tokens.
# - Larger vocab can capture more words directly, but may waste space.

# (b) Special tokens
# Special tokens are not regular words. Instead, they have a specific function
# to help the model understand the structure of the text or perform a task.
# They are essential for language models to handle tasks like classification,
# text generation, masking, or padding input sequences.

                    # Common Special Tokens:
                    # ----------------------
                    # 1. <s>  → Beginning of text
                    #    - Marks the start of the input.
                    #    - Helps the model know where the text begins.
                    #    - Often used in models that generate text step by step.
                    #
                    # 2. </s> → End of text
                    #    - Marks the end of the input.
                    #    - Tells the model: "stop generating here" for text generation tasks.
                    #
                    # 3. <pad> → Padding token
                    #    - Used to fill sequences to a fixed length.
                    #    - Useful because many models require inputs to be the same length.
                    #    - Example: if max length = 10 and input has 7 tokens, add 3 <pad> tokens.
                    #
                    # 4. <unk> → Unknown token
                    #    - Represents words or characters the tokenizer hasn’t seen before.
                    #    - Example: if "quantumleap" is not in the vocabulary, it is replaced by <unk>.
                    #
                    # 5. [CLS] → Classification token (used in BERT)
                    #    - Stands for "classification".
                    #    - Placed at the beginning of the input.
                    #    - The model uses the output of this token for tasks like sentiment analysis
                    #      or any classification problem.
                    #
                    # 6. [MASK] → Masking token (used in BERT pretraining)
                    #    - Replaces a token during training for the model to predict it.
                    #    - Helps the model learn context and fill in missing words.
                    #    - Example: "I love [MASK] pizza" → the model predicts "eating" or "cheese".
                    #
                    # Custom Special Tokens:
                    # ----------------------
                    # LLM designers can add special tokens for domain-specific tasks.
                    # Examples:
                    # - Galactica (a scientific LLM) uses:
                    #   • <work>      → Signals step-by-step reasoning in a scientific explanation.
                    #   • [START_REF] → Marks the start of a reference or citation.
                    # - Code models may add tokens for:
                    #   • <filename>, <reponame> → To differentiate code from multiple files.
                    #   • Indentation tokens → To better handle Python-style code formatting.
                    #
                    # Why special tokens matter:
                    # --------------------------
                    # - They guide the model to understand structure and tasks.
                    # - Without them, models would treat everything as normal words,
                    #   making tasks like classification, code generation, or reasoning harder.
                    # - They also make training more efficient by giving explicit cues to the model.

# (c) Capitalization handling
# - Decide how to treat uppercase vs lowercase letters.
# - Options:
#   • Convert everything to lowercase (uncased BERT)
#   • Keep capitalization (cased BERT)
# - Capitalization can carry important info (like names) but increases vocabulary usage.

# ============================================================
# 4️⃣ Summary
# ============================================================
# - A tokenizer’s behavior depends on its method, parameters, and training data.
# - Proper choices affect:
#   • Vocabulary efficiency
#   • Ability to handle special text (like code, citations, emojis)
#   • Model performance on specific domains

In [31]:
# Domain of the Data and Its Impact on Tokenizers
# -----------------------------------------------
# The dataset a tokenizer is trained on strongly affects how it splits text into tokens.
# Even with the same tokenization method and parameters, tokenizers behave differently
# depending on the data domain (e.g., natural language, code, or multilingual text).

# Key Points:
# 1. Tokenizers optimize their vocabulary based on the training dataset.
# 2. Text-focused tokenizers may handle code poorly:
#    - Example: indentation spaces in Python may be split into multiple tokens.
#    - This can make it harder for the model to understand code structure.
# 3. Code-focused tokenizers handle code differently:
#    - They preserve indentation and important code patterns as single tokens.
#    - This makes the model’s job easier and improves performance on code tasks.
# 4. Domain-specific tokenization can also help with:
#    - Multilingual datasets
#    - Scientific or structured data

##### Resources for deeper learning:
# - Hugging Face course: Tokenizers section
# - Book: Natural Language Processing with Transformers, Revised Edition

#### Contextualized Word Embeddings

In [34]:
# =========================================================
# 🌟 Understanding Word Vectors and Embeddings
# =========================================================

# Computers cannot understand English words directly. They only understand numbers.
# So, we convert words into "vectors" (lists of numbers) called embeddings.

# Example of a word vector (static embedding):
# The word "cat" might be represented as:
cat_vector = [0.2, 0.5, -0.1, 0.9]
dog_vector = [0.3, 0.4, -0.2, 1.0]

# These numbers are like coordinates in a multi-dimensional space.
# Words with similar meanings are "close" in this space.

# --------------------------
# 🔹 Static Embeddings
# --------------------------
# A static embedding always gives the same numbers for a word,
# no matter where it appears in a sentence.

# Example: word "bank"
bank_static = [0.5, 0.1, -0.3]
# In "I went to the bank to deposit money" -> bank_static
# In "The river bank was steep" -> bank_static
# ❌ Problem: It can't understand the difference between "money bank" and "river bank"

# --------------------------
# 🔹 Contextual Embeddings
# --------------------------
# Contextual embeddings change depending on the sentence.
# This makes them smarter than static embeddings.

# Example: word "bank"
bank_money_context = [0.9, 0.2, -0.1]   # "money bank"
bank_river_context = [-0.3, 0.8, 0.1]   # "river bank"
# ✅ Now the model knows which meaning of "bank" to use based on context

# --------------------------
# 🔹 Analogy
# --------------------------
# Static embedding = your home address (never changes)
# Contextual embedding = your mood (changes depending on situation)

# --------------------------
# 🌟 Summary
# --------------------------
# 1. Words are converted into vectors (numbers) for computers to understand.
# 2. Static embeddings = same vector for every word, context ignored.
# 3. Contextual embeddings = vector changes based on sentence context.
# 4. Contextual embeddings are used in modern language models (BERT, GPT, etc.).

In [35]:
# ================================================

# 1. Token embeddings (from before) are static vectors for words.
# 2. Contextualized embeddings change depending on the word's sentence.
#    Example: "bank" in "river bank" vs "bank account" → different vectors!
# 3. These embeddings help AI understand meaning better and power tasks like:
#    - Named-entity recognition (finding names, places, etc.)
#    - Summarization (highlighting important text)
#    - AI image generation (e.g., DALL·E)

# ---------------------------
# Example with DeBERTa Model
# -----------------------------------------------
# CONTEXTUALIZED WORD EMBEDDINGS EXPLAINED
# -----------------------------------------------

# First, let's import the tools we need from Hugging Face
from transformers import AutoModel, AutoTokenizer

# -------------------------------
# Step 1: Load a tokenizer
# -------------------------------
# A tokenizer splits text into "tokens" (like words or pieces of words)
# This is like chopping a sentence into small building blocks that a computer can understand
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# -------------------------------
# Step 2: Load a language model
# -------------------------------
# The language model (like DeBERTa) will turn tokens into numbers (vectors)
# These numbers represent the meaning of words in a way the computer can understand
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# -------------------------------
# Step 3: Tokenize a sentence
# -------------------------------
sentence = "Hello world"
# The tokenizer converts the sentence into token IDs (numbers for each word piece)
tokens = tokenizer(sentence, return_tensors='pt')

# -------------------------------
# Step 4: Generate embeddings
# -------------------------------
# The model processes these tokens and outputs vectors (arrays of numbers)
# Each word/token gets its own vector
# Vectors are like a list of numbers that encode the "meaning" of the word
# Example: "Hello" -> [0.12, -0.45, 0.88, ...] (imagine hundreds of numbers)
output = model(**tokens)[0]

# -------------------------------
# Step 5: Inspect the output
# -------------------------------
# Let's see the shape (dimensions) of the output
# It is usually [batch_size, number_of_tokens, embedding_size]
# batch_size = number of sentences processed at once
# number_of_tokens = how many tokens in the sentence
# embedding_size = length of the vector representing each token
print(output.shape)

# -------------------------------
# WHAT IS HAPPENING INSIDE:
# -------------------------------
# 1. Each word is converted to a vector (list of numbers)
# 2. These vectors are contextualized, meaning:
#    - The vector for "world" in "Hello world" will be different from "world" in "Hello cruel world"
#    - The model knows the surrounding words and changes the meaning slightly
# 3. Computers use these vectors to "understand" text for tasks like:
#    - Summarizing text
#    - Answering questions
#    - Generating new text
#    - Even helping AI make images from text prompts (like DALL·E)
# -------------------------------

# -------------------------------
# SUMMARY:
# -------------------------------
# Tokens -> Token IDs (numbers) -> Vectors (embeddings)
# Vectors = numbers that capture the meaning of words
# Contextualized embeddings = vectors that change depending on surrounding words
# These embeddings are the "magic numbers" that make AI understand text better

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

torch.Size([1, 4, 384])


In [36]:
# =========================================================
# 🌟 Understanding LM Output Vectors
# =========================================================

# After processing a sentence like "Hello world", the model output has shape:
# torch.Size([1, 4, 384])

# --------------------------
# 🔹 Breaking down the dimensions
# --------------------------
# First dimension (1) → batch size
#   - This is used when processing multiple sentences at once.
# Second dimension (4) → number of tokens
#   - Our sentence became 4 tokens:
#     1. [CLS]  → special start token
#     2. Hello
#     3. world
#     4. [SEP]  → special end token
# Third dimension (384) → embedding size
#   - Each token is represented by a vector of 384 numbers

# --------------------------
# 🔹 Inspecting tokens
# --------------------------
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
# Output:
# [CLS]
# Hello
# world
# [SEP]

# ✅ The tokenizer adds [CLS] at the start and [SEP] at the end automatically.

# --------------------------
# 🔹 What the vectors look like
# --------------------------
# Each token is now represented as a 384-length vector. Example:
print(output)
# tensor([[[-3.3060, -0.0507, -0.1098, ..., -0.1704, -0.1618, 0.6932],
#          [ 0.8918,  0.0740, -0.1583, ..., 0.1869,  1.4760, 0.0751],
#          [ 0.0871,  0.6364, -0.3050, ..., 0.4729, -0.1829, 1.0157],
#          [-3.1624, -0.1436, -0.0941, ..., -0.0290, -0.1265, 0.7954]]],
#        grad_fn=<NativeLayerNormBackward0>)

# --------------------------
# 🔹 Explanation
# --------------------------
# 1. Each row corresponds to one token ([CLS], Hello, world, [SEP])
# 2. Each row has 384 numbers → these are embeddings, the model's way of "understanding" the token.
# 3. These embeddings are the building blocks for all LLM tasks like:
#    - Text classification
#    - Summarization
#    - Question answering
# 4. This step—turning token IDs into embeddings—is the first thing a model does internally.

# 🌟 Key takeaway:
# A language model converts each token into a vector (list of numbers) that captures its meaning in context.
# This is called a raw, static embedding. Later layers of the model make these embeddings contextual.


[CLS]
Hello
 world
[SEP]
tensor([[[-3.4816,  0.0861, -0.1819,  ..., -0.0612, -0.3911,  0.3017],
         [ 0.1898,  0.3208, -0.2315,  ...,  0.3714,  0.2478,  0.8048],
         [ 0.2071,  0.5036, -0.0485,  ...,  1.2175, -0.2292,  0.8582],
         [-3.4278,  0.0645, -0.1427,  ...,  0.0658, -0.4367,  0.3834]]],
       grad_fn=<NativeLayerNormBackward0>)
