<a href="https://colab.research.google.com/github/vikashkodati/mygig/blob/main/Demystifyting_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

<h1><center> </center></h1>

# Demystifying AI: A Hands-On Exploration of Language Models

## An Interactive Guide to Understanding How AI Works
### Pate Motter, PhD
AI Performance Engineer, Google


---

</div>

## About Pate
##### Current Role
  - Improving performance of open source model inference on Google's TPU hardware. I work in/around code like Google's [MaxText](https://github.com/AI-Hypercomputer/maxtext), [JAX](https://jax.readthedocs.io/en/latest/), [FLAX](https://flax.readthedocs.io/en/latest/), and [Pallas](https://jax.readthedocs.io/en/latest/pallas/index.html).

##### Background
  - PhD in computer science from CU Boulder (2017) focused on high-performance computing.
  - I've been working on AI code since 2018 and math on GPUs since 2011.

##### Interests
- Making code go fast.
- Gaining and sharing knowledge.

##### Experience
* Amazon Alexa (ASR and WBQA)
* AWS High-Performance Computing
* Lawrence Livermore National Lab

##### Websites
* [LinkedIn](https://www.linkedin.com/in/patemotter)
* [GitHub](https://github.com/patemotter)

---




## Getting Started
This notebook provides an interactive exploration of how AI language models work internally. To run this notebook in Google Colab you will need:
- A Google account to run this in Colab
- About 60 minutes to go through the material


NOTES:
1. This colab is designed to run in the free tier of Google Colab using its T4 GPU runtime. This code can be used on more powerful Colab hardware at a cost. I am not responsible for any charges you incur by doing this. You can also run all of this code locally on your own machine if desired.
2. You are free to take this notebook and do whatever you want to. Take it apart, change it, break it, fix it, have fun.

Follow the instructions below to run this Colab on a T4 GPU.


<details>
<summary>1. Click Runtime -> Change runtime type</summary>


![Screenshot](https://drive.google.com/uc?export=view&id=13tysKrMzwMkGRQo8qmll1-YvUeabQEh5)

</details>


<details>
<summary>2. Change selection to T4 GPU</summary>

![Screenshot](https://drive.google.com/uc?export=view&id=1gjPWs_DqB5cgAtW0mG_yEDuUWa6TVo_O)

</details>

<details>
<summary>3. Click Runtime -> Run all</summary>

![Screenshot](https://drive.google.com/uc?export=view&id=1q0X-Rtzt3KgOnGPiM_uyGbA_kSFj4mlG)

</details>

---

<details>
<summary>License Information</summary>

MIT License

Copyright (c) 2024 Pate Motter

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.]

</details>

## What You'll Learn in this Notebook
This interactive notebook will take you inside an AI language model to understand how it works. You'll learn:
- How models break down and understand text
- How models pay attention to relationships between words
- How models decide what words should come next

----

## The Phi-3 Model

We'll use Microsoft's Phi-3 model for our exploration. Phi-3 is a relatively small but powerful language model, making it perfect for learning about AI fundamentals while being able to run on basic hardware.

### Model Architecture
- **Size**: 3.8 billion parameters
- **Layers**: 32 transformer layers
- **Hidden Size**: 2,048 dimensions (size of word embeddings)
- **Attention Heads**: 32 heads (1 per layer)
- **Context Window**: 4,096 tokens
- **Vocabulary Size**: ~50,000 tokens

### Key Features
- **Size**: Smaller than a lot of LLMs but still powerful enough for many tasks
- **Context Window**: Can process up to 4,096 tokens at once
- **Efficiency**: Designed to run on consumer hardware (like Colab's T4 GPUs)
- **Open Source**: Code and weights are publicly available for learning and research

### Why We're Using It
- **Educational**: Perfect size for understanding AI concepts
- **Quality**: Produces high-quality results despite smaller size
- **Accessibility**: Free and open for everyone to use and learn from

---

# Setup the environment

We're using several key libraries:
- `transformers`: The main library for  models, tokenizers, and pipelines
- `torch`: Handles the mathematical operations
- `bitsandbytes`: Enables efficient 4-bit quantization

The model runs on a GPU to handle the complex matrix operations involved in text processing.

In [None]:
# @title Install our Python libraries via `pip`
!pip install -q transformers seaborn torch pandas scikit-learn pytorch_lightning 'accelerate>=0.27.2' bitsandbytes

In [None]:
# @title Setup common imports

import warnings
warnings.filterwarnings("ignore")

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from IPython.display import display, HTML, clear_output, Markdown
import torch
import ipywidgets as widgets
from textwrap import fill

In [None]:
# @title Cleanup Function (expand to see code)
import gc
import torch

def clear_gpu_memory():
    print(f"GPU memory allocated before cleanup: {torch.cuda.memory_allocated()/1024**2:.2f} MB")
    print(f"GPU memory reserved before cleanup: {torch.cuda.memory_reserved()/1024**2:.2f} MB")
    # Clear all tensors from GPU
    gc.collect()

    # Delete all references to tensors and cached memory
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj):
                if obj.is_cuda:
                    del obj
            elif hasattr(obj, 'data') and torch.is_tensor(obj.data):
                if obj.data.is_cuda:
                    del obj
        except:
            pass

    # Clear CUDA cache
    torch.cuda.empty_cache()

    # Reset peaks and clear memory stats
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.reset_accumulated_memory_stats()
    print(f"GPU memory allocated after cleanup: {torch.cuda.memory_allocated()/1024**2:.2f} MB")
    print(f"GPU memory reserved after cleanup: {torch.cuda.memory_reserved()/1024**2:.2f} MB")

### Define our Model, Tokenizer, and Pipeline

Note: this can take a few minutes to download the model for the first run

In [None]:
# Save some space if needed
clear_gpu_memory()

# Quantization config for reducing model size
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Load our Phi-3 model onto the GPU and use quantization
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    quantization_config=quantization_config,
    trust_remote_code=True,
)

# Define the tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Create a pipeline using our model, tokenizer, and additional options
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=128,
    do_sample=False,
    eos_token_id=tokenizer.eos_token_id,
    early_stopping=True,
    use_cache=True,
)

GPU memory allocated before cleanup: 2167.51 MB
GPU memory reserved before cleanup: 2578.00 MB
GPU memory allocated after cleanup: 2167.51 MB
GPU memory reserved after cleanup: 2360.00 MB


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda


---

# Demo: End-to-end example

In [None]:
prompt = "Is AI the key to unlocking the potential of humanity?"

messages = [
    {"role": "user", "content": prompt}
]
input_text = messages[0]["content"]

print(f"Input: {fill(input_text, width=120)}\n")

output = generator(messages)
generated_text = output[0]["generated_text"]
print(f"\nOutput: {fill(generated_text, width=120)}\n")

Input: Is AI the key to unlocking the potential of humanity?


Output:  The question of whether AI is the key to unlocking the potential of humanity is a complex and multifaceted one. AI has
the potential to greatly enhance human capabilities in various fields such as healthcare, education, transportation, and
more. However, it is not the sole key to unlocking human potential.  Human potential is influenced by a wide range of
factors, including education, access to resources, social and economic conditions, and individual motivation and drive.
AI can certainly play a significant role in enhancing human potential, but it is not the only factor.



---

# 1. Tokenization

## What You'll Learn in This Section
- How AI breaks text into manageable pieces
- Why some words are split while others aren't
- How different AI models tokenize text differently
- How tokens are converted into numbers

## What is tokenization?
Tokenization breaks text into pieces an AI can understand. While humans naturally read "unforgettable" as one word, an AI might break it into "un", "forget", and "able" - pieces it sees often enough to understand well.

## Why do we need it?
- Computers need consistent, manageable pieces to process
- A fixed 50,000-token vocabulary can handle millions of possible words by combining pieces
- New words like "cryptocurrency" can be understood as "crypto" + "currency"
- Saves memory by reusing common pieces instead of storing every possible word

## Common questions:
- **Why split words?** The word "unlikeable" might be rare, but "un", "like", and "able" are common. By splitting it, the model can handle it even if it never saw this exact word during training.
- **What's with the weird symbols?** Tokens like [CLS] mark the start of text, helping the model know where sentences begin and end - like capital letters and periods for humans.
- **Why different across models?** GPT models might keep common words whole but split rare ones, while BERT might split more aggressively to handle specialized text better.




## Tokenization overview

In [None]:
# @title Tokenization overview function (expand to see code)
from IPython.display import HTML, display

# Function to create an educational visualization of the tokenization process
def tokenization_visualization(text, tokenizer):
    """
    Creates a visualization showing tokenization steps including final token IDs.
    """
    # Get tokens and ids
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer(text, return_tensors="pt")['input_ids'][0].tolist()

    style = """
    <style>
    .container {
        background-color: #2d2d2d;
        padding: 10px;
        border-radius: 10px;
        color: #ffffff;
        font-family: system-ui, -apple-system, sans-serif;
        max-width: 1000px;
        margin: 20px auto;
        font-size: 14px;
    }
    .stage {
        margin: 10px 0;
        padding: 10px;
        background: #363636;
        border-radius: 12px;
        box-shadow: 0 2px 8px rgba(0,0,0,0.2);
    }
    .token-container {
        background: #2d2d2d;
        padding: 10px;
        border-radius: 8px;
        font-size: 14px;
        line-height: 1.6;
        border: 1px solid #555;
        margin-top: 15px;
        font-family: monospace;
    }
    .token {
        display: inline-block;
        padding: 4px 4px;
        margin: 0 4px;
    }
    .highlight {
        background: #264F78;
        border-radius: 4px;
        border: 1px solid #3794FF;
    }
    .token-id {
        color: #4EC9B0;
        font-size: 1em;
    }
    .token-pair {
        display: inline-flex;
        flex-direction: column;
        align-items: center;
        margin: 0 0px;
    }
    .explanation {
        color: #e0e0e0;
        margin: 0 0 15px 0;
        font-size: 16px;
        line-height: 1.5;
    }
    .header {
        color: #3794FF;
        font-size: 24px;
        margin-bottom: 25px;
        font-weight: 600;
    }
    .stage-title {
        color: #3794FF;
        font-size: 20px;
        margin-bottom: 10px;
        font-weight: 500;
    }
    </style>
    """

    # Create token-id pairs HTML for later use
    token_id_pairs = []
    for token, id in zip(tokens, token_ids):
        token_id_pairs.append(f"""
            <div class='token-pair'>
                <span class='token highlight'>{token}</span>
                <span class='token-id'>{id}</span>
            </div>
        """)

    html = f"""
    <div class='container'>
        <div class='header'>How Tokenizers Process Text</div>

        <div class='stage'>
            <div class='stage-title'>Original Text</div>
            <div class='explanation'>This is the raw text that will be fed into the model</div>
            <div class='token-container'>{text}</div>
        </div>

        <div class='stage'>
            <div class='stage-title'>Step 1: Word Separation</div>
            <div class='explanation'>The model first separates the text into individual words</div>
            <div class='token-container'>                {"".join([f"<span class='token highlight'>{word}</span>" for word in text.split()])}</div>
        </div>

        <div class='stage'>
            <div class='stage-title'>Step 2: Subword Tokenization</div>
            <div class='explanation'>Words are broken down into smaller pieces called subword tokens</div>
            <div class='token-container'>
                {"".join([f"<span class='token highlight'>{t}</span>" for t in tokens])}
            </div>
        </div>

        <div class='stage'>
            <div class='stage-title'>Step 3: Token Visualization</div>
            <div class='explanation'>Each token is highlighted to show how the model sees the text</div>
            <div class='token-container'>
                {"".join(token_id_pairs)}
            </div>
        </div>

        <div class='stage'>
            <div class='stage-title'>Step 4: Token IDs</div>
            <div class='explanation'>Finally, each token is converted to a number (token ID) that the model uses for processing</div>
            <div class='token-container'>
                {"".join([f"<span class='token highlight'>{id}</span>" for id in token_ids])}
            </div>
        </div>
    </div>
    """

    display(HTML(style + html))

### Demo: Tokenization overview

In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
prompt = "Is AI the key to unlocking the potential of humanity?"
tokenization_visualization(prompt, tokenizer)

## Comparing tokenizers

In [None]:
# @title Compare tokenizers function (expand to see code)
from IPython.display import HTML, display
from transformers import AutoTokenizer
import pandas as pd

def compare_tokenizers(text, tokenizers=None):
    """
    Create a visually enhanced comparison of different tokenizers with a dark theme.

    Args:
        text (str): Input text to tokenize
        tokenizers (dict, optional): Dictionary of tokenizer configurations
    """
    print(f"Input: {text}")
    if tokenizers is None:
        tokenizers = {
            "Phi-3": "microsoft/Phi-3-mini-4k-instruct",
            "GPT-2": "gpt2",
            "BERT": "bert-base-uncased",
            "T5": "t5-small",
            "XLM-R": "xlm-roberta-base"
        }

    # Initialize tokenizers
    tokenizers = {name: AutoTokenizer.from_pretrained(model)
                 for name, model in tokenizers.items()}

    # Get tokens and find maximum length
    token_data = {}
    max_tokens = 0
    for name, tok in tokenizers.items():
        tokens = tok.tokenize(text)
        token_data[name] = tokens
        max_tokens = max(max_tokens, len(tokens))

    # Create styled HTML output with dark theme
    html = '''
    <style>
        .tokenizer-container {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
            gap: 20px;
            max-width: 1200px;
            margin: 20px auto;
            background: #1a1a1a;
            padding: 20px;
            border-radius: 12px;
        }
        .tokenizer-column {
            background: #2d2d2d;
            border-radius: 8px;
            box-shadow: 0 4px 6px rgba(0,0,0,0.2);
            overflow: hidden;
            border: 1px solid #3d3d3d;
        }
        .tokenizer-header {
            background: #363636;
            color: #e1e1e1;
            padding: 12px;
            text-align: center;
            font-weight: bold;
            border-bottom: 1px solid #404040;
        }
        .token-list {
            padding: 8px;
        }
        .token {
            padding: 8px 12px;
            margin: 4px 0;
            background: #404040;
            border-radius: 4px;
            font-family: monospace;
            font-size: 14px;
            color: #e1e1e1;
            border: 1px solid #4a4a4a;
        }
        .token:nth-child(even) {
            background: #363636;
        }
        .stats {
            margin-top: 20px;
            padding: 15px;
            background: #2d2d2d;
            border-radius: 8px;
            font-family: monospace;
            color: #e1e1e1;
        }
    </style>
    <div class="tokenizer-container">
    '''

    # Add columns for each tokenizer
    for name, tokens in token_data.items():
        html += f'''
        <div class="tokenizer-column">
            <div class="tokenizer-header">{name} ({len(tokens)} tokens)</div>
            <div class="token-list">
        '''
        for token in tokens:
            html += f'<div class="token">{token}</div>'
        html += '</div></div>'

    html += '</div>'

    display(HTML(html))

# Example usage:
tokenizers = {
    "Phi-3": "microsoft/Phi-3-mini-4k-instruct",
    "GPT-2": "gpt2",
    "BERT": "bert-base-uncased",
    "T5": "t5-small",
    "XLM-R": "xlm-roberta-base"
}

### Demo: Compare different tokenizers

In [None]:
prompt = "AI is the key to unlocking the potential of humanity"

tokenizers = {
    "Phi-3": "microsoft/Phi-3-mini-4k-instruct",
    "GPT-2": "gpt2",
    "BERT": "bert-base-uncased",
    "T5": "t5-small",
    "XLM-R": "xlm-roberta-base",
}

compare_tokenizers(prompt, tokenizers)

Input: AI is the key to unlocking the potential of humanity


## Try This!

1. Experiment with technical terms or abbreviations. How does the model handle words like "AI", "GPU", or "neural network"?
2. Try words with prefixes or suffixes. How are words like "unchangeable" or "multiplication" broken down?
3. Compare how different languages are tokenized. What patterns do you notice?



## Think About

- Why might some common words be kept whole while others are split?
- How might tokenization affect AI's understanding of technical jargon?
- What challenges might arise when tokenizing multiple languages?


---

# 2. Embeddings

## What You'll Learn in This Section
- How AI represents words as numbers
- Why similar words have similar number patterns
- How AI captures relationships between words
- How 2D visualizations help us understand high-dimensional word relationships

## What are embeddings?
Each token becomes a list of numbers that capture meaning. In this number-space, "king - man + woman = queen" because the numbers capture these relationships.

## Why do we need them?
- Numbers let us calculate relationships: "good" should be closer to "great" than to "terrible"
- Similar words get similar patterns: "cat" and "kitten" share number patterns
- We can measure exact similarities: "happy" and "joyful" might be 90% similar
- Math operations can reveal relationships: "Paris" - "France" + "Italy" = "Rome"

## Common questions:
- **Why so many dimensions?** With just 2 dimensions, you can only capture simple relationships like "good vs bad" and "big vs small". With 2,048 dimensions, you can also capture "formal vs casual", "modern vs ancient", "literal vs figurative", and thousands more subtle distinctions simultaneously.
- **How are they learned?** If words often appear in similar contexts ("eat banana" and "eat apple"), their number patterns become similar. The model adjusts these numbers during training to better predict what words go together.
- **What do individual numbers mean?** A single number might contribute to many aspects of meaning - like how a pixel contributes to many parts of an image. It's the pattern that matters, not individual values.




In [None]:
import torch
from IPython.display import HTML, display
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import ipywidgets as widgets

def embeddings_visualizer(text, model, tokenizer, max_tokens=20):
    """
    Display tokens, their IDs, and visualize their embeddings for Phi-3 model.
    """
    # Get tokens and IDs
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)

    # Truncate if needed
    if len(tokens) > max_tokens:
        tokens = tokens[:max_tokens]
        token_ids = token_ids[:max_tokens]

    # Get model's device
    device = next(model.parameters()).device

    # Get embeddings
    with torch.no_grad():
        inputs = torch.tensor([token_ids]).to(device)
        embeddings = model.model.embed_tokens(inputs)
        embeddings = embeddings[0].to(torch.float32).cpu().numpy()

    # Create PCA projection
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(embeddings)

    # Set the dark style for matplotlib
    plt.style.use('dark_background')
    plt.figure(figsize=(12, 8))

    # Create scatterplot with neon points
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c='#00b4d8', s=100, alpha=0.7)

    # Add labels with improved visibility
    base_offset = 12
    for i, (x, y) in enumerate(embeddings_2d):
        # Smart label positioning
        others_right = sum(1 for j, (xj, yj) in enumerate(embeddings_2d) if j != i and xj > x)
        others_above = sum(1 for j, (xj, yj) in enumerate(embeddings_2d) if j != i and yj > y)

        x_offset = -base_offset if others_right > len(embeddings_2d)/2 else base_offset
        y_offset = -base_offset if others_above > len(embeddings_2d)/2 else base_offset

        plt.annotate(tokens[i],
                    (x, y),
                    xytext=(x_offset, y_offset),
                    textcoords='offset points',
                    color='#e1e1e1',
                    fontsize=9,
                    bbox=dict(facecolor='#2d2d2d',
                            edgecolor='#00b4d8',
                            alpha=0.8,
                            pad=2),
                    zorder=3)

    # Enhanced dark theme styling
    plt.title('Word Embeddings Visualization (2D Projection)',
             color='#e1e1e1',
             pad=20,
             fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.2, color='#404040')
    plt.xlabel('First Principal Component', color='#e1e1e1')
    plt.ylabel('Second Principal Component', color='#e1e1e1')

    # Set background colors
    plt.gca().set_facecolor('#1e1e1e')
    plt.gcf().set_facecolor('#1e1e1e')

    # Style the ticks
    plt.tick_params(colors='#e1e1e1', grid_alpha=0.2)

    plt.show()
    return embeddings

def create_embedding_visualizer(model, tokenizer, prompt=None, examples=None):
    """
    Creates an interactive token embedding visualizer with dropdown selection and custom text input.
    """
    if examples is None:
        examples = {
            "Temperature Words": "hot cold warm cool fresh burning",
            "Direction Words": "up down left right back front near far high low top deep flat",
            "Movement Words": "fast slow quick jump walk run skip hop roll slide float",
            "Weather Words": "rain snow storm cloud clear dry wet fair",
            "Emotion Words": "happy sad glad mad angry calm brave proud tired glad afraid",
            "Color Words": "red blue green black white gray brown gold dark light bright pale",
            "Taste Words": "sweet salt bland rich fresh raw dry cool hot",
            "Texture Words": "soft hard rough smooth wet dry clean sharp flat firm thick thin",
            "Sound Words": "loud quiet soft sharp deep low high swift crash hum",
            "Time Words": "fast slow quick long short swift brief past now then soon late early",
            "Value Words": "rich poor high low good bad best worst great fine fair pure true false",
            "Strength Words": "strong weak firm soft wild calm rough smooth hard soft light",
            "Clarity Words": "clear dark dim bright sharp clean pure raw fog",
            "Mixed Words": "red blue green up down left right fast quick slow"
        }
    if prompt is not None:
        examples["Prompt"] = prompt

    # Create interface container
    interface_html = HTML(f"""
    <div id="embedding-interface" style="margin: 20px 0; background-color: #1e1e1e; padding: 20px; border-radius: 8px; border: 1px solid #333;">
        <h2 style="color: #e1e1e1; margin-top: 0;">Interactive Word Embedding Explorer</h2>
        <p style="color: #00b4d8;">Select a predefined group of words or enter your own custom words.</p>
        <div style="margin: 15px 0;">
            <label for="word-group" style="color: #e1e1e1; display: block; margin-bottom: 5px;">Select Word Group:</label>
            <div id="dropdown-container"></div>
        </div>
        <div style="margin: 15px 0;">
            <label for="custom-words" style="color: #e1e1e1; display: block; margin-bottom: 5px;">Or Enter Custom Words:</label>
            <div id="textbox-container"></div>
        </div>
        <div style="margin-top: 15px; padding-top: 15px; border-top: 1px solid #333;">
            <h3 style="color: #e1e1e1; margin-top: 0;">Words being analyzed:</h3>
            <p style="font-size: 1.1em; color: #00b4d8;" id="current-words"></p>
        </div>
    </div>
    """)

    # Create dropdown with dark theme styling
    dropdown = widgets.Dropdown(
        options=examples,
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='80%')
    )

    # Create text input for custom words
    text_input = widgets.Text(
        placeholder='Enter space-separated words...',
        layout=widgets.Layout(width='80%')
    )

    # Create output widget for the visualization
    output = widgets.Output()

    # Define update function for the words display
    def update_words_display(text):
        display(HTML(f"""
        <script>
        var elem = document.getElementById('current-words');
        if (elem) elem.textContent = {repr(text)};
        </script>
        """))

    # Define callbacks
    def on_dropdown_change(change):
        if change['type'] == 'change' and change['name'] == 'value':
            text_input.value = ''  # Clear text input when dropdown is used
            update_words_display(change['new'])
            with output:
                output.clear_output()
                embeddings_visualizer(change['new'], model, tokenizer)

    def on_text_submit(sender):
        if text_input.value.strip():
            dropdown.unobserve(on_dropdown_change, names='value')  # Temporarily disconnect dropdown
            dropdown.value = list(examples.values())[0]  # Reset dropdown
            dropdown.observe(on_dropdown_change, names='value')  # Reconnect dropdown

            update_words_display(text_input.value)
            with output:
                output.clear_output()
                embeddings_visualizer(text_input.value, model, tokenizer)

    # Connect callbacks
    dropdown.observe(on_dropdown_change, names='value')
    text_input.on_submit(on_text_submit)

    # Display the interface
    display(interface_html)

    # Display and position the widgets
    dropbox_output = widgets.Output()
    textbox_output = widgets.Output()
    with dropbox_output:
        display(dropdown)
    with textbox_output:
        display(text_input)
    display(dropbox_output)
    display(textbox_output)

    # Move widgets into containers
    display(HTML("""
    <script>
    setTimeout(function() {
        var dropdownElem = document.querySelector('.widget-dropdown');
        var textboxElem = document.querySelector('.widget-text');
        var dropdownContainer = document.getElementById('dropdown-container');
        var textboxContainer = document.getElementById('textbox-container');
        if (dropdownElem && dropdownContainer) {
            dropdownContainer.appendChild(dropdownElem);
        }
        if (textboxElem && textboxContainer) {
            textboxContainer.appendChild(textboxElem);
        }
    }, 100);
    </script>
    """))

    # Display visualization output
    display(output)

    # Show initial words and visualization
    update_words_display(dropdown.value)
    with output:
        embeddings_visualizer(dropdown.value, model, tokenizer)

## Demo: Visualize embeddings

### Understanding the Embedding Visualization

This 2D visualization is a simplified view of how AI understands word relationships. Here's what you're seeing:

- **The Axes**: The x and y axes represent the two most important directions of difference between the words, mathematically determined through Principal Component Analysis (PCA). These aren't fixed meanings like "positive/negative" or "singular/plural" - they're mathematical combinations of many different aspects of meaning.

- **Distances**: Words that appear closer together are more similar in the AI's understanding. This similarity could be based on:
  - Meaning (happy/joyful)
  - Function (articles like a/the)
  - Context (often appear in similar situations)

- **Clusters**: Groups of words that cluster together often share some aspect of meaning or usage

Remember: This is a dramatic simplification of the actual 2,048-dimensional space where these words live. Some relationships that exist in the full space might not be visible in this 2D view.

In [None]:
prompt = "AI is the key to unlocking the potential of humanity"

examples = {
    "Mixed Words": "red blue green up down left right fast quick slow",
    "Temperature Words": "hot cold warm cool fresh burning",
    "Direction Words": "up down left right back front near far high low top deep flat",
    "Movement Words": "fast slow quick jump walk run skip hop roll slide float",
    "Weather Words": "rain snow storm cloud clear dry wet fair",
    "Emotion Words": "happy sad glad mad angry calm brave proud tired glad afraid",
    "Color Words": "red blue green black white gray brown gold dark light bright pale",
    "Taste Words": "sweet salt bland rich fresh raw dry cool hot",
    "Texture Words": "soft hard rough smooth wet dry clean sharp flat firm thick thin",
    "Sound Words": "loud quiet soft sharp deep low high swift crash hum",
    "Time Words": "fast slow quick long short swift brief past now then soon late early",
    "Value Words": "rich poor high low good bad best worst great fine fair pure true false",
    "Strength Words": "strong weak firm soft wild calm rough smooth hard soft light",
    "Clarity Words": "clear dark dim bright sharp clean pure raw fog",
}

create_embedding_visualizer(model, tokenizer, examples=examples)

Output()

Output()

Output()

## Try This!
1. Compare synonyms and antonyms. Are they positioned as you'd expect?
2. Explore numbers or dates. How does the model represent sequential information?
3. Test domain-specific terms (e.g., colors, emotions, or technical terms)



## Think About
- Why do some similar words cluster together while others don't?
- How might this representation affect AI's understanding of analogies?
- What limitations might this spatial representation have?

---

# 3. Attention

## What You'll Learn in This Section
- How AI connects related words in a sentence
- Why some word connections are stronger than others
- How different attention heads capture different types of relationships
- How attention patterns reveal AI's understanding of text

## What is attention?
When processing "The cat sat on the mat", attention helps the model connect "cat" with "sat" more strongly than "mat" with "the". It's like highlighting connections between words based on their relationships.

## Why do we need it?
- Some word pairs matter more than others: In "The bank of the river", connecting "bank" with "river" helps avoid confusion with financial banks
- Different relationships need different focus:
  - Grammar: Connecting articles ("the") with their nouns
  - Logic: Linking pronouns ("it") back to what they refer to
  - Context: Understanding modifiers ("red" modifies "house" in "the red house")
- Long sentences need selective focus: In "John, who met Mary at the store, gave her the book", connecting "her" back to "Mary" requires jumping over other words

## Common questions:
- **Why multiple heads?** Different heads can specialize: one might focus on grammar, another on subject-verb relationships, another on pronouns and their references.
- **What are layers?** Think of understanding levels:
  - Early layers: Basic patterns like finding word pairs ("hot dog") and simple grammar
  - Middle layers: Sentence structure and local context
  - Deep layers: Complex relationships like cause-and-effect or comparing different parts of long text
- **What do the colors mean?** Darker colors show stronger connections. In "The fast car", you'll see dark colors connecting "fast" to "car" but lighter ones connecting "the" to "fast".

In [None]:
#@title Attention visualizer (expand to see code)
def get_attention_patterns(model, tokenizer, text):
    """
    Extract attention patterns from the model for visualization.
    """
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    # Get model outputs with attention
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)

    # Extract attention weights and convert to numpy
    attention_weights = []
    for layer_attentions in outputs.attentions:
        layer_attentions = layer_attentions.squeeze(0)
        layer_attentions = layer_attentions.cpu().numpy()
        attention_weights.append(layer_attentions)

    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    return attention_weights, tokens

def create_attention_visualizer(model, tokenizer, examples=None):
    # Create informative explanation of attention visualization
    explanation = """
    <div style="background-color: #1e1e1e; padding: 20px; border-radius: 8px; margin-bottom: 20px; color: #e1e1e1;">
        <h3 style="color: #00b4d8; margin-top: 0;">How to Read This Attention Heatmap</h3>

        <p>This visualization shows how each token (word piece) pays attention to other tokens and itself when processing text:</p>

        <ul style="margin-bottom: 15px;">
            <li><strong style="color: #00b4d8;">Reading the Grid:</strong> Each row shows how much attention a token pays to previous tokens and itself (columns)</li>
            <li><strong style="color: #00b4d8;">Numbers & Colors:</strong> Darker blue and higher numbers (0-1) indicate stronger attention</li>
            <li><strong style="color: #00b4d8;">Blank Upper Area:</strong> Tokens can only attend to themselves and previous tokens, so the upper triangle is always blank</li>
        </ul>

        <p><strong style="color: #00b4d8;">Example Pattern:</strong> In the sentence "The cat sits", you might see:</p>
        <ul>
            <li>"cat" paying strong attention to "The" (article-noun relationship)</li>
            <li>"sits" paying attention to "cat" (subject-verb relationship)</li>
            <li>Each token typically paying some attention to itself</li>
        </ul>
    </div>
    """

    if examples is None:
        examples = {
            "Complex Relationship": "The dog chased its tail because it was wagging",
            "Question-Answer": "Q: What is the capital of France? A: Paris",
            "Local Grammar": "The red and blue car",
            "Grammar Pattern": "The red and blue car drove fast",
            "Completion": "The students studied hard for their final",
            "Simple Example": "The cat sits on the mat",
            "Comparison": "Although it was expensive, the quality was excellent",
        }

    # Create widgets
    text_input = widgets.Text(
        value='',
        placeholder='Enter custom text...',
        description='Input:',
        layout=widgets.Layout(width='80%')
    )

    examples_dropdown = widgets.Dropdown(
        options=examples,
        description='Examples:',
        layout=widgets.Layout(width='80%')
    )

    layer_dropdown = widgets.Dropdown(
        options=[(f'Layer {i}', i) for i in range(32)],
        value=0,
        description='Layer:',
        layout=widgets.Layout(width='200px')
    )

    head_dropdown = widgets.Dropdown(
        options=[(f'Head {i}', i) for i in range(32)],
        value=0,
        description='Head:',
        layout=widgets.Layout(width='200px')
    )

    viz_output = widgets.Output()
    rec_output = widgets.Output()

    def update_display(change=None):
        """Update both visualization and recommendations"""
        text = text_input.value if text_input.value else examples_dropdown.value
        layer = layer_dropdown.value
        head = head_dropdown.value

        attention_weights, tokens = get_attention_patterns(model, tokenizer, text)

        with rec_output:
            clear_output(wait=True)
            interesting_patterns = find_interesting_patterns(attention_weights, tokens)
            print("\nRecommended interesting patterns to explore:")
            for l, h, score, reason in interesting_patterns:
                print(f"Layer {l}, Head {h}: {reason} (score: {score:.2f})")

        with viz_output:
            clear_output(wait=True)

            n_tokens = len(tokens)
            fig_size = max(6, n_tokens * 0.5)
            plt.figure(figsize=(fig_size, fig_size))

            # Create mask for upper triangle
            mask = np.triu(np.ones_like(attention_weights[layer][head]), k=1)

            # Plot heatmap with mask
            sns.heatmap(attention_weights[layer][head],
                       xticklabels=tokens,
                       yticklabels=tokens,
                       cmap='Blues',
                       center=0.5,
                       square=True,
                       fmt='.2f',
                       annot=True,
                       mask=mask,  # Apply mask to hide upper triangle
                       annot_kws={'size': 8},
                       cbar_kws={'shrink': .8})

            plt.title(f'Attention Pattern (Layer {layer}, Head {head})')
            plt.xticks(rotation=45, ha='right')
            plt.yticks(rotation=0)
            plt.tight_layout()
            plt.show()

    # Connect callbacks
    text_input.observe(update_display, names='value')
    examples_dropdown.observe(update_display, names='value')
    layer_dropdown.observe(update_display, names='value')
    head_dropdown.observe(update_display, names='value')

    # Create layout with dropdowns side by side
    controls = widgets.VBox([
        widgets.HTML(explanation),  # Add explanation at the top
        examples_dropdown,
        text_input,
        widgets.HBox([layer_dropdown, head_dropdown])
    ])

    # Display everything
    display(widgets.HTML("<h2>Attention Pattern Visualizer</h2>"))
    display(controls)
    display(viz_output)
    display(rec_output)

    # Show initial visualization
    update_display()

#@title Function to find interesting patterns in attention heatmaps
def find_interesting_patterns(attention_weights, tokens, top_k=5):
    scores = []
    n_layers = len(attention_weights)
    n_heads = attention_weights[0].shape[0]

    for layer in range(n_layers):
        for head in range(n_heads):
            matrix = attention_weights[layer][head]

            # Look for patterns where first token gets consistent attention
            first_token_pattern = np.mean(matrix[:, 0]) > 0.5

            # Look for adjective-noun patterns (decreasing attention)
            decreasing_pattern = np.all(np.diff(matrix.mean(axis=1)) < 0.1)

            # Look for distributed attention in nouns
            last_token_distribution = matrix[-1].std() < 0.3 and matrix[-1].mean() > 0.1

            # Calculate linguistic pattern score
            linguistic_score = (
                first_token_pattern * 0.4 +
                decreasing_pattern * 0.3 +
                last_token_distribution * 0.3
            )

            # Determine pattern type and score
            if linguistic_score > 0.6:
                reason = "Shows grammatical structure patterns"
                score = linguistic_score
            elif first_token_pattern:
                reason = "Shows article-word relationships"
                score = linguistic_score
            else:
                # Calculate general interest metrics as fallback
                attention_spread = len(matrix[matrix > 0.05]) / matrix.size
                peak_contrast = np.max(matrix) - np.mean(matrix)
                weighted_distances = np.mean(np.abs(np.arange(len(tokens))[:, None] - np.arange(len(tokens))) * matrix)

                score = (
                    attention_spread * 0.4 +
                    peak_contrast * 0.3 +
                    weighted_distances * 0.3
                )

                if attention_spread > 0.3:
                    reason = "Shows distributed attention"
                elif peak_contrast > 0.4:
                    reason = "Shows focused attention peaks"
                elif weighted_distances > len(tokens)/3:
                    reason = "Shows long-range connections"
                else:
                    continue

            scores.append((layer, head, score, reason))

    # Sort by score and return top_k unique patterns
    scores.sort(key=lambda x: x[2], reverse=True)
    seen_reasons = set()
    filtered_scores = []
    for score in scores:
        if score[3] not in seen_reasons and len(filtered_scores) < top_k:
            filtered_scores.append(score)
            seen_reasons.add(score[3])

    return filtered_scores

## Demo: Attention visualizer

In [None]:
examples = {
    "Grammar Pattern": "The red and blue car drove fast",
    "Complex Relationship": "The dog chased its tail because it was wagging",
    "Question-Answer": "Q: What is the capital of France? A: Paris",
    "Completion": "The students studied hard for their final",
    "Simple Example": "The cat sits on the mat",
    "Comparison": "Although it was expensive, the quality was excellent",
}

create_attention_visualizer(model, tokenizer, examples=examples)

HTML(value='<h2>Attention Pattern Visualizer</h2>')

VBox(children=(HTML(value='\n    <div style="background-color: #1e1e1e; padding: 20px; border-radius: 8px; mar…

Output()

Output()

## Try This!
1. Test sentences with pronouns ("he", "she", "it"). How does attention track references?
2. Compare simple and complex sentences. How does attention pattern complexity change?
3. Experiment with questions and answers. What attention patterns emerge?

## Think About
- How do different attention heads specialize in different patterns?
- Why might some connections be stronger than others?
- How does sentence length affect attention patterns?

---

# 4. Predicting the next token

## What You'll Learn in This Section
- How AI predicts what word comes next
- Why AI considers multiple possibilities
- How temperature affects AI's creativity
- How context influences prediction probabilities

## What is the prediction process?
After analyzing relationships, the model estimates what might come next. For "The cat sat on the ___", it might suggest "mat", "floor", "chair" with different probabilities.

## Why not just pick the highest probability?
- Natural variation: Always completing "as cold as" with "ice" gets boring
- Examples of repetitive patterns to avoid:
  - Word repetition: "The big big big house"
  - Phrase loops: "and then he went to the store and then he went to the store..."
  - Stuck patterns: Always using the same sentence structure

## Temperature
- Models use a "temperature" setting to control randomness in selection
- Temperature of 0.0: Always picks the highest probability token
- Temperature of 1.0: Samples exactly according to the calculated probabilities
- Temperature > 1.0: The probabilities of each possible token become more similar

## Common questions:
- **Why multiple options?** "The cat sat on the" could logically continue with "mat" (rhyming), "couch" (common), or "laptop" (modern) - context determines what's most appropriate
- **What affects probabilities?**
  - Previous context: "the bank" appearing in text near the word "river" vs near the word "money"
  - Common patterns: "salt and pepper" is more likely than "pepper and salt"
  - Topic consistency: Space-related words become more likely after mentioning "astronaut"
- **How far ahead can it look?** Phi-3 has a context window of 4,096 tokens when making predictions

In [None]:
#@title Next Token Probability Function (expand to see code)
def show_topk_token_probabilities(model, tokenizer, prompt="AI models are used for generating", top_k=10):
    """
    Show the probability distribution for the next token given an input prompt.

    Args:
        model: The language model
        tokenizer: The tokenizer
        prompt: Prompt to analyze
        top_k: Number of top tokens to display
    """
    # Get the device the model is on
    device = next(model.parameters()).device

    # Tokenize and move to the same device as the model
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate logits for next token
    with torch.no_grad():
        outputs = model(**inputs)

    # Get probabilities for next token
    logits = outputs.logits[0, -1]
    probs = torch.softmax(logits, dim=-1)

    # Get top-k tokens and their probabilities
    top_probs, top_indices = torch.topk(probs, top_k)

    # Move results back to CPU for processing
    top_probs = top_probs.cpu()
    top_indices = top_indices.cpu()

    print(f"\nInput: '{prompt}'")
    print(f"\nTop {top_k} most likely next tokens:")
    print("-" * 40)

    # Create a visualization of the probability distribution
    max_bar_width = 40  # Maximum width of probability bars

    for prob, idx in zip(top_probs, top_indices):
        token = tokenizer.decode([idx])
        prob_value = prob.item()

        # Create a visual bar representing the probability
        bar_width = int(prob_value * max_bar_width)
        bar = "█" * bar_width

        # Format the probability as a percentage
        prob_percent = f"{prob_value*100:.1f}%"

        # Print the token and its probability with a visual bar
        print(f"{token:15} {prob_percent:>7} |{bar}")

## Demo: Next token prediction

In [None]:
prompt = "AI models are used for generating"
show_topk_token_probabilities(model, tokenizer, prompt=prompt)


Input: 'AI models are used for generating'

Top 10 most likely next tokens:
----------------------------------------
text              18.2% |███████
content            6.3% |██
images             4.7% |█
music              3.0% |█
and                2.6% |█
art                2.5% |█
personal           2.2% |
responses          2.0% |
cre                1.9% |
ca                 1.9% |


## Try This!
1. Start common phrases and see top predictions
2. Compare predictions with and without context
3. Test how technical or domain-specific context affects predictions

## Think About
- Why might unlikely but creative completions be valuable?
- How does context length affect prediction quality?
- What biases might emerge in predictions?

---

# AI Glossary

## Core AI Concepts
- **Artificial Intelligence (AI)**: Computer systems designed to perform tasks that typically require human intelligence
- **Language Model**: An AI system trained to understand, process, and generate human language
- **Neural Network**: A computing system inspired by biological brains, made up of interconnected nodes that process information
- **Machine Learning**: The process by which AI systems learn from examples rather than following explicit instructions
- **Training**: The process of teaching an AI model by showing it many examples
- **Inference**: When a trained AI model processes new inputs to generate outputs

## Language Model Components
- **Token**: The basic unit of text that AI processes (can be whole words, parts of words, or even single characters)
- **Tokenizer**: Software that breaks text into tokens the model can understand
- **Embedding**: A list of numbers that represents a token's meaning and relationships to other tokens
- **Attention**: The mechanism that helps the model connect and relate different parts of text
- **Context Window**: The maximum amount of text the model can process at once (4,096 tokens for Phi-3)
- **Layer**: A processing level in the model where different types of understanding happen
- **Hidden State**: The internal representation of processed information at each layer
- **Head**: A component that focuses on specific types of relationships between words (like grammar or topic)

## Generation Terms
- **Prompt**: The input text given to the model
- **Temperature**: A setting that controls how random or predictable the model's outputs are
  - 0.0: Always picks the most likely next word
  - 1.0: Balanced between creativity and predictability
  - \>1.0: More random and creative outputs
- **Top-k**: Limiting word choices to the k most likely options
- **Top-p/Nucleus Sampling**: Choosing from the smallest set of words whose probabilities add up to probability p
- **Beam Search**: Generating multiple possible continuations and choosing the best one

## Technical Visualization Terms
- **Principal Component Analysis (PCA)**: Mathematical technique that simplifies high-dimensional data (like embeddings) into fewer dimensions while preserving important relationships
- **Dimensionality Reduction**: Process of simplifying high-dimensional data for visualization
- **Heatmap**: Visualization where colors represent numbers (darker usually means stronger connection)
- **Vector**: A list of numbers representing a point in multi-dimensional space
- **Cosine Similarity**: Measure of similarity between two vectors (used to compare word embeddings)
- **Projection**: Converting high-dimensional data into fewer dimensions for visualization

## Model Architecture Terms
- **Transformer**: The fundamental architecture used in modern language models
- **Feed-Forward Network**: Neural network component that processes information in one direction
- **Residual Connection**: Connection that helps information flow through deep networks
- **Normalization**: Process of keeping numbers in a helpful range for the model
- **Dropout**: Technique to prevent the model from overfitting by randomly ignoring some connections

## Hardware Terms
- **GPU**: Graphics Processing Unit, specialized for parallel computations used in AI
- **VRAM**: Video RAM, the memory available on a GPU
- **Tensor**: Multi-dimensional array of numbers processed by the model
- **Batch**: Group of examples processed together for efficiency
- **Quantization**: Technique to reduce model size by using fewer bits to represent numbers

## Model Terms
- **Phi-3**: Microsoft's efficient language model designed for learning and research
- **4-bit Quantization**: Compression technique used to run Phi-3 efficiently
- **Mini Version**: Smaller variant of Phi-3 optimized for limited computing resources
- **Instruction-Tuned**: Model specifically trained to follow instructions and commands

## Language Concepts
- **Grammar Attention**: How the model recognizes and processes grammatical structures
- **Semantic Meaning**: The actual meaning of words and phrases (versus just their spelling or sound)
- **Context**: The surrounding text that helps determine meaning
- **Prompt Engineering**: The art of writing effective inputs to get desired outputs from AI
- **Zero-shot Learning**: Model's ability to handle tasks it wasn't explicitly trained on

## Common Metrics
- **Probability**: Likelihood of a token being chosen next (0-100%)
- **Attention Score**: Strength of connection between two tokens (0-1)
- **Similarity Score**: How closely related two tokens are (usually 0-1)
- **Perplexity**: Measure of how confident the model is in its predictions
- **Token Length**: Number of tokens in a piece of text

## Model Behavior
- **Hallucination**: When AI generates plausible but incorrect information
- **Repetition**: When AI gets stuck repeating similar phrases
- **Coherence**: How well the generated text maintains consistent meaning
- **Fluency**: How natural and flowing the generated text sounds
- **Context Break**: When the model loses track of earlier context

## Colab Terms
- **Runtime**: The environment where code is executed (like a T4 GPU)
- **Notebook**: Interactive document combining code, text, and visualizations
- **Cell**: Individual code or text block in a notebook
