<a href="https://colab.research.google.com/github/suyash1574/GEN-AI-Workshop/blob/main/src/day1/notebooks/03_text_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 1: Text Pipeline - Your First Language Model

Welcome to hands-on text processing! Now that you understand neural networks, let's explore how they work with text data.

## 🎯 Learning Objectives
By the end of this notebook, you will:
- Understand how text becomes numbers (tokenization)
- Load and use a pre-trained language model
- Experiment with text generation parameters
- Compare different prompt engineering techniques
- Build your first text generation pipeline

## 📚 Research Focus
This notebook emphasizes **discovery learning**. You'll:
1. Research concepts before implementing
2. Experiment with parameters to see their effects
3. Compare different approaches
4. Build understanding through hands-on exploration

---

## 1. From Text to Numbers

Neural networks work with numbers, but we have text. How do we bridge this gap?

🔍 **RESEARCH TASK 1**:
- What is tokenization in NLP?
- What is the difference between word-level and sub-word tokenization?
- Research "BPE" (Byte Pair Encoding) - how does it work?
- Why can't we just assign each word a number?

In [1]:
# Import required libraries
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
import torch
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter
import seaborn as sns

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


### Exploring Tokenization

🔍 **RESEARCH TASK 2**:
- Look up the GPT-2 tokenizer documentation
- What is a "vocabulary size"?
- What happens when the model encounters a word it's never seen?

In [None]:
# TODO: Load the GPT-2 tokenizer
# Hint: Use GPT2Tokenizer.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Test sentences to explore tokenization
test_sentences = [
    "Hello world!",
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is revolutionizing technology.",
    "GPT-2 uses transformer architecture.",
    "Supercalifragilisticexpialidocious"  # Long word to see sub-word tokenization
]

print("🔍 Exploring Tokenization:")
print("=" * 50)

for sentence in test_sentences:
    # TODO: Tokenize the sentence
    # Hint: Use tokenizer.encode() to get token IDs
    # Use tokenizer.tokenize() to see the actual tokens
    tokens = ____  # Get the actual token strings
    token_ids = ____  # Get the numerical IDs

    print(f"\nOriginal: {sentence}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {token_ids}")
    print(f"Number of tokens: {len(tokens)}")

# TODO: Print tokenizer vocabulary size
print(f"\n📊 Tokenizer vocabulary size: {len(tokenizer)}")  # Hint: len(tokenizer)

### Understanding Token Patterns

🔍 **RESEARCH TASK 3**:
- Why do some words get split into multiple tokens?
- What does the 'Ġ' symbol represent in GPT-2 tokens?
- How might tokenization affect model performance?

In [None]:
# Analyze tokenization patterns
analysis_texts = [
    "running",
    "runner",
    "run",
    "unhappiness",
    "ChatGPT",
    "COVID-19",
    "2023",
    "programming",
    "antidisestablishmentarianism"
]

print("🔍 Token Pattern Analysis:")
print("=" * 60)

token_analysis = []

for text in analysis_texts:
    # TODO: Analyze each text
    tokens = ____  # Tokenize the text
    token_count = ____  # Count the tokens

    token_analysis.append({
        'text': text,
        'tokens': tokens,
        'token_count': token_count,
        'chars_per_token': len(text) / token_count
    })

    print(f"{text:30} → {tokens} ({token_count} tokens)")

# TODO: Create a DataFrame and analyze patterns
df = pd.DataFrame(token_analysis)
print(f"\n📊 Average characters per token: {____:.2f}")  # Calculate mean
print(f"📊 Longest word in tokens: {____}")  # Find max token_count

## 2. Loading Your First Language Model

Now let's load GPT-2 and understand its architecture.

🔍 **RESEARCH TASK 4**:
- What is GPT-2 and when was it released?
- How many parameters does GPT-2 have? (Compare different sizes)
- What is "autoregressive" text generation?
- How does GPT-2 relate to the neural network you built in the previous notebook?

In [None]:
# TODO: Load GPT-2 model
# Hint: Use GPT2LMHeadModel.from_pretrained('gpt2')
print("🔄 Loading GPT-2 model (this may take a moment)...")
model = ____

# TODO: Set model to evaluation mode
# Hint: Use model.eval()
____

print("✅ GPT-2 model loaded successfully!")

# Explore model architecture
print("\n🏗️ Model Architecture:")
print(f"Model type: {type(model).__name__}")

# TODO: Count model parameters
# Hint: sum(p.numel() for p in model.parameters())
total_params = ____
print(f"Total parameters: {total_params:,}")
print(f"Model size: ~{total_params / 1e6:.1f}M parameters")

### Understanding Model Architecture

🔍 **RESEARCH TASK 5**:
- What are "transformer blocks" in GPT-2?
- What is "attention" in the context of neural networks?
- How does this compare to the simple network you built earlier?

In [None]:
# Explore model structure
print("🔍 Model Structure Analysis:")
print("=" * 50)

# TODO: Print model configuration
# Hint: Use model.config
config = ____

print(f"Vocabulary size: {config.vocab_size}")
print(f"Maximum sequence length: {config.n_positions}")
print(f"Number of transformer layers: {config.n_layer}")
print(f"Number of attention heads: {config.n_head}")
print(f"Hidden size: {config.n_embd}")

# Compare to your simple network
print("\n🤔 Comparison to Your Neural Network:")
print(f"Your network had: 2 inputs → 4 hidden → 1 output")
print(f"GPT-2 has: {config.vocab_size} inputs → {config.n_embd} hidden → {config.vocab_size} outputs")
print(f"Your network: ~50 parameters")
print(f"GPT-2: {total_params:,} parameters")
print(f"GPT-2 is ~{total_params/50:,.0f}x larger!")

## 3. Text Generation Experiments

Let's generate text and understand how different parameters affect the output.

🔍 **RESEARCH TASK 6**:
- What is "temperature" in text generation?
- What is "top-p" (nucleus) sampling?
- What's the difference between greedy decoding and sampling?
- How do these parameters affect creativity vs. coherence?

In [None]:
# TODO: Create a text generation pipeline
# Hint: Use pipeline('text-generation', model=model, tokenizer=tokenizer)
generator = ____

# Base prompt for experiments
base_prompt = "In the future, artificial intelligence will"

print(f"🤖 Base prompt: '{base_prompt}'")
print("=" * 60)

### Temperature Experiments

🔍 **RESEARCH TASK 7**:
- What happens when temperature = 0?
- What happens when temperature > 1?
- Why might you want different temperatures for different tasks?

In [None]:
# Experiment with different temperatures
temperatures = [0.1, 0.7, 1.0, 1.5]

print("🌡️ Temperature Experiments:")
print("=" * 50)

for temp in temperatures:
    print(f"\n🔥 Temperature: {temp}")
    print("-" * 30)

    # TODO: Generate text with different temperatures
    # Hint: Use generator() with temperature parameter
    result = generator(
        ____,  # prompt
        max_length=____,  # try 60
        temperature=____,  # use the temp variable
        do_sample=____,  # should be True for sampling
        pad_token_id=____  # use tokenizer.eos_token_id
    )

    # TODO: Print the generated text
    generated_text = ____  # Extract from result
    print(generated_text)

print("\n🤔 Discussion Questions:")
print("• Which temperature produced the most coherent text?")
print("• Which was most creative/surprising?")
print("• When might you use each temperature setting?")

### Top-p (Nucleus) Sampling Experiments

🔍 **RESEARCH TASK 8**:
- How does top-p sampling work?
- What's the difference between top-k and top-p sampling?
- Why might top-p be better than just using temperature?

In [None]:
# Experiment with top-p sampling
top_p_values = [0.3, 0.7, 0.9, 1.0]

print("🎯 Top-p Sampling Experiments:")
print("=" * 50)

for top_p in top_p_values:
    print(f"\n🎲 Top-p: {top_p}")
    print("-" * 30)

    # TODO: Generate text with different top-p values
    result = generator(
        base_prompt,
        max_length=60,
        temperature=0.8,  # Keep temperature constant
        top_p=____,  # Use the top_p variable
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = result[0]['generated_text']
    print(generated_text)

print("\n🤔 Discussion Questions:")
print("• How did the outputs change with different top-p values?")
print("• What's the trade-off between diversity and quality?")

## 4. Prompt Engineering Experiments

The way you phrase your prompt dramatically affects the output.

🔍 **RESEARCH TASK 9**:
- What is "prompt engineering"?
- What are "few-shot" prompts?
- How can prompt structure influence model behavior?
- Research common prompt engineering techniques

In [None]:
# Different prompt styles to experiment with
prompts_to_test = {
    "Direct": "Write about artificial intelligence:",
    "Question": "What is artificial intelligence and how will it change the world?",
    "Story_Start": "Once upon a time, in a world where artificial intelligence was everywhere,",
    "List_Format": "Here are 5 ways artificial intelligence will change our lives:\n1.",
    "Expert_Persona": "As a leading AI researcher, I believe that artificial intelligence will",
    "Few_Shot": "Technology predictions:\n• The internet will connect everyone (1990s)\n• Smartphones will be everywhere (2000s)\n• Artificial intelligence will"
}

print("✍️ Prompt Engineering Experiments:")
print("=" * 60)

# TODO: Test each prompt style
for style, prompt in prompts_to_test.items():
    print(f"\n📝 Style: {style}")
    print(f"Prompt: '{prompt}'")
    print("-" * 40)

    # TODO: Generate text for each prompt
    result = generator(
        ____,  # use the prompt variable
        max_length=80,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = result[0]['generated_text']
    print(generated_text)
    print("\n" + "="*60)

### Analyzing Prompt Effectiveness

🔍 **RESEARCH TASK 10**:
- Which prompt style produced the most useful output?
- How did the model's "behavior" change with different prompts?
- What makes a good prompt?
- How might this apply to chatbots or AI assistants?

In [None]:
# Let's analyze the generated text more systematically
print("📊 Prompt Analysis Exercise:")
print("=" * 50)

# TODO: For each prompt style, generate multiple outputs and analyze
analysis_results = []

for style, prompt in list(prompts_to_test.items())[:3]:  # Test first 3 for time
    # Generate 3 outputs for each prompt
    outputs = []

    for i in range(3):
        # TODO: Generate text
        result = generator(
            prompt,
            max_length=60,
            temperature=0.8,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

        output = result[0]['generated_text']
        outputs.append(output)

    # TODO: Analyze the outputs
    lengths = ___
    avg_length = ___ # Calculate average length of outputs

    analysis_results.append({
        'style': style,
        'prompt': prompt,
        'avg_length': avg_length,
        'outputs': outputs
    })

    print(f"\n{style}:")
    print(f"  Average length: {avg_length:.1f} characters")
    print(f"  Sample output: {outputs[0][:100]}...")

print("\n🤔 Reflection Questions:")
print("• Which prompt style was most consistent?")
print("• Which produced the most relevant outputs?")
print("• How might you improve these prompts?")

## 5. Building Your Text Generation Pipeline

Now let's create a customizable text generation function.

🔍 **RESEARCH TASK 11**:
- What parameters should a good text generation function have?
- How can you make text generation more controllable?
- What are the trade-offs between different generation strategies?

In [None]:
def custom_text_generator(prompt, style="balanced", length="medium"):
    """
    TODO: Create a customizable text generation function

    Args:
        prompt (str): The input prompt
        style (str): "creative", "balanced", or "conservative"
        length (str): "short", "medium", or "long"

    Returns:
        str: Generated text
    """

    # TODO: Set parameters based on style
    if style == "creative":
        temperature = ____  # Higher for creativity
        top_p = ____        # Higher for diversity
    elif style == "conservative":
        temperature = ____  # Lower for consistency
        top_p = ____        # Lower for focus
    else:  # balanced
        temperature = ____  # Medium values
        top_p = ____

    # TODO: Set length based on parameter
    if length == "short":
        max_length = ____  # Try 40
    elif length == "long":
        max_length = ____  # Try 100
    else:  # medium
        max_length = ____  # Try 70

    # TODO: Generate text with the parameters
    result = generator(
        ____,  # prompt
        max_length=____,
        temperature=____,
        top_p=____,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    return result[0]['generated_text']

# Test your function
test_prompt = "The future of education will be"

print("🧪 Testing Your Text Generator:")
print("=" * 50)

# TODO: Test different combinations
test_combinations = [
    ("creative", "short"),
    ("balanced", "medium"),
    ("conservative", "long")
]

for style, length in test_combinations:
    print(f"\n📝 Style: {style}, Length: {length}")
    print("-" * 30)

    # TODO: Use your function
    output = custom_text_generator(____)
    print(output)
    print(f"Characters: {len(output)}")

## 6. Creative Applications

Let's explore some creative uses of text generation.

🔍 **RESEARCH TASK 12**:
- How is GPT-2 being used in creative writing?
- What are some potential applications for businesses?
- What ethical considerations should we keep in mind?
- How might this technology evolve?

In [None]:
# Creative applications to try
creative_prompts = {
    "Poetry": "Roses are red, violets are blue, artificial intelligence",
    "Story": "It was a dark and stormy night when the AI finally",
    "Product Description": "Introducing the revolutionary new smartphone that",
    "Email": "Dear valued customer, we are excited to announce",
    "Recipe": "How to make the perfect AI-inspired cookies:\nIngredients:\n-",
    "News Headline": "Breaking: Scientists discover that artificial intelligence"
}

print("🎨 Creative Applications:")
print("=" * 50)

# TODO: Generate creative content
for app_type, prompt in creative_prompts.items():
    print(f"\n🖼️ {app_type}:")
    print(f"Prompt: '{prompt}'")
    print("-" * 40)

    # TODO: Choose appropriate style for each application
    if app_type in ["Poetry", "Story"]:
        style = ____  # Should be creative
    elif app_type in ["Product Description", "Email"]:
        style = ____  # Should be conservative
    else:
        style = ____  # Should be balanced

    output = custom_text_generator(prompt, style=style, length="medium")
    print(output)
    print("\n" + "="*50)

## 7. Understanding Limitations

It's important to understand what language models can and cannot do.

🔍 **RESEARCH TASK 13**:
- What is "hallucination" in language models?
- Why might GPT-2 generate biased or incorrect information?
- What are the limitations of autoregressive generation?
- How do these limitations affect real-world applications?

In [None]:
# Test model limitations
limitation_tests = {
    "Factual Knowledge": "The capital of Fakelandia is",
    "Recent Events": "In 2023, the most important AI breakthrough was",
    "Math": "What is 47 * 83? The answer is",
    "Logic": "If all A are B, and all B are C, then all A are",
    "Consistency": "My favorite color is blue. Later in the conversation, my favorite color is"
}

print("⚠️ Understanding Model Limitations:")
print("=" * 50)

for test_type, prompt in limitation_tests.items():
    print(f"\n🧪 Testing: {test_type}")
    print(f"Prompt: '{prompt}'")
    print("-" * 40)

    # TODO: Generate responses to test limitations
    output = custom_text_generator(
        ____,  # prompt
        style="conservative",  # Use conservative for factual tasks
        length="short"
    )

    print(output)

    # TODO: Analyze the output
    print(f"🤔 Analysis: Does this look correct/reasonable?")
    print("\n" + "="*50)

print("\n⚠️ Important Reminders:")
print("• Language models can generate plausible-sounding but incorrect information")
print("• Always verify factual claims from AI-generated content")
print("• Be aware of potential biases in training data")
print("• Use AI as a tool to assist, not replace, human judgment")

## 8. Reflection and Next Steps

### What You've Accomplished
✅ **Understood tokenization and text preprocessing**
✅ **Loaded and used a pre-trained language model**
✅ **Experimented with generation parameters**
✅ **Explored prompt engineering techniques**
✅ **Built a customizable text generation pipeline**
✅ **Understood model limitations and ethical considerations**

### Key Insights
🔍 **Discussion Questions**:
- What surprised you most about text generation?
- Which prompt engineering technique was most effective?
- How might you use this in a real project?
- What limitations concerned you most?

In [None]:
# Final experiment: Design your own use case
print("🎯 FINAL CHALLENGE:")
print("Design your own text generation use case!")
print("=" * 50)

# TODO: Create your own application
# Ideas: Story generator, email assistant, creative writing helper, etc.

your_use_case = "____"  # Describe your use case
your_prompt = "____"   # Design your prompt
your_style = "____"    # Choose your style
your_length = "____"   # Choose your length

print(f"📝 Your use case: {your_use_case}")
print(f"📝 Your prompt: '{your_prompt}'")
print(f"📝 Your settings: {your_style}, {your_length}")
print("-" * 50)

# TODO: Generate with your custom settings
your_output = custom_text_generator(____)
print("🎉 Your generated content:")
print(your_output)

print("\n📈 Next Steps:")
print("• Experiment with different prompt formats")
print("• Try combining multiple generation calls")
print("• Think about how to validate or improve outputs")
print("• Consider user interface design for your application")

## 🎉 Congratulations!

You've successfully:
- ✅ Mastered text tokenization and preprocessing
- ✅ Used a state-of-the-art language model
- ✅ Discovered the art and science of prompt engineering
- ✅ Built your own text generation pipeline
- ✅ Understood the capabilities and limitations of AI text generation
- ✅ Explored creative applications

### Prepare for the Next Notebook
Next, we'll explore computer vision and image processing, applying similar principles to visual data!

**Share with your partner**: What was your most successful text generation experiment?

---
*Text Pipeline Complete - Ready for Computer Vision! 🖼️*