## This notebook requires GPU

This lab must be run in Google Colab in order to use GPU acceleration for model training. Click the button below to open this notebook in Colab, then set your runtime to GPU:

**Runtime > Change Runtime Type > T4 GPU**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scott2b/coursera-msds-public/blob/main/notebooks/2_llm_fundamentals_theory.ipynb)

# 🧠 LLM Fundamentals: Generative AI for Content Analysis

## From Textbook Theory to Modern AI Implementation

Welcome to the comprehensive LLM Classification course! This notebook bridges foundational generative AI concepts from **"The Computational Content Analyst"** with cutting-edge LLM architectures and applications.

## 📚 Textbook Integration (Chapter 6 + Appendices)

### Chapter 6: Leveraging Generative AI for Content Analysis
- **LLM Foundations**: Understanding how large language models learn and predict
- **Generative Capabilities**: Beyond classification to content generation and analysis
- **Content Construction**: How LLMs construct meaning from text data
- **Practical Applications**: Real-world content analysis with generative AI
- **Research Methodology**: Integrating LLMs into computational content analysis

### Appendix A: Codebook and Conceptual Definitions
- **Content Categories**: Identity-based, inflammatory, obscene, and threatening language
- **Annotation Guidelines**: Systematic coding schemes for content analysis
- **Ethical Frameworks**: Responsible classification and content moderation
- **Research Standards**: Best practices for reliable content analysis

### Appendix B: Content Moderation Guidelines
- **Platform Policies**: Understanding content deletion and moderation criteria
- **Ethical Boundaries**: Balancing free speech with platform responsibility
- **Research Ethics**: Navigating sensitive topics in computational analysis
- **Practical Considerations**: Real-world content moderation challenges

## 🚀 Technical Excellence with Modern Tools

### What's New in This Update:
- ✅ **State-of-the-Art Architectures**: Latest LLM models (GPT-4, LLaMA 2, PaLM)
- ✅ **Advanced Optimization**: Quantization, distillation, efficient inference
- ✅ **Production Ready**: Scaling, deployment, and monitoring solutions
- ✅ **Ethical AI**: Bias detection, fairness analysis, responsible deployment
- ✅ **Research Integration**: Connecting textbook methodology with modern AI
- ✅ **Comprehensive Evaluation**: Metrics, benchmarking, and validation

## 🎯 Learning Objectives (Textbook + Modern AI)

### Foundational (From Textbook):
1. **Generative AI Theory**: Understand LLM capabilities for content analysis (Chapter 6)
2. **Content Coding**: Master systematic annotation and classification schemes
3. **Ethical Frameworks**: Apply responsible AI practices in content analysis
4. **Research Methodology**: Design studies using generative AI techniques

### Technical (Modern Implementation):
5. **LLM Architectures**: Master transformer models and modern language models
6. **Efficient Inference**: Implement optimized inference pipelines
7. **Advanced Techniques**: Fine-tuning, quantization, and model compression
8. **Production Systems**: Build scalable, monitored LLM applications
9. **Evaluation Methods**: Comprehensive assessment and benchmarking
10. **Ethical Deployment**: Responsible AI practices and bias mitigation

## 📖 Textbook-Modern AI Synergy

This course uniquely combines:
- **📚 Theoretical Foundations** from "The Computational Content Analyst"
- **🤖 Cutting-edge Technology** with latest LLM architectures
- **⚖️ Ethical AI** with comprehensive bias detection and fairness
- **🏭 Production Engineering** with deployment and scaling
- **🔬 Research Methodology** connecting academic and industry applications

> **Key Insight**: Chapter 6 provides the "generative AI foundation" while modern tools provide the "implementation framework" - together they create a complete content analysis solution.

## 🎓 Course Structure

### Part 1: Foundations (Textbook Chapter 6)
- Generative AI concepts and LLM capabilities
- Content analysis with modern language models
- Ethical considerations and responsible AI
- Research methodology for computational content analysis

### Part 2: Modern Implementation (Appendices + SOTA)
- Advanced LLM architectures and optimization
- Content classification and analysis techniques
- Bias detection and fairness analysis
- Production deployment and scaling

### Part 3: Advanced Applications (Beyond Textbook)
- Multi-modal content analysis
- Real-time inference optimization
- Model interpretability and explainability
- Integration with existing workflows

## 🔬 Key Technical Concepts

### From Chapter 6:
- **LLM Prediction**: Understanding next-token prediction and sequence modeling
- **Content Construction**: How LLMs build meaning from text sequences
- **Generative Applications**: Beyond classification to content creation
- **Research Integration**: Connecting LLMs with content analysis methodology

### From Appendices:
- **Content Categories**: Systematic classification of language types
- **Ethical Boundaries**: Platform policies and moderation guidelines
- **Annotation Standards**: Reliable coding schemes and definitions
- **Research Ethics**: Navigating sensitive content in analysis

## 🏆 Success Metrics

By the end of this course, you will be able to:
- ✅ Apply generative AI for content analysis (Chapter 6 methodology)
- ✅ Implement systematic content coding using textbook frameworks
- ✅ Build ethical LLM applications with bias detection
- ✅ Deploy production-ready content analysis systems
- ✅ Connect academic research with modern AI implementation
- ✅ Navigate ethical challenges in computational content analysis

## 📊 Industry Applications

### Content Analysis Research:
- **Social Media Monitoring**: Automated content classification and analysis
- **News Analysis**: AI-powered news categorization and sentiment analysis
- **Academic Research**: Large-scale content analysis with LLMs
- **Policy Analysis**: Automated policy document analysis and classification

### Business Applications:
- **Content Moderation**: Automated detection of harmful content
- **Brand Monitoring**: Real-time brand mention analysis and classification
- **Customer Service**: Automated ticket classification and routing
- **Market Research**: Consumer feedback analysis and trend identification

## ⚖️ Ethical AI Framework

### From Textbook Appendices:
- **Content Categories**: Identity-based, inflammatory, obscene, threatening language
- **Moderation Guidelines**: Platform policies and deletion criteria
- **Research Ethics**: Handling sensitive topics responsibly
- **Bias Mitigation**: Systematic approaches to fair content analysis

### Modern Implementation:
- **Bias Detection**: Automated identification of discriminatory content
- **Fairness Analysis**: Ensuring equitable treatment across user groups
- **Transparency**: Explainable AI decisions and model interpretability
- **Accountability**: Auditing and monitoring AI system behavior

---

**Ready to leverage generative AI for content analysis? Let's begin with LLM fundamentals! 🚀**

In [None]:
# Install required packages (cross-platform)
import sys, platform, os, subprocess

def pipi(args): subprocess.check_call([sys.executable, "-m", "pip", "install", *args])

gpu_linux = platform.system() == "Linux" and os.path.exists("/proc/driver/nvidia/version")
if gpu_linux:
    pipi(["torch", "torchvision", "torchaudio", "--index-url", "https://download.pytorch.org/whl/cu121"])
else:
    pipi(["torch", "torchvision", "torchaudio"])

pipi(["transformers>=4.43", "datasets", "accelerate>=0.33", "torchinfo", "plotly", "matplotlib", "seaborn", "bertviz"])

if gpu_linux:
    pipi(["bitsandbytes"])


In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import (
    AutoTokenizer, AutoModel, AutoModelForCausalLM,
    GPT2Tokenizer, GPT2Model, GPT2LMHeadModel,
    BertTokenizer, BertModel, BertForSequenceClassification
)
import plotly.graph_objects as go
import plotly.express as px
from torchinfo import summary
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("🚀 LLM Fundamentals Environment Ready!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## 📈 Evolution of Language Models

Let's explore how language models have evolved over time and understand the key breakthroughs.

In [None]:
# Evolution timeline of language models
lm_evolution = {
    2013: {"name": "Word2Vec", "params": "~300M", "architecture": "Shallow NN"},
    2014: {"name": "GloVe", "params": "~400M", "architecture": "Matrix Factorization"},
    2015: {"name": "LSTM Language Models", "params": "~10M", "architecture": "RNN"},
    2017: {"name": "Transformer", "params": "~65M", "architecture": "Attention"},
    2018: {"name": "BERT", "params": "~340M", "architecture": "Bidirectional Transformer"},
    2019: {"name": "GPT-2", "params": "~1.5B", "architecture": "Autoregressive Transformer"},
    2020: {"name": "T5", "params": "~11B", "architecture": "Encoder-Decoder"},
    2022: {"name": "PaLM", "params": "~540B", "architecture": "Dense Transformer"},
    2023: {"name": "LLaMA 2", "params": "~70B", "architecture": "Optimized Transformer"},
    2023: {"name": "GPT-4", "params": "~1.7T", "architecture": "Mixture of Experts"},
    2024: {"name": "Grok-1", "params": "~314B", "architecture": "Mixture of Experts"}
}

# Create visualization
years = list(lm_evolution.keys())
names = [lm_evolution[year]["name"] for year in years]
params = [lm_evolution[year]["params"] for year in years]

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=years,
    y=list(range(len(years))),
    mode='lines+markers+text',
    text=names,
    textposition="top center",
    name='Model Evolution'
))

fig.update_layout(
    title='Evolution of Language Models (2013-2024)',
    xaxis_title='Year',
    yaxis_title='Progression',
    showlegend=False,
    height=600
)
fig.show()

# Display evolution table
import pandas as pd
df_evolution = pd.DataFrame.from_dict(lm_evolution, orient='index')
df_evolution.index.name = 'Year'
print("\nLanguage Model Evolution:")
print(df_evolution.to_string())

## 🏗️ Transformer Architecture Deep Dive

Let's examine the core components of transformer architecture that powers all modern LLMs.

In [None]:
# Load a small transformer model to examine its architecture
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2Model.from_pretrained(model_name)

# Add pad token for GPT-2
tokenizer.pad_token = tokenizer.eos_token

print("GPT-2 Model Architecture:")
print(f"Number of parameters: {model.num_parameters():,}")
print(f"Number of layers: {model.config.n_layer}")
print(f"Hidden size: {model.config.n_embd}")
print(f"Number of attention heads: {model.config.n_head}")
print(f"Vocabulary size: {model.config.vocab_size}")

# Examine model structure
print("\nModel Structure:")
for name, module in model.named_modules():
    if len(name.split('.')) <= 2 and name:  # Show top-level modules
        print(f"  {name}: {type(module).__name__}")

In [None]:
# Detailed model summary
print("\nDetailed Model Summary:")
print(summary(model, input_size=(1, 10), dtypes=[torch.long]))

## 🎯 Attention Mechanism Deep Dive

The attention mechanism is the heart of transformer models. Let's understand how it works.

In [None]:
# Implement a simple attention mechanism from scratch
class SimpleAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.scale = embed_dim ** 0.5

    def forward(self, x):
        # x shape: (batch_size, seq_len, embed_dim)
        Q = self.query(x)  # (batch, seq, embed)
        K = self.key(x)    # (batch, seq, embed)
        V = self.value(x)  # (batch, seq, embed)

        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        attention_weights = torch.softmax(scores, dim=-1)

        # Apply attention
        attended = torch.matmul(attention_weights, V)

        return attended, attention_weights

# Test the attention mechanism
attention = SimpleAttention(embed_dim=64)
test_input = torch.randn(2, 10, 64)  # batch=2, seq=10, embed=64
output, weights = attention(test_input)

print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

# Visualize attention weights
plt.figure(figsize=(8, 6))
sns.heatmap(weights[0].detach().numpy(), cmap='viridis')
plt.title('Attention Weights Visualization')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.show()

## 🔬 Comparing Modern LLM Architectures

Let's compare different LLM architectures and their strengths.

In [None]:
# Compare different model architectures
architectures = {
    'GPT-style (Autoregressive)': {
        'strengths': ['Strong text generation', 'Efficient inference', 'Good at creative tasks'],
        'weaknesses': ['Unidirectional', 'Limited understanding of full context'],
        'use_cases': ['Content generation', 'Code completion', 'Chatbots'],
        'examples': ['GPT-3', 'GPT-4', 'LLaMA']
    },
    'BERT-style (Bidirectional)': {
        'strengths': ['Strong understanding', 'Good at classification', 'Context-aware'],
        'weaknesses': ['Complex training', 'Slower inference', 'Less creative'],
        'use_cases': ['Sentiment analysis', 'Question answering', 'Named entity recognition'],
        'examples': ['BERT', 'RoBERTa', 'ALBERT']
    },
    'T5-style (Encoder-Decoder)': {
        'strengths': ['Versatile', 'Good at translation', 'Structured generation'],
        'weaknesses': ['Complex architecture', 'Higher computational cost'],
        'use_cases': ['Translation', 'Summarization', 'Question answering'],
        'examples': ['T5', 'BART', 'Pegasus']
    },
    'Mixture of Experts (MoE)': {
        'strengths': ['Scalable', 'Efficient routing', 'High performance'],
        'weaknesses': ['Complex routing', 'Higher latency', 'Training challenges'],
        'use_cases': ['Large-scale applications', 'Multilingual tasks'],
        'examples': ['Grok-1', 'Switch Transformer', 'GShard']
    }
}

# Create comparison table
comparison_data = []
for arch, details in architectures.items():
    comparison_data.append({
        'Architecture': arch,
        'Strengths': ', '.join(details['strengths'][:2]),
        'Use Cases': ', '.join(details['use_cases'][:2]),
        'Examples': ', '.join(details['examples'])
    })

df_comparison = pd.DataFrame(comparison_data)
print("LLM Architecture Comparison:")
print(df_comparison.to_string(index=False))

## 📊 Scaling Laws and Model Sizes

Understanding how model performance scales with size and compute.

In [None]:
# Scaling laws visualization
model_sizes = [0.1, 0.3, 1, 3, 7, 13, 30, 65, 175, 540]  # Billions of parameters
performance = [65, 70, 75, 78, 80, 82, 84, 86, 88, 90]  # Approximate performance scores
compute_cost = [10**x for x in range(len(model_sizes))]  # Relative compute cost

# Create scaling plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Performance vs Model Size
ax1.plot(model_sizes, performance, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Model Size (Billions of Parameters)')
ax1.set_ylabel('Performance Score')
ax1.set_title('Scaling Laws: Performance vs Model Size')
ax1.grid(True, alpha=0.3)
ax1.set_xscale('log')

# Performance vs Compute Cost
ax2.plot(compute_cost, performance, 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('Compute Cost (Relative)')
ax2.set_ylabel('Performance Score')
ax2.set_title('Scaling Laws: Performance vs Compute')
ax2.grid(True, alpha=0.3)
ax2.set_xscale('log')

plt.tight_layout()
plt.show()

# Key scaling insights
print("\n🔑 Key Scaling Law Insights:")
print("1. Performance scales as power law with model size")
print("2. Larger models are more sample-efficient")
print("3. Cross-entropy loss decreases predictably with scale")
print("4. Compute-optimal models balance size and training time")
print("5. Scaling laws hold across different architectures")

## 🗜️ Quantization and Model Compression

Learn how to reduce model size while maintaining performance.

In [None]:
# Demonstrate quantization techniques
def demonstrate_quantization():
    """Show different quantization approaches"""

    # Original model
    model_fp32 = GPT2Model.from_pretrained("gpt2")
    original_size = sum(p.numel() * p.element_size() for p in model_fp32.parameters())

    print(f"Original model size: {original_size / (1024**2):.2f} MB")

    # Simulate different quantization levels
    quantization_levels = {
        'FP32': 32,
        'FP16': 16,
        'INT8': 8,
        'INT4': 4,
        'INT2': 2,
        'INT1': 1
    }

    print("\nQuantization Comparison:")
    print("Level  | Bits | Size Reduction | Memory Savings")
    print("-" * 45)

    for level, bits in quantization_levels.items():
        size_reduction = 32 / bits
        memory_savings = (1 - 1/size_reduction) * 100
        quantized_size = original_size / size_reduction
        print(f"{level:6} | {bits:4} | {size_reduction:2.1f}x          | {memory_savings:5.1f}%")
        print(f"       |      | {quantized_size/(1024**2):6.2f} MB     |")
        print()

demonstrate_quantization()

# Load a quantized model
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

print("\n🚀 Loading quantized model...")
try:
    quantized_model = GPT2Model.from_pretrained(
        "gpt2",
        quantization_config=quantization_config,
        device_map="auto"
    )
    print("✅ Quantized model loaded successfully!")
    print(f"Model device: {next(quantized_model.parameters()).device}")
except Exception as e:
    print(f"Note: Quantization demo requires compatible hardware: {e}")

## 🎨 Tokenization and Embeddings

Understanding how text is converted to numerical representations.

In [None]:
# Explore tokenization in depth
text = "Hello, I'm learning about Large Language Models! This is fascinating. 🤖"

# Compare different tokenizers
tokenizers = {
    'GPT-2': GPT2Tokenizer.from_pretrained('gpt2'),
    'BERT': BertTokenizer.from_pretrained('bert-base-uncased')
}

for name, tokenizer in tokenizers.items():
    print(f"\n🔤 {name} Tokenization:")
    print(f"Original text: {text}")

    # Tokenize
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)

    print(f"Tokens: {tokens}")
    print(f"Token IDs: {token_ids}")
    print(f"Number of tokens: {len(tokens)}")
    print(f"Vocabulary size: {tokenizer.vocab_size}")

    # Decode back
    decoded = tokenizer.decode(token_ids)
    print(f"Decoded: {decoded}")

# Visualize tokenization differences
gpt2_tokens = tokenizers['GPT-2'].tokenize(text)
bert_tokens = tokenizers['BERT'].tokenize(text)

print(f"\n📊 Tokenization Comparison:")
print(f"GPT-2 tokens: {len(gpt2_tokens)} - {gpt2_tokens}")
print(f"BERT tokens: {len(bert_tokens)} - {bert_tokens}")
print(f"Difference: {len(gpt2_tokens) - len(bert_tokens)} tokens")

In [None]:
# Explore embeddings
model = GPT2Model.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

# Get embeddings for a simple sentence
sentence = "The cat sat on the mat"
inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

print(f"Input sentence: {sentence}")
print(f"Tokenized: {tokenizer.tokenize(sentence)}")
print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[-1]}")

# Visualize embedding similarities
tokens = tokenizer.tokenize(sentence)
token_embeddings = embeddings[0]  # Remove batch dimension

# Calculate cosine similarities between token embeddings
cosine_similarities = torch.nn.functional.cosine_similarity(
    token_embeddings.unsqueeze(1),
    token_embeddings.unsqueeze(0),
    dim=-1
)

plt.figure(figsize=(10, 8))
sns.heatmap(cosine_similarities.numpy(),
            xticklabels=tokens,
            yticklabels=tokens,
            cmap='coolwarm',
            center=0)
plt.title('Token Embedding Similarities')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## 🎯 Hands-on Exercise: Building a Simple Transformer

Let's build a simplified transformer from scratch to understand the internals.

In [None]:
class SimpleTransformerBlock(nn.Module):
    """A simplified transformer block"""
    def __init__(self, embed_dim, num_heads, ff_dim):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim)
        )

        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        # Self-attention with residual connection
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x

# Test the transformer block
transformer = SimpleTransformerBlock(embed_dim=64, num_heads=4, ff_dim=128)
test_input = torch.randn(2, 10, 64)  # batch=2, seq=10, embed=64
output = transformer(test_input)

print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Transformer block created successfully!")

# Count parameters
total_params = sum(p.numel() for p in transformer.parameters())
print(f"Total parameters: {total_params:,}")

# Model summary
print("\nModel Architecture:")
print(transformer)

## 📚 Key Takeaways

1. **Evolution**: LLMs have evolved from simple word embeddings to massive transformer models
2. **Architecture**: All modern LLMs are based on the transformer architecture
3. **Attention**: The attention mechanism is the key innovation enabling long-range dependencies
4. **Scaling**: Model performance follows predictable scaling laws
5. **Quantization**: Model compression techniques enable deployment on resource-constrained devices
6. **Tokenization**: Text preprocessing is crucial for model performance
7. **Embeddings**: Dense vector representations capture semantic meaning

## 🚀 Next Steps

Now that you understand LLM fundamentals, proceed to:
- **Notebook 2**: Efficient Inference with vLLM
- **Notebook 3**: Advanced Fine-tuning with Unsloth
- **Notebook 4**: Production Deployment and Scaling
- **Notebook 5**: Evaluation, Benchmarking, and Ethics

## 🔗 Additional Resources

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - Original transformer paper
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) - GPT-3 paper
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)

## 🎯 Quiz Time!

Test your understanding of LLM fundamentals:
1. What are the three main components of a transformer block?
2. How does attention differ from traditional RNN approaches?
3. What are the trade-offs between different quantization levels?
4. Why do larger models generally perform better?
5. What role does tokenization play in LLM performance?