# LLM Basics: Understanding Large Language Models

Welcome to your first hands-on experience with Large Language Models! This notebook will guide you through the fundamental concepts and provide practical examples.

## Learning Objectives
By the end of this notebook, you will:
1. Understand what LLMs are and how they work
2. Learn about model sizes and parameter counts
3. Explore different model formats (GGUF, SafeTensors)
4. Calculate memory requirements for different models
5. Set up your environment for LLM work

In [None]:
# Import required libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModel
import psutil
import GPUtil

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA version:", torch.version.cuda)
    print("GPU:", torch.cuda.get_device_name(0))

## 1. Understanding Model Parameters

Let's start by understanding what we mean by "parameters" in LLMs and how model size affects memory requirements.

In [None]:
def calculate_model_memory(num_parameters, precision_bits=32):
    """
    Calculate memory requirements for a model
    
    Args:
        num_parameters: Number of parameters in the model
        precision_bits: Bits per parameter (32 for float32, 16 for float16, etc.)
    
    Returns:
        Memory in bytes, MB, and GB
    """
    bytes_per_param = precision_bits / 8
    total_bytes = num_parameters * bytes_per_param
    total_mb = total_bytes / (1024 * 1024)
    total_gb = total_mb / 1024
    
    return total_bytes, total_mb, total_gb

# Calculate memory for popular model sizes
model_sizes = {
    "Small Model": 1.5e9,    # 1.5B parameters
    "CodeLlama-7B": 7e9,     # 7B parameters  
    "Llama-13B": 13e9,       # 13B parameters
    "Llama-30B": 30e9,       # 30B parameters
    "GPT-3": 175e9           # 175B parameters
}

print("Memory Requirements for Different Model Sizes:")
print("=" * 60)
print(f"{'Model':<15} {'Parameters':<12} {'FP32 (GB)':<10} {'FP16 (GB)':<10} {'INT8 (GB)':<10}")
print("-" * 60)

for name, params in model_sizes.items():
    _, _, gb_32 = calculate_model_memory(params, 32)
    _, _, gb_16 = calculate_model_memory(params, 16)
    _, _, gb_8 = calculate_model_memory(params, 8)
    
    print(f"{name:<15} {params/1e9:>8.1f}B    {gb_32:>6.1f}     {gb_16:>6.1f}     {gb_8:>6.1f}")

## 2. Working with Hugging Face Models

Let's explore how to download and inspect actual models from Hugging Face Hub.

In [None]:
# Check system resources
def check_system_resources():
    """Display current system resources"""
    print("System Resources:")
    print(f"CPU cores: {psutil.cpu_count()}")
    print(f"RAM: {psutil.virtual_memory().total / (1024**3):.1f} GB")
    
    if torch.cuda.is_available():
        gpus = GPUtil.getGPUs()
        for i, gpu in enumerate(gpus):
            print(f"GPU {i}: {gpu.name}")
            print(f"  Memory: {gpu.memoryTotal} MB total, {gpu.memoryFree} MB free")
    else:
        print("No CUDA GPUs available")

check_system_resources()

In [None]:
# Download and inspect a small model
model_name = "microsoft/DialoGPT-small"  # Small model for demonstration

print(f"Loading tokenizer for {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Vocabulary size: {tokenizer.vocab_size:,}")
print(f"Model max length: {tokenizer.model_max_length:,}")

# Test tokenization
sample_text = "Hello, how are you today?"
tokens = tokenizer.encode(sample_text)
print(f"\nSample text: '{sample_text}'")
print(f"Tokens: {tokens}")
print(f"Decoded: '{tokenizer.decode(tokens)}'")

## 3. Understanding Model Architecture

Let's examine the structure of a transformer model and count its parameters.

In [None]:
# Load a small model to examine its architecture
print(f"Loading model {model_name}...")
model = AutoModel.from_pretrained(model_name)

# Count parameters
def count_parameters(model):
    """Count trainable and total parameters in a model"""
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    return trainable_params, total_params

trainable, total = count_parameters(model)
print(f"Trainable parameters: {trainable:,}")
print(f"Total parameters: {total:,}")

# Examine model structure
print(f"\nModel architecture:")
print(model)

## 4. Model Formats and Quantization

Let's understand different model formats and how quantization affects model size and performance.

In [None]:
# Simulate quantization effects
def simulate_quantization_effects():
    """Demonstrate the effects of quantization on model accuracy"""
    
    # Create sample weights (simulating a small portion of a model)
    np.random.seed(42)
    original_weights = np.random.normal(0, 0.1, 1000)
    
    # Quantize to different precisions
    def quantize_weights(weights, bits):
        """Simple quantization simulation"""
        max_val = np.max(np.abs(weights))
        scale = (2**(bits-1) - 1) / max_val
        quantized = np.round(weights * scale) / scale
        return quantized
    
    # Test different bit precisions
    precisions = [32, 16, 8, 4]
    results = {}
    
    for bits in precisions:
        if bits == 32:
            quantized = original_weights  # No quantization
        else:
            quantized = quantize_weights(original_weights, bits)
        
        # Calculate error
        mse = np.mean((original_weights - quantized)**2)
        results[bits] = {'weights': quantized, 'mse': mse}
    
    # Plot results
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    # Plot weight distributions
    ax1.hist(original_weights, bins=50, alpha=0.7, label='Original (FP32)', density=True)
    ax1.hist(results[8]['weights'], bins=50, alpha=0.7, label='Quantized (8-bit)', density=True)
    ax1.set_xlabel('Weight Value')
    ax1.set_ylabel('Density')
    ax1.set_title('Weight Distribution: Original vs Quantized')
    ax1.legend()
    
    # Plot quantization error
    bits_list = list(results.keys())
    mse_list = [results[bits]['mse'] for bits in bits_list]
    
    ax2.plot(bits_list, mse_list, 'bo-')
    ax2.set_xlabel('Bits per Parameter')
    ax2.set_ylabel('Mean Squared Error')
    ax2.set_title('Quantization Error vs Precision')
    ax2.set_yscale('log')
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    return results

quantization_results = simulate_quantization_effects()

## 5. Practical Exercise: Model Selection

Based on what we've learned, let's create a function to help choose the right model for your hardware.

In [None]:
def recommend_model(available_memory_gb, task_type="general"):
    """
    Recommend suitable models based on available GPU memory
    
    Args:
        available_memory_gb: Available GPU memory in GB
        task_type: Type of task ("general", "coding", "vision")
    
    Returns:
        List of recommended models
    """
    
    models = {
        # Model name: (size_gb_fp16, capability, notes)
        "microsoft/DialoGPT-small": (0.5, "general", "Great for learning"),
        "nvidia/Nemotron-Research-Reasoning-Qwen-1.5B": (3, "coding", "Perfect for beginners"),
        "codellama/CodeLlama-7b-hf": (14, "coding", "Professional coding tasks"),
        "mistralai/Mistral-7B-Instruct-v0.2": (14, "general", "Efficient instruction following"),
        "llava-hf/llava-v1.6-mistral-7b-hf": (16, "vision", "Vision + text understanding")
    }
    
    recommendations = []
    
    print(f"Recommendations for {available_memory_gb}GB GPU memory:")
    print("=" * 50)
    
    for model_name, (size, capability, notes) in models.items():
        if size <= available_memory_gb:
            if task_type == "any" or capability == task_type or capability == "general":
                recommendations.append((model_name, size, capability, notes))
    
    if recommendations:
        for model_name, size, capability, notes in recommendations:
            print(f"✅ {model_name}")
            print(f"   Size: {size}GB, Capability: {capability}")
            print(f"   Note: {notes}\n")
    else:
        print("❌ No suitable models found for your hardware configuration.")
        print("Consider using model quantization or cloud-based solutions.")
    
    return recommendations

# Test with different memory configurations
for memory in [4, 8, 16, 24]:
    print(f"\n{'='*60}")
    recommend_model(memory, "any")

## 6. Next Steps

Congratulations! You've learned the fundamentals of LLMs. Here's what to explore next:

### Immediate Next Steps:
1. **Try the second notebook**: `02_model_inference.ipynb` - Learn to run actual model inference
2. **Experiment with quantization**: Try loading models with different precision levels
3. **Explore Hugging Face Hub**: Browse different models and their documentation

### Advanced Topics (for later):
1. **Fine-tuning**: Adapt models to specific tasks
2. **LoRA**: Efficient fine-tuning techniques  
3. **Model deployment**: Serve models in production
4. **Custom architectures**: Build your own transformer models

### Resources for Continued Learning:
- [Hugging Face Course](https://huggingface.co/course/)
- [PyTorch Tutorials](https://pytorch.org/tutorials/)
- [Papers With Code](https://paperswithcode.com/area/natural-language-processing)

Happy learning! 🚀

In [None]:
# Final exercise: Calculate what you could run on your system
print("Your Learning Summary:")
print("=" * 40)

# If GPU is available, show actual specs
if torch.cuda.is_available():
    gpu_memory_mb = torch.cuda.get_device_properties(0).total_memory / (1024**2)
    gpu_memory_gb = gpu_memory_mb / 1024
    print(f"Your GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {gpu_memory_gb:.1f} GB")
    
    print(f"\nWhat you can run:")
    recommend_model(gpu_memory_gb * 0.8, "any")  # Use 80% of memory for safety
else:
    print("No GPU detected - you can still experiment with:")
    print("- Small models on CPU")
    print("- Cloud platforms (Google Colab, Kaggle)")
    print("- Quantized models")

print("\n🎉 Congratulations on completing LLM Basics!")