# Week 1: vLLM & Qwen2.5 - Complete Learning Guide

## üéØ Learning Objectives

By completing this notebook, you will:
1. **Set up and verify** your vLLM environment with GPU support
2. **Download and cache** the Qwen2.5-7B-Instruct model from HuggingFace
3. **Perform basic inference** using vLLM's LLM class
4. **Understand sampling parameters** and how they affect text generation
5. **Experiment systematically** with different parameter combinations
6. **Analyze results** to determine best practices for different use cases

## üìö Structure

This notebook covers 6 main sections (matching the week1-setup scripts):
1. **Environment Verification** (`verify_environment.py`)
2. **Model Download** (`download_model.py`)
3. **Baseline Inference** (`baseline_inference.py`)
4. **Sampling Parameters** (`experiment_sampling_params.py`)
5. **Results Comparison** (`compare_results.py`)
6. **Custom Testing** (`test_custom_params.py`)

## üö® Learning Approach

- **Learn by implementing**: Don't copy from the original scripts!
- **Follow step-by-step guides**: Each function has detailed instructions
- **Test incrementally**: Run cells as you complete them
- **Check expected outputs**: Verify your implementation
- **Learn from pitfalls**: Common mistakes are highlighted

---

# Part 1: Environment Verification

## üéì What You'll Learn
- Programmatically check Python version
- Verify PyTorch and CUDA installation
- Inspect GPU properties using PyTorch
- Check required libraries (vLLM, HuggingFace Hub)
- Create comprehensive validation reports

## üìñ Background
Before running LLM inference, ensure:
- Python 3.9+ is installed
- PyTorch with CUDA support
- CUDA can access your GPU(s)
- vLLM and HuggingFace Hub are available

---

## Function 1.1: `check_python_version()`

### üéØ Goal
Create a function that checks if Python version is 3.9 or higher.

### üìã Step-by-Step Instructions

**Step 1:** Import the `sys` module at the top of your notebook
- Hint: `sys` is a built-in module, no installation needed

**Step 2:** Inside the function, access the Python version information
- Use `sys.version_info` to get version details
- This returns an object with attributes: `major`, `minor`, `micro`

**Step 3:** Print a header message
- Use an emoji like üîç to make it visually clear
- Print "Checking Python version..."

**Step 4:** Print the current Python version
- Format: "Python version: {major}.{minor}.{micro}"
- Use f-strings for formatting

**Step 5:** Check if version meets requirements
- Requirement: Python 3.9 or higher
- Check: `version.major == 3 and version.minor >= 9`
- Why both conditions? Python 4.x would also pass if it existed!

**Step 6:** Print success or failure message
- If compatible: Print "‚úì Python version is compatible (3.9+)"
- If not: Print "‚úó Python version must be 3.9 or higher"
- Use proper indentation (3 spaces) to align with other output

**Step 7:** Return the result
- Return `True` if compatible
- Return `False` if not compatible

### ‚úÖ Expected Output
```
üîç Checking Python version...
   Python version: 3.10.12
   ‚úì Python version is compatible (3.9+)
```

### ‚ö†Ô∏è Common Pitfalls
1. **Forgetting to check major version**: Just checking `minor >= 9` would fail for Python 2.9 (if it existed)
2. **Wrong attribute names**: It's `version_info.major`, not `version_info.version`
3. **Not returning a boolean**: The function should return True/False for later use

### üß™ Test Your Function
After implementing, run:
```python
result = check_python_version()
assert isinstance(result, bool), "Function should return a boolean"
print(f"\nFunction returned: {result}")
```

In [None]:
# TODO: Import necessary modules
import sys

# TODO: Implement check_python_version() function
def check_python_version():
    """Check Python version is 3.9+"""
    # Step 1: Access version info
    
    # Step 2: Print header
    
    # Step 3: Print current version
    
    # Step 4: Check requirements
    
    # Step 5: Print result message
    
    # Step 6: Return boolean
    pass

# TODO: Test your function
# result = check_python_version()
# assert isinstance(result, bool)
# print(f"\nTest passed! Result: {result}")

## Function 1.2: `check_pytorch()`

### üéØ Goal
Check if PyTorch is installed and display its version.

### üìã Step-by-Step Instructions

**Step 1:** Print a header with proper spacing
- Use `\n` before the emoji for spacing
- Print "üîç Checking PyTorch..."

**Step 2:** Try to import PyTorch
- Use a try-except block to handle ImportError
- Try: `import torch`
- This is the safe way to check if a library is installed

**Step 3:** If import succeeds, get version info
- Access `torch.__version__` to get version string
- Print: "PyTorch version: {version}"
- Print: "‚úì PyTorch is installed"
- Return `True`

**Step 4:** If import fails, provide helpful guidance
- Catch `ImportError`
- Print: "‚úó PyTorch is not installed"
- Print installation command: "Run: pip install torch --index-url https://download.pytorch.org/whl/cu121"
- Return `False`

### ‚úÖ Expected Output (if PyTorch installed)
```
üîç Checking PyTorch...
   PyTorch version: 2.1.0+cu121
   ‚úì PyTorch is installed
```

### ‚ö†Ô∏è Common Pitfalls
1. **Importing at module level**: Import inside the function so ImportError can be caught
2. **Not providing installation instructions**: Users need to know how to fix the issue
3. **Forgetting indentation**: Error messages should be indented with 3 spaces for alignment

In [None]:
# TODO: Implement check_pytorch() function
def check_pytorch():
    """Check PyTorch installation"""
    # Step 1: Print header
    
    # Step 2: Try to import torch
    
    # Step 3: If success, get and print version
    
    # Step 4: If fail, print error and instructions
    
    pass

# TODO: Test your function
# result = check_pytorch()
# print(f"\nPyTorch available: {result}")

## Function 1.3: `check_cuda()`

### üéØ Goal
Check CUDA availability and display GPU information.

### üìã Step-by-Step Instructions

**Step 1:** Print the header
- Print "\nüîç Checking CUDA..."

**Step 2:** Wrap everything in a try-except block
- Catch general `Exception` to handle any unexpected errors

**Step 3:** Import torch (inside the try block)
- `import torch`

**Step 4:** Check if CUDA is available
- Use `torch.cuda.is_available()` - returns True/False
- Store result in a variable

**Step 5:** If CUDA is available, gather detailed information
- Print: "CUDA available: True"
- Get CUDA version: `torch.version.cuda`
- Print: "CUDA version: {version}"
- Get number of GPUs: `torch.cuda.device_count()`
- Print: "Number of GPUs: {count}"

**Step 6:** Loop through each GPU and get details
- Use `for i in range(torch.cuda.device_count()):`
- Get GPU name: `torch.cuda.get_device_name(i)`
- Get GPU memory: `torch.cuda.get_device_properties(i).total_memory`
- Convert memory from bytes to GB: divide by `1024**3`
- Print: "GPU {i}: {name} ({memory:.1f} GB)"

**Step 7:** Print success message and return True
- Print: "‚úì CUDA is available and working"
- Return `True`

**Step 8:** If CUDA is not available (in the else block)
- Print: "‚úó CUDA is not available"
- Print helpful note about CPU fallback
- Return `False`

**Step 9:** Handle exceptions in the except block
- Print: "‚úó Error checking CUDA: {error_message}"
- Return `False`

### ‚úÖ Expected Output (with GPU)
```
üîç Checking CUDA...
   CUDA available: True
   CUDA version: 12.1
   Number of GPUs: 1
   GPU 0: NVIDIA A100-SXM4-40GB (40.0 GB)
   ‚úì CUDA is available and working
```

### ‚ö†Ô∏è Common Pitfalls
1. **Memory unit conversion**: `total_memory` is in bytes, need to divide by `1024**3` for GB
2. **Forgetting the .1f format**: Shows memory with 1 decimal place
3. **Not handling the case when device_count is 0**: The for loop handles this automatically

In [None]:
# TODO: Implement check_cuda() function
def check_cuda():
    """Check CUDA availability"""
    # Step 1: Print header
    
    # Step 2: Start try-except block
    
    # Step 3: Import torch
    
    # Step 4: Check CUDA availability
    
    # Step 5: If available, get details
    
    # Step 6: Loop through GPUs
    
    # Step 7: Print success and return True
    
    # Step 8: Handle not available case
    
    # Step 9: Handle exceptions
    
    pass

# TODO: Test your function
# result = check_cuda()
# print(f"\nCUDA available: {result}")

## Function 1.4: `check_vllm()`

### üéØ Goal
Check if vLLM library is installed.

### üìã Step-by-Step Instructions

**Step 1:** Follow the same pattern as `check_pytorch()`
- Print header: "\nüîç Checking vLLM..."

**Step 2:** Use try-except for import
- Try to `import vllm`
- Get version: `vllm.__version__`
- Print version and success message
- Return `True`

**Step 3:** Handle ImportError
- Print error and installation command: "Run: pip install vllm"
- Return `False`

### ‚úÖ Expected Output
```
üîç Checking vLLM...
   vLLM version: X.X.X
   ‚úì vLLM is installed
```

In [None]:
# TODO: Implement check_vllm() function
def check_vllm():
    """Check vLLM installation"""
    # Your code here
    pass

# TODO: Test your function
# result = check_vllm()

## Function 1.5: `check_huggingfacehub()`

### üéØ Goal
Check if HuggingFace Hub library is installed.

### üìã Step-by-Step Instructions

**Step 1:** Follow the same pattern as `check_pytorch()`
- Print header: "\nüîç Checking HuggingFace Hub..."

**Step 2:** Use try-except for import
- Try to `import huggingface_hub`
- Get version: `huggingface_hub.__version__`
- Print version and success message
- Return `True`

**Step 3:** Handle ImportError
- Print error and installation command: "Run: pip install huggingface-hub"
- Return `False`

### ‚úÖ Expected Output
```
üîç Checking HuggingFace Hub...
   HuggingFace Hub version: X.X.X
   ‚úì HuggingFace Hub is installed
```

In [None]:
# TODO: Implement check_huggingfacehub() function
def check_huggingfacehub():
    """Check HuggingFace Hub installation"""
    # Your code here
    pass

# TODO: Test your function
# result = check_huggingfacehub()

## Function 1.6: `print_summary(results)`

### üéØ Goal
Create a summary of all environment checks.

### üìù Function Signature
```python
def print_summary(results):
    """Print summary of all checks
    
    Args:
        results: Dictionary with check names as keys and boolean results as values
        
    Returns:
        bool: True if all checks passed, False otherwise
    """
    pass
```

### üìã Step-by-Step Instructions

**Step 1:** Print a formatted header
- Print a blank line
- Print "="*60 (creates a line of 60 equal signs)
- Print "SUMMARY"
- Print "="*60 again

**Step 2:** Determine if all checks passed
- Use `all(results.values())` to check if all values are True
- Store in variable `all_passed`

**Step 3:** Loop through results and print each one
- Use `for check, passed in results.items():`
- Set status emoji: "‚úì" if passed, else "‚úó"
- Use ternary operator: `status = "‚úì" if passed else "‚úó"`
- Print: "{status} {check}"

**Step 4:** Print footer
- Print "="*60

**Step 5:** Print conditional final message
- If `all_passed`:
  - Print "\nüéâ All checks passed! Your environment is ready."
  - Print "\nNext steps:"
  - Print "1. Continue to Part 2: Model Download"
- Else:
  - Print "\n‚ö†Ô∏è  Some checks failed. Please fix the issues above."

**Step 6:** Print blank line and return result
- Print ""
- Return `all_passed`

### ‚úÖ Expected Output (all passing)
```
============================================================
SUMMARY
============================================================
‚úì Python 3.9+
‚úì PyTorch
‚úì CUDA
‚úì vLLM
‚úì HuggingFace Hub
============================================================

üéâ All checks passed! Your environment is ready.

Next steps:
1. Continue to Part 2: Model Download

```

### ‚ö†Ô∏è Common Pitfalls
1. **Using `any()` instead of `all()`**: We need ALL checks to pass
2. **Forgetting `.values()`**: `results` is a dict, need to get just the boolean values
3. **Wrong number of `=` signs**: Should be exactly 60 for proper formatting

In [None]:
# TODO: Implement print_summary() function
def print_summary(results):
    """Print summary of all checks"""
    # Step 1: Print header
    
    # Step 2: Check if all passed
    
    # Step 3: Loop through and print each result
    
    # Step 4: Print footer
    
    # Step 5: Print conditional message
    
    # Step 6: Return result
    
    pass

# TODO: Test your function with sample data
# sample_results = {
#     "Python 3.9+": True,
#     "PyTorch": True,
#     "CUDA": True,
#     "vLLM": True,
#     "HuggingFace Hub": True,
# }
# print_summary(sample_results)

## üß™ Run Complete Environment Verification

Now let's put it all together and run a complete environment check!

In [None]:
# TODO: Run all checks and create summary
def run_environment_verification():
    """Run all environment checks"""
    print("="*60)
    print("Week 1: Environment Verification")
    print("="*60)
    
    # TODO: Create a dictionary with check names and results
    # Call each check function and store the result
    results = {
        "Python 3.9+": check_python_version(),
        "PyTorch": check_pytorch(),
        "CUDA": check_cuda(),
        "vLLM": check_vllm(),
        "HuggingFace Hub": check_huggingfacehub(),
    }
    
    # TODO: Print summary
    success = print_summary(results)
    return success

# TODO: Run the verification
# run_environment_verification()

---

# Part 2: Model Download

## üéì What You'll Learn
- How to download models from HuggingFace Hub programmatically
- How to use `snapshot_download` for large models
- How to check if a model is already cached
- How to handle download interruptions gracefully
- How to work with Path objects for file system operations

## üìñ Background
The Qwen2.5-7B-Instruct model is approximately 15GB. HuggingFace Hub provides `snapshot_download` which:
- Downloads all model files to a local cache
- Supports resumable downloads (can be interrupted and continued)
- Caches files in `~/.cache/huggingface/hub/` by default
- Allows models to be loaded quickly on subsequent uses

---

## Function 2.1: `check_model_cache()`

### üéØ Goal
Check if the Qwen2.5-7B-Instruct model is already cached locally.

### üìã Step-by-Step Instructions

**Step 1:** Import required modules at the top of the notebook
- `from pathlib import Path`
- `import os`

**Step 2:** Construct the cache directory path
- HuggingFace cache location: `~/.cache/huggingface/hub`
- Use `Path.home()` to get the user's home directory
- Chain with: `/ ".cache" / "huggingface" / "hub"`
- Store in variable `cache_dir`

**Step 3:** Check if cache directory exists
- Use `cache_dir.exists()` method
- If it doesn't exist, return `False` immediately
- No cache directory = no cached models

**Step 4:** Look for Qwen model directories
- Use `cache_dir.iterdir()` to iterate through all items
- For each item, check if "qwen2.5-7b-instruct" is in the name (lowercase)
- Use `item.name.lower()` to make the search case-insensitive
- If found, return `True`

**Step 5:** If no Qwen directory found
- Return `False`

### ‚úÖ Expected Behavior
- Returns `True` if a directory containing "qwen2.5-7b-instruct" exists in the cache
- Returns `False` otherwise

### ‚ö†Ô∏è Common Pitfalls
1. **Case sensitivity**: Model directory names might have different casing, always use `.lower()`
2. **Not checking if directory exists first**: `iterdir()` will fail if directory doesn't exist
3. **Using string paths instead of Path objects**: Path objects are more robust and cross-platform

### üí° Why This Matters
Checking the cache before downloading:
- Saves time (no need to re-download 15GB)
- Saves bandwidth
- Provides better user experience (can prompt before re-downloading)

In [None]:
# TODO: Import required modules
from pathlib import Path
import os

# TODO: Implement check_model_cache() function
def check_model_cache():
    """Check if model is already in cache"""
    # Step 1: Construct cache directory path
    
    # Step 2: Check if directory exists
    
    # Step 3: Look for Qwen model directories
    
    # Step 4: Return result
    
    pass

# TODO: Test your function
# is_cached = check_model_cache()
# print(f"Model cached: {is_cached}")

## Function 2.2: `download_model()`

### üéØ Goal
Download the Qwen2.5-7B-Instruct model from HuggingFace Hub with proper error handling.

### üìã Step-by-Step Instructions

**Step 1:** Print a formatted header
- Print "="*60
- Print "Week 1: Downloading Qwen2.5-7B-Instruct Model"
- Print "="*60
- Print blank line

**Step 2:** Try to import snapshot_download
- Wrap in try-except to catch ImportError
- `from huggingface_hub import snapshot_download`
- If ImportError: print error message and return (don't use sys.exit in notebooks!)

**Step 3:** Define model name and print info
- `model_name = "Qwen/Qwen2.5-7B-Instruct"`
- Print: "üì¶ Model: {model_name}"
- Print: "üìè Expected size: ~15 GB"
- Print: "üìÅ Cache location: {os.path.expanduser('~/.cache/huggingface')}"
- Print blank line
- Print: "‚è≥ This may take 10-30 minutes depending on your internet speed..."
- Print: "    You can cancel and run again later - progress is saved."
- Print blank line

**Step 4:** Download the model with proper error handling
- Wrap in try-except block with TWO exception types:
  1. `KeyboardInterrupt` (user presses Ctrl+C)
  2. `Exception` (any other error)

**Step 5:** In the try block, call snapshot_download
- Use these parameters:
  - `repo_id=model_name`
  - `cache_dir=None` (use default)
  - `resume_download=True` (enable resumable downloads)
- Store returned path in `cache_dir` variable

**Step 6:** Print success message
- Print blank line
- Print "="*60
- Print "‚úÖ Model downloaded successfully!"
- Print "="*60
- Print: "üìÅ Cache location: {cache_dir}"
- Print blank line
- Print "Next steps:"
- Print "1. Continue to Part 3: Baseline Inference"
- Print blank line

**Step 7:** Handle KeyboardInterrupt (in except block)
- Print: "\n\n‚ö†Ô∏è  Download interrupted by user."
- Print: "Progress has been saved. Run this cell again to resume."
- Return from function (in notebooks, don't use sys.exit)

**Step 8:** Handle general exceptions (in except block)
- Catch `Exception as e`
- Print: "\n‚ùå Error downloading model: {e}"
- Print "\nTroubleshooting:"
- Print "1. Check your internet connection"
- Print "2. Ensure you have enough disk space (~20GB free)"
- Print "3. Try again - downloads are resumable"
- Print "4. If issues persist, the model will auto-download on first use"
- Return from function

### ‚ö†Ô∏è Common Pitfalls
1. **Using sys.exit() in notebooks**: This will crash the kernel. Use `return` instead.
2. **Not handling KeyboardInterrupt separately**: Users should know their progress is saved.
3. **Forgetting resume_download=True**: Downloads should be resumable for large models.
4. **Not providing troubleshooting steps**: Users need guidance when errors occur.

In [None]:
# TODO: Implement download_model() function
def download_model():
    """Download Qwen2.5-7B-Instruct model from HuggingFace Hub"""
    # Step 1: Print header
    
    # Step 2: Try to import snapshot_download
    
    # Step 3: Define model name and print info
    
    # Step 4-6: Download with error handling
    
    # Step 7-8: Handle exceptions
    
    pass

# TODO: Uncomment to run the download (WARNING: Large download!)
# First, check if model is already cached
# if not check_model_cache():
#     download_model()
# else:
#     print("‚úÖ Model is already cached!")

---

# Part 3: Baseline Inference

## üéì What You'll Learn
- How to initialize vLLM's LLM class
- How to configure SamplingParams for text generation
- How to perform single and batch inference
- How to measure inference time and throughput
- How to extract and display generated text from outputs

## üìñ Background
vLLM's `LLM` class provides a high-level interface for:
- Loading large language models into GPU memory
- Running inference with automatic batching
- Efficient memory management with PagedAttention

`SamplingParams` controls how text is generated:
- `max_tokens`: Maximum number of tokens to generate
- `temperature`: Controls randomness (0=deterministic, higher=more random)
- `top_p`: Nucleus sampling threshold
- `top_k`: Number of top tokens to consider

---

## Function 3.1: `run_baseline_inference()`

### üéØ Goal
Load the Qwen model and perform a simple inference test.

### üìã Key Steps

**1. Import and Setup**
- Import: `from vllm import LLM, SamplingParams`
- Import: `import time` for measuring performance
- Define model name: `"Qwen/Qwen2.5-7B-Instruct"`

**2. Load the Model**
- Create LLM instance with `trust_remote_code=True`
- Measure loading time with `time.time()`
- Handle exceptions (GPU memory, CUDA availability, etc.)

**3. Configure Sampling**
- Create SamplingParams with:
  - `max_tokens=20` (generate up to 20 tokens)
  - `temperature=0.7` (balanced randomness)
  - `top_p=0.9` (nucleus sampling)
  - `top_k=50` (top-k sampling)

**4. Generate Text**
- Define test prompt: `"Hello, my name is"`
- Call: `outputs = llm.generate([test_prompt], sampling_params)`
- **Important**: Pass prompt as a list, even for single prompt!

**5. Extract Results**
- Access generated text: `outputs[0].outputs[0].text`
- Note the structure: outputs[request_index].outputs[completion_index].text

**6. Calculate and Display Metrics**
- Generation time (seconds)
- Tokens generated (approximate with `.split()`)
- Tokens per second

### ‚úÖ Expected Output
```
============================================================
Week 1: Baseline Inference with vLLM
============================================================

üì¶ Loading model: Qwen/Qwen2.5-7B-Instruct
‚è≥ This may take 30-60 seconds on first load...

‚úÖ Model loaded successfully in 45.3 seconds

üß™ Running inference test...
üìù Prompt: "Hello, my name is"

‚ú® Generated text:
------------------------------------------------------------
Hello, my name is Alex and I'm here to help you with any questions
------------------------------------------------------------

üìä Statistics:
   Generation time: 0.89 seconds
   Tokens generated: 13
   Tokens per second: 14.6
```

### ‚ö†Ô∏è Common Pitfalls
1. **Not passing prompt as a list**: `generate()` expects a list of prompts
2. **Wrong output access path**: It's `outputs[0].outputs[0].text`, not `outputs[0].text`
3. **Dividing by zero**: Check generation_time > 0 before calculating tokens/sec
4. **Token counting**: Using `.split()` is approximate; actual tokenization may differ
5. **Not setting trust_remote_code=True**: Required for Qwen models

### üí° Why This Matters
- **Baseline measurement**: This gives us a reference point for future optimizations
- **Understanding the pipeline**: Loading ‚Üí Tokenization ‚Üí Generation ‚Üí Detokenization
- **Performance awareness**: Measuring time helps identify bottlenecks

In [None]:
# TODO: Import necessary modules
import time

# TODO: Implement run_baseline_inference() function
def run_baseline_inference():
    """Run a basic inference test with Qwen2.5-7B-Instruct"""
    # Step 1: Print header
    
    # Step 2: Import vLLM (with error handling)
    
    # Step 3: Load model with timing
    
    # Step 4: Configure sampling parameters
    
    # Step 5: Generate text with timing
    
    # Step 6: Extract and display results
    
    # Step 7: Print statistics and educational summary
    
    pass

# TODO: Run the inference test (WARNING: This loads a 15GB model!)
# run_baseline_inference()

---

# Part 4: Sampling Parameters Experiments

## üéì What You'll Learn
- How different sampling parameters affect generation quality
- How to design systematic experiments
- How to collect and structure experimental data
- How to analyze performance vs quality trade-offs

## üìñ Sampling Parameters Explained:

### max_tokens: Output Length
- Controls how many tokens to generate
- Directly affects generation time (linear relationship)
- More tokens = more complete answers but slower

### temperature: Randomness
- 0.0 = Deterministic (always pick most likely token)
- 0.7 = Balanced (default for chat)
- 1.0+ = Creative (more random, diverse)
- Higher = more variety but potentially less coherent

### top_p (nucleus sampling): Probability Mass Threshold
- Consider tokens that make up top P% of probability mass
- 0.9 = Consider tokens covering 90% probability
- Lower = more focused, Higher = more diverse

### top_k: Vocabulary Restriction
- Only consider the K most likely tokens
- 50 = Only look at top 50 tokens
- Lower = more conservative, Higher = more variety

---

## Experiment Design

You'll implement 4 key functions for experimentation:

### 1. `create_results_dir()` - Setup
- Create a "results" directory for storing experiment data
- Use `Path("results").mkdir(exist_ok=True)`

### 2. `test_parameter()` - Single Test
- Tests one parameter value
- Measures generation time and tokens per second
- Returns structured results dictionary

### 3. `experiment_max_tokens()` - Experiment Runner
- Tests multiple values: [10, 25, 50, 100, 200]
- Prints formatted analysis table
- Shows how generation time scales linearly

### 4. `save_all_results()` - Save Results
- Save experiment data to JSON with timestamp
- Format: `sampling_experiments_YYYYMMDD_HHMMSS.json`
- Enables comparison and analysis later

---

## Function 4.1: `test_parameter()` - Core Testing Function

### üéØ Goal
Test a specific parameter value and collect comprehensive results.

### üìù Function Signature
```python
def test_parameter(llm, param_name, param_value, base_params, test_prompt):
    """
    Test a specific parameter value.
    
    Args:
        llm: vLLM LLM instance (already loaded)
        param_name: Name of parameter being tested (e.g., 'temperature')
        param_value: Value to test (e.g., 0.7)
        base_params: Dict of base parameters (others stay constant)
        test_prompt: Prompt to use for testing
    
    Returns:
        Dict with results including generated text and metrics
    """
```

### üìã Step-by-Step Instructions

**Step 1:** Create a copy of base_params and update with test value
- Use `.copy()` to avoid modifying original
- Set: `params_dict[param_name] = param_value`

**Step 2:** Create SamplingParams object
- `from vllm import SamplingParams`
- `sampling_params = SamplingParams(**params_dict)`
- The `**` unpacks the dictionary as keyword arguments

**Step 3:** Print what's being tested
- Print: f"\nüìù Testing {param_name}={param_value}"

**Step 4:** Generate text with timing
- Record start time: `start = time.time()`
- Generate: `outputs = llm.generate([test_prompt], sampling_params)`
- Calculate elapsed time: `elapsed = time.time() - start`

**Step 5:** Extract results
- Get text: `generated_text = outputs[0].outputs[0].text`
- Count tokens: `tokens = len(generated_text.split())`

**Step 6:** Print preview (truncated if long)
- If text > 80 chars: print first 80 chars + "..."
- Else: print full text
- Print time and tokens

**Step 7:** Return structured results dictionary
```python
return {
    "parameter": param_name,
    "value": param_value,
    "prompt": test_prompt,
    "generated_text": generated_text,
    "generation_time": elapsed,
    "tokens_generated": tokens,
    "tokens_per_second": tokens / elapsed if elapsed > 0 else 0,
}
```

### ‚ö†Ô∏è Common Pitfalls
1. **Modifying base_params directly**: Always use `.copy()`
2. **Division by zero**: Check `elapsed > 0` before calculating tokens/sec
3. **Not using `**params_dict`**: This unpacks the dictionary into kwargs

In [None]:
# TODO: Import required modules
import json
from datetime import datetime
from pathlib import Path

# TODO: Implement create_results_dir()
def create_results_dir():
    """Create results directory if it doesn't exist"""
    # Your code here
    pass

# TODO: Implement test_parameter()
def test_parameter(llm, param_name, param_value, base_params, test_prompt):
    """Test a specific parameter value"""
    # Step 1: Copy and update parameters
    
    # Step 2: Create SamplingParams
    
    # Step 3: Print test info
    
    # Step 4: Generate with timing
    
    # Step 5: Extract results
    
    # Step 6: Print preview
    
    # Step 7: Return structured results
    
    pass

---

# Part 5: Results Comparison & Analysis

## üéì What You'll Learn
- How to load and parse JSON experiment results
- How to create comparison tables
- How to extract insights from experimental data
- How to make data-driven recommendations

## üìñ Key Functions:

### 1. `load_latest_results()` - Load Experiment Data
- Finds the most recent results file
- Uses `glob()` to find matching files
- Returns JSON data or None if not found

### 2. `compare_max_tokens()` - Analysis Function
- Creates formatted comparison table
- Calculates average speeds
- Shows time scaling relationships

### 3. `show_recommendations()` - Best Practices
- Provides parameter combinations for different use cases:
  - Factual/Technical: temp=0.2, top_p=0.8
  - Balanced: temp=0.7, top_p=0.9
  - Creative: temp=1.0, top_p=0.95
  - Deterministic: temp=0.0, top_p=1.0

---

## Function 5.1: `load_latest_results()`

### üéØ Goal
Load the most recent experiment results from the results directory.

### üìã Step-by-Step Instructions

**Step 1:** Define results directory path
- `results_dir = Path("results")`

**Step 2:** Check if directory exists
- If not exists:
  - Print "‚ùå No results directory found. Run experiments first!"
  - Return `None`

**Step 3:** Find all result files
- Use `results_dir.glob("sampling_experiments_*.json")`
- Convert to list: `result_files = list(...)`

**Step 4:** Check if any files found
- If empty list:
  - Print error message
  - Return `None`

**Step 5:** Get the most recent file
- Use `max()` with a key function:
```python
latest_file = max(result_files, key=lambda f: f.stat().st_mtime)
```
- `st_mtime` is the file modification time

**Step 6:** Load and return JSON data
- Print: f"üìÇ Loading: {latest_file.name}"
- Open file: `with open(latest_file) as f:`
- Load JSON: `return json.load(f)`

### ‚ö†Ô∏è Common Pitfalls
1. **Not converting glob to list**: `glob()` returns an iterator
2. **Wrong stat attribute**: It's `st_mtime`, not `modified_time`
3. **Not handling None returns**: Calling code should check

In [None]:
# TODO: Implement load_latest_results()
def load_latest_results():
    """Load the most recent experiment results"""
    # Step 1: Define results directory
    
    # Step 2: Check if exists
    
    # Step 3: Find all result files
    
    # Step 4: Check if any found
    
    # Step 5: Get most recent
    
    # Step 6: Load and return
    
    pass

---

# Part 6: Custom Parameter Testing

## üéì What You'll Learn
- How to create reusable testing functions
- How to use default parameters effectively
- How to provide good user experience

## üìñ Interactive Testing

The `test_generation()` function allows quick testing of any parameter combination:

```python
# Test with defaults
result = test_generation("Explain AI")

# Test with custom parameters
result = test_generation(
    "Explain AI", 
    temperature=0.0,  # Deterministic
    max_tokens=100
)
```

### Default Parameters Make Functions User-Friendly:
- Can call with just a prompt
- Or override specific params
- Users don't need to specify everything

---

## Function 6.1: `test_generation()` - Quick Testing Tool

### üéØ Goal
Test text generation with any combination of parameters, with sensible defaults.

### üìù Function Signature
```python
def test_generation(prompt, max_tokens=50, temperature=0.7, top_p=0.9, top_k=50):
    """
    Test text generation with specific parameters.
    
    Args:
        prompt: Input prompt (required)
        max_tokens: Maximum tokens to generate (default: 50)
        temperature: Sampling temperature 0.0-2.0 (default: 0.7)
        top_p: Nucleus sampling 0.0-1.0 (default: 0.9)
        top_k: Top-k sampling (default: 50)
    
    Returns:
        Dict with results and metrics
    """
```

### üìã Implementation Steps

**1. Load Model**
- Import vLLM components
- Load Qwen model with trust_remote_code=True

**2. Configure and Generate**
- Create SamplingParams with all provided parameters
- Print configuration being used
- Generate text with timing

**3. Display Results**
- Print generated text in a formatted box
- Show all metrics (time, tokens, speed)

**4. Return Structured Result**
- Include timestamp, prompt, parameters, generated text, metrics
- Return as dictionary for potential further use

### üí° Why Default Parameters?
- Easier to use: `test_generation("Hello")` just works
- Flexible: Can override any parameter
- Self-documenting: Defaults show recommended values

In [None]:
# TODO: Implement test_generation()
def test_generation(prompt, max_tokens=50, temperature=0.7, top_p=0.9, top_k=50):
    """Test text generation with specific parameters"""
    # Step 1: Load model
    
    # Step 2: Configure sampling
    
    # Step 3: Generate with timing
    
    # Step 4: Display results
    
    # Step 5: Return structured result
    
    pass

# TODO: Test with different parameter combinations
# Example 1: Deterministic
# result = test_generation("Explain machine learning:", temperature=0.0)

# Example 2: Creative
# result = test_generation("Write a story:", temperature=1.2, max_tokens=100)

---

# üéâ Congratulations!

## What You've Accomplished

You've successfully implemented a complete Week 1 learning system for vLLM and Qwen2.5! You now understand:

### ‚úÖ Environment Setup
- How to verify Python, PyTorch, CUDA, and library installations
- How to check GPU availability programmatically
- How to create comprehensive validation scripts

### ‚úÖ Model Management
- How to download large models from HuggingFace Hub
- How to check local cache
- How to handle download interruptions and resumption

### ‚úÖ Inference Basics
- How to initialize vLLM's LLM class
- How to configure SamplingParams
- How to perform single and batch inference
- How to measure inference performance

### ‚úÖ Parameter Experimentation
- How max_tokens affects output length and speed
- How temperature controls randomness and creativity
- How top_p and top_k influence diversity
- How to design systematic experiments

### ‚úÖ Data Analysis
- How to save and load experiment results
- How to compare parameter effects
- How to derive insights from experimental data
- How to make informed recommendations

### ‚úÖ Software Engineering Skills
- Error handling and user feedback
- File system operations with pathlib
- JSON serialization and deserialization
- Creating reusable, well-documented functions

---

## üîú Next Steps

1. **Experiment Further**: Try your own prompts and parameter combinations
2. **Compare Results**: Run the same prompt with different parameters
3. **Document Findings**: Keep notes on what works well for your use cases
4. **Week 2**: Move on to performance profiling and optimization!

---

## üìñ Quick Reference: Sampling Parameters

```python
# Factual/Technical
SamplingParams(max_tokens=100, temperature=0.2, top_p=0.8, top_k=40)

# Balanced/Chat
SamplingParams(max_tokens=150, temperature=0.7, top_p=0.9, top_k=50)

# Creative
SamplingParams(max_tokens=200, temperature=1.0, top_p=0.95, top_k=100)

# Deterministic (for testing/reproducibility)
SamplingParams(max_tokens=100, temperature=0.0, top_p=1.0, top_k=1)
```

---

## üôè Great Work!

You've learned by doing - the best way to truly understand a technology. The code you've written is now yours to modify, extend, and apply to your own projects.

**Key Takeaway**: Understanding how to tune sampling parameters is crucial for getting the best results from LLMs. Different tasks require different settings!

Keep experimenting and happy coding! üöÄ