# NanoGPT Model Evaluation

This notebook loads a trained NanoGPT model and evaluates it on train.bin and val.bin datasets using multiple metrics:
- **Perplexity**: Measures how well the model predicts the next token
- **BLEU Score**: Measures similarity between generated and reference text
- **ROUGE Scores**: Measures overlap of n-grams between generated and reference text

## Usage
1. Configure the paths and parameters in the configuration cell
2. Run all cells to perform the evaluation
3. View results in the final summary

## 1. Import Required Libraries

In [75]:
import os
import sys
import math
import time
from pathlib import Path
from typing import List, Tuple, Dict, Any

import numpy as np
import torch
import torch.nn.functional as F

# Add the required paths for importing
current_dir = Path.cwd()
sys.path.append(str(current_dir / "baselines"))
sys.path.append(str(current_dir / "notebooks"))

print(f"Current directory: {current_dir}")
print(f"Python path updated with: {current_dir / 'baselines'}, {current_dir / 'notebooks'}")

Current directory: c:\Users\hayk_\OneDrive\Desktop\05_LMU_Masters\04_applied_dl\adl-bnn-textgen\notebooks
Python path updated with: c:\Users\hayk_\OneDrive\Desktop\05_LMU_Masters\04_applied_dl\adl-bnn-textgen\notebooks\baselines, c:\Users\hayk_\OneDrive\Desktop\05_LMU_Masters\04_applied_dl\adl-bnn-textgen\notebooks\notebooks


In [76]:
import evaluate
from rouge_score import rouge_scorer

## 2. Configuration

Set your model paths and evaluation parameters here:

In [None]:
# Configuration
CONFIG = {
    # 'data_dir': 'nanoGPT/data/shakespeare_char',
    # 'model_path': '../checkpoints/baseline_nanogpt/baseline_nanogpt.pt',
    # 'meta_path': '../checkpoints/baseline_nanogpt/nanogpt_meta.pkl',
    'data_dir': 'nanoGPT/data/shakespeare',
    'model_path': '../checkpoints/baseline_token_level_nano/token_level_1500_iter.pt',
    'meta_path': '../checkpoints/baseline_nanogpt/nanogpt_meta.pkl',

    'batch_size': 16,
    'max_eval_samples': 1000,
    'device': 'auto',  # 'auto', 'cpu', or 'cuda'
    'splits': ['val', 'train'],  # Dataset splits to evaluate
    'num_text_samples': 50,  # Number of text samples for BLEU/ROUGE
    'prompt_length': 20,  # Length of prompt for text generation
    'generation_length': 30  # Length of generated text
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

# Check if paths exist
for path_key in ['data_dir', 'model_path', 'meta_path']:
    path = Path(CONFIG[path_key])
    if path.exists():
        print(f"Path exists {path_key}: {path}")
    else:
        print(f"Path not found {path_key}: {path}")

Configuration:
  data_dir: nanoGPT/data/shakespeare
  model_path: ../checkpoints/baseline_token_level_nano/token_level_1500_iter.pt
  meta_path: ../checkpoints/baseline_nanogpt/nanogpt_meta.pkl
  batch_size: 16
  max_eval_samples: 1000
  device: auto
  splits: ['val', 'train']
  num_text_samples: 50
  prompt_length: 20
  generation_length: 30
‚úì data_dir: nanoGPT\data\shakespeare exists
‚úì model_path: ..\checkpoints\baseline_token_level_nano\token_level_1500_iter.pt exists
‚úì meta_path: ..\checkpoints\baseline_nanogpt\nanogpt_meta.pkl exists


## 3. Alternative Utility Functions

These functions provide fallback implementations if the utils module is not available:

In [78]:
from utils import load_model, load_tokenizer, decode

## 4. NanoGPT Evaluator Class

This class handles model loading and evaluation with multiple metrics:

In [None]:
class NanoGPTEvaluator:
    """Evaluator for NanoGPT models with multiple metrics"""
    
    def __init__(self, model_path: str, meta_path: str, device: str = 'auto'):
        """
        Initialize the evaluator
        
        Args:
            model_path: Path to the model checkpoint
            meta_path: Path to the meta.pkl file containing tokenizer info
            device: Device to use ('cpu', 'cuda', or 'auto')
        """
        # Set device
        if device == 'auto':
            self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        else:
            self.device = device
        
        print(f"Using device: {self.device}")
        

        self.model, self.checkpoint = load_model(Path(model_path), self.device)
        self.stoi, self.itos = load_tokenizer(Path(meta_path))
        
        self.vocab_size = len(self.itos)
        
        # Set model to evaluation mode
        self.model.eval()
        
        # Initialize metrics if available
        self.metrics = {}
        try:
            # Load evaluation metrics from HuggingFace evaluate
            self.bleu_metric = evaluate.load("bleu")
            self.rouge_metric = evaluate.load("rouge")
            self.perplexity_metric = evaluate.load("perplexity", module_type="metric")
            print("HuggingFace evaluation metrics loaded successfully")
        except Exception as e:
            print(f"Warning: Could not load HuggingFace metrics: {e}")
            print("Falling back to individual metric libraries...")
            try:
                # Fallback to rouge_score library
                self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
                print("ROUGE scorer initialized")
            except Exception:
                self.rouge_scorer = None
            self.bleu_metric = None
            self.rouge_metric = None
            self.perplexity_metric = None

print("NanoGPTEvaluator class defined")

‚úì NanoGPTEvaluator class defined


In [None]:
# Add data loading methods to the evaluator
def load_data(self, data_dir: str, split: str = 'val') -> np.ndarray:
    """
    Load train.bin or val.bin data
    
    Args:
        data_dir: Directory containing the data files
        split: 'train' or 'val'
        
    Returns:
        Numpy array of token indices
    """
    filename = f"{split}.bin"
    filepath = os.path.join(data_dir, filename)
    
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Data file not found: {filepath}")
    
    data = np.memmap(filepath, dtype=np.uint16, mode='r')
    print(f"Loaded {split} data: {len(data):,} tokens")
    return data

def get_batch(self, data: np.ndarray, batch_size: int, block_size: int) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Get a random batch of data for evaluation
    
    Args:
        data: Token data array
        batch_size: Number of sequences in the batch
        block_size: Length of each sequence
        
    Returns:
        Tuple of (input_tokens, target_tokens)
    """
    if len(data) <= block_size:
        # If data is smaller than block_size, just use what we have
        ix = [0] * batch_size
        max_len = len(data) - 1
        x = torch.stack([torch.from_numpy(data[0:max_len].astype(np.int64)) for _ in range(batch_size)])
        y = torch.stack([torch.from_numpy(data[1:max_len+1].astype(np.int64)) for _ in range(batch_size)])
    else:
        ix = torch.randint(len(data) - block_size, (batch_size,))
        x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
        y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    
    x, y = x.to(self.device), y.to(self.device)
    return x, y

# Add methods to the class
NanoGPTEvaluator.load_data = load_data
NanoGPTEvaluator.get_batch = get_batch

print("Data loading methods added to NanoGPTEvaluator")

‚úì Data loading methods added to NanoGPTEvaluator


In [None]:
# Add perplexity calculation method
@torch.no_grad()
def calculate_perplexity(self, data: np.ndarray, batch_size: int = 16, max_batches: int = 100) -> float:
    """
    Calculate perplexity on the given dataset
    
    Args:
        data: Token data array
        batch_size: Batch size for evaluation
        max_batches: Maximum number of batches to evaluate
        
    Returns:
        Perplexity score
    """
    print(f"Calculating perplexity with {batch_size} batch size...")
    
    # Try using evaluate library first
    if self.perplexity_metric is not None:
        try:
            # Prepare text for evaluate library
            text_samples = []
            block_size = min(self.model.config.block_size, 512)  # Limit block size for memory
            num_samples = min(max_batches * batch_size, len(data) // block_size)
            
            for i in range(0, num_samples * block_size, block_size):
                if i + block_size < len(data):
                    tokens = data[i:i+block_size].astype(np.int64)
                    text = decode(tokens.tolist(), self.itos)
                    text_samples.append(text)
            
            if text_samples:
                # Use evaluate library
                result = self.perplexity_metric.compute(predictions=text_samples, model_id="gpt2")
                return result['mean_perplexity']
        except Exception as e:
            print(f"Error with evaluate library perplexity: {e}")
            print("Falling back to manual calculation...")
    
    # Fallback to manual calculation
    total_loss = 0.0
    total_tokens = 0
    batches_processed = 0
    
    block_size = self.model.config.block_size
    
    # Calculate number of possible batches
    if len(data) > block_size:
        max_possible_batches = (len(data) - block_size) // batch_size
    else:
        max_possible_batches = 1
    
    num_batches = min(max_batches, max_possible_batches, 100)  # Limit to reasonable number
    
    print(f"Processing {num_batches} batches for perplexity calculation...")
    
    for batch_idx in range(num_batches):
        try:
            x, y = self.get_batch(data, batch_size, block_size)
            
            # Forward pass
            logits, loss = self.model(x, y)
            
            total_loss += loss.item() * x.numel()
            total_tokens += x.numel()
            batches_processed += 1
            
            if batch_idx % 20 == 0:
                print(f"  Processed batch {batch_idx + 1}/{num_batches}")
                
        except Exception as e:
            print(f"Error in batch {batch_idx}: {e}")
            continue
    
    if total_tokens == 0:
        return float('inf')
    
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    
    print(f"Processed {batches_processed} batches, {total_tokens:,} tokens")
    return perplexity

# Add method to the class
NanoGPTEvaluator.calculate_perplexity = calculate_perplexity

print("Perplexity calculation method added")

‚úì Perplexity calculation method added


In [None]:
# Add text generation and metric calculation methods
def generate_samples_for_metrics(self, data: np.ndarray, num_samples: int = 50, 
                               prompt_length: int = 20, generation_length: int = 30) -> Tuple[List[str], List[str]]:
    """
    Generate text samples for BLEU/ROUGE evaluation
    
    Args:
        data: Token data array
        num_samples: Number of samples to generate
        prompt_length: Length of prompt in tokens
        generation_length: Length of generated text in tokens
        
    Returns:
        Tuple of (references, predictions)
    """
    print(f"Generating {num_samples} samples for BLEU/ROUGE evaluation...")
    
    references = []
    predictions = []
    
    # Limit samples based on data size
    max_possible_samples = max(1, (len(data) - prompt_length - generation_length) // 100)
    num_samples = min(num_samples, max_possible_samples)
    
    print(f"Generating {num_samples} text samples...")
    
    for i in range(num_samples):
        try:
            # Select a random starting position
            if len(data) > prompt_length + generation_length + 10:
                start_idx = np.random.randint(0, len(data) - prompt_length - generation_length - 10)
            else:
                start_idx = 0
            
            # Extract prompt and reference
            prompt_tokens = data[start_idx:start_idx + prompt_length].astype(np.int64)
            reference_tokens = data[start_idx + prompt_length:start_idx + prompt_length + generation_length].astype(np.int64)
            
            # Decode reference
            reference_text = decode(reference_tokens.tolist(), self.itos)
            
            # Generate prediction
            x = torch.tensor(prompt_tokens, dtype=torch.long, device=self.device)[None, ...]
            
            with torch.no_grad():
                generated_tokens = []
                for _ in range(generation_length):
                    # Crop if sequence gets too long
                    x_cond = x if x.size(1) <= self.model.config.block_size else x[:, -self.model.config.block_size:]
                    
                    # Forward pass
                    logits, _ = self.model(x_cond)
                    logits = logits[:, -1, :] / 0.8  # temperature
                    
                    # Sample next token
                    probs = F.softmax(logits, dim=-1)
                    next_token = torch.multinomial(probs, num_samples=1)
                    generated_tokens.append(next_token.item())
                    
                    # Append to sequence
                    x = torch.cat((x, next_token), dim=1)
            
            # Decode prediction
            prediction_text = decode(generated_tokens, self.itos)
            
            # Clean up texts
            reference_text = reference_text.strip()
            prediction_text = prediction_text.strip()
            
            if len(reference_text) > 0 and len(prediction_text) > 0:
                references.append(reference_text)
                predictions.append(prediction_text)
            
            if (i + 1) % 10 == 0:
                print(f"  Generated {i + 1}/{num_samples} samples")
                
        except Exception as e:
            print(f"Error generating sample {i}: {e}")
            continue
    
    print(f"Successfully generated {len(references)} sample pairs")
    return references, predictions

# Add method to the class
NanoGPTEvaluator.generate_samples_for_metrics = generate_samples_for_metrics

print("Text generation method added")

‚úì Text generation method added


In [None]:
# Add BLEU and ROUGE calculation methods
def calculate_bleu_score(self, references: List[str], predictions: List[str]) -> float:
    """
    Calculate BLEU score using evaluate library
    
    Args:
        references: List of reference texts
        predictions: List of predicted texts
        
    Returns:
        BLEU score
    """
    
    try:
        # Use HuggingFace evaluate library for BLEU
        # Format for evaluate library: predictions and references should be lists
        results = self.bleu_metric.compute(
            predictions=predictions, 
            references=[[ref] for ref in references]
        )
        return results['bleu']
    except Exception as e:
        print(f"Error calculating BLEU: {e}")
        return 0.0

def calculate_rouge_score(self, references: List[str], predictions: List[str]) -> Dict[str, float]:
    """
    Calculate ROUGE scores using evaluate library
    
    Args:
        references: List of reference texts
        predictions: List of predicted texts
        
    Returns:
        Dictionary of ROUGE scores
    """
 
    # Try HuggingFace evaluate library first
    if self.rouge_metric is not None:
        try:
            results = self.rouge_metric.compute(
                predictions=predictions, 
                references=references
            )
            return {
                'rouge1': results.get('rouge1', 0.0),
                'rouge2': results.get('rouge2', 0.0), 
                'rougeL': results.get('rougeL', 0.0)
            }
        except Exception as e:
            print(f"Error with HuggingFace ROUGE: {e}")
    
    # Fallback to rouge_score library
    if self.rouge_scorer is not None:
        try:
            rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
            
            for ref, pred in zip(references, predictions):
                if len(ref.strip()) > 0 and len(pred.strip()) > 0:
                    scores = self.rouge_scorer.score(ref, pred)
                    rouge_scores['rouge1'].append(scores['rouge1'].fmeasure)
                    rouge_scores['rouge2'].append(scores['rouge2'].fmeasure)
                    rouge_scores['rougeL'].append(scores['rougeL'].fmeasure)
            
            # Calculate averages
            avg_scores = {}
            for key, values in rouge_scores.items():
                if values:
                    avg_scores[key] = sum(values) / len(values)
                else:
                    avg_scores[key] = 0.0
            
            return avg_scores
        except Exception as e:
            print(f"Error calculating ROUGE: {e}")
    
    return {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0}

# Add methods to the class
NanoGPTEvaluator.calculate_bleu_score = calculate_bleu_score
NanoGPTEvaluator.calculate_rouge_score = calculate_rouge_score

print("BLEU and ROUGE calculation methods added")

‚úì BLEU and ROUGE calculation methods added


In [None]:
# Add main evaluation method
def evaluate_dataset(self, data_dir: str, split: str = 'val', batch_size: int = 16, 
                    max_eval_samples: int = 1000, num_text_samples: int = 50,
                    prompt_length: int = 20, generation_length: int = 30) -> Dict[str, Any]:
    """
    Evaluate the model on a dataset split
    
    Args:
        data_dir: Directory containing train.bin and val.bin
        split: 'train' or 'val'
        batch_size: Batch size for evaluation
        max_eval_samples: Maximum number of samples for evaluation
        num_text_samples: Number of text samples for BLEU/ROUGE
        prompt_length: Length of prompt for text generation
        generation_length: Length of generated text
        
    Returns:
        Dictionary of evaluation metrics
    """
    print(f"\n{'='*50}")
    print(f"Evaluating on {split} set")
    print(f"{'='*50}")
    
    # Load data
    data = self.load_data(data_dir, split)
    
    results = {'split': split, 'total_tokens': len(data)}
    
    # Calculate perplexity
    print("\n1. Calculating Perplexity...")
    start_time = time.time()
    perplexity = self.calculate_perplexity(data, batch_size, max_batches=min(100, max_eval_samples//batch_size))
    results['perplexity'] = perplexity
    print(f"Perplexity: {perplexity:.4f} (took {time.time() - start_time:.2f}s)")
    
    # Generate samples and calculate BLEU/ROUGE
    if len(data) > 100:  # Only if we have enough data
        print("\n2. Generating samples for BLEU/ROUGE evaluation...")
        start_time = time.time()
        num_samples = min(num_text_samples, max_eval_samples//20, len(data)//100)  # Reasonable number of samples
        references, predictions = self.generate_samples_for_metrics(
            data, num_samples, prompt_length, generation_length
        )
        
        if references and predictions:
            print("\n3. Calculating BLEU score...")
            bleu_score = self.calculate_bleu_score(references, predictions)
            results['bleu'] = bleu_score
            print(f"BLEU Score: {bleu_score:.4f}")
            
            print("\n4. Calculating ROUGE scores...")
            rouge_scores = self.calculate_rouge_score(references, predictions)
            results.update(rouge_scores)
            print(f"ROUGE-1: {rouge_scores['rouge1']:.4f}")
            print(f"ROUGE-2: {rouge_scores['rouge2']:.4f}")
            print(f"ROUGE-L: {rouge_scores['rougeL']:.4f}")
            
            # Show some example generations
            print("\n5. Example generations:")
            for i in range(min(3, len(references))):
                print(f"\nExample {i+1}:")
                print(f"Reference: {references[i][:100]}...")
                print(f"Generated: {predictions[i][:100]}...")
        else:
            print("Could not generate samples for BLEU/ROUGE evaluation")
            results['bleu'] = 0.0
            results.update({'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0})
    else:
        print("Dataset too small for text generation evaluation")
        results['bleu'] = 0.0
        results.update({'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0})
    
    print(f"\nEvaluation completed in {time.time() - start_time:.2f}s")
    return results

# Add method to the class
NanoGPTEvaluator.evaluate_dataset = evaluate_dataset

print("Main evaluation method added")
print("\nNanoGPTEvaluator class is ready!")

‚úì Main evaluation method added

üéâ NanoGPTEvaluator class is ready!


## 5. Initialize the Evaluator

Load the model and initialize the evaluator:

In [None]:
# Initialize evaluator
print("Initializing NanoGPT Evaluator...")
print("=" * 50)
print(f"Model: {CONFIG['model_path']}")
print(f"Data: {CONFIG['data_dir']}")
print(f"Meta: {CONFIG['meta_path']}")
print(f"Batch size: {CONFIG['batch_size']}")
print(f"Max eval samples: {CONFIG['max_eval_samples']}")
print(f"Splits: {CONFIG['splits']}")

try:
    evaluator = NanoGPTEvaluator(
        CONFIG['model_path'], 
        CONFIG['meta_path'], 
        CONFIG['device']
    )
    print("\nEvaluator initialized successfully!")
except Exception as e:
    print(f"Error initializing evaluator: {e}")
    evaluator = None

Initializing NanoGPT Evaluator...
Model: ../checkpoints/baseline_token_level_nano/token_level_1500_iter.pt
Data: nanoGPT/data/shakespeare
Meta: ../checkpoints/baseline_nanogpt/nanogpt_meta.pkl
Batch size: 16
Max eval samples: 1000
Splits: ['val', 'train']
Using device: cpu
Loading model from: ..\checkpoints\baseline_token_level_nano\token_level_1500_iter.pt
Model arguments: {'n_layer': 8, 'n_head': 8, 'n_embd': 512, 'block_size': 512, 'bias': False, 'vocab_size': 50304, 'dropout': 0.1}
number of parameters: 50.93M
Model loaded successfully!
Number of parameters: 51,192,320
Falling back to individual metric libraries...
ROUGE scorer initialized

‚úÖ Evaluator initialized successfully!


## 6. Run Evaluation

Evaluate the model on the specified dataset splits:

In [None]:
# Run evaluation on all specified splits
if evaluator is not None:
    all_results = {}
    
    for split in CONFIG['splits']:
        print(f"\nEvaluating {split} split...")
        try:
            results = evaluator.evaluate_dataset(
                CONFIG['data_dir'], 
                split, 
                CONFIG['batch_size'], 
                CONFIG['max_eval_samples'],
                CONFIG['num_text_samples'],
                CONFIG['prompt_length'],
                CONFIG['generation_length']
            )
            all_results[split] = results
            print(f"{split} evaluation completed")
        except Exception as e:
            print(f"Error evaluating {split} split: {e}")
            continue
else:
    print("Cannot run evaluation - evaluator not initialized")
    all_results = {}


üîÑ Evaluating val split...

Evaluating on val set
Loaded val data: 36,059 tokens

1. Calculating Perplexity...
Calculating perplexity with 16 batch size...
Processing 62 batches for perplexity calculation...
  Processed batch 1/62
  Processed batch 21/62
  Processed batch 41/62
  Processed batch 61/62
Processed 62 batches, 507,904 tokens
Perplexity: 142.1983 (took 1254.42s)

2. Generating samples for BLEU/ROUGE evaluation...
Generating 50 samples for BLEU/ROUGE evaluation...
Generating 50 text samples...
Error generating sample 0: 16398
Error generating sample 1: 198
Error generating sample 2: 1498
Error generating sample 3: 612
Error generating sample 4: 198
Error generating sample 5: 4957
Error generating sample 6: 198
Error generating sample 7: 391
Error generating sample 8: 440
Error generating sample 9: 1549
Error generating sample 10: 198
Error generating sample 11: 290
Error generating sample 12: 7206
Error generating sample 13: 278
Error generating sample 14: 1334
Error gene

## 7. Results Summary

Display a comprehensive summary of all evaluation results:

In [None]:
# Print comprehensive summary
print(f"\n{'='*60}")
print("EVALUATION SUMMARY")
print(f"{'='*60}")

if all_results:
    # Create a summary table
    import pandas as pd
    
    summary_data = []
    for split, results in all_results.items():
        summary_data.append({
            'Split': split.upper(),
            'Total Tokens': f"{results.get('total_tokens', 0):,}",
            'Perplexity': f"{results.get('perplexity', 0):.4f}",
            'BLEU': f"{results.get('bleu', 0):.4f}",
            'ROUGE-1': f"{results.get('rouge1', 0):.4f}",
            'ROUGE-2': f"{results.get('rouge2', 0):.4f}",
            'ROUGE-L': f"{results.get('rougeL', 0):.4f}"
        })
    
    try:
        df = pd.DataFrame(summary_data)
        print(df.to_string(index=False))
    except:
        # Fallback if pandas is not available
        for split, results in all_results.items():
            print(f"\n{split.upper()} SET:")
            print(f"  Total tokens: {results.get('total_tokens', 0):,}")
            print(f"  Perplexity:   {results.get('perplexity', 0):.4f}")
            print(f"  BLEU:         {results.get('bleu', 0):.4f}")
            print(f"  ROUGE-1:      {results.get('rouge1', 0):.4f}")
            print(f"  ROUGE-2:      {results.get('rouge2', 0):.4f}")
            print(f"  ROUGE-L:      {results.get('rougeL', 0):.4f}")
    
    print(f"\nEvaluation completed successfully!")
    
    # Store results for further analysis
    evaluation_results = all_results
    print(f"\nResults stored in 'evaluation_results' variable for further analysis")
else:
    print("No evaluation results to display")
    evaluation_results = {}


üìä EVALUATION SUMMARY
Split Total Tokens Perplexity   BLEU ROUGE-1 ROUGE-2 ROUGE-L
  VAL       36,059   142.1983 0.0000  0.0000  0.0000  0.0000
TRAIN      301,966    15.7473 0.0000  0.0000  0.0000  0.0000

‚úÖ Evaluation completed successfully!

üíæ Results stored in 'evaluation_results' variable for further analysis


## 8. Additional Analysis (Optional)

You can use this cell for additional analysis of the results:

In [None]:
# Additional analysis cell - customize as needed

if evaluation_results:
    print("Additional Analysis:")
    print("=" * 30)
    
    # Compare train vs validation performance
    if 'train' in evaluation_results and 'val' in evaluation_results:
        train_ppl = evaluation_results['train'].get('perplexity', 0)
        val_ppl = evaluation_results['val'].get('perplexity', 0)
        
        print(f"\nPerplexity Comparison:")
        print(f"  Training:   {train_ppl:.4f}")
        print(f"  Validation: {val_ppl:.4f}")
        
        if train_ppl > 0 and val_ppl > 0:
            ratio = val_ppl / train_ppl
            print(f"  Val/Train ratio: {ratio:.4f}")
            
            if ratio > 1.5:
                print("  High validation perplexity suggests overfitting")
            elif ratio < 1.1:
                print(f"  Good generalization - low overfitting")
            else:
                print(f"  Moderate generalization gap")
    
    # Text generation quality assessment
    for split in evaluation_results:
        results = evaluation_results[split]
        bleu = results.get('bleu', 0)
        rouge1 = results.get('rouge1', 0)
        
        print(f"\nText Generation Quality ({split}):")
        if bleu > 0.3:
            print(f"  BLEU {bleu:.4f}: Good text similarity")
        elif bleu > 0.1:
            print(f"  BLEU {bleu:.4f}: Moderate text similarity")
        else:
            print(f"  BLEU {bleu:.4f}: Low text similarity")
        
        if rouge1 > 0.3:
            print(f"  ROUGE-1 {rouge1:.4f}: Good word overlap")
        elif rouge1 > 0.15:
            print(f"  ROUGE-1 {rouge1:.4f}: Moderate word overlap")
        else:
            print(f"  ROUGE-1 {rouge1:.4f}: Low word overlap")
else:
    print("No results available for analysis")

üîç Additional Analysis:

üìà Perplexity Comparison:
  Training:   15.7473
  Validation: 142.1983
  Val/Train ratio: 9.0300
  ‚ö†Ô∏è  High validation perplexity suggests overfitting

üìù Text Generation Quality (val):
  BLEU 0.0000: ‚ùå Low text similarity
  ROUGE-1 0.0000: ‚ùå Low word overlap

üìù Text Generation Quality (train):
  BLEU 0.0000: ‚ùå Low text similarity
  ROUGE-1 0.0000: ‚ùå Low word overlap


## 9. Export Results (Optional)

Save the evaluation results to a file for later analysis:

In [None]:
# Export results to JSON file
import json
from datetime import datetime

if evaluation_results:
    # Add metadata
    export_data = {
        'timestamp': datetime.now().isoformat(),
        'config': CONFIG,
        'results': evaluation_results,
        'model_info': {
            'model_path': CONFIG['model_path'],
            'meta_path': CONFIG['meta_path'],
            'vocab_size': evaluator.vocab_size if evaluator else None,
            'device': evaluator.device if evaluator else None
        }
    }
    
    # Save to file
    output_file = f"evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    
    try:
        with open(output_file, 'w') as f:
            json.dump(export_data, f, indent=2)
        print(f"Results exported to: {output_file}")
    except Exception as e:
        print(f"Error exporting results: {e}")
else:
    print("No results to export")

üìÅ Results exported to: evaluation_results_20250912_162939.json
