# Optional LLM Enrichment for Enhanced Metadata

This notebook demonstrates how to use a local LLM (Ollama) to enhance metadata extraction.

## Why LLM Enrichment?

LLMs can:
- **Decode abbreviations**: sst → sea surface temperature
- **Infer context**: Guess institution from project names
- **Suggest use cases**: Identify potential applications
- **Add domain knowledge**: Recognize scientific domain
- **Generate summaries**: Create human-readable descriptions

## Important Notes

- ✅ **Completely Optional** - System works fine without it
- ✅ **Privacy Preserving** - Uses local Ollama, no data sent externally
- ⚠️ **Slower** - Adds ~2-5 seconds per file
- ⚠️ **Requires Ollama** - Install separately

## Setup Ollama

First, install Ollama if you haven't already:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (choose one)
ollama pull llama3.2:3b      # Fast, 2GB, good quality
# ollama pull llama3.2:8b      # Slower, 4GB, better quality
# ollama pull mistral:7b       # Alternative

# Test it
ollama run llama3.2:3b
```

In [1]:
# Setup
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent))

from llm_enricher import LLMEnricher, DataInspector
from metadata_extractors import MetadataExtractor
from search_engine import FAIRSearchEngine

## Example 1: Test Ollama Connection

In [2]:
# Test connection to Ollama
try:
    enricher = LLMEnricher(model="llama3.2:3b")
    print("✓ Connected to Ollama")
except Exception as e:
    print(f"✗ Could not connect to Ollama: {e}")
    print("\nMake sure Ollama is running:")
    print("  ollama serve")

Connected to Ollama at http://localhost:11434
✓ Connected to Ollama


## Example 2: Enrich Basic Metadata

In [3]:
# Extract basic metadata first
sample_file = Path("sample_data/ocean_temperature.nc")

if sample_file.exists():
    extractor = MetadataExtractor()
    metadata = extractor.extract(sample_file)
    
    print("Original Metadata:")
    print("=" * 60)
    print(f"Filename: {metadata.get('filename')}")
    print(f"Title: {metadata.get('title')}")
    print(f"Institution: {metadata.get('institution')}")
    print(f"Variables: {list(metadata.get('variables', {}).keys())}")
else:
    print("Sample file not found. Run notebook 00 first.")

Original Metadata:
Filename: ocean_temperature.nc
Title: Sample Ocean Temperature Data
Institution: Demo University
Variables: ['time', 'lat', 'lon', 'sea_surface_temperature']


In [4]:
# Enrich with LLM
if sample_file.exists():
    print("\nEnriching metadata with LLM...")
    print("(This may take a few seconds)\n")
    
    enriched = enricher.enrich_metadata(metadata)
    
    if 'llm_enrichment' in enriched:
        llm_data = enriched['llm_enrichment']
        
        print("LLM-Enriched Metadata:")
        print("=" * 60)
        
        if 'domain' in llm_data:
            print(f"Domain: {llm_data['domain']}")
        
        if 'source_institution' in llm_data:
            print(f"Inferred Institution: {llm_data['source_institution']}")
        
        if 'variable_descriptions' in llm_data:
            print(f"\nVariable Descriptions:")
            for var, desc in llm_data['variable_descriptions'].items():
                print(f"  {var}: {desc}")
        
        if 'use_cases' in llm_data:
            print(f"\nPotential Use Cases:")
            for use_case in llm_data['use_cases']:
                print(f"  - {use_case}")
        
        if 'quality_notes' in llm_data:
            print(f"\nQuality Notes: {llm_data['quality_notes']}")
    else:
        print(f"Enrichment error: {enriched.get('llm_error')}")


Enriching metadata with LLM...
(This may take a few seconds)

LLM-Enriched Metadata:
Domain: Oceanography
Inferred Institution: Demo University

Variable Descriptions:
  time: Date and time of observation (e.g., year, month, day, hour)
  lat: Latitude coordinate (-90° to +90° N/S)
  lon: Longitude coordinate (-180° to +180° E/W)
  sea_surface_temperature: Temperature at the sea surface in degrees Celsius

Potential Use Cases:
  - Climate modeling and analysis
  - Ocean current research and tracking
  - Sea ice formation and melting studies

Quality Notes: 


## Example 3: Data Inspection Tools

Inspect actual data values to validate metadata.

In [5]:
# Inspect variable statistics
if sample_file.exists():
    inspector = DataInspector()
    
    print("Variable Statistics:")
    print("=" * 60)
    
    stats = inspector.get_variable_statistics(
        str(sample_file), 
        'sea_surface_temperature'
    )
    
    if 'error' not in stats:
        print(f"Variable: sea_surface_temperature")
        print(f"  Min: {stats['min']:.2f}")
        print(f"  Max: {stats['max']:.2f}")
        print(f"  Mean: {stats['mean']:.2f}")
        print(f"  Std Dev: {stats['std']:.2f}")
        print(f"  Shape: {stats['shape']}")
        print(f"  Data Type: {stats['dtype']}")
    else:
        print(f"Error: {stats['error']}")

Variable Statistics:
Variable: sea_surface_temperature
  Min: -2.41
  Max: 35.17
  Mean: 14.91
  Std Dev: 4.97
  Shape: (10, 20, 30)
  Data Type: float32


In [6]:
# Check temporal coverage
if sample_file.exists():
    print("\nTemporal Coverage:")
    print("=" * 60)
    
    temporal = inspector.check_temporal_coverage(str(sample_file))
    
    if 'error' not in temporal:
        print(f"Number of timesteps: {temporal['num_timesteps']}")
        print(f"Time units: {temporal['units']}")
        
        if 'start_date' in temporal:
            print(f"Start: {temporal['start_date']}")
            print(f"End: {temporal['end_date']}")
    else:
        print(f"Error: {temporal['error']}")


Temporal Coverage:
Number of timesteps: 10
Time units: days since 2020-01-01
Start: 2020-01-01 00:00:00
End: 2020-01-10 00:00:00


In [7]:
# Check spatial coverage
if sample_file.exists():
    print("\nSpatial Coverage:")
    print("=" * 60)
    
    spatial = inspector.check_spatial_coverage(str(sample_file))
    
    if 'error' not in spatial:
        print(f"Latitude: {spatial['lat_min']:.2f}° to {spatial['lat_max']:.2f}°")
        print(f"Longitude: {spatial['lon_min']:.2f}° to {spatial['lon_max']:.2f}°")
        
        if spatial['lat_resolution']:
            print(f"Lat resolution: {spatial['lat_resolution']:.2f}°")
        if spatial['lon_resolution']:
            print(f"Lon resolution: {spatial['lon_resolution']:.2f}°")
    else:
        print(f"Error: {spatial['error']}")


Spatial Coverage:
Latitude: -90.00° to 90.00°
Longitude: -180.00° to 180.00°
Lat resolution: 9.47°
Lon resolution: 12.41°


## Example 4: Enrich Search Results

Add LLM-generated summaries to search results.

In [8]:
# Search first
try:
    engine = FAIRSearchEngine(load_existing=True)
    results = engine.search("ocean temperature", top_k=3)
    
    if results:
        print("Original Search Results:")
        print("=" * 60)
        for i, r in enumerate(results, 1):
            print(f"\n{i}. {Path(r['filepath']).name}")
            print(f"   Score: {r['similarity_score']:.3f}")
            print(f"   Title: {r.get('title', 'N/A')}")
        
        # Enrich with LLM summaries
        print("\n" + "=" * 60)
        print("Generating LLM Summaries...")
        print("=" * 60)
        
        enriched_results = enricher.enrich_search_results(results)
        
        print("\nEnriched Results:")
        print("=" * 60)
        for i, r in enumerate(enriched_results, 1):
            print(f"\n{i}. {Path(r['filepath']).name}")
            print(f"   Score: {r['similarity_score']:.3f}")
            
            if 'llm_summary' in r:
                print(f"   Summary: {r['llm_summary']}")
    else:
        print("No results found. Index some data first.")

except FileNotFoundError:
    print("No index found. Run indexing first.")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Model loaded. Embedding dimension: 384
Loaded embedding cache: 23 entries
Loading existing index...
Initialized FAISS index (dim=384)
Index loaded: {'total_vectors': 11, 'total_metadata': 11, 'unique_files': 11, 'embedding_dim': 384, 'index_type': 'IndexFlatIP'}
Original Search Results:

1. ocean_temp_feb_2023.nc
   Score: 0.548
   Title: Test temperature data

2. ocean_temp_jan_2023.nc
   Score: 0.529
   Title: Test temperature data

3. ocean_temp_mar_2023.nc
   Score: 0.524
   Title: Test temperature data

Generating LLM Summaries...

Enriched Results:

1. ocean_temp_feb_2023.nc
   Score: 0.548
   Summary: This dataset contains test temperature readings collected by the Demo Lab, providing a comprehensive record of thermal data across three dimensions. It could be used to inform quality control measures

2. ocean_temp_jan_2023.nc
   Score: 0.529
   Summary: This dataset contains test temperature data collected at Demo La

## Example 5: Compare With/Without Enrichment

In [9]:
# Create minimal metadata file
import netCDF4
import numpy as np

minimal_file = Path("sample_data/test_minimal.nc")

with netCDF4.Dataset(minimal_file, 'w') as ds:
    # NO global attributes, only variable name
    ds.createDimension('x', 100)
    ds.createDimension('y', 100)
    ds.createDimension('t', 10)
    
    var = ds.createVariable('wspd', 'f4', ('t', 'y', 'x'))
    var[:] = np.random.randn(10, 100, 100) * 5 + 10

print(f"Created minimal file: {minimal_file}")

Created minimal file: sample_data/test_minimal.nc


In [10]:
# Extract without enrichment
extractor = MetadataExtractor()
basic_metadata = extractor.extract(minimal_file)

print("Without LLM Enrichment:")
print("=" * 60)
print(f"Title: {basic_metadata.get('title') or '(none)'}")
print(f"Institution: {basic_metadata.get('institution') or '(none)'}")
print(f"Variables: {list(basic_metadata.get('variables', {}).keys())}")
print(f"\nSearchable text length: {len(extractor.create_searchable_text(basic_metadata))} chars")

Without LLM Enrichment:
Title: (none)
Institution: (none)
Variables: ['wspd']

Searchable text length: 74 chars


In [11]:
# Enrich with LLM
enriched_metadata = enricher.enrich_metadata(basic_metadata)

print("\nWith LLM Enrichment:")
print("=" * 60)

if 'llm_enrichment' in enriched_metadata:
    llm_data = enriched_metadata['llm_enrichment']
    
    print(f"Inferred domain: {llm_data.get('domain', 'N/A')}")
    print(f"\nVariable interpretation:")
    for var, desc in llm_data.get('variable_descriptions', {}).items():
        print(f"  {var} → {desc}")
    
    print(f"\nPotential uses:")
    for use in llm_data.get('use_cases', []):
        print(f"  - {use}")
    
    # Create enriched searchable text
    enriched_text = extractor.create_searchable_text(basic_metadata)
    enriched_text += " " + str(llm_data)
    print(f"\nSearchable text length: {len(enriched_text)} chars (increased by {len(enriched_text) - len(extractor.create_searchable_text(basic_metadata))}")
else:
    print(f"Error: {enriched_metadata.get('llm_error')}")


With LLM Enrichment:
Inferred domain: Atmospheric Science

Variable interpretation:
  wspd → Wind speed

Potential uses:
  - Climate modeling
  - Weather forecasting

Searchable text length: 262 chars (increased by 188


## Performance Considerations

In [12]:
import time

# Time basic extraction
start = time.time()
basic = extractor.extract(sample_file)
basic_time = time.time() - start

# Time with LLM enrichment
start = time.time()
enriched = enricher.enrich_metadata(basic)
llm_time = time.time() - start

print("Performance Comparison:")
print("=" * 60)
print(f"Basic extraction: {basic_time:.3f} seconds")
print(f"LLM enrichment: {llm_time:.3f} seconds")
print(f"Total with LLM: {basic_time + llm_time:.3f} seconds")
print(f"\nSpeed difference: {((basic_time + llm_time) / basic_time):.1f}x slower with LLM")

print("\n💡 Recommendation:")
if llm_time < 5:
    print("  LLM enrichment is reasonably fast for your use case")
else:
    print("  Consider using LLM enrichment selectively for important datasets")

Performance Comparison:
Basic extraction: 0.002 seconds
LLM enrichment: 13.586 seconds
Total with LLM: 13.587 seconds

Speed difference: 8636.0x slower with LLM

💡 Recommendation:
  Consider using LLM enrichment selectively for important datasets


## When to Use LLM Enrichment

### ✅ Good Use Cases
- Files with minimal or cryptic metadata
- Variable names that are abbreviations
- Data from unfamiliar sources
- High-value datasets worth extra processing
- Interactive exploration (real-time is fine)

### ⚠️ Consider Skipping
- Large batch processing (1000s of files)
- Well-documented CF-compliant files
- Time-sensitive indexing workflows
- When basic search already works well

## Best Practices

1. **Index without LLM first** - Get basic search working quickly
2. **Enrich selectively** - Use LLM for important datasets
3. **Cache results** - Store enriched metadata
4. **Use local models** - Keep data private
5. **Monitor quality** - Verify LLM outputs make sense

## Next Steps

- **Notebook 99**: See complete workflows with optional enrichment
- Try different Ollama models for quality/speed tradeoffs
- Integrate enrichment into your indexing pipeline selectively