# Metadata Extraction from Scientific Data Files

This notebook demonstrates how to extract metadata from NetCDF, HDF5, and GRIB files.

## What Gets Extracted?

- **Global Attributes**: Title, institution, conventions, etc.
- **Dimensions**: Time, latitude, longitude, depth, etc.
- **Variables**: Names, units, descriptions, data types
- **Filename Hints**: Dates, versions, variable names from filename

## Challenge: Minimal Metadata

Many real-world files have minimal embedded metadata. The system extracts what it can and creates searchable text from variable names, dimensions, and filename patterns.

In [None]:
# Setup
import sys
from pathlib import Path
import json
sys.path.insert(0, str(Path.cwd().parent))

from metadata_extractors import MetadataExtractor, NetCDFExtractor, HDF5Extractor
from utils import pretty_print_dict

## Example 1: Extract from NetCDF

In [None]:
# Extract metadata from our sample file
sample_file = Path("sample_data/ocean_temperature.nc")

if sample_file.exists():
    extractor = MetadataExtractor()
    metadata = extractor.extract(sample_file)
    
    print("Extracted Metadata:")
    print("=" * 60)
    
    # Show key fields
    print(f"\nFile: {metadata.get('filename')}")
    print(f"Format: {metadata.get('format')}")
    print(f"Title: {metadata.get('title')}")
    print(f"Institution: {metadata.get('institution')}")
    print(f"Source: {metadata.get('source')}")
    
    # Dimensions
    print(f"\nDimensions:")
    for dim, size in metadata.get('dimensions', {}).items():
        print(f"  {dim}: {size}")
    
    # Variables
    print(f"\nVariables ({metadata.get('num_variables', 0)} total):")
    for var_name, var_info in metadata.get('variables', {}).items():
        attrs = var_info.get('attributes', {})
        units = attrs.get('units', 'no units')
        long_name = attrs.get('long_name', var_name)
        print(f"  {var_name}: {long_name} [{units}]")
else:
    print("Sample file not found. Run notebook 00 first.")

## Example 2: Create Searchable Text

The extracted metadata is converted to searchable text for embedding generation.

In [None]:
if sample_file.exists():
    # Create searchable text
    searchable_text = extractor.create_searchable_text(metadata)
    
    print("Searchable Text:")
    print("=" * 60)
    print(searchable_text)
    print("\n" + "=" * 60)
    print(f"Length: {len(searchable_text)} characters")

## Example 3: Handling Files with Minimal Metadata

Let's create a file with minimal metadata to show how the system handles it.

In [None]:
import netCDF4
import numpy as np

# Create a file with minimal metadata (common in real-world data)
minimal_file = Path("sample_data/minimal_sst_20230515.nc")

with netCDF4.Dataset(minimal_file, 'w') as ds:
    # NO global attributes
    
    # Just dimensions and variables
    ds.createDimension('time', 1)
    ds.createDimension('lat', 180)
    ds.createDimension('lon', 360)
    
    time = ds.createVariable('time', 'f8', ('time',))
    time[:] = [0]
    
    lat = ds.createVariable('lat', 'f4', ('lat',))
    lat[:] = np.linspace(-89.5, 89.5, 180)
    
    lon = ds.createVariable('lon', 'f4', ('lon',))
    lon[:] = np.linspace(-179.5, 179.5, 360)
    
    # Variable with minimal metadata
    sst = ds.createVariable('sst', 'f4', ('time', 'lat', 'lon'))
    sst[:] = np.random.randn(1, 180, 360) * 5 + 15

print(f"Created minimal metadata file: {minimal_file}")

In [None]:
# Extract from minimal file
metadata_minimal = extractor.extract(minimal_file)

print("\nMetadata from Minimal File:")
print("=" * 60)
print(f"Title: {metadata_minimal.get('title') or '(none)'}")
print(f"Institution: {metadata_minimal.get('institution') or '(none)'}")
print(f"\nVariables: {list(metadata_minimal.get('variables', {}).keys())}")
print(f"Dimensions: {metadata_minimal.get('dimensions')}")

# Check filename hints
if 'date_from_filename' in metadata_minimal:
    print(f"Date from filename: {metadata_minimal['date_from_filename']}")
if 'variables_hint' in metadata_minimal:
    print(f"Variable hints: {metadata_minimal['variables_hint']}")

In [None]:
# Even with minimal metadata, we can create searchable text
searchable_minimal = extractor.create_searchable_text(metadata_minimal)

print("\nSearchable Text from Minimal File:")
print("=" * 60)
print(searchable_minimal)
print("\n" + "=" * 60)

print("\nKey Observations:")
print("- Extracted 'sst' variable name (sea surface temperature)")
print("- Got date '2023-05-15' from filename")
print("- Identified dimensions (180x360 suggests global grid)")
print("- Variable name 'sst' decoded to 'sea_surface_temperature'")

## Example 4: Format-Specific Extraction

Different formats store metadata differently.

In [None]:
# NetCDF-specific extractor
nc_extractor = NetCDFExtractor()
nc_metadata = nc_extractor.extract(sample_file)

print("NetCDF-Specific Metadata:")
print(f"\nGlobal Attributes:")
for key, value in nc_metadata.get('global_attributes', {}).items():
    print(f"  {key}: {value}")

## Example 5: Save Metadata as JSON

Metadata can be exported for other tools.

In [None]:
from utils import save_json

# Save metadata
output_file = Path("sample_data/ocean_temperature_metadata.json")
save_json(metadata, output_file)
print(f"Metadata saved to: {output_file}")

# Pretty print a sample
print("\nMetadata Structure (partial):")
print(json.dumps({
    'format': metadata.get('format'),
    'dimensions': metadata.get('dimensions'),
    'num_variables': metadata.get('num_variables')
}, indent=2))

## Key Takeaways

1. **Metadata varies widely** - Some files are rich, others minimal
2. **Multiple extraction strategies** - Format-specific + filename hints
3. **Searchable text creation** - Combines all available information
4. **Graceful degradation** - Works even with minimal metadata

## Next Steps

- **Notebook 03**: Discover companion documentation
- **Notebook 04**: Generate embeddings and search