# FAIR Scientific Data Discovery System - Setup

This notebook guides you through installing and setting up the system.

## What is FAIR?

**FAIR** stands for:
- **F**indable: Datasets can be discovered through rich metadata.
- **A**ccessible: Clear paths to access data.
- **I**nteroperable: Works with standard formats.
- **R**eusable: Well-documented for future use.

## System Overview

This system makes scientific datasets (NetCDF, HDF5, GRIB) more discoverable by:
- ✅ Extracting rich metadata automatically.
- ✅ Handling even poorly documented files.
- ✅ Discovering and linking companion documentation (READMEs, scripts, citations).
- ✅ Supporting LLM enrichment for enhanced context.

## Step 1: Install Dependencies

First, let's install the required packages.

In [1]:
# Install all dependencies
!pip install -q netCDF4 h5py numpy tqdm requests pygrib

## Step 2: Verify Installation

In [2]:
# Test imports
import sys
from pathlib import Path

try:
    import netCDF4
    print("✓ netCDF4 installed")
except ImportError:
    print("✗ netCDF4 not installed")

try:
    import h5py
    print("✓ h5py installed")
except ImportError:
    print("✗ h5py not installed")

try:
    import pygrib
    print("✓ pygrib installed")
except ImportError:
    print("✗ pygrib not installed")

try:
    import requests
    print("✓ requests installed")
except ImportError:
    print("✗ requests not installed")

✓ netCDF4 installed
✓ h5py installed
✓ pygrib installed
✓ requests installed


## Step 3: Test System Components

In [3]:
# Add parent directory to path to import our modules
sys.path.insert(0, str(Path.cwd().parent))

# Test imports
try:
    import config
    print("✓ config module loaded")
    
    from file_validator import FileValidator
    print("✓ file_validator module loaded")
    
    from metadata_extractors import MetadataExtractor
    print("✓ metadata_extractors module loaded")
    
    from companion_finder import CompanionDocFinder
    print("✓ companion_finder module loaded")
    
    print("\n✓ All core modules loaded successfully!")
    
except ImportError as e:
    print(f"✗ Error importing modules: {e}")
    print("Make sure all .py files are in the correct directory")

✓ config module loaded
✓ file_validator module loaded
✓ metadata_extractors module loaded
✓ companion_finder module loaded

✓ All core modules loaded successfully!


## Step 4: Create Sample Data

Let's create a simple NetCDF file for testing.

In [4]:
import netCDF4
import numpy as np

# Create sample directory in generated/
sample_dir = Path("generated/sample_data")
sample_dir.mkdir(parents=True, exist_ok=True)

# Create a simple NetCDF file
filepath = sample_dir / "ocean_temperature.nc"

with netCDF4.Dataset(filepath, 'w') as ds:
    ds.title = "Sample Ocean Temperature Data"
    ds.institution = "Demo University"
    ds.source = "Simulated data for testing"
    ds.Conventions = "CF-1.8"
    
    ds.createDimension('time', 10)
    ds.createDimension('lat', 20)
    ds.createDimension('lon', 30)
    
    time = ds.createVariable('time', 'f8', ('time',))
    time.units = 'days since 2020-01-01'
    time[:] = np.arange(10)
    
    lat = ds.createVariable('lat', 'f4', ('lat',))
    lat.units = 'degrees_north'
    lat[:] = np.linspace(-90, 90, 20)
    
    lon = ds.createVariable('lon', 'f4', ('lon',))
    lon.units = 'degrees_east'
    lon[:] = np.linspace(-180, 180, 30)
    
    temp = ds.createVariable('sea_surface_temperature', 'f4', ('time', 'lat', 'lon'))
    temp.units = 'celsius'
    temp.long_name = 'Sea Surface Temperature'
    temp.standard_name = 'sea_surface_temperature'
    temp[:] = np.random.randn(10, 20, 30) * 5 + 15

print(f"✓ Sample file created: {filepath}")
print(f"  Size: {filepath.stat().st_size} bytes")

✓ Sample file created: generated/sample_data/ocean_temperature.nc
  Size: 34266 bytes


## Step 5: Quick Test

Let's do a quick test of the metadata extraction.

In [5]:
from metadata_extractors import MetadataExtractor

# Initialize extractor
print("Initializing metadata extractor...")
extractor = MetadataExtractor()

# Extract metadata from the sample file
print("Extracting metadata from sample file...")
metadata = extractor.extract(filepath)

if 'error' not in metadata:
    print("\n--- Extracted Metadata ---")
    print(f"Title: {metadata.get('title', 'N/A')}")
    print(f"Institution: {metadata.get('institution', 'N/A')}")
    print(f"Format: {metadata.get('format', 'N/A')}")
    print(f"Variables: {', '.join(metadata.get('variables', {}).keys())}")
    print("--------------------------")
else:
    print(f"✗ Error during extraction: {metadata['error']}")

print("\n🎉 System test complete!")

Initializing metadata extractor...
Extracting metadata from sample file...

--- Extracted Metadata ---
Title: Sample Ocean Temperature Data
Institution: Demo University
Format: NetCDF
Variables: time, lat, lon, sea_surface_temperature
--------------------------

🎉 System test complete!


## Next Steps

Now that the system is set up, you can:

1. **Explore Components**: Check out the other notebooks to learn about each component.
2. **Process Your Data**: Use the `fair_index.py` script to process your own scientific datasets.
3. **Try Full Workflow**: See notebook 99 for end-to-end examples.

## Install Ollama (for LLM Enrichment)

To enable LLM-based metadata enrichment:

```bash
# Install Ollama
curl -fsSL [https://ollama.com/install.sh](https://ollama.com/install.sh) | sh

# Pull a model
ollama pull llama3.2:3b
```

See notebook 06 for LLM enrichment examples.