# FAIR Scientific Data Discovery System - Setup

This notebook guides you through installing and setting up the system.

## What is FAIR?

**FAIR** stands for:
- **F**indable: Datasets can be discovered through search
- **A**ccessible: Clear paths to access data
- **I**nteroperable: Works with standard formats
- **R**eusable: Well-documented for future use

## System Overview

This system makes scientific datasets (NetCDF, HDF5, GRIB) searchable using:
- ✅ Local semantic search (no API keys needed)
- ✅ Works offline
- ✅ Handles minimal metadata
- ✅ Discovers companion documentation
- ✅ Optional LLM enrichment

## Step 1: Install Dependencies

First, let's install the required packages.

In [None]:
# Install core dependencies
!pip install -q netCDF4 h5py sentence-transformers faiss-cpu numpy tqdm

In [None]:
# Optional: Install pygrib for GRIB support (may require system dependencies)
# !pip install -q pygrib

In [None]:
# Optional: Install ollama-python for LLM enrichment
# !pip install -q ollama

## Step 2: Download Embedding Model

The system uses a local sentence-transformer model for generating embeddings.
This will download ~80MB on first run.

In [None]:
from sentence_transformers import SentenceTransformer

# Download and cache the model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
print(f"Downloading model: {model_name}")
model = SentenceTransformer(model_name)
print(f"✓ Model loaded! Embedding dimension: {model.get_sentence_embedding_dimension()}")

## Step 3: Verify Installation

In [None]:
# Test imports
import sys
from pathlib import Path

try:
    import netCDF4
    print("✓ netCDF4 installed")
except ImportError:
    print("✗ netCDF4 not installed")

try:
    import h5py
    print("✓ h5py installed")
except ImportError:
    print("✗ h5py not installed")

try:
    import faiss
    print("✓ faiss installed")
except ImportError:
    print("✗ faiss not installed")

try:
    import sentence_transformers
    print("✓ sentence-transformers installed")
except ImportError:
    print("✗ sentence-transformers not installed")

try:
    import pygrib
    print("✓ pygrib installed")
except ImportError:
    print("⚠ pygrib not installed (optional)")

## Step 4: Test System Components

In [None]:
# Add parent directory to path to import our modules
sys.path.insert(0, str(Path.cwd().parent))

# Test imports
try:
    import config
    print("✓ config module loaded")
    
    from file_validator import FileValidator
    print("✓ file_validator module loaded")
    
    from metadata_extractors import MetadataExtractor
    print("✓ metadata_extractors module loaded")
    
    from embedding_generator import EmbeddingGenerator
    print("✓ embedding_generator module loaded")
    
    from vector_index import VectorIndex
    print("✓ vector_index module loaded")
    
    from search_engine import FAIRSearchEngine
    print("✓ search_engine module loaded")
    
    print("\n✓ All modules loaded successfully!")
    
except ImportError as e:
    print(f"✗ Error importing modules: {e}")
    print("Make sure all .py files are in the parent directory")

## Step 5: Create Sample Data (Optional)

Let's create a simple NetCDF file for testing.

In [None]:
import netCDF4
import numpy as np

# Create sample directory
sample_dir = Path("sample_data")
sample_dir.mkdir(exist_ok=True)

# Create a simple NetCDF file
filepath = sample_dir / "ocean_temperature.nc"

with netCDF4.Dataset(filepath, 'w') as ds:
    # Global attributes
    ds.title = "Sample Ocean Temperature Data"
    ds.institution = "Demo University"
    ds.source = "Simulated data for testing"
    ds.Conventions = "CF-1.8"
    
    # Create dimensions
    ds.createDimension('time', 10)
    ds.createDimension('lat', 20)
    ds.createDimension('lon', 30)
    
    # Create variables
    time = ds.createVariable('time', 'f8', ('time',))
    time.units = 'days since 2020-01-01'
    time[:] = np.arange(10)
    
    lat = ds.createVariable('lat', 'f4', ('lat',))
    lat.units = 'degrees_north'
    lat[:] = np.linspace(-90, 90, 20)
    
    lon = ds.createVariable('lon', 'f4', ('lon',))
    lon.units = 'degrees_east'
    lon[:] = np.linspace(-180, 180, 30)
    
    temp = ds.createVariable('sea_surface_temperature', 'f4', ('time', 'lat', 'lon'))
    temp.units = 'celsius'
    temp.long_name = 'Sea Surface Temperature'
    temp.standard_name = 'sea_surface_temperature'
    temp[:] = np.random.randn(10, 20, 30) * 5 + 15

print(f"✓ Sample file created: {filepath}")
print(f"  Size: {filepath.stat().st_size} bytes")

## Step 6: Quick Test

Let's do a quick test of the complete system.

In [None]:
from search_engine import FAIRSearchEngine

# Initialize search engine
print("Initializing search engine...")
engine = FAIRSearchEngine(load_existing=False)

# Index the sample file
print("\nIndexing sample file...")
result = engine.index_file(filepath)
print(f"Result: {result}")

# Search for it
print("\nSearching for 'ocean temperature'...")
results = engine.search("ocean temperature", top_k=1)

if results:
    print(f"✓ Found {len(results)} result(s)")
    print(f"  File: {results[0]['filepath']}")
    print(f"  Score: {results[0]['similarity_score']:.3f}")
else:
    print("✗ No results found")

# Save index
print("\nSaving index...")
engine.save()
print("✓ Index saved")

print("\n🎉 System test complete!")

## Next Steps

Now that the system is set up, you can:

1. **Explore Components**: Check out notebooks 01-06 to learn about each component
2. **Index Your Data**: Use notebook 05 to index your scientific datasets
3. **Try Full Workflow**: See notebook 99 for end-to-end examples

## Optional: Install Ollama (for LLM Enrichment)

To enable LLM-based metadata enrichment:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.2:3b
```

See notebook 06 for LLM enrichment examples.