# Complete FAIR Data Discovery Workflow

This notebook demonstrates end-to-end workflows for making scientific data FAIR.

## Workflows Covered

1. **Single File Indexing**: Index one file with full metadata
2. **Directory Batch Indexing**: Index all files in a directory
3. **Archive Processing**: Handle .zip files with data
4. **Search and Discovery**: Find datasets by topic
5. **Quality Control**: Validate files before indexing

In [1]:
# Setup
import sys
from pathlib import Path
import zipfile
import shutil
sys.path.insert(0, str(Path.cwd().parent))

from search_engine import FAIRSearchEngine
from file_validator import FileValidator
from archive_handler import ArchiveHandler
import netCDF4
import numpy as np

## Workflow 1: Create Sample Dataset Collection

In [2]:
# Create a diverse sample dataset
sample_dir = Path("sample_dataset_collection")
sample_dir.mkdir(exist_ok=True)

datasets = [
    {
        'filename': 'global_sst_2023_01.nc',
        'title': 'Global Sea Surface Temperature - January 2023',
        'institution': 'NOAA',
        'variable': 'sst',
        'var_long_name': 'Sea Surface Temperature',
        'units': 'celsius'
    },
    {
        'filename': 'wind_speed_atlantic_2023.nc',
        'title': 'Atlantic Wind Speed Measurements 2023',
        'institution': 'European Space Agency',
        'variable': 'wind_speed',
        'var_long_name': 'Wind Speed at 10m',
        'units': 'm/s'
    },
    {
        'filename': 'chlorophyll_pacific_2023.nc',
        'title': 'Pacific Ocean Chlorophyll Concentration',
        'institution': 'NASA',
        'variable': 'chlor_a',
        'var_long_name': 'Chlorophyll-a Concentration',
        'units': 'mg/m^3'
    }
]

for ds_info in datasets:
    filepath = sample_dir / ds_info['filename']
    
    with netCDF4.Dataset(filepath, 'w') as ds:
        # Global attributes
        ds.title = ds_info['title']
        ds.institution = ds_info['institution']
        ds.Conventions = 'CF-1.8'
        ds.source = 'Simulated data for demonstration'
        
        # Dimensions
        ds.createDimension('time', 30)
        ds.createDimension('lat', 50)
        ds.createDimension('lon', 100)
        
        # Variables
        time = ds.createVariable('time', 'f8', ('time',))
        time.units = 'days since 2023-01-01'
        time[:] = np.arange(30)
        
        lat = ds.createVariable('lat', 'f4', ('lat',))
        lat.units = 'degrees_north'
        lat[:] = np.linspace(-60, 60, 50)
        
        lon = ds.createVariable('lon', 'f4', ('lon',))
        lon.units = 'degrees_east'
        lon[:] = np.linspace(-180, 180, 100)
        
        var = ds.createVariable(ds_info['variable'], 'f4', ('time', 'lat', 'lon'))
        var.units = ds_info['units']
        var.long_name = ds_info['var_long_name']
        var[:] = np.random.randn(30, 50, 100) * 5 + 20

print(f"✓ Created {len(datasets)} sample datasets in {sample_dir}/")
for ds_info in datasets:
    print(f"  - {ds_info['filename']}")

✓ Created 3 sample datasets in sample_dataset_collection/
  - global_sst_2023_01.nc
  - wind_speed_atlantic_2023.nc
  - chlorophyll_pacific_2023.nc


## Workflow 2: Quality Control - Validate Before Indexing

In [3]:
# Validate all files
validator = FileValidator()
results = validator.validate_directory(sample_dir)

print("Quality Control Results:")
print("=" * 60)
print(f"Total files checked: {results['total_files']}")
print(f"✓ Valid: {len(results['valid'])}")
print(f"✗ Invalid: {len(results['invalid'])}")

if results['invalid']:
    print("\nInvalid files (will be skipped):")
    for inv in results['invalid']:
        print(f"  - {Path(inv['filepath']).name}: {inv['issues']}")
else:
    print("\n✓ All files passed validation!")

Quality Control Results:
Total files checked: 3
✓ Valid: 3
✗ Invalid: 0

✓ All files passed validation!


## Workflow 3: Batch Index Directory

In [4]:
# Initialize search engine
print("Initializing FAIR Search Engine...")
engine = FAIRSearchEngine(load_existing=False)

# Batch index all files
print(f"\nIndexing directory: {sample_dir}")
print("=" * 60)

result = engine.index_directory(
    sample_dir,
    validate=True,
    include_companions=True,
    extract_archives=True,
    show_progress=True
)

print("\nIndexing Results:")
print(f"  ✓ Successfully indexed: {result['indexed']}")
print(f"  ✗ Errors: {result['errors']}")
print(f"  📦 Archives processed: {result['archives_processed']}")

if result['errors'] > 0:
    print("\nError details:")
    for error in result['details']['errors'][:5]:
        print(f"  - {error}")

Initializing FAIR Search Engine...
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Model loaded. Embedding dimension: 384
Loaded embedding cache: 23 entries
Creating new index...
Initialized FAISS index (dim=384)

Indexing directory: sample_dataset_collection


Indexing files: 100%|█| 3/3 [00:00<00:00,  4.


Indexing Results:
  ✓ Successfully indexed: 3
  ✗ Errors: 0
  📦 Archives processed: 0





## Workflow 4: Search and Discovery

In [5]:
# Define search scenarios
search_scenarios = [
    {
        'query': 'ocean temperature',
        'description': 'Finding ocean temperature datasets'
    },
    {
        'query': 'wind measurements',
        'description': 'Finding wind data'
    },
    {
        'query': 'satellite ocean color chlorophyll',
        'description': 'Finding ocean color/productivity data'
    },
    {
        'query': 'data from NASA',
        'description': 'Finding by institution'
    }
]

for scenario in search_scenarios:
    print(f"\n{'='*70}")
    print(f"Scenario: {scenario['description']}")
    print(f"Query: '{scenario['query']}'")
    print('='*70)
    
    results = engine.search(scenario['query'], top_k=3)
    
    if results:
        for i, result in enumerate(results, 1):
            print(f"\n{i}. {Path(result['filepath']).name}")
            print(f"   📊 Relevance: {result['similarity_score']:.3f}")
            print(f"   📝 Title: {result.get('title', 'N/A')}")
            print(f"   🏛️ Institution: {result.get('institution', 'N/A')}")
            
            if 'variables' in result:
                var_names = list(result['variables'].keys())[:3]
                print(f"   📈 Variables: {', '.join(var_names)}")
    else:
        print("\n❌ No results found")


Scenario: Finding ocean temperature datasets
Query: 'ocean temperature'

1. global_sst_2023_01.nc
   📊 Relevance: 0.488
   📝 Title: Global Sea Surface Temperature - January 2023
   🏛️ Institution: NOAA
   📈 Variables: time, lat, lon

2. chlorophyll_pacific_2023.nc
   📊 Relevance: 0.404
   📝 Title: Pacific Ocean Chlorophyll Concentration
   🏛️ Institution: NASA
   📈 Variables: time, lat, lon

3. wind_speed_atlantic_2023.nc
   📊 Relevance: 0.289
   📝 Title: Atlantic Wind Speed Measurements 2023
   🏛️ Institution: European Space Agency
   📈 Variables: time, lat, lon

Scenario: Finding wind data
Query: 'wind measurements'

1. wind_speed_atlantic_2023.nc
   📊 Relevance: 0.514
   📝 Title: Atlantic Wind Speed Measurements 2023
   🏛️ Institution: European Space Agency
   📈 Variables: time, lat, lon

2. global_sst_2023_01.nc
   📊 Relevance: 0.252
   📝 Title: Global Sea Surface Temperature - January 2023
   🏛️ Institution: NOAA
   📈 Variables: time, lat, lon

3. chlorophyll_pacific_2023.nc
   📊

## Workflow 5: Find Similar Datasets

In [6]:
# Pick a reference dataset
reference = sample_dir / "global_sst_2023_01.nc"

if reference.exists():
    print(f"Finding datasets similar to: {reference.name}")
    print("=" * 60)
    
    similar = engine.find_similar(reference, top_k=3)
    
    for i, result in enumerate(similar, 1):
        filename = Path(result['filepath']).name
        
        # Skip the file itself
        if filename == reference.name:
            continue
        
        print(f"\n{i}. {filename}")
        print(f"   Similarity: {result['similarity_score']:.3f}")
        print(f"   Why similar: Same format and ocean-related")
        
        if 'variables' in result:
            vars_list = list(result['variables'].keys())[:3]
            print(f"   Variables: {', '.join(vars_list)}")

Finding datasets similar to: global_sst_2023_01.nc

2. wind_speed_atlantic_2023.nc
   Similarity: 0.746
   Why similar: Same format and ocean-related
   Variables: time, lat, lon

3. chlorophyll_pacific_2023.nc
   Similarity: 0.595
   Why similar: Same format and ocean-related
   Variables: time, lat, lon


## Workflow 6: Working with Archives

In [7]:
# Create a .zip archive with data and README
archive_path = Path("sample_research_project.zip")

with zipfile.ZipFile(archive_path, 'w') as zf:
    # Add data files
    for nc_file in sample_dir.glob("*.nc"):
        zf.write(nc_file, f"data/{nc_file.name}")
    
    # Add README
    readme_content = """# Research Project Data Archive

This archive contains oceanographic measurements from 2023.

## Files
- global_sst_2023_01.nc: Sea surface temperature
- wind_speed_atlantic_2023.nc: Wind measurements
- chlorophyll_pacific_2023.nc: Ocean color data

## Citation
Please cite: Demo et al. (2023). Ocean Data Collection.
DOI: 10.1234/demo.2023
"""
    zf.writestr("README.md", readme_content)

print(f"✓ Created archive: {archive_path}")
print(f"  Size: {archive_path.stat().st_size / 1024:.1f} KB")

✓ Created archive: sample_research_project.zip
  Size: 1788.8 KB


In [8]:
# Index the archive (auto-extracts)
print(f"\nIndexing archive: {archive_path}")
print("=" * 60)

# The engine will extract and index all data files
# (Already indexed above, but showing how it would work)

with ArchiveHandler() as handler:
    structure = handler.get_archive_structure(archive_path)
    
    print(f"Archive contains:")
    print(f"  📦 Total files: {len(structure['files'])}")
    print(f"  📊 Data files: {len(structure['data_files'])}")
    print(f"  📄 Docs: {len(structure['companion_files'])}")
    
    print(f"\nData files in archive:")
    for df in structure['data_files']:
        print(f"  - {df}")
    
    print(f"\nDocumentation:")
    for cf in structure['companion_files']:
        print(f"  - {cf}")


Indexing archive: sample_research_project.zip
Archive contains:
  📦 Total files: 4
  📊 Data files: 3
  📄 Docs: 1

Data files in archive:
  - data/chlorophyll_pacific_2023.nc
  - data/wind_speed_atlantic_2023.nc
  - data/global_sst_2023_01.nc

Documentation:
  - README.md


## Workflow 7: Index Statistics and Management

In [9]:
# Get comprehensive statistics
stats = engine.get_stats()

print("Search Engine Statistics:")
print("=" * 60)
print(f"📊 Total datasets indexed: {stats['total_vectors']}")
print(f"📁 Unique files: {stats['unique_files']}")
print(f"🔢 Embedding dimension: {stats['embedding_dim']}")
print(f"🤖 Model: {stats['model']}")
print(f"💾 Cache size: {stats['cache_size']} embeddings")
print(f"\n🔍 Index type: {stats['index_type']}")

Search Engine Statistics:
📊 Total datasets indexed: 3
📁 Unique files: 3
🔢 Embedding dimension: 384
🤖 Model: sentence-transformers/all-MiniLM-L6-v2
💾 Cache size: 30 embeddings

🔍 Index type: IndexFlatIP


In [10]:
# Save index for future use
print("\nSaving index...")
engine.save()
print("✓ Index saved successfully")
print("\nIndex files created:")
print(f"  - FAISS index: {engine.vector_index.index.ntotal} vectors")
print(f"  - Metadata store: {len(engine.vector_index.metadata_store)} entries")
print(f"  - File map: {len(engine.vector_index.filepath_map)} unique files")


Saving index...
Index saved to /home/vastdata/vast-fair-stack/lib/indexes/faiss_index.bin
✓ Index saved successfully

Index files created:
  - FAISS index: 3 vectors
  - Metadata store: 3 entries
  - File map: 3 unique files


## Workflow 8: Real-World Scenario - Project Data Discovery

Simulating a researcher looking for specific data:

In [11]:
print("Research Scenario: Looking for ocean temperature data for climate study")
print("=" * 70)

# Natural language queries a researcher might use
research_queries = [
    "I need sea surface temperature data",
    "SST measurements for 2023",
    "ocean thermal data",
    "satellite temperature observations"
]

all_results = {}
for query in research_queries:
    results = engine.search(query, top_k=1)
    if results:
        best_match = results[0]
        filepath = best_match['filepath']
        score = best_match['similarity_score']
        
        if filepath not in all_results or score > all_results[filepath]:
            all_results[filepath] = score

print(f"\nFound {len(all_results)} relevant dataset(s):\n")
for filepath, score in sorted(all_results.items(), key=lambda x: x[1], reverse=True):
    print(f"  📊 {Path(filepath).name}")
    print(f"     Relevance: {score:.3f}")
    print()

Research Scenario: Looking for ocean temperature data for climate study

Found 1 relevant dataset(s):

  📊 global_sst_2023_01.nc
     Relevance: 0.608



## Performance Summary

In [12]:
import time

# Measure end-to-end performance
print("Performance Benchmarks:")
print("=" * 60)

# Search speed
query = "ocean data"
start = time.time()
for _ in range(50):
    engine.search(query)
avg_search_ms = (time.time() - start) / 50 * 1000

print(f"Average search time: {avg_search_ms:.2f} ms")
print(f"Target: <200ms " + ("✓" if avg_search_ms < 200 else "✗"))

print(f"\nDatasets indexed: {stats['total_vectors']}")
print(f"Ready for production use: ✓")

Performance Benchmarks:
Average search time: 0.23 ms
Target: <200ms ✓

Datasets indexed: 3
Ready for production use: ✓


## Cleanup (Optional)

In [13]:
# Uncomment to clean up sample files
# shutil.rmtree(sample_dir)
# archive_path.unlink()
# print("✓ Cleanup complete")

## Summary: FAIR Principles Achieved

✅ **Findable**
- Natural language search works
- Metadata extracted and indexed
- Fast similarity search

✅ **Accessible**
- File paths preserved
- Metadata easily retrieved
- Archive support

✅ **Interoperable**
- Standard formats (NetCDF, HDF5)
- CF conventions supported
- Companion docs included

✅ **Reusable**
- Full metadata preserved
- Citations discoverable
- Context maintained

## Next Steps

1. **Use with Your Data**: Replace sample files with real datasets
2. **Optional LLM Enhancement**: See Notebook 06
3. **Command-Line Tools**: Use provided scripts for automation
4. **Integration**: Connect with your analysis workflows