# Vector Search with Embeddings

This notebook demonstrates how semantic search works using:
1. **Sentence Transformers**: Convert text to vector embeddings
2. **FAISS**: Fast similarity search in vector space
3. **Cosine Similarity**: Measure how similar queries are to datasets

## Why Vector Search?

Traditional search matches exact keywords. Vector search understands meaning:
- "ocean temp" finds "sea surface temperature"
- "climate data" finds "atmospheric measurements"
- Works even with abbreviations and synonyms

In [1]:
# Setup
import sys
from pathlib import Path
import numpy as np
sys.path.insert(0, str(Path.cwd().parent))

from embedding_generator import EmbeddingGenerator
from vector_index import VectorIndex
from search_engine import FAIRSearchEngine

## Example 1: Generate Embeddings

In [2]:
# Initialize embedding generator
generator = EmbeddingGenerator()

# Example texts
texts = [
    "sea surface temperature measurements from satellite",
    "ocean temperature data from MODIS",
    "atmospheric pressure at sea level",
    "wind speed and direction measurements"
]

# Generate embeddings
embeddings = generator.encode(texts)

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")
print(f"\nFirst embedding (first 10 values):")
print(embeddings[0][:10])

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Model loaded. Embedding dimension: 384
Generated 4 embeddings
Embedding shape: (4, 384)
Embedding dimension: 384

First embedding (first 10 values):
[-0.03911696  0.00684475 -0.00633144  0.01558108 -0.04790843 -0.06568947
  0.01690703 -0.00748422  0.03100628  0.03318202]


## Example 2: Compute Similarity

Embeddings can be compared using cosine similarity.

In [3]:
# Compute similarity between texts
from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

# Compare first two texts (both about ocean temperature)
sim_ocean = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity between ocean temperature texts: {sim_ocean:.3f}")

# Compare ocean temp with atmospheric pressure (different topics)
sim_different = cosine_similarity(embeddings[0], embeddings[2])
print(f"Similarity between ocean temp and pressure: {sim_different:.3f}")

print("\nNote: Similar texts have scores closer to 1.0")

Similarity between ocean temperature texts: 0.655
Similarity between ocean temp and pressure: 0.424

Note: Similar texts have scores closer to 1.0


## Example 3: Build a Vector Index

In [4]:
# Create vector index
index = VectorIndex(embedding_dim=generator.get_embedding_dim())

# Create fake metadata for our example texts
metadata_list = [
    {'filepath': 'data/sst_satellite.nc', 'title': 'SST from Satellite', 'format': 'NetCDF'},
    {'filepath': 'data/ocean_temp_modis.nc', 'title': 'Ocean Temp MODIS', 'format': 'NetCDF'},
    {'filepath': 'data/pressure_sea_level.nc', 'title': 'Sea Level Pressure', 'format': 'NetCDF'},
    {'filepath': 'data/wind_measurements.nc', 'title': 'Wind Speed', 'format': 'NetCDF'}
]

# Add to index
index.add(embeddings, metadata_list)

print("Index Statistics:")
print(index.get_stats())

Initialized FAISS index (dim=384)
Index Statistics:
{'total_vectors': 4, 'total_metadata': 4, 'unique_files': 4, 'embedding_dim': 384, 'index_type': 'IndexFlatIP'}


## Example 4: Search the Index

In [5]:
# Search queries
queries = [
    "ocean surface temperature",
    "atmospheric data",
    "wind information"
]

for query in queries:
    print(f"\nQuery: '{query}'")
    print("=" * 60)
    
    # Generate query embedding
    query_embedding = generator.encode_single(query)
    
    # Search
    results = index.search(query_embedding, top_k=2)
    
    for i, result in enumerate(results, 1):
        print(f"{i}. {result['title']}")
        print(f"   Score: {result['similarity_score']:.3f}")
        print(f"   File: {result['filepath']}")


Query: 'ocean surface temperature'
1. SST from Satellite
   Score: 0.740
   File: data/sst_satellite.nc
2. Ocean Temp MODIS
   Score: 0.662
   File: data/ocean_temp_modis.nc

Query: 'atmospheric data'
1. Ocean Temp MODIS
   Score: 0.548
   File: data/ocean_temp_modis.nc
2. Sea Level Pressure
   Score: 0.526
   File: data/pressure_sea_level.nc

Query: 'wind information'
1. Wind Speed
   Score: 0.710
   File: data/wind_measurements.nc
2. Ocean Temp MODIS
   Score: 0.354
   File: data/ocean_temp_modis.nc


## Example 5: Real Search Engine Test

In [6]:
# Initialize search engine with our sample data
engine = FAIRSearchEngine(load_existing=False)

# Index the sample files we created
sample_files = [
    "sample_data/ocean_temperature.nc",
    "sample_data/minimal_sst_20230515.nc"
]

for filepath in sample_files:
    if Path(filepath).exists():
        result = engine.index_file(Path(filepath), include_companions=True)
        if result.get('success'):
            print(f"✓ Indexed: {Path(filepath).name}")

print(f"\nIndex stats: {engine.get_stats()}")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Model loaded. Embedding dimension: 384
Loaded embedding cache: 2 entries
Creating new index...
Initialized FAISS index (dim=384)
✓ Indexed: ocean_temperature.nc
✓ Indexed: minimal_sst_20230515.nc

Index stats: {'total_vectors': 2, 'total_metadata': 2, 'unique_files': 2, 'embedding_dim': 384, 'index_type': 'IndexFlatIP', 'model': 'sentence-transformers/all-MiniLM-L6-v2', 'cache_size': 4}


In [7]:
# Try various search queries
test_queries = [
    "ocean temperature",
    "sea surface",
    "satellite measurements",
    "MODIS data",
    "temperature from 2023"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: '{query}'")
    print('='*60)
    
    results = engine.search(query, top_k=3)
    
    if results:
        for i, result in enumerate(results, 1):
            print(f"\n{i}. {Path(result['filepath']).name}")
            print(f"   Score: {result['similarity_score']:.3f}")
            print(f"   Title: {result.get('title', 'N/A')}")
            
            # Show variables if available
            if 'variables' in result:
                var_names = list(result['variables'].keys())[:3]
                print(f"   Variables: {', '.join(var_names)}")
    else:
        print("No results found")


Query: 'ocean temperature'

1. minimal_sst_20230515.nc
   Score: 0.618
   Title: 
   Variables: time, lat, lon

2. ocean_temperature.nc
   Score: 0.596
   Title: Sample Ocean Temperature Data
   Variables: time, lat, lon

Query: 'sea surface'

1. ocean_temperature.nc
   Score: 0.414
   Title: Sample Ocean Temperature Data
   Variables: time, lat, lon

2. minimal_sst_20230515.nc
   Score: 0.404
   Title: 
   Variables: time, lat, lon

Query: 'satellite measurements'

1. minimal_sst_20230515.nc
   Score: 0.307
   Title: 
   Variables: time, lat, lon

2. ocean_temperature.nc
   Score: 0.305
   Title: Sample Ocean Temperature Data
   Variables: time, lat, lon

Query: 'MODIS data'

1. ocean_temperature.nc
   Score: 0.373
   Title: Sample Ocean Temperature Data
   Variables: time, lat, lon

2. minimal_sst_20230515.nc
   Score: 0.298
   Title: 
   Variables: time, lat, lon

Query: 'temperature from 2023'

1. minimal_sst_20230515.nc
   Score: 0.443
   Title: 
   Variables: time, lat, lon

2. 

## Example 6: Find Similar Datasets

In [8]:
# Find datasets similar to a reference file
reference_file = Path("sample_data/ocean_temperature.nc")

if reference_file.exists():
    print(f"Finding datasets similar to: {reference_file.name}")
    print("=" * 60)
    
    similar = engine.find_similar(reference_file, top_k=3)
    
    for i, result in enumerate(similar, 1):
        print(f"\n{i}. {Path(result['filepath']).name}")
        print(f"   Score: {result['similarity_score']:.3f}")
        
        # Show why it's similar
        if 'variables' in result:
            vars_list = list(result['variables'].keys())[:3]
            print(f"   Shared variables: {', '.join(vars_list)}")

Finding datasets similar to: ocean_temperature.nc

1. ocean_temperature.nc
   Score: 0.855
   Shared variables: time, lat, lon

2. minimal_sst_20230515.nc
   Score: 0.726
   Shared variables: time, lat, lon


## Understanding Search Scores

- **Score > 0.8**: Very relevant, likely what you're looking for
- **Score 0.6-0.8**: Relevant, related datasets
- **Score 0.4-0.6**: Somewhat related
- **Score < 0.4**: Probably not what you want

## Performance

Let's measure search speed:

In [9]:
import time

# Benchmark search speed
query = "ocean temperature measurements"
num_searches = 100

start = time.time()
for _ in range(num_searches):
    engine.search(query, top_k=10)
end = time.time()

avg_time_ms = (end - start) / num_searches * 1000
print(f"Average search time: {avg_time_ms:.2f} ms")
print(f"Searches per second: {1000/avg_time_ms:.0f}")
print("\nTarget: < 200ms per search ✓" if avg_time_ms < 200 else "Target: < 200ms per search ✗")

Average search time: 0.21 ms
Searches per second: 4865

Target: < 200ms per search ✓


## Save and Load Index

In [10]:
# Save index
engine.save()
print("✓ Index saved")

# Can be loaded later
print("\nIndex can be loaded with:")
print("  engine = FAIRSearchEngine(load_existing=True)")

Index saved to /home/vastdata/vast-fair-stack/lib/indexes/faiss_index.bin
✓ Index saved

Index can be loaded with:
  engine = FAIRSearchEngine(load_existing=True)


## Key Takeaways

1. **Semantic Understanding**: Vector search finds meaning, not just keywords
2. **Fast Performance**: FAISS enables sub-second search even with large indexes
3. **Local and Free**: Everything runs on your machine, no API calls
4. **Persistent Storage**: Index saves to disk for reuse

## Next Steps

- **Notebook 05**: Batch indexing of large datasets
- **Notebook 06**: Optional LLM enrichment
- **Notebook 99**: Complete workflow examples