# 17. Embeddings and Approximate Nearest Neighbors (ANN)\n
\n
Modern IR uses **Dense Vector Retrieival**. Instead of matching keywords, we match semantic meaning vectors.\n
\n
## 1. Loading Vocabulary and Vectors\n
We use the dummy vectors generated in `00_data_expansion.ipynb`.

In [1]:
import json
import math
import random
import time
from pathlib import Path

DATA_DIR = Path('../data')

def load_vectors():
    path = DATA_DIR / 'word_vectors.json'
    if not path.exists():
        print("⚠️ Vectors not found! Please run 00_data_expansion.ipynb first.")
        return {}
    
    with open(path, 'r', encoding='utf-8') as f:
        vectors = json.load(f)
    return vectors

word_vectors = load_vectors()
print(f"Loaded {len(word_vectors)} word vectors (Dim={len(next(iter(word_vectors.values()))) if word_vectors else 0}).")

Loaded 50 word vectors (Dim=50).


## 2. Vector Similarity (Cosine)\n
\n
$$ Cosine(A, B) = \frac{A \cdot B}{||A|| \cdot ||B||} $$

In [2]:
def cosine_similarity(v1, v2):
    dot = sum(a*b for a, b in zip(v1, v2))
    norm1 = math.sqrt(sum(a*a for a in v1))
    norm2 = math.sqrt(sum(b*b for b in v2))
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return dot / (norm1 * norm2)

def get_nearest_neighbors(query_word, k=5):
    if query_word not in word_vectors:
        return []
    
    query_vec = word_vectors[query_word]
    scores = []
    
    for word, vec in word_vectors.items():
        if word == query_word: continue
        score = cosine_similarity(query_vec, vec)
        scores.append((word, score))
        
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:k]

# Demo: Find semantically similar words
test_word = "सरकार"
if test_word in word_vectors:
    print(f"\nNeighbors of '{test_word}':")
    for w, s in get_nearest_neighbors(test_word):
        print(f"  {w} ({s:.4f})")


Neighbors of 'सरकार':
  नेता (0.9805)
  संसद (0.9792)
  लोकतन्त्र (0.9785)
  मन्त्री (0.9776)
  जनता (0.9776)


## 3. Approximate Nearest Neighbor (ANN) Concept\n
Scanning all vectors ($O(N)$) is too slow for millions of docs. We need $O(\log N)$.\n
\n
### Simulated HNSW (Hierarchical Navigable Small World)\n
We will implement a simplified **Navigable Small World (NSW)** graph.\n
1. Build a graph where close vectors are connected.\n
2. Search by Greedy Traversal: always move to the neighbor closest to query.

In [3]:
class SimpleNSW:
    def __init__(self, vectors, k_neighbors=3):
        self.vectors = vectors
        self.graph = {w: [] for w in vectors}
        self.k = k_neighbors
        self.build()
        
    def build(self):
        print("Building NSW Graph (this might take a moment)...")
        # Naive build: For each node, connect to true k-NN (SLOW but works for demo)
        keys = list(self.vectors.keys())
        for i, word in enumerate(keys):
            neighbors = get_nearest_neighbors(word, k=self.k)
            self.graph[word] = [n for n, s in neighbors]
            
    def greedy_search(self, query_vec, entry_point, steps=10):
        current = entry_point
        seen = {current}
        
        # Greedy walk
        for _ in range(steps):
            best_neighbor = None
            best_score = -1
            
            curr_vec = self.vectors[current]
            current_dist = cosine_similarity(query_vec, curr_vec)
            
            # Check neighbors
            improved = False
            for neighbor in self.graph[current]:
                if neighbor in seen: continue
                
                n_vec = self.vectors[neighbor]
                score = cosine_similarity(query_vec, n_vec)
                
                if score > current_dist:
                    current = neighbor
                    improved = True
                    seen.add(current)
                    break # Greedy jump
            
            if not improved:
                break
                
        return current

# Build Graph
nsw = SimpleNSW(word_vectors)

# Search
target_word = "फुटबल"
if target_word in word_vectors:
    target_vec = word_vectors[target_word]
    # Start from random point
    start_node = random.choice(list(word_vectors.keys()))
    
    print(f"\nGreedy Graph Search for '{target_word}':")
    print(f"  Start: {start_node}")
    
    result = nsw.greedy_search(target_vec, start_node)
    print(f"  End:   {result}")
    
    if result == target_word:
        print("  ✓ Found exact match!")
    else:
        print("  ~ Found approximate match.")

Building NSW Graph (this might take a moment)...

Greedy Graph Search for 'फुटबल':
  Start: इन्टरनेट
  End:   वेबसाइट
  ~ Found approximate match.
