# Workout: Embeddings

## Setup
```bash
uv add openai sentence-transformers numpy
```

---
## Drill 1: OpenAI Embedding 游릭
**Task:** Create an embedding using OpenAI API

In [None]:
from openai import OpenAI

client = OpenAI()

text = "Python is a great programming language"

# Create embedding using text-embedding-3-small
# Print the dimension and first 5 values

---
## Drill 2: Batch Embeddings 游릭
**Task:** Embed multiple texts in a single API call

In [None]:
from openai import OpenAI

client = OpenAI()

texts = [
    "Machine learning is fascinating",
    "Deep learning uses neural networks",
    "Natural language processing handles text"
]

# Embed all texts in one API call
# Print the shape of results

---
## Drill 3: Sentence Transformers 游릭
**Task:** Use a local embedding model

In [None]:
from sentence_transformers import SentenceTransformer

# Load all-MiniLM-L6-v2 model
# Embed "Hello, world!"
# Print dimension

---
## Drill 4: Cosine Similarity 游리
**Task:** Implement cosine similarity from scratch

In [None]:
import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Calculate cosine similarity between two vectors."""
    pass

# Test with these vectors
a = np.array([1, 2, 3])
b = np.array([1, 2, 3])  # Same -> should be 1.0
c = np.array([-1, -2, -3])  # Opposite -> should be -1.0
d = np.array([3, 2, 1])  # Similar -> should be ~0.86

print(cosine_similarity(a, b))
print(cosine_similarity(a, c))
print(cosine_similarity(a, d))

---
## Drill 5: Semantic Similarity 游리
**Task:** Compare text similarity using embeddings

In [None]:
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Compare these pairs and print similarity scores:
pairs = [
    ("I love Python", "Python is my favorite"),  # Very similar
    ("I love Python", "I hate Python"),  # Opposite sentiment
    ("I love Python", "The weather is nice"),  # Unrelated
]

# What do you notice about the scores?

---
## Drill 6: Find Most Similar 游리
**Task:** Find the most similar document to a query

In [None]:
import numpy as np
from openai import OpenAI

documents = [
    "Python is a programming language",
    "Machine learning uses algorithms",
    "Cats are popular pets",
    "Neural networks are inspired by brains",
    "Dogs are loyal companions"
]

query = "artificial intelligence"

# Embed all documents and query
# Find the most similar document
# Print the result with similarity score

---
## Drill 7: Embedding Cache 游리
**Task:** Implement a simple embedding cache

In [None]:
import json
import hashlib
from pathlib import Path

class SimpleCache:
    def __init__(self, cache_file: str = "embeddings.json"):
        self.cache_file = Path(cache_file)
        self._cache = self._load()

    def _load(self) -> dict:
        pass

    def _save(self):
        pass

    def get(self, text: str) -> list[float] | None:
        pass

    def set(self, text: str, embedding: list[float]):
        pass

# Test the cache
# cache = SimpleCache()
# cache.set("hello", [0.1, 0.2, 0.3])
# result = cache.get("hello")
# assert result == [0.1, 0.2, 0.3]

---
## Drill 8: Dimension Reduction 游리
**Task:** Use OpenAI's dimension reduction

In [None]:
from openai import OpenAI

client = OpenAI()

text = "Dimension reduction can save storage space"

# Create embeddings at different dimensions
# Compare: 1536 (default), 512, 256

dims = [1536, 512, 256]
for dim in dims:
    # Get embedding with that dimension
    # Print dimension and first 3 values
    pass

---
## Drill 9: Model Comparison 游댮
**Task:** Compare OpenAI and Sentence Transformers quality

In [None]:
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np

test_pairs = [
    ("A dog is running", "A canine is jogging"),
    ("The cat sleeps", "Python programming"),
    ("I love coffee", "Coffee is my favorite drink"),
]

# Get similarities using both:
# 1. OpenAI text-embedding-3-small
# 2. Sentence Transformers all-MiniLM-L6-v2

# Compare the rankings - are they consistent?

---
## Drill 10: Semantic Search Function 游댮
**Task:** Build a complete semantic search function

In [None]:
from dataclasses import dataclass
from typing import Callable
import numpy as np

@dataclass
class SearchResult:
    text: str
    score: float
    index: int

def semantic_search(
    query: str,
    documents: list[str],
    embed_fn: Callable[[str], list[float]],
    top_k: int = 3
) -> list[SearchResult]:
    """
    Perform semantic search.

    Args:
        query: Search query
        documents: List of documents to search
        embed_fn: Function to create embeddings
        top_k: Number of results to return

    Returns:
        List of SearchResult sorted by score (highest first)
    """
    pass

# Test with this data
docs = [
    "The quick brown fox jumps over the lazy dog",
    "Python is a popular programming language",
    "Machine learning models learn from data",
    "The lazy dog sleeps all day",
    "Deep learning is a subset of machine learning",
]

# results = semantic_search(
#     query="AI and neural networks",
#     documents=docs,
#     embed_fn=get_embedding,
#     top_k=2
# )

# for r in results:
#     print(f"{r.score:.3f}: {r.text}")

---
## Self-Check

- [ ] Can create embeddings with OpenAI and Sentence Transformers
- [ ] Understand cosine similarity and when to use it
- [ ] Can implement semantic search
- [ ] Know how to cache embeddings for efficiency
- [ ] Understand trade-offs between different models