# Day 8: Understanding Embeddings

So far, we've been sending text to LLMs and getting text back.

But what if we need to **compare** texts? Or find **similar** documents?

That's where **embeddings** come in.

## What is an Embedding?

An embedding converts text into a **vector** ‚Äî a list of numbers.

```
"Hello World" ‚Üí [0.012, -0.034, 0.056, ..., 0.089]  (3072 numbers)
```

These numbers capture the **meaning** of the text, not just the words.

Two sentences with similar meaning will have similar vectors.

## Setup

In [18]:
from google import genai
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path='../.env')
API_KEY = os.environ["GEMINI_API_KEY"]
client = genai.Client(api_key=API_KEY)

## Generate an Embedding

In [19]:
text = "Hello World"

response = client.models.embed_content(
    model="gemini-embedding-001",
    contents=text
)

embedding = response.embeddings[0].values

print(f"Input: '{text}'")
print(f"Vector dimensions: {len(embedding)}")
print(f"First 5 values: {embedding}")

Input: 'Hello World'
Vector dimensions: 3072
First 5 values: [-0.015046939, 0.007224771, 0.010408387, -0.06416951, -0.003296465, 0.00021087048, -0.013788601, 0.012758294, 0.006955135, 0.0027825162, -0.012763795, -0.021246223, 0.030361578, 0.047764648, 0.10964752, 0.014141391, -0.005493229, -0.012254291, 0.016399883, -0.014347842, 0.0016168603, -0.007848015, 0.017269338, 0.01300918, -0.03266869, 0.003212574, 0.020221904, 0.0020086032, 0.032555528, 0.018311819, 0.020676704, -0.0114897145, -0.026783135, 0.0074604093, -0.0007540683, 0.016881272, 0.00922744, 0.0002653525, -0.01646248, -0.0070080273, -0.0033590917, 0.009421624, -0.009593557, -0.0054687685, -0.021972029, -0.02350194, -0.009719635, -0.012529851, -0.0047947294, 0.01424776, -0.023526996, -0.020196723, -0.007693768, -0.14930812, -0.009975046, 0.009400769, -0.015323871, 0.030536795, -0.01093062, -0.02422644, -0.0029394834, 0.0059339697, -0.013433221, -0.008600716, 0.00022039568, -0.024070311, -0.0041903853, 0.020025954, -0.0211148

## What Do These Numbers Mean?

Each number represents a **feature** of the text's meaning.

- The model learned these features during training
- Individual numbers don't have human-readable labels
- But combined, they form a unique "fingerprint" of meaning

Think of it like GPS coordinates:
- `(37.7749, -122.4194)` doesn't tell you "San Francisco"
- But similar coordinates mean nearby locations

### Embedding Multiple Texts

In [20]:
texts = [
    "Hello World",
    "Hi there, how are you?",
    "Machine learning is fascinating",
    "Deep learning uses neural networks"
]

embeddings = []
for text in texts:
    response = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text
    )
    embeddings.append(response.embeddings[0].values)
    print(f"‚úÖ Generated embedding for: '{text}'")

print(f"\nüìä Total embeddings generated: {len(embeddings)}")
print(f"üìê Each embedding has {len(embeddings[0])} dimensions")

‚úÖ Generated embedding for: 'Hello World'
‚úÖ Generated embedding for: 'Hi there, how are you?'
‚úÖ Generated embedding for: 'Machine learning is fascinating'
‚úÖ Generated embedding for: 'Deep learning uses neural networks'

üìä Total embeddings generated: 4
üìê Each embedding has 3072 dimensions


## Comparing Two Texts

In [None]:
text_a = "The cat sat on the mat"
text_b = "A feline rested on the rug"

# Get embeddings
emb_a = client.models.embed_content(model="gemini-embedding-001", contents=text_a).embeddings[0].values
emb_b = client.models.embed_content(model="gemini-embedding-001", contents=text_b).embeddings[0].values

print(f"Text A: '{text_a}'")
print(f"Text B: '{text_b}'")
print(f"\nBoth have {len(emb_a)} dimensions")

## Measuring Similarity with Cosine

**Cosine similarity** measures how similar two vectors are:

- `1.0` = identical meaning
- `0.0` = unrelated
- `-1.0` = opposite meaning (rare in practice)

In [None]:
import numpy as np

def cosine_similarity(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarity = cosine_similarity(emb_a, emb_b)
print(f"Similarity: {similarity:.4f}")

## Similar vs Different Texts

In [None]:
sentences = [
    "Machine learning is a subset of AI",
    "ML is part of artificial intelligence",
    "The weather is nice today"
]

# Generate embeddings
embeddings = []
for s in sentences:
    emb = client.models.embed_content(model="gemini-embedding-001", contents=s).embeddings[0].values
    embeddings.append(emb)

# Compare
print("Comparing sentences:\n")
print(f"1: '{sentences[0]}'")
print(f"2: '{sentences[1]}'")
print(f"3: '{sentences[2]}'")

print(f"\n1 vs 2 (similar meaning): {cosine_similarity(embeddings[0], embeddings[1]):.4f}")
print(f"1 vs 3 (different topic):  {cosine_similarity(embeddings[0], embeddings[2]):.4f}")

## Key Takeaways

1. **Embeddings** convert text to vectors (lists of numbers)
2. **Similar meanings** produce similar vectors
3. **Cosine similarity** measures how close two vectors are
4. Gemini embeddings have **3072 dimensions**

---

**Next:** Day 9 ‚Äî Using embeddings for semantic search