# 🔢 Understanding Word Embeddings

Welcome to the world of embeddings! Before we dive into building vector databases, let's understand the fundamental concept that makes it all possible: **word embeddings**.

Think of embeddings as a way to teach computers the *meaning* of words by converting them into numbers. This simple concept revolutionizes how AI systems understand and search through text.

Let's explore how this magic works! ✨

## 📦 Install Required Packages

Let's install the basic tools we need to create and work with embeddings.

In [None]:
# Install the essential embedding library
!pip install -q sentence-transformers==3.0.1 scikit-learn==1.4.2

## 📚 Import Libraries

Import the tools we'll use to create embeddings and measure similarity.

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

print("🧠 Ready to explore embeddings!")

## 🔢 From Words to Numbers: A Simple Example

Here's how different words become different vectors (simplified to just 4 dimensions for illustration):

- `"dog"` → [0.2, 0.8, 0.1, 0.6] 
- `"puppy"` → [0.3, 0.7, 0.2, 0.5] (similar to dog - both canines)
- `"cat"` → [0.1, 0.6, 0.8, 0.3] (different but some similarity - still a pet)
- `"car"` → [0.9, 0.1, 0.2, 0.1] (completely different - not an animal)

In real embeddings, these vectors have hundreds or thousands of dimensions, capturing subtle meaning relationships that make semantic search possible!

## 🧪 Let's Create Real Embeddings!

Now let's see this in action! We'll convert those same words into actual embedding vectors.

In [None]:
# Load a pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our example words from the diagram
words = ["dog", "puppy", "cat", "car"]

# Convert words to embeddings (vectors)
embeddings = model.encode(words)

print("🔢 Converting words to vectors:")
for word, embedding in zip(words, embeddings):
    print(f"'{word}' → vector with {len(embedding)} dimensions")
    print(f"   First 5 values: {embedding[:5]}")
    print()

## ✨ The Vector Magic

1. **Text → Numbers**: Every piece of text gets converted into a list of numbers (a vector) that captures its meaning
2. **Similarity Search**: When you ask a question, the system finds vectors that are close in distance
3. **Lightning Fast**: Even with millions of documents, searches happen in milliseconds

## 🔍 Testing Similarity

Let's measure how similar our words are to each other. The computer will tell us which words are most related!

In [None]:
# Calculate similarity between 'dog' and all other words
dog_embedding = embeddings[0].reshape(1, -1)  # 'dog' is first word
other_embeddings = embeddings[1:]  # puppy, cat, car

# Calculate cosine similarity (higher = more similar)
similarities = cosine_similarity(dog_embedding, other_embeddings)[0]

print("🐕 How similar is 'dog' to other words?")
print()
for word, similarity in zip(words[1:], similarities):
    print(f"dog ↔ {word}: {similarity:.3f}")
    
print()
print("📊 Notice:")
print("• 'dog' and 'puppy' are most similar (both canines)")
print("• 'dog' and 'cat' have some similarity (both pets)")
print("• 'dog' and 'car' are least similar (totally different concepts)")

## 🎮 Try It Yourself!

Want to test similarity between your own words or phrases? Run the cell below and experiment!

In [None]:
# Try your own words or phrases!
# Change these to anything you want to compare
test_words = [
    "machine learning",
    "artificial intelligence", 
    "deep learning",
    "cooking recipes"
]

# Create embeddings
test_embeddings = model.encode(test_words)

# Compare first phrase with all others
base_embedding = test_embeddings[0].reshape(1, -1)
other_test_embeddings = test_embeddings[1:]
test_similarities = cosine_similarity(base_embedding, other_test_embeddings)[0]

print(f"🧪 How similar is '{test_words[0]}' to other phrases?")
print()
for phrase, similarity in zip(test_words[1:], test_similarities):
    print(f"'{test_words[0]}' ↔ '{phrase}': {similarity:.3f}")

print()
print("💡 Try changing the test_words list above and rerun to experiment!")

## 🚀 What's Next?

Now that you understand how embeddings work conceptually, you're ready to:

- **Build a real vector database** using Milvus
- **Generate actual embeddings** from text
- **Implement semantic search** for educational content
- **See the magic in action** with real similarity scores

Continue to the next notebook: **`1-vector-databases.ipynb`** to start building your production vector database! 🎯