# Embeddings Exploration

Embeddings are how LLMs understand **meaning**. While tokenization breaks text into pieces, embeddings capture what those pieces *mean* by representing them as vectors in a high-dimensional space.

The key insight: **similar meanings = nearby vectors**. "Dog" and "puppy" end up close together, while "dog" and "refrigerator" end up far apart.

This notebook lets you experiment hands-on with embeddings using `sentence-transformers`, a free library that runs entirely on your machine.

## 1. Setup

First, let's install and import the embedding model.

In [9]:
# Install sentence-transformers if you don't have it
!pip install sentence-transformers -q

In [10]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a small, fast model that produces good embeddings
# This will download the model on first run (~90MB)
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Model loaded!")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Model loaded!
Embedding dimension: 384


## 2. Getting Embeddings

Let's see what an embedding actually looks like - it's just a list of numbers (a vector).

In [11]:
# Get the embedding for a word
word = "hello"
embedding = model.encode(word)

print(f"Word: '{word}'")
print(f"Embedding type: {type(embedding)}")
print(f"Embedding shape: {embedding.shape}")
print(f"\nFirst 10 values: {embedding[:10]}")
print(f"\nMin value: {embedding.min():.4f}")
print(f"Max value: {embedding.max():.4f}")

Word: 'hello'
Embedding type: <class 'numpy.ndarray'>
Embedding shape: (384,)

First 10 values: [-0.06277174  0.05495885  0.05216482  0.08579    -0.08274895 -0.07457299
  0.06855471  0.01839639 -0.08201139 -0.03738479]

Min value: -0.1444
Max value: 0.2981


In [12]:
# The embedding is a 384-dimensional vector
# Each dimension captures some aspect of the meaning
print(f"This word is represented by {len(embedding)} numbers.")
print(f"\nThink of it as a point in 384-dimensional space.")
print(f"Words with similar meanings are points that are close together.")

This word is represented by 384 numbers.

Think of it as a point in 384-dimensional space.
Words with similar meanings are points that are close together.


## 3. Cosine Similarity

To measure how similar two embeddings are, we use **cosine similarity**. It measures the angle between two vectors:

- **1.0** = identical direction (very similar meaning)
- **0.0** = perpendicular (unrelated)
- **-1.0** = opposite direction (opposite meaning)

In practice, most text embeddings fall between 0 and 1.

In [13]:
def cosine_similarity(vec1, vec2):
    """
    Calculate cosine similarity between two vectors.
    
    Formula: cos(theta) = (A . B) / (||A|| * ||B||)
    
    Where:
    - A . B is the dot product
    - ||A|| and ||B|| are the magnitudes (lengths) of the vectors
    """
    dot_product = np.dot(vec1, vec2)
    magnitude1 = np.linalg.norm(vec1)
    magnitude2 = np.linalg.norm(vec2)
    
    return dot_product / (magnitude1 * magnitude2)

# Test it with a simple example
vec_a = np.array([1, 0, 0])
vec_b = np.array([1, 0, 0])  # Same direction
vec_c = np.array([0, 1, 0])  # Perpendicular

print("Simple 3D vectors:")
print(f"  Same direction: {cosine_similarity(vec_a, vec_b):.2f}")
print(f"  Perpendicular:  {cosine_similarity(vec_a, vec_c):.2f}")

Simple 3D vectors:
  Same direction: 1.00
  Perpendicular:  0.00


## 4. Semantic Similarity in Action

Now for the magic - let's see how well the model captures semantic relationships.

In [14]:
def compare_words(word1, word2):
    """Get embeddings for two words and compute their similarity."""
    emb1 = model.encode(word1)
    emb2 = model.encode(word2)
    similarity = cosine_similarity(emb1, emb2)
    return similarity

def show_similarity(word1, word2, expected=""):
    """Display the similarity between two words."""
    sim = compare_words(word1, word2)
    bar = "#" * int(sim * 30)  # Visual bar
    note = f" ({expected})" if expected else ""
    print(f"'{word1}' vs '{word2}': {sim:.3f} {bar}{note}")

In [15]:
# Expected: high similarity - same concept
print("Same concept, different words:")
show_similarity("dog", "puppy", "high - same animal")
show_similarity("car", "automobile", "high - synonyms")
show_similarity("happy", "joyful", "high - synonyms")
show_similarity("big", "large", "high - synonyms")

Same concept, different words:
'dog' vs 'puppy': 0.804 ######################## (high - same animal)
'car' vs 'automobile': 0.865 ######################### (high - synonyms)
'happy' vs 'joyful': 0.684 #################### (high - synonyms)
'big' vs 'large': 0.807 ######################## (high - synonyms)


In [16]:
# Expected: medium similarity - related but different
print("\nRelated concepts:")
show_similarity("dog", "cat", "medium - both pets")
show_similarity("coffee", "tea", "medium - both beverages")
show_similarity("doctor", "nurse", "medium - both medical")
show_similarity("king", "queen", "medium - both royalty")


Related concepts:
'dog' vs 'cat': 0.661 ################### (medium - both pets)
'coffee' vs 'tea': 0.616 ################## (medium - both beverages)
'doctor' vs 'nurse': 0.608 ################## (medium - both medical)
'king' vs 'queen': 0.681 #################### (medium - both royalty)


In [17]:
# Expected: low similarity - unrelated
print("\nUnrelated concepts:")
show_similarity("dog", "refrigerator", "low - unrelated")
show_similarity("banana", "democracy", "low - unrelated")
show_similarity("laptop", "elephant", "low - unrelated")
show_similarity("music", "mathematics", "low-medium?")


Unrelated concepts:
'dog' vs 'refrigerator': 0.246 ####### (low - unrelated)
'banana' vs 'democracy': 0.186 ##### (low - unrelated)
'laptop' vs 'elephant': 0.370 ########### (low - unrelated)
'music' vs 'mathematics': 0.381 ########### (low-medium?)


In [18]:
# Interesting cases - opposites and antonyms
print("\nInteresting cases (opposites):")
show_similarity("hot", "cold", "often high! - same concept, opposite ends")
show_similarity("love", "hate", "often high! - same concept, opposite ends")
show_similarity("up", "down", "often high! - same concept, opposite ends")

print("\n(Note: Opposites often have HIGH similarity because they relate to the same concept!")
print("This is a limitation of simple embedding similarity.)")


Interesting cases (opposites):
'hot' vs 'cold': 0.519 ############### (often high! - same concept, opposite ends)
'love' vs 'hate': 0.488 ############## (often high! - same concept, opposite ends)
'up' vs 'down': 0.673 #################### (often high! - same concept, opposite ends)

(Note: Opposites often have HIGH similarity because they relate to the same concept!
This is a limitation of simple embedding similarity.)


**Try it yourself:** What other word pairs would you like to compare? Add some below!

In [19]:
# Your turn - try your own word pairs
show_similarity("python", "snake")
show_similarity("python", "programming")
show_similarity("apple", "fruit")
show_similarity("apple", "microsoft")

'python' vs 'snake': 0.444 #############
'python' vs 'programming': 0.613 ##################
'apple' vs 'fruit': 0.537 ################
'apple' vs 'microsoft': 0.493 ##############


## 5. Sentences vs Words

The real power of modern embedding models is that they work on **whole sentences**, not just words. The meaning of a sentence is more than the sum of its words.

In [20]:
# Single word embedding
word_emb = model.encode("bank")

# Sentence embeddings - same word, different meanings
sent1_emb = model.encode("I need to deposit money at the bank")
sent2_emb = model.encode("We had a picnic by the river bank")

print("The word 'bank' has multiple meanings.")
print(f"\nSentence 1 (financial) vs Sentence 2 (river): {cosine_similarity(sent1_emb, sent2_emb):.3f}")
print("\nThe sentences are somewhat different because the context differs!")

The word 'bank' has multiple meanings.

Sentence 1 (financial) vs Sentence 2 (river): 0.312

The sentences are somewhat different because the context differs!


In [21]:
# Same meaning, different words
sentences = [
    "The cat sat on the mat",
    "A feline was resting on the rug",
    "The dog ran in the park",
    "I love programming in Python",
]

embeddings = model.encode(sentences)

print("Sentence similarities:\n")
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = cosine_similarity(embeddings[i], embeddings[j])
        print(f"{sim:.3f}: '{sentences[i][:40]}...' vs '{sentences[j][:40]}...'")

Sentence similarities:

0.577: 'The cat sat on the mat...' vs 'A feline was resting on the rug...'
0.072: 'The cat sat on the mat...' vs 'The dog ran in the park...'
0.020: 'The cat sat on the mat...' vs 'I love programming in Python...'
0.118: 'A feline was resting on the rug...' vs 'The dog ran in the park...'
0.040: 'A feline was resting on the rug...' vs 'I love programming in Python...'
0.055: 'The dog ran in the park...' vs 'I love programming in Python...'


In [22]:
# Paraphrases - same meaning, completely different words
paraphrase_pairs = [
    ("How old are you?", "What is your age?"),
    ("The movie was great", "I really enjoyed the film"),
    ("It's raining outside", "The weather is wet"),
]

print("Paraphrase detection (same meaning, different words):\n")
for sent1, sent2 in paraphrase_pairs:
    sim = compare_words(sent1, sent2)
    print(f"{sim:.3f}: '{sent1}' <-> '{sent2}'")

Paraphrase detection (same meaning, different words):

0.761: 'How old are you?' <-> 'What is your age?'
0.805: 'The movie was great' <-> 'I really enjoyed the film'
0.791: 'It's raining outside' <-> 'The weather is wet'


## 6. Practical Example: Semantic Search

One of the most useful applications of embeddings is **semantic search** - finding content by meaning, not just keywords.

Instead of exact string matching, we find documents whose embeddings are closest to the query embedding.

In [23]:
# Our "document" collection
documents = [
    "Python is a popular programming language for data science",
    "Machine learning models can predict future outcomes",
    "The weather forecast shows rain tomorrow",
    "Neural networks are inspired by the human brain",
    "I made a delicious pasta for dinner last night",
    "Deep learning requires large amounts of training data",
    "The stock market showed gains today",
    "Transformers have revolutionized natural language processing",
    "My cat loves to sleep in sunny spots",
    "GPT models generate human-like text",
]

# Pre-compute embeddings for all documents (in practice, you'd store these)
doc_embeddings = model.encode(documents)

print(f"Indexed {len(documents)} documents")
print(f"Each document is represented by a {doc_embeddings.shape[1]}-dimensional vector")

Indexed 10 documents
Each document is represented by a 384-dimensional vector


In [24]:
def semantic_search(query, top_k=3):
    """
    Find the most similar documents to a query.
    
    This is the basic algorithm behind semantic search:
    1. Embed the query
    2. Compare to all document embeddings
    3. Return the closest matches
    """
    # Embed the query
    query_embedding = model.encode(query)
    
    # Calculate similarity to each document
    similarities = []
    for i, doc_emb in enumerate(doc_embeddings):
        sim = cosine_similarity(query_embedding, doc_emb)
        similarities.append((sim, i))
    
    # Sort by similarity (highest first)
    similarities.sort(reverse=True)
    
    # Return top results
    print(f"Query: '{query}'\n")
    print("Top results:")
    for rank, (sim, idx) in enumerate(similarities[:top_k], 1):
        print(f"  {rank}. [{sim:.3f}] {documents[idx]}")
    
    return similarities[:top_k]

In [25]:
# Try some searches!
semantic_search("AI and artificial intelligence")

Query: 'AI and artificial intelligence'

Top results:
  1. [0.454] Neural networks are inspired by the human brain
  2. [0.321] Machine learning models can predict future outcomes
  3. [0.215] Transformers have revolutionized natural language processing


[(np.float32(0.45375124), 3),
 (np.float32(0.32095325), 1),
 (np.float32(0.21459457), 7)]

In [26]:
# Notice: we don't need exact keyword matches
semantic_search("coding")

Query: 'coding'

Top results:
  1. [0.287] Python is a popular programming language for data science
  2. [0.277] GPT models generate human-like text
  3. [0.267] Neural networks are inspired by the human brain


[(np.float32(0.28735948), 0),
 (np.float32(0.27661642), 9),
 (np.float32(0.26702595), 3)]

In [27]:
# Works with questions too
semantic_search("How do large language models work?")

Query: 'How do large language models work?'

Top results:
  1. [0.403] Transformers have revolutionized natural language processing
  2. [0.393] GPT models generate human-like text
  3. [0.294] Deep learning requires large amounts of training data


[(np.float32(0.40281165), 7),
 (np.float32(0.39343098), 9),
 (np.float32(0.2939294), 5)]

In [28]:
# Completely different domain
semantic_search("pets and animals")

Query: 'pets and animals'

Top results:
  1. [0.243] My cat loves to sleep in sunny spots
  2. [0.171] Neural networks are inspired by the human brain
  3. [0.123] Python is a popular programming language for data science


[(np.float32(0.2425824), 8),
 (np.float32(0.17082085), 3),
 (np.float32(0.122809336), 0)]

In [29]:
# Food-related query
semantic_search("cooking and food")

Query: 'cooking and food'

Top results:
  1. [0.367] I made a delicious pasta for dinner last night
  2. [0.136] My cat loves to sleep in sunny spots
  3. [0.097] Machine learning models can predict future outcomes


[(np.float32(0.3670238), 4),
 (np.float32(0.13638993), 8),
 (np.float32(0.0970244), 1)]

**Key insight:** Semantic search finds relevant content even when the exact words don't match. This is why it's so powerful for:

- Search engines
- Recommendation systems
- RAG (Retrieval Augmented Generation) for LLMs
- Finding similar documents
- Clustering content by topic

## Summary

Key takeaways:

1. **Embeddings are vectors** - Text gets converted to lists of numbers (384 dimensions in our model)

2. **Meaning = Position** - Similar meanings end up as nearby points in the vector space

3. **Cosine similarity** - Measures how similar two embeddings are (0-1 for text, where 1 = identical)

4. **Sentences, not just words** - Modern models capture full sentence meaning, including context

5. **Semantic search** - Find relevant content by meaning, not just keyword matching

6. **Limitations** - Antonyms often have high similarity (they're related concepts). Embeddings don't capture logic or reasoning.

## Playground

Use these cells to experiment with your own text!

In [30]:
# Compare any two pieces of text
text1 = "Your first text here"
text2 = "Your second text here"

show_similarity(text1, text2)

'Your first text here' vs 'Your second text here': 0.843 #########################


In [31]:
# Add your own documents and search them
my_documents = [
    "Add your own documents here",
    "Make them about topics you care about",
    "Then search for related content",
]

# Update the embeddings
my_doc_embeddings = model.encode(my_documents)

# Now search
my_query = "your search query"
query_emb = model.encode(my_query)

print(f"Query: '{my_query}'\n")
for i, doc in enumerate(my_documents):
    sim = cosine_similarity(query_emb, my_doc_embeddings[i])
    print(f"  [{sim:.3f}] {doc}")

Query: 'your search query'

  [0.228] Add your own documents here
  [0.061] Make them about topics you care about
  [0.510] Then search for related content
