# Exploring Embeddings using OpenAI

<div align="left">
  <a href="https://colab.research.google.com/github/simonguest/dp-applied-genai/blob/main/src/01/embeddings_using_openai.ipynb" target="blank">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
  </a>
</div>

## OpenAI's Text Embeddings

OpenAI provides state-of-the-art text embedding models that convert text into high-dimensional vectors. These embeddings capture semantic meaning and can be used for various tasks like:

- **Semantic search**: Finding similar documents or passages
- **Clustering**: Grouping similar texts together
- **Classification**: Using embeddings as features for ML models
- **Recommendation systems**: Finding similar content

In this notebook, we'll explore OpenAI's `text-embedding-ada-002` model, which produces 1536-dimensional vectors and is optimized for both quality and cost.

## Setup and API Key

Before running this notebook, make sure you have:
1. An OpenAI API key (set as environment variable `OPENAI_API_KEY`)
2. The OpenAI Python library installed: `pip install openai`

**Note**: Using OpenAI's API incurs costs. The embedding model is relatively inexpensive, but be mindful of your usage.

## Generate Your First Embedding

Let's start by generating an embedding for a simple text string. The embedding will be a list of 1536 floating-point numbers that represent the semantic meaning of the text.

In [None]:
from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    input="Your text string goes here",
    model="text-embedding-ada-002"
)

print(f"Embedding dimensions: {len(response.data[0].embedding)}")
print(f"First 10 values: {response.data[0].embedding[:10]}")
print(f"Full embedding: {response.data[0].embedding}")

## Understanding the Output

**Try this**: Change the input text in the cell above and run it again. Notice how:
- The embedding always has 1536 dimensions
- Different texts produce different embeddings
- The values are typically between -1 and 1

**Question to consider**: What happens when you use the exact same text twice? Do you get identical embeddings?

## Comparing Embeddings: Semantic Similarity

The real power of embeddings comes from comparing them. We can measure how similar two pieces of text are by calculating the cosine similarity between their embeddings.

**Cosine similarity** ranges from -1 to 1:
- 1 = identical meaning
- 0 = no relationship
- -1 = opposite meaning

In [None]:
from scipy.spatial.distance import cosine

# Generate embeddings for two similar sentences
response = client.embeddings.create(
    input=["The cat sat on the mat.",
           "A feline rested on a rug."],
    model="text-embedding-ada-002"
)

embedding_a = response.data[0].embedding
embedding_b = response.data[1].embedding

# Calculate cosine similarity (1 - cosine distance)
similarity_score = 1 - cosine(embedding_a, embedding_b)
print(f"Cosine similarity: {similarity_score:.4f}")

## Experiment with Different Text Pairs

**Try these experiments** by modifying the input texts in the cell above:

1. **Synonymous sentences**: 
   - "The dog is running" vs "The canine is jogging"
   
2. **Related but different topics**:
   - "I love pizza" vs "Italian food is delicious"
   
3. **Completely unrelated**:
   - "The weather is sunny" vs "Mathematics is challenging"
   
4. **Opposite meanings**:
   - "I am happy" vs "I am sad"

**Questions to explore**:
- What similarity scores do you get for each pair?
- Do the scores align with your intuition about semantic similarity?
- How does this compare to simple word matching?

## Batch Processing Multiple Texts

For efficiency, you can generate embeddings for multiple texts in a single API call. This is more cost-effective and faster than individual calls.

In [None]:
# Multiple texts to embed
texts = [
    "The cat sat on the mat.",
    "A feline rested on a rug.",
    "Dogs are loyal companions.",
    "The weather is beautiful today.",
    "Machine learning is fascinating."
]

# Generate embeddings for all texts
response = client.embeddings.create(
    input=texts,
    model="text-embedding-ada-002"
)

# Extract embeddings
embeddings = [data.embedding for data in response.data]

print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding has {len(embeddings[0])} dimensions")

## Creating a Similarity Matrix

Let's create a matrix showing the similarity between all pairs of texts. This helps visualize which texts are most similar to each other.

In [None]:
import numpy as np
import pandas as pd

# Calculate similarity matrix
n_texts = len(embeddings)
similarity_matrix = np.zeros((n_texts, n_texts))

for i in range(n_texts):
    for j in range(n_texts):
        if i == j:
            similarity_matrix[i][j] = 1.0  # Perfect similarity with itself
        else:
            similarity_matrix[i][j] = 1 - cosine(embeddings[i], embeddings[j])

# Create a DataFrame for better visualization
similarity_df = pd.DataFrame(
    similarity_matrix, 
    index=[f"Text {i+1}" for i in range(n_texts)],
    columns=[f"Text {i+1}" for i in range(n_texts)]
)

print("Similarity Matrix:")
print(similarity_df.round(3))

print("\nOriginal texts:")
for i, text in enumerate(texts):
    print(f"Text {i+1}: {text}")

## Analysis Questions

Looking at the similarity matrix above:

1. **Which two texts are most similar?** Why do you think this is?
2. **Which texts are least similar?** Does this make sense semantically?
3. **How do the similarity scores compare to what you would expect intuitively?**

**Advanced exploration**: Try adding more texts to the list and see how the similarity patterns change. Consider texts from different domains (sports, technology, cooking, etc.).

## Key Takeaways

From this exploration, you should understand:

1. **Embeddings are dense vector representations** of text that capture semantic meaning
2. **OpenAI's embeddings are high-quality** and can detect semantic similarity even when words are different
3. **Cosine similarity** is a standard way to measure how similar two embeddings are
4. **Batch processing** is more efficient for multiple texts
5. **Embeddings enable many AI applications** like search, recommendation, and classification

**Next steps**: Try using these embeddings in a real application, such as building a semantic search system or a text classifier!