# OpenAI Embeddings Lab

This lab will help you understand how embeddings work and how to visualize semantic relationships between different pieces of text using OpenAI's embedding model.

## Lab Structure

The lab consists of 5 main exercises and 1 bonus exercise:

1. **Generate Embeddings**: Learn how to use OpenAI's API to convert text into vector representations
2. **Find Similar Sentences**: Apply embeddings to find semantically similar text
3. **Find Different Sentences**: Identify the most semantically distant sentences
4. **Compare to a New Sentence**: Given a new sentence, search already-embedded sentences for similar
5. **Use a larger Corpus**: Test out your work with a larger corpus.

**Bonus**: Reduce the dimensionality of your embeddings.

## Tips

- Read the OpenAI embeddings documentation carefully
- Pay attention to the dimensionality of the vectors
- Test with small examples first
- Try modifying the corpus with your own sentences

## Extension Ideas

1. Try using a different embedding model
2. Implement alternative similarity metrics
3. Experiment with larger text corpora

## Resources

- [OpenAI Embeddings Documentation](https://platform.openai.com/docs/guides/embeddings)
- [Cosine Similarity Explanation](https://en.wikipedia.org/wiki/Cosine_similarity)


## Getting Started

Let's begin by importing the necessary packages. We'll also set up our API key.

In [None]:
import os
import numpy
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env file
API_KEY = os.getenv('OPENAI_API_KEY')

print(API_KEY)

client = OpenAI(api_key=API_KEY)

## Exercise 1: Generate Embeddings for the corpus

Write a function that takes an array of strings and returns a dictionary with the sentences as keys and their embeddings as values. Use OpenAI's text-embedding-3-small model.

In [None]:
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "A lazy dog sleeps in the sun",
    "The brown fox is quick and clever",
    "Dogs and cats are common pets",
    "Foxes are wild animals that hunt",
]


def generate_embeddings(texts):
    """
    Generate embeddings for a list of text strings.

    Arg: texts (list): List of strings to generate embeddings for
    Returns: Dictionary with sentences as keys and embeddings as values
    """

    # TODO

embeddings = generate_embeddings(corpus)
for sentence, embedding in embeddings.items():
    print(f"{sentence}: {embedding}")

## Exercise 2: Find the most similar pairs of sentences

Write a function that finds and returns the most similar pair of sentences from the corpus based on their embedding similarities.

You can utilize the `cosine_similarity` function from the previous lab:

```python
def cosine_similarity(vec1, vec2):
    dot = numpy.dot(vec1, vec2)
    norm1 = numpy.linalg.norm(vec1)
    norm2 = numpy.linalg.norm(vec2)
    return dot / (norm1 * norm2)
```

In [None]:
def find_most_similar_pair(embeddings):
    """
    Find the most similar pair of sentences in the corpus.

    Args:
        embeddings (dict): Dictionary with sentences as keys and embeddings as values

    Returns: Dictionary with the most similar pair of sentences and their similarity score
    """

    # TODO

    pass


sentence_pair = find_most_similar_pair(embeddings)
print(sentence_pair)

## Exercise 3: Find the most semantically distant sentences

Write a function that finds the pair of sentences with the lowest similarity.

In [None]:
def find_most_different_pair(embeddings):
    """
    Find the most semantically different pair of sentences.

    Args:
        embeddings (dict): Dictionary with sentences as keys and embeddings as values

    Returns: Dictionary with the most different pair of sentences and their similarity score
    """

    # TODO

    pass


sentence_pair = find_most_different_pair(embeddings)
print(sentence_pair)

## Exercise 4: Compare to a New Sentence

Create a function that takes a new sentence and finds the most similar sentence from our corpus.

In [None]:
def find_similar_to_new_sentence(new_sentence, embeddings):
    """
    Find the most similar sentence to a new sentence.

    Args:
        new_sentence (str): The new sentence to find the most similar sentence for
        embeddings (dict): Dictionary with sentences as keys and embeddings as values

    Returns: Dictionary with the most similar sentence and its similarity score
    """

    # TODO

    pass

similar_sentence = find_similar_to_new_sentence("Gertrude is relaxing by the pool.", embeddings)
print(similar_sentence)

## Exercise 5: Use a larger corpus

Now that we're comfortable with the basics, let's use a larger corpus. Find some long text...from wikipedia, blog posts, news sites, whatever interests you. Aim for approximately 4 pages of text.

Split the text into sentences and embed the sentences as we did before.

Finally, test out the functions you wrote above using this larger corpus.


In [None]:
# TODO : Create embeddings with a larger corpus
# TODO : Test above functions with larger corpus

## Bonus: Reduce Dimensionality

The project you're working on is running short on storage space and money. You've been tasked with reducing the dimensionality of the embeddings you've generated. 1536 is simply too large. You've been given these requirements:

- Reduce the dimensionality of the embeddings to 256.
- DO NOT re-generate the embeddings. We can't afford the tokens!

You need to figure out how this is possible and write the code to do so.