## Creating an Embedding request

OpenAI provides access to their embedding models via the Embeddings endpoint, and requests to it take a very similar form to other OpenAI endpoints. 

We'll use the openAI lybrary to create requests to the OpenAI API, which also requires us to have an OpenAI API key.

```python
    from openai import OpenAI

    client = OpenAI(api_key="<OPENAI_API_KEY>")

    response = client.embeddings.create(
        model = "text-embedding-3-small"
        input = "Embeddings are a numerical representation of text that can be used to measure the relatedness between two pieces of text."
)
```

We have specified input here as a text string, but the argument also accepts a list of strings.

Finally, we'll call the `.model_dump()` method on the response to convert it into a dictionary, which is easier to work with, and print the result.

```python
response_dict = response.model_dump()
print(response_dict)
```
<img src='./images/embedding_response.png' width=60% height=60% >

The response from the API is extremely long, as the embedding model outputs 1536 numbers to represent the input string. Because we converted the response into a dictionary, we can dig into it using list and dictionary subsetting.

Here we can print the full list of 1536 numbers representing our text:

```pyton
print(response_dict["data"][0]["embedding"])
```

Output: `[0.0023064255, ..., -0.0028842222]`

## Investigating the vector space

### Example: Embedding headlines

We'll be working wit a dataset of news articles stored in a list if dictionaries

```python
    articles = [
        {"headline": "Economic Growth Continues Amid Global Uncertainty", "topic": "Business"},
        {"headline": "Interest rates fall to historic lows", "topic": "Business"},
        {"headline": "Scientists Make Breakthrough Discovery in Renewable Energy", "topic": "Science"},
        {"headline": "India Successfully Lands Near Moon's South Pole", "topic": "Science"},
        {"headline": "New Particle Discovered at CERN", "topic": "Science"},
        {"headline": "Tech Company Launches Innovative Product to Improve Online Accessibility"},
        {"headline": "Tech Giant Buys 49% Stake In AI Startup", "topic": "Tech"},
        {"headline": "New Social Media Platform Has Everyone Talking!", "topic": "Tech"},
        {"headline": "The Blues get promoted on the final day of the season!", "topic": "Sport"},
        {"headline": "1.5 Billion Tune-in to the World Cup Final", "topic": "Sport"}
]
```

**Embedding multiple inputs**

We'll extract each article's headline using a list comprehension, accessing the headline key from each dictionary. 

```python
    headline_text = [article["headline"] for article in articles]       # list comprehension
    headline_text
```

Output: `["Economic Growth Continues Amid Global Uncertainty", "topic": "Business", ..., "1.5 Billion Tune-in to the World Cup Final", "topic": "Sport"]`

To compute the embeddings, we can pass this entire list as an input to the create method. Batching the embeddings in this way is much more efficient than making API calls for each input.

__Response__

```python
    response = client.embeddings.create(
        model= 'text-embedding-3-small,'
        input=headline_text
    )

    response_dict = response.model_dump()
```

The response output only differs from the single-input case in one way: where before, the list under the data key contained a single dictionary for the embeddings, in the multiple-input case, there is one dictionary for each input.

<img src='./images/multi-embedding_response.png' width=60% height=60%>

**Embedding multiple inputs**

To extract these embeddings from the response and store them in the articles list of dictionaries, we loop over the indexes and articles using enumerate. For each article, we assign the embedding at the same index in the response to the article's embedding key.

```python
    for i, article in enumerate(articles):
        article["embedding"] = response_dict["data"][i]["embedding"]
```

__Note that `article` itself is a dictionary under the `articles` dictionary, so we can assign the embedding to the `embedding` key in the same way we would assign any other value. the index `i` corresponds to the index of the article in the `articles` list and the index of the embedding in the `response_dict["data"]` list. So we properly assign the embedding to the correct article.__

Let's print the first two articles

```python
    print(articles[:2])
```

<img src= '/Users/erensen/Downloads/DataCamp/Introduction to Embeddings with the OpenAI API/images/milti-embedded-articles-dict_first-two.png' width=60% height=60%>

We succesfully created embeddings for multiple inputs!

Let's investigate these numbers more closely.

**How long is the embeddings vector?**

* "Economic Growth Continues Amid Global Uncertainty"
```python
    print(articles[0]["embedding"])
```

Output: `1536`

For the first article in the list, the embedding model returns 1536numbers representing the semantic meaning of its headline, or in other words, its _position_, or _vector_, in the _vector space_.

Another longer headline:

* "Tech Company Launches Innovative Product to Improve Accessibility"

```python
    print(articles[5]["embedding"])
```
Output: `1536`

We get the same number again!

This is a key property of OpenAI's embedding models - they always return 1536 numbers, no matter the input.

**Dimensionality reduction and t-SNE**

Let's visualize our embeddings to better understand the model's results.

We'll first need to reduce the number of dimensions from 1536 to something more manageable, like 2! There are lots of techniques for dimensionality _reduction_, but we'll use __t-SNE (t-distributed Stochastic Neighbor Embedding)__, which is a popular choice for visualizing high-dimensional data.

We'll implement t-SNE using scikit-learn, a popular Python library for machine learning tasks.

First we import `TSNE` from `sklearn.manifold`, and `numpy` as `np`

```python
    from sklearn.manifold import TSNE
    import numpy as np
```

Next, we'll extract the embeddings from our articles list of dictionaries using list comprehension. to implement t-SNE, we create a TSNE instance and assing it to the tsne variable. We specify two arguments: 
* `n_components`: the number of dimensions we want to reduce to
* `perplexity`: used by the algorithm in the transformation. The default value is normally fine, but for smaller datasets, it must be reduced to a numer less than the number of data points. Since we have 10 articles, we'll set it to 5.

```python
    embeddings = np.array([article["embedding"] for article in articles])

    tsne = TSNE(n_components=2, perplexity=5)
```

Finally, to perform the t-SNE transformation, we call the `fit_transform` method on the `tsne` object, passing it in the embeddings as a NumPy array. This will return the transformed embeddings in a NumPy array with `n_components` dimensions, which we can now visualize.

```python
    transformed_embeddings = tsne.fit_transform(np.array(embeddings))
```

Altough t-SNE is useful for exploring and visualizing igher dimensions, it will result in the loss of some information in the transformation, so it should be used with caution.

To visualize these transformed embeddings, we call `plt.scatter` from `Matplotlib` on the first and the second columns of the `embeddings_2d` array. We'll also include some code to extract the article topics, annotate the plot with them, and display the plot.

```python
    plt.scatter(transformed_embeddings[:, 0], transformed_embeddings[:, 1])

    topics = [article["topic"] for article in articles]
    for i, topic in enumerate(topics):
        plt.annotate(topic, (transformed_embeddings[i, 0], transformed_embeddings[i, 1]))

    plt.show()
```

the full code is as follows:

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
from openai import OpenAI


articles = [
        {"headline": "Economic Growth Continues Amid Global Uncertainty", "topic": "Business"},
        {"headline": "Interest rates fall to historic lows", "topic": "Business"},
        {"headline": "Scientists Make Breakthrough Discovery in Renewable Energy", "topic": "Science"},
        {"headline": "India Successfully Lands Near Moon's South Pole", "topic": "Science"},
        {"headline": "New Particle Discovered at CERN", "topic": "Science"},
        {"headline": "Tech Company Launches Innovative Product to Improve Online Accessibility", "topic": "Tech"},
        {"headline": "Tech Giant Buys 49% Stake In AI Startup", "topic": "Tech"},
        {"headline": "New Social Media Platform Has Everyone Talking!", "topic": "Tech"},
        {"headline": "The Blues get promoted on the final day of the season!", "topic": "Sport"},
        {"headline": "1.5 Billion Tune-in to the World Cup Final", "topic": "Sport"}
]

# Extract headlines
headline_text = [article["headline"] for article in articles]       # list comprehension
headline_text

# Initialize OpenAI client
client = OpenAI(api_key="<OPENAI_API_KEY>")

# Generate embeddings for all headlines
response = client.embeddings.create(
    model= 'text-embedding-3-small',
    input=headline_text
)

# Convert response to dictionary and store embeddings in articles
response_dict = response.model_dump()
for i, article in enumerate(articles):
        article["embedding"] = response_dict["data"][i]["embedding"]


# Convert embeddings to numpy array for t-SNE
embeddings = np.array([article["embedding"] for article in articles])

# Initialize t-SNE with 2 components and perplexity of 5
tsne = TSNE(n_components=2, perplexity=5)

# Transform high-dimensional embeddings to 2D
transformed_embeddings = tsne.fit_transform(np.array(embeddings))

# Create scatter plot of transformed embeddings
plt.scatter(transformed_embeddings[:, 0], transformed_embeddings[:, 1])

# Add topic labes to the point
topics = [article['topic'] for article in articles]
for i, topic in enumerate(topics):
    plt.annotate(topic, (transformed_embeddings[i, 0], transformed_embeddings[i, 1]))

plt.show()

Here is the plot.

<img src= './images/scatter-of-transformed-embeddings.png' width=50% height=50%>

Notice that headlines with the same topic were clustered more closely together. In other words, model captured the semantic meaning of the headlines and mapped them based on it.

### Exercise

You've been provided with a list of dictionaries called `products`, which contains product information for different products sold by an online retailer. It's your job to embed the `'short_description'` for each product to enable semantic search for the retailer's website.

Here's a preview of the products list of dictionaries:

```python
products = [
    {
        "title": "Smartphone X1",
        "short_description": "The latest flagship smartphone with AI-powered features and 5G connectivity.",
        "price": 799.99,
        "category": "Electronics",
        "features": [
            "6.5-inch AMOLED display",
            "Quad-camera system with 48MP main sensor",
            "Face recognition and fingerprint sensor",
            "Fast wireless charging"
        ]
    },
    ...
]
```

An OpenAI client has already been created as assigned to `client`.

__Instructions__

* Create a list called `product_descriptions` containing the `'short_description'` for each product in `products` using a list comprehension.
* Create embeddings for each product `'short_description'` using __batching__, passing the input to the `text-embedding-3-small` model.
* Extract the embeddings for each product from `response_dict` and store them in `products` under a new key called `'embedding'`.

In [None]:
products = [
    {
        "title": "Smartphone X1",
        "short_description": "The latest flagship smartphone with AI-powered features and 5G connectivity.",
        "price": 799.99,
        "category": "Electronics",
        "features": [
            "6.5-inch AMOLED display",
            "Quad-camera system with 48MP main sensor",
            "Face recognition and fingerprint sensor",
            "Fast wireless charging"
        ]
    },
    ...
]

# Extract a list of product short descriptions from products
product_descriptions = [product['short_description'] for product in products]

# Create embeddings for each product description
response = client.embeddings.create(
  model="text-embedding-3-small",
  input=product_descriptions
)
response_dict = response.model_dump()

# Extract the embeddings from response_dict and store in products
for i, product in enumerate(products):
    product['embedding'] = response_dict['data'][i]['embedding']
    
print(products[0].items())

* Create two lists by extracting information from `products` using list comprehensions: `categories`, containing the `'category'` of each product, and `embeddings`, containing the embedded short description.
* Reduce the number of embeddings dimensions from 1,536 to two using the `tsne` model provided.
* Create a scatter plot of the 2D embeddings, plotting the first column from `embeddings_2d` on the x-axis and the second column on the y-axis.

In [None]:
# Create categories and embeddings lists using list comprehensions
categories = [product['category'] for product in products]
embeddings = [product['embedding'] for product in products]

# Reduce the number of embeddings dimensions to two using t-SNE
tsne = TSNE(n_components=2, perplexity=5)
embeddings_2d = tsne.fit_transform(np.array(embeddings))

# Create a scatter plot from embeddings_2d
plt.scatter(embeddings_2d[:,0], embeddings_2d[:,1])

for i, category in enumerate(categories):
    plt.annotate(category, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

plt.show()

Output: 

<img src= ./images/plot-of-exercise.png width=50% height=50%>

## Text Similarity

Recall that embedding models map semantically similar texts more closely together in the vector space. This means that we can measure how semantically similar two pieces of text are by computing the distance between the vectors in the vector space.

### Measuring similarity

**Cosine distance**

The cosine distance uses linear algebra, to evaluate the similarity between two vectors.

To compute the cosine distance between two vectors, we can use the `cosine` function from the `scipy.spatial.distance` module. First we import `distance` from `scipy.spatial` and then we compute the cosine distance between the vectors.

```python
    from scipy.spatial import distance

    distance.cosine(vector1, vector2)
```

Here, vector1 and vector2 are the vectors of n-tuple. In 2D, the vectors are of the form (x, y). So, vector1 can be (0,1) and vector2 can be (1,0). When passing it to the cosine function, we write them with __square brackets__, instead of parantheses.

In [9]:
from scipy.spatial import distance

print(distance.cosine([0,1], [1,0]))
print(distance.cosine([0,1], [0,-1]))
print(distance.cosine([0,1], [0,1]))

1.0
2.0
0.0


Note that the distance range from 0 to 2, where small number indicate high similarity.

let's try this on text embeddings:

To create embeddings in a more repeatable way, we'll define a custom function to send a request to the API, and extract and return embeddings from the response.

```python
    def create_embeddings(text):
        response = client.embeddings/create(
            model="text-embedding-3-small",
            input=text
        )
    response_dict = response.model_dump(

    return [data['embedding'] for data in response_dict['data']]
    )
```

This function can be called on a single string, or a list of strings, and always retuns a list of lists.

To just return a single list of embeddings for the single string case, make sure to 0-index the function's result.

```python
print(create_embeddings(["Python is the best!", "R is the best"]))
print(create_embeddings("Datacamp is awesome!")[0])
```

First, we'll import `distance` from `scipy.spacial` for the cosine distance calculations, and NumPy to access its `argmin` function, which returns the index of the minimum value in an array.

```python
    from scipy.spatial import distance
    import numpy as np
```

Let's start with a piece of text to compare to our embedded headlines: computer.<br>
We'll stary by embedding this text using our `create_embeddings` custom function, remembering to 0-index the result.

```python
    search_text = "computer"
    search_embedding = create_embeddings(search_text)[0]
```

To find the most similar headline to this text, we'll loop over each article, calculating the cosine distance between each embedded headline and the embedded query. We start by creating an empty list to store the distances, and loop over each article in our articles list of dictionaries. Next we calculate the cosine distance between the text and headline by calling `distance.cosine`, passing it the embedded text and headline. Finally we append this distanve to the distances list.

```python
    distances = []
    for article in articles:
        headline_embedding = article["headline_embedding"]
        dist = distance.cosine(search_embedding, article['embedding'])
        distances.append(dist)
```

The most similar headline will have the smallest cosine distance, so we can use NumPy's `argmin` function to return the index of the smallest value in the distances list; then, use it to subset the article at this index and return its headline.

```python
    min_dist_ind = np.argmin(distances)
    print(f"Most similar headline: {articles[min_dist_ind]['headline']}")
```


The full code is as follows:
```python
    from scipy.spatial import distance
    import numpy as np

    """articles list is given above."""

    def create_embeddings(text):
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
    response_dict = response.model_dump(

    return [data['embedding'] for data in response_dict['data']]
    )

    search_text = "computer"
    search_embedding = create_embeddings(search_text)[0]

    distances = []
    for article in articles:
        headline_embedding = article["headline_embedding"]
        dist = distance.cosine(search_embedding, article['embedding'])
        distances.append(dist)

    min_dist_ind = np.argmin(distances)
    print(f"Most similar headline: {articles[min_dist_ind]['headline']}")
```

Output: <br>
`Most similar headline: Tech Company Launches Innovative Product to Improve Online Accessibility`