# Embeddings Part 3
# Going Through the Use Cases
Here we show some representative use cases. We will use the Amazon fine-food reviews dataset for the following examples.

## Obtaining the Embeddings
The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). For example:

PRODUCT ID	USER ID	        SCORE	SUMMARY	                TEXT
B001E4KFG0	A3SGXH7AUHU8GW	5	    Good Quality Dog Food	I have bought several of the Vitality canned...
B00813GRG4	A1D87F6ZCVE5NK	1	    Not as Advertised	    Product arrived labeled as Jumbo Salted Peanut...

We will combine the review summary and review text into a single combined text. The model will encode this combined text and output a single vector embedding.


In [32]:
# import our packages
import os
import pandas as pd
import numpy as np
from openai import OpenAI

In [33]:
# Create an instance of the OpenAI client
client = OpenAI()

In [34]:
# utility function to get the file size
def get_file_size(file_path):
    """ Returns the size of the file in megabytes. """
    size_bytes = os.path.getsize(file_path)
    size_mb = size_bytes / (1024 * 1024)  # Convert from bytes to megabytes
    return size_mb

# main utility function to get the embeddings
def get_embedding(text, model="text-embedding-3-small"):
    # Replace newlines in the text with spaces for consistent formatting
    text = text.replace("\n", " ")
    # Request the embedding for the cleaned text and return the embedding
    return client.embeddings.create(input=[text], model=model).data[0].embedding

# utility function to get the embeddings with reduced dimensions
def get_embedding_reduced_dims(text, model="text-embedding-3-large"):
    # Replace newlines in the text with spaces for consistent formatting
    text = text.replace("\n", " ")
    # Request the embedding for the cleaned text and return the embedding
    return client.embeddings.create(input=[text], model=model,dimensions=1024).data[0].embedding

In [35]:

# Define the path to the input data file
input_data_path = "./EmbeddingsDemoAssets/fine_food_reviews_1k.csv"

# Read the CSV file using pandas and set the first column as the index
df = pd.read_csv(input_data_path, index_col=0)

# Select only the relevant columns from the dataframe
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]

# Drop rows with any missing values
df = df.dropna()

# Drop rows with any missing values
df = df.drop_duplicates()

# Combine the 'Summary' and 'Text' columns into a new column with a formatted string
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)

# Display the first 5 rows of the modified dataframe
df.head(5)


Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ...."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...


In [36]:

# Apply the `get_embedding` function to each entry in the 'combined' column and store the results
df['embedding'] = df['combined'].apply(lambda x: get_embedding(x))

# Save the dataframe with embeddings to a CSV file, omitting the index
df.to_csv('./EmbeddingsDemoAssets/fine_food_reviews_with_embeddings_1k.csv', index=False)

# Display the first 5 rows of the modified dataframe
df.head(5)


Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined,embedding
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,"[0.0359942801296711, -0.02117965929210186, -0...."
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,"[0.01144526805728674, 0.0342588946223259, -0.0..."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....","[0.003300713375210762, 0.012644506990909576, -..."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...,"[-0.002887785667553544, 0.014631008729338646, ..."
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...,"[0.012054475955665112, -0.05606513097882271, 0..."


In [37]:
# Get the size of the data file
file_path = './EmbeddingsDemoAssets/fine_food_reviews_with_embeddings_1k.csv'
size_mb = get_file_size(file_path)
print(f"The size of the file is {size_mb:.2f} MB.")

The size of the file is 33.43 MB.


### Loading the Data
Here we show how to load the data into a dataframe from a CSV file to make it ready to be used again when needed. 

In [38]:

# Load the CSV file into a pandas DataFrame
df_load = pd.read_csv('./EmbeddingsDemoAssets/fine_food_reviews_with_embeddings_1k.csv')

# Convert the string representation of embeddings in the 'embedding' column to numpy arrays
# By converting the embeddings from string format to numpy arrays immediately after loading, 
# we ensure that the data is in a ready-to-use state for any subsequent analysis or processing steps.
df_load['embedding'] = df_load['embedding'].apply(eval).apply(np.array)

# Display the first 5 rows of the loaded dataframe
df_load.head(5)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined,embedding
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,"[0.0359942801296711, -0.02117965929210186, -0...."
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,"[0.01144526805728674, 0.0342588946223259, -0.0..."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....","[0.003300713375210762, 0.012644506990909576, -..."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...,"[-0.002887785667553544, 0.014631008729338646, ..."
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...,"[0.012054475955665112, -0.05606513097882271, 0..."


### Reducing the Dimensions
Using larger embeddings, for example storing them in a vector store for retrieval, generally costs more and consumes more compute, memory and storage than using smaller embeddings.

Both of our new embedding models were trained with a technique that allows developers to trade-off performance and cost of using embeddings. Specifically, developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties.

In [39]:

# Define the path to the input data file
input_data_path = "./EmbeddingsDemoAssets/fine_food_reviews_1k.csv"

# Read the CSV file using pandas and set the first column as the index
df_reduced_dims = pd.read_csv(input_data_path, index_col=0)

# Select only the relevant columns from the dataframe
df_reduced_dims = df_reduced_dims[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]

# Drop rows with any missing values
df_reduced_dims = df_reduced_dims.dropna()

# Drop rows with any missing values
df_reduced_dims = df_reduced_dims.drop_duplicates()

# Combine the 'Summary' and 'Text' columns into a new column with a formatted string
df_reduced_dims["combined"] = (
    "Title: " + df_reduced_dims.Summary.str.strip() + "; Content: " + df_reduced_dims.Text.str.strip()
)

# Display the first 5 rows of the modified dataframe
df_reduced_dims.head(5)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ...."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...


In [40]:

# Apply the `get_embedding` function to each entry in the 'combined' column and store the results
df_reduced_dims['embedding'] = df_reduced_dims['combined'].apply(lambda x: get_embedding_reduced_dims(x))

# Save the dataframe with embeddings to a CSV file, omitting the index
df_reduced_dims.to_csv('./EmbeddingsDemoAssets/fine_food_reviews_with_embeddings_reduced_dims_1k.csv', index=False)

# Display the first 5 rows of the modified dataframe
df.head(5)


Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined,embedding
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,"[0.0359942801296711, -0.02117965929210186, -0...."
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,"[0.01144526805728674, 0.0342588946223259, -0.0..."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....","[0.003300713375210762, 0.012644506990909576, -..."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...,"[-0.002887785667553544, 0.014631008729338646, ..."
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...,"[0.012054475955665112, -0.05606513097882271, 0..."


In [41]:

# Get the size of the data file
file_path = './EmbeddingsDemoAssets/fine_food_reviews_with_embeddings_reduced_dims_1k.csv'
size_mb = get_file_size(file_path)
print(f"The size of the file is {size_mb:.2f} MB.")

The size of the file is 22.46 MB.


## Question Answering Using Embeddings-Based Search
There are many common cases where the model is not trained on data which contains key facts and information you want to make accessible when generating responses to a user query. One way of solving this is to put additional information into the context window of the model. This is effective in many use cases but leads to higher token costs. The other way is to use RAG to obtain the information. 

In [42]:

# Load the CSV file into a pandas DataFrame
df_search = pd.read_csv('./EmbeddingsDemoAssets/fine_food_reviews_with_embeddings_reduced_dims_1k.csv')

# Convert the string representation of embeddings in the 'embedding' column to numpy arrays
# By converting the embeddings from string format to numpy arrays immediately after loading, 
# we ensure that the data is in a ready-to-use state for any subsequent analysis or processing steps.
df_search['embedding'] = df_search['embedding'].apply(eval).apply(np.array)

# Display the first 5 rows of the loaded dataframe
df_search.head(5)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined,embedding
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,"[-0.01375417411327362, 0.007851662114262581, -..."
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,"[0.009859263896942139, -0.03595048189163208, 0..."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....","[-0.003976356703788042, 0.007836353965103626, ..."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...,"[0.016051892191171646, -0.01706559769809246, -..."
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...,"[-0.01650797203183174, 0.015522805042564869, 0..."


In [48]:
from scipy.spatial import distance
import pandas as pd

# Search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - distance.cosine(x, y),
    top_n: int = 100,
    threshold: float = 0.001  # Threshold to consider scores as different
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model="text-embedding-3-large",
        input=query,
        dimensions=1024,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["combined"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    # Sort and apply threshold
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    filtered = []
    last_score = -1
    for item in strings_and_relatednesses:
        if abs(item[1] - last_score) > threshold:
            filtered.append(item)
            last_score = item[1]
        if len(filtered) >= top_n:
            break
    strings, relatednesses = zip(*filtered)
    return strings[:top_n], relatednesses[:top_n]


In [49]:
# do a search for the top 5 related items
strings, relatednesses = strings_ranked_by_relatedness("dogs dislike product", df_search, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"\n========================\n{relatedness=:.3f}")
    display(string)


relatedness=0.573


'Title: My Chihuahuas HATE This Product; Content: My two Chihuahuas loved "Cadet Gourmet Pet Treats Duck Breast Fillets", the 32-Ounce package I had been buying at Costco.  However, I discovered it was made in China and feared this product could make them sick or cause death.  So I started searching for a company in the USA that made dog jerky treats.  I found Plato Natural Duck Strips and ordered a bag.  This product does not look like or smell like duck jerky strips.  My Chihuahuas sniffed it and just walked away.  It doesn\'t even smell like duck.  Rather, it smells somewhat fishy.  A waste of my money!'


relatedness=0.545


'Title: Good product; Content: This is a good product but our dogs did not seem to enjoy this as much as Holistic Select or Merrick.'


relatedness=0.509


'Title: Triggered strange vomit response to my dog; Content: I can\'t deny that these smell amazing- all the fruitables that I ordered smelled like human-counterpart treats...blueberries, yum.<br /><br />I was excited to give my dog a "real" treat, and I have him one of this and one of another fruitables soft chew. A few hours later, we were shocked at the dog spitting up everything in this stomach- he threw up about 3 times, and then continued to spit up mucous and bile fluid. Gross, and really unfortunate :(<br /><br />I contacted both fruitables and amazon, who both graciously accepted the open bags as returns, and fruitables checked the lot numbers of the bags I had. I guess my dog had a particular sensitivity to something in these, which stinks because I really wanted him to love these nice smelling treats.<br /><br />Hopefully others have better luck with them!'


relatedness=0.490


'Title: Not sure how to feel; Content: So, my dog likes it. I\'ll just get that out there right away.<br />But, it\'s awkward to carry and can be messy (which I didn\'t think it would be) but since you have to sometimes give it a little encouragement to get anything to come out, sometimes it just decides "Oh you want ALL of it!" and I have a sticky snouted puppy.<br />And, she has since lost interest in it and prefers treats she can smell better.'


relatedness=0.487


"Title: My Dog's Favorite Snack!; Content: I was first introduced to this snack at my dog's training classes at petco. He really enjoys them! The pieces are really tiny but you can tell that it is made from quality ingredients (it also smells really bad), but my dog loves it. I highly recommend this product as it is a quality product that is actually healthy for your pup!"