## Get embeddings

+ The function `get_embedding` will give us an embedding for an input text.
+ 2 ways to create text embeddings using OpenAI's API: a direct call and a function with retry logic. 
+ Both approaches use the same model to transform the text into a numerical vector, which represents the semantic content of the input text.

In [2]:
import openai
import os
# OpenAI API Key
openai.api_key = openai.api_key = os.getenv("OPENAI_API_KEY")
import openai
from tenacity import retry, wait_random_exponential, stop_after_attempt

In [3]:
embedding = openai.Embedding.create(
    input="Toy Story (1995)", model="text-embedding-ada-002"
)["data"][0]["embedding"]
len(embedding)

1536

In [4]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(text: str, model="text-embedding-ada-002") -> list[float]:
    return openai.Embedding.create(input=[text], model=model)["data"][0]["embedding"]


embedding = get_embedding("Your text goes here", model="text-embedding-ada-002")
print(len(embedding))


1536


In [10]:
import os
import pandas as pd
import openai
from scipy.spatial import distance
import plotly.express as px
from sklearn.cluster import KMeans
from umap.umap_ import UMAP

# Data Overview

In [11]:
# Read the dataset
dataset_path = "../data/ml-latest-small/merged_data.csv"
movie_data = pd.read_csv(dataset_path)
movie_data.info()
movie_data.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3476 entries, 0 to 3475
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  3476 non-null   int64  
 1   imdbId   3476 non-null   int64  
 2   tmdbId   3476 non-null   float64
 3   title    3476 non-null   object 
 4   genres   3476 non-null   object 
 5   userId   3476 non-null   int64  
 6   rating   3476 non-null   float64
 7   tag      3476 non-null   object 
dtypes: float64(2), int64(3), object(3)
memory usage: 217.4+ KB


Unnamed: 0,movieId,imdbId,tmdbId,title,genres,userId,rating,tag
0,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,4.0,pixar
1,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,4.0,pixar
2,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,3.5,fun


In [12]:
def get_embedding(text_to_embed):
	# Embed a line of text
	response = openai.Embedding.create(
    	model= "text-embedding-ada-002",
    	input=[text_to_embed]
	)
	# Extract the AI output embedding as a list of floats
	embedding = response["data"][0]["embedding"]
    
	return embedding

In [17]:
title_df = movie_data[['title']]
print("Data shape: {}".format(title_df.shape))
display(title_df.head())

Data shape: (3476, 1)


Unnamed: 0,title
0,Toy Story (1995)
1,Toy Story (1995)
2,Toy Story (1995)
3,Jumanji (1995)
4,Jumanji (1995)


In [19]:
# title_df = title_df.sample(100)
title_df["embedding"] = title_df["title"].astype(str).apply(get_embedding)

# Make the index start from 0
title_df.reset_index(drop=True)

title_df.head(10)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  title_df["embedding"] = title_df["title"].astype(str).apply(get_embedding)


Unnamed: 0,title,embedding
0,Toy Story (1995),"[-0.001893474836833775, -0.037952445447444916,..."
1,Toy Story (1995),"[-0.0019398860167711973, -0.037929877638816833..."
2,Toy Story (1995),"[-0.001885014702565968, -0.037917181849479675,..."
3,Jumanji (1995),"[8.057472587097436e-05, -0.023652777075767517,..."
4,Jumanji (1995),"[0.00010395599383627996, -0.023625079542398453..."
5,Jumanji (1995),"[3.986759838880971e-05, -0.02368870936334133, ..."
6,Jumanji (1995),"[0.00010395599383627996, -0.023625079542398453..."
7,Grumpier Old Men (1995),"[0.007774407975375652, -0.046490442007780075, ..."
8,Grumpier Old Men (1995),"[0.007800175808370113, -0.046567048877477646, ..."
9,Father of the Bride Part II (1995),"[0.010245592333376408, -0.03074953705072403, -..."


In [21]:
# Cluster the title data
kmeans = KMeans(n_clusters=3, n_init=10)
kmeans.fit(title_df["embedding"].tolist())

In [None]:
# Reduce dimensions of embedded text title data
reducer = UMAP()
embeddings_2d = reducer.fit_transform(review_df["embedding"].tolist())

In [None]:
# Visualize the clusters
fig = px.scatter(x=embeddings_2d[:, 0], y=embeddings_2d[:, 1], color=kmeans.labels_)
fig.show()

# References

+ https://www.datacamp.com/tutorial/introduction-to-text-embeddings-with-the-open-ai-api