## Regression using the embeddings

Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. 

In [2]:
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import openai
from tenacity import retry, wait_random_exponential, stop_after_attempt
from sklearn.feature_extraction.text import TfidfVectorizer

# Set API Key
openai.api_key = os.getenv("OPENAI_API_KEY")


In [3]:
datafile_path = "../data/ml-latest-small/merged_data.csv"
df = pd.read_csv(datafile_path)
df.head(3)

Unnamed: 0,movieId,imdbId,tmdbId,title,genres,userId,rating,tag
0,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,4.0,pixar
1,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,4.0,pixar
2,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,3.5,fun


# Traditional Approach

TF-IDF to generate embeddings for the movie titles and trains a RandomForestRegressor model on them

In [4]:
# TF-IDF embeddings
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['title'])

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, df['rating'], test_size=0.2, random_state=42)

# Train RandomForest
rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(X_train, y_train)

# Predict and evaluate
preds = rfr.predict(X_test)
mse = mean_squared_error(y_test, preds)
mae = mean_absolute_error(y_test, preds)

print(f"TF-IDF embedding performance: mse={mse:.2f}, mae={mae:.2f}")

TF-IDF embedding performance: mse=0.34, mae=0.32


# OpenAI API Approach

OpenAI's embedding API to generate embeddings and trains another RandomForestRegressor model 

In [5]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embeddings(texts: list[str], model="text-embedding-ada-002") -> list[list[float]]:
    return [item["embedding"] for item in openai.Embedding.create(input=texts, model=model)["data"]]

# Get embeddings in batches
batch_size = 100  # Define your batch size
embeddings = []

for i in range(0, len(df['title']), batch_size):
    batch_texts = df['title'].iloc[i:i+batch_size].tolist()
    embeddings.extend(get_embeddings(batch_texts))

X_openai = np.array(embeddings)

# Splitting the dataset
X_train_openai, X_test_openai, y_train, y_test = train_test_split(X_openai, df['rating'], test_size=0.2, random_state=42)

# Train RandomForest on OpenAI embeddings
rfr_openai = RandomForestRegressor(n_estimators=100)
rfr_openai.fit(X_train_openai, y_train)

# Predict and evaluate
preds_openai = rfr_openai.predict(X_test_openai)
mse_openai = mean_squared_error(y_test, preds_openai)
mae_openai = mean_absolute_error(y_test, preds_openai)

print(f"OpenAI embedding performance: mse={mse_openai:.2f}, mae={mae_openai:.2f}")


OpenAI embedding performance: mse=0.30, mae=0.31
