## Regression using the embeddings

Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. 

In [7]:
import pandas as pd
import numpy as np
from ast import literal_eval

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

datafile_path = "../data/ml-latest-small/merged_data.csv"

df = pd.read_csv(datafile_path)

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming you want to create embeddings for the title
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['title'])

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, df['rating'], test_size=0.2, random_state=42)

# Train RandomForest
rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(X_train, y_train)

# Predict and evaluate
preds = rfr.predict(X_test)
mse = mean_squared_error(y_test, preds)
mae = mean_absolute_error(y_test, preds)

print(f"Title embedding performance: mse={mse:.2f}, mae={mae:.2f}")


Title embedding performance: mse=0.34, mae=0.32
