# CA9.1: Creating Predictions from Text

**Objective:**

The goal of this assignment is to build and evaluate a model that predicts the numerical star rating (e.g., 1 to 5 stars) of an Amazon product review based *only* on the text content of the review (title and body).

**Dataset:**

You will use the processed Amazon electronics review dataset created in Module 9.1.
* **File:** `processed_electronics_reviews_openrouter_threaded.jsonl`
* **Source:** This file contains the original review data along with the LLM-extracted insights (sentiment, categories, etc.). For this assignment, you primarily need the `rating`, `title`, and `text` fields.

**Task:**

This is a **regression task**. You need to predict the numerical value in the `rating` column using the information in the `title` and `text` columns.

**Your Choices (Choose at least one approach):**

You can choose one or both of the following approaches, similar to those explored in Module 9.3 (which focused on classification), but adapted for **regression**:

1.  **Approach 1: Fine-Tuning a Pre-trained Language Model (PLM) for Regression:**
    * Take a pre-trained transformer model (like `distilbert-base-uncased` used in M9.3).
    * Fine-tune it directly to predict the numerical rating based on the review text.
    * This involves modifying the model's final layer(s) for regression output and training it on the review text and corresponding ratings.

2.  **Approach 2: Feature Extraction (Embeddings) + Traditional Regression Model:**
    * Use a pre-trained model (like `distilbert-base-uncased` or a `SentenceTransformer` model like `all-MiniLM-L6-v2` from M9.2) to generate fixed numerical embeddings (feature vectors) for each review text.
    * Train a standard machine learning *regression* model (e.g., Linear Regression, Ridge, SVR, Gradient Boosting Regressor, etc.) using these embeddings as input features to predict the rating.

**Steps & Instructions:**

1.  **Setup:** Import necessary libraries (pandas, numpy, torch, transformers, datasets, evaluate, scikit-learn, etc.).
2.  **Load Data:** Load the `processed_electronics_reviews_openrouter_threaded.jsonl` file into a pandas DataFrame.
3.  **Prepare Data:**
    * Clean the data: Filter out any rows that had processing errors in Module 9.1 (check `error` or `llm_analysis_error` fields). Ensure the `text` and `rating` columns are present and valid (not null/empty). You might combine `title` and `text` for a richer input.
    * You may choose to focus on reviews for a specific product (ASIN) first, as done in M9.2, to manage dataset size, or use a sample of the full dataset.
    * Split your data into training and testing sets (or use the existing train/test splits if you loaded the original dataset and processed both parts).
4.  **Implement Your Chosen Approach(es):**
    * Follow the specific hints below for either fine-tuning or feature extraction.
5.  **Train:** Train your model(s) on the training data.
6.  **Evaluate:** Evaluate your model(s) on the test data using appropriate regression metrics.
7.  **Report:** Summarize your findings.

**Hints:**

**Setup and Data Loading:**

In [None]:
import json
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm # Or standard tqdm
import os

# --- Configuration ---
PROCESSED_FILE_PATH = r"processed_electronics_reviews_openrouter_threaded.jsonl"
# Optional: Set a limit for faster testing, None to use all
MAX_REVIEWS_TO_LOAD = 10000 # Example: Load first 10k

# --- Load Data ---
print(f"Loading data from {PROCESSED_FILE_PATH}...")
if not os.path.exists(PROCESSED_FILE_PATH):
    raise FileNotFoundError(f"Input file not found: {PROCESSED_FILE_PATH}")

all_data = []
with open(PROCESSED_FILE_PATH, 'r', encoding='utf-8') as f:
    for i, line in enumerate(tqdm(f, desc="Reading JSONL")):
        if MAX_REVIEWS_TO_LOAD is not None and i >= MAX_REVIEWS_TO_LOAD:
            print(f"\nReached limit of {MAX_REVIEWS_TO_LOAD} reviews.")
            break
        try:
            all_data.append(json.loads(line))
        except json.JSONDecodeError:
            print(f"Warning: Skipping invalid JSON on line {i+1}")

if not all_data:
    raise ValueError("No valid data loaded.")

df = pd.json_normalize(all_data, sep='_')
print(f"Loaded {len(df)} total entries into DataFrame.")

# --- Data Preparation & Cleaning ---
print("Preparing text and filtering valid reviews...")

# Combine title and text
df['input_text'] = df['title'].fillna('') + ' - ' + df['text'].fillna('')

# Define potential error columns based on M9.1 output
error_cols = [col for col in ['error', 'llm_analysis_error'] if col in df.columns]

# Basic filtering: Keep rows with valid text, valid rating, and no errors recorded
filter_condition = df['input_text'].str.strip().astype(bool)
filter_condition &= df['rating'].notna() # Ensure rating is present

# Check for errors - keep rows where *all* existing error columns are NA
for col in error_cols:
    filter_condition &= df[col].isna()

df_valid = df[filter_condition].copy()

# Ensure rating is numeric (e.g., float)
df_valid['rating'] = pd.to_numeric(df_valid['rating'], errors='coerce')
df_valid.dropna(subset=['rating'], inplace=True) # Drop rows where rating couldn't be converted

print(f"Filtered down to {len(df_valid)} valid reviews for modeling.")

if df_valid.empty:
    raise ValueError("No valid reviews found after filtering. Cannot proceed.")

# --- Optional: Select a subset for faster iteration ---
# df_sample = df_valid.sample(n=5000, random_state=42) # Example: Use 5000 reviews
# X = df_sample['input_text'].tolist()
# y = df_sample['rating'].values

# --- Use the full valid dataset ---
X = df_valid['input_text'].tolist() # Features (text)
y = df_valid['rating'].values # Target (rating)

# --- Split Data (Example using sklearn) ---
from sklearn.model_selection import train_test_split

X_train_text, X_test_text, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {len(X_train_text)}")
print(f"Test set size: {len(X_test_text)}")

**Approach 1: Fine-Tuning Hints (Refer to M9.3 Section 3)**

* **Model:** Use `AutoModelForSequenceClassification` but configure it for regression by setting `num_labels=1`.

    ```python
    from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
    import torch

    model_checkpoint = "distilbert-base-uncased" # Or another suitable model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

    # Load model for REGRESSION (num_labels=1)
    model_finetune = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint,
        num_labels=1 # Key change for regression!
    ).to(device)
    ```

* **Tokenization:** The tokenization process is the same as in M9.3. You'll need to tokenize `X_train_text` and `X_test_text`. Create a Dataset object compatible with the `Trainer`.

    ```python
    # (Tokenization function similar to M9.3)
    def tokenize_function(texts):
         return tokenizer(texts, padding="max_length", truncation=True, max_length=512)

    # Create datasets (example structure)
    class RegressionDataset(torch.utils.data.Dataset):
        def __init__(self, texts, labels, tokenizer):
            self.encodings = tokenizer(texts, padding="max_length", truncation=True, max_length=512)
            self.labels = labels

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx], dtype=torch.float) # Use float for regression
            return item

        def __len__(self):
            return len(self.labels)

    train_dataset = RegressionDataset(X_train_text, y_train, tokenizer)
    test_dataset = RegressionDataset(X_test_text, y_test, tokenizer)
    ```

* **Metrics:** Adapt the `compute_metrics` function for regression. Use metrics like Mean Absolute Error (MAE) or Mean Squared Error (MSE). `evaluate` library has these (`evaluate.load('mae')`, `evaluate.load('mse')`). Lower values are better.

    ```python
    import evaluate
    import numpy as np

    mae_metric = evaluate.load("mae")
    mse_metric = evaluate.load("mse")

    def compute_metrics_regression(eval_pred):
        logits, labels = eval_pred
        # Logits are the direct output of the regression model (shape: batch_size, 1)
        predictions = logits.squeeze(-1) # Remove the last dimension
        mae = mae_metric.compute(predictions=predictions, references=labels)
        mse = mse_metric.compute(predictions=predictions, references=labels)
        return {
            "mae": mae['mae'],
            "mse": mse['mse'],
            "rmse": np.sqrt(mse['mse']) # Calculate RMSE from MSE
        }
    ```

* **Training Arguments:** Set `evaluation_strategy`, `save_strategy`, `logging_strategy`. Crucially, set `metric_for_best_model` to your primary evaluation metric (e.g., `"mae"` or `"rmse"`) and ensure `greater_is_better=False`.

    ```python
    training_args = TrainingArguments(
        output_dir="review_rating_regressor_finetuned",
        num_train_epochs=3, # Adjust as needed
        learning_rate=2e-5,
        per_device_train_batch_size=16, # Adjust based on GPU memory
        per_device_eval_batch_size=16,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="mae", # Or "rmse"
        greater_is_better=False, # Lower error is better
        push_to_hub=False,
        report_to="none",
        save_total_limit=1
    )
    ```

* **Trainer:** Instantiate the `Trainer` with the model, args, datasets, tokenizer, and the `compute_metrics_regression` function.
* **Train & Evaluate:** Run `trainer.train()` and `trainer.evaluate()`.

**Approach 2: Feature Extraction Hints (Refer to M9.3 Section 4 / M9.2)**

* **Embedding Model:** You can use `AutoModel` (like M9.3) to get hidden states (e.g., `[CLS]` token) or `SentenceTransformer` (like M9.2) for sentence embeddings. `SentenceTransformer` is often simpler for this purpose.

    ```python
    from sentence_transformers import SentenceTransformer

    # Option 1: Sentence Transformer (often easier)
    embedding_model_name = 'all-MiniLM-L6-v2' # Or another SentenceTransformer model
    embedder = SentenceTransformer(embedding_model_name, device=device)

    print("Generating embeddings for train set...")
    X_train_embeddings = embedder.encode(X_train_text, show_progress_bar=True, batch_size=64)
    print("Generating embeddings for test set...")
    X_test_embeddings = embedder.encode(X_test_text, show_progress_bar=True, batch_size=64)

    # Option 2: Using AutoModel (like M9.3 Section 4b/4c)
    # (Requires loading AutoModel, AutoTokenizer, defining extract_hidden_states function,
    # and mapping it over the text lists - potentially more complex setup)
    # X_train_embeddings = ... # Result of mapping extract_hidden_states
    # X_test_embeddings = ...
    ```

* **Prepare Data:** You now have `X_train_embeddings` and `X_test_embeddings` (NumPy arrays) and your target arrays `y_train` and `y_test`.
* **Train Regression Model:** Use scikit-learn. A pipeline with `StandardScaler` is recommended. Choose a *regression* model.

    ```python
    from sklearn.linear_model import Ridge # Example: Ridge Regression
    from sklearn.ensemble import GradientBoostingRegressor # Example: Gradient Boosting
    from sklearn.svm import SVR # Example: Support Vector Regressor
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

    # Example using Ridge Regression
    regressor_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        # ('regressor', Ridge(alpha=1.0, random_state=42))
        # ('regressor', SVR(C=1.0, epsilon=0.1))
         ('regressor', GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)) # Example Gradient Boosting
    ])

    print("Training the regressor...")
    regressor_pipeline.fit(X_train_embeddings, y_train)
    print("Training complete.")
    ```

* **Evaluate:** Use scikit-learn metrics for regression.

    ```python
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

    print("Evaluating the regressor...")
    y_pred = regressor_pipeline.predict(X_test_embeddings)

    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred) # R-squared

    print(f"\nFeature Extraction + Regressor Results:")
    print(f"  Mean Absolute Error (MAE): {mae:.4f}")
    print(f"  Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"  R-squared (R2): {r2:.4f}")
    ```

**Evaluation:**

* The primary metrics for this regression task are **Mean Absolute Error (MAE)** and **Root Mean Squared Error (RMSE)**.
    * MAE tells you, on average, how many "stars" off your prediction was.
    * RMSE penalizes larger errors more heavily.
* You can also report **R-squared ($R^2$)** to understand the proportion of variance explained by your model.
* Compare the performance of your chosen approach(es).

**Submission:**

* Submit your Python code, preferably as a Jupyter Notebook (`.ipynb`) file.
* Include the output of your code cells, showing the training process (if applicable) and final evaluation metrics.
* Add a brief markdown section in your notebook summarizing:
    * Which approach(es) you implemented.
    * The final MAE, RMSE, and R2 scores on the test set for each approach.
    * Any challenges you faced or interesting observations.
    * (Optional) If you tried both, a brief comparison of their performance and potential reasons for differences.

**Optional Challenges (Extra Credit):**

* Implement and compare both the fine-tuning and feature-extraction approaches.
* Experiment with different pre-trained models (e.g., `RoBERTa`, other `SentenceTransformer` models).
* For the feature extraction approach, try different traditional regression models (e.g., compare Linear Regression vs. Gradient Boosting vs. SVR).
* Perform basic hyperparameter tuning for either the fine-tuning process (e.g., learning rate, epochs) or the traditional ML model (e.g., regularization strength for Ridge, `n_estimators` for Gradient Boosting).
* Analyze the errors: Look at examples where your model's prediction was significantly wrong. Are there patterns?

Good luck!