# **1. Data Loading and Preprocessing**

## **Introduction**
The initial phase of this study involves **data loading and preprocessing**, which lays the foundation for effective analysis and machine learning modeling. The dataset under consideration contains **568,454 user-generated reviews**, capturing textual feedback along with metadata such as numerical ratings, timestamps, and helpfulness scores. Given the sheer volume and diversity of the dataset, meticulous preprocessing is critical to ensure reliability and accuracy in downstream tasks.

## **Objectives of this Section**
- **Efficiently load the dataset** while minimizing memory usage.
- **Verify dataset integrity** by inspecting shape and column structure.
- **Handle missing values and duplicates** to prevent distortions in analysis.
- **Select only relevant features** that contribute to the classification task.
- **Engineer a new feature (`Combined_Text`)** by merging short (`Summary`) and long (`Text`) reviews.
- **Transform numerical ratings (`Score`) into categorical satisfaction levels (`User_Satisfaction`)** for supervised classification.

## **Dataset Structure (Before Processing)**
Upon loading the dataset, we retrieve the following properties:

In [1]:
import pandas as pd

# Load the data
data = pd.read_csv('Reviews.csv', encoding='ISO-8859-1', low_memory=False)

# Display dataset shape to confirm full loading
print(f"Dataset shape: {data.shape}")

# Show column names to verify
print(f"Columns: {data.columns}")


Dataset shape: (568454, 10)
Columns: Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')


The dataset comprises **568,454 rows** and **10 columns**, capturing a mix of numerical, categorical, and textual data. 

## **Dataset Column Descriptions**

The dataset consists of user-generated reviews, capturing essential details about products, users, ratings, and feedback. Each row represents an individual review, accompanied by metadata that provides context for the evaluation.

---

## **Column Descriptions**

### **1. Id**
- **Definition**: A unique identifier assigned to each row in the dataset.
- **Type**: Integer

### **2. ProductId**
- **Definition**: A unique alphanumeric identifier associated with each product.
- **Type**: String

### **3. UserId**
- **Definition**: A unique identifier assigned to each user who submitted a review.
- **Type**: String

### **4. ProfileName**
- **Definition**: The display name of the user who submitted the review.
- **Type**: String

### **5. HelpfulnessNumerator**
- **Definition**: The number of users who found the review helpful.
- **Type**: Integer

### **6. HelpfulnessDenominator**
- **Definition**: The total number of users who provided feedback on whether a review was helpful.
- **Type**: Integer

### **7. Score**
- **Definition**: The rating provided by the user, ranging from 1 to 5.
- **Type**: Integer (1-5)
- **Purpose**: Represents the user's sentiment about the product.
  - **5** → Highly Satisfied
  - **4** → Satisfied
  - **3** → Neutral
  - **2** → Not Satisfied
  - **1** → Very Bad

### **8. Time**
- **Definition**: A Unix timestamp representing the date and time when the review was posted.
- **Type**: Integer

### **9. Summary**
- **Definition**: A brief, high-level summary of the review provided by the user.
- **Type**: String

### **10. Text**
- **Definition**: The full-length detailed review written by the user.
- **Type**: String
---

## **Data Cleaning Steps**
1. **Handling Missing Values**:
   - Since the `Score` column is central to our classification task, all rows with `NaN` values in `Score` are removed.
   - Missing values in textual fields (`Summary` and `Text`) are replaced with empty strings (`""`) to ensure consistency in text processing.

2. **Removing Duplicates**:
   - Identical reviews can distort classification models by **artificially inflating the occurrence of certain labels**. Therefore, we eliminate duplicate rows.

3. **Feature Selection**:
   - To reduce computational complexity and enhance model interpretability, we retain only the most pertinent columns:
     - `HelpfulnessNumerator`: Number of users who found the review helpful.
     - `HelpfulnessDenominator`: Total number of users who rated helpfulness.
     - `Score`: The numerical rating given by the reviewer (1 to 5).
     - `Summary`: A brief overview of the review.
     - `Text`: The full review content.

4. **Feature Engineering (`Combined_Text`)**:
   - The `Summary` column contains **concise expressions of user sentiment**, whereas the `Text` column provides **detailed qualitative feedback**.
   - To **capture both short- and long-form sentiment**, we concatenate them into a new column:  
     ```
     Combined_Text = Summary + " " + Text
     ```

## **Transforming the `Score` Column**
The numerical `Score` column is mapped to categorical `User_Satisfaction` levels as follows:

In [2]:
# Remove rows where 'Score' is empty
data = data[data["Score"].notna()]
data = data.drop_duplicates()


# Select only the relevant columns
selected_columns = [
    "HelpfulnessNumerator", 
    "HelpfulnessDenominator", 
    "Score", 
    "Time", 
    "Summary", 
    "Text"
]

# Keep only the selected columns
data = data[selected_columns]

data["Score"] = pd.to_numeric(data["Score"], errors='coerce')
data["Combined_Text"] = data["Summary"] + " " + data["Text"]

# Define function to categorize User_Satisfaction
def classify_satisfaction(rating):
    if rating == 5:
        return "Highly Satisfied"
    elif rating == 4:
        return "Satisfied"
    elif rating == 3:
        return "Neutral"
    elif rating == 2:
        return "Not Satisfied"
    elif rating <= 1:
        return "Very Bad"
    else:
        return None  # Handle missing values gracefully

# Apply function to create new column
data["User_Satisfaction"] = data["Score"].apply(classify_satisfaction)

data

Unnamed: 0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Combined_Text,User_Satisfaction
0,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,Good Quality Dog Food I have bought several of...,Highly Satisfied
1,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,Not as Advertised Product arrived labeled as J...,Very Bad
2,1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,"""Delight"" says it all This is a confection tha...",Satisfied
3,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,Cough Medicine If you are looking for the secr...,Not Satisfied
4,0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,Great taffy Great taffy at a great price. The...,Highly Satisfied
...,...,...,...,...,...,...,...,...
568449,0,0,5,1299628800,Will not do without,Great for sesame chicken..this is a good if no...,Will not do without Great for sesame chicken.....,Highly Satisfied
568450,0,0,2,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...,disappointed I'm disappointed with the flavor....,Not Satisfied
568451,2,2,5,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o...",Perfect for our maltipoo These stars are small...,Highly Satisfied
568452,1,1,5,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...,Favorite Training and reward treat These are t...,Highly Satisfied


This transformation enables the classification task to shift from a **numeric regression problem to a categorical classification problem**.

## **Class Distribution (After Processing)**


In [3]:
user_satisfaction_counts = data["User_Satisfaction"].value_counts()

# Display the counts
print(user_satisfaction_counts)

User_Satisfaction
Highly Satisfied    363122
Satisfied            80655
Very Bad             52268
Neutral              42640
Not Satisfied        29769
Name: count, dtype: int64


# **2. Data Splitting for Machine Learning**

## **Why Splitting the Data?**
Before training a classification model, it is crucial to **partition the dataset into training and testing subsets**. This ensures:
- **Generalizability**: The model is evaluated on unseen data.
- **Avoiding Overfitting**: Prevents the model from memorizing patterns in training data.
- **Measuring Model Performance**: Provides an unbiased estimate of accuracy.

## **Data Preparation Steps**
1. **Label Encoding**:
   - Since `User_Satisfaction` is a categorical variable, it is transformed into **numerical labels** via `LabelEncoder()`.
   - Example:
     ```
     "Highly Satisfied" → 0
     "Satisfied" → 1
     "Neutral" → 2
     "Not Satisfied" → 3
     "Very Bad" → 4
     ```

2. **Train-Test Split**:
   - We perform an **80-20 split**, where **80%** of the data is used for training and **20%** for evaluation.
   - **Stratified Sampling** ensures that the **class distribution remains proportional** in both training and test sets.

## **Resulting Data Structure**


In [4]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the dataset

# Select the required columns
selected_columns = [
    "HelpfulnessNumerator", 
    "HelpfulnessDenominator", 
    "Summary",  
    "Text", 
    "User_Satisfaction"
]
data = data[selected_columns]
# Fill missing values with an empty string
data[["Summary", "Text"]] = data[["Summary", "Text"]].fillna("")

# Combine 'Summary' + 'Text' for text processing
data["Combined_Text"] = data["Summary"] + " " + data["Text"]

# Encode 'User_Satisfaction' as numerical labels
label_encoder = LabelEncoder()
data["User_Satisfaction"] = label_encoder.fit_transform(data["User_Satisfaction"])

# Train-Test Split (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    data[["HelpfulnessNumerator", "HelpfulnessDenominator", "Combined_Text"]],
    data["User_Satisfaction"],
    test_size=0.2,
    random_state=42,
    stratify=data["User_Satisfaction"]
)

# Display Data Splitting Results
print(f"Training Set: {X_train.shape}, Test Set: {X_test.shape}")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[["Summary", "Text"]] = data[["Summary", "Text"]].fillna("")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Combined_Text"] = data["Summary"] + " " + data["Text"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["User_Satisfaction"] = label_encoder.fit_transform(data["User_Satisfaction

Training Set: (454763, 3), Test Set: (113691, 3)


The final dataset consists of:
- `HelpfulnessNumerator`
- `HelpfulnessDenominator`
- `Combined_Text`

With target variable as `User_Satisfaction`

# Baseline model

### Multinomial Naive Bayes classifier
For our baseline model we have chosen to use a Random Forest Classifier. This model was chosen as the base model for its ability to handle large datasets and It is less prone to overfitting because of its ability to combine multiple trees. 

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from scipy.sparse import hstack

# Convert text data into numerical features using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train['Combined_Text'])
X_test_tfidf = vectorizer.transform(X_test['Combined_Text'])

# Combine numerical features with text features
X_train_combined = hstack([X_train_tfidf, X_train[['HelpfulnessNumerator', 'HelpfulnessDenominator']].values])
X_test_combined = hstack([X_test_tfidf, X_test[['HelpfulnessNumerator', 'HelpfulnessDenominator']].values])

# Train Random Forest Model
rf_model = RandomForestClassifier(n_estimators=100,oob_score=True, random_state=42, n_jobs=-1)
rf_model.fit(X_train_combined, y_train)
rf_preds = rf_model.predict(X_test_combined)

# Evaluate Random Forest Model
rf_acc = accuracy_score(y_test, rf_preds)
rf_report = classification_report(y_test, rf_preds, target_names=["1", "2", "3", "4", "5"])

# Display Results
print("Accuracy:",rf_acc) 
print(rf_report)


Accuracy: 0.8156142526673176
              precision    recall  f1-score   support

           1       0.80      0.99      0.88     72624
           2       0.90      0.44      0.59      8528
           3       0.97      0.39      0.56      5954
           4       0.92      0.44      0.59     16131
           5       0.83      0.71      0.76     10454

    accuracy                           0.82    113691
   macro avg       0.88      0.59      0.68    113691
weighted avg       0.83      0.82      0.79    113691



### Baseline model scores:
- The baseline model has decent accuracy of 81.6%.
- one major concern is that out of all of the classes 1 has the best overall performance when looking at the precision, recall and f1-score.

## Verifying convergence of the baseline model
A random forest Classifier does not require a convergence check like logistic regression and neural networks. It does not have iterative optimization like gradient descent and the model stops growing once its predefined number of trees has been reached.

One way to assess the "convergence" of the Random forest model is to monitor the out-of-bag (OOB) error. This is the prediction error calculated on the data points there were not used to train each indvidual tree, as the number of trees increases. When the OOB does not decrease significantly or begins to flatten out, this can serve as an indication that the model has converged



In [6]:
print("OOB Score:", rf_model.oob_score_)

OOB Score: 0.8151058903208924


# Foundational Model 

### BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer-based deep learning model developed by Google. It is designed to understand contextual meaning in text by processing words bidirectionally, meaning it considers both preceding and succeeding words to derive meaning. Unlike traditional NLP models that read text left to right or right to left, BERT captures the full context of a word in a sentence. This makes it highly effective for tasks like sentiment analysis, where word relationships significantly impact meaning. Instead of training a model from scratch, we can leverage transfer learning by fine-tuning BERT on our specific dataset, allowing it to adapt to sentiment-specific language patterns.

BERT was pre-trained on large-scale text data from Wikipedia and BooksCorpus using two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM, BERT learns to predict missing words in a sentence based on their surrounding context. In NSP, BERT learns to determine whether one sentence naturally follows another. These pre-training tasks give BERT a strong general understanding of language, making it highly transferable to sentiment classification tasks. Since our dataset consists of Amazon Fine Food Reviews, which include natural language reviews and corresponding sentiment labels, BERT’s pre-trained knowledge can be fine-tuned to classify reviews as positive, neutral, or negative. By applying transfer learning, BERT can adapt to domain-specific nuances in customer reviews while reducing training time and computational cost.

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_scheduler, Trainer, TrainingArguments, EarlyStoppingCallback
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report

#GPU Check
print(torch.__version__)  
print(torch.cuda.is_available())  
print(torch.cuda.get_device_name(0))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

2.6.0+cu118
True
NVIDIA GeForce RTX 3060 Ti
Using device: cuda


In [None]:
#Load Pre-trained BERT Tokenizer and Model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(label_encoder.classes_))
model.to(device)
model.config.hidden_dropout_prob = 0.3
model.config.attention_probs_dropout_prob = 0.3

#Dataset class for BERT
class ReviewDataset(Dataset):
    def __init__(self, texts, labels):
        self.encodings = tokenizer(texts, padding="max_length", truncation=True, max_length=256, return_tensors="pt")
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

#Put the train and test data into sets    
train_dataset = ReviewDataset(X_train["Combined_Text"].tolist(), y_train.tolist())
test_dataset = ReviewDataset(X_test["Combined_Text"].tolist(), y_test.tolist())

#Data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

#Loss Function/Optimizer
class_weights = torch.tensor([0.5, 1.5, 1.8, 1.2, 1.3], dtype=torch.float).to(device)
loss_fn = nn.CrossEntropyLoss(weight=class_weights)
optimizer_grouped_parameters = [
    {"params": model.bert.encoder.layer[:-4].parameters(), "lr": 1e-6},  # Frozen layers (very low LR)
    {"params": model.bert.encoder.layer[-4:].parameters(), "lr": 1e-5},  # Unfrozen layers (lower LR)
    {"params": model.classifier.parameters(), "lr": 3e-5}  # Classifier (moderate LR)
]

optimizer = AdamW(optimizer_grouped_parameters, weight_decay=0.01)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:

for param in model.bert.parameters():
    param.requires_grad = False

num_training_steps = len(train_loader) * 3  # 3 epochs
lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=500, num_training_steps=num_training_steps)


In [None]:

for param in model.bert.parameters():
    param.requires_grad = False

def train_model(model, train_loader, optimizer, loss_fn, epochs=3, freeze=True, patience=2):
    if freeze:
        for param in model.bert.parameters():
            param.requires_grad = False  # Freeze BERT layers

    model.train()
    best_loss = float("inf")  # Track best loss
    patience_counter = 0  # Track early stopping

    for epoch in range(epochs):
        loop = tqdm(train_loader, leave=True)
        total_loss = 0
        
        for batch in loop:
            optimizer.zero_grad()

            input_ids, attention_mask, labels = (
                batch["input_ids"].to(device),
                batch["attention_mask"].to(device),
                batch["labels"].to(device),
            )

            outputs = model(input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)
            loss.backward()
            optimizer.step()
            lr_scheduler.step()

            total_loss += loss.item()

            loop.set_description(f"Epoch {epoch+1}")
            loop.set_postfix(loss=loss.item())

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1} Average Loss: {avg_loss:.4f}")

        # 🔥 Early Stopping Condition
        if avg_loss < best_loss:
            best_loss = avg_loss
            patience_counter = 0  # Reset patience if loss improves
        else:
            patience_counter += 1  # Increase patience counter

        if patience_counter >= patience:  # Stop training if no improvement for 'patience' epochs
            print("Early stopping triggered. Training stopped.")
            break  # Stop training early

train_model(model, train_loader, optimizer, loss_fn, epochs=5, freeze=True, patience=2)



ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>=0.26.0'`

In [None]:
def evaluate_model(model, test_loader):
    model.eval()
    preds, true_labels = [], []

    with torch.no_grad():
        for batch in test_loader:
            input_ids, attention_mask, labels = (
                batch["input_ids"].to(device),
                batch["attention_mask"].to(device),
                batch["labels"].to(device),
            )

            outputs = model(input_ids, attention_mask=attention_mask)
            preds.extend(torch.argmax(outputs.logits, dim=1).cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(true_labels, preds)
    print(f"Model Accuracy: {acc:.4f}")
    print("Classification Report:\n", classification_report(true_labels, preds, target_names=label_encoder.classes_))

evaluate_model(model, test_loader)

In [None]:
#Fine Tuning
for param in model.bert.encoder.layer[-2:].parameters():
    param.requires_grad = True  # Unfreeze last two layers

#Optimzier
optimizer_grouped_parameters = [
    {"params": model.bert.encoder.layer[:-4].parameters(), "lr": 1e-6},  # Frozen layers (very low LR)
    {"params": model.bert.encoder.layer[-4:].parameters(), "lr": 1e-5},  # Unfrozen layers (lower LR)
    {"params": model.classifier.parameters(), "lr": 3e-5}  # Classifier (higher LR)
]

optimizer = AdamW(optimizer_grouped_parameters, weight_decay=0.01)
num_training_steps = len(train_loader) * 5  # 5 epochs
lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=500, num_training_steps=num_training_steps)


train_model(model, train_loader, optimizer, loss_fn, epochs=5, freeze=False)


evaluate_model(model, test_loader)