Step 1: Data Loading and Initial Inspection

In [2]:
from google.colab import files

# This command will open a file selection dialog.
uploaded = files.upload()

Saving Reviews.csv to Reviews.csv


In [3]:
import pandas as pd
import numpy as np

# Assuming the file 'Reviews.csv' is in the same folder as your notebook
file_path = 'Reviews.csv'
try:
    df = pd.read_csv(file_path)
    print("✅ Data loaded successfully.")

    # Display initial structure
    print("\n--- Initial Data Snapshot ---")
    print(df.head())
    print("\nTotal rows before cleaning:", len(df))

except FileNotFoundError:
    print(f"❌ Error: File not found at '{file_path}'. Please ensure 'Reviews.csv' is in the correct directory.")

✅ Data loaded successfully.

--- Initial Data Snapshot ---
   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                     1                       1      4  1219017600   
3                     3                       3      2  1307923200   
4                     0                       0      5  1350777600   

                 Summary                                               Text  
0  Good Quality Dog F

Step 2: Data Cleaning and Target Definition

In [4]:
# --- 2.1 Drop Missing Values and Filter Columns ---
df.dropna(subset=['Summary', 'Text', 'Score'], inplace=True)
df = df.filter(items=['Score', 'Text']) # Keep only the necessary columns

# --- 2.2 Target Definition (Sentiment Mapping) ---
def map_sentiment(score):
    if score in [1, 2]:
        return 'Negative'
    elif score == 3:
        return 'Neutral'
    else: # 4 or 5
        return 'Positive'

df['Sentiment'] = df['Score'].apply(map_sentiment)

# Remove the original 'Score' column as it's no longer needed
df.drop('Score', axis=1, inplace=True)

# --- 2.3 Check Class Distribution ---
print("\n--- Final Sentiment Class Distribution ---")
print(df['Sentiment'].value_counts(normalize=True))
print(f"\nTotal Rows after cleaning: {len(df)}")


--- Final Sentiment Class Distribution ---
Sentiment
Positive    0.780711
Negative    0.144279
Neutral     0.075011
Name: proportion, dtype: float64

Total Rows after cleaning: 568427


Step 3: Text Preprocessing (Cleaning Function)

In [5]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
tqdm.pandas()

# --- 3.1 Setup ---
# Downloads (if needed)
try:
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('punkt')
except:
    pass

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))


# --- 3.2 Defining the Cleaning Function ---
def clean_text(text):
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)       # Remove HTML tags
    text = re.sub(r'[^a-z\s]', '', text)    # Remove punctuation
    words = text.split()
    words = [word for word in words if word not in stop_words] # Stop Word Removal
    words = [lemmatizer.lemmatize(word) for word in words]      # Lemmatization
    return ' '.join(words)


# --- 3.3 Applying the Function ---
print("Starting text cleaning and normalization...")

# Apply the cleaning function (this will take a few minutes)
df['cleaned_text'] = df['Text'].progress_apply(clean_text)

print("Text cleaning complete.")

# Display a comparison
print("\n--- Raw vs. Cleaned Comparison ---")
print("Original Text:", df['Text'].iloc[1])
print("Cleaned Text:", df['cleaned_text'].iloc[1])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Starting text cleaning and normalization...


100%|██████████| 568427/568427 [01:39<00:00, 5694.76it/s]

Text cleaning complete.

--- Raw vs. Cleaned Comparison ---
Original Text: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
Cleaned Text: product arrived labeled jumbo salted peanutsthe peanut actually small sized unsalted sure error vendor intended represent product jumbo





Step 4: Feature Extraction (Splitting and TF-IDF)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# X is the cleaned text, Y is the sentiment label
X = df['cleaned_text']
y = df['Sentiment']

# Split the data (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Fit and Transform Training Data
X_train_vec = vectorizer.fit_transform(X_train)

# Transform Testing Data
X_test_vec = vectorizer.transform(X_test)

print(f"\nTraining set size: {len(X_train)} reviews")
print(f"Testing set size: {len(X_test)} reviews")
print(f"Shape of X_train_vec (Training Features): {X_train_vec.shape}")
print(f"Shape of X_test_vec (Testing Features): {X_test_vec.shape}")


Training set size: 454741 reviews
Testing set size: 113686 reviews
Shape of X_train_vec (Training Features): (454741, 5000)
Shape of X_test_vec (Testing Features): (113686, 5000)


Step 5: Baseline Model (Logistic Regression)

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

print("\nStarting training of Logistic Regression baseline model...")

# Initialize and train the model
logreg = LogisticRegression(solver='lbfgs', max_iter=500, random_state=42, n_jobs=-1)
logreg.fit(X_train_vec, y_train)

# Make predictions and evaluate
y_pred_logreg = logreg.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred_logreg)
report = classification_report(y_test, y_pred_logreg)

print("Logistic Regression training complete.")
print("\n--- Logistic Regression Baseline Model Performance ---")
print(f"Overall Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", report)


Starting training of Logistic Regression baseline model...
Logistic Regression training complete.

--- Logistic Regression Baseline Model Performance ---
Overall Accuracy: 0.8640

Classification Report:
               precision    recall  f1-score   support

    Negative       0.74      0.66      0.70     16402
     Neutral       0.51      0.17      0.26      8528
    Positive       0.89      0.97      0.93     88756

    accuracy                           0.86    113686
   macro avg       0.71      0.60      0.63    113686
weighted avg       0.84      0.86      0.85    113686



Step 6: Data Prep for Deep Learning (LSTM variables)

In [8]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# Convert sentiment labels to numbers (0, 1, 2)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Convert numerical labels to one-hot vectors for LSTM (Positive -> [0, 0, 1])
y_train_one_hot = to_categorical(y_train_encoded)
y_test_one_hot = to_categorical(y_test_encoded)

print(f"Label Mapping: {list(label_encoder.classes_)}")
print(f"Shape of One-Hot Encoded Training Labels: {y_train_one_hot.shape}")

Label Mapping: ['Negative', 'Neutral', 'Positive']
Shape of One-Hot Encoded Training Labels: (454741, 3)


Step 7: Final Model Strategy (BERT Data Prep)

In [9]:
from transformers import AutoTokenizer
import torch

# --- 7.1 Initialize BERT Tokenizer ---
tokenizer_name = 'distilbert-base-uncased'
bert_tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
MAX_LEN = 100

# --- 7.2 Tokenize and Encode Data ---
print(f"Tokenizing and encoding training data using {tokenizer_name}...")

# Encode the training data (PyTorch tensors)
X_train_enc = bert_tokenizer(
    X_train.tolist(),
    padding='max_length',
    truncation=True,
    max_length=MAX_LEN,
    return_tensors='pt'
)

# Encode the testing data
X_test_enc = bert_tokenizer(
    X_test.tolist(),
    padding='max_length',
    truncation=True,
    max_length=MAX_LEN,
    return_tensors='pt'
)

# --- 7.3 Prepare Labels ---
# Convert the existing encoded labels (0, 1, 2) to PyTorch tensors
y_train_tensor = torch.tensor(y_train_encoded, dtype=torch.long)
y_test_tensor = torch.tensor(y_test_encoded, dtype=torch.long)

print("\nBERT Data Preparation Complete.")
print(f"Training Input IDs shape: {X_train_enc['input_ids'].shape}")
print(f"Testing Input IDs shape: {X_test_enc['input_ids'].shape}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenizing and encoding training data using distilbert-base-uncased...

BERT Data Preparation Complete.
Training Input IDs shape: torch.Size([454741, 100])
Testing Input IDs shape: torch.Size([113686, 100])


Step 8: BERT Model Definition and Training (Transfer Learning)

In [10]:
from transformers import DistilBertForSequenceClassification
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.optim import AdamW # FIXED: Import AdamW from PyTorch
import torch
import numpy as np
from sklearn.metrics import classification_report, accuracy_score
import time
# Assuming X_train_enc, X_test_enc, y_train_tensor, y_test_tensor, and label_encoder are defined from previous steps

# --- 8.1 Setup: Device and Model ---
# Use GPU if available, otherwise CPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f'Using GPU: {torch.cuda.get_device_name(0)}')
else:
    device = torch.device("cpu")
    print('Using CPU.')

# 3 classes: Negative (0), Neutral (1), Positive (2)
NUM_LABELS = 3

# Load the pre-trained DistilBERT model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=NUM_LABELS,
    output_attentions=False,
    output_hidden_states=False
)

# Move the model to the selected device (GPU/CPU)
model.to(device)
print("\nDistilBERT Model loaded successfully.")

# --- 8.2 Create DataLoaders ---
# Combine tokenized features and labels into PyTorch Datasets
train_dataset = TensorDataset(
    X_train_enc['input_ids'],
    X_train_enc['attention_mask'],
    y_train_tensor
)

test_dataset = TensorDataset(
    X_test_enc['input_ids'],
    X_test_enc['attention_mask'],
    y_test_tensor
)

BATCH_SIZE = 64

train_dataloader = DataLoader(
    train_dataset,
    sampler=RandomSampler(train_dataset),
    batch_size=BATCH_SIZE
)

test_dataloader = DataLoader(
    test_dataset,
    sampler=SequentialSampler(test_dataset),
    batch_size=BATCH_SIZE
)

# --- 8.3 Optimizer and Training Configuration ---
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
EPOCHS = 2

# --- 8.4 Training Loop ---
print(f"\nStarting Fine-tuning for {EPOCHS} epochs...")

for epoch_i in range(0, EPOCHS):
    print(f'======== Epoch {epoch_i + 1} / {EPOCHS} ========')
    t0 = time.time()
    total_loss = 0
    model.train() # Set the model to training mode

    for step, batch in enumerate(train_dataloader):
        # Move batch data to the device
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        model.zero_grad() # Clear previous gradients

        # Forward pass
        outputs = model(
            b_input_ids,
            attention_mask=b_input_mask,
            labels=b_labels
        )

        loss = outputs.loss
        total_loss += loss.item()

        # Backward pass
        loss.backward()

        # Update parameters
        optimizer.step()

    avg_train_loss = total_loss / len(train_dataloader)
    print(f"  Average training loss: {avg_train_loss:.2f}")
    print(f"  Training epoch took: {time.time() - t0:.2f}s")


# --- 8.5 Evaluation Loop ---
print("\nStarting Model Evaluation on Test Set...")
model.eval() # Set model to evaluation mode

predictions, true_labels = [], []

for batch in test_dataloader:
    # Move batch data to the device
    b_input_ids = batch[0].to(device)
    b_input_mask = batch[1].to(device)
    b_labels = batch[2].to(device)

    with torch.no_grad(): # Disable gradient calculation
        outputs = model(
            b_input_ids,
            attention_mask=b_input_mask
        )

    logits = outputs.logits # Get the raw output scores

    # Move results back to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Store predictions and true labels
    predictions.append(logits)
    true_labels.append(label_ids)

# --- 8.6 Final Results ---
# Concatenate and flatten the results
predictions = np.concatenate(predictions, axis=0)
true_labels = np.concatenate(true_labels, axis=0)

# Get the predicted class (index with the highest score)
y_pred_encoded = np.argmax(predictions, axis=1).flatten()

# Get the original string labels
class_names = list(label_encoder.classes_)
y_pred_bert = label_encoder.inverse_transform(y_pred_encoded)
y_test_bert = label_encoder.inverse_transform(true_labels)


accuracy = accuracy_score(y_test_bert, y_pred_bert)
report = classification_report(y_test_bert, y_pred_bert, target_names=class_names)

print("\nBERT Fine-tuning and Evaluation Complete.")
print("=" * 40)
print("--- DistilBERT Model Performance ---")
print(f"Overall Accuracy: {accuracy:.4f}")
print("\nClassification Report (BERT):\n", report)

Using GPU: Tesla T4


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



DistilBERT Model loaded successfully.

Starting Fine-tuning for 2 epochs...
  Average training loss: 0.34
  Training epoch took: 3476.98s
  Average training loss: 0.24
  Training epoch took: 3484.68s

Starting Model Evaluation on Test Set...

BERT Fine-tuning and Evaluation Complete.
--- DistilBERT Model Performance ---
Overall Accuracy: 0.9104

Classification Report (BERT):
               precision    recall  f1-score   support

    Negative       0.83      0.81      0.82     16402
     Neutral       0.57      0.56      0.57      8528
    Positive       0.96      0.96      0.96     88756

    accuracy                           0.91    113686
   macro avg       0.79      0.78      0.78    113686
weighted avg       0.91      0.91      0.91    113686

