<h1 align="center">
    Improving Sentiment Prediction in the Pet Market <br> 
    with Active Learning and BERTimbau
</h1>

*******************************************************************************************************************************

<h2>1. Introduction</h2>

In this project, we aim to build a sentiment analysis model for reviews collected from businesses in the **pet care sector** in Santo André, SP, using **Active Learning** and **BERTimbau**. The primary goal is to **predict the ratings** (from 1 to 5 stars) based on the sentiment expressed in the reviews. We leverage BERTimbau, a **transformer model** specifically trained on Brazilian Portuguese, to capture the nuances of the language used in the reviews.

The dataset is **small** and highly **imbalanced**, with many reviews rated 1 and 5 stars, and fewer in the middle. Additionally, the rating provided by the customer often does not reflect the sentiment in the review, as the text tends to describe their experience in a **subjective** manner. For this reason, BERTimbau was chosen, as it is capable of capturing **complex sentiments**, enabling the model to better understand the full range of emotions expressed.

To address the challenge of imbalanced data, we implement Active Learning, a technique that iteratively selects the most **uncertain examples** for manual labeling and model training. Starting with a small set of labeled data and progressively incorporating the most informative examples, the model is able to learn more efficiently and improve its performance over time.

In conclusion, this project demonstrates that combining BERTimbau with Active Learning is an efficient approach for training sentiment analysis models with limited labeled data. By focusing on the most informative examples through Active Learning, the model iteratively improves and adapts, providing more accurate sentiment predictions. This approach allows businesses in the pet care sector to better interpret **customer feedback**, transforming subjective reviews into **reliable information** that can support **service improvements** and enhance customer engagement.

<h2>2. Initialization</h2>

In [None]:
# Library Imports
import os
import time
import json
import random
import csv

import pandas as pd
import numpy as np

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.ticker import MaxNLocator

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.metrics import (
    accuracy_score, 
    classification_report, 
    confusion_matrix, 
    mean_absolute_error, 
    mean_squared_error
)

from scipy.stats import entropy

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

from transformers import (
    BertTokenizer, 
    BertForSequenceClassification, 
    Trainer, 
    TrainingArguments, 
    AdamW
)

import accelerate
from IPython.display import clear_output

In [None]:
# Configure Pandas display: show all columns and suppress chained assignment warnings
pd.set_option("display.max_columns", None)
pd.set_option('display.max_colwidth', None)
pd.options.mode.chained_assignment = None

<h2>3. Load the Dataset</h2>

In [None]:
PATH = os.path.abspath(os.path.join("..", "data", "processed", "reviews_processed.csv"))

In [None]:
reviews_df = pd.read_csv(PATH, sep=";", header=0, encoding="utf-8")

In [None]:
reviews_df.shape

In [None]:
reviews_df.columns

In [None]:
reviews_df.head(3)

<h2>4. Explore the Data</h2>

<h3>4.1 Pre-analysis of the Data</h3>

In [None]:
# Summary of the DataFrame (data types, non-null count, memory)
reviews_df.info()

In [None]:
# Summary statistics for numerical columns (mean, std, min, max, etc.)
reviews_df.describe()

<h3>4.2 Distribution of Ratings</h3>

In [None]:
def save_plot(fig, filename="rating_distribution.png"):
    """
    Saves the given figure to a file.

    Parameters:
    fig (matplotlib.figure.Figure): The figure to save.
    filename (str): The name of the file to save the plot to (default is 'rating_distribution.png').
    """
    # Save the figure to the specified file
    fig.savefig(filename, bbox_inches='tight')  # Save with tight bounding box to avoid clipping

    # Close the figure (so it doesn't display in the notebook)
    plt.close(fig)

In [None]:
def plot_bar(x, y):
    """
    Creates a bar plot to visualize the distribution of ratings.

    Parameters:
    x (list or array-like): The categories for the x-axis (e.g., rating values).
    y (list or array-like): The corresponding counts or frequencies for each category.
    """

    # Create a figure and axis with a predefined size
    fig, ax = plt.subplots(figsize=(6, 4))
    
    # Define the width of the bars
    bar_width = 0.4  

    # Set the title and label for the x-axis
    ax.set_title("Distribution of Ratings", fontfamily='Arial Rounded MT Bold', fontsize=16, pad=15)
    ax.set_xlabel("Rating", fontsize=12)

    # Add a grid for better readability (dashed lines, gray color, semi-transparent)
    ax.grid(visible=True, linestyle='--', color='gray', alpha=0.7)

    # Convert the x values to a NumPy array to ensure compatibility with plotting
    x = np.array(x)  

    # Create the bar plot with a custom color and border
    bars = ax.bar(x, y, color="#ff6d0a", edgecolor='#e6e6e6')

    # Annotate each bar with its height (value)
    for bar in bars:
        yval = bar.get_height()  # Retrieve the height of the bar
        ax.text(
            bar.get_x() + bar.get_width() / 2, yval, round(yval, 2), 
            ha='center', va='bottom', color='black', fontsize=10
        )  

    # Set the x-axis tick labels based on the provided x values
    ax.set_xticks(x)

    # Display the plot
    plt.show()

    # Save the plot
    save_plot(fig, os.path.join("..", "results", "figures", "rating_distribution.png")) # Save as a PNG file

In [None]:
# Plot bar chart of rating distribution
plot_bar(
    reviews_df['Rating'].value_counts().sort_index().index,
    reviews_df['Rating'].value_counts().sort_index().values
)

<h3>4.3 Distribution of Word Count</h3>

In [None]:
def plot_hist(x):
    """
    Plots a histogram showing the distribution of word count in reviews with a logarithmic y-axis.

    Parameters:
    - x (list or np.array): List or array of word counts from the dataset.

    Returns:
    - Displays a histogram with improved styling and a log-scaled y-axis, and saves the plot.
    """

    # Convert input data to a NumPy array for better performance and compatibility
    x = np.array(x)  

    # Create a figure and axis with a specified size
    fig, ax = plt.subplots(figsize=(8, 5))

    # Plot the histogram with 100 bins, custom color, transparency, and black edges
    ax.hist(x, bins=100, color="#1f77b4", alpha=0.75, edgecolor="black")

    # Set the y-axis to a logarithmic scale for better visualization of frequency distribution
    ax.set_yscale("log")

    # Set title and axis labels with appropriate fonts and sizes
    ax.set_title("Distribution of Word Count in Reviews (Log Scale on Y)", 
                 fontfamily="Arial Rounded MT Bold", fontsize=16, pad=15)
    ax.set_xlabel("Number of Words", fontsize=12)
    ax.set_ylabel("Frequency (log scale)", fontsize=12)

    # Add a dashed grid on the y-axis for better readability
    ax.grid(axis="y", linestyle="--", color="gray", alpha=0.6)

    # Adjust layout to prevent overlapping elements
    plt.tight_layout()

    # Display the histogram
    plt.show()

    # Save the plot
    save_plot(fig, os.path.join("..", "results", "figures", "word_count_distribution.png")) # Save as a PNG file

In [None]:
# Plot histogram of word count distribution
plot_hist(reviews_df['Word Count'])

<h3>4.4 Outliers in Word Count</h3>

In [None]:
def plot_outliers(x):
    """
    Plots a boxplot showing outliers in the distribution of word count in reviews,
    with quartile values displayed in a legend instead of on the graph.

    Parameters:
    - x (list or np.array): List of word counts from the dataset.

    Returns:
    - Displays a boxplot highlighting the outliers and quartile values in a legend.
    """
    x = np.array(x)  # Convert input to numpy array for better performance

    # Compute quartiles and median
    Q1 = np.percentile(x, 25)  # First quartile (25%)
    Q3 = np.percentile(x, 75)  # Third quartile (75%)
    median = np.median(x)      # Median (50%)

    # Create figure and axis
    fig, ax = plt.subplots(figsize=(8, 5))

    # Plot boxplot
    box = ax.boxplot(x, vert=False, patch_artist=True, 
                     boxprops=dict(facecolor="#87ceeb", color="black"), 
                     whiskerprops=dict(color="black"),
                     capprops=dict(color="black"),
                     medianprops=dict(color="#d62728", linewidth=2),
                     flierprops=dict(marker='o', markerfacecolor='#d62728', markersize=6, linestyle='none'))

    # Titles and labels
    ax.set_title("Outliers in Word Count Distribution", fontfamily="Arial Rounded MT Bold", fontsize=16, pad=15)
    ax.set_xlabel("Number of Words", fontsize=12)

    # Grid styling
    ax.grid(axis="x", linestyle="--", color="gray", alpha=0.6)

    # Create a legend to display quartile values
    legend_text = f"Q1: {Q1:.1f}\nMedian: {median:.1f}\nQ3: {Q3:.1f}"
    ax.legend([legend_text], loc="upper right", fontsize=10, frameon=True, edgecolor="black")

    # Adjust layout for better spacing
    plt.tight_layout()
    plt.show()

    # Save the plot
    save_plot(fig, os.path.join("..", "results", "figures", "word_count_outliers.png")) # Save as a PNG file

In [None]:
# Plot outliers in the 'Word Count' column
plot_outliers(reviews_df['Word Count'])

<h2>5. Preprocessing</h2>

<h3>5.1 Adding 'Predicted Rating' Column</h3>

In [None]:
# Check if the "Predicted Rating" column exists; if not, insert it after the "Rating" column
if "Predicted Rating" not in reviews_df.columns:
    reviews_df.insert(reviews_df.columns.get_loc("Rating") + 1, "Predicted Rating", pd.NA)

In [None]:
# Ensure "Predicted Rating" is of type Int64 (nullable integer)
reviews_df["Predicted Rating"] = reviews_df["Predicted Rating"].astype("Int64")

<h3>5.2 Selecting Initial Labeled Data</h3>

In [None]:
# Define the output path
LABELED_DATA_PATH = os.path.join("..", "data", "active_learning", "initial_labeled_data.csv")

**Note:**  
Run the following code only once at the beginning to select the initial labeled data.

In [None]:
'''# Select an initial sample of 500 reviews for manual labeling  
initial_labeled_data = reviews_df.sample(n=500, random_state=42)  

# Save the selected data to a CSV file for manual labeling  
initial_labeled_data.to_csv(LABELED_DATA_PATH, sep=";", index=False, encoding="utf-8")'''

In [None]:
def manual_prediction_entry(file_path):
    """
    Iterates through each row of the DataFrame, displaying the 'Text' and 'Rating' columns.
    Allows the user to input a 'Predicted Rating' value, which is then stored in the DataFrame.
    The file is updated after each entry to avoid data loss.
    The user can exit by typing 'q' or 'exit'.

    Parameters:
        file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
        None
    """

    # Load the dataset from the specified CSV file
    # Ensure 'Predicted Rating' is read as an integer, allowing missing values (NaN)
    df = pd.read_csv(file_path, sep=";", encoding="utf-8", dtype={"Predicted Rating": "Int64"})

    # Get total number of reviews for progress tracking
    total_reviews = len(df)

    # Iterate through each row in the DataFrame
    for idx, (index, row) in enumerate(df.iterrows(), start=1):

        # Skip rows where 'Predicted Rating' is already filled
        if pd.notna(row["Predicted Rating"]):
            continue

        # Clear the screen to improve readability
        try:
            clear_output(wait=False)  # Works in Jupyter Notebook
        except NameError:
            os.system("cls" if os.name == "nt" else "clear")  # Works in Terminal

        time.sleep(0.5)  # Small delay for better user experience
        
        # Display review information
        print(f"\nReview {idx} of {total_reviews}")  
        print(f"Review Text: {row['Text']}")
        print(f"Actual Rating: {row['Rating']}")
        
        # Loop until the user provides a valid input
        while True:
            user_input = input("Enter Predicted Rating (1-5) or type 'q' to exit: ").strip()

            # Allow user to exit by typing 'q' or 'exit'
            if user_input.lower() in ["q", "exit"]:
                print("Exiting...")
                df.to_csv(file_path, index=False, sep=";", encoding="utf-8")  # Save progress before exiting
                return  # Stop execution

            try:
                predicted = int(user_input)  # Convert input to integer
                if 1 <= predicted <= 5:  # Ensure rating is within the valid range
                    df.at[index, "Predicted Rating"] = predicted
                    break  # Exit the loop if input is valid
                else:
                    print("Invalid input! Please enter a number between 1 and 5.")
            except ValueError:
                print("Invalid input! Please enter a number or 'q' to exit.")
        
        # Save updated DataFrame to CSV after each valid input to prevent data loss
        df.to_csv(file_path, index=False, sep=";", encoding="utf-8")
        print("Data saved successfully!")

    print("\nAll reviews have been processed.")  # Message displayed when all reviews are labeled

In [None]:
# Run manual prediction entry using LABELED_DATA_PATH.
manual_prediction_entry(LABELED_DATA_PATH)

<h2>6. Sentiment Analysis Model</h2>

<h3>6.1 Load BERTimbau Model and Tokenizer</h3>

In [None]:
# Define the model name and load the corresponding tokenizer
MODEL_NAME = "neuralmind/bert-base-portuguese-cased"  # Pretrained BERT model for Portuguese
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)  # Load the tokenizer for the model

<h3>6.2 Saving and Loading Model</h3>

In [None]:
# Define the file path to save or load the intermediate model
SAVED_MODEL_PATH = os.path.join("..", "models", "sentiment_model.pt")  # Path for saving the model

In [None]:
def save_model(model, model_path):
    """ 
    Saves the model's state dictionary to a specified file path.

    Args:
        model (nn.Module): The model to be saved.
        model_path (str): The path where the model will be saved.

    Returns:
        None
    """
    # Save the model's state dictionary
    torch.save(model.state_dict(), model_path)
    print(f"Model saved to {model_path}")

In [None]:
def load_model(model, model_path):
    """ 
    Loads a pretrained model from a specified file path.

    This function loads the model's state dictionary from the given path and sets the model to evaluation mode.

    Args:
        model (nn.Module): The model to be loaded.
        model_path (str): The path to the saved model file.

    Returns:
        nn.Module: The loaded model set to evaluation mode.
    """
    # Load the model's state dictionary
    model.load_state_dict(torch.load(model_path, weights_only=True))
    
    # Set the model to evaluation mode
    model.eval()
    print(f"Model loaded from {model_path}")
    
    return model

<h3>6.3 Saving and Loading Training State</h3>

In [None]:
# Define the file path to save or load the active learning state
STATE_FILE = os.path.join("..", "data", "active_learning", "active_learning_state.json")  # Path for saving the active learning state

In [None]:
def save_state(labeled_texts, labeled_labels, remaining_indices, current_iteration, state_file=STATE_FILE):
    """ 
    Saves the current state of the Active Learning process.

    This function creates a dictionary containing the current labeled texts, labels, remaining indices, 
    and the current iteration. It then attempts to save this state as a JSON file.

    Args:
        labeled_texts (list): List of texts that have been labeled.
        labeled_labels (list): Corresponding labels for the texts.
        remaining_indices (list): Indices of the remaining unlabeled data.
        current_iteration (int): The current iteration of the Active Learning loop.
        state_file (str): Path to the file where the state will be saved (default is STATE_FILE).
    """
    # Create a dictionary with the current state
    state = {
        "labeled_texts": labeled_texts,
        "labeled_labels": labeled_labels,
        "remaining_indices": remaining_indices,
        "current_iteration": current_iteration
    }
    
    try:
        # Save the state as a JSON file
        with open(state_file, "w", encoding="utf-8") as f:
            json.dump(state, f, ensure_ascii=False, indent=4)
        print("State successfully saved!")
    except Exception as e:
        # Handle errors during the saving process
        print(f"Error saving state: {e}")

In [None]:
def load_state():
    """ 
    Loads the last saved state of the Active Learning process, if available.
    
    Checks if the state file exists, and loads its contents if so. Returns None if the file doesn't exist.

    Returns:
        dict or None: The saved state as a dictionary, or None if no saved state exists.
    """
    # Check if the state file exists
    if os.path.exists(STATE_FILE):
        # Open and load the JSON content of the state file
        with open(STATE_FILE, "r", encoding="utf-8") as f:
            return json.load(f)
    # Return None if the state file doesn't exist
    return None

<h3>6.4 Loading Labeled Data</h3>

In [None]:
def load_labeled_data():
    """
    Loads labeled data from a CSV file.

    If the labeled data file exists, it loads the data and returns the texts and ratings.
    If the file does not exist, it prints a warning and returns empty lists.

    Returns:
        texts: List of text reviews.
        labels: List of corresponding ratings (adjusted to range 0-4).
    """
    if os.path.exists(LABELED_DATA_PATH):
        df = pd.read_csv(LABELED_DATA_PATH, sep=";", encoding="utf-8", dtype={"Predicted Rating": "Int64"})
        
        # Remove NaN values from labeled data
        df = df.dropna(subset=["Predicted Rating"])
        
        print(f"{len(df)} labeled examples loaded.")
        
        return df["Review ID"].tolist(), df["Text"].tolist(), df["Predicted Rating"].astype(int).tolist()
    else:
        print("No labeled data found. Starting from scratch.")
        return [], []

<h3>6.5 Tokenization and Dataset Preparation</h3>

In [None]:
class ReviewDataset(Dataset):
    """
    Custom Dataset for text reviews, prepares data for tokenization and model input.

    Args:
        texts: List of text reviews.
        labels: List of labels corresponding to each review (optional).
        tokenizer: Tokenizer used for text preprocessing (default: `tokenizer`).
        max_length: Maximum length for the tokenized sequences (default: 512).
    """
    def __init__(self, texts, labels=None, tokenizer=tokenizer, max_length=512):
        self.texts = texts  # Store the texts
        self.labels = labels  # Store the labels (optional)
        self.tokenizer = tokenizer  # Store the tokenizer
        self.max_length = max_length  # Store max length for padding/truncating

    def __len__(self):
        """
        Returns the number of samples in the dataset.
        """
        return len(self.texts)

    def __getitem__(self, idx):
        """
        Retrieves a single item (text and its label) and tokenizes it.
        
        Args:
            idx: Index of the sample to fetch.

        Returns:
            item: Dictionary containing the tokenized input and its label (if available).
        """
        # Tokenize the text at index `idx`
        encoding = self.tokenizer(
            self.texts[idx], 
            truncation=True, 
            padding='max_length', 
            max_length=self.max_length, 
            return_tensors="pt"  # Return as PyTorch tensors
        )
        
        # Squeeze the tensor to remove unnecessary dimensions
        item = {key: val.squeeze(0) for key, val in encoding.items()}
        
        # If labels are provided, add the label to the item
        if self.labels is not None:
            item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
        
        return item  # Return the tokenized item (text and label)

<h3>6.6 Define Active Learning Strategy (Entropy-Based)</h3>

In [None]:
@torch.no_grad()
def compute_entropy(model, dataset, batch_size=16, device="cuda"):
    """
    Computes entropy for each example in the dataset using the model's predictions.
    Entropy is used to measure uncertainty, helping to identify uncertain examples for Active Learning.

    Args:
        model: Pre-trained model used for predictions.
        dataset: Dataset containing the text samples for which entropy is calculated.
        batch_size: Number of samples per batch for processing.
        device: The device for computation (e.g., 'cuda').

    Returns:
        np.array: An array of entropy values for each sample.
    """
    model.eval()  # Set model to evaluation mode
    dataloader = DataLoader(dataset, batch_size=batch_size)  # Create DataLoader for batching
    entropy_values = []

    # Loop through each batch in the dataset
    for batch in dataloader:
        # Move batch data to the specified device
        batch = {key: val.to(device) for key, val in batch.items() if key != "labels"}
        outputs = model(**batch)  # Get model predictions
        probs = F.softmax(outputs.logits, dim=-1)  # Apply softmax to get probabilities
        entropy = -torch.sum(probs * torch.log(probs + 1e-10), dim=-1)  # Compute entropy (avoid log(0))
        entropy_values.extend(entropy.cpu().numpy())  # Collect entropy values for each example

        # Free memory
        del batch, outputs, probs, entropy
        torch.cuda.empty_cache()

    return np.array(entropy_values)  # Return entropy values for all samples

In [None]:
def select_uncertain_examples(model, unlabeled_texts, n_samples=100, batch_size=16, device="cuda"):
    """
    Selects the most uncertain examples based on entropy for Active Learning.

    Args:
        model: Pre-trained model used for predictions.
        unlabeled_texts: List of texts that have not been labeled yet.
        n_samples: Number of uncertain samples to select.
        batch_size: Number of samples per batch for processing.
        device: The device for computation (e.g., 'cuda').

    Returns:
        np.array: Indices of the most uncertain examples in the dataset.
    """
    dataset = ReviewDataset(unlabeled_texts, tokenizer=tokenizer)  # Prepare dataset
    entropy_values = compute_entropy(model, dataset, batch_size=batch_size, device=device)  # Get entropy for each text
    uncertain_indices = np.argsort(entropy_values)[-n_samples:]  # Select indices with highest entropy (uncertainty)
    return uncertain_indices  # Retun indices of uncertain examples

<h3>6.7 Fine-Tuning the Model</h3>

In [None]:
def fine_tune_model(model, train_texts, train_labels, batch_size=16, epochs=3, device="cuda"):
    """
    Fine-tunes the pre-trained model on the provided training data.

    Args:
        model: Pre-trained model to be fine-tuned.
        train_texts: List of training texts.
        train_labels: List of labels for training data.
        batch_size: Number of samples per batch for training.
        epochs: Number of epochs to train the model.
        device: The device for computation (e.g., 'cuda').

    Returns:
        model: The fine-tuned model.
    """
    train_dataset = ReviewDataset(train_texts, train_labels)  # Prepare the training dataset

    # Set up training arguments
    training_args = TrainingArguments(
        output_dir="./results",  # Directory to save results
        evaluation_strategy="no",  # No evaluation during training
        per_device_train_batch_size=batch_size,  # Batch size for training
        num_train_epochs=epochs,  # Number of training epochs
        save_strategy="no",  # No saving during training
        logging_dir="./logs",  # Directory to save logs
        logging_steps=10,  # Log every 10 steps
        report_to="none"  # Disable reporting
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset  # Dataset to train on
    )

    trainer.train()  # Start training
    return model # Return the fine-tuned model

<h3>6.8 Active Learning Loop</h3>

In [None]:
def active_learning_loop(reviews_df, iterations=20, samples_per_iteration=100, batch_size=16, device="cuda", model_path=SAVED_MODEL_PATH):
    """
    Run the Active Learning loop, selecting uncertain examples and fine-tuning the model iteratively.
    After each iteration, manually label new examples, update the training set, and re-train the model.

    Args:
        reviews_df: DataFrame with the unlabeled reviews.
        iterations: Number of Active Learning iterations.
        samples_per_iteration: Number of uncertain examples to select each iteration.
        batch_size: Number of examples per batch during fine-tuning.
        device: The device for computation (e.g., 'cuda').
        model_path: Path for saving and loading the model.

    Returns:
        model: The fine-tuned model after completing the Active Learning loop.
    """
    _ , labeled_texts, labeled_labels = load_labeled_data()  # Load labeled data
    labeled_labels = [label - 1 for label in labeled_labels]  # Adjust labels to range 0-4

    saved_state = load_state()  # Check if there is a saved state to resume
    if saved_state:
        print("\nResuming from last saved state...")
        labeled_texts = saved_state["labeled_texts"]
        labeled_labels = saved_state["labeled_labels"]
        remaining_indices = saved_state["remaining_indices"]
        start_iteration = saved_state["current_iteration"] + 1
    else:
        remaining_indices = reviews_df.index.tolist()  # Start with all reviews
        start_iteration = 0

    model = BertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=5).to(device)

    # If a saved model exists, load it
    if os.path.exists(model_path):
        model = load_model(model, model_path)
    else:
        print("No saved model found, starting training from scratch.")

    for i in range(start_iteration, iterations):
        print(f"\nIteration {i+1}/{iterations}")

        # Fine-tune the model
        model = fine_tune_model(model, labeled_texts, labeled_labels, batch_size=batch_size, device=device)

        # Load or create labeled data CSV
        if os.path.exists(LABELED_DATA_PATH):
            labeled_df = pd.read_csv(LABELED_DATA_PATH, sep=";", encoding="utf-8", dtype={"Predicted Rating": "Int64"})
        else:
            labeled_df = pd.DataFrame(columns=reviews_df.columns.tolist() + ["Predicted Rating"])

        # Check how many examples from the last batch are still unlabeled
        unlabeled_in_batch = labeled_df[labeled_df["Predicted Rating"].isna()]

        if len(unlabeled_in_batch) > 0:
            print(f"{len(unlabeled_in_batch)} examples still need labeling before selecting new ones.")
            manual_prediction_entry(LABELED_DATA_PATH)  # Ensure user completes labeling before proceeding
            labeled_df = pd.read_csv(LABELED_DATA_PATH, sep=";", encoding="utf-8", dtype={"Predicted Rating": "Int64"})

        # Select new uncertain examples **only if all previous ones are labeled**
        if labeled_df["Predicted Rating"].isna().sum() == 0:
            if remaining_indices:
                # Select exactly 100 uncertain examples
                uncertain_indices = select_uncertain_examples(
                    model, reviews_df.loc[remaining_indices, "Text"].tolist(), samples_per_iteration, batch_size, device
                )[:samples_per_iteration]  # Ensure we get exactly samples_per_iteration indices
                
                new_indices = [remaining_indices[idx] for idx in uncertain_indices]

                # Avoid duplicates before adding
                new_examples = reviews_df.loc[new_indices].copy()
                new_examples["Predicted Rating"] = None

                existing_review_ids = set(labeled_df["Review ID"].dropna().tolist())  # Avoid re-adding existing IDs
                new_examples = new_examples[~new_examples["Review ID"].isin(existing_review_ids)]

                # Ensure exactly 100 examples (if possible)
                if len(new_examples) > samples_per_iteration:
                    new_examples = new_examples.sample(n=samples_per_iteration, random_state=42)

                labeled_df = pd.concat([labeled_df, new_examples], ignore_index=True)
                labeled_df.to_csv(LABELED_DATA_PATH, index=False, sep=";", encoding="utf-8")

                print(f"Added {len(new_examples)} new examples for manual labeling.")
        
        # Perform manual labeling
        manual_prediction_entry(LABELED_DATA_PATH)

        # Reload labeled data
        labeled_df = pd.read_csv(LABELED_DATA_PATH, sep=";", encoding="utf-8", dtype={"Predicted Rating": "Int64"})
        newly_labeled = labeled_df.dropna(subset=["Predicted Rating"])

        labeled_texts.extend(newly_labeled["Text"].tolist())
        labeled_labels.extend([label - 1 for label in newly_labeled["Predicted Rating"].tolist()])  # Ensure labels are in range 0-4

        # Remove labeled examples from the remaining list
        remaining_indices = [idx for idx in remaining_indices if idx not in newly_labeled.index]

        # Save progress
        save_state(labeled_texts, labeled_labels, remaining_indices, i)
        print("Progress saved!")

        # Save the model after each iteration
        save_model(model, model_path)

        # Free memory
        torch.cuda.empty_cache()

    print("\nActive Learning process completed!")
    return

In [None]:
# Run the active learning loop to train and update the model iteratively
final_model = active_learning_loop(reviews_df, iterations=20, samples_per_iteration=100)

<h2>7. Model Evaluation</h2>

<h3>7.1 Load the Trained Model</h3>

In [None]:
def load_trained_model(model_path, device="cuda"):
    """
    Loads a pre-trained BERT model from the specified path and moves it to the specified device.
    
    Args:
        model_path (str): Path to the saved model weights.
        device (str): Device to load the model on ("cuda" for GPU, "cpu" for CPU).
    
    Returns:
        model (BertForSequenceClassification): The loaded BERT model ready for inference.
    """
    model = BertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=5)
    model.load_state_dict(torch.load(model_path, weights_only=True, map_location=device))  # Load saved weights
    model.to(device)  # Move the model to the specified device (GPU/CPU)
    model.eval()  # Set the model to evaluation mode (disables dropout, batch norm)
    return model

<h3>7.2 Load Data</h3>

In [None]:
# Load labeled data
_ , texts, labels = load_labeled_data()

# Adjust labels (BERT expects labels to be 0-indexed)
labels = [label - 1 for label in labels]

# Convert texts and labels to numpy arrays
texts = np.array(texts)
labels = np.array(labels)

<h3>7.3 Initialize Tokenizer</h3>

In [None]:
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = load_trained_model(SAVED_MODEL_PATH)

<h3>7.4 Preprocess Texts</h3>

In [None]:
# Function to preprocess the texts
def preprocess_texts(texts, tokenizer, max_length=256):
    """
    Tokenizes and preprocesses input texts to be compatible with BERT's input format.
    
    Args:
        texts (list): A list of text samples to be tokenized.
        tokenizer (BertTokenizer): A pre-trained BERT tokenizer.
        max_length (int): The maximum length to pad/truncate the sequences to.
    
    Returns:
        encodings (BatchEncoding): Tokenized inputs as PyTorch tensors.
    """
    encodings = tokenizer(
        texts,
        truncation=True,  # Truncate texts that exceed max length
        padding=True,  # Pad shorter texts
        max_length=max_length,  # Ensure all texts are of max length
        return_tensors="pt"  # Return tensors for PyTorch
    )
    return encodings

<h3>7.5 Stratified Shuffle Split (Train/Test Split)</h3>

In [None]:
# Perform Stratified Shuffle Split to ensure balanced test set
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

# Split data
for _, test_index in sss.split(texts, labels):
    X_test = [texts[i] for i in test_index]
    y_test = [labels[i] for i in test_index]

<h3>7.6 Model Prediction</h3>

In [None]:
def predict(model, tokenizer, texts, batch_size=16, device="cuda"):
    """
    Makes predictions on the input texts using the trained BERT model.
    
    Args:
        model (BertForSequenceClassification): The trained BERT model.
        tokenizer (BertTokenizer): A pre-trained BERT tokenizer.
        texts (list): A list of input texts to predict on.
        batch_size (int): The batch size for processing the texts in smaller chunks.
        device (str): Device to run the model on ("cuda" for GPU, "cpu" for CPU).
    
    Returns:
        predictions (list): The list of predicted labels for the input texts.
    """
    model.eval()  # Set the model to evaluation mode
    predictions = []
    
    # Iterate over the texts in batches
    with torch.no_grad():  # No need to track gradients during inference
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            encodings = preprocess_texts(batch_texts, tokenizer).to(device)  # Preprocess and move to device
            outputs = model(**encodings)  # Make prediction
            preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()  # Get predicted class (highest logit)
            predictions.extend(preds)  # Collect predictions
    
    return predictions

In [None]:
# Get predictions
pred_labels = predict(model, tokenizer, X_test)

<h3>7.7 Evaluate the Model</h3>

In [None]:
# Make predictions on the test set
pred_labels = predict(model, tokenizer, X_test)

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, pred_labels)
print(f"Accuracy: {accuracy:.4f}\n")

In [None]:
# Shift class labels from 0-4 to 1-5
y_test_adjusted = np.array([y + 1 for y in y_test])
pred_labels_adjusted = np.array([pred + 1 for pred in pred_labels])

# Generate classification report with adjusted labels (1 to 5)
print("Classification Report:")
print(classification_report(y_test_adjusted, pred_labels_adjusted, digits=4))

In [None]:
def plot_confusion_matrix(y_true, y_pred, title='Model Evaluation - Confusion Matrix', filename="model_evaluation_confusion_matrix.png"):
    """
    Creates a confusion matrix heatmap to visualize the model's performance.

    Parameters:
    y_true (array-like): The true labels.
    y_pred (array-like): The predicted labels.
    title (str): The title of the plot. Default is 'Model Evaluation - Confusion Matrix'.
    """

    # Compute the confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    # Create a figure and axis with a predefined size
    fig, ax = plt.subplots(figsize=(6, 5))

    # Plot the heatmap with annotations
    sns.heatmap(cm, annot=True, fmt="d", cmap="YlOrBr", 
                xticklabels=range(1, 6), yticklabels=range(1, 6),
                linewidths=0.5, linecolor='lightgray', ax=ax)

    # Set the title and labels
    ax.set_title(title, fontfamily='Arial Rounded MT Bold', fontsize=16, pad=15)
    ax.set_xlabel("Predicted", fontsize=12)
    ax.set_ylabel("Actual", fontsize=12)

    # Show the plot
    plt.show()

    # Save the plot
    save_plot(fig, os.path.join("..", "results", "figures", filename)) # Save as a PNG file

In [None]:
# Plot confusion matrix to evaluate model performance
plot_confusion_matrix(y_test, pred_labels, "Model Evaluation - Confusion Matrix", "model_evaluation_confusion_matrix.png")

<h2>8. Predicting Ratings for Unlabeled Reviews</h2>

<h3>8.1 Load Labeled Data</h3>

In [None]:
# Load labeled data
labeled_ids, labeled_texts, labeled_labels = load_labeled_data()

<h3>8.2 Map Predicted Ratings to Reviews</h3>

In [None]:
# Create a dictionary mapping review IDs to predicted labels
id_to_pred = {review_id: pred for review_id, pred in zip(labeled_ids, labeled_labels)}

In [None]:
# Map the predicted ratings from labeled data to the reviews dataframe
reviews_df["Predicted Rating"] = reviews_df["Review ID"].map(id_to_pred)

<h3>8.3 Identify Unlabeled Reviews</h3>

In [None]:
# Get texts and IDs of unlabeled reviews
unlabeled_reviews = reviews_df[reviews_df["Predicted Rating"].isna()]["Text"].tolist()
unlabeled_review_ids = reviews_df[reviews_df["Predicted Rating"].isna()]["Review ID"].tolist()

<h3>8.4 Predict Ratings for Unlabeled Reviews</h3>

In [None]:
# Predict ratings for unlabeled reviews
unlabeled_predictions = predict(model, tokenizer, unlabeled_reviews)

In [None]:
# Adjust the predictions to be on a scale of 1-5
unlabeled_predictions = [pred + 1 for pred in unlabeled_predictions]

<h3>8.5 Map Predicted Ratings to Unlabeled Reviews</h3>

In [None]:
# Create a dictionary for unlabeled review predictions
id_to_pred_unlabeled = {review_id: pred for review_id, pred in zip(unlabeled_review_ids, unlabeled_predictions)}

In [None]:
# Map the predicted ratings to the reviews dataframe
reviews_df.loc[reviews_df["Review ID"].isin(id_to_pred_unlabeled.keys()), "Predicted Rating"] = \
    reviews_df["Review ID"].map(id_to_pred_unlabeled)

In [None]:
# Ensure the 'Predicted Rating' column is of type Int64 (with support for NaN values)
reviews_df["Predicted Rating"] = reviews_df["Predicted Rating"].astype("Int64")

<h3>8.6 Save the Results to a CSV File</h3>

In [None]:
# Define the path for saving the CSV file
REVIEW_PATH = os.path.join("..", "results", "predictions", "reviews_with_predictions.csv")

In [None]:
# Save the dataframe to a CSV file
reviews_df.to_csv(REVIEW_PATH, index=False, sep=";", encoding="utf-8")

<h2>9. Comparative Analysis: Actual vs. Predicted Ratings</h2>

<h3>9.1 Distribution of Actual and Predicted Ratings</h3>

In [None]:
def plot_rating_comparison(reviews_df):
    """
    Creates a histogram to compare the distribution of actual ratings vs. predicted ratings.

    Parameters:
    reviews_df (DataFrame): A DataFrame containing the 'Rating' and 'Predicted Rating' columns.
    """
    
    # Define figure size for the plot
    fig, ax = plt.subplots(figsize=(10, 5))  

    # Plot the distribution of actual ratings with KDE
    sns.histplot(reviews_df["Rating"].dropna(), bins=5, kde=True, color="#1f77b4", label="Actual Rating", 
                 alpha=0.6, discrete=True, ax=ax)

    # Plot the distribution of predicted ratings with KDE
    sns.histplot(reviews_df["Predicted Rating"].dropna(), bins=5, kde=True, color="#ff7f0e", label="Predicted Rating", 
                 alpha=0.6, discrete=True, ax=ax)

    # Set labels and title for the axes
    ax.set_xlabel("Rating", fontsize=12)
    ax.set_ylabel("Count", fontsize=12)
    ax.set_title("Distribution of Actual vs. Predicted Ratings", fontfamily='Arial Rounded MT Bold', fontsize=16, pad=15)

    # Set x-axis ticks to represent the possible rating values (1 to 5)
    ax.set_xticks(range(1, 6))

    # Add a grid for better readability (dashed lines, gray color, semi-transparent)
    ax.grid(visible=True, linestyle='--', color='gray', alpha=0.7)

    # Display legend to distinguish the actual and predicted ratings
    ax.legend()

    # Show the plot
    plt.show()

    # Save the plot
    save_plot(fig, os.path.join("..", "results", "figures", "actual_vs_predicted_ratings.png")) # Save as a PNG file

In [None]:
# Plot comparison of actual vs predicted ratings
plot_rating_comparison(reviews_df)

<h3>9.2 Performance Metrics</h3>

In [None]:
# Compute metrics
accuracy = accuracy_score(reviews_df["Rating"], reviews_df["Predicted Rating"])
mae = mean_absolute_error(reviews_df["Rating"], reviews_df["Predicted Rating"])
rmse = np.sqrt(mean_squared_error(reviews_df["Rating"], reviews_df["Predicted Rating"]))

In [None]:
# Print metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")

<h3>9.3 Confusion Matrix</h3>

In [None]:
# Remove NaN values for comparison
valid_df = reviews_df.dropna(subset=["Rating", "Predicted Rating"])

In [None]:
# Plot confusion matrix
plot_confusion_matrix(valid_df["Rating"], 
                      valid_df["Predicted Rating"], 
                      "Prediction vs Actual - Confusion Matrix", 
                      "prediction_vs_actual_confusion_matrix.png")

<h3>9.4 Error Analysis</h3>

In [None]:
# Compute absolute errors
valid_df["Error"] = abs(valid_df["Rating"] - valid_df["Predicted Rating"])

In [None]:
def plot_average_error(valid_df):
    """
    Creates a bar plot to visualize the mean absolute error by rating category,
    displaying the values above the bars with a background grid.

    Parameters:
    valid_df (DataFrame): A DataFrame containing the 'Rating' and 'Error' columns.
    """

    # Define figure size
    fig, ax = plt.subplots(figsize=(8, 5))

    # Ensure the style does not override grid settings
    sns.set_style("whitegrid")  # Light background with grid

    # Plot the bar plot with the coolwarm color palette
    barplot = sns.barplot(x="Rating", y="Error", data=valid_df, hue="Rating", 
                          palette="rocket", legend=False, ax=ax)

    # Set the labels and title
    ax.set_xlabel("Actual Rating", fontsize=12)
    ax.set_ylabel("Mean Absolute Error", fontsize=12)
    ax.set_title("Average Error by Rating Category", fontsize=14, fontfamily='Arial Rounded MT Bold', pad=15)

    # Force grid appearance
    ax.grid(True, linestyle='--', alpha=0.7, color='gray')  # Dashed light-gray grid
    ax.set_axisbelow(True)  # Ensures the grid is behind the bars

    # Show the plot
    plt.show()

    # Save the plot
    save_plot(fig, os.path.join("..", "results", "figures", "average_error_by_rating_category.png")) # Save as a PNG file

In [None]:
# Call the function with your data (valid_df)
plot_average_error(valid_df)

<h3>9.5 Correct and Incorrect Predictions</h3>

In [None]:
# Correct predictions
correct_preds = valid_df[valid_df["Rating"] == valid_df["Predicted Rating"]].sample(5)

In [None]:
# Displaying 5 examples of correct predictions
print("Examples of Correct Predictions:")
correct_preds[["Text", "Rating", "Predicted Rating"]]

In [None]:
# Incorrect predictions
incorrect_preds = valid_df[valid_df["Rating"] != valid_df["Predicted Rating"]].sample(5)

In [None]:
# Displaying 5 examples of incorrect predictions
print("Examples of Incorrect Predictions:")
incorrect_preds[["Text", "Rating", "Predicted Rating"]]

<h2>10. Conclusion</h2>

### **Model Performance Overview**  

The model has achieved an exceptional **accuracy of 99.54%**, demonstrating strong performance across all sentiment classes. Analyzing the detailed metrics:  

- **Precision remains close to 1.0 across all classes**, indicating minimal false positives.  
- **Recall is consistently high**, with perfect recall (1.0) for classes **2, 4, and 5**, meaning that all actual instances in these classes were correctly identified.  
- **The Macro F1-score is 0.9910**, showing that the model maintains a balanced performance across all classes, despite the dataset being highly imbalanced.  
- **The Weighted F1-score is 0.9954**, confirming that the model performs well across all classes while accounting for class imbalance. This suggests that the model does not disproportionately favor majority classes but maintains strong predictive accuracy even for underrepresented ones.  

These results highlight the effectiveness of combining **BERTimbau with Active Learning** for sentiment analysis, especially when dealing with limited and imbalanced labeled data. The model demonstrates the ability to generalize well across different sentiment levels, ensuring reliable predictions for both majority and minority classes.  

### **Comparison with Actual Ratings**  

When comparing the actual ratings provided by customers with the predicted ratings, the results are as follows:  

- **Accuracy: 94.77%**, indicating that the predicted ratings align well with customer evaluations.  
- **Mean Absolute Error (MAE): 0.0690**, demonstrating that the model’s predictions have a small average deviation from the true ratings.  
- **Root Mean Squared Error (RMSE): 0.3380**, showing that the model maintains consistently accurate predictions with minimal discrepancies.  

Despite the overall strong performance, **larger errors were observed in intermediate classes (2, 3, and 4).** These ratings often reflect more nuanced experiences and can be more subjective, making them inherently harder to classify. This suggests that additional labeled data and further fine-tuning may be necessary to improve the model’s ability to differentiate these sentiment levels. One possible approach is to leverage Active Learning to prioritize the labeling of uncertain cases in these classes, helping refine the model’s understanding of subtle differences in sentiment.  

### **Final Thoughts**  

These findings confirm that **BERTimbau combined with Active Learning** is a highly effective approach for sentiment analysis in customer reviews. The model not only captures the sentiment nuances across different rating levels but also demonstrates **robust generalization**, even in a **highly imbalanced dataset**. This makes it a valuable tool for **understanding customer feedback in the pet care sector**, providing insights that can support business decisions and service improvements.  

<h2>11. Next Steps</h2>

While the model has demonstrated **excellent performance**, there are several areas that can be explored further to improve its accuracy and applicability. Some potential next steps include:  

### **Fine-Tune Model Performance**  
- **Explore Hyperparameter Tuning:** Adjusting hyperparameters such as learning rate, batch size, and the number of layers could help optimize performance, particularly in improving precision and recall for specific classes.  
- **Test with Different Models:** While BERTimbau has shown strong results, experimenting with other transformer models like **DistilBERT, RoBERTa, or domain-specific transformers** could reveal alternative solutions that improve computational efficiency or predictive accuracy.  
- **Compare with Simpler Models:** Evaluating **Logistic Regression, Random Forest, or SVM** as baselines might offer insights into whether simpler models can achieve competitive performance with less computational cost.  

### **Enhance Active Learning Strategies**  
- **Evaluate Training and Validation Curves:** Analyzing the model's performance over time as more labeled data is added can help detect signs of **overfitting or underfitting**. Plotting these curves will provide insights into when additional training data brings diminishing returns and when to stop the Active Learning process.  
- **Experiment with Different Active Learning Strategies:** While entropy-based sampling has been effective, testing **least confidence, margin sampling, or other uncertainty-based approaches** could refine the model’s ability to select the most informative examples for labeling.  
- **Assess the Impact of Additional Labeled Data:** Monitoring how accuracy, precision, and recall evolve with each batch of newly labeled data will help determine whether collecting more labels continues to enhance the model’s performance or if the gains plateau.  

By exploring these next steps, the model can be further refined to **deliver even more precise and actionable predictions**. This will enable businesses in the **pet care sector** to better understand and respond to customer feedback, improving their services based on deeper sentiment insights.  
