In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# Sentiment Analysis 
This stage leverages a **DistilBERT-based Sentiment Classification Model**, specifically the `tabularisai/robust-sentiment-analysis` model, to perform sentiment analysis. The goal is to efficiently analyze and classify sentiment within a dataset for the purposes of **Data Quality Assessment (DQA)** and **Exploratory Data Analysis (EDA)**. 

## Model Overview
- **Model Name**: `tabularisai/robust-sentiment-analysis`
- **Base Model**: `distilbert/distilbert-base-uncased`
- **Task**: Text Classification (Sentiment Analysis)
- **Language**: English
- **Number of Classes**: 5 sentiment categories:
  - **Very Negative**
  - **Negative**
  - **Neutral**
  - **Positive**
  - **Very Positive**

## Model Description
This model is a fine-tuned version of `distilbert-base-uncased`, optimized for sentiment analysis using synthetic data generated by cutting-edge language models like **Llama3.1** and **Gemma2**. By training exclusively on synthetic data, the model has been exposed to a diverse range of sentiment expressions, which enhances its ability to generalize across different use cases



## Imports

In [2]:
import pandas as pd
from tqdm import tqdm

from discover.container import DiscoverContainer
from discover.flow.data_prep.sentiment.stage import SentimentAnalysisStage
from discover.core.flow import PhaseDef, StageDef
from discover.infra.config.flow import FlowConfigReader

# Register `tqdm` with pandas
tqdm.pandas()

pd.options.display.max_colwidth = None

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.base.stage",
    ],
)

## Sentiment Analysis Task
The `SentimentAnalysisTask` class performs sentiment analysis on text data using the `tabularisai/robust-sentiment-analysis` pre-trained transformer model. It is built to handle large-scale text data efficiently and is optimized for execution on GPU when available.

**Key Technical Aspects**:

1. **Model Loading**: The transformer is loaded using the Hugging Face `transformers` library, leveraging both the `AutoTokenizer` for text tokenization and `AutoModelForSequenceClassification` for sentiment classification.
2. **Hardware Optimization**: The class supports GPU acceleration through PyTorch. It checks for the availability of a CUDA-compatible GPU and moves the model and data to the GPU if available. This significantly speeds up inference, making it suitable for large datasets.
3. **Text Preprocessing and Tokenization**: Text data is preprocessed and tokenized using the `AutoTokenizer`, which converts text into input tensors that the model can process. The inputs are truncated or padded to a maximum sequence length of 512 tokens, ensuring consistency in input size.
4. **Memory Management**: The class uses `torch.cuda.empty_cache()` to clear CUDA memory before loading the model, optimizing memory usage and preventing potential out-of-memory errors on the GPU.
5. **Sentiment Prediction**: The `predict_sentiment` method performs inference using `torch.no_grad()` to disable gradient calculation, reducing memory consumption and speeding up computations. It calculates class probabilities using the `softmax` function and maps the predicted class index to a sentiment label.
6. **Caching Mechanism**: The class constructs a cache file path using environment-specific settings, making it possible to store and reuse sentiment analysis results efficiently. This can help avoid redundant computations and improve the overall performance of the data pipeline.
7. **Integration with DataFrames**: The class operates on pandas DataFrames, applying sentiment analysis to each entry in the specified text column using the `progress_apply` method, which provides a progress bar for monitoring the processing status.

The code is included in the following expandable cell.


In [4]:
# %load -r 19-139 discover/flow/data_prep/sentiment/task.py
import os
import warnings

import pandas as pd
import torch
from tqdm import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer

from discover.flow.base.task import Task
from discover.infra.service.logging.task import task_logger

# ------------------------------------------------------------------------------------------------ #
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"
tqdm.pandas()


# ------------------------------------------------------------------------------------------------ #
class SentimentAnalysisTask(Task):
    """Task for performing sentiment analysis on text data.

    This class uses a pre-trained transformer model to analyze the sentiment
    of text in a specified column and appends the predicted sentiment labels
    to a new column in the DataFrame.

    Args:
        column (str): The name of the column containing the text data. Defaults to "content".
        new_column (str): The name of the column to store sentiment predictions. Defaults to "sentiment".
        model_name (str): The name of the pre-trained sentiment analysis model. Defaults to "tabularisai/robust-sentiment-analysis".
    """

    def __init__(
        self,
        column="content",
        new_column="sentiment",
        model_name: str = "tabularisai/robust-sentiment-analysis",
    ):
        super().__init__(
            column=column,
            new_column=new_column,
        )
        self._model_name = model_name

        # Model, tokenizer, and device are initialized as None and will be loaded later
        self._model = None
        self._tokenizer = None
        self._device = None

    @task_logger
    def run(self, data: pd.DataFrame) -> pd.DataFrame:
        """Executes sentiment analysis on the given DataFrame.

        Args:
            data (pd.DataFrame): The input DataFrame containing text data.

        Returns:
            pd.DataFrame: The DataFrame with a new column containing sentiment predictions.
        """
        # Clear CUDA memory to ensure enough space is available for the model
        torch.cuda.empty_cache()

        # Load the device, model, and tokenizer
        self._load_model_tokenizer_to_device()

        # Apply sentiment prediction to each text entry in the specified column
        data[self._new_column] = data[self._column].progress_apply(
            self.predict_sentiment
        )
        return data

    def predict_sentiment(self, text):
        """Predicts the sentiment of a given text using the loaded model.

        Args:
            text (str): The input text for sentiment analysis.

        Returns:
            str: The predicted sentiment label.
        """
        with torch.no_grad():
            # Tokenize and prepare the input text for the model
            inputs = self._tokenizer(
                text.lower(),
                return_tensors="pt",
                truncation=True,
                padding=True,
                max_length=512,
            )
            # Move inputs to the appropriate device (CPU or GPU)
            inputs = {key: value.to(self._device) for key, value in inputs.items()}
            # Get model outputs and calculate probabilities
            outputs = self._model(**inputs)
            probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)

            # Determine the predicted class
            predicted_class = torch.argmax(probabilities, dim=-1).item()

        # Map the predicted class index to a sentiment label
        sentiment_map = {
            0: "Very Negative",
            1: "Negative",
            2: "Neutral",
            3: "Positive",
            4: "Very Positive",
        }
        return sentiment_map[predicted_class]

    def _load_model_tokenizer_to_device(self) -> None:
        """Loads the device, tokenizer, and model for sentiment analysis."""
        # Select GPU if available, otherwise use CPU
        self._device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # Load the tokenizer and model from the pre-trained model name
        self._tokenizer = AutoTokenizer.from_pretrained(self._model_name)
        self._model = AutoModelForSequenceClassification.from_pretrained(
            self._model_name
        )
        # Move the model to the selected device
        self._model.to(self._device)

## Sentiment Analysis Pipeline
Similar to the previous Ingestion pipeline, we obtain the configuration using `FlowConfigReader` and set up the `SentimentAnalysisStage` with the specified phase and stage definitions. The stage is then built and executed, with the `asset_id` capturing the resulting data asset.


In [5]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.SENTIMENT
)

# Build and run Data Sentiment Analysis Stage
stage = SentimentAnalysisStage.build(stage_config=stage_config, force=FORCE)
dataset = stage.run()



#                            Sentiment Analysis Stage                            #



                            Sentiment Analysis Stage                            
                           Stage Started | Tue, 19 Nov 2024 10:01:06
                         Stage Completed | Tue, 19 Nov 2024 10:01:07
                           Stage Runtime | 1.24 seconds
                           Cached Result | True





## Inspect Results
This sample illustrates sentiment vis-a-vis ratings, revealing the complexity and nuance in user opinion.

In [6]:
dataset.content[["id", "content", "rating", "sa_sentiment"]].sample(n=5, random_state=8)

Unnamed: 0,id,content,rating,sa_sentiment
56610,9912730443,Mooncycle,4,Very Positive
72533,9522246562,"Would be 5 stars, but missing this feature. Please bring back the option to remove contact from non-friends, so we can have some privacy. I think there’s use to be this option years ago. Now once you chat with non friends you will see each other active. For example if you contact thru Facebook marketplace you it will automatically add to your contact which you can’t remove and can only block. Which I don’t want to see anyone on my block list. So overall 3 stars unless fixed with future updates.",3,Neutral
41259,8455560762,I thought that the payout would have been higher for surveys.,3,Negative
51489,7358923363,"I havent been able to stay in connection with what is happening in Palestine because instagram keeps blocking most of the videos and posts and banning their spread. As a user that has the total freedom to follow whatever page i want and that expects to recieve the news i select, this have been useless lately because i no longer have the right nor the ability to chose the platforms i would like to be connected to. Only ridiculously useless pages on the top of the feeds with very unimportant content while disasterous unhumatarian events are happening all over the world with zero coverage and transparency. Hypocrite",1,Very Negative
29621,7043042797,They give you so much unlimited information that other apps do not definitely worth downloading I have downloaded multiple maybe all and this was the best one !!! It’s awesome 🤰🏾🤰🏼🤰🏿🤰🏽🤰🏻🤰,5,Very Positive


### Summary of Sentiment vs. Ratings
1. **Entry 1: Mooncycle**
   - **Rating**: 4
   - **Sentiment Analysis**: Very Positive
   - **Comment**: The user provided a high rating (4 stars), and the sentiment analysis correctly identified a very positive sentiment. This indicates a good match between the expressed sentiment and the user's rating.

2. **Entry 2: Privacy Concern**
   - **Rating**: 3
   - **Sentiment Analysis**: Neutral
   - **Comment**: The review mentions significant concerns about privacy features but still gives a moderate rating of 3 stars. The sentiment analysis classified this as Neutral, which seems reasonable given the mix of positive and negative feedback. However, one might argue that a "Slightly Negative" label could better capture the overall tone.

3. **Entry 3: Survey Payouts**
   - **Rating**: 3
   - **Sentiment Analysis**: Negative
   - **Comment**: The user was disappointed with survey payouts, rating the experience as 3 stars. The sentiment analysis classified this as Negative, which reflects the user's dissatisfaction. The rating, however, seems higher than expected for a purely negative sentiment, suggesting potential leniency or mixed feelings not fully captured by the text.

4. **Entry 4: Instagram Censorship**
   - **Rating**: 1
   - **Sentiment Analysis**: Very Negative
   - **Comment**: This review strongly criticizes Instagram's content policies, and the user gave the lowest possible rating (1 star). The sentiment analysis accurately labeled this as Very Negative, showing a clear alignment between sentiment and rating.

5. **Entry 5: Informative App**
   - **Rating**: 5
   - **Sentiment Analysis**: Very Positive
   - **Comment**: The review is overwhelmingly positive, emphasizing the app's usefulness and unique features, and the user gave a 5-star rating. The sentiment analysis correctly labeled it as Very Positive, demonstrating alignment between the rating and sentiment.

### Observations
- **Alignment**: In most cases, the sentiment analysis aligns well with the user ratings. Positive sentiments correlate with higher ratings, while negative sentiments correspond to lower ratings.
- **Mixed Reviews**: The Neutral sentiment for the privacy concern review highlights the complexity of mixed feedback, where both positives and negatives are present. This might require more nuanced classification.
- **Alignment Between Sentiment and Rating**: In most cases, there is alignment between the sentiment analysis and user ratings. For instance, Very Positive sentiments are generally accompanied by high ratings (4 or 5), and Very Negative sentiments align with the lowest rating of 1.
- **Neutral Sentiment vs. Moderate Rating**: For reviews with Neutral or Negative sentiment (Ratings: 3), the ratings reflect appreciation for the app's core value but reveal dissatisfaction with specific features or limitations.
- **Sentiment Outliers**: No significant mismatches are observed here, suggesting that the sentiment analysis accurately reflects the reviewer’s stance in this sample. However, cases like Review 2 highlight how neutral sentiments can still accompany moderate ratings due to unfulfilled expectations.

This analysis indicates that sentiment analysis can generally align well with user ratings, offering insights into specific areas of dissatisfaction or satisfaction that might otherwise be missed in numerical ratings alone.

In the next section, we'll add perplexity, a measure of relevance, to the dataset.