In [None]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# Sentiment Analysis 
This stage leverages a **DistilBERT-based Sentiment Classification Model**, specifically the `tabularisai/robust-sentiment-analysis` model, to perform sentiment analysis. The goal is to efficiently analyze and classify sentiment within a dataset for the purposes of **Data Quality Assessment (DQA)** and **Exploratory Data Analysis (EDA)**. 

## Model Overview
- **Model Name**: `tabularisai/robust-sentiment-analysis`
- **Base Model**: `distilbert/distilbert-base-uncased`
- **Task**: Text Classification (Sentiment Analysis)
- **Language**: English
- **Number of Classes**: 5 sentiment categories:
  - **Very Negative**
  - **Negative**
  - **Neutral**
  - **Positive**
  - **Very Positive**

## Model Description
This model is a fine-tuned version of `distilbert-base-uncased`, optimized for sentiment analysis using synthetic data generated by cutting-edge language models like **Llama3.1** and **Gemma2**. By training exclusively on synthetic data, the model has been exposed to a diverse range of sentiment expressions, which enhances its ability to generalize across different use cases



## Imports

In [2]:
import pandas as pd
from tqdm import tqdm

from discover.container import DiscoverContainer
from discover.flow.stage.model.sentiment import SentimentAnalysisStage
from discover.core.flow import PhaseDef, StageDef
from discover.infra.config.flow import FlowConfigReader

# Register `tqdm` with pandas
tqdm.pandas()

pd.options.display.max_colwidth = None

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.stage.base",
    ],
)

## Sentiment Analysis Task
The `SentimentAnalysisTask` class performs sentiment analysis on text data using the `tabularisai/robust-sentiment-analysis` pre-trained transformer model. It is built to handle large-scale text data efficiently and is optimized for execution on GPU when available.

**Key Technical Aspects**:

1. **Model Loading**: The transformer is loaded using the Hugging Face `transformers` library, leveraging both the `AutoTokenizer` for text tokenization and `AutoModelForSequenceClassification` for sentiment classification.
2. **Hardware Optimization**: The class supports GPU acceleration through PyTorch. It checks for the availability of a CUDA-compatible GPU and moves the model and data to the GPU if available. This significantly speeds up inference, making it suitable for large datasets.
3. **Text Preprocessing and Tokenization**: Text data is preprocessed and tokenized using the `AutoTokenizer`, which converts text into input tensors that the model can process. The inputs are truncated or padded to a maximum sequence length of 512 tokens, ensuring consistency in input size.
4. **Memory Management**: The class uses `torch.cuda.empty_cache()` to clear CUDA memory before loading the model, optimizing memory usage and preventing potential out-of-memory errors on the GPU.
5. **Sentiment Prediction**: The `predict_sentiment` method performs inference using `torch.no_grad()` to disable gradient calculation, reducing memory consumption and speeding up computations. It calculates class probabilities using the `softmax` function and maps the predicted class index to a sentiment label.
6. **Caching Mechanism**: The class constructs a cache file path using environment-specific settings, making it possible to store and reuse sentiment analysis results efficiently. This can help avoid redundant computations and improve the overall performance of the data pipeline.
7. **Integration with DataFrames**: The class operates on pandas DataFrames, applying sentiment analysis to each entry in the specified text column using the `progress_apply` method, which provides a progress bar for monitoring the processing status.

The code is included in the following expandable cell.


In [4]:
# %load -r 19-210 discover/flow/task/model/sentiment.py
import os
import warnings

import pandas as pd
import torch
from tqdm import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer

from discover.flow.task.base import Task
from discover.infra.service.logging.task import task_logger
from discover.infra.utils.file.io import IOService

# ------------------------------------------------------------------------------------------------ #
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"
tqdm.pandas()


# ------------------------------------------------------------------------------------------------ #
class SentimentAnalysisTask(Task):
    """
    Task for performing sentiment analysis on text data in a specified column of a Pandas DataFrame.

    This task uses a pre-trained model to predict sentiment for text in the specified column and
    stores the sentiment predictions in a new column. Results are cached to a file to avoid reprocessing.
    It supports execution on GPUs or local devices depending on the configuration.

    Args:
        cache_filepath (str): Path to the cache file for storing or loading sentiment predictions.
        column (str): The name of the column in the DataFrame containing text data for sentiment analysis.
            Defaults to "content".
        new_column (str): The name of the column to store sentiment predictions. Defaults to "sentiment".
        model_name (str): The name of the pre-trained model to use for sentiment analysis. Defaults to
            "tabularisai/robust-sentiment-analysis".
        device_local (bool): Indicates whether to execute the task on local devices. Defaults to False.

    Methods:
        run(data: pd.DataFrame) -> pd.DataFrame:
            Executes the sentiment analysis task, using a cache if available. If not, it predicts sentiment
            for the text column and caches the results.
        predict_sentiment(text: str) -> str:
            Predicts sentiment for a given text string.
        _load_model_tokenizer_to_device() -> None:
            Loads the model, tokenizer, and device for performing sentiment analysis.
        _run(data: pd.DataFrame) -> pd.DataFrame:
            Executes the model inference for sentiment prediction and writes the results to the cache.
    """

    def __init__(
        self,
        cache_filepath: str,
        column="content",
        new_column="sentiment",
        model_name: str = "tabularisai/robust-sentiment-analysis",
        device_local: bool = False,
        io_cls: type[IOService] = IOService,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self._column = column
        self._new_column = f"{self.stage.id}_{new_column}"
        self._model_name = model_name
        self._cache_filepath = cache_filepath
        self._device_local = device_local
        self._io = io_cls()

        # Model, tokenizer, and device are initialized as None and will be loaded later
        self._model = None
        self._tokenizer = None
        self._device = None

    @task_logger
    def run(self, data: pd.DataFrame) -> pd.DataFrame:
        """
        Executes the sentiment analysis task on the input DataFrame.

        This method first attempts to read sentiment predictions from a cache file. If the cache
        is not available or not valid, it performs sentiment analysis using the pre-trained model
        and writes the results to the cache. Sentiment predictions are stored in the specified
        `new_column` of the DataFrame.

        Args:
            data (pd.DataFrame): The input DataFrame containing the text data.

        Returns:
            pd.DataFrame: The DataFrame with sentiment predictions added to the specified column.

        Raises:
            FileNotFoundError: If the cache is not found or the task is run locally without a GPU.
            Exception: For any other unexpected errors.
        """
        try:
            cache = self._io.read(filepath=self._cache_filepath, lineterminator="\n")
            cache["id"] = cache["id"].astype("string")
            data = data.merge(cache[["id", self._new_column]], how="left", on="id")
            return data
        except (FileNotFoundError, TypeError):
            if self._device_local:
                return self._run(data=data)
            else:
                msg = (
                    f"Cache not found or not available. {self.__class__.__name__} is not "
                    "supported on local devices. Try running on Kaggle, Colab, or AWS."
                )
                self._logger.error(msg)
                raise FileNotFoundError(msg)
        except Exception as e:
            msg = f"Unknown exception encountered.\n{e}"
            self._logger.exception(msg)
            raise

    def _run(self, data: pd.DataFrame) -> pd.DataFrame:
        """
        Executes model inference for sentiment analysis and writes results to the cache.

        This method processes the input DataFrame by applying sentiment predictions for each entry
        in the specified text column. It uses parallel processing for efficient computation and
        writes the results to the cache file.

        Args:
            data (pd.DataFrame): The input DataFrame containing the text data.

        Returns:
            pd.DataFrame: The DataFrame with sentiment predictions added to the specified column.
        """
        torch.cuda.empty_cache()  # Clear CUDA memory to ensure sufficient space

        # Load the device, model, and tokenizer
        self._load_model_tokenizer_to_device()

        # Apply sentiment prediction to each text entry
        data[self._new_column] = data[self._column].progress_apply(
            self.predict_sentiment
        )

        # Write results to the cache file
        self._write_file(
            filepath=self._cache_filepath, data=data[["id", self._new_column]]
        )

        return data

    def predict_sentiment(self, text: str) -> str:
        """
        Predicts the sentiment of a given text string.

        This method uses the loaded model and tokenizer to predict the sentiment of the input
        text. It maps the model's output to a sentiment label.

        Args:
            text (str): The input text string.

        Returns:
            str: The predicted sentiment label, e.g., "Positive", "Negative", or "Neutral".
        """
        with torch.no_grad():
            inputs = self._tokenizer(
                text.lower(),
                return_tensors="pt",
                truncation=True,
                padding=True,
                max_length=512,
            )
            inputs = {key: value.to(self._device) for key, value in inputs.items()}
            outputs = self._model(**inputs)
            probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_class = torch.argmax(probabilities, dim=-1).item()

        sentiment_map = {
            0: "Very Negative",
            1: "Negative",
            2: "Neutral",
            3: "Positive",
            4: "Very Positive",
        }
        return sentiment_map[predicted_class]

    def _load_model_tokenizer_to_device(self) -> None:
        """
        Loads the pre-trained model, tokenizer, and device for sentiment analysis.

        This method selects the appropriate device (GPU or CPU), loads the tokenizer and model
        based on the specified model name, and moves the model to the selected device.
        """
        self._device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self._tokenizer = AutoTokenizer.from_pretrained(self._model_name)
        self._model = AutoModelForSequenceClassification.from_pretrained(
            self._model_name
        )
        self._model.to(self._device)

## Sentiment Analysis Pipeline
Similar to the previous Ingestion pipeline, we obtain the configuration using `FlowConfigReader` and set up the `SentimentAnalysisStage` with the specified phase and stage definitions. The stage is then built and executed, with the `asset_id` capturing the resulting data asset.


In [5]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.SENTIMENT
)

# Build and run Data Sentiment Analysis Stage
stage = SentimentAnalysisStage.build(
    stage_config=stage_config, return_dataset=True, force=FORCE
)
dataset = stage.run()



#                            Sentiment Analysis Stage                            #



                             SentimentAnalysisTask                              
                             ---------------------                              
                          Start Datetime | Sun, 24 Nov 2024 15:49:43


[11/24/2024 03:49:44 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_sentiment-review-dataset.parquet from repository.
[11/24/2024 03:49:44 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-sentiment-review from the repository.


                       Complete Datetime | Sun, 24 Nov 2024 15:49:44
                                 Runtime | 0.5 seconds


                            Sentiment Analysis Stage                            
                           Stage Started | Sun, 24 Nov 2024 15:49:43
                         Stage Completed | Sun, 24 Nov 2024 15:49:44
                           Stage Runtime | 1.08 seconds
                           Cached Result | True





## Inspect Results
This sample illustrates sentiment vis-a-vis ratings, revealing the complexity and nuance in user opinion.

In [6]:
dataset.content[["id", "content", "rating", "sa_sentiment"]].sample(n=5, random_state=8)

Unnamed: 0,id,content,rating,sa_sentiment
56610,9912730443,Mooncycle,4,Very Positive
72533,9522246562,"Would be 5 stars, but missing this feature. Please bring back the option to remove contact from non-friends, so we can have some privacy. I think there’s use to be this option years ago. Now once you chat with non friends you will see each other active. For example if you contact thru Facebook marketplace you it will automatically add to your contact which you can’t remove and can only block. Which I don’t want to see anyone on my block list. So overall 3 stars unless fixed with future updates.",3,Neutral
41259,8455560762,I thought that the payout would have been higher for surveys.,3,Negative
51489,7358923363,"I havent been able to stay in connection with what is happening in Palestine because instagram keeps blocking most of the videos and posts and banning their spread. As a user that has the total freedom to follow whatever page i want and that expects to recieve the news i select, this have been useless lately because i no longer have the right nor the ability to chose the platforms i would like to be connected to. Only ridiculously useless pages on the top of the feeds with very unimportant content while disasterous unhumatarian events are happening all over the world with zero coverage and transparency. Hypocrite",1,Very Negative
29621,7043042797,They give you so much unlimited information that other apps do not definitely worth downloading I have downloaded multiple maybe all and this was the best one !!! It’s awesome 🤰🏾🤰🏼🤰🏿🤰🏽🤰🏻🤰,5,Very Positive


### Summary of Sentiment vs. Ratings
1. **Entry 1: Mooncycle**
   - **Rating**: 4
   - **Sentiment Analysis**: Very Positive
   - **Comment**: The user provided a high rating (4 stars), and the sentiment analysis correctly identified a very positive sentiment. This indicates a good match between the expressed sentiment and the user's rating.

2. **Entry 2: Privacy Concern**
   - **Rating**: 3
   - **Sentiment Analysis**: Neutral
   - **Comment**: The review mentions significant concerns about privacy features but still gives a moderate rating of 3 stars. The sentiment analysis classified this as Neutral, which seems reasonable given the mix of positive and negative feedback. However, one might argue that a "Slightly Negative" label could better capture the overall tone.

3. **Entry 3: Survey Payouts**
   - **Rating**: 3
   - **Sentiment Analysis**: Negative
   - **Comment**: The user was disappointed with survey payouts, rating the experience as 3 stars. The sentiment analysis classified this as Negative, which reflects the user's dissatisfaction. The rating, however, seems higher than expected for a purely negative sentiment, suggesting potential leniency or mixed feelings not fully captured by the text.

4. **Entry 4: Instagram Censorship**
   - **Rating**: 1
   - **Sentiment Analysis**: Very Negative
   - **Comment**: This review strongly criticizes Instagram's content policies, and the user gave the lowest possible rating (1 star). The sentiment analysis accurately labeled this as Very Negative, showing a clear alignment between sentiment and rating.

5. **Entry 5: Informative App**
   - **Rating**: 5
   - **Sentiment Analysis**: Very Positive
   - **Comment**: The review is overwhelmingly positive, emphasizing the app's usefulness and unique features, and the user gave a 5-star rating. The sentiment analysis correctly labeled it as Very Positive, demonstrating alignment between the rating and sentiment.

### Observations
- **Alignment**: In most cases, the sentiment analysis aligns well with the user ratings. Positive sentiments correlate with higher ratings, while negative sentiments correspond to lower ratings.
- **Mixed Reviews**: The Neutral sentiment for the privacy concern review highlights the complexity of mixed feedback, where both positives and negatives are present. This might require more nuanced classification.
- **Alignment Between Sentiment and Rating**: In most cases, there is alignment between the sentiment analysis and user ratings. For instance, Very Positive sentiments are generally accompanied by high ratings (4 or 5), and Very Negative sentiments align with the lowest rating of 1.
- **Neutral Sentiment vs. Moderate Rating**: For reviews with Neutral or Negative sentiment (Ratings: 3), the ratings reflect appreciation for the app's core value but reveal dissatisfaction with specific features or limitations.
- **Sentiment Outliers**: No significant mismatches are observed here, suggesting that the sentiment analysis accurately reflects the reviewer’s stance in this sample. However, cases like Review 2 highlight how neutral sentiments can still accompany moderate ratings due to unfulfilled expectations.

This analysis indicates that sentiment analysis can generally align well with user ratings, offering insights into specific areas of dissatisfaction or satisfaction that might otherwise be missed in numerical ratings alone.

In the next section, we evaluate data quality and requirements for data cleaning.