In [None]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# Sentiment Classification 
This notebook leverages a **DistilBERT-based Sentiment Classification Model**, specifically the `tabularisai/robust-sentiment-analysis` model, to perform sentiment analysis. The goal is to efficiently analyze and classify sentiment within a dataset for the purposes of **Data Quality Assessment (DQA)** and **Exploratory Data Analysis (EDA)**. By using an 'off-the-shelf', pre-trained model, we gain a sense of sentiment class balance, and insights with a computational efficient technique.  

## Model Overview
- **Model Name**: `tabularisai/robust-sentiment-analysis`
- **Base Model**: `distilbert/distilbert-base-uncased`
- **Task**: Text Classification (Sentiment Analysis)
- **Language**: English
- **Number of Classes**: 5 sentiment categories:
  - **Very Negative**
  - **Negative**
  - **Neutral**
  - **Positive**
  - **Very Positive**

## Model Description
This model is a fine-tuned version of `distilbert-base-uncased`, optimized for sentiment analysis using synthetic data generated by cutting-edge language models like **Llama3.1** and **Gemma2**. By training exclusively on synthetic data, the model has been exposed to a diverse range of sentiment expressions, which enhances its ability to generalize across different use cases

## Purpose of the Notebook
1. **Data Quality Assessment (DQA)**: By running sentiment analysis on the dataset, we can assess sentiment distribution and identify any potential biases or issues in the data that may impact subsequent analysis.
2. **Exploratory Data Analysis (EDA)**: Understanding the overall sentiment landscape of the dataset provides critical context for deeper analysis, revealing trends, patterns, or anomalies in the data.
3. **Pre-Tuned Efficiency**: Using an off-the-shelf model ensures quick and efficient analysis, allowing us to focus on insights rather than model optimization. This is particularly valuable as we will later fine-tune a more specialized model for ABSA.

## Workflow Outline
1. **Loading and Preprocessing Data**:
   - Import the necessary libraries and load the dataset.
   - Perform any required preprocessing, such as cleaning text data and handling missing values.

2. **Model Setup**:
   - Load the `tabularisai/robust-sentiment-analysis` model from Hugging Face.
   - Configure the model for efficient sentiment classification.

3. **Sentiment Analysis**:
   - Use the model to predict sentiment for each text entry in the dataset.
   - Classify sentiments into one of the five categories: Very Negative, Negative, Neutral, Positive, or Very Positive.


## Imports

In [None]:
import pandas as pd
from discover.container import DiscoverContainer
from discover.flow.data_prep.sentiment.stage import SentimentClassificationStage
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import DataPrepStageDef, PhaseDef

pd.options.display.max_colwidth = None

In [None]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
    ],
)

## Sentiment Classification Pipeline

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.SENTIMENT
)
# Build and run the stage
stage = SentimentClassificationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/13/2024 02:27:17 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_sentiment-review-dataset.parquet from repository.
[11/13/2024 02:27:17 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-sentiment-review from the repository.




#                         Sentiment Classification Stage                         #



                              MergeSentimentsTask                               
                              -------------------                               
                          Start Datetime | Wed, 13 Nov 2024 02:27:18
                       Complete Datetime | Wed, 13 Nov 2024 02:27:18
                                 Runtime | 0.64 seconds


                         Sentiment Classification Stage                         
                           Stage Started | Wed, 13 Nov 2024 02:27:17
                         Stage Completed | Wed, 13 Nov 2024 02:27:19
                           Stage Runtime | 1.31 seconds
                           Cached Result | True





## Check Results

In [None]:
# Instantiate the repository
repo = container.repo.dataset_repo()
# Load the dataset from the repository
df = repo.get(asset_id=asset_id).content
# Inspect a few rows
df[["id", "app_name", "content", "rating", "dqp_sentiment"]].sample(
    n=5, random_state=22
)

Unnamed: 0,id,app_name,content,rating,dqp_sentiment
76544,10201072201,Ad Block One: Tube Ad Blocker,Awesome,5,Very Positive
82700,8833410900,Cleanup: Phone Storage Cleaner,Save time and space,5,Very Positive
26968,8183002438,sweetgreen,I’ve used Chipotle’s and other restaurants’ apps and this is by far the easiest to use and best interface. Not to mention it is similar in price to get a salad delivered and the food is absolutely amazing!! I do have two suggestions: (I) allow the user to add more than two bases and (ii) allow for the use of Apple Pay at checkout. Thanks :)!!!,5,Very Positive
2161,9187505053,OwO Novel - Read Romance Story,The app is not worth 5 stars and the cost for chapters keeps going up,5,Negative
60842,9288815829,Bible,I use this app every day. Easy and intuitive. I like all the different versions. I would love to see a chronological version and a Reference to Jesus version. I want plans that are for one day a week.,5,Neutral


From this sample, several observations are notable:

1. **Ad Block One: Tube Ad Blocker ("Awesome")**: The 5-star rating and "Very Positive" sentiment remain well-aligned, as the single-word feedback conveys a clear and enthusiastic endorsement. No further action needed here.

2. **Cleanup: Phone Storage Cleaner ("Save time and space")**: The sentiment analysis again correctly identifies the positive tone of the review, which matches the 5-star rating. The short, impactful statement reflects high user satisfaction with the app's functionality.

3. **sweetgreen**: The expanded review content continues to justify the "Very Positive" sentiment and the 5-star rating. The user expresses enthusiasm about the app's interface, ease of use, and the quality of the food. Despite suggesting improvements (like adding more bases and supporting Apple Pay), the overall sentiment remains overwhelmingly positive. This is a good example of how constructive feedback can coexist with high satisfaction, and the sentiment analysis accurately captures the overall positive tone.

4. **OwO Novel - Read Romance Story**: The mismatch between the negative content and the 5-star rating becomes even more evident with the added details. The user explicitly states that the app "is not worth 5 stars" and criticizes the rising cost for chapters. This discrepancy is likely a case where the sentiment model is correct in detecting negativity, but the user gave a high rating that contradicts their review. This case suggests that users may sometimes give ratings that do not reflect their written feedback, highlighting the complexity of relying solely on ratings for sentiment analysis.

5. **Bible**: The review content provides constructive feedback alongside a description of regular app use. The suggestions for additional features, like a chronological version and specific plans, are not emotionally charged, which supports the "Neutral" sentiment label. However, the 5-star rating indicates a high level of satisfaction despite the neutral tone of the review. This suggests that the user is content overall but expressed feedback in a more factual manner. The model’s labeling is understandable, but incorporating more contextual understanding might help align sentiment labels more closely with ratings in cases like this.

### Key Takeaways and Recommendations:
- **sweetgreen**: The sentiment analysis does well to capture overall positivity despite the presence of suggestions for improvement, demonstrating robustness in handling mixed feedback.
- **OwO Novel - Read Romance Story**: This highlights a potential gap in understanding user intent behind ratings. Further investigation into user behavior (such as high ratings paired with negative comments) may provide insights into refining sentiment analysis models.
- **Bible**: This review underscores the challenge of interpreting reviews that are positive overall but expressed in a neutral tone. Sentiment analysis might benefit from additional heuristics or metadata to better align with user ratings.

Overall, these examples illustrate the complexities of sentiment analysis when ratings and content don’t always align perfectly, but your model appears to be performing well in capturing the general sentiment conveyed by the text. Let me know if you’d like to explore further improvements or adjustments!

In the next section, we will evaluate the degree to which noise is extant in the review text.