# Sentiment Classification 
This notebook leverages a **DistilBERT-based Sentiment Classification Model**, specifically the `tabularisai/robust-sentiment-analysis` model, to perform sentiment analysis. The goal is to efficiently analyze and classify sentiment within a dataset for the purposes of **Data Quality Assessment (DQA)** and **Exploratory Data Analysis (EDA)**. By using a pre-trained model off-the-shelf, we aim to gain valuable sentiment insights without the need for extensive fine-tuning, as the more nuanced **Aspect-Based Sentiment Analysis (ABSA)** will be handled separately in future stages.

## Model Overview
- **Model Name**: `tabularisai/robust-sentiment-analysis`
- **Base Model**: `distilbert/distilbert-base-uncased`
- **Task**: Text Classification (Sentiment Analysis)
- **Language**: English
- **Number of Classes**: 5 sentiment categories:
  - **Very Negative**
  - **Negative**
  - **Neutral**
  - **Positive**
  - **Very Positive**

## Model Description
This model is a fine-tuned version of `distilbert-base-uncased`, optimized for sentiment analysis using synthetic data generated by cutting-edge language models like **Llama3.1** and **Gemma2**. By training exclusively on synthetic data, the model has been exposed to a diverse range of sentiment expressions, which enhances its ability to generalize across different use cases

## Purpose of the Notebook
1. **Data Quality Assessment (DQA)**: By running sentiment analysis on the dataset, we can assess sentiment distribution and identify any potential biases or issues in the data that may impact subsequent analysis.
2. **Exploratory Data Analysis (EDA)**: Understanding the overall sentiment landscape of the dataset provides critical context for deeper analysis, revealing trends, patterns, or anomalies in the data.
3. **Pre-Tuned Efficiency**: Using an off-the-shelf model ensures quick and efficient analysis, allowing us to focus on insights rather than model optimization. This is particularly valuable as we will later fine-tune a more specialized model for ABSA.

## Workflow Outline
1. **Loading and Preprocessing Data**:
   - Import the necessary libraries and load the dataset.
   - Perform any required preprocessing, such as cleaning text data and handling missing values.

2. **Model Setup**:
   - Load the `tabularisai/robust-sentiment-analysis` model from Hugging Face.
   - Configure the model for efficient sentiment classification.

3. **Sentiment Analysis**:
   - Use the model to predict sentiment for each text entry in the dataset.
   - Classify sentiments into one of the five categories: Very Negative, Negative, Neutral, Positive, or Very Positive.


In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pandarallel import pandarallel

pandarallel.initialize(nb_workers=18, verbose=False, progress_bar=True)

## Load Data

In [None]:
FP = "workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-03_tqa-review-dataset.parquet"
df = pd.read_csv(FP)

## Load Model and Tokenizer

In [4]:
# Load model and tokenizer
model_name = "tabularisai/robust-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

## Create Classifier

In [5]:
# Function to predict sentiment
def predict_sentiment(text):
    inputs = tokenizer(
        text.lower(), return_tensors="pt", truncation=True, padding=True, max_length=512
    )
    with torch.no_grad():
        outputs = model(**inputs)

    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1).item()

    sentiment_map = {
        0: "Very Negative",
        1: "Negative",
        2: "Neutral",
        3: "Positive",
        4: "Very Positive",
    }
    return sentiment_map[predicted_class]

## Perform Inference

In [None]:
df["sentiment"] = df["content"].parallel_apply(predict_sentiment)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4817), Label(value='0 / 4817'))), …

## Check Results

In [None]:
df[["content", "sentiment"]].head()