In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = True

# Text Quality Analysis (TQA) for Aspect-Based Sentiment Analysis (ABSA)
---
In the upcoming **Aspect-Based Sentiment Analysis (ABSA)** modeling phase, we evaluate the cost-performance tradeoffs between custom LLMs and fine-tuned foundational models for identifying, extracting, and classifying sentiments associated with product and service aspects in the AppVoCAI Dataset. A consistent trend in ABSA literature underscores the primacy of text quality over sheer volume, where high-quality input data improves the granularity and reliability of extracted sentiment signals.

This Text Quality Analysis (TQA) effort assesses review suitability for ABSA model development, emphasizing syntactic and lexical features that influence the clarity, richness, and relevance of aspects, opinions and sentiment signals. Our weighted text quality scoring method described below will guide instance selection for ABSA model training and fine-tuning. 

## Measures of Text Quality and Their Weights
---
The following **syntactic features** and their **weights** quantify the quality of a given text for ABSA. Each measure contributes to the overall **Text Quality Analysis (TQA)** score, which is a weighted sum of these features.

$$
\text{TQA Score}=\sum_{i} \left( \text{Measure Score}_i \times \text{Weight}_i \right)
$$

### High Importance (Coefficients 3.0 and above):    
- **aspect_verb_pairs (4.0)**: These pairs directly link an aspect (noun or noun phrase) with an action or state (verb) related to that aspect. This is highly informative for understanding how users feel about specific aspects. "The battery drains quickly" clearly links "battery" (aspect) with a negative sentiment via "drains quickly" (verb phrase).
- **noun_adjective_pairs (3.0)**: As discussed, these pairs are strong indicators of sentiment towards an aspect. "Excellent screen" directly expresses positive sentiment towards "screen." While slightly less direct than aspect-verb pairs, they are still highly valuable.

### Medium Importance (Coefficients between 1.5 and 2.5):    
- **noun_phrases (2.5)**: Noun phrases often represent aspects themselves, even without an accompanying adjective or verb. "The camera quality" is a clear aspect, even if the sentiment is expressed elsewhere in the sentence.
- **verb_phrases (2.0)**: Verb phrases can provide context and nuance to the sentiment. "The phone performs well" is more informative than just "phone" or "performs."
- **adjective_count (1.5)**: The sheer number of adjectives can give a general indication of the sentiment's intensity. A review with many positive adjectives is likely more positive overall. However, individual adjectives within noun-adjective pairs are more informative.
- **lexical_density (1.5)**: Lexical density (the proportion of content words) can be a proxy for the information content of a review. Higher lexical density might suggest more specific and detailed feedback, which could be useful for ABSA.

### Lower Importance (Coefficients below 1.5):   
- **adverb_count (0.75)**: Adverbs can modify adjectives or verbs, adding nuance. However, their contribution to ABSA might be less direct compared to adjectives or verbs themselves.
- **noun_count (1.0)**: The raw count of nouns is less informative than noun phrases or nouns in noun-adjective pairs. It's more of a general indicator of review length and complexity.
- **verb_count (1.0)**: Similar to noun count, the raw count of verbs is less directly related to ABSA than verb phrases or verbs in aspect-verb pairs.
- **adverbial_phrases (0.5)**: Similar to adverbs, adverbial phrases can add detail but are less directly related to aspect-based sentiment.
- **review_length (1.0)**: Review length is a general metric and doesn't directly contribute to ABSA. Longer reviews might contain more information, but they can also be rambling.
- **dependency_depth (1.0)**: Dependency depth can be a measure of syntactic complexity, but its relationship to ABSA is not as clear. Complex sentences aren't necessarily more or less sentiment-bearing than simpler ones.

**Summary of Weights:**

| Measure              | Weight |
|----------------------|--------|
| aspect_verb_pairs    | 4      |
| noun_adjective_pairs | 3      |
| noun_phrases         | 2.5    |
| verb_phrases         | 2      |
| adjective_count      | 1.5    |
| lexical_density      | 1.5    |
| noun_count           | 1      |
| verb_count           | 1      |
| review_length        | 1      |
| dependency_depth     | 1      |
| adverb_count         | 0.75   |
| adverbial_phrases    | 0.5    |

**Important Note:** These weights represent a starting hypothesis. The TQA Exploratory Data Analysis will evaluate the degree to which this weighting regime reflects the quality of reviews for the ABSA task.

## Text Quality Analysis Pipeline
With that, the text quality analysis (TQA) processing pipeline computes these text quality metrics at scale.

The pipeline leverages:     
- Dask for distributed data processing, enabling efficient computation over large text datasets.
- spaCy for NLP tasks, including dependency parsing and part-of-speech (POS) tagging.

The pipeline follows these steps:    
- Dataset Configuration: Define the source and target dataset configurations.
- Pipeline Construction: Instantiate the TQAStageBuilder and configure it for Dask processing.
- Execution: Run the TQAStage, applying text analysis and feature extraction.



### Import Libraries
---

In [2]:

from genailab.setup import auto_wire_container
from genailab.core.dtypes import DFType
from genailab.infra.utils.file.fileset import FileFormat
from genailab.asset.dataset.config import DatasetConfig
from genailab.flow.dataprep.tqa.builder import TQAStageBuilder
from genailab.asset.dataset.config import DatasetConfig
from genailab.core.flow import PhaseDef, StageDef


# Wire container
container = auto_wire_container()

### Define the Source and Target Dataset Configurations
---
The source dataset represents the cleaned text data, while the target dataset will store the extracted text quality features.

In [3]:
# Source Dataset Configuration
source_config = DatasetConfig(
    phase=PhaseDef.DATAPREP,
    stage=StageDef.CLEAN,
    name="review",
    file_format=FileFormat.PARQUET,
    asset_type="dataset",
    dftype=DFType.PANDAS,
)

# Target Dataset Configuration
target_config = DatasetConfig(
    phase=PhaseDef.DATAPREP,
    stage=StageDef.TQA,
    name="review",
    file_format=FileFormat.PARQUET,
    asset_type="dataset",
    dftype=DFType.PANDAS,
)


### Construct the TQA Pipeline
---
We use the TQAStageBuilder to configure a Dask-powered text quality analysis pipeline with:

- Normalization enabled (ensures robust feature scaling).
- Batch processing (improves efficiency for large datasets).

In [4]:
stage = (
    TQAStageBuilder()
        .analyze_text()
        .build(source_config=source_config, target_config=target_config))


### Run the Pipeline
---
Once the pipeline is built, we execute it to compute text quality features.

In [5]:
dataset = stage.run(force=FORCE)



#             Text Quality Analysis Stage Sun, 09 Feb 2025 20:23:34              #

____________________________________________________________________________________________________
Text Quality Analysis Stage             20:23:34    20:25:16    1.0 minute and 41.91 seconds                       





## TQA Scoring Evaluation
---
Does `tqa_score` and `tqa_rating` adequately reflect the quality of review text for ABSA? Let's inspect samples stratified by `tqa_rating`.

In [None]:
df = dataset.dataframe
cols = ["content",
"noun_count",
"verb_count",
"adjective_count",
"adverb_count",
"aspect_verb_pairs",
"noun_adjective_pairs",
"noun_phrases",
"verb_phrases",
"adverbial_phrases",
"review_length",
"lexical_density",
"dependency_depth",
"tqa_score",
"tqa_rating",
]
tqa5 = df.loc[df['tqa_rating']==5,cols].sample(n=10)
tqa4 = df.loc[df['tqa_rating']==4,cols].sample(n=10)
tqa3 = df.loc[df['tqa_rating']==3,cols].sample(n=10)
tqa2 = df.loc[df['tqa_rating']==2,cols].sample(n=10)
tqa1 = df.loc[df['tqa_rating']==1,cols].sample(n=10)


## Highest Quality Reviews (TQA Rating = 5)
---


Great! Now that we've confirmed the dataset has been successfully processed, we have a set of **text quality analysis** metrics that we can use for instance selection during the feature engineering stage.
Next, we transition to **sentiment analysis at the review level**. In this phase, we will analyze the overall **sentiment** of each review.

In [7]:
tqa5

Unnamed: 0,content,noun_count,verb_count,adjective_count,adverb_count,aspect_verb_pairs,noun_adjective_pairs,noun_phrases,verb_phrases,adverbial_phrases,review_length,lexical_density,dependency_depth,tqa_score,tqa_rating
1339,i have only had this happen twice but this las...,4.682131,1.386294,0.0,0.0,1.098612,0.0,0.0,0.693147,0.0,4.718499,4.615121,4.718499,28.208848,5
1049,i used to really like this app but recently it...,4.219508,1.098612,0.0,0.0,0.0,0.0,0.693147,0.0,0.0,4.26268,4.615121,4.26268,22.499028,5
1643,is time for my new years resolutions and becom...,4.356709,1.386294,0.0,0.0,0.0,0.0,0.693147,0.0,0.0,4.406719,4.615121,4.406719,23.21199,5
4242,it is very shameful for myanmar military terro...,4.488636,0.0,0.0,0.0,0.0,0.0,0.693147,0.0,0.0,4.49981,4.615121,4.49981,22.143804,5
649,the inapp menu does not match the pamphlet the...,3.044522,1.098612,0.0,0.0,0.693147,0.0,0.693147,0.0,0.0,3.135494,4.615121,3.135494,21.842261,5
953,hilarious content i dont know what youre doing...,2.944439,1.386294,0.0,0.0,0.693147,0.0,0.693147,0.0,0.0,3.091042,4.615121,3.091042,21.940956,5
1722,ive been on ww many times in my life starting ...,4.189655,0.693147,0.0,0.0,0.0,0.0,0.693147,0.0,0.0,4.204693,4.615121,4.204693,21.947736,5
1779,i absolutely love this app it has reps sets an...,5.105945,2.079442,0.0,0.0,0.0,0.0,0.693147,0.0,0.0,5.159055,4.615121,5.159055,26.159046,5
1577,okay so i have adhd and i need to play with st...,4.304065,1.098612,0.0,0.0,0.693147,0.0,0.693147,0.0,0.0,4.343805,4.615121,4.343805,25.518426,5
3456,i have for years been somewhat skeptical of ai...,4.219508,1.609438,0.0,0.0,0.693147,0.0,0.0,0.693147,0.0,4.276666,4.615121,4.276666,25.463842,5
