In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = True

# Text Quality Analysis (TQA) for Aspect-Based Sentiment Analysis (ABSA)
---
In **Aspect-Based Sentiment Analysis (ABSA)**, the primary goal is to extract sentiment for specific aspects within a text, such as products, services, or features. To ensure accurate sentiment extraction, the text must be of sufficient quality. Text quality directly influences the effectiveness of ABSA models, and assessing text quality is crucial to improve aspect-level sentiment predictions.

The **Text Quality Analysis (TQA)** process evaluates various features of the text that may affect ABSA performance. It focuses on syntactic and lexical features that help determine the relevance, richness, and clarity of the content in relation to specific aspects.

Key **requirements for ABSA-based text quality** are:

- **Aspect Identification**: Clear identification of aspects (e.g., product features or services).
- **Aspect-Verb Pairing**: The relationship between aspects and verbs (actions related to aspects).
- **Text Coherence and Complexity**: Well-formed sentences with a manageable level of complexity.
- **Lexical Density and Content Richness**: The degree to which the content reflects substantive, content-rich words.

These requirements are captured by **syntactic measures** like noun phrases, verb phrases, aspect-verb pairs, and additional features like review length, lexical density, and dependency depth. Together, these features inform the quality of text in the context of ABSA tasks.

## Measures of Text Quality and Their Weights
---
The following **syntactic features** and their **weights** quantify the quality of a given text for ABSA. Each measure contributes to the overall **Text Quality Analysis (TQA)** score, which is a weighted sum of these features.

$$
\text{TQA Score}=\sum_{i} \left( \text{Measure Score}_i \times \text{Weight}_i \right)
$$

Here’s the run-through of each measure, its weight, and its impact on the overall TQA score:

### High Importance (Coefficients 3.0 and above):    
- **aspect_verb_pairs (4.0)**: These pairs directly link an aspect (noun or noun phrase) with an action or state (verb) related to that aspect. This is highly informative for understanding how users feel about specific aspects. "The battery drains quickly" clearly links "battery" (aspect) with a negative sentiment via "drains quickly" (verb phrase).
- **noun_adjective_pairs (3.0)**: As discussed, these pairs are strong indicators of sentiment towards an aspect. "Excellent screen" directly expresses positive sentiment towards "screen." While slightly less direct than aspect-verb pairs, they are still highly valuable.

### Medium Importance (Coefficients between 1.5 and 2.5):    
- **noun_phrases (2.5)**: Noun phrases often represent aspects themselves, even without an accompanying adjective or verb. "The camera quality" is a clear aspect, even if the sentiment is expressed elsewhere in the sentence.
- **verb_phrases (2.0)**: Verb phrases can provide context and nuance to the sentiment. "The phone performs well" is more informative than just "phone" or "performs."
- **adjective_count (1.5)**: The sheer number of adjectives can give a general indication of the sentiment's intensity. A review with many positive adjectives is likely more positive overall. However, individual adjectives within noun-adjective pairs are more informative.
- **lexical_density (1.5)**: Lexical density (the proportion of content words) can be a proxy for the information content of a review. Higher lexical density might suggest more specific and detailed feedback, which could be useful for ABSA.

### Lower Importance (Coefficients below 1.5):   
- **adverb_count (0.75)**: Adverbs can modify adjectives or verbs, adding nuance. However, their contribution to ABSA might be less direct compared to adjectives or verbs themselves.
- **noun_count (1.0)**: The raw count of nouns is less informative than noun phrases or nouns in noun-adjective pairs. It's more of a general indicator of review length and complexity.
- **verb_count (1.0)**: Similar to noun count, the raw count of verbs is less directly related to ABSA than verb phrases or verbs in aspect-verb pairs.
- **adverbial_phrases (0.5)**: Similar to adverbs, adverbial phrases can add detail but are less directly related to aspect-based sentiment.
- **review_length (1.0)**: Review length is a general metric and doesn't directly contribute to ABSA. Longer reviews might contain more information, but they can also be rambling.
- **dependency_depth (1.0)**: Dependency depth can be a measure of syntactic complexity, but its relationship to ABSA is not as clear. Complex sentences aren't necessarily more or less sentiment-bearing than simpler ones.

**Summary of Weights:**

| Measure              | Weight |
|----------------------|--------|
| aspect_verb_pairs    | 4      |
| noun_adjective_pairs | 3      |
| noun_phrases         | 2.5    |
| verb_phrases         | 2      |
| adjective_count      | 1.5    |
| lexical_density      | 1.5    |
| noun_count           | 1      |
| verb_count           | 1      |
| review_length        | 1      |
| dependency_depth     | 1      |
| adverb_count         | 0.75   |
| adverbial_phrases    | 0.5    |

**Important Note:** These weights are a starting hypothesis. The TQA Exploratory Data Analysis will evaluate the degree to which this weighting regime reflects the quality of reviews for the ABSA task.

## Text Quality Analysis Pipeline
With that, the text quality analysis (TQA) processing pipeline computes these text quality metrics at scale.

The pipeline leverages:     
- Dask for distributed data processing, enabling efficient computation over large text datasets.
- spaCy for NLP tasks, including dependency parsing and part-of-speech (POS) tagging.

The pipeline follows these steps:    
- Dataset Configuration: Define the source and target dataset configurations.
- Pipeline Construction: Instantiate the TQAStageBuilder and configure it for Dask processing.
- Execution: Run the TQAStage, applying text analysis and feature extraction.



### Import Libraries
---

In [2]:

from genailab.setup import auto_wire_container
from genailab.core.dtypes import DFType
from genailab.infra.utils.file.fileset import FileFormat
from genailab.asset.dataset.config import DatasetConfig
from genailab.flow.dataprep.tqa.builder import TQAStageBuilder
from genailab.asset.dataset.config import DatasetConfig
from genailab.core.flow import PhaseDef, StageDef


# Wire container
container = auto_wire_container()

### Define the Source and Target Dataset Configurations
---
The source dataset represents the cleaned text data, while the target dataset will store the extracted text quality features.

In [3]:
# Source Dataset Configuration
source_config = DatasetConfig(
    phase=PhaseDef.DATAPREP,
    stage=StageDef.CLEAN,
    name="review",
    file_format=FileFormat.PARQUET,
    asset_type="dataset",
    dftype=DFType.PANDAS,
)

# Target Dataset Configuration
target_config = DatasetConfig(
    phase=PhaseDef.DATAPREP,
    stage=StageDef.TQA,
    name="review",
    file_format=FileFormat.PARQUET,
    asset_type="dataset",
    dftype=DFType.PANDAS,
)


### Construct the TQA Pipeline
---
We use the TQAStageBuilder to configure a Dask-powered text quality analysis pipeline with:

- Normalization enabled (ensures robust feature scaling).
- Batch processing (improves efficiency for large datasets).

In [4]:
stage = (
    TQAStageBuilder()
        .analyze_text()
        .build(source_config=source_config, target_config=target_config))


### Run the Pipeline
---
Once the pipeline is built, we execute it to compute text quality features.

In [5]:
dataset = stage.run(force=FORCE)



#             Text Quality Analysis Stage Sun, 09 Feb 2025 00:28:11              #

____________________________________________________________________________________________________
Text Quality Analysis Stage             00:28:11    00:29:51    1.0 minute and 40.04 seconds                       





### Validate the Dataset
---
Let's ensure that the text quality measures have been added and the dataset is in the repository.

#### Dataset Profile

In [6]:
dataset.profile

Unnamed: 0,Column,DataType,Complete,Null,Completeness,Unique,Duplicate,Uniqueness,Size (Bytes)
0,id,object,4939,0,1.0,4939,0,1.0,331387
1,app_id,object,4939,0,1.0,1930,3009,0.390767,328454
2,app_name,object,4939,0,1.0,1930,3009,0.390767,394959
3,category_id,object,4939,0,1.0,14,4925,0.002835,301279
4,author,object,4939,0,1.0,4937,2,0.999595,380303
5,rating,object,4939,0,1.0,5,4934,0.001012,177804
6,content,object,4939,0,1.0,4936,3,0.999393,1237730
7,vote_sum,object,4939,0,1.0,14,4925,0.002835,159020
8,vote_count,object,4939,0,1.0,18,4921,0.003644,159272
9,date,datetime64[ns],4939,0,1.0,4939,0,1.0,39512


#### Dataset Sample

In [7]:
dataset.dataframe.head()

Unnamed: 0,id,app_id,app_name,category_id,author,rating,content,vote_sum,vote_count,date,...,aspect_verb_pairs,noun_adjective_pairs,noun_phrases,verb_phrases,adverbial_phrases,review_length,lexical_density,dependency_depth,tqa_score,tqa_rating
0,10019409512,1380362212,GALATEA: Novels & Audiobooks,6018,c011c66aae3e668b150e,5,i love it but the chapter and waiting hours fo...,0,0,2023-06-10 15:09:00,...,0.0,0.0,0.693147,0.0,0.0,2.70805,4.615121,2.70805,16.779699,2
1,10027124164,1380362212,GALATEA: Novels & Audiobooks,6018,5a2741393dd20358b609,5,i like the books that i have read so far if th...,0,0,2023-06-12 20:14:00,...,0.0,0.0,0.693147,0.0,0.0,3.931826,4.615121,3.931826,21.776695,4
2,10036938913,1076402606,"Libby, by OverDrive",6018,46117640263dddac9294,5,i have read dozens upon dozens of books after ...,0,0,2023-06-15 17:01:00,...,0.693147,0.0,0.0,0.693147,0.0,3.850148,4.615121,3.850148,23.259196,5
3,10047764706,1076402606,"Libby, by OverDrive",6018,a0e95f8868233439444d,5,happy with the app i use it primarily for audi...,0,0,2023-06-18 19:40:00,...,0.0,0.0,0.693147,0.0,0.0,3.583519,4.615121,3.583519,20.447559,4
4,10064456025,1535748732,Storyroom - Webnovel & Story,6018,bb43c451a876165c2abf,1,im going to be honest the books are really gre...,0,0,2023-06-23 15:23:00,...,0.0,0.0,0.693147,0.0,0.0,4.477337,4.615121,4.477337,23.163182,5


#### Dataset Repository

In [8]:
repo = container.io.repo()
ds = repo.get(asset_id=dataset.asset_id)
assert ds == dataset

Great! Now that we've confirmed the dataset has been successfully processed, we have a set of **text quality analysis** metrics that we can use for instance selection during the feature engineering stage.
Next, we transition to **sentiment analysis at the review level**. In this phase, we will analyze the overall **sentiment** of each review.