In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# Text Quality Analysis (TQA) for Aspect-Based Sentiment Analysis (ABSA)
---
In **Aspect-Based Sentiment Analysis (ABSA)**, the primary goal is to extract sentiment for specific aspects within a text, such as products, services, or features. To ensure accurate sentiment extraction, the text must be of sufficient quality. Text quality directly influences the effectiveness of ABSA models, and assessing text quality is crucial to improve aspect-level sentiment predictions.

The **Text Quality Analysis (TQA)** process evaluates various features of the text that may affect ABSA performance. It focuses on syntactic and lexical features that help determine the relevance, richness, and clarity of the content in relation to specific aspects.

Key **requirements for ABSA-based text quality** are:

- **Aspect Identification**: Clear identification of aspects (e.g., product features or services).
- **Aspect-Verb Pairing**: The relationship between aspects and verbs (actions related to aspects).
- **Text Coherence and Complexity**: Well-formed sentences with a manageable level of complexity.
- **Lexical Density and Content Richness**: The degree to which the content reflects substantive, content-rich words.

These requirements are captured by **syntactic measures** like noun phrases, verb phrases, aspect-verb pairs, and additional features like review length, lexical density, and dependency depth. Together, these features inform the quality of text in the context of ABSA tasks.

## Measures of Text Quality and Their Weights
---
The following **syntactic features** and their **weights** quantify the quality of a given text for ABSA. Each measure contributes to the overall **Text Quality Analysis (TQA)** score, which is a weighted sum of these features.

$$
\text{TQA Score}=\sum_{i} \left( \text{Measure Score}_i \times \text{Weight}_i \right)
$$

Here’s the run-through of each measure, its weight, and its impact on the overall TQA score:

- **Aspect-Verb Pairs**: This measure identifies the relationship between **aspects** (e.g., product features) and **verbs** (actions related to those aspects). A higher number of aspect-verb pairs indicates that the text is highly relevant to ABSA, making it the most heavily weighted feature in TQA with a weight of **4**.    
- **Noun Phrases**: Noun phrases help identify **key entities** and aspects in the text. The weight of **2.5** reflects the importance of rich, aspect-relevant content in generating accurate sentiment scores.    
- **Verb Phrases**: Verb phrases identify the **actions or states** related to aspects. This has a weight of **2**, highlighting the importance of these phrases for understanding sentiment.    
- **Adjective Count**: Adjectives capture **descriptive sentiment**, reflecting the **quality and intensity** of the sentiments expressed. It is weighted at **1.5**, indicating its moderate to high importance in sentiment analysis.    
- **Lexical Density**: This measure quantifies the amount of **content-rich vocabulary** used in the text. High lexical density indicates that the text contains more **substantive content**, making it more informative for sentiment analysis. It is weighted at **1.5**.    
- **Noun Count**: The count of **nouns** provides insights into the **syntactic structure** and **content** of the text. It is weighted at **1**, reflecting its importance in the analysis of textual richness.    
- **Verb Count**: The count of **verbs** helps capture the **action-oriented components** of the text. Like the noun count, it is weighted at **1**.    
- **Review Length**: Longer reviews generally provide more context for sentiment analysis, but excessive length can introduce noise. It is weighted at **1**, reflecting its moderate importance in sentiment prediction.    
- **Dependency Depth**: This measure indicates the **complexity** of sentence structure. Although sentence complexity can affect sentiment extraction, it plays a secondary role in ABSA and is weighted at **1**.    
- **Adverb Count**: Adverbs modify verbs or adjectives, helping capture nuances in sentiment. With a weight of **0.75**, it plays a supportive role in sentiment analysis.    
- **Adverbial Phrases**: These phrases provide additional **modification** of actions, aspects, and sentiments. They are weighted at **0.5**, as they are considered less important than noun and verb phrases in ABSA tasks.

**Summary of Weights:**

| Measure           | Weight |
|-------------------|--------|
| Aspect-Verb Pairs | 4      |
| Noun Phrases      | 2.5    |
| Verb Phrases      | 2      |
| Adjective Count   | 1.5    |
| Lexical Density   | 1.5    |
| Noun Count        | 1      |
| Verb Count        | 1      |
| Review Length     | 1      |
| Dependency Depth  | 1      |
| Adverb Count      | 0.75   |
| Adverbial Phrases | 0.5    |

## Text Quality Analysis Pipeline
With that, the text quality analysis (TQA) processing pipeline computes these text quality metrics at scale.

The pipeline leverages:     
- Dask for distributed data processing, enabling efficient computation over large text datasets.
- spaCy for NLP tasks, including dependency parsing and part-of-speech (POS) tagging.

The pipeline follows these steps:    
- Dataset Configuration: Define the source and target dataset configurations.
- Pipeline Construction: Instantiate the TQAStageBuilder and configure it for Dask processing.
- Execution: Run the TQAStage, applying text analysis and feature extraction.



### Import Libraries
---

In [2]:

from genailab.setup import auto_wire_container
from genailab.core.dtypes import DFType
from genailab.infra.utils.file.fileset import FileFormat
from genailab.asset.dataset.config import DatasetConfig
from genailab.flow.dataprep.tqa.builder import TQAStageBuilder
from genailab.asset.dataset.config import DatasetConfig
from genailab.core.flow import PhaseDef, StageDef


# Wire container
container = auto_wire_container()

### Define the Source and Target Dataset Configurations
---
The source dataset represents the cleaned text data, while the target dataset will store the extracted text quality features.

In [3]:
# Source Dataset Configuration
source_config = DatasetConfig(
    phase=PhaseDef.DATAPREP,
    stage=StageDef.CLEAN,
    name="review",
    file_format=FileFormat.PARQUET,
    asset_type="dataset",
    dftype=DFType.PANDAS,
)

# Target Dataset Configuration
target_config = DatasetConfig(
    phase=PhaseDef.DATAPREP,
    stage=StageDef.TQA,
    name="review",
    file_format=FileFormat.PARQUET,
    asset_type="dataset",
    dftype=DFType.PANDAS,
)


### Construct the TQA Pipeline
---
We use the TQAStageBuilder to configure a Dask-powered text quality analysis pipeline with:

- Normalization enabled (ensures robust feature scaling).
- Batch processing (improves efficiency for large datasets).

In [None]:
stage = (
    TQAStageBuilder()
        .with_dask(normalized=True, batched=True)
        .build(source_config=source_config, target_config=target_config))


### Run the Pipeline
---
Once the pipeline is built, we execute it to compute text quality features.

In [None]:
dataset = stage.run(force=FORCE)

### Validate the Dataset
---
Let's ensure that the text quality measures have been added and the dataset is in the repository.

#### Dataset Profile

In [None]:
dataset.profile

#### Dataset Sample

In [None]:
dataset.dataframe.head()

#### Dataset Repository

In [8]:
repo = container.io.repo()
ds = repo.get(asset_id=dataset.asset_id)
assert ds == dataset

Great! Now that we've confirmed the dataset has been successfully processed, we have a set of **text quality analysis** metrics that we can use for instance selection during the feature engineering stage.
Next, we transition to **sentiment analysis at the review level**. In this phase, we will analyze the overall **sentiment** of each review.