In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

# Text Quality Scoring Framework for App Reviews in Aspect-Based Sentiment Analysis (ABSA)
In Aspect-Based Sentiment Analysis (ABSA), high-quality text enables models to accurately parse and infer nuanced sentiment across diverse aspects, particularly within user-generated content such as app reviews. To maximize ABSA efficacy, reviews should exhibit richness in lexical diversity, structural depth, sentiment clarity, and readability. 

Here, we present a text quality scoring framework that leverages multiple linguistic and structural dimensions to filter app reviews, ensuring that only those meeting high-quality standards are selected for analysis. 

### Quality Scoring Framework
The framework assigns a weighted quality score to each review based on six key components:
1. **POS Count (Content Volume)**
2. **POS Diversity (Lexical Variety)**
3. **Structural Complexity**
4. **POS Intensity (Sentiment Focus)**
5. **Readability**
6. **TQA Check (Formal Quality Indicators)**

Each component contributes to a comprehensive understanding of review quality, reflecting various linguistic and stylistic aspects that support ABSA objectives.

### Overall Quality Score Formula

The quality score \( Q \) is defined as follows:

$$
Q = w_1 \times \text{POS Count} + w_2 \times \text{POS Diversity} + w_3 \times \text{Structural Complexity} + w_4 \times \text{POS Intensity} + w_5 \times \text{Readability} + w_6 \times \text{TQA Check}
$$

where $( w_1, w_2, w_3, w_4, w_5)$, and $(w_6)$ are weights representing the relative importance of each component. The weights are calibrated to prioritize content-rich reviews conducive to multi-aspect sentiment inference.

#### Weight Allocation and Justification

The framework prioritizes **POS Count** due to its direct correlation with ABSA goals. POS Count holds the highest weight, as it ensures sufficient lexical volume, enabling richer and more varied sentiment and aspect extraction. POS Diversity and Structural Complexity are also assigned significant weights, as they enhance the scope and depth of analysis by ensuring reviews cover multiple aspects with well-structured expression. POS Intensity, Readability, and TQA Check receive supporting weights, contributing to focus, interpretability, and quality consistency without disproportionately influencing the quality assessment.

### Components and Calculation

Each component is designed to capture a specific dimension of review quality, relevant to ABSA’s emphasis on aspect coverage, sentiment clarity, and text readability.

#### 1. POS Count (Content Volume) — $(w_1 = 0.4)$

**Definition**: POS Count is a measure of the absolute quantity of content-bearing parts of speech, specifically nouns, verbs, adjectives, and adverbs. This component prioritizes reviews with a substantial volume of lexical material, directly supporting ABSA tasks by providing sufficient substance for identifying aspects and inferring nuanced sentiment.

**Formula**:
$$
\text{POS Count} = \text{nouns} + \text{verbs} + \text{adjectives} + \text{adverbs}
$$

**Rationale**: In the context of ABSA, reviews with high POS Count are more likely to provide in-depth feedback, encompassing multiple app features and varied sentiment. This component, weighted at 0.3, serves as the primary driver of the quality score, reflecting its fundamental role in content adequacy.

#### 2. POS Diversity (Lexical Variety) — $( w_2 = 0.2 )$

**Definition**: POS Diversity assesses the variety and balance of POS tags within the text. It calculates how evenly nouns, verbs, adjectives, and adverbs are distributed, using a normalized entropy score to reflect balance across content-bearing parts of speech.

**Formula**:
$$
\text{POS Diversity} = - \sum_{i} \left( p_{i} \times \log(p_{i}) \right)
$$
where \( p_i \) represents the proportion of each POS type (nouns, verbs, adjectives, adverbs) within the text.

**Rationale**: Lexical variety enhances ABSA performance by increasing the probability that multiple aspects and sentiments are represented. Reviews with balanced POS usage are more likely to contain a mixture of descriptive, evaluative, and action-oriented language, which provides a richer substrate for aspect and sentiment extraction.

#### 3. Structural Complexity — $( w_3 = 0.18 )$

**Definition**: Structural Complexity evaluates the variability and engagement level within each review, capturing sentence length variability, unique word usage, and punctuation proportion. This component encourages reviews with intricate sentence structures and unique tokens, both of which signal a higher depth of content and engagement.

**Formula**:
$$
\text{Structural Complexity} = 0.25 \times \text{sentence\_length\_std} + 0.25 \times \text{p\_unique\_tokens} + 0.25 \times \text{p\_punctuation} + 0.25 \times \text{n\_sentences}
$$

where:
- **sentence\_length\_std**: Standard deviation of sentence lengths, capturing complexity.
- **p\_unique\_tokens**: Proportion of unique tokens, indicating lexical variety.
- **p\_punctuation**: Proportion of punctuation, reflecting sentence structuring.
- **n\_sentences**: Number of sentences, indicating text depth.

**Rationale**: This component complements POS Count and POS Diversity by capturing sentence-level structure and engagement, both of which improve ABSA’s ability to parse and infer multi-dimensional sentiment. Weighted at 0.2, Structural Complexity balances the need for structured yet diverse content.

#### 4. POS Intensity (Sentiment Focus) — $( w_4 = 0.07 )$

**Definition**: POS Intensity measures the density of content-bearing words relative to total word count, emphasizing reviews where sentiment-laden and aspect-rich language predominates.

**Formula**:
$$
\text{POS Intensity} = \frac{\text{nouns} + \text{verbs} + \text{adjectives} + \text{adverbs}}{\text{total words}}
$$

**Rationale**: High POS Intensity suggests a focused, sentiment-rich review, where descriptive and evaluative language outweighs filler or redundant text. This complements POS Count by ensuring that content volume is matched with density, focusing on the substance of each review.

#### 5. Readability — $( w_5 = 0.05 )$

**Definition**: Readability assesses the ease with which a review can be read and interpreted, using the Flesch Reading Ease score to gauge accessibility. High readability ensures that reviews are well-structured and easily parsed, which benefits both ABSA models and human readability.

**Formula**:
$$
\text{Readability} = \text{Flesch Reading Ease}
$$

**Rationale**: Readability supports ABSA by ensuring that sentiment and aspect-related language is presented in a straightforward manner. However, readability alone does not guarantee substantive content, so it is weighted lower, serving as a secondary quality indicator.

#### 6. TQA Check (Formal Quality Indicators) — $( w_6 = 0.1 )$

**Definition**: The TQA Check evaluates the formal quality of the text, including punctuation and digit ratio, which signal professionalism and readability. TQA Check serves as a filter for overly simplistic or noisy text.

**Formula**:
$$
\text{TQA Check} = 0.3 \times (1 - \text{high\_digit\_ratio}) + 0.3 \times (1 - \text{high\_punctuation\_ratio}) + 0.4 \times \text{has\_terminal\_punctuation}
$$

**Rationale**: By enforcing basic formal standards, the TQA Check helps maintain a minimum quality threshold, reducing noise without strongly influencing content richness. It’s particularly useful for filtering out low-quality reviews that might otherwise distort ABSA performance.

### Final Quality Score Formula

Based on the above rationale, the complete quality score formula is as follows:

$$
Q = 0.4 \times \text{POS Count} + 0.2 \times \text{POS Diversity} + 0.18 \times \text{Structural Complexity} + 0.07 \times \text{POS Intensity} + 0.05 \times \text{Readability} + 0.1 \times \text{TQA Check}
$$

This quality scoring framework provides a multi-dimensional approach to filtering app reviews for ABSA. By weighting content volume, diversity, structural complexity, intensity, readability, and formal quality checks, the score favors reviews rich in content, balanced in lexical usage, and clear in sentiment expression. 

## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.tqa.stage import TQAStage

ImportError: cannot import name 'Review' from 'discover.app.univariate' (/home/john/projects/appvocai-discover/discover/app/univariate.py)

# Params

In [3]:
FORCE = True

## Dependency Container

In [4]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
    ],
)

## Text Quality Scoring Pipeline
The text quality scoring pipeline begins by obtaining a configuration tailored to the text quality assessment (TQA) stage. Using the `FlowConfigReader`, we retrieve and apply a configuration specific to TQA from the overall data preparation phase. This configuration includes the parameters and settings required to assess the quality of the text data.

### Pipeline Steps:

1. **Configuration Retrieval**: The pipeline starts by reading the configuration via `FlowConfigReader`. By specifying "phases" as the target configuration, we isolate the required settings, focusing on the TQA stage.

2. **Stage Initialization**: The `TQAStage` is built using the retrieved `stage_config`, which defines parameters such as thresholds, quality components, and weightings for text quality scoring. Setting the `force` parameter allows us to re-run this stage if necessary.

3. **Execution and Asset Creation**: Finally, running the `TQAStage` initiates the text quality assessment, where each review's quality score is computed based on the defined formula. Once completed, this produces an asset identifier, `asset_id`, which corresponds to the processed dataset with text quality scores applied.

This pipeline ensures that the text quality assessment is structured and reproducible, setting the stage for further analysis and filtering based on quality thresholds.

In [None]:
# Obtain the configuration
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"]["tqa"]

# Build and run Data Ingestion Stage
stage = TQAStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

Upon completing the text quality scoring pipeline, the dataset is now enriched with quality scores that reflect the structural and linguistic richness of each review. This dataset provides a foundation for further selection, where high-quality samples can be identified for pseudolabeling and other downstream tasks. Next, we will examine how the scores and data are distributed at different quality thresholds. 