In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = True

# Text Quality Analysis (TQA) 
Review text quality is an indicator of the content's richness, coherence, and informativeness. In this section, we integrate two complementary quality assessment measures—a lexical/syntactic score and a perplexity-based score—into a weighted sum. This approach provides a balanced evaluation, capturing both the structural diversity and natural language fluency of the reviews.

## Lexical and Syntactic Complexity Assessment
The lexical and syntactic quality assessment (TQA) evaluates review quality using a composite score derived from multiple syntactic and lexical measures. These measures are computed with specific weights:

- **Syntactic Extent Score** (40%): Reflects the richness of content using counts of nouns, verbs, adjectives, and adverbs.
- **Syntactic Diversity Score** (20%): Captures variety in language using an entropy-based calculation.
- **Syntactic Complexity Score** (10%): Assesses the density of key parts of speech relative to total word count.
- **Lexical Complexity Score** (20%): Evaluates text complexity using unique word proportion, special character usage, and word length variation.
- **Typography Score** (10%): Incorporates quality signals such as limited digit use, minimal special characters, and proper terminal punctuation.

A high Lexical and Syntactic Complexity Score typically indicates a text rich in linguistic features, with varied sentence structures and a well-balanced mix of nouns, verbs, and modifiers (like adjectives and adverbs). This variety is particularly valuable for tasks like Aspect-Based Sentiment Analysis (ABSA), where structural complexity can signal content with nuanced aspects and sentiments.

## Coherence - Perplexity-Based Quality Assessment
This measure evaluates review quality by applying 13 linguistic and structural filters, each assigned a weight derived from relative perplexity differences between the full dataset and filtered subsets. The filters assess features like adjective presence, punctuation ratios, word repetition, and special character use. Weights are computed to emphasize filters that most reduce perplexity, thus enhancing text fluency and coherence. The final score is a weighted sum of these filter indicators.

Lower perplexity implies higher fluency, coherence, and grammatical correctness, which are key indicators of text quality. This component is useful for flagging low-quality or noisy text that may be unpredictable or deviate significantly from standard linguistic norms.

## Weighted Scoring Approach
To create a balanced quality score, the Syntactic Complexity Score and Perplexity-Based Score are combined with tailored weights that emphasize their respective strengths.

- **Lexical and Syntactic Complexity Weight**: Typically given more weight when the task demands detailed and linguistically rich text, such as ABSA, where richer syntactic content improves aspect and sentiment extraction.
- **Perplexity-Based Weight**: Often assigned a moderate weight to capture coherence and fluency, ensuring that only grammatically sound and predictable text is prioritized without sacrificing syntactic diversity.

The final **Text Quality Score** is a weighted average of these two components, providing a single score that balances both syntactic richness and linguistic fluency. The remainder of this notebook will execute the text quality scoring pipeline, computing and integrating the two quality measures into a final text quality score. The specifics of the measures are provided in {ref}`appendix:tqa`.

## Import Libraries

In [None]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import DataPrepStageDef
from discover.flow.data_prep.edp.stage import QuantStage
from discover.flow.data_prep.sentiment.stage import SentimentClassificationStage
from discover.flow.data_prep.dqm.stage import TextQualityDetectionStage
from discover.flow.data_prep.tqa.stage import TQAStage
from discover.flow.data_prep.agg.stage import AggregationStage
from discover.core.flow import PhaseDef, DataPrepStageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.flow.data_prep.aggregation.stage",
    ],
)

## Text Quality Analysis (TQA) Pipeline 
The **Text Quality Analysis (TQA) Pipeline** develops in four stages. The first stage executes a *SparkNLP* pipeline, which performs tokenization and parts-of-speech tagging to capture the syntactic elements from the review text. From the tokenized text and parts-of-speech tags we proceed to the second stage, where we derive the **lexical and syntactic complexity** scores. In the third stage, the pipeline computes perplexity-based weighting for 13 distinct text quality heuristics, producing a balanced score that encapsulates the fluency and coherence of the text. Finally, the fourth stage synthesizes **syntactic and lexical complexity scores** and the **perplexity-based coherence scores** into a weighted composite text quality score - a holistic measure of text quality that captures both the linguistic sophistication and the overall clarity of the writing.  

In [None]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.TQA
)

# Build and run Data Ingestion Stage
stage = TQAStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

## Summary and Transition to the Data Quality Analysis (DQA):
With the **Text Quality Analysis (TQA) Pipeline** now complete, we have the linguistic elements that contribute to a holistic assessment of text quality for NLP applications. These enriched text quality measures are determinative inputs for our next stage: the **Data Quality Analysis (DQA)**. 

In the DQA, we’ll dilate our aperture, integrating sentiments, typographical, and linguistic metrics across several dimensions of data quality, allowing us to uncover areas of concern, and devise further data processing interventions. 