In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = True

# Text Quality Analysis (TQA) 
Aspect-based sentiment analysis (ABSA) model self-training and fine-tuning requires text saturated with explicit aspects and opinions words. Dense, unambiguous, aspect-rich reviews are especially vital during ABSA model self-training and pseudo-labeling. Explicit aspect-opinion pair relationships minimize noise and reinforce the model’s understanding of aspect-sentiment associations. Here, we assess the degree to which each review manifests this richness, enabling targeted sample selection for optimal self-training and ABSA model fine-tuning. 

## Text Quality Scoring 
Text quality in our context is less about fluency or lexical sophistication in the traditional linguistic sense; we're not grading essays. Instead, we focus on whether reviews contain clear aspects and opinions, reflected through specific syntactic features, such as the density of nouns, adjectives, verbs, and adverbs. Our scoring method assigns weighted values to key syntactic components that drive aspect-based sentiment analysis. We calculate the **Syntactic Score** using the formula:

$$
\text{Syntactic Score} = 0.3 \times \text{Noun Density} + 0.3 \times \text{Adjective Density} + 0.2 \times \text{Verb Density} + 0.2 \times \text{Adverb Density}
$$

Here, nouns $w_N = 0.3$ anchor aspect identification, while adjectives $w_A = 0.3$ capture sentiment polarity. Verbs $w_V = 0.2$ add contextual sentiment nuances, and adverbs $w_{ADV} = 0.2$ convey intensity. We combine this **Syntactic Score** with a **Lexical Diversity Score (TTR)** to derive a comprehensive **Text Quality Score**:

$$
\text{Text Quality Score} = \alpha \cdot \text{Syntactic Score} + \beta \cdot \text{Lexical Diversity Score (TTR)}
$$

where $\alpha=0.5$ and $\beta=0.5$ adjust the relative importance of syntactic richness and lexical diversity, respectively.

## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import StageDef
from discover.flow.data_prep.tqa.stage import TQAStage
from discover.core.flow import PhaseDef, StageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_processing.base.stage",
    ],
)

## Text Quality Analysis Pipeline
You know the script. We obtain the configuration, instantiate and run the `TQAStage` stage object. 

In [20]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(phase=PhaseDef.DATAPREP, stage=StageDef.TQA)

# Build and run Data Ingestion Stage
stage = TQAStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/18/2024 03:32:29 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-05_tqa-review-dataset.parquet from repository.
[11/18/2024 03:32:29 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-tqa-review from the repository.




#                          Text Quality Analysis Stage                           #



your 131072x1 screen size is bogus. expect trouble


:: loading settings :: url = jar:file:/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/john/.ivy2/cache
The jars for the packages stored in: /home/john/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-93e7ee7e-8401-4978-85ea-57bd430595d6;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.3.3 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in central
	f



                                    NLPTask                                     
                                    -------                                     
                          Start Datetime | Mon, 18 Nov 2024 15:32:57
pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
[ | ]pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
Download done! Loading the resource.
[ / ]



[ — ]

                                                                                

[OK!]
                       Complete Datetime | Mon, 18 Nov 2024 15:33:24
                                 Runtime | 27.07 seconds


                             ComputeTextQualityTask                             
                             ----------------------                             
                          Start Datetime | Mon, 18 Nov 2024 15:33:24
                       Complete Datetime | Mon, 18 Nov 2024 15:33:25
                                 Runtime | 0.88 seconds






                          Text Quality Analysis Stage                           
                           Stage Started | Mon, 18 Nov 2024 15:32:29
                         Stage Completed | Mon, 18 Nov 2024 15:34:52
                           Stage Runtime | 2.0 minutes and 23.04 seconds





                                                                                

## Summary and Transition to the Data Quality Analysis (DQA):
With the **Text Quality Analysis (TQA) Pipeline** now complete, we have the linguistic elements that contribute to a holistic assessment of text quality for NLP applications. These enriched text quality measures are determinative inputs for our next stage: the **Data Quality Analysis (DQA)**. 

In the DQA, we’ll dilate our aperture, integrating sentiments, typographical, and linguistic metrics across several dimensions of data quality, allowing us to uncover areas of concern, and devise further data processing interventions. 