In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = True

# Review Text Quality Assessment (TQA) 
Aspect-based sentiment analysis (ABSA) model self-training and fine-tuning requires text saturated with explicit aspects and opinions words. Dense, unambiguous, aspect-rich reviews are especially vital during ABSA model self-training and pseudo-labeling. Explicit aspect-opinion pair relationships minimize noise and reinforce the model’s understanding of aspect-sentiment associations. Here, we assess the degree to which each review manifests this richness, enabling targeted sample selection for optimal self-training and ABSA model fine-tuning. 

## Text Quality Scoring 
Text quality in our context is less about fluency or lexical sophistication in the traditional linguistic sense; we're not grading essays. Instead, we focus on whether reviews contain clear aspects and opinions, reflected through specific syntactic features, such as the density of nouns, adjectives, verbs, and adverbs. Our scoring method assigns weighted values to key syntactic components that drive aspect-based sentiment analysis. We calculate the **Syntactic Score** using the formula:

$$
\text{Syntactic Score} = 0.3 \times \text{Noun Density} + 0.3 \times \text{Adjective Density} + 0.2 \times \text{Verb Density} + 0.2 \times \text{Adverb Density}
$$

Here, nouns $w_N = 0.3$ anchor aspect identification, while adjectives $w_A = 0.3$ capture sentiment polarity. Verbs $w_V = 0.2$ add contextual sentiment nuances, and adverbs $w_{ADV} = 0.2$ convey intensity. We combine this **Syntactic Score** with a **Lexical Diversity Score (TTR)** to derive a comprehensive **Text Quality Score**:

$$
\text{Text Quality Score Raw} = \alpha \cdot \text{Syntactic Score} + \beta \cdot \text{Lexical Diversity Score (TTR)}
$$

where $\alpha=0.5$ and $\beta=0.5$ adjust the relative importance of syntactic richness and lexical diversity, respectively.

Finally, we scale $\text{Text Quality Score}$ to the range $[0,100]$

## Import Libraries

In [2]:
import pandas as pd

from discover.app.tqa import DatasetEvaluation
from discover.app.tqa import TextQualityAnalysis
from discover.container import DiscoverContainer
from discover.core.flow import DataPrepStageDef
from discover.assets.idgen import DatasetIDGen
from discover.core.flow import Phase, DataPrepStageDef

pd.options.display.max_colwidth = 200
pd.options.display.max_rows = None

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.stage.base",
    ],
)

In [4]:
repo = container.persist.dataset_repo()
asset_id = DatasetIDGen.get_asset_id(
    asset_type="dataset",
    phase=PhaseDef.DATAPREP,
    stage=DataPrepStageDef.ENRICH_REVIEW,
    name="review",
)
dataset = repo.get(asset_id=asset_id, distributed=False, nlp=False)

In [None]:
df = dataset.content
df.loc[df["tqa_score"] > 1][
    ["app_name", "tqa_score", "sa_sentiment", "content"]
].sample(n=5)

Unnamed: 0,app_name,tqa_score,sa_sentiment,content
11755,TEUIDA Learn Korean & Japanese,1.520173,Very Positive,I’ll be frank I found this app in an ad and thought why not. When I say it was the best decision I mean it. It’s interactive the lessons are helpful. There are stories to help you feel like your s...
74015,Xfinity,1.068535,Neutral,What is the purpose of all these xfinity apps? What is the purpose of the app if we are going to be redirected to sign in everytime we want to do something within the app😭
12386,Procare: Childcare App,1.029619,Positive,Great app for getting updates and pictures of my little while at day care.
56458,Bible,1.111658,Positive,"Love the ease of this app! The plans are great and whether you have 2 minutes or 22 minutes, you will be able to take something away from the app."
36567,Sweatcoin Walking Step Counter,1.903294,Negative,When I first downloaded this app I read the goal was to monetize it like bitcoin. I walk my dog every day around 2 miles so figured It would be an easy way to earn sweat coins since I was outside....


In [None]:
tqa = TextQualityAnalysis(df=df)
tqa.distribution()

Analysis here

### Low Quality Reviews

In [None]:
tqa.select(
    n=10,
    x="tqa_score",
    sort_by="tqa_score",
    ascending=True,
    cols=["id", "app_name", "tqa_score", "content"],
)

### High Quality Reviews

In [None]:
tqa.select(
    n=10,
    x="tqa_score",
    sort_by="tqa_score",
    ascending=False,
    cols=["id", "app_name", "tqa_score", "content"],
)

## Summary and Transition to the Data Quality Analysis (DQA):
With the **Text Quality Analysis (TQA) Pipeline** now complete, we have the linguistic elements that contribute to a holistic assessment of text quality for NLP applications. These enriched text quality measures are determinative inputs for our next stage: the **Data Quality Analysis (DQA)**. 

In the DQA, we’ll dilate our aperture, integrating sentiments, typographical, and linguistic metrics across several dimensions of data quality, allowing us to uncover areas of concern, and devise further data processing interventions. 