In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# Text Quality Detection
Noise removal is a standard part of many NLP pipelines, and for good reason: machine learning models often struggle with inconsistencies and irrelevant artifacts in text data, which can degrade performance. Traditional models, in particular, were highly sensitive to noise, requiring rigorous preprocessing to function effectively.

However, transformer models, such as BERT, which was trained on a corpus of approximately **3.3 billion words** from sources like BooksCorpus and English Wikipedia, have demonstrated remarkable robustness to linguistic noise. Their ability to handle variations in text, such as abbreviations, emojis, slang, internet jargon, grammatical errors, informal word forms, and misspellings, stems from a combination of subword tokenization techniques and the innovative self-attention mechanism.

1. **Misspellings**: Subword tokenization techniques, like Byte Pair Encoding (BPE) and SentencePiece, break misspelled words into recognizable subword components. This allows models to leverage existing subword embeddings and infer the intended meaning, even when spelling deviations occur. The model still maintains high semantic accuracy because it can assemble meaning from familiar patterns rather than needing a perfect match.

2. **Slang and Informal Language**: Transformers are trained on diverse, real-world text that includes slang and informal expressions, making them adept at understanding and processing these variations. Subword tokenization decomposes these unconventional words into smaller units that the model has encountered in other contexts, enabling generalization. Additionally, transformers’ vast training data captures the distribution and use of slang, embedding these linguistic nuances effectively.

3. **Emojis and Special Characters**: Subword tokenization treats emojis and special symbols as unique tokens, preserving their semantic value. The self-attention mechanism allows the model to integrate these elements contextually, understanding their contribution to sentiment or meaning within the text. By attending to the relationships between emojis and surrounding words, the model can interpret and generate text that accurately reflects emotional tone or emphasis.

4. **Abbreviations and Internet Jargon**: Abbreviations and internet-specific language are broken down into meaningful subword segments, allowing transformers to recognize patterns and relate them to standard language forms. The self-attention mechanism plays a crucial role here by dynamically assigning importance to different parts of the input sequence, enabling the model to understand the intended message despite the use of abbreviations.

5. **Grammatical Errors and Informal Word Forms**: The self-attention mechanism is a fundamental innovation in transformer models. It enables the model to establish contextual relationships between words regardless of their order or grammatical correctness. By weighing the relevance of each word in relation to others, the model captures the overarching meaning even in the presence of syntax errors or informal language structures. This flexibility makes transformers robust to variations that would otherwise disrupt traditional models.

Moreover, studies have shown that some types of "useful" noise, such as informal language and emojis, can enhance model performance and generalizability, as they better simulate real-world text scenarios {cite}`languageandmultimodalailamalabimperialcollegelondonukBetterUnderstandingNoise2021`. By preserving or even embracing this *useful* noise, models become more adaptable and effective in practical applications, demonstrating the nuanced trade-offs in handling linguistic noise.

Therefore, we take a nuanced, task-specific approach to noise handling, isolating and removing only *harmful* noise. We define harmful noise as artifacts that do not carry meaning or distort the intended meaning of the text. To ensure high data quality, we assess and flag observations based on several dimensions:

- **Personally Identifiable Information (PII)**: This includes URLs, emails, and phone numbers, which can compromise privacy.
- **Language Noise**: We flag non-English text, which can hinder language-specific models from accurately interpreting content.
- **Accuracy Noise**: Artifacts such as control characters, HTML tags, excessive whitespace, elongation of characters, non-ASCII characters, and certain special characters that disrupt text consistency are flagged and managed.
- **Validity**: We identify review length and perplexity outliers that may indicate irregular, fake or otherwise invalid content. For `review_length` and `perplexity` we take a conservative approach. An outlier is defined as a value beyond $3\times\text{IQR}$

By addressing these dimensions, we optimize data quality in a way that enhances model performance without sacrificing the rich, real-world variability that some noise can provide.

## Import Libraries

In [None]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import StageDef
from discover.flow.data_prep.dqd.stage import TextQualityDetectionStage
from discover.core.flow import PhaseDef, StageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.base.stage",
    ],
)

## Text Quality Detection (TQD) Pipeline
The text quality detection pipeline flags observations containing *harmful* noise artifacts for downstream transformation, replacement, or removal. Following a pattern similar to the **Ingestion Pipeline**, the pipeline configuration is loaded using the `FiowConfiguReader`. During the construction of the `TextQualityDetectionStage`, the text quality detection tasks are dynamically instantiated and incorporated into the `TextQualityDetectionStage` object. The `run` method orchestrates the entire workflow.

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(phase=PhaseDef.DATAPREP, stage=StageDef.TQD)
# Build and run the stage
stage = TextQualityDetectionStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                          Text Quality Detection Stage                          #



your 131072x1 screen size is bogus. expect trouble


:: loading settings :: url = jar:file:/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/john/.ivy2/cache
The jars for the packages stored in: /home/john/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9954d18f-632a-46da-be5b-2eaa61a8a442;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.3.3 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in central
	f



                             DetectOrRepairURLTask                              
                             ---------------------                              
                          Start Datetime | Mon, 18 Nov 2024 16:15:46
                       Complete Datetime | Mon, 18 Nov 2024 16:15:46
                                 Runtime | 0.2 seconds


                         DetectOrRepairEmailAddressTask                         
                         ------------------------------                         
                          Start Datetime | Mon, 18 Nov 2024 16:15:46
                       Complete Datetime | Mon, 18 Nov 2024 16:15:46
                                 Runtime | 0.06 seconds


                         DetectOrRepairPhoneNumberTask                          
                         -----------------------------                          
                          Start Datetime | Mon, 18 Nov 2024 16:15:46
                       Complete Datetime | Mon, 18 N

                                                                                

                       Complete Datetime | Mon, 18 Nov 2024 16:15:51
                                 Runtime | 3.54 seconds


                           DetectOrRepairOutliersTask                           
                           --------------------------                           
                          Start Datetime | Mon, 18 Nov 2024 16:15:51


[11/18/2024 04:15:51 PM] [ERROR] [DetectOrRepairTask.run] [wrapper] : Exception occurred in DetectOrRepairOutliersTask called with data=DataFrame[id: string, app_id: string, app_name: string, category_id: string, author: string, rating: smallint, content: string, vote_sum: bigint, vote_count: bigint, date: timestamp_ntz, review_length: bigint, voc_sentiment: string, category: string, tqd_url: boolean, tqd_email: boolean, tqd_phone: boolean, tqd_ctrl_chars: boolean, tqd_accents: boolean, tqd_html_chars: boolean, tqd_excess_whitespace: boolean, tqd_non_english_app_name: boolean, tqd_non_english_text: boolean, tqd_excess_special_chars: boolean, tqd_duplicate_review_id: boolean, tqd_elongation: boolean, tqd_non_ascii_chars: boolean, tqd_excess_non_ascii_chars: boolean, tqd_excess_sequence_repetition: boolean, tqd_excess_word_repetition: boolean, tqd_excess_phrase_repetition: boolean, tqd_review_length_outlier: boolean]
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `an_perplexity` cannot be resolved. Did you mean one of the following? [`app_id`, `app_name`, `category_id`, `content`, `review_length`].;
'Project ['an_perplexity]
+- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, tqd_duplicate_review_id#272, ... 7 more fields]
   +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, tqd_duplicate_review_id#272, ... 6 more fields]
      +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, tqd_duplicate_review_id#272, ... 5 more fields]
         +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, tqd_duplicate_review_id#272, ... 4 more fields]
            +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, tqd_duplicate_review_id#272, ... 3 more fields]
               +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, tqd_duplicate_review_id#272, ... 2 more fields]
                  +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, tqd_duplicate_review_id#272, cast((regexp_count(content#6, (.)\1{3,}) > 0) as boolean) AS tqd_elongation#298]
                     +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, tqd_duplicate_review_id#272]
                        +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247, _we0#273L, cast((_we0#273L > cast(1 as bigint)) as boolean) AS tqd_duplicate_review_id#272]
                           +- Window [count(id#0) windowspecdefinition(id#0, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _we0#273L], [id#0]
                              +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, tqd_excess_special_chars#247]
                                 +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, tqd_non_english_text#224, cast(((cast(regexp_count(content#6, [#<>~]) as double) / cast(length(content#6) as double)) > 0.3) as boolean) AS tqd_excess_special_chars#247]
                                    +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, CASE WHEN tqd_non_english_text#200 THEN _run_lingua(content#6)#223 ELSE false END AS tqd_non_english_text#224]
                                       +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, tqd_non_english_app_name#177, _run_fasttext(content#6)#199 AS tqd_non_english_text#200]
                                          +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, CASE WHEN tqd_non_english_app_name#154 THEN _run_lingua(app_name#2)#176 ELSE false END AS tqd_non_english_app_name#177]
                                             +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, tqd_excess_whitespace#132, _run_fasttext(app_name#2)#153 AS tqd_non_english_app_name#154]
                                                +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, tqd_html_chars#112, cast((regexp_count(content#6, \s{2,}) > 0) as boolean) AS tqd_excess_whitespace#132]
                                                   +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, tqd_accents#93, cast((regexp_count(content#6, &[#A-Za-z0-9]+;) > 0) as boolean) AS tqd_html_chars#112]
                                                      +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, tqd_ctrl_chars#75, cast((regexp_count(content#6, [\u00C0-\u024F]) > 0) as boolean) AS tqd_accents#93]
                                                         +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, tqd_phone#58, cast((regexp_count(content#6, [\x00-\x1F\x7F]) > 0) as boolean) AS tqd_ctrl_chars#75]
                                                            +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, tqd_email#42, cast((regexp_count(content#6, (\+?\d{1,3})?[\s.-]?\(?\d{2,4}\)?[\s.-]?\d{3,4}[\s.-]?\d{4}) > 0) as boolean) AS tqd_phone#58]
                                                               +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, tqd_url#26, cast((regexp_count(content#6, [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}) > 0) as boolean) AS tqd_email#42]
                                                                  +- Project [id#0, app_id#1, app_name#2, category_id#3, author#4, rating#5, content#6, vote_sum#7L, vote_count#8L, date#9, review_length#10L, voc_sentiment#11, category#12, cast((regexp_count(content#6, (https?:\/\/)?(www\.)?[\w\-_]+(\.[\w\-_]+)+([\/\w\-_\.]*)*) > 0) as boolean) AS tqd_url#26]
                                                                     +- Relation [id#0,app_id#1,app_name#2,category_id#3,author#4,rating#5,content#6,vote_sum#7L,vote_count#8L,date#9,review_length#10L,voc_sentiment#11,category#12] parquet


With **Text Quality Detection** step complete, our dataset is now enriched with quality signals for a meaningful **Data Quality Analysis (DQA)**, where we’ll examine the broader impact of these artifacts on overall data quality. In the next section, we will analyze review text through the lens of syntactic complexity, lexical diversity, and coherence - linguistic characteristics associated with high quality reviews.