In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# Data Quality Anomaly Detection (DQAD)
Data Quality Anomaly Detection and cleaning are standard parts of many NLP pipelines, and for good reason: machine learning models often struggle with inconsistencies and irrelevant artifacts in text data, which can degrade performance. Traditional models, in particular, were highly sensitive to noise, requiring rigorous preprocessing to function effectively.

However, transformer models, such as BERT, which was trained on a corpus of approximately **3.3 billion words** from sources like BooksCorpus and English Wikipedia, have demonstrated remarkable robustness to linguistic noise. Their ability to handle variations in text, such as abbreviations, emojis, slang, internet jargon, grammatical errors, informal word forms, and misspellings, stems from a combination of subword tokenization techniques and the innovative self-attention mechanism.   

1. **Misspellings**: Subword tokenization techniques, like Byte Pair Encoding (BPE) and SentencePiece, break misspelled words into recognizable subword components. This allows models to leverage existing subword embeddings and infer the intended meaning, even when spelling deviations occur. The model still maintains high semantic accuracy because it can assemble meaning from familiar patterns rather than needing a perfect match. 
2. **Slang and Informal Language**: Transformers are trained on diverse, real-world text that includes slang and informal expressions, making them adept at understanding and processing these variations. Subword tokenization decomposes these unconventional words into smaller units that the model has encountered in other contexts, enabling generalization. Additionally, transformers’ vast training data captures the distribution and use of slang, embedding these linguistic nuances effectively. 
3. **Emojis and Special Characters**: Subword tokenization treats emojis and special symbols as unique tokens, preserving their semantic value. The self-attention mechanism allows the model to integrate these elements contextually, understanding their contribution to sentiment or meaning within the text. By attending to the relationships between emojis and surrounding words, the model can interpret and generate text that accurately reflects emotional tone or emphasis. 
4. **Abbreviations and Internet Jargon**: Abbreviations and internet-specific language are broken down into meaningful subword segments, allowing transformers to recognize patterns and relate them to standard language forms. The self-attention mechanism plays a crucial role here by dynamically assigning importance to different parts of the input sequence, enabling the model to understand the intended message despite the use of abbreviations. 
5. **Grammatical Errors and Informal Word Forms**: The self-attention mechanism is a fundamental innovation in transformer models. It enables the model to establish contextual relationships between words regardless of their order or grammatical correctness. By weighing the relevance of each word in relation to others, the model captures the overarching meaning even in the presence of syntax errors or informal language structures. This flexibility makes transformers robust to variations that would otherwise disrupt traditional models.

Moreover, studies have shown that some types of "useful" noise, such as informal language and emojis, can enhance model performance and generalizability, as they better simulate real-world text scenarios {cite}`languageandmultimodalailamalabimperialcollegelondonukBetterUnderstandingNoise2021`. By preserving or even embracing this *useful* noise, models become more adaptable and effective in practical applications, demonstrating the nuanced trade-offs in handling linguistic noise.

Therefore, we take a nuanced, task-specific approach to data quality assessment and anomaly detection, isolating and removing only *harmful* noise. We define harmful noise as artifacts that do not carry meaning or distort the intended meaning of the text. To ensure high data quality, we assess and flag observations to support analysis along several dimensions of data quality.

## Accuracy Dimension
The **Accuracy** dimension in text data quality focuses on the correctness and reliability of textual information, ensuring that the content represents what is intended without introducing errors or distortions. In the context of Natural Language Processing (NLP), accuracy checks are particularly crucial as they help maintain the integrity of the text data that models rely on to make predictions or derive insights.

1. **Excessive Special Characters**: The presence of excessive or random special characters can corrupt the intended meaning of text and make it harder for models to interpret context. Accuracy checks ensure that these characters are only present when they add legitimate semantic value, such as in programming-related text or stylized writing. 
2. **Non-ASCII Characters**: While transformers can process non-ASCII characters, they may introduce unintended complexities or errors, especially when non-ASCII content is mixed into primarily English text without a clear purpose. Accuracy checks flag these occurrences to determine if they are contextually appropriate or represent an error in the data. 
3. **Control Characters**: Control characters, which are non-printable characters like tabs or line breaks embedded in text data, can disrupt text parsing and processing. Ensuring their absence or appropriate use maintains the structural accuracy needed for smooth NLP operations.
4. **HTML Characters**: Text data sourced from the web may contain HTML tags or character entities that interfere with the text's readability and model understanding. Accuracy checks sanitize or transform these elements to their intended textual form. 
5. **Excessive Whitespace**: Extra spaces or line breaks, though seemingly minor, can affect text tokenization and representation in models. Normalizing whitespace ensures text is processed in a consistent, meaningful way. 
6. **Accented and Diacritic Characters**: While accented characters are valid in many languages, their unintended presence in primarily non-accented text can indicate data entry errors. Checks for these characters verify if they are linguistically appropriate or require correction. 
7. **Elongation**: Text elongation, like in "sooo coool," is often used to emphasize words but may not be handled uniformly by models. Accuracy checks flag or normalize elongation to ensure consistent semantic interpretation. 
8. **Low Perplexity**: In the context of language models, low perplexity often signals repetitive or predictable patterns that may not carry substantive meaning. Ensuring text has appropriate complexity and variability is crucial for high-quality, informative data.

Bottom line, the **Accuracy** dimension addresses the integrity of text content, ensuring that linguistic artifacts and patterns do not distort the meaning or introduce errors that could mislead models or downstream applications.

## Relevance Dimension
The **Relevance** dimension in text data quality ensures that the content is contextually appropriate and meaningful for the specific NLP task or analysis at hand. In other words, the text must be pertinent to the domain, language, or focus of the project. Relevance checks filter out content that could mislead models or degrade the performance of algorithms by introducing off-topic or linguistically inconsistent information.
1. **Non-English App Names**: In datasets where the primary focus is on English-language content, non-English app names can be a source of confusion or skew analysis results. Relevance checks flag these instances, allowing us to either exclude or process them separately to maintain linguistic consistency. 
2. **Non-English Review Text**: Similar to non-English app names, reviews written in languages other than English may be irrelevant to models trained specifically on English text. Relevance checks identify non-English text, helping ensure the data aligns with the model's language capabilities and task requirements. 
3. **Review Length < 3**: Very short reviews, typically less than three words, often lack substantive information or context. These reviews are unlikely to provide meaningful insights and may act as noise, affecting sentiment analysis or topic modeling performance. Relevance checks filter these short reviews to maintain a focus on text that contributes valuable content to the analysis.

By assessing relevancy, we ensure that the text data are appropriate, meaningful, and aligned with the goals of the analysis. This dimension helps avoid the inclusion of extraneous or off-topic content that could distort model training or analysis results.

## Validity Dimension
The **Validity** dimension in text data quality ensures that the content adheres to expected formats, structures, and rules, making it suitable for processing and analysis. Validity checks identify and flag content that deviates from these established norms, as such deviations can hinder the performance of NLP models and introduce inaccuracies. 
1. **URLs**: Reviews containing URLs may not provide meaningful textual content for analysis and can disrupt language models. Validity checks identify and flag URLs, allowing for their removal or replacement to maintain textual coherence. 
2. **Phone Numbers**: Similar to URLs, phone numbers are often irrelevant to the semantic content of a review and may interfere with text processing. Validity checks detect phone numbers, ensuring that they are either masked or removed to avoid skewing the analysis. 
3. **Email Addresses**: Email addresses can introduce noise and potentially violate privacy policies. Detecting and handling these elements helps maintain data integrity and privacy while ensuring the text remains analyzable. 
4. **Repeated Sequences**: Reviews with excessive repetition of sequences, such as repeated letters, words, or patterns, can indicate spam or low-quality content. Validity checks identify such sequences, enabling corrective measures to ensure high-quality input for NLP models. 
5. **Repeated Words**: Similar to repeated sequences, the presence of redundant words may indicate automated or spam-like content. Detecting and addressing these issues helps maintain the linguistic integrity of the dataset. 
6. **Repeated Phrases**: Repeated phrases can dilute the semantic richness of the text and may signify low-quality or irrelevant content. Validity checks ensure these phrases are flagged for removal or further examination. 

By incorporating these **Validity** checks, we verify that the textual data adheres to expected norms and formats, reducing the risk of disruptions during text analysis. 

## Uniqueness Dimension
The **Uniqueness** dimension in text data quality emphasizes the importance of having distinct and non-duplicative content within a dataset. Ensuring uniqueness is crucial to prevent redundancy and to maintain the integrity and reliability of analytical results. In text processing, repeated or duplicated content can skew analysis, reduce the diversity of linguistic features, and lead to misleading insights.  
1. **Duplicate Review Id**: Duplicate review identifiers indicate that the same piece of text has been repeated in the dataset. This can artificially inflate the perceived frequency of specific sentiments or topics, impacting statistical analysis and model performance. Uniqueness checks detect and flag duplicate review IDs, allowing for the removal of redundant entries and ensuring that each review contributes uniquely to the analysis.

By enforcing the **Uniqueness** dimension, we ensure that analyses are based on a diverse and representative sample of the text.

Next, we construct and execute the **Data Quality Anomaly Detection** pipeline, adding indicators of data accuracy, validity, uniqueness, and relevance.

## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import StageDef
from discover.flow.data_prep.dqd.stage import DataQualityDetectionStage
from discover.core.flow import PhaseDef, StageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.base.stage",
    ],
)

## Data Quality Anomaly Detection (DQAD) Pipeline
Following our standard orchestration process, we lodd the configuration using the `FiowConfiguReader`, then construct and execute the **DataQualityDetectionStage** pipeline.

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(phase=PhaseDef.DATAPREP, stage=StageDef.DQD)
# Build and run the stage
stage = DataQualityDetectionStage.build(
    stage_config=stage_config, return_dataset=False, force=FORCE
)
dataset = stage.run()

[11/20/2024 07:49:23 AM] [DEBUG] [discover.flow.data_prep.base.stage.DataQualityDetectionStage] [run] : Data prep execution path: RUN
[11/20/2024 07:49:23 AM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_session] : Creating a spark session.
[11/20/2024 07:49:23 AM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_session] : Creating an Spark session. log4j Configuration: file:/home/john/projects/appvocai-discover/log4j.properties




#                      Data Quality Anomaly Detection Stage                      #



your 131072x1 screen size is bogus. expect trouble
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                



                            DetectOrRepairPercentile                            
                            ------------------------                            
                          Start Datetime | Wed, 20 Nov 2024 07:49:37


[11/20/2024 07:49:40 AM] [DEBUG] [DetectOrRepairTask.run] [wrapper] : Task: DetectOrRepairPercentile
[11/20/2024 07:49:40 AM] [DEBUG] [DetectOrRepairTask.run] [wrapper] : Started: Wed, 20 Nov 2024 07:49:37
[11/20/2024 07:49:40 AM] [DEBUG] [DetectOrRepairTask.run] [wrapper] : Completed: Wed, 20 Nov 2024 07:49:40
[11/20/2024 07:49:40 AM] [DEBUG] [DetectOrRepairTask.run] [wrapper] : Runtime: 3.42 seconds


                       Complete Datetime | Wed, 20 Nov 2024 07:49:40
                                 Runtime | 3.42 seconds


                         DetectOrRepairMinimumValueTask                         
                         ------------------------------                         
                          Start Datetime | Wed, 20 Nov 2024 07:49:40


[11/20/2024 07:49:41 AM] [DEBUG] [DetectOrRepairTask.run] [wrapper] : Task: DetectOrRepairMinimumValueTask
[11/20/2024 07:49:41 AM] [DEBUG] [DetectOrRepairTask.run] [wrapper] : Started: Wed, 20 Nov 2024 07:49:40
[11/20/2024 07:49:41 AM] [DEBUG] [DetectOrRepairTask.run] [wrapper] : Completed: Wed, 20 Nov 2024 07:49:41
[11/20/2024 07:49:41 AM] [DEBUG] [DetectOrRepairTask.run] [wrapper] : Runtime: 0.83 seconds


                       Complete Datetime | Wed, 20 Nov 2024 07:49:41
                                 Runtime | 0.83 seconds


[11/20/2024 07:49:44 AM] [DEBUG] [DataPrepStage.run] [wrapper] : Stage: Data Quality Anomaly Detection Stage
[11/20/2024 07:49:44 AM] [DEBUG] [DataPrepStage.run] [wrapper] : Stage Started: Wed, 20 Nov 2024 07:49:23
[11/20/2024 07:49:44 AM] [DEBUG] [DataPrepStage.run] [wrapper] : Stage Completed: Wed, 20 Nov 2024 07:49:44
[11/20/2024 07:49:44 AM] [DEBUG] [DataPrepStage.run] [wrapper] : Stage Runtime: 21.12 seconds




                      Data Quality Anomaly Detection Stage                      
                           Stage Started | Wed, 20 Nov 2024 07:49:23
                         Stage Completed | Wed, 20 Nov 2024 07:49:44
                           Stage Runtime | 21.12 seconds





With **Data Quality Anomaly Detection** we move on to **Data Quality Analysis (DQA)**.

In [5]:
from discover.assets.dataset import Dataset


print(dataset)

dataset-test-dataprep-dqd-review
