In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

# Data Quality Assessment
In the previous section, we provided an overview of the AppVoCAI dataset, including its structure, features, and distributions. The goal of this stage is to 

1. identify any unwanted artifacts in the dataset that could compromise the integrity and accuracy of our downstream modeling efforts, and
2. design the data cleaning pipeline and text cleaning interventions to remove or otherwise treat these artifacts.

It is important to clarify that this stage is focused on **data quality assessment** and **text cleaning**, rather than data preprocessing. While preprocessing tasks such as tokenization, lemmatization, stopword removal, and text normalization are crucial steps in preparing the dataset for model training, they are not the focus here. Instead, our goal is to address **anomalies**—such as duplicates, invalid characters, non-ASCII text, and other artifacts—that could compromise the integrity and reliability of downstream analysis. By ensuring the dataset is clean and free of unwanted noise, we lay the foundation for accurate, meaningful preprocessing and modeling in later stages.

## Data Quality Context
Downstream tasks such as sentiment analysis, classification, text summarization, and generation will leverage transformer-based models (like BERT, RoBERTa, and GPT), which have proven to be highly robust in handling various data anomalies and linguistic variations. Unlike traditional models such as Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs)—which process data sequentially—transformers operate on entire sequences in parallel. This allows them to capture long-range dependencies and uncover subtle nuances and contextual relationships in language with near-human precision.

However, research demonstrates that preprocessing can still significantly improve the performance of transformer models {cite}`siinoTextPreprocessingStill2024`. This synergy between preprocessing and model architecture suggests that a balanced approach is ideal. Our data preparation methodology focuses on addressing critical data quality issues that could undermine the integrity of downstream analyses, while preserving the text as close to its original form as possible. By adopting this conservative approach, we tackle key issues without sacrificing the nuance and representativeness of the data, ensuring the models are presented with rich, authentic input.

## Data Quality Approach
Although, our data quality and cleaning approach comprises many of the preprocessing techniques commonly found in the literature {cite}`symeonidisComparativeEvaluationPreprocessing2018`, three questions motivated the data quality design process.

1. What’s essential to remove, and what can be left intact to preserve meaning?
2. How do we best preserve text richness and nuance?
3. How can the data cleaning process best exploit model strengths towards optimal model performance?

These Key Evaluation Questions (KEQs) crystallized our approach which balanced data quality with model sophistication.

---

### Noise Removal
Noise refers to characters or tokens that either distort the content or contribute little to understanding sentiment, intent, or behavior. Our approach to noise removal starts with the simpler and more common issues, followed by more nuanced decisions around special characters and punctuation.

#### Encoding and Common Noise
The first set of noise to address involves issues that are relatively easy to detect and remove, but can have a big impact on the clarity of the dataset:

- **Encoding and Control Characters:** Often appearing as artifacts from different text encoding formats, these include characters that serve formatting or invisible functions in the text. They will be removed.
- **Accents and Diacritics:** These will be normalized (e.g., converting `é` to `e`) to reduce unnecessary variation in the text.
- **HTML Characters:** Common in scraped data, characters such as `&amp;` and `&#39;` will be removed as they do not add any meaningful content.
- **Line Breaks and Excessive Whitespace:** Multiple line breaks and extra spaces will be condensed into a single space to ensure consistency and readability.
- **Non-ASCII Characters:** While some non-ASCII characters (like certain symbols or non-English characters) may carry meaning, most introduce unnecessary complexity and will be removed. 

#### Special Characters
Special characters can vary in importance, depending on their context in app reviews. To systematically handle special characters, we can categorize them into two groups: **those that should be removed** (because they add noise or don't carry useful information) and **those that should be retained** (because they might contribute to the meaning or sentiment in the text). Here's a breakdown:

##### Special Characters to Remove 
These characters generally do not add meaningful content to app reviews and are mostly used for formatting or random emphasis. 

|     Special   Character     |          Examples          |                                                                  Rationale                                                                  |
|:---------------------------:|:--------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------:|
| Currency Symbols            | $, €, ¥, £                 | Unlikely to be relevant in app reviews unless specifically analyzing   price-related content, usually captured in text rather than symbols. |
| Mathematical Symbols        | +, =, *, /, ^              | Rarely convey sentiment or meaning in reviews and can clutter   tokenization.                                                               |
| Logical/Programming Symbols | {}, [], <>, \|, \, ~, ;, : | Generally irrelevant unless for technical feedback; uncommon in reviews.                                                                    |
| Ampersand                   | &                          | Often shorthand for 'and,' normalization improves consistency without   altering meaning.                                                   |
| At Symbol                   | @                          | Primarily used for tagging or mentions; usually irrelevant in app   reviews.                                                                |
| Hashtag                     | #                          | Adds noise unless analyzing hashtags or trends; generally unnecessary in   reviews.                                                         |
| Percentage                  | %                          | Typically written as words in reviews (e.g., “80% battery” as 'eighty   percent battery'), unnecessary unless focusing on numbers.          |
| Underscore                  | _                          | Found in URLs or filenames, generally meaningless in app reviews.                                                                           |
| Pipe                        | \|                         | Used as a separator or in technical contexts, generally adds no value in   app reviews.                                                     |
| Backslash                   | \                          | Part of escape sequences or formatting, unnecessary for understanding   text content.                                                       |

##### Special Characters to Retain
These characters might carry semantic or emotional weight and could enhance analyses such as sentiment, emphasis, or intensity detection:

| Special   Character | Examples |                                                                    Rationale                                                                   |
|:-------------------:|:--------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|
| Apostrophe          | '        | Important for contractions (e.g., “don’t”, “can’t”), which can affect   meaning if removed.                                                    |
| Quotation Marks     | "        | Useful for retaining structure when users quote something directly in   their review.                                                          |
| Parentheses         | ()       | Often used to add clarifying details or side comments in reviews. They   can add meaning that shouldn’t be stripped.                           |
| Hyphen              | -        | Important for words that are hyphenated or when users break up long   thoughts. It can also appear in numerical ranges (e.g., '5-10 minutes'). |


#### Punctuation
Punctuation presents more nuanced challenges, as it often affects sentence structure, tone, and emphasis. Standard punctuation such as periods, commas, exclamation points, and question marks will be kept. They are important for understanding sentence boundaries and emotional emphasis. Multiple punctuation marks (e.g., "!!!", "???") are also retained, as they often signify strong emotion or sentiment.

### Personallly Identifiable Information (PII)
In this data cleaning phase, personally identifiable information (PII), such as emails, URLs, and phone numbers, will undergo masking to ensure privacy and compliance with ethical standards for data use. Emails, for instance, are highly sensitive and can reveal specific user identities or contact details, risking the exposure of personally sensitive information. To mitigate this, email addresses will be systematically replaced with the marker `[EMAIL]`. Similarly, URLs—often including specific domains or personal resources—may unintentionally disclose identifiable or private information. Masking URLs as `[URL]` prevents unintended data leakage while retaining content structure for analysis. Additionally, phone numbers, inherently identifiable and private, will be marked as `[PHONE]`. Given their nature as direct contact points, the masking of phone numbers is essential to uphold confidentiality and meet data privacy regulations.

These masking protocols enable comprehensive content analysis while upholding data privacy obligations and reducing risks of re-identification in sensitive datasets. This approach allows the dataset to retain its structural and contextual integrity, facilitating meaningful analysis without compromising user privacy.
### Language 
As part of the data quality assessment strategy, non-English app names and reviews will be systematically identified and removed to ensure linguistic consistency within the dataset. This process involves the application of advanced language detection models, which will analyze app names and review text to detect non-English content. Entries flagged as non-English will be excluded from the dataset, ensuring that only English-language data remains for analysis.

The rationale for this step lies in the need to maintain coherence in language-based tasks such as sentiment analysis, aspect-based sentiment analysis (ABSA), and emotion detection, all of which rely on clear, uniform input. By removing non-English content, we aim to prevent noise, misinterpretation, or inconsistencies that could undermine the accuracy of insights. This approach is designed to focus the analysis on the English-speaking market, aligning the dataset with the target audience and improving the overall relevance and quality of the findings.

### Data Normalization
Standard NLP normalization includes key processes such as converting emoticons and emojis to text, correcting spelling, expanding abbreviations and acronyms, expanding contractions, and removing elongation. The following data quality analysis and cleaning strategies aim to preserve meaning, text richness, and nuance while leveraging the strengths of transformer models to achieve optimal performance. By focusing on essential elements, we ensure that unnecessary noise is removed without compromising the depth and context of the original content, allowing for more accurate and insightful analysis in downstream NLP tasks.

#### Emoticons and Emojis Conversion

Emoticons and emojis are frequently used in app reviews to express emotions, sentiments, or nuanced meanings that words alone may not fully capture. The challenge in handling these symbols lies in balancing the need to preserve their meaning with the ability of transformer models to interpret them effectively. While transformers, particularly models with advanced tokenizers such as BPE (Byte Pair Encoding) or WordPiece, are capable of recognizing emojis as individual tokens, there are trade-offs in terms of interpretability and model performance. Below, we outline two potential approaches: conversion of emojis to text and leaving them as-is, along with the rationale for each.

#### 1. **Convert Emoticons and Emojis to Text**
   **Approach**: In this strategy, emoticons and emojis are converted into their corresponding textual descriptions (e.g., "😊" becomes "smiling face" or "happy"). This process can be done using predefined emoji dictionaries that map each symbol to a meaningful word or phrase. 

   **Justification**:
   - **Improved Interpretability**: Converting emojis to text ensures that their emotional or symbolic meaning is explicitly captured, which may enhance sentiment analysis or emotion detection tasks. Models will then treat these symbols as regular words, improving semantic understanding.
   - **Preserving Sentiment**: Emojis often carry significant emotional weight, which can be missed if the model treats them as independent, isolated tokens. By converting them to text, we ensure that the full emotional context of the review is retained and understood by the model.
   - **Better Alignment with Text-Based Models**: Transformer models, particularly those trained on text data (e.g., BERT, RoBERTa), may perform better when they process complete words rather than unfamiliar symbols, as text-based tokens align with the model’s pretraining. This conversion provides uniform input for the model to interpret, reducing potential ambiguity.

   **Challenges**:
   - **Loss of Brevity and Flow**: In informal text like app reviews, converting "😊" to "smiling face" might disrupt the brevity or stylistic tone of the content. This change could affect the naturalness of user-generated content, especially in sentiment-heavy reviews.
   - **Limited Context in Text Conversion**: Some emojis have context-specific meanings (e.g., a heart emoji may indicate love, approval, or even sarcasm), and these meanings may not always be fully captured by a generic text replacement.

#### 2. **Leave Emojis and Emoticons As-Is**
   **Approach**: This strategy involves keeping emojis and emoticons in their original form and allowing the transformer model’s tokenizer to handle them natively. Tokenizers like BPE or WordPiece can break down emojis into subword units or treat them as individual tokens, based on the pretraining corpus.

   **Justification**:
   - **Capable Tokenizers**: Modern transformer models are designed to handle a wide variety of tokens, including emojis. Since these models are trained on large, diverse datasets that likely include emojis, they can recognize and process them without needing explicit conversion to text. For example, BERT can treat "😊" as a unique token, potentially learning its context from the surrounding text.
   - **Retaining Natural Language Flow**: In user-generated content like app reviews, leaving emojis intact preserves the natural flow and brevity of the language. This is particularly relevant for informal or sentiment-heavy reviews, where emojis are often integral to conveying tone or nuance.
   - **Model Pretraining**: Transformers pretrained on large-scale internet corpora (e.g., GPT models) may have already learned embeddings for commonly used emojis, allowing the model to understand the sentiment behind them without needing explicit conversion.

   **Challenges**:
   - **Ambiguity in Meaning**: While transformer models can process emojis, they may not always capture the full sentiment or emotion conveyed by them, particularly if an emoji has multiple or context-specific meanings. This can reduce the accuracy of sentiment or emotion detection tasks.
   - **Inconsistent Handling of Emojis**: Since not all emojis carry clear or universal meanings, models might struggle with low-frequency or niche emojis that were less represented in the training data.

### Recommended Approach: **As-Is for Transformer Models**
Given the advanced capabilities of modern transformer tokenizers and the likelihood that app reviews will contain common emojis that transformers have encountered during pretraining, leaving emojis **as-is** can be a more efficient and natural approach. This preserves the integrity of user-generated content, aligns well with the model’s existing knowledge, and allows the transformer’s contextual understanding to interpret these symbols effectively.

The **as-is approach** exploits the strength of transformers to process a wide variety of tokens without adding additional complexity through text conversion. This is particularly important when dealing with informal text such as app reviews, where brevity and emotional tone are often conveyed through symbols rather than words. While converting to text might slightly improve interpretability in niche cases, the cost in terms of disrupting the natural flow of language and potentially introducing noise outweighs the benefits for most applications.

However, for specialized tasks such as deep sentiment analysis, where capturing fine-grained emotional nuance is critical, a hybrid approach could be considered. In such cases, selectively converting only sentimentally significant emojis might enhance performance while keeping the overall text intact.


In this Data Quality Assessment (DQA), we will execute a series of checks designed to uncover noise, inconsistencies, or anomalies within the dataset:

1. **App ID/App Name Consistency**: We ensure that `app_id` and `app_name` align. A prior analysis revealed 14 more `app_id`s than `app_name`s, which will require further investigation.
2. **Duplicate Review IDs**: We identified 117 duplicate review `id`s. These entries must be flagged for closer examination.
3. **Duplicate Review Content**: Approximately 14% of the reviews were found to be duplicates. These reviews need to be reviewed for potential redundancy or noise.
4. **Review Length Anomalies**: Zero-length reviews will be removed. Extremely long reviews will be inspected for signs of repetition or low-quality content.
5. **Non-English Text**: Reviews and app names written in non-English may not be relevant to our analysis. We will identify and decide on appropriate handling for these entries.
6. **Inappropriate Content**: Content such as URLs, phone numbers, email addresses, or other personally identifiable information will be treated as spam and either removed or masked.
7. **Emojis**: Emojis can add valuable context in some cases but may introduce noise in others. We will assess whether to retain, remove, or convert emojis into textual equivalents.
8. **Formatting Anomalies**: We will flag any entries with excessive whitespace, HTML, or other markup artifacts that could interfere with analysis.

**Note on Exclusions**: While this assessment focuses on structural and formatting issues, certain aspects like profanity and spelling mistakes are not included in this phase. These may be addressed during the text quality assessment, which will focus on content and linguistic features.

By performing these data quality checks, we identify and mark anomalies that will be addressed in the subsequent data cleaning stage. This ensures that our dataset is flagged for inconsistencies and potential issues, laying the groundwork for clean, reliable data. Resolving these issues will enhance the accuracy of our analyses and lead to more robust conclusions in the later phases of our data processing pipeline.

## Import Libraries

In [2]:
import fasttext

from discover.analysis.dqa import DataQualityAnalysis
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.dqa.stage import DQAStage

fasttext.FastText.eprint = lambda x: None

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.flow.data_prep.dqa",
        "discover.analysis.base",
    ],
)

## Data Quality Assessment Pipeline
The data quality assessment process conducts the data quality checks, marking the observations that require attention. We begin with the configuration, then construct and run the DQAStage pipeline.

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"]["dqa"]

# Build and run Data Ingestion Stage
stage = DQAStage.build(stage_config=stage_config, force=True)
asset_id = stage.run()

[10/24/2024 04:39:34 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_dqa-review-dataset.parquet from repository.
[10/24/2024 04:39:34 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-dqa-review from the repository.




#                         Data Quality Assessment Stage                          #



                              DetectDuplicateTask                               
                              -------------------                               
                          Start Datetime | Thu, 24 Oct 2024 04:39:34
                       Complete Datetime | Thu, 24 Oct 2024 04:39:34
                                 Runtime | 0.03 seconds
                               DQA Check | dqa_identical_rows
                      Anomalies Detected | 0 (0.0%) of 59021 records


                              DetectDuplicateTask                               
                              -------------------                               
                          Start Datetime | Thu, 24 Oct 2024 04:39:34
                       Complete Datetime | Thu, 24 Oct 2024 04:39:34
                                 Runtime | 0.03 seconds
                               DQA Check | dqa_identical_review_id


## Data Quality Impressions
Let's get a summary of the data quality issues by type.

In [5]:
dqa = DataQualityAnalysis()
dqa.summarize()

Unnamed: 0,n,%
dqa_contains_non_ascii_chars,27252,46.173396
dqa_contains_excessive_whitespace,7540,12.775114
dqa_identical_review_content,3923,6.646787
dqa_has_emoji,3490,5.91315
dqa_non_english_review,2276,3.856255
dqa_non_english_app_name,1402,2.375426
dqa_contains_excessive_numbers,41,0.069467
dqa_contains_phone_number,26,0.044052
dqa_contains_inconsistent_app_id_name,8,0.013554
dqa_contains_HTML_chars,5,0.008472


## Data Preprocessing Approach

The data quality assessment (DQA) conducted on the AppVoCAI dataset revealed several key issues that may require some treatment during the data cleaning process. That said, our review of the DQA results, and our treatment of these anomalies must be contextualized within an overall data preprocessing ethos, or approach. 

Downstream tasks such as sentiment analysis, classification, text summarization and generation will leverage  **transformer-based models** (like BERT, RoBERTa, or GPT), which have shown to be quite **robust** in handling many kinds of data anomalies and variations. Unlike traditional Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs), that rely heavily on extensive preprocessing, transformers represent an evolution in language comprehension, capable of capturing relationships across distant sequences and inferring intricate nuances in language with precision without many of the traditional natural language processing (NLP) interventions.  Here's why.

1. **Contextual Understanding**:
   - Transformer models leverage the **context** in which words appear. So, even if a word like "goooood" appears, the model can usually infer its meaning based on the surrounding text.
   - Models like BERT or GPT can often recognize that "goooood" is just an elongated form of "good" because of how they understand the sequence of tokens.

2. **Subword Tokenization**:
   - Most transformer models use **subword tokenizers** like WordPiece (used by BERT) or Byte-Pair Encoding (used by GPT), which break down rare or unknown words into smaller subword units. For example:
     - "goooood" might be tokenized as "goo" + "##ood".
     - "loooove" might become "lo" + "##ove".
   - This allows the model to handle rare or misspelled words by breaking them down into known subword pieces, which are still interpretable by the model.
   
3. **Pretraining Robustness**:
   - Transformer models are pretrained on *massive* amounts of diverse text data, which often includes typos, misspellings, and casual language (e.g., from web data, social media). This means they've likely been exposed to, and learned from, similar patterns, making them more robust to these irregularities.

In the transformer modeling paradigm, extracting rich, detailed sentiment, emotion, and intensity from customer reviews is best achieved through a conservative, light-touch approach to data preprocessing. This approach focuses on essential treatments—such as removing personally identifiable information (PII), correcting data errors, and converting emojis to text—while preserving the original text as much as possible.

Certain preprocessing techniques, such as reducing repeated characters, or spelling correction, might unintentionally remove emphasis or change the tone of the review. For example, "goooood" with the extra "o"s might indicate strong positivity or excitement, whereas just "good" might seem more neutral. Transformer models are generally capable of understanding the difference in sentiment or tone conveyed by elongated words. Character repetitions often convey **emotion, emphasis, or informality**, and this is part of the natural variation in user-generated content. Removing or normalizing them might lose some of this nuance.


- **Efficient Processing**: By not over-processing the text, you avoid unnecessary complexity in the preprocessing pipeline, reducing the risk of introducing errors or losing information.
- **Transformers are Robust**: Since transformers handle these variations well, there's less pressure to "fix" these issues, especially if your primary goal is to capture the overall sentiment or meaning in the text.

### When to Preprocess:
There are cases where preprocessing may still be beneficial, such as:
- **Strict Standardization**: If you're building interpretable models or need standardized outputs for downstream use cases (e.g., generating reports), you may prefer to clean or normalize text.
- **Highly Noisy Data**: If the dataset contains extreme noise or gibberish-like text (e.g., excessive repetition of random letters), it might still be worth cleaning those cases up.

### Conclusion:
**If you’re using transformer models,** there's a strong case for leaving misspellings and repeated characters as they are, since these models are designed to handle these variations effectively. The context-awareness and tokenization strategies used by transformers can often deal with non-standard text better than more rigid traditional models.

### Suggested Approach:
- **Minimal Preprocessing**: Keep the text as close to the original form as possible, doing only necessary cleaning like removing URLs, emails, or extreme noise.
- **Let Transformers Do the Work**: Trust that transformer-based models can handle elongated words, typos, and misspellings through contextual understanding and subword tokenization.

Would you like further guidance on how transformers handle such variations in practice, or do you feel confident in this approach?


## Data Quality Review
The following sections will present the anomalous observations for review prior to embarking on the cleaning stage.

### Anomalies to be Removed

#### Duplicate Review Content

In [6]:
summary, data = dqa.get_duplicate_review_content()
summary

Unnamed: 0,content,count
151,Good,236
313,Love it,133
168,Great,130
172,Great app,114
230,I love it,76
...,...,...
279,It’s amazing!,2
273,It’s Ight,2
270,Its alright,2
269,It's so easy to use,2


The data quality assessment (DQA) conducted on the AppVoC dataset revealed several key issues that may require some treatment during the data cleaning process. Let's take a look. 

### Observations to Be Removed:
These cases represent data quality issues where the entire observation will be removed from the dataset.

- **Duplicate Review Content**: 13.93% of the reviews are duplicates and will be removed to ensure the dataset contains only unique user feedback.
- **Duplicate Review IDs**: A small fraction (0.0005%) of reviews with duplicate IDs will be removed to prevent inconsistencies.
- **Missing Reviews**: Any reviews flagged as missing (0.000009%) will be dropped from the dataset.
- **Inconsistent App ID-Name Pairs**: Inconsistent app ID-name pairs (0.0001%) will be removed to maintain consistency.
- **Non-English Reviews**: 3.20% of the reviews are non-English and will be removed to focus on English content.
- **Non-English App Names**: 2.76% of app names are non-English and will be removed.

### Observations to Be Cleaned (Replacing Sequences):
These issues involve replacing specific problematic sequences with predefined values, while keeping the rest of the observation intact.

- **Non-ASCII Characters (27.88%)**: Non-ASCII characters will be replaced with an empty string to clean the text.
- **Excessive Whitespace (12.97%)**: Whitespace sequences will be replaced with a single space to maintain proper formatting.
- **Emojis (4.90%)**: Emojis will be converted to their text equivalents (e.g., 😊 becomes "smiling face").
- **Phone Numbers (0.03%)**: Phone numbers will be replaced with "[PHONE]" to anonymize personal information.
- **Control Characters (0.02%)**: Control characters will be removed by replacing them with an empty string to avoid formatting issues.
- **HTML Characters (0.01%)**: HTML entities will be replaced with an empty string to strip unnecessary formatting.
- **URLs (0.0007%)**: URLs will be replaced with "[URL]" to anonymize web addresses.
- **Emails (0.0001%)**: Emails will be replaced with "[EMAIL]" to protect personal information.
- **Excessive Numbers (0.04%)**: Excessive numbers will be replaced with "[NUMBER]" to normalize content where numbers dominate the text.

### Risk Mitigation
This approach delineates which observations will be removed entirely and which will undergo targeted sequence replacements to clean the data. By delineating the treatments, the data is prepared in a way that ensures both data integrity and readability, without losing more information than necessary. However, before taking the irreversible cleaning steps, we'll review a sampling of the anomalies to ensure that the flagged data is truly problematic.  