In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

# Data Quality Assessment
In the previous section, we provided an overview of the AppVoCAI dataset, including its structure, features, and distributions. The goal of this stage is to 

1. identify any unwanted artifacts in the dataset that could compromise the integrity and accuracy of our downstream modeling efforts, and
2. design the data cleaning pipeline and text cleaning interventions to remove or otherwise treat these artifacts.

It is important to clarify that this stage is focused on **data quality assessment** and **text cleaning**, rather than data preprocessing. While preprocessing tasks such as tokenization, lemmatization, stopword removal, and text normalization are crucial steps in preparing the dataset for model training, they are not the focus here. Instead, our goal is to address **anomalies**—such as duplicates, invalid characters, non-ASCII text, and other artifacts—that could compromise the integrity and reliability of downstream analysis. By ensuring the dataset is clean and free of unwanted noise, we lay the foundation for accurate, meaningful preprocessing and modeling in later stages.

## Data Quality Context
Downstream tasks such as sentiment analysis, classification, text summarization, and generation will leverage transformer-based models (like BERT, RoBERTa, and GPT), which have proven to be highly robust in handling various data anomalies and linguistic variations. Unlike traditional models such as Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs)—which process data sequentially—transformers operate on entire sequences in parallel. This allows them to capture long-range dependencies and uncover subtle nuances and contextual relationships in language with near-human precision.

However, research demonstrates that preprocessing can still significantly improve the performance of transformer models {cite}`siinoTextPreprocessingStill2024`. This synergy between preprocessing and model architecture suggests that a balanced approach is ideal. Our data preparation methodology focuses on addressing critical data quality issues that could undermine the integrity of downstream analyses, while preserving the text as close to its original form as possible. By adopting this conservative approach, we tackle key issues without sacrificing the nuance and representativeness of the data, ensuring the models are presented with rich, authentic input.

## Data Quality Approach
Although, our data quality and cleaning approach comprises many of the preprocessing techniques commonly found in the literature {cite}`symeonidisComparativeEvaluationPreprocessing2018`, three questions motivated the data quality design process.

1. What’s essential to remove, and what can be left intact to preserve meaning?
2. How do we best preserve text richness and nuance?
3. How can the data cleaning process best exploit model strengths towards optimal model performance?

These Key Evaluation Questions (KEQs) crystallized our approach which balanced data quality with model sophistication.

---

### Noise Removal Overview

In preparing the app review dataset for analysis, removing noise is crucial to improve the quality of the text while preserving valuable information. Noise refers to characters or tokens that either distort the content or contribute little to understanding sentiment, intent, or behavior. Our approach to noise removal starts with the simpler and more common issues, followed by more nuanced decisions around special characters and punctuation.

#### Encoding and Common Noise
The first set of noise to address involves issues that are relatively easy to detect and remove, but can have a big impact on the clarity of the dataset:

- **Encoding and Control Characters:** Often appearing as artifacts from different text encoding formats, these include characters that serve formatting or invisible functions in the text. They will be removed.
- **Accents and Diacritics:** These will be normalized (e.g., converting `é` to `e`) to reduce unnecessary variation in the text.
- **HTML Characters:** Common in scraped data, characters such as `&amp;` and `&#39;` will be removed as they do not add any meaningful content.
- **Line Breaks and Excessive Whitespace:** Multiple line breaks and extra spaces will be condensed into a single space to ensure consistency and readability.
- **Non-ASCII Characters:** While some non-ASCII characters (like certain symbols or non-English characters) may carry meaning, most introduce unnecessary complexity and will be removed. 

#### Special Characters
Special characters can vary in importance, depending on their context in app reviews:
To systematically handle special characters, we can categorize them into two groups: **those that should be removed** (because they add noise or don't carry useful information) and **those that should be retained** (because they might contribute to the meaning or sentiment in the text). Here's a breakdown:

### **1. Special Characters to Remove**  
These characters generally do not add meaningful content to app reviews and are mostly used for formatting or random emphasis:

- **Currency symbols**: `$`, `€`, `¥`, `£`, etc.  
  *Rationale*: These are unlikely to be relevant in app reviews unless you’re specifically analyzing price-related content, which is typically captured in text, not symbols.
  
- **Mathematical symbols**: `+`, `=`, `*`, `/`, `^`, etc.  
  *Rationale*: These symbols rarely convey sentiment or meaning in reviews and can clutter tokenization.

- **Logical/Programming symbols**: `{}`, `[]`, `<>`, `|`, `\`, `~`, `;`, `:`  
  *Rationale*: Generally irrelevant to the content of app reviews, unless it's technical feedback, in which case specific symbols might appear (but still uncommon).

- **Ampersand (`&`)**  
  *Rationale*: Often used as shorthand for "and", which can be normalized to improve consistency without affecting the meaning.

- **At symbol (`@`)**  
  *Rationale*: Mostly used for tagging or social media mentions, which are unlikely to be relevant in app reviews.

- **Hashtag (`#`)**  
  *Rationale*: If you're not analyzing hashtags or trends, the `#` symbol can be removed since it's generally noise in reviews.

- **Percentage (`%`)**  
  *Rationale*: Percentages are usually written in words in reviews (e.g., “80% battery” as "eighty percent battery"). Unless you're explicitly analyzing numbers, this symbol can be removed.

- **Underscore (`_`)**  
  *Rationale*: Used in URLs, file names, or variables but not meaningful in app reviews.

- **Pipe (`|`)**  
  *Rationale*: Often used as a separator or in technical contexts, it generally adds no value in app reviews.

- **Backslash (`\`)**  
  *Rationale*: Often part of escape sequences or formatting and not necessary for understanding text content.

### **2. Special Characters to Retain**
These characters might carry semantic or emotional weight and could enhance analyses such as sentiment, emphasis, or intensity detection:

- **Exclamation mark (`!`)**  
  *Rationale*: Indicates excitement, urgency, or strong emotions, often helpful for sentiment or emotion analysis.

- **Question mark (`?`)**  
  *Rationale*: Could imply uncertainty, confusion, or a rhetorical question, useful for sentiment analysis.

- **Apostrophe (`'`)**  
  *Rationale*: Important for contractions (e.g., “don’t”, “can’t”), which can affect meaning if removed.

- **Quotation marks (`"`, `'`)**  
  *Rationale*: Useful for retaining structure when users quote something directly in their review.

- **Parentheses (`()`)**  
  *Rationale*: Often used to add clarifying details or side comments in reviews. They can add meaning that shouldn’t be stripped.

- **Period (`.`)**  
  *Rationale*: Useful for sentence boundaries and maintaining clarity in text structure. Removing it can merge sentences, causing misinterpretation.

- **Comma (`,`)**  
  *Rationale*: Helps to structure sentences and maintain readability.

- **Hyphen (`-`)**  
  *Rationale*: Important for words that are hyphenated or when users break up long thoughts. It can also appear in numerical ranges (e.g., "5-10 minutes").

### **Conclusion**

#### **To Remove**
- `$`, `€`, `¥`, `£` (Currency)
- `+`, `=`, `*`, `/`, `^`, `%` (Math/Percent)
- `{}`, `[]`, `<>`, `|`, `\`, `~`, `;`, `:` (Programming/Logical)
- `&`, `@`, `#`, `_`, `|`, `\`

#### **To Retain**
- `!`, `?`, `'`, `"`, `()`, `.`, `,`, `-`

Would you like to proceed with this list, or do you want to adjust based on any additional considerations specific to your dataset?
- **Retained Characters:** Certain special characters, such as **ellipsis (`...`)**, will be retained. Ellipsis often represents hesitation, emphasis, or unfinished thoughts, contributing to the tone and meaning of the review.
- **Removed Characters:** Special symbols (e.g., `@`, `#`, `^`) that do not contribute to the meaning of the review will be removed. These are considered noise unless they serve a specific function (such as tagging a user or keyword), which is uncommon in this dataset.


### Punctuation

Punctuation presents more nuanced challenges, as it often affects sentence structure, tone, and emphasis:

- **Retained Punctuation:** Standard punctuation such as periods, commas, exclamation points, and question marks will be kept. They are important for understanding sentence boundaries and emotional emphasis. Multiple punctuation marks (e.g., "!!!", "???") are also retained, as they often signify strong emotion or sentiment.
- **Excessive Numbers:** Any sequences of numbers deemed excessive or irrelevant will be replaced with the token `[NUMBER]` to maintain some context without cluttering the text with digits that have little meaning.

---

### Rationale

Our approach to noise removal prioritizes tackling the simplest and most disruptive forms of noise first (encoding, control characters, and whitespace), followed by more nuanced decisions around special characters and punctuation. This ensures that the core structure of the text is preserved, while reducing irrelevant noise that might hinder further analysis.

---

This structure should provide a clear roadmap and rationale for handling noise in your dataset. Let me know if you’d like to refine any section!






 without many of the traditional natural language processing (NLP) interventions.  Here's why.

1. **Contextual Understanding**:
   - Transformer models leverage the **context** in which words appear. So, even if a word like "goooood" appears, the model can usually infer its meaning based on the surrounding text.
   - Models like BERT or GPT can often recognize that "goooood" is just an elongated form of "good" because of how they understand the sequence of tokens.

2. **Subword Tokenization**:
   - Most transformer models use **subword tokenizers** like WordPiece (used by BERT) or Byte-Pair Encoding (used by GPT), which break down rare or unknown words into smaller subword units. For example:
     - "goooood" might be tokenized as "goo" + "##ood".
     - "loooove" might become "lo" + "##ove".
   - This allows the model to handle rare or misspelled words by breaking them down into known subword pieces, which are still interpretable by the model.
   
3. **Pretraining Robustness**:
   - Transformer models are pretrained on *massive* amounts of diverse text data, which often includes typos, misspellings, and casual language (e.g., from web data, social media). This means they've likely been exposed to, and learned from, similar patterns, making them more robust to these irregularities.

In the transformer modeling paradigm, extracting rich, detailed sentiment, emotion, and intensity from customer reviews is best achieved through a conservative, light-touch approach to data preprocessing. This approach focuses on essential treatments—such as removing personally identifiable information (PII), correcting data errors, and converting emojis to text—while preserving the original text as much as possible.
 will  that will transfeorm the raw data into   the data cleaning pipeline 

The dataset overview confirmed the absence of null values and the validity of key variables such as ratings and review dates. 

In this Data Quality Assessment (DQA), we will execute a series of checks designed to uncover noise, inconsistencies, or anomalies within the dataset:

1. **App ID/App Name Consistency**: We ensure that `app_id` and `app_name` align. A prior analysis revealed 14 more `app_id`s than `app_name`s, which will require further investigation.
2. **Duplicate Review IDs**: We identified 117 duplicate review `id`s. These entries must be flagged for closer examination.
3. **Duplicate Review Content**: Approximately 14% of the reviews were found to be duplicates. These reviews need to be reviewed for potential redundancy or noise.
4. **Review Length Anomalies**: Zero-length reviews will be removed. Extremely long reviews will be inspected for signs of repetition or low-quality content.
5. **Non-English Text**: Reviews and app names written in non-English may not be relevant to our analysis. We will identify and decide on appropriate handling for these entries.
6. **Inappropriate Content**: Content such as URLs, phone numbers, email addresses, or other personally identifiable information will be treated as spam and either removed or masked.
7. **Emojis**: Emojis can add valuable context in some cases but may introduce noise in others. We will assess whether to retain, remove, or convert emojis into textual equivalents.
8. **Formatting Anomalies**: We will flag any entries with excessive whitespace, HTML, or other markup artifacts that could interfere with analysis.

**Note on Exclusions**: While this assessment focuses on structural and formatting issues, certain aspects like profanity and spelling mistakes are not included in this phase. These may be addressed during the text quality assessment, which will focus on content and linguistic features.

By performing these data quality checks, we identify and mark anomalies that will be addressed in the subsequent data cleaning stage. This ensures that our dataset is flagged for inconsistencies and potential issues, laying the groundwork for clean, reliable data. Resolving these issues will enhance the accuracy of our analyses and lead to more robust conclusions in the later phases of our data processing pipeline.

## Import Libraries

In [2]:
import fasttext

from discover.analysis.dqa import DataQualityAnalysis
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.dqa.stage import DQAStage

fasttext.FastText.eprint = lambda x: None

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.flow.data_prep.dqa",
        "discover.analysis.base",
    ],
)

## Data Quality Assessment Pipeline
The data quality assessment process conducts the data quality checks, marking the observations that require attention. We begin with the configuration, then construct and run the DQAStage pipeline.

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"]["dqa"]

# Build and run Data Ingestion Stage
stage = DQAStage.build(stage_config=stage_config, force=True)
asset_id = stage.run()

[10/24/2024 04:39:34 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_dqa-review-dataset.parquet from repository.
[10/24/2024 04:39:34 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-dqa-review from the repository.




#                         Data Quality Assessment Stage                          #



                              DetectDuplicateTask                               
                              -------------------                               
                          Start Datetime | Thu, 24 Oct 2024 04:39:34
                       Complete Datetime | Thu, 24 Oct 2024 04:39:34
                                 Runtime | 0.03 seconds
                               DQA Check | dqa_identical_rows
                      Anomalies Detected | 0 (0.0%) of 59021 records


                              DetectDuplicateTask                               
                              -------------------                               
                          Start Datetime | Thu, 24 Oct 2024 04:39:34
                       Complete Datetime | Thu, 24 Oct 2024 04:39:34
                                 Runtime | 0.03 seconds
                               DQA Check | dqa_identical_review_id


## Data Quality Impressions
Let's get a summary of the data quality issues by type.

In [5]:
dqa = DataQualityAnalysis()
dqa.summarize()

Unnamed: 0,n,%
dqa_contains_non_ascii_chars,27252,46.173396
dqa_contains_excessive_whitespace,7540,12.775114
dqa_identical_review_content,3923,6.646787
dqa_has_emoji,3490,5.91315
dqa_non_english_review,2276,3.856255
dqa_non_english_app_name,1402,2.375426
dqa_contains_excessive_numbers,41,0.069467
dqa_contains_phone_number,26,0.044052
dqa_contains_inconsistent_app_id_name,8,0.013554
dqa_contains_HTML_chars,5,0.008472


## Data Preprocessing Approach

The data quality assessment (DQA) conducted on the AppVoCAI dataset revealed several key issues that may require some treatment during the data cleaning process. That said, our review of the DQA results, and our treatment of these anomalies must be contextualized within an overall data preprocessing ethos, or approach. 

Downstream tasks such as sentiment analysis, classification, text summarization and generation will leverage  **transformer-based models** (like BERT, RoBERTa, or GPT), which have shown to be quite **robust** in handling many kinds of data anomalies and variations. Unlike traditional Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs), that rely heavily on extensive preprocessing, transformers represent an evolution in language comprehension, capable of capturing relationships across distant sequences and inferring intricate nuances in language with precision without many of the traditional natural language processing (NLP) interventions.  Here's why.

1. **Contextual Understanding**:
   - Transformer models leverage the **context** in which words appear. So, even if a word like "goooood" appears, the model can usually infer its meaning based on the surrounding text.
   - Models like BERT or GPT can often recognize that "goooood" is just an elongated form of "good" because of how they understand the sequence of tokens.

2. **Subword Tokenization**:
   - Most transformer models use **subword tokenizers** like WordPiece (used by BERT) or Byte-Pair Encoding (used by GPT), which break down rare or unknown words into smaller subword units. For example:
     - "goooood" might be tokenized as "goo" + "##ood".
     - "loooove" might become "lo" + "##ove".
   - This allows the model to handle rare or misspelled words by breaking them down into known subword pieces, which are still interpretable by the model.
   
3. **Pretraining Robustness**:
   - Transformer models are pretrained on *massive* amounts of diverse text data, which often includes typos, misspellings, and casual language (e.g., from web data, social media). This means they've likely been exposed to, and learned from, similar patterns, making them more robust to these irregularities.

In the transformer modeling paradigm, extracting rich, detailed sentiment, emotion, and intensity from customer reviews is best achieved through a conservative, light-touch approach to data preprocessing. This approach focuses on essential treatments—such as removing personally identifiable information (PII), correcting data errors, and converting emojis to text—while preserving the original text as much as possible.

Certain preprocessing techniques, such as reducing repeated characters, or spelling correction, might unintentionally remove emphasis or change the tone of the review. For example, "goooood" with the extra "o"s might indicate strong positivity or excitement, whereas just "good" might seem more neutral. Transformer models are generally capable of understanding the difference in sentiment or tone conveyed by elongated words. Character repetitions often convey **emotion, emphasis, or informality**, and this is part of the natural variation in user-generated content. Removing or normalizing them might lose some of this nuance.


- **Efficient Processing**: By not over-processing the text, you avoid unnecessary complexity in the preprocessing pipeline, reducing the risk of introducing errors or losing information.
- **Transformers are Robust**: Since transformers handle these variations well, there's less pressure to "fix" these issues, especially if your primary goal is to capture the overall sentiment or meaning in the text.

### When to Preprocess:
There are cases where preprocessing may still be beneficial, such as:
- **Strict Standardization**: If you're building interpretable models or need standardized outputs for downstream use cases (e.g., generating reports), you may prefer to clean or normalize text.
- **Highly Noisy Data**: If the dataset contains extreme noise or gibberish-like text (e.g., excessive repetition of random letters), it might still be worth cleaning those cases up.

### Conclusion:
**If you’re using transformer models,** there's a strong case for leaving misspellings and repeated characters as they are, since these models are designed to handle these variations effectively. The context-awareness and tokenization strategies used by transformers can often deal with non-standard text better than more rigid traditional models.

### Suggested Approach:
- **Minimal Preprocessing**: Keep the text as close to the original form as possible, doing only necessary cleaning like removing URLs, emails, or extreme noise.
- **Let Transformers Do the Work**: Trust that transformer-based models can handle elongated words, typos, and misspellings through contextual understanding and subword tokenization.

Would you like further guidance on how transformers handle such variations in practice, or do you feel confident in this approach?


## Data Quality Review
The following sections will present the anomalous observations for review prior to embarking on the cleaning stage.

### Anomalies to be Removed

#### Duplicate Review Content

In [6]:
summary, data = dqa.get_duplicate_review_content()
summary

Unnamed: 0,content,count
151,Good,236
313,Love it,133
168,Great,130
172,Great app,114
230,I love it,76
...,...,...
279,It’s amazing!,2
273,It’s Ight,2
270,Its alright,2
269,It's so easy to use,2


The data quality assessment (DQA) conducted on the AppVoC dataset revealed several key issues that may require some treatment during the data cleaning process. Let's take a look. 

### Observations to Be Removed:
These cases represent data quality issues where the entire observation will be removed from the dataset.

- **Duplicate Review Content**: 13.93% of the reviews are duplicates and will be removed to ensure the dataset contains only unique user feedback.
- **Duplicate Review IDs**: A small fraction (0.0005%) of reviews with duplicate IDs will be removed to prevent inconsistencies.
- **Missing Reviews**: Any reviews flagged as missing (0.000009%) will be dropped from the dataset.
- **Inconsistent App ID-Name Pairs**: Inconsistent app ID-name pairs (0.0001%) will be removed to maintain consistency.
- **Non-English Reviews**: 3.20% of the reviews are non-English and will be removed to focus on English content.
- **Non-English App Names**: 2.76% of app names are non-English and will be removed.

### Observations to Be Cleaned (Replacing Sequences):
These issues involve replacing specific problematic sequences with predefined values, while keeping the rest of the observation intact.

- **Non-ASCII Characters (27.88%)**: Non-ASCII characters will be replaced with an empty string to clean the text.
- **Excessive Whitespace (12.97%)**: Whitespace sequences will be replaced with a single space to maintain proper formatting.
- **Emojis (4.90%)**: Emojis will be converted to their text equivalents (e.g., 😊 becomes "smiling face").
- **Phone Numbers (0.03%)**: Phone numbers will be replaced with "[PHONE]" to anonymize personal information.
- **Control Characters (0.02%)**: Control characters will be removed by replacing them with an empty string to avoid formatting issues.
- **HTML Characters (0.01%)**: HTML entities will be replaced with an empty string to strip unnecessary formatting.
- **URLs (0.0007%)**: URLs will be replaced with "[URL]" to anonymize web addresses.
- **Emails (0.0001%)**: Emails will be replaced with "[EMAIL]" to protect personal information.
- **Excessive Numbers (0.04%)**: Excessive numbers will be replaced with "[NUMBER]" to normalize content where numbers dominate the text.

### Risk Mitigation
This approach delineates which observations will be removed entirely and which will undergo targeted sequence replacements to clean the data. By delineating the treatments, the data is prepared in a way that ensures both data integrity and readability, without losing more information than necessary. However, before taking the irreversible cleaning steps, we'll review a sampling of the anomalies to ensure that the flagged data is truly problematic.  