In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

# Params

In [2]:
FORCE = False

# Data Cleaning
In the previous section, we provided an overview of the AppVoCAI dataset, including its structure, features, and distributions. The goal of this stage is clean and condition the dataset in advance of downstream modeling efforts.

It is important to clarify that this stage is focused on **data cleaning**, rather than data preprocessing. While preprocessing tasks such as tokenization, lemmatization, stopword removal, and text normalization are crucial steps in preparing the dataset for model training, they are not the focus here. Instead, our goal is to address **anomalies**—such as duplicates, invalid characters, non-ASCII text, and other artifacts—that could compromise the integrity and reliability of downstream analysis. By ensuring the dataset is clean and free of unwanted noise, we lay the foundation for accurate, meaningful preprocessing and modeling in later stages.

## Data Cleaning Context
Downstream tasks such as sentiment analysis, classification, text summarization, and generation will leverage transformer-based models (like BERT, RoBERTa, and GPT), which have proven to be highly robust in handling various data anomalies and linguistic variations. Unlike traditional models such as Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs)—which process data sequentially—transformers operate on entire sequences in parallel. This allows them to capture long-range dependencies and uncover subtle nuances and contextual relationships in language with near-human precision.

However, research demonstrates that preprocessing can still significantly improve the performance of transformer models {cite}`siinoTextPreprocessingStill2024`. This synergy between preprocessing and model architecture suggests that a balanced approach is ideal. Our data preparation methodology focuses on addressing critical data quality issues that could undermine the integrity of downstream analyses, while preserving the text as close to its original form as possible. By adopting this conservative approach, we tackle key issues without sacrificing the nuance and representativeness of the data, ensuring the models are presented with rich, authentic input.

## Data Cleaning Key Evaluation Questions (KEQs)
Although, this data cleaning approach comprises many of the preprocessing techniques commonly found in the literature {cite}`symeonidisComparativeEvaluationPreprocessing2018`, the following data cleaning approach is motivated by three guiding questions.

1. What’s essential to remove, and what can be left intact to preserve meaning?
2. How do we best preserve text richness and nuance?
3. How can the data cleaning process best exploit model strengths towards optimal model performance?

These Key Evaluation Questions (KEQs) crystallized our approach which balanced data quality with model sophistication.

## Data Cleaning Strategy

### Review ID Uniqueness
For handling duplicate review `id` entries in app reviews, the approach is to retain the latest review as it is likely to reflect the user's current opinion. 

### Noise Removal
Noise refers to characters or tokens that either distort the content or contribute little to understanding sentiment, intent, or behavior. Our approach to noise removal starts with the simpler and more common issues, followed by more nuanced decisions around special characters and punctuation.

#### Encoding and Common Noise
The first set of noise to address involves issues that are relatively easy to detect and remove, but can have a big impact on the clarity of the dataset:

- **Encoding and Control Characters:** Often appearing as artifacts from different text encoding formats, these include characters that serve formatting or invisible functions in the text. They will be removed.
- **Accents and Diacritics:** These will be normalized (e.g., converting `é` to `e`) to reduce unnecessary variation in the text.
- **HTML Characters:** Common in scraped data, characters such as `&amp;` and `&#39;` will be removed as they do not add any meaningful content.
- **Line Breaks and Excessive Whitespace:** Multiple line breaks and extra spaces will be condensed into a single space to ensure consistency and readability.

##### Special Characters 
In contrast to conventional text cleaning practices, which often involve the removal of special characters, we have opted to retain them in our pipeline. Special characters like emoticons provide additional context and convey subtle emotional nuances that are valuable for sentiment analysis. Furthermore, indicators such as brackets (e.g., `[EMAIL]`) serve as markers for sensitive information that has been anonymized. Removing these characters could strip valuable context from the reviews, potentially impacting the accuracy and depth of our analysis. This approach, though unconventional, preserves nuance and expressiveness and enhances our ability to capture the full meaning and sentiment embedded in user feedback.

#### Punctuation
Punctuation presents more nuanced challenges, as it often affects sentence structure, tone, and emphasis. Standard punctuation such as periods, commas, exclamation points, and question marks will be kept. They are important for understanding sentence boundaries and emotional emphasis. Multiple punctuation marks (e.g., "!!!", "???") are also retained, as they often signify strong emotion or sentiment.

### Personally Identifiable Information (PII)
In this data cleaning phase, personally identifiable information (PII), such as emails, URLs, and phone numbers, will undergo masking to ensure privacy and compliance with ethical standards for data use. Emails, for instance, are highly sensitive and can reveal specific user identities or contact details, risking the exposure of personally sensitive information. To mitigate this, email addresses will be systematically replaced with the marker `[EMAIL]`. Similarly, URLs—often including specific domains or personal resources—may unintentionally disclose identifiable or private information. Masking URLs as `[URL]` prevents unintended data leakage while retaining content structure for analysis. Additionally, phone numbers, inherently identifiable and private, will be marked as `[PHONE]`. Given their nature as direct contact points, the masking of phone numbers is essential to uphold confidentiality and meet data privacy regulations.

These masking protocols enable comprehensive content analysis while upholding data privacy obligations and reducing risks of re-identification in sensitive datasets. This approach allows the dataset to retain its structural and contextual integrity, facilitating meaningful analysis without compromising user privacy.

### Language 
As part of the data quality assessment strategy, non-English app names and reviews will be systematically identified and removed to ensure linguistic consistency within the dataset. This process involves the application of advanced language detection models, which will analyze app names and review text to detect non-English content. Entries flagged as non-English will be excluded from downstream analyis.

The rationale for this step lies in the need to maintain coherence in language-based tasks such as sentiment analysis, aspect-based sentiment analysis (ABSA), and emotion detection, all of which rely on clear, uniform input. By removing non-English content, we aim to prevent noise, misinterpretation, or inconsistencies that could undermine the accuracy of insights. This approach is designed to focus the analysis on the English-speaking market, aligning the dataset with the target audience and improving the overall relevance and quality of the findings.

### Emoticons and Emojis Conversion
Emoticons and emojis are frequently used in app reviews to express emotions, sentiments, or nuanced meanings that words alone may not fully capture. In the context of aspect-based sentiment analysis (ABSA) for app reviews, the decision to retain or convert emojis plays a significant role in shaping the accuracy and efficiency of downstream natural language processing (NLP) tasks. Given that modern transformer models are highly adept at handling a wide variety of tokens, including emojis, we take a **leave emojis as-is** approach during data cleaning. This strategy leverages the inherent strengths of transformer models in dealing with diverse text elements while maintaining the integrity and natural flow of user-generated content. 

#### Justification:
1. **Transformer Models’ Capabilities**:
   Transformer-based models, such as BERT and GPT, employ advanced tokenization techniques like Byte Pair Encoding (BPE) and WordPiece, which are designed to recognize emojis as distinct tokens. These models are pretrained on large corpora that include emojis, making them well-equipped to process and learn from the contextual meanings that emojis convey. Therefore, emojis can be treated as standard tokens without the need for conversion, preserving computational efficiency while ensuring sentiment and meaning are captured.

2. **Preservation of Natural Language Flow**:
   App reviews are inherently informal and often rely on emojis to express sentiment or emphasis succinctly. By retaining emojis in their original form, the natural tone and expressiveness of the reviews are preserved, allowing the model to analyze real-world feedback in its most authentic form. Converting emojis to text (e.g., "😊" to "happy face") could introduce unnecessary verbosity and disrupt the concise style typical in user reviews, potentially leading to a loss of nuance in the analysis.

3. **Efficiency in ABSA**:
   In ABSA, the focus is on extracting sentiment related to specific app features or aspects. Since transformer models can already interpret the sentiment behind common emojis based on the surrounding context, converting them to text is unlikely to provide substantial improvements in analysis. For example, the positive sentiment conveyed by "😊" or the negative tone of "😡" are naturally inferred from their usage alongside relevant aspects of the review (e.g., "support team" or "performance"). Retaining emojis allows the model to focus on both the aspect and the associated sentiment, without needing additional processing steps.

For app reviews processed in transformer-based models, our strategy is to leave emojis as-is during data cleaning. This approach capitalizes on the transformer’s robust tokenization and contextual learning capabilities while preserving the authentic, emotion-rich nature of user feedback. By retaining emojis in their original form, the model can efficiently capture sentiment and meaning without the unnecessary complexity introduced by conversion, ensuring both accuracy and interpretability in aspect-based sentiment analysis.

### Spelling, Abbreviations, and Acronyms in App Reviews

In processing user-generated content such as app reviews, the handling of spelling variations, abbreviations, and acronyms is a critical component of the data cleaning strategy. Given the powerful contextual understanding of transformer-based models, a **leave-as-is approach** for both spelling errors and abbreviations/acronyms is taken. This approach takes advantage of the transformer’s strengths in dealing with noisy text data while preserving the natural language flow and intent found in app reviews.

#### Justification:
1. **Transformer Models' Robustness to Spelling Variations and Abbreviations**:
   Transformer models, particularly those using **Byte Pair Encoding (BPE)** or **WordPiece** tokenization, are designed to handle subword units, allowing them to process incomplete or misspelled words as well as abbreviations and acronyms effectively. By breaking down words and abbreviations into smaller components, transformers can use surrounding context to infer meaning. For example:
   - **Misspelled words** such as "exellent" will be split into recognizable subwords like "excel" and "ent," allowing the model to infer the correct meaning from context.
   - **Abbreviations or acronyms**, such as "AI" (Artificial Intelligence) or "UX" (User Experience), are commonly recognized by transformers due to their pretraining on vast datasets that include these forms of shorthand. The context of the sentence often clarifies the meaning, rendering explicit expansion unnecessary.

2. **Pretraining on Diverse, Noisy Data**:
   Transformer models like **BERT** and **GPT** are pretrained on large, diverse datasets that encompass various forms of natural language, including informal writing styles with abbreviations, slang, and misspellings. These models are already equipped to understand user-generated content that is not perfectly clean, making it redundant to introduce correction or expansion processes that could introduce unnecessary complexity without providing significant performance gains.

3. **Preserving Authenticity and User Intent**:
   User-generated app reviews often reflect natural, informal communication, where spelling variations, abbreviations, and acronyms contribute to the user’s tone and intent. Correcting or expanding these forms may risk altering the tone or authenticity of the review, particularly in cases where users employ creative or emphatic language. For instance, abbreviating "awesome" as "awsome" or "UX" as shorthand reflects natural usage, and correcting these may remove key aspects of user expression that are crucial for sentiment analysis.

4. **Focus on Context and Meaning over Perfection**:
   The strength of transformers lies in their **contextual learning**—the ability to understand the meaning of words in relation to their surrounding context. In sentiment analysis, particularly for **aspect-based sentiment analysis (ABSA)**, the focus is on extracting user opinions related to specific app features, not on perfecting the language. Transformers excel in this task by interpreting the overall sentiment, even when the input includes abbreviations or spelling errors. Expanding abbreviations or correcting spelling would likely introduce marginal improvements, if any, at the cost of altering the text’s natural flow.

For app reviews analyzed using transformer-based models, the **leave-as-is approach** for spelling correction, abbreviations and acronyms is the most effective strategy. This approach leverages the strengths of transformer models—particularly their ability to tokenize subword units and learn from context—while preserving the natural, authentic nature of user-generated content. By maintaining spelling variations in the dataset, the data cleaning process remains streamlined and avoids unnecessary complexity, allowing the model to focus on capturing meaningful sentiment and insights without sacrificing user intent or tone.

### Contraction Expansion
For user-generated app reviews, the **leave-as-is strategy** for contractions (e.g., "don’t," "can’t") is taken to maintain the natural language style and ensure seamless handling by transformer-based models. Contractions are a common feature in informal language, and their preservation has specific benefits in the context of transformer processing for sentiment-rich text like app reviews.

#### Justification:
1. **Subword Tokenization**:
   - Transformers use subword tokenization (e.g., WordPiece, SentencePiece, or Byte-Pair Encoding), which breaks down rare or unknown words into **subword units**. This means that contractions like "can’t" will be split into meaningful parts (e.g., "can" + "##'t" in BERT or similar in DeBERTa and T5).
   - Since these models **contextualize the meaning** of tokens based on their surrounding context, they are highly capable of understanding negation (e.g., "can’t") even if it's tokenized into separate parts. Therefore, expanding contractions is **not as critical** for these models because they can effectively capture the **semantic meaning** without needing the contraction to be expanded.

2. **Contextualized Understanding**:
   - Models like **DeBERTa** (which adds dis-entangled attention to focus on both words and relative positions), **BERT**, and **T5** excel at capturing **negation and emphasis** due to their self-attention mechanisms. They can distinguish "can" from "cannot" or "can’t" based on the context of the sentence.
   - Even if "can’t" is tokenized into "can" and "not," the model understands the negation because it weighs the contextual relationships between words rather than relying on single-word embeddings. This makes the need for contraction expansion less crucial.

3. **Negation Sensitivity**:
   - Transformer models are trained on large, diverse datasets, which means they are already familiar with contractions and negations. They are robust enough to handle **complex forms of negation** without needing explicit expansion (e.g., converting "can’t" to "cannot"). This is especially true for models like **T5**, which is designed for more flexible text generation and can parse a wide variety of text structures.

4. **Antonym Replacement**:
   - If your downstream tasks involve replacing negations with antonyms, transformers can still handle the tokenized parts of a contraction appropriately. They rely on context rather than the exact word forms, so whether "can’t" is split into "can" and "not," the model still understands that it represents negation, and antonym replacement can proceed effectively.

5. **Preserving Tone and Authenticity**:
   - Contractions are integral to the casual, conversational tone typical of app reviews. Expanding contractions (e.g., converting "can't" to "cannot") can alter the text’s natural rhythm and diminish the informal style that characterizes user feedback. In sentiment analysis, this tone is essential, as contractions often convey emphasis or expressiveness that reflects the user's attitude.

6. **Contextual Clarity Without Expansion**:
   - Contractions rarely introduce ambiguity within a sentence’s context, and transformer models handle these forms effectively without expansion. In sentences like "I can’t believe how good this app is," the meaning is unambiguous, and the sentiment remains clear. Expanding contractions does not enhance model comprehension or add interpretative value, making the change unnecessary for ABSA and sentiment analysis.

7. **Alignment with Pretraining Data**:
   - Transformer models like BERT and GPT are pretrained on diverse datasets that include abundant contractions. This pretraining allows the models to naturally interpret contractions without needing expansion, making additional preprocessing steps redundant. Thus, a leave-as-is approach aligns well with how these models were designed to process text.

The leave-as-is approach for contractions preserves the authentic tone and conversational style of app reviews, maintains efficiency in tokenization, and aligns with transformer models' pretraining on natural, informal language. This strategy is optimized for capturing user sentiment and aspect-specific feedback without unnecessary preprocessing, ensuring that the natural expressiveness of contractions is leveraged fully in ABSA tasks.

### Elongation
For user-generated app reviews, a **leave-as-is strategy** for elongation (e.g., "soooo good," "loooove it") is adopted. Elongation is a common stylistic element in informal text that conveys emphasis or intensity, which can be especially valuable in sentiment-heavy content. Preserving these elongated forms supports both the nuance of user expression and the capabilities of transformer-based models.

#### Justification:
1. **Enhanced Sentiment Intensity**: Elongated words (e.g., "soooo good") amplify emotion and provide valuable sentiment cues that standard forms might miss.
2. **Preservation of Authentic Tone**: Elongation is a natural part of user expression in informal reviews, and retaining it preserves the genuine voice of the reviewer.
3. **Transformer Models' Contextual Understanding**: Transformer models, with subword tokenization, effectively interpret elongated words within context, eliminating the need for normalization.
4. **Alignment with Informal Language**: User-generated content frequently includes elongations to convey emphasis, and altering these forms could disrupt the text's natural flow and expressiveness. 
5. **Improved Aspect-Based Sentiment Analysis (ABSA)**: Elongations enhance the model’s ability to capture nuances in sentiment related to specific app features, supporting more accurate ABSA outcomes.

For ABSA and sentiment analysis in app reviews, a leave-as-is approach for elongation effectively preserves emotional intensity and authentic tone, while fully leveraging the transformer model’s contextual understanding. This approach avoids unnecessary preprocessing while capturing the sentiment-rich expressiveness that elongated words convey, ensuring that user feedback is analyzed in its most genuine and impactful form.

## Import Libraries

In [3]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.clean.stage import DataCleaningStage

## Dependency Container

In [4]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
    ],
)

## Data Cleaning Pipeline
This code snippet demonstrates how to set up and run a data cleaning stage based on a configuration obtained from a configuration reader. Here’s a breakdown of the steps:

1. **Obtain the Configuration**: 
   - A `FlowConfigReader` instance (`reader`) is used to load the configuration. 
   - The `get_config` method retrieves the configuration for all phases, excluding namespaces, and then accesses the specific stage configuration for data preparation and cleaning.

2. **Build the Data Cleaning Stage**:
   - The `DataCleaningStage.build` method initializes the data cleaning stage with the provided `stage_config`. Setting `force=False` ensures the stage is only built if the endpoint doesn't already exists.

3. **Run the Data Cleaning Stage**:
   - Finally, the `run` method executes the data cleaning stage and returns an `asset_id`, which likely identifies the cleaned dataset or asset generated by this stage.



In [5]:
# Obtain the configuration
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"]["clean"]

# Build and run Data Ingestion Stage
stage = DataCleaningStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[10/28/2024 05:13:09 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-01_clean-review-dataset.parquet from repository.
[10/28/2024 05:13:09 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-clean-review from the repository.




#                              Data Cleaning Stage                               #



                          RemoveDuplicateReviewIdTask                           
                          ---------------------------                           
                          Start Datetime | Mon, 28 Oct 2024 17:13:09
                       Complete Datetime | Mon, 28 Oct 2024 17:13:10
                                 Runtime | 0.61 seconds


                                  URLMaskTask                                   
                                  -----------                                   
                          Start Datetime | Mon, 28 Oct 2024 17:13:14
                       Complete Datetime | Mon, 28 Oct 2024 17:13:17
                                 Runtime | 2.43 seconds


                              EmailAddressMaskTask                              
                              --------------------                              
                          Start Da

## Closing
With data cleaning complete, we have addressed the critical anomalies that could impact downstream tasks, such as duplicates, non-ASCII text, and unwanted control characters, while preserving the natural richness and variability of user-generated content. This approach aligns with the guiding Key Evaluation Questions (KEQs) to ensure both data quality and the nuanced representation required for transformer-based models, which will be central to the AppVoCAI analyses. By addressing essential data quality issues without over-processing, we optimize the dataset to leverage transformers' strengths in handling subtle linguistic variations and context.

In the next section, we tackle text preprocessing and feature extraction.