In [None]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# Data Cleaning
In the previous section, we provided an overview of the AppVoCAI dataset, evaluating its structure, features, and validity. This section is about duplicate deleting, language filtering, text encoding, artifact removing, PII masking, character normalizing data cleaning. By removing or masking artifacts that could compromise the integrity and reliability of downstream analysis, we ensure that the dataset will support nuanced, rich, and insightful discovery and model performance.

## Data Cleaning Context
Downstream tasks such as sentiment analysis, classification, text summarization, and generation will leverage transformer-based models (like BERT, RoBERTa, and GPT), which have proven to be highly robust in handling various data anomalies and linguistic variations. Unlike traditional models such as Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs)—which process data sequentially—transformers operate on entire sequences in parallel. This allows them to capture long-range dependencies and uncover subtle nuances and contextual relationships in language with near-human precision.

However, research demonstrates that preprocessing can still significantly improve the performance of transformer models {cite}`siinoTextPreprocessingStill2024`. This synergy between preprocessing and model architecture suggests that a balanced approach is ideal. Our data preparation methodology focuses on addressing critical data quality issues that could undermine the integrity of downstream analyses, while preserving the text as close to its original form as possible. By adopting this conservative approach, we tackle key issues without sacrificing the nuance and representativeness of the data, ensuring the models are presented with rich, authentic input.

## Data Cleaning Key Evaluation Questions (KEQs)
Although, this data cleaning approach comprises many of the preprocessing techniques commonly found in the literature {cite}`symeonidisComparativeEvaluationPreprocessing2018`, the following data cleaning approach is motivated by three guiding questions.

1. What’s essential to remove, and what can be left intact to preserve meaning?
2. How do we best preserve text richness and nuance?
3. How can the data cleaning process best exploit model strengths towards optimal model performance?

These Key Evaluation Questions (KEQs) crystallized our approach which balanced data quality with model sophistication.

## Data Cleaning Strategy
The following describes our data cleaning process and steps, executed in the order listed. We begin with 'safe' techniques that carry minimal risk of compromising downstream cleaning tasks. For instance, UTF-8 encoding can impact the accuracy of language detection algorithms, especially if characters carry language-specific information. Removing special characters may compromise the detection of Personally Identifiable Information (PII) such as URLs and email addresses. As the process progresses, steps carry a greater impact on the data, its expressiveness, and representation.

Our minimalist, *leave-as-is* approach can depart from data cleaning orthodoxy and standard practice. In such cases, we are transparent with our rationale. With that, our process is as follows:

### Basic Interventions
These measures ensure storage, I/O, and memory efficiency and provide reliable access to the data.

1. **Type Casting**: Data types are cast for optimal storage efficiency, memory utilization, and processing speed within the `pandas` framework. This ensures efficient handling and manipulation of the data throughout the pipeline.
2. **Remove Newlines from Review Text**: Newline characters in text can cause errors and unpredictable behavior in I/O operations and parsing within `pandas` DataFrames. Our first task removes these artifacts to ensure data stability and prevent disruptions in subsequent processing steps.

### Privacy
We ensure personally identifiable information is masked to protect data privacy.

3. **URL Masking**: Using regular expressions, we find and replace URLs with the placeholder '[URL]'. This allows us to uphold privacy standards while retaining the lexical context of the text.
4. **Email Address Masking**: Email addresses are similarly masked with '[EMAIL]', ensuring that sensitive contact information is not exposed.
5. **Phone Number Masking**: Phone numbers are detected and masked with '[PHONE]' using regular expressions. These masking protocols enable content analysis while meeting data privacy obligations and minimizing re-identification risks.

### Noise Removal
We remove or normalize artifacts that distort content or add little value to understanding sentiment, intent, or behavior.

6. **Control Characters**: We remove non-printable characters from the Unicode and ASCII character sets that are used to control text flow or hardware devices (e.g., newline, tab, or carriage return). These characters have no analytical value and can interfere with text processing.
7. **Accents and Diacritics**: We normalize accented characters (e.g., converting `é` to `e`) to reduce unnecessary text variation, which simplifies analysis without compromising the meaning of the content.
8. **HTML Characters**: Common in scraped data, HTML entities (e.g., `&amp;`, `&#39;`) are removed as they do not convey meaningful content. This ensures that the text is clean and ready for analysis.
9. **Excessive Whitespace**: Extra whitespace is condensed into a single space to ensure text consistency and improve readability. This also facilitates more efficient text parsing and analysis.

### Language and Expression
These steps enhance linguistic consistency and preserve the expressive elements of the text.

10. **Remove Non-English Text**: We systematically identify and remove non-English app names and reviews to maintain linguistic uniformity within the dataset, which is crucial for consistent language-based analysis.
11. **Elongation Handling**: Elongated words (e.g., "soooo") convey emphasis in informal text, which is valuable for sentiment analysis. We use a threshold approach to limit characters that appear four or more times consecutively to a maximum of three (e.g., "soooo" becomes "sooo"), preserving emphasis while maintaining readability.
12. **Special Characters**: Excessive special characters can indicate SPAM, emotional intensity, or nonsensical content. We apply a threshold: if special characters make up more than 30% of the review text, the review is removed. This helps maintain the quality and relevance of the dataset.

### Data Integrity
We implement measures to ensure the integrity and uniqueness of the data.

13. **Review ID Deduplication**: For duplicate review IDs, our policy for retention is based on several criteria: the most recent review date is prioritized, followed by the longest review text, and, if all else is equal, the review with the lowest row index is retained. This ensures that we keep the most informative and relevant reviews.

### Encoding
Encoding steps are performed last to standardize the text format while preserving language-specific features throughout earlier processing.

14. **Unicode Normalization**: We normalize the text using `unicodedata.normalize` to standardize characters and ensure consistency, particularly for languages where characters have multiple valid representations.
15. **UTF-8 Encoding**: After normalization, we encode the text in UTF-8 format. This step converts the text into a consistent byte representation, suitable for storage or transmission, and ensures proper character encoding.
16. **Non-ASCII Character Removal**: Finally, we remove non-ASCII characters from the review text. This simplifies the text and ensures compatibility with systems that may not handle non-ASCII characters well.

---

## Data Cleaning Techniques Not Implemented
In natural language processing (NLP), text cleaning measures such as lower-casing, contraction and abbreviation expansion, spelling correction, and the removal of emoticons, emojis, and other artifacts are considered standard practice. **However, our data cleaning strategy deliberately omits certain conventional techniques to maximize the quality and authenticity of app reviews for aspect-based sentiment analysis (ABSA).** We focus on leveraging the strengths of modern transformer models to handle natural language variations effectively.

### Emoticons and Emojis Conversion
Emoticons and emojis are frequently used in app reviews to express emotions, sentiments, or nuanced meanings that words alone may not fully capture. In the context of aspect-based sentiment analysis (ABSA) for app reviews, the decision to retain or convert emojis plays a significant role in shaping the accuracy and efficiency of downstream natural language processing (NLP) tasks. Given that modern transformer models are highly adept at handling a wide variety of tokens, including emojis, we take a **leave emojis as-is** approach during data cleaning unless the number of special characters exceeds a threshold. This strategy leverages the inherent strengths of transformer models in dealing with diverse text elements while maintaining the integrity and natural flow of user-generated content.

#### Justification:
1. **Transformer Models’ Capabilities**:
   Transformer-based models, such as BERT and GPT, employ advanced tokenization techniques like Byte Pair Encoding (BPE) and WordPiece, which are designed to recognize emojis as distinct tokens. These models are pretrained on large corpora that include emojis, making them well-equipped to process and learn from the contextual meanings that emojis convey. Therefore, emojis can be treated as standard tokens without the need for conversion, preserving computational efficiency while ensuring sentiment and meaning are captured.

2. **Preservation of Natural Language Flow**:
   App reviews are inherently informal and often rely on emojis to express sentiment or emphasis succinctly. By retaining emojis in their original form, the natural tone and expressiveness of the reviews are preserved, allowing the model to analyze real-world feedback in its most authentic form. Converting emojis to text (e.g., "😊" to "happy face") could introduce unnecessary verbosity and disrupt the concise style typical in user reviews, potentially leading to a loss of nuance in the analysis.

3. **Efficiency in ABSA**:
   In ABSA, the focus is on extracting sentiment related to specific app features or aspects. Since transformer models can already interpret the sentiment behind common emojis based on the surrounding context, converting them to text is unlikely to provide substantial improvements in analysis. For example, the positive sentiment conveyed by "😊" or the negative tone of "😡" are naturally inferred from their usage alongside relevant aspects of the review (e.g., "support team" or "performance"). Retaining emojis allows the model to focus on both the aspect and the associated sentiment, without needing additional processing steps.

For app reviews processed in transformer-based models, our strategy is to leave emojis as-is during data cleaning. This approach capitalizes on the transformer’s robust tokenization and contextual learning capabilities while preserving the authentic, emotion-rich nature of user feedback. By retaining emojis in their original form, the model can efficiently capture sentiment and meaning without the unnecessary complexity introduced by conversion, ensuring both accuracy and interpretability in aspect-based sentiment analysis.

### Spelling, Abbreviations, and Acronyms in App Reviews
In processing user-generated content such as app reviews, the handling of spelling variations, abbreviations, and acronyms is a critical component of the data cleaning strategy. **Given the powerful contextual understanding of transformer-based models, a leave-as-is approach for both spelling errors and abbreviations/acronyms is taken.** This approach takes advantage of the transformer’s strengths in dealing with noisy text data while preserving the natural language flow and intent found in app reviews.

#### Justification:
1. **Transformer Models' Robustness to Spelling Variations and Abbreviations**:
   Transformer models, particularly those using **Byte Pair Encoding (BPE)** or **WordPiece** tokenization, are designed to handle subword units, allowing them to process incomplete or misspelled words as well as abbreviations and acronyms effectively. By breaking down words and abbreviations into smaller components, transformers can use surrounding context to infer meaning. For example:
   - **Misspelled words** such as "exellent" will be split into recognizable subwords like "excel" and "ent," allowing the model to infer the correct meaning from context.
   - **Abbreviations or acronyms**, such as "AI" (Artificial Intelligence) or "UX" (User Experience), are commonly recognized by transformers due to their pretraining on vast datasets that include these forms of shorthand. The context of the sentence often clarifies the meaning, rendering explicit expansion unnecessary.

2. **Pretraining on Diverse, Noisy Data**:
   Transformer models like **BERT** and **GPT** are pretrained on large, diverse datasets that encompass various forms of natural language, including informal writing styles with abbreviations, slang, and misspellings. These models are already equipped to understand user-generated content that is not perfectly clean, making it redundant to introduce correction or expansion processes that could introduce unnecessary complexity without providing significant performance gains.

3. **Preserving Authenticity and User Intent**:
   User-generated app reviews often reflect natural, informal communication, where spelling variations, abbreviations, and acronyms contribute to the user’s tone and intent. Correcting or expanding these forms may risk altering the tone or authenticity of the review, particularly in cases where users employ creative or emphatic language. For instance, abbreviating "awesome" as "awsome" or "UX" as shorthand reflects natural usage, and correcting these may remove key aspects of user expression that are crucial for sentiment analysis.

4. **Focus on Context and Meaning over Perfection**:
   The strength of transformers lies in their **contextual learning**—the ability to understand the meaning of words in relation to their surrounding context. In sentiment analysis, particularly for **aspect-based sentiment analysis (ABSA)**, the focus is on extracting user opinions related to specific app features, not on perfecting the language. Transformers excel in this task by interpreting the overall sentiment, even when the input includes abbreviations or spelling errors. Expanding abbreviations or correcting spelling would likely introduce marginal improvements, if any, at the cost of altering the text’s natural flow.

For app reviews analyzed using transformer-based models, the **leave-as-is approach** for spelling correction, abbreviations, and acronyms is the most effective strategy. This approach leverages the strengths of transformer models—particularly their ability to tokenize subword units and learn from context—while preserving the natural, authentic nature of user-generated content. By maintaining spelling variations in the dataset, the data cleaning process remains streamlined and avoids unnecessary complexity, allowing the model to focus on capturing meaningful sentiment and insights without sacrificing user intent or tone.

---

With our data cleaning rationale established, we now move on to the implementation: the following code runs the entire data cleaning pipeline, automating the steps described above.

## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.clean.stage import DataCleaningStage
from discover.core.flow import PhaseDef, DataPrepStageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
    ],
)

## Data Cleaning Pipeline
This code snippet demonstrates how to set up and run a data cleaning stage based on a configuration obtained from a configuration reader. Here’s a breakdown of the steps:

1. **Obtain the Configuration**: 
   - A `FlowConfigReader` instance (`reader`) is used to load the configuration. 
   - The `get_stage_config` method retrieves the stage configuration for data preparation and cleaning.

2. **Build the Data Cleaning Stage**:
   - The `DataCleaningStage.build` method initializes the data cleaning stage with the provided `stage_config`. Setting `force=False` ensures the stage is only built if the endpoint doesn't already exists.

3. **Run the Data Cleaning Stage**:
   - Finally, the `run` method executes the data cleaning stage and returns an `asset_id`, which likely identifies the cleaned dataset or asset generated by this stage.



In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.CLEAN
)

# Build and run Data Ingestion Stage
stage = DataCleaningStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                              Data Cleaning Stage                               #



                                CastDataTypeTask                                
                                ----------------                                
                          Start Datetime | Sat, 09 Nov 2024 18:28:52
                       Complete Datetime | Sat, 09 Nov 2024 18:30:24
                                 Runtime | 1.0 minutes and 32.65 seconds
                                 Summary | Modified 0 cells.


                               RemoveNewlinesTask                               
                               ------------------                               
                          Start Datetime | Sat, 09 Nov 2024 18:30:26
                       Complete Datetime | Sat, 09 Nov 2024 18:31:42
                                 Runtime | 1.0 minutes and 15.61 seconds
                                 Summary | Modified 1693576 cells.


                                  

## Closing
With data cleaning complete, we have addressed the critical anomalies that could impact downstream tasks, such as duplicates, non-ASCII text, and unwanted control characters, while preserving the natural richness and variability of user-generated content. This approach aligns with the guiding Key Evaluation Questions (KEQs) to ensure both data quality and the nuanced representation required for transformer-based models, which will be central to the AppVoCAI analyses. By addressing essential data quality issues without over-processing, we optimize the dataset to leverage transformers' strengths in handling subtle linguistic variations and context.

In the next section, we turn our attention to data enrichment. 