In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

# Data Quality Assessment
In the previous section, we provided an overview of the AppVoCAI dataset, including its structure, features, and distributions. The goal of this stage is to identify and address any data quality issues that could compromise the integrity and accuracy of our downstream modeling efforts. The dataset overview confirmed the absence of null values and the validity of key variables such as ratings and review dates. 

In this Data Quality Assessment (DQA), we will execute a series of checks designed to uncover noise, inconsistencies, or anomalies within the dataset:

1. **App ID/App Name Consistency**: We ensure that `app_id` and `app_name` align. A prior analysis revealed 14 more `app_id`s than `app_name`s, which will require further investigation.
2. **Duplicate Review IDs**: We identified 117 duplicate review `id`s. These entries must be flagged for closer examination.
3. **Duplicate Review Content**: Approximately 14% of the reviews were found to be duplicates. These reviews need to be reviewed for potential redundancy or noise.
4. **Review Length Anomalies**: Zero-length reviews will be removed. Extremely long reviews will be inspected for signs of repetition or low-quality content.
5. **Non-English Text**: Reviews and app names written in non-English may not be relevant to our analysis. We will identify and decide on appropriate handling for these entries.
6. **Inappropriate Content**: Content such as URLs, phone numbers, email addresses, or other personally identifiable information will be treated as spam and either removed or masked.
7. **Emojis**: Emojis can add valuable context in some cases but may introduce noise in others. We will assess whether to retain, remove, or convert emojis into textual equivalents.
8. **Formatting Anomalies**: We will flag any entries with excessive whitespace, HTML, or other markup artifacts that could interfere with analysis.

**Note on Exclusions**: While this assessment focuses on structural and formatting issues, certain aspects like profanity and spelling mistakes are not included in this phase. These may be addressed during the text quality assessment, which will focus on content and linguistic features.

By performing these data quality checks, we identify and mark anomalies that will be addressed in the subsequent data cleaning stage. This ensures that our dataset is flagged for inconsistencies and potential issues, laying the groundwork for clean, reliable data. Resolving these issues will enhance the accuracy of our analyses and lead to more robust conclusions in the later phases of our data processing pipeline.

## Import Libraries

In [2]:
import fasttext

from discover.analysis.dqa import DataQualityAnalysis
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.dqa.stage import DQAStage

fasttext.FastText.eprint = lambda x: None

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.flow.data_prep.dqa",
        "discover.analysis.base",
    ],
)

## Data Quality Assessment Pipeline
The data quality assessment process conducts the data quality checks, marking the observations that require attention. We begin with the configuration, then construct and run the DQAStage pipeline.

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"]["dqa"]

# Build and run Data Ingestion Stage
stage = DQAStage.build(stage_config=stage_config, force=True)
asset_id = stage.run()

[10/24/2024 03:58:16 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_dqa-review-dataset.parquet from repository.
[10/24/2024 03:58:16 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-dqa-review from the repository.




#                         Data Quality Assessment Stage                          #



                              DetectDuplicateTask                               
                              -------------------                               
                          Start Datetime | Thu, 24 Oct 2024 03:58:16
                       Complete Datetime | Thu, 24 Oct 2024 03:58:16
                                 Runtime | 0.03 seconds
                               DQA Check | dqa_identical_rows
                      Anomalies Detected | 0 (0.0%) of 59021 records


                              DetectDuplicateTask                               
                              -------------------                               
                          Start Datetime | Thu, 24 Oct 2024 03:58:16
                       Complete Datetime | Thu, 24 Oct 2024 03:58:16
                                 Runtime | 0.02 seconds
                               DQA Check | dqa_identical_review_id


## Data Quality Impressions
Let's get a summary of the data quality issues by type.

In [5]:
dqa = DataQualityAnalysis()
dqa.summarize()

Unnamed: 0,n,%
dqa_contains_non_ascii_chars,27252,46.173396
dqa_contains_excessive_whitespace,7540,12.775114
dqa_identical_review_content,3923,6.646787
dqa_has_emoji,3490,5.91315
dqa_non_english_review,2276,3.856255
dqa_non_english_app_name,1402,2.375426
dqa_contains_excessive_numbers,41,0.069467
dqa_contains_phone_number,26,0.044052
dqa_contains_inconsistent_app_id_name,8,0.013554
dqa_contains_HTML_chars,5,0.008472


The data quality assessment (DQA) conducted on the AppVoC dataset revealed several key issues that need to be removed or replaced to ensure the integrity and reliability of the analysis. 

### Observations to Be Removed:
These cases represent data quality issues where the entire observation will be removed from the dataset.

- **Duplicate Review Content**: 13.93% of the reviews are duplicates and will be removed to ensure the dataset contains only unique user feedback.
- **Duplicate Review IDs**: A small fraction (0.0005%) of reviews with duplicate IDs will be removed to prevent inconsistencies.
- **Missing Reviews**: Any reviews flagged as missing (0.000009%) will be dropped from the dataset.
- **Inconsistent App ID-Name Pairs**: Inconsistent app ID-name pairs (0.0001%) will be removed to maintain consistency.
- **Non-English Reviews**: 3.20% of the reviews are non-English and will be removed to focus on English content.
- **Non-English App Names**: 2.76% of app names are non-English and will be removed.

### Observations to Be Cleaned (Replacing Sequences):
These issues involve replacing specific problematic sequences with predefined values, while keeping the rest of the observation intact.

- **Non-ASCII Characters (27.88%)**: Non-ASCII characters will be replaced with an empty string to clean the text.
- **Excessive Whitespace (12.97%)**: Whitespace sequences will be replaced with a single space to maintain proper formatting.
- **Emojis (4.90%)**: Emojis will be converted to their text equivalents (e.g., üòä becomes "smiling face").
- **Phone Numbers (0.03%)**: Phone numbers will be replaced with "[PHONE]" to anonymize personal information.
- **Control Characters (0.02%)**: Control characters will be removed by replacing them with an empty string to avoid formatting issues.
- **HTML Characters (0.01%)**: HTML entities will be replaced with an empty string to strip unnecessary formatting.
- **URLs (0.0007%)**: URLs will be replaced with "[URL]" to anonymize web addresses.
- **Emails (0.0001%)**: Emails will be replaced with "[EMAIL]" to protect personal information.
- **Excessive Numbers (0.04%)**: Excessive numbers will be replaced with "[NUMBER]" to normalize content where numbers dominate the text.

### Risk Mitigation
This approach delineates which observations will be removed entirely and which will undergo targeted sequence replacements to clean the data. By delineating the treatments, the data is prepared in a way that ensures both data integrity and readability, without losing more information than necessary. However, before taking the irreversible cleaning steps, we'll review a sampling of the anomalies to ensure that the flagged data is truly problematic.  

## Data Quality Review
The following sections will present the anomalous observations for review prior to embarking on the cleaning stage.

### Anomalies to be Removed

#### Duplicate Review Content

In [6]:
summary, data = dqa.get_duplicate_review_content(n=20, random_state=10)
summary

[                  id      app_id                       app_name category_id  \
 3392182   9713950361   576226288       Fun Phone Call - IntCall        6016   
 10957055  6865966523  1514434067  Funny Voice Effects & Changer        6012   
 
                         author  rating content  vote_sum  vote_count  \
 3392182   4d3ed46658aefbf9092e       5     !!!         0           0   
 10957055  85abf4cdb7b7b1a220e6       5     !!!         0           0   
 
                         date  review_length       category  
 3392182  2023-03-14 23:13:47              1  Entertainment  
 10957055 2021-01-13 19:46:25              1      Lifestyle  ,
                    id      app_id                        app_name category_id  \
 13258525  10181827730   544007664  YouTube: Watch, Listen, Stream        6008   
 16867834  10197531298  1040207175          Emoji Me Sticker Maker        6005   
 
                         author  rating content  vote_sum  vote_count  \
 13258525  d7bfb8063c90feb867