In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

# Data Quality Assessment
The second stage of data processing is the Data Quality Assessment. This stage ensures that our dataset is ready for subsequent analysis and modeling tasks. By identifying and rectifying data quality issues early, we can avoid potential pitfalls that might compromise the integrity and accuracy of our results.

## Data Quality Checks
In this stage, we employ a series of tasks designed to identify and address any noise or irregularities within the dataset. Each task focuses on a specific aspect of data quality, ranging from detecting duplicate entries to identifying profanity, special patterns, and other potential sources of bias or distortion.
1. **Duplicate Rows**: We identify and remove duplicate entries to ensure that each observation is unique, preventing skewed analyses and inflated metrics.
2. **Null Values**: We detect and handle missing data appropriately, which could involve imputation, deletion, or flagging incomplete records for further investigation.
3. **Non-English Text**: We check for and address non-English text in reviews and app names, as they may not be relevant to our analysis or could require special handling.
4. **Emojis**: Emojis can carry significant meaning in certain contexts but might also introduce noise. We identify and decide on their treatment—whether to retain, remove, or translate them into textual representations.
5. **Excessive Special Characters**: Special characters can disrupt text analysis and need to be managed, either by cleaning or encoding them appropriately.
6. **Invalid Dates**: We verify that date values fall within expected ranges and formats, correcting or flagging anomalies for further review.
7. **Invalid Ratings**: Ratings that fall outside the expected scale (e.g., 1 to 5) are identified and corrected or flagged.
8. **Profanity**: We detect and handle profane content to ensure that our dataset adheres to appropriate usage standards, especially if it's intended for public or sensitive applications.
9. **Special Patterns**: We identify and manage special patterns such as URLs, phone numbers, and emails. These patterns could be indicative of spam or need to be anonymized to protect privacy.

By conducting these data quality checks, we ensure that our dataset is clean, reliable, and ready for detailed analysis. This foundational step sets the stage for accurate insights and robust conclusions in the subsequent phases of our data processing pipeline.

In [2]:
import fasttext

from discover.app.dqa import DataQualityAnalysis
from discover.container import DiscoverContainer
from discover.infra.config.orchestration import OrchestrationConfigReader
from discover.orchestration.data_prep.stage import DataPrepStage

fasttext.FastText.eprint = lambda x: None

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.orchestration.data_prep.stage",
        "discover.orchestration.data_prep.dqa",
        "discover.app.base",
    ],
)

## Data Quality Assessment Pipeline
The data quality assessment process conducts the 9 data quality checks, marking the observations that require attention.

In [None]:
# Obtain the configuration
from discover.orchestration.data_prep.stage import DataPrepStageCache

reader = OrchestrationConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"][2]

# Build and run Data Ingestion Stage
stage = DataPrepStage.build(stage_config=stage_config, force=True)
asset_id = stage.run()

## Data Quality Impressions
Let's get a summary of the data quality issues by type.

In [None]:
dqa = DataQualityAnalysis()
dqa.summarize()

The data quality assessment (DQA) conducted on the AppVoC dataset revealed several key issues that need to be addressed to ensure the integrity and reliability of the analysis. These issues include:

- **Non-English**: Notable proportions of the app names (15%) and review content (4.9%) were flagged as being non-English.
- **Emoji**: Approximately 6% of the reviews have emoji characters.
- **Duplicates**: A small percentage (5.6%) have duplicate reviews.
- **Special Characters**: A small percentage (< 2%) of reviews were noted for the presence of special characters in excessive proportions.
- **Profanity**: About 1% of the reviews have language considered profane.
- **Random Text**: Random text, indicated by high entropy scores, is present in less than 0.5% of the text.
- **Sensitive Information**: The presence of sensitive information, such as email addresses, and phone numbers are relatively minimal.

On the other hand:
- **Null Values**: Fortunately, there were no null values detected in the dataset.
- **Invalid Entries**: There were no invalid dates or invalid ratings found.

Given these findings, the next step is to visually inspect a sample of the anomalies, then identify and treat the high-impact data quality issues. 

## Data Anomalies

In [None]:
dqa.get_non_english_apps()

This may involve, without limitation:

- **Removing Duplicates**: Eliminating observations with duplicate review IDs.
- **Handling Outliers**: Identifying and appropriately managing outliers in vote sums, and vote counts.
- **Addressing Non-English Text**: Filtering or translating non-English reviews. 
- **filtering Noise**: Filtering or removing excessive special characters from reviews.
- **Ensuring Clean Content**: Censor or remove reviews containing profanity.
- **Remove Personal Data**: Personal identifying information such as phone numbers, and email addresses, would be removed.

Cue the action!

In [7]:
from lingua import Language, LanguageDetectorBuilder

languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

In [None]:
import sklearn

print(sklearn.__version__)