In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

# Data Quality Assessment
By identifying and rectifying data quality issues early, we can avoid potential pitfalls that might compromise the integrity and accuracy of downstream modeling efforts. In this stage, we employ a series of tasks to identify and address any noise or irregularities within the dataset. 

2. **Non-English Text**: We check for and address non-English text in reviews and app names, as they may not be relevant to our analysis or could require special handling.
3. **Inappropriate Content**: URLs, phone numbers, email addresses, and other personally identifiable information are considered as spam, and will be removed from the dataset or masked.
4. **Profanity**: We detect and handle profane content to ensure that our dataset adheres to appropriate usage standards, especially if it's intended for public or sensitive applications.
5. **Excessive Special Characters**: Special characters can disrupt text analysis and need to be managed, either by cleaning or encoding them appropriately.
6. **Emojis**: Emojis can carry significant meaning in certain contexts but might also introduce noise. We identify and decide on their treatment—whether to retain, remove, or translate them into textual representations.
7. **Invalid Dates**: We verify that date values fall within expected ranges and formats, correcting or flagging anomalies for further review.
8. **Invalid Ratings**: Ratings that fall outside the expected scale (e.g., 1 to 5) are identified and corrected or flagged.
9. **Duplicate Rows**: We identify and remove duplicate entries to ensure that each observation is unique, preventing skewed analyses and inflated metrics.
10. **Null Values**: We detect and handle missing data appropriately, which could involve imputation, deletion, or flagging incomplete records for further investigation.


By conducting these data quality checks, we ensure that our dataset is clean, reliable, and ready for detailed analysis. This foundational step sets the stage for accurate insights and robust conclusions in the subsequent phases of our data processing pipeline.

In [2]:
import fasttext

from discover.analysis.dqa import DataQualityAnalysis
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.dqa.stage import DQAStage
from discover.infra.service.cache.cache import DiscoverCache

fasttext.FastText.eprint = lambda x: None

## Parameters

In [3]:
RESET_CACHE = False

## Dependency Container

In [4]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.flow.data_prep.dqa",
        "discover.analysis.base",
    ],
)

## Cache Management
Changes to pipelines may necessitate resetting the cache. Here's the place where that happens.

In [5]:
if RESET_CACHE:
    cache = DiscoverCache()
    cache.reset()

: 

## Data Quality Assessment Pipeline
The data quality assessment process conducts the 9 data quality checks, marking the observations that require attention.

In [None]:
# Obtain the configuration
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"][2]

# Build and run Data Ingestion Stage
stage = DQAStage.build(stage_config=stage_config, force=True)
asset_id = stage.run()

[10/20/2024 01:12:32 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_dqa-review-dataset.parquet from repository.
[10/20/2024 01:12:32 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dataprep-dqa-review from the repository.




#                        Data Quality Assessment Stage                         #

Starting Data Quality Assessment Stage Sun, 20 Oct 2024 01:12:32

	Starting EntropyTask Sun, 20 Oct 2024 01:12:33
	EntropyTask detected 193 anomalies in the dqa_entropy column, 0.33% of 59021 records.
	Completed EntropyTask Sun, 20 Oct 2024 01:12:33. Runtime: 0.95 seconds

	Starting DetectDuplicateTask Sun, 20 Oct 2024 01:12:33
	DetectDuplicateTask detected 0 anomalies in the dqa_duplicate_rows column, 0.0% of 59021 records.
	Completed DetectDuplicateTask Sun, 20 Oct 2024 01:12:34. Runtime: 0.16 seconds

	Starting DetectDuplicateTask Sun, 20 Oct 2024 01:12:34
	DetectDuplicateTask detected 0 anomalies in the dqa_duplicate_id column, 0.0% of 59021 records.
	Completed DetectDuplicateTask Sun, 20 Oct 2024 01:12:34. Runtime: 0.01 seconds

	Starting DetectDuplicateTask Sun, 20 Oct 2024 01:12:34
	DetectDuplicateTask detected 3282 anomalies in the dqa_duplicate_review column, 5.56% of 59021 records.
	Completed 

## Data Quality Impressions
Let's get a summary of the data quality issues by type.

In [7]:
dqa = DataQualityAnalysis()
dqa.summarize()

[10/20/2024 01:09:35 AM] [ERROR] [discover.infra.persistence.dal.fao.base.CentralizedFileSystemFAO] [_read] : Exception occurred while reading a Parquet file from workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_dqa-review-dataset.parquet.
Cannot yet unify dictionaries with nulls
Traceback (most recent call last):
  File "/home/john/projects/appvocai-discover/discover/infra/persistence/dal/fao/centralized.py", line 68, in _read
    return pd.read_parquet(path=filepath, **self._storage_config["read_kwargs"])
  File "/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
  File "/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pandas/io/parquet.py", line 281, in read
    result = pa_table.to_pandas(**to_pandas_kwargs)
  File "pyarrow/array.pxi", line 885, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 5002, in pyarrow.lib.Table._to_pandas
  File

DatasetIOError: Exception occurred while reading dataset dataset-dataprep-dqa-review contents from file.
Original exception: Exception occurred while reading a Parquet file from workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_dqa-review-dataset.parquet.
Cannot yet unify dictionaries with nulls
Original exception: Cannot yet unify dictionaries with nulls

The data quality assessment (DQA) conducted on the AppVoC dataset revealed several key issues that need to be addressed to ensure the integrity and reliability of the analysis. These issues include:

- **Non-English**: Notable proportions of the app names (15%) and review content (4.9%) were flagged as being non-English.
- **Emoji**: Approximately 6% of the reviews have emoji characters.
- **Duplicates**: A small percentage (5.6%) have duplicate reviews.
- **Special Characters**: A small percentage (< 2%) of reviews were noted for the presence of special characters in excessive proportions.
- **Profanity**: About 1% of the reviews have language considered profane.
- **Random Text**: Random text, indicated by high entropy scores, is present in less than 0.5% of the text.
- **Sensitive Information**: The presence of sensitive information, such as email addresses, and phone numbers are relatively minimal.

On the other hand:
- **Null Values**: Fortunately, there were no null values detected in the dataset.
- **Invalid Entries**: There were no invalid dates or invalid ratings found.

Given these findings, the next step is to visually inspect a sample of the anomalies, then identify and treat the high-impact data quality issues. 

## Data Anomalies

In [None]:
dqa.get_non_english_apps()

This may involve, without limitation:

- **Removing Duplicates**: Eliminating observations with duplicate review IDs.
- **Handling Outliers**: Identifying and appropriately managing outliers in vote sums, and vote counts.
- **Addressing Non-English Text**: Filtering or translating non-English reviews. 
- **filtering Noise**: Filtering or removing excessive special characters from reviews.
- **Ensuring Clean Content**: Censor or remove reviews containing profanity.
- **Remove Personal Data**: Personal identifying information such as phone numbers, and email addresses, would be removed.

Cue the action!