In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

# Cleaning

This stage involves systematically addressing anomalies identified during the data quality assessment.

## Approach to Addressing Anomalies

During the data quality assessment, we identified various anomalies within the dataset. Each anomaly was then evaluated to estimate its potential impact on subsequent analyses. This evaluation process categorized anomalies into four distinct levels of criticality:

1. **Critical**: These anomalies have a significant impact on the integrity and reliability of the data. If left unaddressed, they could severely distort the results of any analysis. Examples include duplicate records, non-English text (if the analysis is language-specific), and invalid ratings.

2. **High**: High impact anomalies also pose a substantial threat to the validity of the analysis but are slightly less severe than critical issues. These include records with excessive special characters, profanity, and privacy-related issues such as email addresses, or phone numbers embedded in the text.

3. **Medium**: Medium impact anomalies have a moderate effect on the analysis. While they do not necessarily distort results as severely as critical or high issues, they can still introduce noise and reduce the overall quality of insights. Examples include outliers in vote sums and vote counts.

4. **Low**: Low impact anomalies are considered minor issues that have minimal impact on the overall analysis. These include the presence of emojis and URLs in the text, which typically do not affect the analytical outcome significantly.

## Removal Criteria

Based on the criticality assessment, a systematic approach was adopted to handle these anomalies:

- **Critical and High Impact Issues**: Observations containing anomalies classified as critical or high impact were earmarked for removal. The rationale behind this strict approach is to eliminate any potential distortions in the analysis that could arise from these severe issues. By removing these observations, we ensure that the dataset maintains a high level of integrity and reliability.

- **Medium and Low Impact Issues**: Anomalies classified as medium or low impact were not grounds for removal of the observations. Instead, these issues were retained in the dataset to preserve as much data as possible while accepting a tolerable level of noise. This approach balances the need for data quality with the necessity of maintaining a sufficient volume of data for robust analysis.

## Sorting
Reviews are sorted by date to support temporal analysis.


In [2]:
from appvocai-discover.data_prep.clean import DataCleaner, CleanConfig
from appvocai-discover.analysis.dqa import DataQualityAnalysis

In [3]:
config = CleanConfig(force=True)
cleaner = DataCleaner(config=config)
data_clean = cleaner.run()



#                             DataCleaner Pipeline                             #



Error executing function 'DataCleaningTask.run': 'Column not found in the DataFrame: "[\'dqa_is_duplicate_rating_id\', \'dqa_app_name_non_english\', \'dqa_has_profanity\'] not in index"'


Task Reader completed successfully.


Exception occurred in DataCleaningTask.run called with <appvocai-discover.data_prep.clean.DataCleaningTask object at 0x7f0550ebb1c0>,                   id      app_id                        app_name category_id  \
0         1119912682   302584613                   Amazon Kindle        6018   
1          599135993   377951542           Crackle - Movies & TV        6016   
2          817378711   379693831    Audible: Audio Entertainment        6018   
3         1140598740   454638411                       Messenger        6005   
4         5104781144   912561374    Marco Polo - Video Messenger        6005   
...              ...         ...                             ...         ...   
22166586  7129296333  1075603018                      Funimation        6016   
22166587  8063238669  1492683521  TuckerMoji - Tucker Budzyn Dog        6016   
22166588  4677288812   316800034                         Workday        6000   
22166589  8030630370  1269081011      Zoe: Lesbian Dating & Chat  

KeyError: 'Column not found in the DataFrame: "[\'dqa_is_duplicate_rating_id\', \'dqa_app_name_non_english\', \'dqa_has_profanity\'] not in index"'

With the data cleaning stage successfully completed, we have ensured that our dataset is free from critical and high-impact anomalies.

## Validation
Let's verify that the critical and high impact issues have been addressed.

In [None]:

analyzer = DataQualityAnalysis()
results = analyzer.run_analysis(data=data_clean)
results

The results show that the observations with critical and high impact issues have been removed from the dataset. Next, we enrich the dataset with features that facilitate temporal and text analysis. Chalo!