In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

# Cleaning

This stage involves systematically addressing anomalies identified during the data quality assessment.

## Approach to Addressing Anomalies

During the data quality assessment, we identified various anomalies within the dataset. Each anomaly was then evaluated to estimate its potential impact on the subsequent analysis. This evaluation process categorized anomalies into four distinct levels of criticality:

1. **Critical**: These anomalies have a significant impact on the integrity and reliability of the data. If left unaddressed, they could severely distort the results of any analysis. Examples include duplicate records, non-English text (if the analysis is language-specific), and invalid ratings.

2. **High**: High impact anomalies also pose a substantial threat to the validity of the analysis but are slightly less severe than critical issues. These include records with excessive special characters, profanity, and privacy-related issues such as email addresses or phone numbers embedded in the text.

3. **Medium**: Medium impact anomalies have a moderate effect on the analysis. While they do not necessarily distort results as severely as critical or high issues, they can still introduce noise and reduce the overall quality of insights. Examples include outliers in vote sums and vote counts, and unusually long reviews.

4. **Low**: Low impact anomalies are considered minor issues that have minimal impact on the overall analysis. These include the presence of emojis and URLs in the text, which typically do not affect the analytical outcome significantly.

## Removal Criteria

Based on the criticality assessment, a systematic approach was adopted to handle these anomalies:

- **Critical and High Impact Issues**: Observations containing anomalies classified as critical or high impact were earmarked for removal. The rationale behind this strict approach is to eliminate any potential distortions in the analysis that could arise from these severe issues. By removing these observations, we ensure that the dataset maintains a high level of integrity and reliability.

- **Medium and Low Impact Issues**: Anomalies classified as medium or low impact were not grounds for removal of the observations. Instead, these issues were retained in the dataset to preserve as much data as possible while accepting a tolerable level of noise. This approach balances the need for data quality with the necessity of maintaining a sufficient volume of data for robust analysis.

## Sorting
Reviews are sorted by date to support temporal analysis.


In [2]:
from appvocai-discover.data_prep.clean import DataCleaner, CleanConfig
from appvocai-discover.analysis.dqa import DataQualityAnalysis

In [3]:
config = CleanConfig(force=True)
cleaner = DataCleaner(config=config)
data_clean = cleaner.execute()



#                             DataCleaner Pipeline                             #

Task ReadTask completed successfully.


                            AppInsight Data Cleaning                            
                      Original DataFrame | 18306 rows
                       Cleaned DataFrame | 15138 rows
                    Removed Observations | 3168 rows


Task DataCleaningTask completed successfully.
Task WriteTask completed successfully.


                                  DataCleaner                                   
                          Pipeline Start | 2024-06-07 02:51:26.286717
                           Pipeline Stop | 2024-06-07 02:51:26.548644
                        Pipeline Runtime | 00 Minutes 00.261927 Seconds







With the data cleaning stage successfully completed, we have ensured that our dataset is free from critical and high-impact anomalies.

## Validation
Let's verify that the critical and high impact issues have been addressed.

In [4]:

analyzer = DataQualityAnalysis()
results = analyzer.run_analysis(data=data_clean)
results

Analyzing Duplication
Analyzing Duplication by id
Analyzing Null Values
Analyzing Non-English Reviews
Analyzing Non-English Reviews
Analyzing Emojis
Analyzing Excessive Special Characters
Analyzing Invalid Dates
Analyzing Invalid Ratings
Analyzing Profanity
Analyzing Emails in Reviews
Analyzing URLs in Reviews
Analyzing Phone Numbers in Reviews
Analyzing Outliers in vote_count
Analyzing Outliers in vote_sum
Analyzing Outliers in review_length


Unnamed: 0,Characteristic,Impact,Count,Percent
0,Duplicate Values,Critical,0,0.0
1,Duplicate IDs,Critical,0,0.0
3,Non-English Review,Critical,0,0.0
4,Non-English App Name,Critical,0,0.0
8,Invalid Ratings,Critical,0,0.0
2,Null Values,High,0,0.0
6,Excessive Special Characters,High,0,0.0
7,Invalid Dates,High,0,0.0
9,Profanity,High,0,0.0
10,Contains Email Address(es),High,0,0.0


The results show that the observations with critical and high impact issues have been removed from the dataset. Next, a spot of feature engineering.