In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

# Cleaning

This stage involves systematically addressing anomalies identified during the data quality assessment, which is essential for reliable and valid downstream analysis.

## Approach to Addressing Anomalies

During the data quality assessment, we identified various anomalies within the dataset. Each anomaly was then evaluated to estimate its potential impact on the subsequent analysis. This evaluation process categorized anomalies into four distinct levels of criticality:

1. **Critical**: These anomalies have a significant impact on the integrity and reliability of the data. If left unaddressed, they could severely distort the results of any analysis. Examples include duplicate records, non-English text (if the analysis is language-specific), and invalid ratings.

2. **High**: High impact anomalies also pose a substantial threat to the validity of the analysis but are slightly less severe than critical issues. These include records with excessive special characters, profanity, and privacy-related issues such as email addresses or phone numbers embedded in the text.

3. **Medium**: Medium impact anomalies have a moderate effect on the analysis. While they do not necessarily distort results as severely as critical or high issues, they can still introduce noise and reduce the overall quality of insights. Examples include outliers in vote sums and vote counts, and unusually long reviews.

4. **Low**: Low impact anomalies are considered minor issues that have minimal impact on the overall analysis. These include the presence of emojis and URLs in the text, which typically do not affect the analytical outcome significantly.

## Removal Criteria

Based on the criticality assessment, a systematic approach was adopted to handle these anomalies:

- **Critical and High Impact Issues**: Observations containing anomalies classified as critical or high impact were earmarked for removal. The rationale behind this strict approach is to eliminate any potential distortions in the analysis that could arise from these severe issues. By removing these observations, we ensure that the dataset maintains a high level of integrity and reliability.

- **Medium and Low Impact Issues**: Anomalies classified as medium or low impact were not grounds for removal of the observations. Instead, these issues were retained in the dataset to preserve as much data as possible while accepting a tolerable level of noise. This approach balances the need for data quality with the necessity of maintaining a sufficient volume of data for robust analysis.

By following this structured and systematic approach to data cleaning, we ensure that the dataset is prepared to a high standard of quality, ready for accurate and reliable downstream analysis. This preparation phase is fundamental to the success of any data-driven project.

In [2]:
from appvocai-discover.data.prep.clean import DataCleaner, CleanConfig
from appvocai-discover.analysis.dqa import DataQualityAnalysisConfig, DataQualityAnalysis

ModuleNotFoundError: No module named 'appvocai-discover.data'

## Configuration
A configuration object was created to map each identified anomaly to its corresponding impact level. This configuration facilitated an organized and consistent approach to anomaly handling.

In [2]:
config = CleanConfig(force=False)
config.config

Unnamed: 0,Issue,Characteristic,Impact
0,dqa_is_duplicate,Duplicate Values,Critical
1,dqa_is_duplicate_rating_id,Duplicate IDs,Critical
2,dqa_non_english,Non-English Reviews,Critical
3,dqa_rating_invalid,Invalid Ratings,Critical
4,dqa_has_null,Null Values,High
5,dqa_has_excessive_special_chars,Excessive Special Characters,High
6,dqa_date_invalid,Invalid Dates,High
7,dqa_has_profanity,Profanity,High
8,dqa_contains_email,Contains Email Address(es),High
9,dqa_contains_phone_number,Contains Phone Number(s),High


## Execution
The DataCleaner object encapsulates the data cleaning pipeline.  Observations flagged with critical or high impact anomalies were removed, while those with medium or low impact issues were retained. The process was designed to be transparent, with clear reporting of the number of observations removed and retained.


In [3]:
cleaner = DataCleaner(config=config)
data_clean = cleaner.execute()

DataCleaner endpoint already exists. Returning prior results.


With the data cleaning stage successfully completed, we have ensured that our dataset is free from critical and high-impact anomalies, thereby enhancing its integrity and reliability.

## Validation
Let's verify that the critical and high impact issues have been addressed.

In [4]:
config = DataQualityAnalysisConfig()
analyzer = DataQualityAnalysis(config=config)
results = analyzer.execute()
results

DataQualityAnalysis endpoint already exists. Returning prior results.


Unnamed: 0,Characteristic,Impact,Count,Percent
0,Duplicate Values,Critical,0,0.0
1,Duplicate IDs,Critical,0,0.0
3,Non-English Reviews,Critical,0,0.0
7,Invalid Ratings,Critical,0,0.0
2,Null Values,High,0,0.0
5,Excessive Special Characters,High,0,0.0
6,Invalid Dates,High,0,0.0
8,Profanity,High,0,0.0
9,Contains Email Address(es),High,0,0.0
11,Contains Phone Number(s),High,0,0.0


The results show that the observations with critical and high impact issues have been removed from the dataset. Next, a spot of feature engineering.