In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

# Data Quality Assessment
The second stage of data processing is the Data Quality Assessment. This stage ensures that our dataset is ready for subsequent analysis and modeling tasks. By identifying and rectifying data quality issues early, we can avoid potential pitfalls that might compromise the integrity and accuracy of our results.

## Data Quality Checks
In this stage, we employ a series of tasks designed to identify and address any noise or irregularities within the dataset. Each task focuses on a specific aspect of data quality, ranging from detecting duplicate entries to identifying profanity, special patterns, and other potential sources of bias or distortion.
1. **Duplicate Rows**: We identify and remove duplicate entries to ensure that each observation is unique, preventing skewed analyses and inflated metrics.
2. **Null Values**: We detect and handle missing data appropriately, which could involve imputation, deletion, or flagging incomplete records for further investigation.
3. **Outliers**: Check for outliers in numeric columns using the non-parametric Interquartile Range (IQR) method.
4. **Non-English Text**: We check for and address non-English text, as it may not be relevant to our analysis or could require special handling.
5. **Emojis**: Emojis can carry significant meaning in certain contexts but might also introduce noise. We identify and decide on their treatment—whether to retain, remove, or translate them into textual representations.
6. **Excessive Special Characters**: Special characters can disrupt text analysis and need to be managed, either by cleaning or encoding them appropriately.
7. **Invalid Dates**: We verify that date values fall within expected ranges and formats, correcting or flagging anomalies for further review.
8. **Invalid Ratings**: Ratings that fall outside the expected scale (e.g., 1 to 5) are identified and corrected or flagged.
9. **Profanity**: We detect and handle profane content to ensure that our dataset adheres to appropriate usage standards, especially if it's intended for public or sensitive applications.
10. **Special Patterns**: We identify and manage special patterns such as URLs, phone numbers, and emails. These patterns could be indicative of spam or need to be anonymized to protect privacy.

By conducting these data quality checks, we ensure that our dataset is clean, reliable, and ready for detailed analysis. This foundational step sets the stage for accurate insights and robust conclusions in the subsequent phases of our data processing pipeline.

In [2]:

from appvocai-genailab.data.prep.dqm import DataQualityAssessment, DQAConfig

ModuleNotFoundError: No module named 'appvocai-genailab.data'

We've encapsulated the data quality assessment process in a `DataQualityAssessment` class. Configured with source and target files, this class conducts the 10 data quality checks, marking the observations that require attention.

In [2]:
config = DQAConfig(force=False)
dqa = DataQualityAssessment(config=config)
data = dqa.execute()

Let's get a summary of the data quality issues by type.

In [3]:
dqa.overview()

Unnamed: 0,Count,Percent
dqa_is_duplicate,4,0.02
dqa_is_duplicate_rating_id,5,0.03
dqa_has_null,5,0.03
dqa_vote_sum_outlier,749,4.09
dqa_vote_count_outlier,995,5.44
dqa_eda_review_length_outlier,1234,6.74
dqa_non_english,937,5.12
dqa_has_excessive_special_chars,204,1.11
dqa_date_invalid,0,0.0
dqa_rating_invalid,0,0.0


The data quality assessment (DQA) conducted revealed several key issues. A small percentage of records were identified as duplicates or had duplicate rating IDs. There were no null values, invalid dates, or invalid ratings. Outliers were found in vote sums, vote counts, and review lengths. A notable portion of the data was flagged for being non-English or containing excessive special characters. Instances of profanity and presence of email addresses, URLs, or phone numbers were minimal. 

Next, we move to the cleaning stage.