# ydata-quality
> The *ydata_quality* package aims to be for Data Quality what *sklearn* is for machine learning, *matplotlib* for visualization or *pandas* for data manipulation.

The `ydata_quality` package aims to be the the go-to package for assessing Data Quality throughout the multiple stages of a data pipeline development. Once you have a dataset available, running `DataQuality(df=my_df).report()` provides a comprehensive overview of the details and intricacies of the data, through the perspective of the multiple modules available in the package.

For this tutorial, we will 1. load a dataset; 2. analyze its quality issues; 3. apply strategies to mitigate them and 4. check the new quality analysis on the post-processed (cleaned) data.

## Quick Start
Load a DataFrame and evaluate your data using `DataQuality`.
For more advanced analysis, we can provide additional arguments but we will get there in a minute.

In [29]:
%%capture
from ydata_quality import DataQuality
import pandas as pd

df = pd.read_csv(f'../datasets/transformed/census_10k_v3.csv') # load data
dq = DataQuality(df=df) # create the main class that holds all quality modules
results = dq.evaluate() # run the tests

In [30]:
dq.report() # Output a report of the quality issues found by the engines

	[DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns. (Priority 1: heavy impact expected)
	[PREDEFINED VALUED MISSING VALUES] Found 1960 vmvs in the dataset. (Priority 2: usage allowed, limited human intelligibility)
	[FLATLINES] Found 4627 flatline events with a minimun length of 5 among the columns {'relationship', 'workclass2', 'workclass', 'marital-status', 'capital-gain', 'capital-loss', 'sex', 'education-num', 'income', 'education', 'native-country', 'occupation', 'hours-per-week', 'race'}. (Priority 2: usage allowed, limited human intelligibility)
	[EXACT DUPLICATES] Found 3 instances with exact duplicate feature values. (Priority 2: usage allowed, limited human intelligibility)


From the report, we get multiple warnings with different priorities (lower is of higher importance). The warnings were generated automatically by the default tests implemented in each module. From the report, we get an overall sense for each quality warning but this doesn't tell the whole story. To investigate the details, let's pick a QualityWarning and analyze it.

## Warnings
The warnings contain details for issues detected during the data quality analysis. For any given issue, the warning contains the information required by the Data Scientist to fully grasp the quality issue and how it is impacting the current dataset.

Warnings can be fetched with `.get_warnings()` as a list of `QualityWarning`'s (best for coding) or summarized with `.report()` which print of overall status (best for analysis, visualization).

Warnings are generated automatically for some tests but can be created by Data Scientists as well and added to existing engines.

In [5]:
dq.get_warnings(test='Duplicate Columns').data

{'workclass': 'workclass2'}

From the warning details, we know that the 'workclass2' feature is an exact copy of the original 'workclass'. For a typical machine learning pipeline, the duplicated feature 'workclass2' could be dropped as it is not adding any value (all info is already present) and may cause a toll in performance (e.g. due to collinearity effects).

## Modules
A full picture of data quality requires multiple perspectives, which we deliver in a modular way: Bias & Fairness, Data Expectations, Data Relations, Drift Analysis, Labelling, Missing and Valued Missing Values. All of the engines are integrated into a single `DataQuality` class that allows you to run everything at once, providing a holistic perspective of your data. From the main `DataQuality` you can access the individual engines with the `.engines` property.

Some of the modules will not run unless you specify specific arguments due to mandatory info (e.g. target feature name for Labelling). By default, `DataQuality` will only contain the engines which have valid arguments and will drop all of those who are not sufficiently specified on the initialization.

Since we didn't specified any sensitive features, the Bias&Fairness engine didn't run but we can define it now as a standalone as well.

In [34]:
from ydata_quality.bias_fairness import BiasFairness
bf = BiasFairness(df=df, sensitive_features=['race', 'sex'], label='income')
bf_results = bf.evaluate()
bf.report()

	[PROXY IDENTIFICATION] Found 1 feature pairs of correlation to sensitive attributes with values higher than defined threshold (0.5). (Priority 2: usage allowed, limited human intelligibility)
	[SENSITIVE ATTRIBUTE REPRESENTATIVITY] Found 2 values of 'race' sensitive attribute with low representativity in the dataset (below 1.00%). (Priority 2: usage allowed, limited human intelligibility)


In [35]:
bf_results

{'performance_discrimination': {'race': Amer-Indian-Eskimo    1.000000
  Asian-Pac-Islander    0.510216
  Other                 1.000000
  Black                 0.639949
  White                 0.562961
  dtype: float64,
  'sex': Male      0.591663
  Female    0.526722
  dtype: float64},
 'proxy_identification': features
 relationship_sex    0.650656
 Name: association, dtype: float64,
 'sensitive_predictability': race    0.121680
 sex     0.249346
 dtype: float64,
 'sensitive_representativity': {'race': White                 0.8537
  Black                 0.0978
  Asian-Pac-Islander    0.0303
  Other                 0.0092
  Amer-Indian-Eskimo    0.0090
  Name: race, dtype: float64,
  'sex': Male      0.6657
  Female    0.3343
  Name: sex, dtype: float64}}

From the report, we know that we may have a proxy feature leaking information about a sensitive attribute (cf. PROXY IDENTIFICATION) and severe under-representation of feature values of a sensitive attribute. To investigate, we can fetch the warnings with the `get_warnings` method filtering for a specific test.

In [36]:
# Looks like the 'relationship' and 'sex' features are highly correlated. Even if we removed the feature 'sex' from the data, 
# the 'relationship' feature could still leak information about the original sensitive attribute
bf.get_warnings(test='Proxy Identification')

relationship_sex    0.650656
Name: association, dtype: float64)

In [8]:
# From observing the data, we see that some relationship status (e.g. Husband, Wife) are gender-specific, thus impacting the correlation.
df[['relationship', 'sex']].value_counts().sort_index()

relationship    sex   
Husband         Male      4023
Not-in-family   Female    1221
                Male      1351
Other-relative  Female     132
                Male       163
Own-child       Female     712
                Male       904
Unmarried       Female     783
                Male       215
Wife            Female     495
                Male         1
dtype: int64

## Data Cleaning
After the data quality issues have been detected, we can build a data processing pipeline with the guidance from the warnings raised above.

In [16]:
def improve_quality(df: pd.DataFrame):
    "Clean the data based on the Data Quality issues found previously."
    # Bias & Fairness
    df = df.replace({'relationship': {'Husband': 'Married', 'Wife': 'Married'}}) # Substitute gender-based 'Husband'/'Wife' for generic 'Married'
    
    # Duplicates
    df = df.drop(columns=['workclass2']) # Remove the duplicated column
    df = df.drop_duplicates()            # Remove exact feature value duplicates

    return df

clean_df = improve_quality(df.copy())

### Cleaned Data - DataQuality
To check the impact of our data cleaning pipeline, we create a new DataQuality class now based on the improved version of the original data.
Given that we removed the duplicated column and we erased the exact feature value duplicates, those warnings are not raised in the new DataQuality engine.

In [27]:
%%capture
better_dq = DataQuality(df=clean_df) # main class on cleaned data
results = better_dq.evaluate() # run the tests

In [28]:
better_dq.report()

	[FLATLINES] Found 4165 flatline events with a minimun length of 5 among the columns {'relationship', 'workclass', 'marital-status', 'capital-gain', 'capital-loss', 'sex', 'education-num', 'income', 'education', 'native-country', 'occupation', 'hours-per-week', 'race'}. (Priority 2: usage allowed, limited human intelligibility)
	[PREDEFINED VALUED MISSING VALUES] Found 1360 vmvs in the dataset. (Priority 2: usage allowed, limited human intelligibility)


### Cleaned Data - Specific Module
For the specific analysis of Bias & Fairness, we see that the previous QualityWarning of "Proxy Identification" has disappeared. To check the new association results, we lower the threshold and observe that the association measure between 'relationship' and 'sex' features has dropped from 0.65 to 0.48.

In [38]:
# Specific analysis for Bias & Fairness with improved dataframe
better_bf = BiasFairness(df=better_df, sensitive_features=['race', 'sex'], label='income')
_ = better_bf.evaluate()
better_bf.report()

	[SENSITIVE ATTRIBUTE REPRESENTATIVITY] Found 2 values of 'race' sensitive attribute with low representativity in the dataset (below 1.00%). (Priority 2: usage allowed, limited human intelligibility)


In [41]:
# The
better_bf.proxy_identification(th=0.45)

features
relationship_sex      0.475097
marital-status_sex    0.459768
Name: association, dtype: float64

## The End
That's it. In this quick tutorial, you learned how to use `ydata_quality` to assess the Data Quality of your dataset, both with the `DataQuality` main aggregator or through a specific module engine (e.g. `BiasFairness`). We introduced `QualityWarning`'s and how they provide a high-level measure of severity (cf. Priority) and contain the original data that raised the alarm. Based on the data quality insights, we defined a data cleaning pipeline and observed how it solved the warnings we aimed for.