# YData Quality - DataQuality Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial to run the `ydata_quality.DataQuality` main class that aggregates all the individual data quality engines, each focused on a main topic of data quality (e.g. duplicates, missing values).

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [1]:
import pandas as pd
from ydata_quality import DataQuality

## Load the example dataset
We will use a transformed version of the "Guerry" dataset available from the statsmodels package.

In [2]:
df = pd.read_csv('../datasets/transformed/guerry_histdata.csv')

## Create the main engine
The DataQuality class aggregates all the individual data quality engines, each focused on a main topic of data quality (e.g. duplicates, missing values). To create a DataQuality object, you provide:
- df: target DataFrame, for which we will run the test suite
- target (optional): target feature to be predicted in a supervised learning context
- entities (optional): list of feature names for which checking duplicates after grouping-by is applicable.
- ed_extensions (optional): list of erroneous data values to append to the defaults.

In [3]:
ED_EXTENSIONS = ['a_custom_EDV', 999999999, '!', '', 'UNKNOWN']
SENSITIVE_FEATURES = ['Suicides', 'Crime_parents', 'Infanticide']

In [4]:
dq = DataQuality(df=df, label='Pop1831', ed_extensions=ED_EXTENSIONS, sensitive_features=SENSITIVE_FEATURES, random_state=42)

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [5]:
full_results = dq.evaluate()

CRITICAL | Canceled Data Expectations engine execution due to dataset-expectation suite mismatch.

[38;5;209m[1mPriority 1[0m - [1mheavy impact expected[0m:
	[38;5;209m*[0m [1m[LABELS[0m - [4mTEST NORMALITY][0m The label distribution failed to pass a normality test as-is and following a battery of transforms. It is possible that the data originates from an exotic distribution, there is heavy outlier presence or it is multimodal. Addressing this issue might prove critical for regressor performance.
	[38;5;209m*[0m [1m[DUPLICATES[0m - [4mDUPLICATE COLUMNS][0m Found 1 columns with exactly the same feature values as other columns.
[38;5;11m[1mPriority 2[0m - [1musage allowed, limited human intelligibility[0m:
	[38;5;11m*[0m [1m[DATA RELATIONS[0m - [4mHIGH COLLINEARITY - NUMERICAL][0m Found 18 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are suited by priority and have additional details that can provide better insights for Data Scientists.

In [6]:
# Retrieve the warnings
warnings = dq.get_warnings()

In [7]:
# With get_warnings you can also filter the warning list by specific conditions
duplicate_quality_warnings = dq.get_warnings(category='Duplicates')
priority_2_warnings = dq.get_warnings(priority=2)