# YData Quality - Data Expectations Tutorial
Time-to-Value: 8 minutes

This notebook provides a tutorial for the ydata_quality package integration of the Great Expectations library for managing data expectations.

**Structure:**

1. Load Great Expectations validation run
2. Instantiate the Data Quality engine
3. Run the quality checks
4. Assess the warnings
5. (Extra) Detailed overview

## A data expectations introduction
### What are data expectations?
Detecting inconsistencies or even errors in data can sometimes be a trivial task, but surely this is far from being the norm. Many times this task requires minucious inspection of lots of data structures or advanced domain knowledge that allows a user to confidently label any shortcoming.

Consider __[Test-Driven-Development](https://en.wikipedia.org/wiki/Test-driven_development)__ (TDD) for a moment. In a TDD process, software requirements are realized into test cases before the development of the software itself. Software changes are constantly ran against these test cases in order to, hopefully, detect any sort of problem that might occur. A full software pipeline can be tested in this fashion to warrant a green light for a production push, establishing a quality assurance protocol with a end user or supporting refactorizations.

But what about data? What if you could generalize domain knowledge, and generally expected data behaviour, into the datasets you manipulate, either internally sourced or from third parties? In fact many teams already do this in one way or another, but taking the lesson from TDD, if we could easily develop a set of verifiable tests that work just like software test cases we would also get the same benefits. **Data Expectation** is the name we use for unit tests applied to data, to define an expectation about data is to develop a unit test that asserts a certain property about the data and provides an actionable output in any deviation.

### What is Great Expectations?
__[Great Expectations](https://greatexpectations.io/)__ is a Python tool for creating and running data expectations suite, allowing you to validate, profile your data and automate report creation in the form of HTML documents. Great Expectations offers a wide range of built-in expectations but also allows you to define custom expectations that better fit to your needs.

### How can I leverage my Great Expectations project with YData Quality?
Its simple!

Locate the validations directory of your Great Expectations project, which should be under uncommitted. There you will find a set of folders, one for each validation run that you executed. Choose a validation run to which you would like to get more insight, and provide its path to the DataExpectationsReporter or as a loaded json (using the native json package loads method per example). Instantiate the engine and run evaluate. Congratulations you are all set!


In [None]:
import statsmodels.api as sm
from ydata_quality.data_errors import DataErrorSearcher

## Load the example dataset
We will use a dataset available from the statsmodels package.

In [None]:
df = sm.datasets.get_rdataset('Guerry', 'HistData').data

## Distort the original dataset
Apply transformations to highlight the data quality functionalities.

In [None]:
# Duplicate the first 20 rows
df = df.append(df[:20], ignore_index=True)

In [None]:
# Duplicate the dept column
df["dept2"] = df["dept"]

## Create the engine
Each engine contains the checks and tests for each suite. To create a DataErrorSearcher, you provide:
- df: target DataFrame, for which we will run the test suite

In [None]:
des = DataErrorSearcher(df=df)

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [None]:
results = des.evaluate()
results.keys()

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are suited by priority and have additional details that can provide better insights for Data Scientists.

In [None]:
des.report()

### Quality Warning

In [None]:
# Get a sample warning
sample_warning = des.warnings[1]

In [None]:
# Check the details
sample_warning.test, sample_warning.description, sample_warning.priority

In [None]:
# Retrieve the relevant data from the warning
sample_warning_data = sample_warning.data

## Full Test Suite
In this section, you will find a detailed overview of the available tests in the data errors module of ydata_quality.