# YData Quality - Data Expectations Tutorial
Time-to-Value: 8 minutes

This notebook provides a tutorial for the ydata_quality package integration of the Great Expectations library for managing data expectations.

**Structure:**

1. A data expectations introduction
2. Load example dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

## A data expectations introduction
### What are data expectations?
Detecting inconsistencies or even errors in data can sometimes be a trivial task, but surely this is far from being the norm. Many times this task requires minucious inspection of lots of data structures or advanced domain knowledge that allows a user to confidently label any shortcoming.

Consider __[Test-Driven-Development](https://en.wikipedia.org/wiki/Test-driven_development)__ (TDD) for a moment. In a TDD process, software requirements are realized into test cases before the development of the software itself. Software changes are constantly ran against these test cases in order to, hopefully, detect any sort of problem that might occur. A full software pipeline can be tested in this fashion to establish a quality assurance protocol, warrant a green light for a production push, and supporting refactorizations.

But what about data? What if you could generalize domain knowledge, and generally expected data behaviour, into the datasets you manipulate, either internally sourced or from third parties? In fact many teams already do this in one way or another, but generally resorting to ad-hoc and hard to generalize processes. Taking the lesson from TDD, if we could develop a set of verifiable tests that work just like software test cases we would also get the same benefits.

**Data Expectation** is the name we use for unit tests applied to data, to define an expectation about data is to develop a unit test that asserts a certain property about the data and provides an actionable output in any deviation.

### What is Great Expectations?
__[Great Expectations](https://greatexpectations.io/)__ is a Python tool for creating and running data expectations suite, allowing you to validate, profile your data, automate report creation in the form of HTML documents and store validation logs. Great Expectations offers a wide range of built-in expectations but also allows you to define custom expectations that better fit to your needs.

### How can I leverage my Great Expectations project with YData Quality?
Its simple!

1. Locate the validations directory of your Great Expectations project, which should be under the *uncommitted* directory. There you will find a set of folders, one for each validation run that you executed.
2. Choose a validation run to which you would like to get more insight, and copy the path to the json file.
3. Instantiate a DataExpectationsReporter engine and run evaluate by providing the json file path.

Congratulations you are all set!


In [1]:
import pandas as pd

from ydata_quality.data_expectations import DataExpectationsReporter

## Load the example dataset and path to Great Expectations validation run
We will use a demo from the GE tutorials. The taxi ride dataset and a log from a validation run on this dataset.

In [2]:
# This is the DataFrame used in the demo from GE tutorials
df = pd.read_csv('../src/ydata_quality/data_expectations/test_cases/taxi/yellow_tripdata_sample_2019-01.csv')

# This is a sample json log taken from a validation run
results_json_path = '../src/ydata_quality/data_expectations/test_cases/taxi/long.json'

## Create the engine
Each engine contains the checks and tests for each suite.

In [3]:
der = DataExpectationsReporter()

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a dictionary with outputs of operation performed. 

In [4]:
results = der.evaluate(results_json_path, df)


results.keys()  

  arr_value = np.array(value)


dict_keys(['Overall Assessment', 'Coverage Fraction', 'Expectation level assessment'])

## Check the status
After running the data quality checks, you can check the warnings for each individual operation over the GE validation log. The warnings are sorted by priority and have additional details that can provide better insights for Data Scientists.

In [6]:
der.report()

[EXPECTATION ASSESSMENT - VALUE BETWEEN] The observed value is outside of the expected range.
	- The observed value is -100% deviated from the nearest bound of the expected range. (Priority 3: minor impact, aesthetic)
[EXPECTATION ASSESSMENT - VALUE BETWEEN] The observed value is outside of the expected range.
	- The observed value is -14% deviated from the nearest bound of the expected range. (Priority 3: minor impact, aesthetic)
[EXPECTATION ASSESSMENT - VALUE BETWEEN] The observed value is outside of the expected range.
	- The observed value is 17% deviated from the nearest bound of the expected range. (Priority 3: minor impact, aesthetic)
[OVERALL ASSESSMENT] 10 expectations have failed, which is more than the implied absolute threshold of 0 failed expectations. (Priority 2: usage allowed, limited human intelligibility)
[EXPECTATION ASSESSMENT - VALUE BETWEEN] The observed value is outside of the expected range.
	- The observed value is -35% deviated from the nearest bound of the e

### Quality Warning

In [8]:
# Get a sample warning
sample_warning = list(der.warnings)[1]

In [9]:
# Check the details
sample_warning.test, sample_warning.description, sample_warning.priority

('Expectation assessment - Value Between',
 'The observed value is outside of the expected range.\n\t- The observed value is -14% deviated from the nearest bound of the expected range.',
 <Priority.P3: 3>)

In [10]:
# Retrieve the relevant data from the warning
sample_warning_data = sample_warning.data

## Full Test Suite
In this section, you will find a detailed overview of the available tests in the data expectations module of ydata_quality.