# YData Quality - Data Expectations Tutorial
Time-to-Value: 8 minutes

This notebook provides a tutorial for the ydata_quality package integration of the Great Expectations library for managing data expectations.

**Structure:**

1. A data expectations introduction
2. Load example dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

## A data expectations introduction
### What are data expectations?
Detecting inconsistencies or even errors in data can sometimes be a trivial task, but surely this is far from being the norm. Many times this task requires minucious inspection of lots of data structures or advanced domain knowledge that allows a user to confidently label any shortcoming.

Consider __[Test-Driven-Development](https://en.wikipedia.org/wiki/Test-driven_development)__ (TDD) for a moment. In a TDD process, software requirements are realized into test cases before the development of the software itself. Software changes are constantly ran against these test cases in order to, hopefully, detect any sort of problem that might occur. A full software pipeline can be tested in this fashion to establish a quality assurance protocol, warrant a green light for a production push, and supporting refactorizations.

But what about data? What if you could generalize domain knowledge, and generally expected data behaviour, into the datasets you manipulate, either internally sourced or from third parties? In fact many teams already do this in one way or another, but generally resorting to ad-hoc and hard to generalize processes. Taking the lesson from TDD, if we could develop a set of verifiable tests that work just like software test cases we would also get the same benefits.

**Data Expectation** is the name we use for unit tests applied to data, to define an expectation about data is to develop a unit test that asserts a certain property about the data and provides an actionable output in any deviation.

### What is Great Expectations?
__[Great Expectations](https://greatexpectations.io/)__ is a Python tool for creating and running data expectations suite, allowing you to validate, profile your data, automate report creation in the form of HTML documents and store validation logs. Great Expectations offers a wide range of built-in expectations but also allows you to define custom expectations that better fit to your needs.

### How can I leverage my Great Expectations project with YData Quality?
It's simple!

1. Locate the validations directory of your Great Expectations project, which should be under the *uncommitted* directory. There you will find a set of folders, one for each validation run that you executed.
2. Choose a validation run to which you would like to get more insight, and copy the path to the json file.
3. Instantiate a DataExpectationsReporter engine and run evaluate by providing the json file path.

Congratulations you are all set!


In [1]:
import pandas as pd

from ydata_quality.data_expectations import DataExpectationsReporter

## Load the example dataset and path to Great Expectations validation run
We will use a demo from the GE tutorials. The taxi ride dataset and a log from a validation run on this dataset.

In [2]:
# This is the DataFrame used in the demo from GE tutorials
df = pd.read_csv('../datasets/original/taxi_yellow_tripdata_sample_2019-01.csv')

# This is a sample json log taken from a validation run
results_json_path = '../datasets/original/taxi_long.json'

## Create the engine
Each engine contains the checks and tests for each suite.

In [3]:
der = DataExpectationsReporter()

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a dictionary with outputs of operation performed. 

In [4]:
results = der.evaluate(results_json_path, df)



[38;5;11m[1mPriority 2[0m - [1musage allowed, limited human intelligibility[0m:
	[38;5;11m*[0m [1m[DATA EXPECTATIONS[0m - [4mOVERALL ASSESSMENT][0m 10 expectations have failed, which is more than the implied absolute threshold of 0 failed expectations.
	[38;5;11m*[0m [1m[DATA EXPECTATIONS[0m - [4mCOVERAGE FRACTION][0m The provided DataFrame has a total expectation coverage of 11% of its columns, which is below the expected coverage of 75%.
[38;5;69m[1mPriority 3[0m - [1mminor impact, aesthetic[0m:
	[38;5;69m*[0m [1m[DATA EXPECTATIONS[0m - [4mEXPECTATION ASSESSMENT - VALUE BETWEEN][0m Column passenger_count - The observed value is outside of the expected range.
	- The observed value is -100% deviated from the nearest bound of the expected range.
	[38;5;69m*[0m [1m[DATA EXPECTATIONS[0m - [4mEXPECTATION ASSESSMENT - VALUE BETWEEN][0m Column trip_distance - The observed value is outside of the expected range.
	- The observed value is 5% deviated from the

  arr_value = np.array(value)


## Check the status
After running the data quality checks, you can check the warnings for each individual operation over the GE validation log. The warnings are sorted by priority and have additional details that can provide better insights for Data Scientists.

In [5]:
warnings = der.get_warnings()
warnings[0]



## Full Test Suite
In this section, you will find a detailed overview of the available tests in the data expectations module of ydata_quality. These are all run with the `evaluate` method, which centralizes input arguments and produces specific outputs in the returned results dictionary, structured by test.

In [6]:
# Results object structure
list(results.keys())

['Coverage Fraction', 'Overall Assessment', 'Expectation Level Assessment']

### Overall assessment

This method controls for errors in the expectation suite level.
It receives your results_json_path and 2 optional arguments.
The default is a 0 failed expectations tolerance. You can configure this threshold using one of two arguments:
1. An integer for the maximum number of expectations you tolerate as failures, error_tol

or


2. The fraction of expectations you tolerate as failures, rel_error_tol

Any number of failed expectations greater than the one implied by these arguments will store a warning that will be part of the report. The data object of this warning consists in a dictionary with a value consisting in the list of indexes of the failed expectations, according to your expectation suite. The same list is stored in the returned results of `evaluate`. Note that the IDs are not arbitrary, they obey the order implied by your expectation suite, according to zero-based numbering.

In [7]:
failed_expectations_ids = results['Overall Assessment']
failed_expectations_ids

[2, 4, 6, 7, 9, 12, 13, 14, 15, 17]

### Coverage fraction

This method controls for total expectation coverage of your dataset.
It receives your results_json_path, the dataset for which you ran your validation and 1 optional argument.
By default the engine expects a minimum coverage of 70% of your columns by the expectation suite. Coverage is considered only in specific column expectations, table expectations like __[the list of table expectations](https://docs.greatexpectations.io/docs/reference/glossary_of_expectations#table-shape)__ are not considered here. You can callibrate your desired minimum coverage by providing the argument minimum_coverage as a fraction.

Any coverage inferior to the one implied by these arguments will store a warning that will be part of the report. The data object of this warning is the set of columns of your dataset that are not covered by the expectation suite. The method stores in the results object the fraction of columns of your dataset that are covered by at least one expectation.

Additionally, if there are any expectations in your expectation suite meant for a column which cannot be found on the provided dataset, an exception will be raised which will be captured by the `evaluate` method.

In [8]:
coverage_fraction = results['Coverage Fraction']
coverage_fraction

0.1111111111111111

### Expectation level assessment

This method checks the success of your expectation suite at the expectation level.
It receives one argument, your results_json_path.


It stores no warnings directly but depending on the failing expectations, it may call private methods to further digest the stored information. These expectation specific methods can store warnings for you to have some additional insight into what is wrong.


After running, a tuple will be returned with the following contents:
1. A report containing the status for all your expectations, succesful or not and with additional information on the failed expectations. Further information for interpreting the error metrics is provided in the stored warnings description.
2. A dense representation of the expectations, including: 
    * Results format;
    * Success status;
    * Expectation type;
    * Flag indicating if it is a table expectation;
    * General kwargs;
    * Column kwargs.

In [9]:
expectations_report, expectations_dense = results['Expectation Level Assessment']
expectations_report

Unnamed: 0,Expectation type,Successful?,Error metric(s)
0,expect_table_columns_to_match_ordered_list,True,
1,expect_table_row_count_to_be_between,True,
2,expect_column_min_to_be_between,False,"(None, -1.0)"
3,expect_column_max_to_be_between,True,
4,expect_column_mean_to_be_between,False,"(None, -0.13610333418172574)"
5,expect_column_median_to_be_between,True,
6,expect_column_quantile_values_to_be_between,False,
7,expect_column_values_to_be_in_set,False,
8,expect_column_values_to_not_be_null,True,
9,expect_column_proportion_of_unique_values_to_b...,False,"(None, 0.16666666666666677)"


In [10]:
# Retrieve a dense representation of an expectation
expectations_dense[0]

{'results_format': 'BASIC+',
 'success': True,
 'type': 'expect_table_columns_to_match_ordered_list',
 'kwargs': {'column_list': ['vendor_id',
   'pickup_datetime',
   'dropoff_datetime',
   'passenger_count',
   'trip_distance',
   'rate_code_id',
   'store_and_fwd_flag',
   'pickup_location_id',
   'dropoff_location_id',
   'payment_type',
   'fare_amount',
   'extra',
   'mta_tax',
   'tip_amount',
   'tolls_amount',
   'improvement_surcharge',
   'total_amount',
   'congestion_surcharge']},
 'result': {'observed_value': ['vendor_id',
   'pickup_datetime',
   'dropoff_datetime',
   'passenger_count',
   'trip_distance',
   'rate_code_id',
   'store_and_fwd_flag',
   'pickup_location_id',
   'dropoff_location_id',
   'payment_type',
   'fare_amount',
   'extra',
   'mta_tax',
   'tip_amount',
   'tolls_amount',
   'improvement_surcharge',
   'total_amount',
   'congestion_surcharge']},
 'is_table_expectation': True,
 'column_kwargs': {'column_list': ['vendor_id',
   'pickup_datetime'