# Great Expectations
[Great Expectations](https://docs.greatexpectations.io/docs/) is a tool for data validation. You can think of this as running unit tests on our data. To gain a first impression of Great Expectations, we recommend taking a look at the [quick-start tutorial](https://docs.greatexpectations.io/docs/tutorials/quickstart/) provided by Great Expectations. 

In [1]:
import great_expectations as gx
from great_expectations.data_context import EphemeralDataContext, FileDataContext
from great_expectations.datasource.fluent import BatchRequest
from great_expectations.exceptions import DataContextError
from great_expectations.expectations.expectation import Expectation
from great_expectations.checkpoint.types.checkpoint_result import CheckpointResult

from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split

from IPython.display import display

In [2]:
WORKING_DIR = Path.cwd()
RANDOM_SEED = 42

We use the familiar wine quality dataset in the following example. 

In [3]:
csv_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(csv_url, sep=";")

# Split the data into training and test sets. (0.75, 0.25) split.
train_df, test_df = train_test_split(data, random_state=RANDOM_SEED)

## 1. Instantiate Great Expectations
Let's first instantiate an entry point for Great Expectations so we can get started with it. Running the code cell below creates a folder `gx` in your working directory with all things related to working with Great Expectations. 

To better understand what the code below does, you can go through the following concepts used by Great Expectations [here](https://docs.greatexpectations.io/docs/glossary).
- Data Context
- Data Source
- Data Asset
- Batch Request
- Expectation and Expectation Suite
- Validator

In [4]:
wine_datasource_name = "wine_datasource"
wine_expectation_suite_name = "wine_expectation_suite"

# Instantiate a Data Context and save it in filesystem.
context = gx.get_context(project_root_dir=str(WORKING_DIR))

# Connect to data in our wine DataFrame.
try:
    # Create a new Data Source in the Data Context
    datasource = context.sources.add_pandas(name=wine_datasource_name)
except DataContextError:
    # The Data Source already exists in the Data Context
    datasource = context.get_datasource(wine_datasource_name)

try:
    # Create a new DataFrame Data Asset
    training_data_asset = datasource.add_dataframe_asset(name="training_data")
except ValueError:
    # The Data Asset already exists
    training_data_asset = datasource.get_asset("training_data")

# Request all data in the training DataFrame as a single batch
training_batch_request = training_data_asset.build_batch_request(dataframe=train_df)

# Create an Expectation Suite 
context.add_or_update_expectation_suite(wine_expectation_suite_name)

# Create a Validator
wine_validator = context.get_validator(
    batch_request=training_batch_request,
    expectation_suite_name=wine_expectation_suite_name
)
display(wine_validator.head())

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,11.7,0.49,0.49,2.2,0.083,5.0,15.0,1.0,3.19,0.43,9.2,5
1,8.8,0.6,0.29,2.2,0.098,5.0,15.0,0.9988,3.36,0.49,9.1,5
2,7.1,0.59,0.0,2.1,0.091,9.0,14.0,0.99488,3.42,0.55,11.5,7
3,8.3,0.54,0.24,3.4,0.076,16.0,112.0,0.9976,3.27,0.61,9.4,5
4,9.3,0.775,0.27,2.8,0.078,24.0,56.0,0.9984,3.31,0.67,10.6,6


## 2. Create Expectations
After creating a Validator, we can start to define Expectations interactively (other ways exist as well). We can define Expectations by calling the `expect_*` methods of the Validator object. Each method performs two tasks: 
1) it stores the defined Expectation to the Expectation Suite, 
2) it also runs the Expectation against the data loaded into the Validator (`train_df` in this case), and returns an `ExpectationValidationResult` object, which is a dictionary holding the test results. This way a user can see if the Expectations they have suggested align with the data loaded into the Validator.

Let's define an Expectation to make sure that there is no null value in the target column. 

In [5]:
result = wine_validator.expect_column_values_to_not_be_null("quality")
print(f"Test status: {'Succeeded' if result['success'] else 'failed'}")
print(f"Expectation: {result['expectation_config']['expectation_type']}")


Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Test status: Succeeded
Expectation: expect_column_values_to_not_be_null


The Expectation Suite we have configured this far exists only in memory and has to be persisted for future use. Let's save the Expectation Suite into our Data Context.

In [6]:
wine_validator.save_expectation_suite()

## 3. Validate new data against the defined Expectations
Let's run our newly defined Expectation Suite against the test data. This is done by creating and configuring a [Checkpoint](https://docs.greatexpectations.io/docs/terms/checkpoint).

In [7]:
try:
    # Create a new DataFrame Data Asset
    test_data_asset = datasource.add_dataframe_asset(name="test_data")
except ValueError:
    # The Data Asset already exists
    test_data_asset = datasource.get_asset("test_data")

# Request all data in the test DataFrame as a single batch
test_batch_request = test_data_asset.build_batch_request(dataframe=test_df)

In [8]:
checkpoint = context.add_or_update_checkpoint(
    name="test_checkpoint",
    batch_request=test_batch_request,
    expectation_suite_name=wine_expectation_suite_name
)
checkpoint_result = checkpoint.run(run_name="test_run1")


Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

In [None]:
context.view_validation_result(checkpoint_result)

# The result is persisted in the `gx` directory and you can also check the result from gx/uncommitted/data_docs/local_site/index.html