![](https://docs.greatexpectations.io/assets/images/GE_OSS_process-448174e3b55ae4dfd7fbb7a8c1a452e3.png)

## Setup the environment

In [None]:
!pip install sqlalchemy psycopg2-binary
!pip install great-expectations 
!apt-get --quiet install tree
!pip install ipython-sql
%reload_ext sql

In [None]:
import os
import pandas as pd
from sqlalchemy import create_engine
import psycopg2

In [None]:
HOST = ""
PASSWORD = ""

In [None]:
# confirm if data is loaded
%sql postgresql+psycopg2://postgres:{PASSWORD}@{HOST}/studentdb

'Connected: postgres@studentdb'

In [None]:
%sql select * from yellow_tripdata_sample_2019_01 limit 10;

   postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/postgres
 * postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/studentdb
10 rows affected.


index,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_location_id,dropoff_location_id,fare_amount
4380300,2,2019-02-18 11:57:22,2019-02-18 12:23:17,3.0,249,228,24.5
1180497,1,2019-02-05 17:00:39,2019-02-05 17:06:32,0.0,211,158,5.5
2454286,1,2019-02-10 10:40:18,2019-02-10 11:01:19,1.0,79,142,17.0
4848450,2,2019-02-20 12:34:17,2019-02-20 12:55:26,2.0,161,164,13.0
713150,1,2019-02-03 17:45:36,2019-02-03 18:05:45,1.0,237,79,14.0
308808,1,2019-02-02 00:37:19,2019-02-02 01:06:49,1.0,100,20,34.5
2972573,2,2019-02-12 16:59:26,2019-02-12 17:03:02,2.0,107,224,4.0
6613199,2,2019-02-27 15:08:31,2019-02-27 15:25:00,1.0,163,170,11.0
3147727,2,2019-02-13 10:51:50,2019-02-13 11:30:32,2.0,138,162,34.0
6826264,1,2019-02-28 11:32:10,2019-02-28 12:11:44,1.0,220,265,65.0


In [None]:
%sql select * from taxi_zone_lookup limit 10;

   postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/postgres
 * postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/studentdb
10 rows affected.


index,locationid,borough,zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone
5,6,Staten Island,Arrochar/Fort Wadsworth,Boro Zone
6,7,Queens,Astoria,Boro Zone
7,8,Queens,Astoria Park,Boro Zone
8,9,Queens,Auburndale,Boro Zone
9,10,Queens,Baisley Park,Boro Zone


## Setup the GE project

In this, you will install the Great Expectations Python package and create a sample Great Expectations project. You will add a Datasource for a sample data set, create an Expectation Suite using the automated profiler, run validation on a data set, and generate Data Docs.

In [None]:
!mkdir ge_demo
%cd ge_demo
!great_expectations --v3-api init

/content/dbt_demo/ge_demo
Using v3 (Batch Request) API[0m
[36m
  ___              _     ___                  _        _   _
 / __|_ _ ___ __ _| |_  | __|_ ___ __  ___ __| |_ __ _| |_(_)___ _ _  ___
| (_ | '_/ -_) _` |  _| | _|\ \ / '_ \/ -_) _|  _/ _` |  _| / _ \ ' \(_-<
 \___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
                                |_|
             ~ Always know what to expect from your data ~
[0m[0m
Let's create a new Data Context to hold your project configuration.

Great Expectations will create a new directory with the following structure:

    great_expectations
    |-- great_expectations.yml
    |-- expectations
    |-- checkpoints
    |-- plugins
    |-- .gitignore
    |-- uncommitted
        |-- config_variables.yml
        |-- data_docs
        |-- validations

OK to proceed? [Y/n]: Y

[0m
[36mCongratulations! You are now ready to customize your Great Expectations configuration.[0m[0m

[36mYou can customize your configuratio

## Create datasource

Create a new data source configuration

In [None]:
!mkdir -p scripts

In [None]:
%%writefile ./scripts/create_datasource.py
import great_expectations as ge
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource

context = ge.get_context()

config = f"""
name: my_datasource
class_name: Datasource
execution_engine:
  class_name: SqlAlchemyExecutionEngine
  credentials:
    host: <db>
    port: '5432'
    username: postgres
    password: <dbpass>
    database: studentdb
    drivername: postgresql
data_connectors:
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    batch_identifiers:
      - default_identifier_name
  default_inferred_data_connector_name:
    class_name: InferredAssetSqlDataConnector
    name: whole_table"""

sanitize_yaml_and_save_datasource(context, config, overwrite_existing=True)

Overwriting ./scripts/create_datasource.py


In [None]:
!python ./scripts/create_datasource.py

[0m

Confirm that the Datasource was added correctly to the configuration file by running the following command in the Terminal tab:

In [None]:
!great_expectations --v3-api datasource list

Using v3 (Batch Request) API[0m
1 Datasource found:[0m
[0m
 - [36mname:[0m my_datasource[0m
   [36mclass_name:[0m Datasource[0m
[0m

The following file has been generated using the built-in profiler that inspected the data in the yellow_tripdata_sample_2019_01 table in the PostgreSQL database and created Expectations based on the types and values that are found in the data

In [None]:
%%writefile ./great_expectations/expectations/my_suite.json
{
    "data_asset_type": null,
    "expectation_suite_name": "my_suite",
    "expectations": [{
            "expectation_type": "expect_table_columns_to_match_ordered_list",
            "kwargs": {
                "column_list": [
                    "index",
                    "vendor_id",
                    "pickup_datetime",
                    "dropoff_datetime",
                    "passenger_count",
                    "pickup_location_id",
                    "dropoff_location_id",
                    "fare_amount"
                ]
            },
            "meta": {}
        },
        {
            "expectation_type": "expect_table_row_count_to_be_between",
            "kwargs": {
                "max_value": 12000,
                "min_value": 8000
            },
            "meta": {}
        },
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "vendor_id",
                "value_set": [
                    1,
                    2,
                    4,
                    5
                ]
            },
            "meta": {}
        },
        {
            "expectation_type": "expect_column_values_to_not_be_null",
            "kwargs": {
                "column": "vendor_id"
            },
            "meta": {}
        },
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "passenger_count",
                "value_set": [
                    0,
                    1,
                    2,
                    3,
                    4,
                    5,
                    6
                ]
            },
            "meta": {}
        },
        {
            "expectation_type": "expect_column_mean_to_be_between",
            "kwargs": {
                "column": "passenger_count",
                "max_value": 1.61,
                "min_value": 1.55
            },
            "meta": {}
        }
    ],
    "meta": {
        "great_expectations_version": "0.13.19"
    }
}

Overwriting ./great_expectations/expectations/my_suite.json


## Generate Data Docs
Data Docs are HTML pages showing your Expectation Suites and validation results. Let's look at my_suite in Data Docs to see which Expectations it contains.

Run the following command to generate Data Docs:

In [None]:
!great_expectations --v3-api docs build --no-view

Using v3 (Batch Request) API[0m

The following Data Docs sites will be built:

 - [36mlocal_site:[0m file:///content/dbt_demo/ge_demo/great_expectations/uncommitted/data_docs/local_site/index.html
[0m
Would you like to proceed?[0m [Y/n]: y

Building Data Docs...
[0m
Done building Data Docs[0m
[0m

In [None]:
import portpicker
from google.colab.output import eval_js
port = portpicker.pick_unused_port()
print(eval_js("google.colab.kernel.proxyPort({})".format(port)))
%cd ./great_expectations/uncommitted/data_docs/local_site
!nohup python3 -m http.server $port &

https://hjyfw2jvxx-496ff2e9c6d22116-15927-colab.googleusercontent.com/
/content/dbt_demo/ge_demo/great_expectations/uncommitted/data_docs/local_site
nohup: appending output to 'nohup.out'


## Set up a Checkpoint to Run Validation

In this step, you will use your newly generated Expectation Suite to validate a new data asset. Recall that the Expectation Suite was created based on the data found in the yellow_tripdata_sample_2019_01 table. You will now create a Checkpoint that uses this suite to validate the yellow_tripdata_sample_2019_02 table and identify any unexpected differences in the data.

In [None]:
%%writefile ./scripts/create_checkpoint.py
from ruamel.yaml import YAML
import great_expectations as ge

yaml = YAML()
context = ge.get_context()

config = f"""
name: my_checkpoint
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-validation-run"
validations:
  - batch_request:
      datasource_name: my_datasource
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: yellow_tripdata_sample_2019_01
      data_connector_query:
        index: -1
    expectation_suite_name: my_suite
"""

context.add_checkpoint(**yaml.load(config))

Overwriting ./scripts/create_checkpoint.py


In [None]:
!python ./scripts/create_checkpoint.py

This will create a configuration for a new Checkpoint called my_checkpoint and save it to the Data Context of your project. In order to confirm that the Checkpoint was correctly created, run the command to list all Checkpoints in the project:

In [None]:
!great_expectations --v3-api checkpoint list

Using v3 (Batch Request) API[0m
Found 1 Checkpoint.[0m
 - [36mmy_checkpoint[0m[0m
[0m

## Run validation with a Checkpoint
To run the Checkpoint and validate the yellow_tripdata_sample_2019_02 with my_suite, execute:

In [None]:
!great_expectations --v3-api checkpoint run my_checkpoint

Using v3 (Batch Request) API[0m
Calculating Metrics: 100% 21/21 [00:04<00:00,  4.72it/s]
Validation succeeded![0m

Suite Name                                   Status     Expectations met[0m
- my_suite                                   [32m✔ Passed[0m   6 of 6 (100.0 %)[0m
[0m

In [None]:
# TEST
!cp -r /content/dbt_demo/ge_demo/great_expectations /tmp
!great_expectations --v3-api --config /tmp/great_expectations checkpoint run my_checkpoint; exit 99;

Using v3 (Batch Request) API[0m
Calculating Metrics: 100% 21/21 [00:04<00:00,  4.72it/s]
Validation succeeded![0m

Suite Name                                   Status     Expectations met[0m
- my_suite                                   [32m✔ Passed[0m   6 of 6 (100.0 %)[0m
[0m

## Load new data

In [None]:
!wget -q --show-progress https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.parquet



In [None]:
yellow_tripdata_2019_df = pd.read_parquet('yellow_tripdata_2019-02.parquet')
yellow_tripdata_2019_df = yellow_tripdata_2019_df[['VendorID',
                                                   'tpep_pickup_datetime',
                                                   'tpep_dropoff_datetime',
                                                   'passenger_count',
                                                   'PULocationID',
                                                   'DOLocationID',
                                                   'fare_amount']]

yellow_tripdata_2019_df.columns = ['vendor_id',
                                   'pickup_datetime',
                                   'dropoff_datetime',
                                   'passenger_count',
                                   'pickup_location_id',
                                   'dropoff_location_id',
                                   'fare_amount']

In [None]:
postgreSQLConnection = alchemyEngine.connect();
yellow_tripdata_2019_df.sample(10000).to_sql('yellow_tripdata_sample_2019_02', postgreSQLConnection, if_exists='replace');
postgreSQLConnection.close();

## Set up a Checkpoint to Run Validation

In this step, you will use your newly generated Expectation Suite to validate a new data asset. Recall that the Expectation Suite was created based on the data found in the yellow_tripdata_sample_2019_01 table. You will now create a Checkpoint that uses this suite to validate the yellow_tripdata_sample_2019_02 table and identify any unexpected differences in the data.

In [None]:
%%writefile ./scripts/create_checkpoint.py
from ruamel.yaml import YAML
import great_expectations as ge

yaml = YAML()
context = ge.get_context()

config = f"""
name: my_checkpoint
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-validation-run"
validations:
  - batch_request:
      datasource_name: my_datasource
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: yellow_tripdata_sample_2019_01
      data_connector_query:
        index: -1
    expectation_suite_name: my_suite
"""

context.add_checkpoint(**yaml.load(config))

config = f"""
name: my_checkpoint_taxi_data_load_2019_2
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-validation-run"
validations:
  - batch_request:
      datasource_name: my_datasource
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: yellow_tripdata_sample_2019_02
      data_connector_query:
        index: -1
    expectation_suite_name: my_suite
"""

context.add_checkpoint(**yaml.load(config))

Overwriting ./scripts/create_checkpoint.py


In [None]:
!python ./scripts/create_checkpoint.py

This will create a configuration for a new Checkpoint called my_checkpoint and save it to the Data Context of your project. In order to confirm that the Checkpoint was correctly created, run the command to list all Checkpoints in the project:

In [None]:
!great_expectations --v3-api checkpoint list

Using v3 (Batch Request) API[0m
Found 2 Checkpoints.[0m
 - [36mmy_checkpoint[0m[0m
 - [36mmy_checkpoint_taxi_data_load_2019_2[0m[0m
[0m

## Run validation with a Checkpoint
To run the Checkpoint and validate the yellow_tripdata_sample_2019_02 with my_suite, execute:

In [None]:
!great_expectations --v3-api checkpoint run my_checkpoint_taxi_data_load_2019_2

Using v3 (Batch Request) API[0m
Calculating Metrics: 100% 21/21 [00:04<00:00,  4.71it/s]
Validation failed![0m

Suite Name                                   Status     Expectations met[0m
- my_suite                                   [31m✖ Failed[0m   5 of 6 (83.33 %)[0m
[0m

This will correctly show the validation output as "Failed", meaning that Great Expectations has detected some data in this table that does not meet the Expectations in my_suite.

## Inspect validation results in Data Docs

Open the Data Docs site again

- You will now see an additional tab Validation Results on the index page, listing a timestamped
- Click into the first row to go to the validation results detail page.
- On the detail page, you will see that the validation run is marked as "Failed."
- Scroll down to see which Expectations failed and why.

In [None]:
%sql select distinct passenger_count from yellow_tripdata_sample_2019_02

   postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/postgres
 * postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/studentdb
9 rows affected.


passenger_count
2.0
5.0
""
0.0
3.0
4.0
1.0
6.0
7.0


In [None]:
%sql select passenger_count, count(*) from yellow_tripdata_sample_2019_02 group by passenger_count

   postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/postgres
 * postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/studentdb
9 rows affected.


passenger_count,count
2.0,1431
5.0,427
,44
0.0,184
3.0,408
4.0,179
1.0,7066
6.0,260
7.0,1


In [None]:
%sql delete from yellow_tripdata_sample_2019_02 where passenger_count=7;

In [None]:
%sql select passenger_count, count(*) from yellow_tripdata_sample_2019_02 group by passenger_count

   postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/postgres
 * postgresql+psycopg2://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com/studentdb
8 rows affected.


passenger_count,count
2.0,1431
5.0,427
,44
0.0,184
3.0,408
4.0,179
1.0,7066
6.0,260


In [None]:
!great_expectations --v3-api checkpoint run my_checkpoint_taxi_data_load_2019_2

Using v3 (Batch Request) API[0m
Calculating Metrics: 100% 21/21 [00:04<00:00,  4.71it/s]
Validation succeeded![0m

Suite Name                                   Status     Expectations met[0m
- my_suite                                   [32m✔ Passed[0m   6 of 6 (100.0 %)[0m
[0m