# How to use `DataAssistants`

* A `DataAssistant` enables you to quickly profile your data by providing a thin API over a pre-constructed `RuleBasedProfiler` configuration.
* As a result of the profiling, you get back a result object consisting of 
    * `Metrics` that describe the current state of the data
    * `Expectations` that are able to alert you if the data deviates from the expected state in the future. 
    
* `DataAssistant` results can also be plotted to help you understand their data visually.
* There are multiple `DataAssistants` centered around a theme (volume, nullity etc), and this notebook walks you through an example `OnboardingDataAssistant`, which is the most general and extensive `DataAssistant`. 
    * 

The `OnboardingDataAssistant` is considered to be the "starting point" for profililng and is generally applicable for numerical data.  In our example we will be using `taxi_trip` data, building our `ExpectationSuite` using data from 2018-2019, and running it against January 2020 data, to see if our more-recent data falls within the range of previous years.

In our example, the `OnboardingDataAssistant` will take in a `batch_request` describing data from 2018-2019 and calculating upper and lower bounds for the following `Expectations` across the sample `Batches` using a  bootstrapping step. The bootstrapping step allows the `DataAssistant` to account for outliers, and allows it to obtain a more accurate estimate of the true ranges by taking into account the underlying distribution.

* TableExpectations. 
    - and the we were going to be 

* 8 Rules
    - each Rule has Expectations
    - x number up

* `expect_table_columns_to_match_set` 
* `expect_table_row_count_to_be_between`
* `expect_column_min_to_be_between`
* `expect_column_max_to_be_between`
* `expect_column_mean_to_be_between`
* `expect_column_median_to_be_between`
* `expect_column_stdev_to_be_between`

In [1]:
import great_expectations as ge
from great_expectations.core.yaml_handler import YAMLHandler
from great_expectations.core.batch import BatchRequest
from great_expectations.core import ExpectationSuite
from great_expectations.core.expectation_configuration import ExpectationConfiguration
from great_expectations.validator.validator import Validator
from great_expectations.rule_based_profiler.data_assistant import (
    DataAssistant,
    VolumeDataAssistant,
)
from great_expectations.rule_based_profiler.data_assistant_result import (
    VolumeDataAssistantResult,
)
from typing import List
yaml = YAMLHandler()

## Set-up: Adding `taxi_data` `Datasource`
* Add `taxi_data` as a new `Datasource`
* We are using an `InferredAssetFilesystemDataConnector` to connect to data in the `test_sets/taxi_yellow_tripdata_samples` folder and get one `DataAsset` (`yellow_tripdata_sample`) that has 24 Batches, corresponding to one batch per month from 2018-2019.

* We will later be using this to run a checkpoint on 2020

In [2]:
data_context: ge.DataContext = ge.get_context()

In [3]:
data_path: str = "../../../../test_sets/taxi_yellow_tripdata_samples"

datasource_config: dict = {
    "name": "taxi_data_all_years",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "PandasExecutionEngine",
    },
    "data_connectors": {
        "configured_data_connector_multi_batch_asset": {
            "class_name": "ConfiguredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "assets":{
                "yellow_tripdata_20182019":{
                    "group_names": ["year", "month"],
                    "pattern": "yellow_tripdata_sample_(2018|2019)-(\\d.*)\\.csv",
                },
                "yellow_tripdata_all_years":{
                    "group_names": ["year", "month"],
                    "pattern": "yellow_tripdata_sample_(2018|2019|2020)-(\\d.*)\\.csv",
                }
            },
        },
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	configured_data_connector_multi_batch_asset : ConfiguredAssetFilesystemDataConnector

	Available data_asset_names (2 of 2):
		yellow_tripdata_20182019 (3 of 24): ['yellow_tripdata_sample_2018-01.csv', 'yellow_tripdata_sample_2018-02.csv', 'yellow_tripdata_sample_2018-03.csv']
		yellow_tripdata_all_years (3 of 36): ['yellow_tripdata_sample_2018-01.csv', 'yellow_tripdata_sample_2018-02.csv', 'yellow_tripdata_sample_2018-03.csv']

	Unmatched data_references (3 of 68):['.DS_Store', 'first_3_files', 'first_3_files/.DS_Store']



<great_expectations.datasource.new_datasource.Datasource at 0x7fc598e8a9d0>

In [4]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

#  Configure `BatchRequest`

In this example, we will be using a `BatchRequest` that will return all 36 batches of data from the `taxi_data` dataset.  We will refer to the `Datasource` and `DataConnector` configured in the previous step. 

In [5]:
multi_batch_all_years_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_data_all_years",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_20182019",
)

In [6]:
batch_request: BatchRequest = multi_batch_all_years_batch_request

In [7]:
batch_list = data_context.get_batch_list(batch_request=batch_request)

In [8]:
batch_list # len = 24

[<great_expectations.core.batch.Batch at 0x7fc598eb6fd0>,
 <great_expectations.core.batch.Batch at 0x7fc5977f8f70>,
 <great_expectations.core.batch.Batch at 0x7fc5974585e0>,
 <great_expectations.core.batch.Batch at 0x7fc5975ae9a0>,
 <great_expectations.core.batch.Batch at 0x7fc5975ae7c0>,
 <great_expectations.core.batch.Batch at 0x7fc5998bb1c0>,
 <great_expectations.core.batch.Batch at 0x7fc599092520>,
 <great_expectations.core.batch.Batch at 0x7fc597afc8e0>,
 <great_expectations.core.batch.Batch at 0x7fc599303ca0>,
 <great_expectations.core.batch.Batch at 0x7fc599307fa0>,
 <great_expectations.core.batch.Batch at 0x7fc598e5b8e0>,
 <great_expectations.core.batch.Batch at 0x7fc598e73100>,
 <great_expectations.core.batch.Batch at 0x7fc598e6c040>,
 <great_expectations.core.batch.Batch at 0x7fc5991dfc40>,
 <great_expectations.core.batch.Batch at 0x7fc59a85ff10>,
 <great_expectations.core.batch.Batch at 0x7fc59aa6d400>,
 <great_expectations.core.batch.Batch at 0x7fc59abc77c0>,
 <great_expect

# Run the `VolumeDataAssistant`

* The `VolumeDataAssistant` can be run directly from the `DataContext` by specifying `assistants` and `onboarding`, and passing in the `BatchRequest` from the previous step.

In [9]:
data_context.assistants.onboarding.run

<function great_expectations.rule_based_profiler.data_assistant.data_assistant_runner.DataAssistantRunner.run_impl.<locals>.run(batch_request: Union[great_expectations.core.batch.BatchRequestBase, dict, NoneType], min_max_unexpected_values_proportion: Union[str, float, NoneType] = 0.975, include_column_names: Union[str, List[str], NoneType] = None, include_column_name_suffixes: Union[str, Iterable, List[str], NoneType] = None, exclude_column_names: Union[str, List[str], NoneType] = ['id'], semantic_type_filter_module_name: Union[str, NoneType] = None, max_unexpected_ratio: Union[str, float, NoneType] = None, max_unexpected_values: Union[str, int] = 0, semantic_type_filter_class_name: Union[str, NoneType] = None, cardinality_limit_mode: Union[str, great_expectations.rule_based_profiler.helpers.cardinality_checker.CardinalityLimitMode, dict, NoneType] = '$variables.cardinality_limit_mode', max_proportion_unique: Union[str, float, NoneType] = None, max_unique_values: Union[str, int, NoneT

In [10]:
result = data_context.assistants.onboarding.run(batch_request=batch_request)




Generating Expectations:   0%|          | 0/8 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/12 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/11 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/3 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/8 [00:00<?, ?it/s]

**Note**: There a

# Explore `DataAssistantResult` by plotting

The resulting `DataAssistantResult` can be best explored by plotting. For each `Domain` considered (`Table` and `Column` in our case), the plots will display the value for each `Batch` (36 in total). 

In [11]:
result.plot_metrics()

interactive(children=(Dropdown(description='Select Plot Type: ', layout=Layout(margin='0px', width='max-conten…

PlotResult(charts=[alt.LayerChart(...), alt.Chart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...),

An additional layer of information that can be retrieved from the `DataAssistantResult` is the `prescriptive` information, which corresponds to the range values of the `Expectations` that result from the `DataAssistant` run. 

For example the `vendor_id` plot will show that the range of distinct `vendor_id` values ranged from 2-3 across all of our `Batches`, as indicated by the blue band around the plotted values. These values correspond to the `max_value` and `min_value` for the resulting `Expectation`, `expect_column_unique_value_count_to_be_between`.

In [12]:
result.plot_expectations_and_metrics()

interactive(children=(Dropdown(description='Select Plot Type: ', layout=Layout(margin='0px', width='max-conten…

PlotResult(charts=[alt.LayerChart(...), alt.Chart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...), alt.LayerChart(...),

# Save `ExpectationSuite`

Finally, we can save the `ExpectationConfiguration` objext resulting from the `DataAssistant` in our `ExpectationSuite` and then use the `DataContext`'s `save_expectation_suite()` method to pass in our `ExpectationSuite`, updated with the `DataAssistant`.

In [13]:
suite: ExpectationSuite = ExpectationSuite(expectation_suite_name="taxi_data_suite")

In [14]:
resulting_configurations: List[ExpectationConfiguration] = suite.add_expectation_configurations(expectation_configurations=result.expectation_configurations)

In [15]:
data_context.save_expectation_suite(expectation_suite=suite)

# Use it to run Checkpoint
# view DataDoc

In [16]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_data_all_years",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_all_years",
)

In [17]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request_from_multi)

In [18]:
batch_list[0].batch_definition

{'datasource_name': 'taxi_data_all_years', 'data_connector_name': 'configured_data_connector_multi_batch_asset', 'data_asset_name': 'yellow_tripdata_all_years', 'batch_identifiers': {'year': '2018', 'month': '01'}}

In [19]:
checkpoint_config = {
    "name": "my_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": single_batch_batch_request_from_multi,
            "expectation_suite_name": "taxi_data_suite",
            "batch_identifiers":{"year": "2020", "month": "01"} # batch_identifier month is set to 01
        }
    ],
     "action_list":[
            {
                "name": "store_validation_result",
                "action": {
                    "class_name": "StoreValidationResultAction",
                },
            },
            {
                "name": "store_evaluation_params",
                "action": {
                    "class_name": "StoreEvaluationParametersAction",
                },
            },
            {
                "name": "update_data_docs",
                "action": {
                    "class_name": "UpdateDataDocsAction",
                },
            },
        ],
}
data_context.add_checkpoint(**checkpoint_config)

{
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "batch_request": {},
  "class_name": "Checkpoint",
  "config_version": 1.0,
  "evaluation_parameters": {},
  "module_name": "great_expectations.checkpoint",
  "name": "my_checkpoint",
  "profilers": [],
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "taxi_data_all_years",
        "data_connector_name": "configured_data_connector_multi_batch_asset",
        "data_asset_name": "yellow_tripdata_all_years"
      },
      "expectation_suite_name": "taxi_data_suite",
      "batch_identifiers": {
     

In [20]:
data_context.add_checkpoint(**checkpoint_config)
results = data_context.run_checkpoint(
    checkpoint_name="my_checkpoint"
)

Calculating Metrics:   0%|          | 0/207 [00:00<?, ?it/s]

In [23]:
results.list_validation_results

<bound method CheckpointResult.list_validation_results of {
  "run_id": {
    "run_time": "2022-08-11T17:42:54.500568+00:00",
    "run_name": null
  },
  "run_results": {
    "ValidationResultIdentifier::taxi_data_suite/__none__/20220811T174254.500568Z/9b2e727e74edbe57d0bae443c10af899": {
      "validation_result": {
        "meta": {
          "great_expectations_version": "0.15.18+5.g880ae799b.dirty",
          "expectation_suite_name": "taxi_data_suite",
          "run_id": {
            "run_time": "2022-08-11T17:42:54.500568+00:00",
            "run_name": null
          },
          "batch_spec": {
            "path": "/Users/work/Development/great_expectations/tests/test_fixtures/rule_based_profiler/example_notebooks/great_expectations/../../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-12.csv"
          },
          "batch_markers": {
            "ge_load_time": "20220812T004256.078118Z",
            "pandas_data_fingerprint": "b1b9772f30b6c43a0a26daf

In [21]:
results.success

False

## Optional: Clean-up Directory


As part of running this notebook, the `DataAssistant` will create a number of ExpectationSuite configurations in the `great_expectations/expectations/tmp` directory. Optionally run the following cell to clean up the directory.

In [22]:
#import shutil, os
#shutil.rmtree("great_expectations/expectations/tmp")
#os.remove("great_expectations/expectations/.ge_store_backend_id")
#os.remove("great_expectations/expectations/taxi_data_suite.json")

# Appendix

Additional parameters for *run*

Statistics that goes on?