In [4]:
from ruamel import yaml
import great_expectations as ge
from great_expectations.core.batch import BatchRequest
from great_expectations.expectations.expectation import Expectation
from great_expectations.rule_based_profiler.config import RuleBasedProfilerConfig

# Self-Initializing Expectations
- Self-initializing `Expectations` utilize `RuleBasedProfilers` to automate parameter estimation for Expectations using a Batch or Batches that have been loaded into a `Validator`. 

### Do they work for all `Expectations`?
- No, not all `Expectations` have parameters that can be estimated. As an example, `ExpectColumnToExist` only takes in a `Domain` (which is the column name) and checks whether the column name is in the list of names in the table's metadata. It would be an example of an `Expectation` that would not work under the self-initializing framework. 
- An example of an `Expectation` that would work under the self-initializing framework would be ones that have numeric ranges, like `ExpectColumnMeanToBeBetween`, `ExpectColumnMaxToBeBetween`, and `ExpectColumnSumToBeBetween`
- To check whether the `Expectation` you are interested in by running the `is_expectation_self_initializing()` method on `Expectations`. 

In [32]:
Expectation.is_expectation_self_initializing(name="expect_column_to_exist")

The Expectation expect_column_to_exist is not able to be self-initialized.


False

In [33]:
Expectation.is_expectation_self_initializing(name="expect_column_mean_to_be_between")

The Expectation expect_column_mean_to_be_between is able to be self-initialized. Please run by using the auto=True parameter.


True

# Set-up

* To setup an example usecase for self-initializing `Expectations`, we will start from a new Great Expectations Data Context (ie `great_expectations` folder after running `great_expectations init`), and begin by adding the `Datasource`, and configuring a `BatchRequest`

In [34]:
data_context: ge.DataContext = ge.get_context()

### Adding `taxi_data` Datasource
We are using an `InferredAssetFilesystemDataConnector` (named `2018_data`) to connect to data in the `test_sets/taxi_yellow_tripdata_samples` folder and get one `DataAsset` (`yellow_tripdata_sample_2018`) that has 12 Batches (1 Batch/month).

In [35]:
data_path: str = "../../../../test_sets/taxi_yellow_tripdata_samples"

datasource_config = {
    "name": "taxi_multi_batch_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "PandasExecutionEngine",
    },
    "data_connectors": {
        "2018_data": {
            "class_name": "InferredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "default_regex": {
                "group_names": ["data_asset_name", "month"],
                "pattern": "(yellow_tripdata_sample_2018)-(\\d.*)\\.csv",
            },
        },
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))


Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	2018_data : InferredAssetFilesystemDataConnector

	Available data_asset_names (1 of 1):
		yellow_tripdata_sample_2018 (3 of 12): ['yellow_tripdata_sample_2018-01.csv', 'yellow_tripdata_sample_2018-02.csv', 'yellow_tripdata_sample_2018-03.csv']

	Unmatched data_references (3 of 30):['.DS_Store', 'first_3_files', 'random_subsamples']



<great_expectations.datasource.new_datasource.Datasource at 0x7fc8c860f4f0>

In [36]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

### Configuring BatchRequest
In this example, we will be using a `BatchRequest` that returns 12 `Batches` of data from the 2018 `taxi_data` datataset.

In [38]:
batch_request_2018_data: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_datasource",
    data_connector_name="2018_data",
    data_asset_name="yellow_tripdata_sample_2018",
)

### Get Validator

Load `taxi_data` into a `Validator` using the `BatchRequest` from the previous step.

In [39]:
suite = data_context.create_expectation_suite(expectation_suite_name="new_expectation_suite", overwrite_existing=True
)

In [40]:
validator = data_context.get_validator(expectation_suite=suite, batch_request=batch_request_2018_data)

Check that the number of batches in our validator is 12 (1 batch / month for 2018)

In [41]:
assert len(validator.batches) == 12

# Running Self-Initializing Expectation

Now we have all the components we need to build an ExpectationSuite by using a Validator. Let's first look at our data by running `validator.head()` which will output the first few rows of our most recent (December 2018) Batch.

In [42]:
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2,2018-12-22 18:30:39,2018-12-22 18:42:37,1,1.39,1,N,170,229,2,9.0,0.0,0.5,0.0,0.0,0.3,9.8,
1,2,2018-12-29 14:46:47,2018-12-29 15:07:41,1,3.77,1,N,68,140,1,16.0,0.0,0.5,5.04,0.0,0.3,21.84,
2,1,2018-12-01 16:04:05,2018-12-01 16:45:20,1,4.9,1,N,263,249,1,26.5,0.0,0.5,5.46,0.0,0.3,32.76,
3,1,2018-12-31 15:28:07,2018-12-31 15:28:16,0,0.0,5,N,132,132,1,70.0,0.0,0.0,0.0,0.0,0.3,70.3,
4,2,2018-12-31 18:13:34,2018-12-31 18:41:03,1,6.74,1,N,162,116,1,24.5,1.0,0.5,5.26,0.0,0.3,31.56,


#### The "old" way

Let's say that you were interested in constructing an `Expectation` that captured the average distance for taxi trips during a year and alerted you if the average trip distance fell out of the previous year's range. 

A good starting point would be the `expect_column_mean_to_be_between()`, and a look at the signature reveals the following parameters: 

```
column (str): The column name.
min_value (float or None): The minimum value for the column mean.
max_value (float or None): The maximum value for the column mean.
strict_min (boolean): If True, the column mean must be strictly larger than min_value, default=False
strict_max (boolean): If True, the column mean must be strictly smaller than max_value, default=False
```

`column` and the boolean flags (`strict_min` and `strict_max`) seem straightfoward enough, but how do you set the appropriate `min_value` and `max_value`?

Previously, this would involve loading each `Batch` (month's data) individually, calculating the mean value for `trip_distance` for each `Batch`, and using calculated `mean` values to determine the `min_value` and `max_value` parameters to pass to our `Expectation`. 

#### The "new" way

Self-initializing `Expectations` automate this sort of calculation across batches. To do perform the same calculation described above (the mean ranges across the 12 `Batches` in the 2018 data), the only thing you need to do is run the `Expectation` with `auto=True`.

In [28]:
validator.expect_column_mean_to_be_between(column="trip_distance", auto=True)

Profiling Dataset:   0%|          | 0/1 [00:00<?, ?it/s]




Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "trip_distance",
      "min_value": 2.83,
      "max_value": 3.06,
      "strict_min": false,
      "strict_max": false
    },
    "meta": {
      "auto_generated_at": "20220519T230312.066546Z",
      "great_expectations_version": "0.15.6+20.gd61afe072.dirty"
    }
  },
  "meta": {},
  "result": {
    "observed_value": 2.926081
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Then the Expectation will calculate the `min_value` (`2.83`) and `max_value` (`3.06`) using all the `Batches` that are loaded into the Validator, in our case the 12 batches associated with 2018 `taxi_data`. 

Now the Expectation can be saved to the ExpectaionSuite associated with the Validator, with the upper and lower bounds having come from the Batches.

In [29]:
validator.save_expectation_suite(discard_failed_expectations=False)

# How to write your own self-initializing Expectation

Inside each of the `Expectatations` is a `RuleBasedProfiler` configuration that is run by the `Validator` when building the `ExpectationConfiguration`. Writing your own self-initializing `Expectation` involved writing your own `RuleBasedProfiler` configuration (or adapting an existing configuration) to automatically estimate the parameters that the `Expectation` requires. For more information on `RuleBasedProfiler` components, and their requirements, please refer to the [RBP Jupyter Notebook](https://github.com/great-expectations/great_expectations/blob/d91fe2e801879f8c407082dd4330dbe9a11d2d78/tests/test_fixtures/rule_based_profiler/example_notebooks/BasicExample_RBP_Instantiation_and_running.ipynb)


The following is the configuration that is part of `ExpectColumnMeanToBeBetween`, which can be found [here](https://github.com/great-expectations/great_expectations/blob/f53e27b068007471b819fc089f008d2a24864d20/great_expectations/expectations/core/expect_column_mean_to_be_between.py). Please also note that some `ENUM` values (like `DOMAIN_KWARGS_PARAMETER_FULLY_QUALIFIED_NAME`) have been translated into string values for readability.

In [31]:
default_profiler_config: RuleBasedProfilerConfig = RuleBasedProfilerConfig(
    name="expect_column_mean_to_be_between",
    config_version=1.0,
    variables={},
    rules={
    "default_expect_column_mean_to_be_between_rule": {
      "variables": {
        "strict_min": False,
        "strict_max": False,
        "false_positive_rate": 0.05,
        "quantile_statistic_interpolation_method": "auto",
        "quantile_bias_std_error_ratio_threshold": "0.25",
        "estimator": "bootstrap",
        "n_resamples": 9999,
        "include_estimator_samples_histogram_in_details": False,
        "truncate_values": {},
        "round_decimals": 2
      },
      "domain_builder": {
        "class_name": "ColumnDomainBuilder",
        "module_name": "great_expectations.rule_based_profiler.domain_builder"
      },
      "expectation_configuration_builders": [
        {
          "expectation_type": "expect_column_mean_to_be_between",
          "class_name": "DefaultExpectationConfigurationBuilder",
          "module_name": "great_expectations.rule_based_profiler.expectation_configuration_builder",
          "validation_parameter_builder_configs": [
            {
              "module_name": "great_expectations.rule_based_profiler.parameter_builder",
              "estimator": "$variables.estimator",
              "quantile_statistic_interpolation_method": "$variables.quantile_statistic_interpolation_method",
              "quantile_bias_std_error_ratio_threshold": "$variables.quantile_bias_std_error_ratio_threshold",
              "enforce_numeric_metric": True,
              "n_resamples": "$variables.n_resamples",
              "name": "mean_range_estimator",
              "metric_name": "column.mean",
              "class_name": "NumericMetricRangeMultiBatchParameterBuilder",
              "round_decimals": "$variables.round_decimals",
              "metric_domain_kwargs": "$domain.domain_kwargs",
              "reduce_scalar_metric": True,
              "include_estimator_samples_histogram_in_details": "$variables.include_estimator_samples_histogram_in_details",
              "truncate_values": "$variables.truncate_values",
              "false_positive_rate": "$variables.false_positive_rate",
              "replace_nan_with_zero": True
            }
          ],
          "column": "$domain.domain_kwargs.column",
          "min_value": "$parameter.mean_range_estimator.value[0]",
          "max_value": "$parameter.mean_range_estimator.value[1]",
          "strict_min": "$variables.strict_min",
          "strict_max": "$variables.strict_max",
          "meta": {
            "profiler_details": "$parameter.mean_range_estimator.details"
          }
        }
      ]
    }
  }
)

## More Details

## `variables`
Key-value pairs defined in this portion of the configuration are be shared across `Rules` and `Rule` components, help you keep track of values without having to input them multiple times.

* `strict_min`: Used by `expect_column_mean_to_be_between` Expectation. Recognized values are `True` or `False`.
* `strict_max`: Used by `expect_column_mean_to_be_between` Expectation. Recognized values are `True` or `False`. 
* `false_positive_rate`: Used by `NumericMetricRangeMultiBatchParameterBuilder`. Typically a float `0 <= 1.0`.
* `quantile_statistic_interpolation_method`: Used by `NumericMetricRangeMultiBatchParameterBuilder`, which is used when estimating quantile values (not relevant in our case). Recognized values include `auto`, `nearest`, and `linear`.
* `quantile_bias_std_error_ratio_threshold`: Used by `NumericMetricRangeMultiBatchParameterBuilder`, which is used when estimating quantile bias (not relevant in our case). Accepts floating point number.
* `estimator`: Used by `NumericMetricRangeMultiBatchParameterBuilder`. Recognized values include `oneshot`, `bootstrap`, and `kde`. 
* `n_resamples`:  Used by `NumericMetricRangeMultiBatchParameterBuilder`. Integer values are expected. 
* `include_estimator_samples_histogram_in_details`: Used by `NumericMetricRangeMultiBatchParameterBuilder`. Recognized values are `True` or `False`.
* `truncate_values`: A value used by the `NumericMetricRangeMultiBatchParameterBuilder` to specify the `[lower_bound, upper_bound]` interval, where either boundary is numeric or None. In our case the value is an empty dictionary, and an equivalent configuration would have been `truncate_values : { lower_bound: None, upper_bound: None }`. 
* `round_decimals` : Used by `NumericMetricRangeMultiBatchParameterBuilder`, and determines how many digits after the decimal point to output (in our case 2). 

## `domain_builder`
The `DomainBuilder` configuration requires a `class_name` and `module_name`:
- `class_name`: is `ColumnDomainBuilder` in our case. For examples of additional DomainBuilders, please refer to the Appendix of the [RBP Jupyter Notebook](https://github.com/great-expectations/great_expectations/blob/d91fe2e801879f8c407082dd4330dbe9a11d2d78/tests/test_fixtures/rule_based_profiler/example_notebooks/BasicExample_RBP_Instantiation_and_running.ipynb)
- `module_name`: is `great_expectations.rule_based_profiler.domain_builder`, which is common for all `DomainBuilders`. 
- The `ColumnDomainBuilder` outputs the column of interest (in our case `trip_distance`), which is accessed by the `ExpectationConfigurationBuilder` using the variable `$domain.domain_kwargs.column`.


## `validation_parameter_builder_configs`
Our list contains a configuration for 1 `ParamterBuilder`, a `NumericMetricRangeMultiBatchParameterBuilder`.  For examples of additional DomainBuilders, please refer to the Appendix of the [RBP Jupyter Notebook](https://github.com/great-expectations/great_expectations/blob/d91fe2e801879f8c407082dd4330dbe9a11d2d78/tests/test_fixtures/rule_based_profiler/example_notebooks/BasicExample_RBP_Instantiation_and_running.ipynb)
* `name`: `mean_range_estimator`
* `class_name`: `NumericMetricRangeMultiBatchParameterBuilder`
* `module_name`: `great_expectations.rule_based_profiler.parameter_builder` which is the same for all `ParameterBuilders`.
* `estimator`: choice of the estimation algorithm: "oneshot" (one observation), "bootstrap" (default), or "kde" (kernel density estimation). Value is pulled from `$variables.estimator`, which is set to "bootstrap" in our configuration.  
* `quantile_statistic_interpolation_method`:  Applicable for the "bootstrap" sampling method. Determines the value of interpolation "method" to `np.quantile()` statistic, which is used for confidence intervals. Value is pulled from `$variables.quantile_statistic_interpolation_method`, which is set to "auto" in our configuration.
* `quantile_bias_std_error_ratio_threshold`:  Applicable for the "bootstrap" sampling method. Specifies the value of quantile bias threshold, which is used for confidence intervals. Value is pulled from `$variables.quantile_bias_std_error_ratio_threshold`, which is set to "0.25" in our configuration.
* `enforce_numeric_metric`: used in `MetricConfiguration` to ensure that metric computations return numeric values. Set to `True`. 
* `n_resamples`: Applicable for the "bootstrap" and "kde" sampling methods -- if omitted (default), then 9999 is used.  Value is pulled from `$variables.n_resamples`, which is set to `9999` in our configuration.
* `round_decimals`: User-configured non-negative integer indicating the number of decimals of the rounding precision of the computed parameter values (i.e., `min_value`, `max_value`) prior to packaging them on output.  If omitted, then no rounding is performed, unless the computed value is already an integer. Value is pulled from `$variables.round_decimals` which is `2` in our configuration.
* `reduce_scalar_metric`: If `True` (default), then reduces computation of 1-dimensional metric to scalar value. This value is set to `True`.
* `include_estimator_samples_histogram_in_details`: For the "bootstrap" sampling method -- if True, then add 10-bin histogram of bootstraps to "details"; otherwise, omit this information (default). Value pulled from `$variables.include_estimator_samples_histogram_in_details`, which is `False` in our configuration.
* `truncate_values`: User-configured directive for whether or not to allow the computed parameter values (i.e.,`lower_bound`, `upper_bound`) to take on values outside the specified bounds when packaged on output. Value pulled from `$variables.truncate_values`, which is `None` in our configuration.
* `false_positive_rate`: User-configured fraction between 0 and 1 expressing desired false positive rate for identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Value pulled from `$variables.false_positive_rate` and is `0.05` in our configuration.
* `replace_nan_with_zero`: If False, then if the computed metric gives `NaN`, then exception is raised; otherwise, if True (default), then if the computed metric gives NaN, then it is converted to the 0.0 (float) value. Set to `True` in our configuration.
* `metric_domain_kwargs`: Domain values for `ParameteBuilder`. Pulled from `$domain.domain_kwargs`, and is empty in our configuration.

## `expectation_configuration_builders`
Our Configuration contains 1 `ExpectationConfigurationBuilder`, for the `expect_column_mean_to_be_between` Expectation type. 

The `ExpectationConfigurationBuilder` configuration requires a `expectation_type`, `class_name` and `module_name`:

* `expectation_type`: `expect_column_mean_to_be_between`
* `class_name`: `DefaultExpectationConfigurationBuilder`
* `module_name`: `great_expectations.rule_based_profiler.expectation_configuration_builder` which is common for all `ExpectationConfigurationBuilders`

Also included are: 
* `validation_parameter_builder_configs`: Which are a list of `ValidationParameterBuilder` configurations, and our configuration case contains the `ParameterBuilder` described in the previous section. 

Next are the parameters that are specific to the `expect_column_mean_to_be_between` `Expectation`.
* `column`: Pulled from `DomainBuilder` using the parameter`$domain.domain_kwargs.column`
* `min_value`:  Pulled from the `ParameterBuilder` using `$parameter.mean_range_estimator.value[0]`
* `max_value`: Pulled from the `ParameterBuilder` using `$parameter.mean_range_estimator.value[1]`
* `strict_min`: Pulled from ``$variables.strict_min`, which is `False`. 
* `strict_max`: Pulled from ``$variables.strict_max`, which is `False`. 


Last is `meta` which contains `details` from our `parameter_builder`. 


## Optional: Clean-up Directory


As part of running this notebook, the `DataAssistant` will create a number of ExpectationSuite configurations in the `great_expectations/expectations/tmp` directory. Optionally run the following cell to clean up the directory.

In [3]:
# import shutil, os
# try:
#     shutil.rmtree("great_expectations/expectations/tmp")
#     os.remove("great_expectations/expectations/.ge_store_backend_id")
#     os.remove("great_expectations/expectations/new_expectation_suite.json")
# except FileNotFoundError:
#     pass