# How to use `DataAssistants` - Sql

* A `DataAssistant` enables you to quickly profile your data by providing a thin API over a pre-constructed `RuleBasedProfiler` configuration.
* As a result of the profiling, you get back a result object consisting of 
    * `Metrics` that describe the current state of the data
    * `Expectations` that are able to alert you if the data deviates from the expected state in the future. 
    
* `DataAssistant` results can also be plotted to help you understand their data visually.
* There are multiple `DataAssistants` centered around a theme (volume, nullity etc), and this notebook walks you through an example `OnboardingDataAssistant`, which is the most general and extensive `DataAssistant`.


**How do I run the code in this notebook?**
You will need to spin up a Docker container with a `postgres` database, and load in the example data used in this notebook. There is a section at the end of the notebook titled 'Loading Data into Postgresql Database'. Run the code there first and then start running the cells below. 


The `OnboardingDataAssistant` is considered to be the "starting point" for profiling and is generally applicable for numerical, categorial, or datetime data.  In our example we will be using `taxi_trip` data, building our `ExpectationSuite` using data from 2019, and validating the Suite on January 2020 data, to see if our more-recent data falls within the range of previous months.

In our example, the `OnboardingDataAssistant` will take in a `batch_request` describing data from 2019 and calculating upper and lower bounds for the following `Expectations` across the sample `Batches`. 

In [1]:
import great_expectations as gx
from great_expectations.core.batch import BatchRequest
from great_expectations.core import ExpectationSuite
from great_expectations.core.expectation_configuration import ExpectationConfiguration
from great_expectations.validator.validator import Validator
from great_expectations.rule_based_profiler.data_assistant import (
    DataAssistant,
    VolumeDataAssistant,
)
from great_expectations.rule_based_profiler.data_assistant_result import (
    VolumeDataAssistantResult,
)
from typing import List
import os
from ruamel import yaml

interfere with sqlalchemy_bigquery.
pybigquery should be uninstalled.
  import sqlalchemy_bigquery as sqla_bigquery


### Example Database

Imagine we have a database of 2 tables, with `yellow_tripdata_sample_2019` and `yellow_tripdata_sample_2020`, containing 12 months' `taxi_trip` data for 2019 and 2020.

In [2]:
# connect to postgres DB, and print the existing tables
pg_hostname = os.getenv("GE_TEST_LOCAL_DB_HOSTNAME", "localhost")
CONNECTION_STRING = f"postgresql+psycopg2://postgres:@{pg_hostname}/test_ci"
from sqlalchemy import create_engine
from sqlalchemy import inspect

engine = create_engine(CONNECTION_STRING)
insp = inspect(engine)
print(insp.get_table_names())

['yellow_tripdata_sample_2019', 'yellow_tripdata_sample_2020']


## Set-up: Adding `taxi_data` `Datasource`
* Add `taxi_data` as a new `Datasource`
* We are using an `CONNECTION_STRING` to connect to the tables in our `postgres` database get 2 `DataAssets` that have 12 Batches, corresponding to one batch per month in 2019 and 2020.

In [3]:
data_context: gx.DataContext = gx.get_context()

In [4]:
datasource_config = {
    "name": "taxi_multi_batch_sql_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "SqlAlchemyExecutionEngine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "configured_data_connector_multi_batch_asset": {
            "class_name": "ConfiguredAssetSqlDataConnector",
            "assets": {
                "yellow_tripdata_sample_2019": {
                    "splitter_method": "split_on_year_and_month",
                    "splitter_kwargs": {
                        "column_name": "pickup_datetime",
                    },
                },
                "yellow_tripdata_sample_2020": {
                    "splitter_method": "split_on_year_and_month",
                    "splitter_kwargs": {
                        "column_name": "pickup_datetime",
                    },
                },
            },
        },
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
	configured_data_connector_multi_batch_asset : ConfiguredAssetSqlDataConnector

	Available data_asset_names (2 of 2):
		yellow_tripdata_sample_2019 (3 of 12): [{'pickup_datetime': {'year': 2019, 'month': 2}}, {'pickup_datetime': {'year': 2019, 'month': 5}}, {'pickup_datetime': {'year': 2019, 'month': 6}}]
		yellow_tripdata_sample_2020 (3 of 12): [{'pickup_datetime': {'year': 2020, 'month': 2}}, {'pickup_datetime': {'year': 2020, 'month': 7}}, {'pickup_datetime': {'year': 2020, 'month': 12}}]

	Unmatched data_references (0 of 0):[]



<great_expectations.datasource.new_datasource.Datasource at 0x7fa9af4bc5b0>

The configuration for the `configured_data_connector_multi_batch_asset` DataConnector contains 2 Assets that use a `splitter_method` to  split the table values into multiple batches. The splitter we use is `split_on_year_and_month`, which creates Batches according to the `pickup_datetime` column which of type `timestamp` in the database schema.

We see we have successfully configured this because the output shows 2 data assets `yellow_tripdata_sample_2019` and yellow_tripdata_sample_2020 with 12 batches, each associated with a different month in our `pickup_datetime` column. These become our `batch_identifiers` that distinguish one Batch from another.

In [5]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

#  Configure `BatchRequest`

In this example, we will be using a `BatchRequest` that will return all 12 batches of data from the `yellow_tripdata_sample_2019_data` table.  We will refer to the `Datasource` and `DataConnector` configured in the previous step. 

In [6]:
multi_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2019",
)

In [7]:
batch_request: BatchRequest = multi_batch_batch_request

In [8]:
batch_list = data_context.get_batch_list(batch_request=batch_request)

In [9]:
len(batch_list)  # len = 12

12

# Run the `OnboardingDataAssistant`

* The `OnboardingDataAssistant` can be run directly from the `DataContext` by specifying `assistants` and `onboarding`, and passing in the `BatchRequest` from the previous step.

**Note**: In the simplest way that we can run this is just by passing in our `batch_request` from our previous step, which corresponds to 2019 data. The appendix will show how the run can be further configured.

In [10]:
result = data_context.assistants.onboarding.run(batch_request=multi_batch_batch_request)




Generating Expectations:   0%|          | 0/8 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/2 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/12 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/11 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/2 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/14 [00:00<?, ?it/s]

If you would like to see the `ExpectationSuite` that was generated, then you can run the `get_expectation_suite()` method on the result.

In [12]:
new_suite = result.get_expectation_suite(expectation_suite_name="taxi_data_2019_suite")

# Explore `DataAssistantResult` by plotting

The resulting `DataAssistantResult` can be explored by Plotting. The `plot_metrics()` function will plot the statistical metrics, and `plot_expectations_and_metrics()` will plot the generated min and max values overlayed on the statistical metrics.

In [13]:
result.plot_metrics()

interactive(children=(Dropdown(description='Select Plot Type: ', layout=Layout(margin='0px', width='max-conten…



In [14]:
result.plot_expectations_and_metrics()

interactive(children=(Dropdown(description='Select Plot Type: ', layout=Layout(margin='0px', width='max-conten…



# Save `ExpectationSuite`

Next, we can save the `ExpectationConfiguration` object resulting from the `DataAssistant` by:

1. Creating an `ExpecationSuite`, which is `taxi_data_2019_suite` in our example
2. Adding `ExpectationConfiguration` to the `ExpectationSuite`
3. Saving the `ExpectationSuite` using `DataContext`'s `save_expectation_suite()` method

In [15]:
suite: ExpectationSuite = ExpectationSuite(
    expectation_suite_name="taxi_data_2019_suite"
)

In [16]:
resulting_configurations: List[
    ExpectationConfiguration
] = suite.add_expectation_configurations(
    expectation_configurations=result.expectation_configurations
)

In [17]:
data_context.add_expectation_suite(expectation_suite=suite)

#  Running Checkpoint

We have just trained our Expectation Suite on the 2019 data. Now we will run a validation using this ExpectationSuite on data from January of 2020.

In [18]:
single_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020",
    data_connector_query={
        "batch_filter_parameters": {"pickup_datetime": {"year": 2020, "month": 1}},
    },
)

In [19]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request)

In [20]:
batch_list[0].batch_definition

{'datasource_name': 'taxi_multi_batch_sql_datasource', 'data_connector_name': 'configured_data_connector_multi_batch_asset', 'data_asset_name': 'yellow_tripdata_sample_2020', 'batch_identifiers': {'pickup_datetime': {'year': 2020, 'month': 1}}}

Our `SimpleCheckpoint` configuration includes our `batch_request` and `ExpectationSuiteName` (which is `taxi_data_suite` in our example)

We also include the `UpdateDataDocsAction`, so we can visualize the results of our Checkpoint.

In [21]:
checkpoint_config: dict = {
    "name": "my_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": single_batch_batch_request,
            "expectation_suite_name": "taxi_data_2019_suite",
        }
    ],
    "action_list": [
        {
            "name": "update_data_docs",
            "action": {
                "class_name": "UpdateDataDocsAction",
            },
        },
    ],
}
data_context.add_checkpoint(**checkpoint_config)

{
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "batch_request": {},
  "class_name": "Checkpoint",
  "config_version": 1.0,
  "evaluation_parameters": {},
  "module_name": "great_expectations.checkpoint",
  "name": "my_checkpoint",
  "profilers": [],
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "taxi_multi_batch_sql_datasource",
        "data_connector_name": "configured_data_connector_multi_batch_asset",
        "data_asset_name": "yellow_tripdata_sample_2020",
        "data_connector_query": {
          "batch_filter_parameters": {
      

In [22]:
data_context.add_checkpoint(**checkpoint_config)
results = data_context.run_checkpoint(checkpoint_name="my_checkpoint")

Calculating Metrics:   0%|          | 0/391 [00:00<?, ?it/s]

### Examine results by looking at DataDocs

We can check the results by evaluating:

In [23]:
results.success

False

As you can see the results of the `Checkpoint` were not successful, which means the January 2020 data did not fall within the ranges predicted by the 2019 data. To examine this more closely, you can [open Data Docs here](great_expectations/uncommitted/data_docs/local_site/index.html). If the link doesn't work, from the location where this Notebook is located, you can go to `great_expectations/uncommitted/data_docs/local_site/index.html`.

## Optional: Clean-up Directory


As part of running this notebook, the `DataAssistant` will create a number of ExpectationSuite configurations in the `great_expectations/expectations/tmp` directory. Optionally run the following cell to clean up the directory.

In [None]:
# import shutil, os
# shutil.rmtree("great_expectations/expectations/tmp")
# os.remove("great_expectations/expectations/.ge_store_backend_id")
# os.remove("great_expectations/expectations/taxi_data_suite.json")

# Appendix

## What Expectation are included in the `OnboardingDataAssistant`?


The `OnboardingDataAssistant` is built with the following `Rules` that are run internally by the `RuleBasedProfiler`. 

- `TableRule`
- `Column Uniqueness and Nullity Rules`
- `NumericColumnRule`
- `DateColumnRule`
- `TextColumnRule`
- `CategoricalColumnRule`


#### Table Rule
This Rule will take the data as a table and try to calculate the following parameter values for Expectations across batches that we pass in. 

* `expect_table_row_count_to_be_between`: 
    - `min_value` : maximum threshold for table row count.
    - `max_value` : minimum threshold for table row count.
* `expect_table_columns_to_match_set`:
    - `column_set`: Either a list or set of strings, that describe the columns of the Table
    - `exact_match`: Boolean (default=True) which determines whether the list of columns must exactly match the observed columns. 
    
#### Column Uniqueness and Nullity
This is not a `Rule` in the strictest sense, but the `DataAssistant` will generate `ExpectationConfigurations` for the following `Expectations` for each column in our data. 

* `expect_column_values_to_be_unique`
* `expect_column_values_to_be_null`
* `expect_column_values_to_not_be_null`
 
They each take 2 parameters. `column`, which is the name of the column being validated, and `mostly`, which is an optional `float` value between `0` and `1` which specifies the fraction of values that match the expectation. Default for the `DataAssistant` is `1.0`. 

#### NumericColumnRule
The `NumericColumnRule` will calculate the `min_value` and `max_value` for the following expectations. 

* `expect_column_min_to_be_between`
* `expect_column_max_to_be_between`
* `expect_column_values_to_be_between`
* `expect_column_median_to_be_between`
* `expect_column_mean_to_be_between`
* `expect_column_stdev_to_be_between`

By default the estimation will be done by the `exact` estimator by default, which takes in the values across all Batches in the batch list (in our example 12 months of data from 2019).

If you do `data_assistant.run()` with the `estimator` parameter set to `drop_outliers` then `bootstrapping` will be done done behind-the-scenes to estimate outliers.

The parameters for `bootstrapping` are:

* `n_resamples` : It is set by default to `9999` which is the default in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.htm

* `false_positive_rate`: A user-configured fraction between 0 and 1 expressing desired false positive rate for
    identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Set by default to be `0.05`.
* `random_seed`: Seed for randomization. If omitted (which is the default), then we use `np.random.choice`. otherwise, we use `np.random.Generator(np.random.PCG64(random_seed))`.
* `round_decimals`: A user-configured non-negative integer indicating the number of decimals of
    rounding precision of the computed parameter values (i.e., `min_value`, `max_value`) prior to packaging them
    on output. For `NumericColumnRule` the default is `15` which means calculations are done 15 digits after the decimal point.
* `mostly`: is `1.0` by default
* `strict_min` and `strict_max` are `False`, and these are parameters that determine whether the minimum proportion of unique values must be strictly smaller than max or min value.
* `truncate_values`:  User-configured directive for whether or not to allow the computed parameter values (i.e.,`lower_bound`, `upper_bound`) to take on values outside the specified bounds when packaged on output
* `allow_relative_error`:  Whether to allow relative error in quantile communications on backends that support or require it.

In addition, the `expect_column_quantile_values_to_be_between` Expectation takes in the following parameters: 

* `quantiles` : Quantiles and associated value ranges for the column, with the default being`[0.25, 0.5, 0.75]`
* `quantile_statistic_interpolation_method`: which is used when estimating quantile values. Recognized values include `auto`, `nearest`, and `linear`. (default is `nearest`).
* `quantile_bias_correction`: Used when determining whether to correct for quantile bias. Recognized values are `True` and  `False` with default being `False`.
* `quantile_bias_std_error_ratio_threshold`: If omitted
    (default), then 0.25 is used (as minimum ratio of bias to standard error for applying bias correction).
* `include_estimator_samples_histogram_in_details`: Determines whether the estimator samples are included in the results. (default is `False`).
  

#### DateColumnRules
The `DateColumnRule` will take a `datetime` column and calculate the `min_value` and `max_value` for the following Expectations.

* `expect_column_min_to_be_between`
* `expect_column_max_to_be_between`
* `expect_column_values_to_be_between`

Estimation will be done using the `exact` estimator by default, but if you select `drop_outliers` then `bootstrapping` will use the following parameters: 

* `n_resamples` : For bootstrapping. It is set by default to `9999` which is the default in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.htm
* `false_positive_rate`: user-configured fraction between 0 and 1 expressing desired false positive rate for
    identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Default is `0.05`.
* `random_seed`: Seed for randomization. If omitted (which is the default), then we use `np.random.choice`. otherwise, we use `np.random.Generator(np.random.PCG64(random_seed))`.
* `round_decimals` A user-configured non-negative integer indicating the number of decimals of the
    rounding precision of the computed parameter values (i.e., `min_value`, `max_value`) prior to packaging them
    on output. Default for `DateColumnRules` is  `1` which means calculations are done 1 digit after the decimal point


#### TextColumnsRule

The `TextColumnRule` will generate parameters for the following 2 `Expectations`.

* `expect_column_value_lengths_to_be_between`
* `expect_column_values_to_match_regex`

For `expect_column_value_lengths_to_be_between` will have the parameters `min_value` and `max_value` estimated. 

Estimation will be done using the `exact` estimator by default, but if you select `drop_outliers` then the following parameters are set by default for `bootstrap` estimation

* `n_resamples` : For bootstrapping. It is set by default to `9999` which is the default in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.htm
* `false_positive_rate`: user-configured fraction between 0 and 1 expressing desired false positive rate for
    identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Default is `0.05`.
* `random_seed`: Seed for randomization. If omitted (which is the default), then we use `np.random.choice`. otherwise, we use `np.random.Generator(np.random.PCG64(random_seed))`.
* `round_decimals` A user-configured non-negative integer indicating the number of decimals of the
    rounding precision of the computed parameter values (i.e., `min_value`, `max_value`) prior to packaging them
    on output. Default for `TextColumnRule` is  `0` which means calculations are done to the nearest integer.
    
For `expect_column_values_to_match_regex`, the values in the column be matched against a candidate list of common `regex` values which were built from the following sources: 

   - [20 Most Common Regular Expressions](https://regexland.com/most-common-regular-expressions/)
   - [Stackoverflow on how to test for valid uuid](https://stackoverflow.com/questions/7905929/how-to-test-valid-uuid-guid/13653180#13653180)

* This is the regex list used by the `TextColumnsRule`.

```python
CANDIDATE_REGEX: Set[str] = {
    r"\d+",  # whole number with 1 or more digits
    r"-?\d+",  # negative whole numbers
    r"-?\d+(?:\.\d*)?",  # decimal numbers with . (period) separator
    r"[A-Za-z0-9\.,;:!?()\"'%\-]+",  # general text
    r"^\s+",  # leading space
    r"\s+$",  # trailing space
    r"https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b(?:[-a-zA-Z0-9@:%_\+.~#()?&//=]*)",  # Matching URL (including http(s) protocol)
    r"<\/?(?:p|a|b|img)(?: \/)?>",  # HTML tags
    r"(?:25[0-5]|2[0-4]\d|[01]\d{2}|\d{1,2})(?:.(?:25[0-5]|2[0-4]\d|[01]\d{2}|\d{1,2})){3}",  # IPv4 IP address
    r"(?:[A-Fa-f0-9]){0,4}(?: ?:? ?(?:[A-Fa-f0-9]){0,4}){0,7}",  # IPv6 IP address,
    r"\b[0-9a-fA-F]{8}\b-[0-9a-fA-F]{4}-[0-5][0-9a-fA-F]{3}-[089ab][0-9a-fA-F]{3}-\b[0-9a-fA-F]{12}\b ",  # UUID
    }
```

**Note**: The above list can be found in the [great_expectations repo here](https://github.com/great-expectations/great_expectations/blob/da2376b613843844afc041538269bb7683444b1f/great_expectations/rule_based_profiler/parameter_builder/regex_pattern_string_parameter_builder.py#L38)

#### CategoricalColumnsRule

The CategoricalColumnsRule will generate parameters for the following 3 Expectations:

* `expect_column_values_to_be_in_set`
* `expect_column_unique_value_count_to_be_between`
* `expect_column_proportion_of_unique_values_to_be_between`


Categorical columns are determined to be ones that meet a certain cardinality threshold. This prevents us from calculating the number of `unique` value in a column with millions of rows, with each value being slightly different from another (which is theoretically possible). 

Great Expectations gets around this by only building the unique value set for columns that have less than a certain number of unique values, which is determined by the cardinality threshold. The default threshold is `FEW` which means Great Expectations will generate parameters for `expect_column_values_to_be_in_set()` for columns where the number of unique values are less than or equal to 100. 

Other values for cardinality values include: 

```python 
ZERO = AbsoluteCardinalityLimit("ZERO", 0)
ONE = AbsoluteCardinalityLimit("ONE", 1)
TWO = AbsoluteCardinalityLimit("TWO", 2)
VERY_FEW = AbsoluteCardinalityLimit("VERY_FEW", 10)
FEW = AbsoluteCardinalityLimit("FEW", 100)
SOME = AbsoluteCardinalityLimit("SOME", 1000)
MANY = AbsoluteCardinalityLimit("MANY", 10000)
VERY_MANY = AbsoluteCardinalityLimit("VERY_MANY", 100000)
UNIQUE = RelativeCardinalityLimit("UNIQUE", 1.0)

... # and more
```

The full list of cardinality values used by Great Expectations can be found in the [great_expectations repo here](https://github.com/great-expectations/great_expectations/blob/da2376b613843844afc041538269bb7683444b1f/great_expectations/rule_based_profiler/helpers/cardinality_checker.py#L55). 

The three Expectations each have their own parameters

`expect_column_values_to_be_in_set` requires `column` (column name) and `value_set`, which is calculated for all columns that meet the cardinality threshold. 

`expect_column_unique_value_count_to_be_between` has `column` (column name) and `min_value` and `max_value`.

Estimation will be done using the `exact` estimator by default, but if you select `drop_outliers` then the following parameters are set by default for `bootstrap` estimation

* `n_resamples` : For bootstrapping. It is set by default to `9999` which is the default in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.htm
* `false_positive_rate`: user-configured fraction between 0 and 1 expressing desired false positive rate for
    identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Default is `0.05`.
* `random_seed`: Seed for randomization. If omitted (which is the default), then we use `np.random.choice`. otherwise, we use `np.random.Generator(np.random.PCG64(random_seed))`.

`expect_column_proportion_of_unique_values_to_be_between` has the following parameters
   
* `column`: column name
* `min_value`:  minimum proportion of unique values, ranging from 0 to 1
* `max_value`:  minimum proportion of unique values, ranging from 0 to 1
* `strict_min`: determine whether the minimum proportion of unique values must be strictly greater than min value, with the `default=False`
*`strict_max`: determine whether the minimum proportion of unique values must be strictly smaller than max value, with the `default=False`

Estimation will be done using the `exact` estimator by default, but if you select `drop_outliers` then the following parameters are set by default for `bootstrap` estimation

* `n_resamples` : For bootstrapping. It is set by default to `9999` which is the default in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.htm
* `false_positive_rate`: user-configured fraction between 0 and 1 expressing desired false positive rate for
    identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Default is `0.05`.
* `random_seed`: Seed for randomization. If omitted (which is the default), then then we use `np.random.choice`. otherwise, we use `np.random.Generator(np.random.PCG64(random_seed))`.
* `round_decimals`: A user-configured non-negative integer indicating the number of decimals of the
    rounding precision of the computed parameter values (i.e., `min_value`, `max_value`) prior to packaging them
    on output. For `CategoricalColumnsRule` the default is `15` which means calculations are done 15 digits after the decimal point.
* `mostly`: is `1.0` by default
* `strict_min` and `strict_max` are `False`, and these are parameters that determine whether the minimum proportion of unique values must be strictly smaller than max or min value.
* `truncate_values`:  User-configured directive for whether or not to allow the computed parameter values (i.e.,`lower_bound`, `upper_bound`) to take on values outside the specified bounds when packaged on output
* `allow_relative_error`:  Whether to allow relative error in quantile communications on backends that support or require it.

# Adjusting `DataAssistant` Parameters

Now that you can run the profiler, you may want more control over the specifics of how the `DataAssistant` is run. 

The first major parameter that you may want to set is is `estimation`, which you will be inputting directly into the `run()` method.

* if set to `flag_outliers` then the DataAssistant will use `bootstrapping` on the Batches returned by the `batch_request` to estimate outliers. The parameters will be `n_resamples`, `false_positive` rate etc with more details in the above section above.

In [None]:
# the following code will run the onboarding assistant with bootstrapping
data_context.assistants.onboarding.run(
    batch_request=single_batch_batch_request, estimation="flag_outliers"
)

Now what if you need more granular control? For instance, how would you set the `cardinality_threshold` for the CategoricalColumnsRule. First you can see the full list of parameters that are passed into the DataAssistant by calling the `run` directly, with no parameters. 


In [None]:
data_context.assistants.onboarding.run

If you look closely at the output, you will see that it is broken up into `rules` with one of the rules being the `categorical_columns_rule`. You'll see that it is type hinted as a `dict`. With parameters like `cardinality_limit_mode`, `mostly`, etc. 

```python
categorical_columns_rule = {
     'cardinality_limit_mode': 'FEW',
     'mostly': 1.0,
     'strict_min': False, 
     'strict_max': False, 
     'false_positive_rate': 0.05, 
     'estimator': 'bootstrap', 
     'n_resamples': 9999,
     'random_seed': None, 
     'quantile_statistic_interpolation_method': 'nearest', 
     'quantile_bias_correction': False, 
     'quantile_bias_std_error_ratio_threshold': None,
     'include_estimator_samples_histogram_in_details': False, 
     'truncate_values': {
     'lower_bound': 0.0,
       'upper_bound': None
     }, 
    'round_decimals': 15
  }

```


If you copy the dictionary and set the key-value pair you are interested in, then `categorical_columns_rule` can be passed into the `assistants.onboarding.run()` method as a dictionary. 

In the following example we have set the `cardinality_limit_mode` from `FEW` to `VERY_MANY` ([more information in the great_expectations repo here](https://github.com/great-expectations/great_expectations/blob/da2376b613843844afc041538269bb7683444b1f/great_expectations/rule_based_profiler/parameter_builder/regex_pattern_string_parameter_builder.py#L38)). The remaining parameters can either be kept or commented out. Both calls below will work. 

In [None]:
# this will work
data_context.assistants.onboarding.run(
    batch_request=multi_batch_batch_request,
    categorical_columns_rule={
        "cardinality_limit_mode": "VERY_MANY",
        #'mostly': 1.0,
        #'strict_min': False,
        #'strict_max': False,
        #'false_positive_rate': 0.05,
        #'estimator': 'bootstrap',
        #'n_resamples': 9999,
        #'random_seed': None,
        #'quantile_statistic_interpolation_method': 'nearest',
        #'quantile_bias_correction': False,
        #'quantile_bias_std_error_ratio_threshold': None,
        #'include_estimator_samples_histogram_in_details': False,
        #'truncate_values': {
        #    'lower_bound': 0.0,
        #   'upper_bound': None
        # },
        #'round_decimals': 15
    },
)

# and so will this
data_context.assistants.onboarding.run(
    batch_request=multi_batch_batch_request,
    categorical_columns_rule={
        "cardinality_limit_mode": "VERY_MANY",
    },
)

## Configuring Splitters at `DataConnector` and `Asset`-Level

For `ConfiguredAssetSqlDataConnectors`, the `splitter_method` and `splitter_kwargs` can be configured at the `DataConnector`-level or `Asset`-level. 

#### Configuration at `DataConnector`-level


Here is a configuration with the splitter method `split_on_year_and_month` configured at the `DataConnector`-level for a `DataConnector` with 2 `Assets`, `yellow_tripdata_sample_2020_by_year_and_month` and `yellow_tripdata_sample_2020`

In [24]:
datasource_config = {
    "name": "taxi_multi_batch_sql_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "SqlAlchemyExecutionEngine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "configured_data_connector_multi_batch_asset": {
            "class_name": "ConfiguredAssetSqlDataConnector",
            "splitter_method": "split_on_year_and_month",
            "splitter_kwargs": {
                "column_name": "pickup_datetime",
            },
            "assets": {
                "yellow_tripdata_sample_2020_by_year_and_month": {
                    "table_name": "yellow_tripdata_sample_2020",
                    "schema_name": "public",
                },
                "yellow_tripdata_sample_2020": {
                    "table_name": "yellow_tripdata_sample_2020",
                    "schema_name": "public",
                },
            },
        },
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
	configured_data_connector_multi_batch_asset : ConfiguredAssetSqlDataConnector

	Available data_asset_names (2 of 2):
		yellow_tripdata_sample_2020 (3 of 12): [{'pickup_datetime': {'year': 2020, 'month': 1}}, {'pickup_datetime': {'year': 2020, 'month': 10}}, {'pickup_datetime': {'year': 2020, 'month': 11}}]
		yellow_tripdata_sample_2020_by_year_and_month (3 of 12): [{'pickup_datetime': {'year': 2020, 'month': 1}}, {'pickup_datetime': {'year': 2020, 'month': 10}}, {'pickup_datetime': {'year': 2020, 'month': 11}}]

	Unmatched data_references (0 of 0):[]



<great_expectations.datasource.new_datasource.Datasource at 0x7fa9b0849a90>

As you can see, both `Assets`, `yellow_tripdata_sample_2020_by_year_and_month` **and** `yellow_tripdata_sample_2020` have the splitter method applied to it, meaning they both have 12 Batches as a result of splitting by `year` and `month`.

#### Configuration at `DataConnector`-level **and** `Asset`-level

Next we have a similar example, but with a second `splitter_method` also configured at the `Asset`-level. This time we will configure a second `splitter_method`, `split_on_year_and_month_and_day`, for the Asset `yellow_tripdata_sample_2020_by_year_and_month_and_day`. In this case, the `Asset`-level configuration will **override** the configuration at the `DataConnector`-level and produce 366 Batches as a result of splitting by `year`, `month` and `day`.

In [27]:
datasource_config = {
    "name": "taxi_multi_batch_sql_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "SqlAlchemyExecutionEngine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "configured_data_connector_multi_batch_asset": {
            "class_name": "ConfiguredAssetSqlDataConnector",
            "splitter_method": "split_on_year_and_month",
            "splitter_kwargs": {
                "column_name": "pickup_datetime",
            },
            "assets": {
                "yellow_tripdata_sample_2020_by_year_and_month": {
                    "table_name": "yellow_tripdata_sample_2020",
                    "schema_name": "public",
                },
                "yellow_tripdata_sample_2020_by_year_and_month_and_day": {
                    "table_name": "yellow_tripdata_sample_2020",
                    "schema_name": "public",
                    "splitter_method": "split_on_year_and_month_and_day",
                    "splitter_kwargs": {
                        "column_name": "pickup_datetime",
                    },
                },
            },
        },
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
	configured_data_connector_multi_batch_asset : ConfiguredAssetSqlDataConnector

	Available data_asset_names (2 of 2):
		yellow_tripdata_sample_2020_by_year_and_month (3 of 12): [{'pickup_datetime': {'year': 2020, 'month': 1}}, {'pickup_datetime': {'year': 2020, 'month': 10}}, {'pickup_datetime': {'year': 2020, 'month': 11}}]
		yellow_tripdata_sample_2020_by_year_and_month_and_day (3 of 94): [{'pickup_datetime': {'year': 2020, 'month': 10, 'day': 1}}, {'pickup_datetime': {'year': 2020, 'month': 10, 'day': 10}}, {'pickup_datetime': {'year': 2020, 'month': 10, 'day': 13}}]

	Unmatched data_references (0 of 0):[]



<great_expectations.datasource.new_datasource.Datasource at 0x7fa9b298dd30>

As you can see, `yellow_tripdata_sample_2020_by_year_and_month` and `yellow_tripdata_sample_2020_by_year_and_month_and_day` each have a different number of Batches resulting from their different `splitter` configurations. 

* `yellow_tripdata_sample_2020_by_year_and_month` has 12 Batches. 
* `yellow_tripdata_sample_2020_by_year_and_month_and_day` has 94 Batches. (The reason it is not 366 is because we did not load the full dataset into the database by setting `load_full_dataset=False`, but you get the picture :) )

## Loading Data into Postgresql Database


* The following code can be used to build the postgres database used in this notebook. It is included (and commented out) for reference.
* In order to load the data into a local `postgresql` database, please feel free to use the `docker-compose.yml` file available at `great_expectations/assets/docker/postgresql/`. 

### To spin up the `postgresql` database
* Have [Docker Desktop](https://www.docker.com/products/docker-desktop/) running locally.
* Navigate to `great_expectations/assets/docker/postgresql/`
* Type `docker-compose up`
* Then uncomment and run the following snippet

In [None]:
# from tests.test_utils import load_data_into_test_database
# from typing import List
# import sqlalchemy as sa
# import pandas as pd
# pg_hostname = os.getenv("GE_TEST_LOCAL_DB_HOSTNAME", "localhost")
# CONNECTION_STRING = f"postgresql+psycopg2://postgres:@{pg_hostname}/test_ci"
#
# # 2019 first
# data_paths: List[str] = [
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-01.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-02.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-03.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-04.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-05.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-06.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-07.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-08.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-09.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-10.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-11.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2019-12.csv",
# ]
#
# # adding 2019 table
# engine = sa.create_engine(CONNECTION_STRING)
# connection = engine.connect()
# table_name = "yellow_tripdata_sample_2019"
# res = connection.execute(f"DROP TABLE IF EXISTS {table_name}")
#
# for data_path in data_paths:
#     # This utility is not for general use. It is only to support testing.
#     load_data_into_test_database(
#         table_name=table_name,
#         csv_path=data_path,
#         connection_string=CONNECTION_STRING,
#         load_full_dataset=False,
#         drop_existing_table=False,
#         convert_colnames_to_datetime=["pickup_datetime", "dropoff_datetime"]
#     )
#
#
# # 2020 next
#
# data_paths: List[str] = [
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-01.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-02.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-03.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-04.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-05.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-06.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-07.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-08.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-09.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-10.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-11.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-12.csv",
# ]
#
# engine = sa.create_engine(CONNECTION_STRING)
# connection = engine.connect()
# table_name = "yellow_tripdata_sample_2020"
# res = connection.execute(f"DROP TABLE IF EXISTS {table_name}")
#
# for data_path in data_paths:
#     # This utility is not for general use. It is only to support testing.
#     load_data_into_test_database(
#         table_name=table_name,
#         csv_path=data_path,
#         connection_string=CONNECTION_STRING,
#         load_full_dataset=False,
#         drop_existing_table=False,
#         convert_colnames_to_datetime=["pickup_datetime", "dropoff_datetime"]
#     )
#
#