# How to write multi-batch `BatchRequest` - `InferredAsset` Example
* A `BatchRequest` facilitates the return of a `batch` of data from a configured `Datasource`. To find more about `Batches`, please refer to the [related documentation](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_get_a_batch_of_data_from_a_configured_datasource#1-construct-a-batchrequest). 
* A `BatchRequest` can return 0 or more Batches of data depending on the underlying data, and how it is configured. This guide will help you configure `BatchRequests` to return multiple batches, which can be used by
   1. Self-Initializing Expectations to estimate parameters
   2. DataAssistants to profile your data and create and Expectation suite with self-intialized parameters.
   
* Note : Multi-batch BatchRequests are not supported in `RuntimeDataConnector`.

## FileSystem Example

### Example Directory

Imagine we have a directory of 12 csv files, each corresponding to 1 month of Taxi rider data

```
yellow_tripdata_sample_2020-01.csv
yellow_tripdata_sample_2020-02.csv
yellow_tripdata_sample_2020-03.csv
yellow_tripdata_sample_2020-04.csv
yellow_tripdata_sample_2020-05.csv
yellow_tripdata_sample_2020-06.csv
yellow_tripdata_sample_2020-07.csv
yellow_tripdata_sample_2020-08.csv
yellow_tripdata_sample_2020-09.csv
yellow_tripdata_sample_2020-10.csv
yellow_tripdata_sample_2020-11.csv
yellow_tripdata_sample_2020-12.csv
```


In [None]:
import great_expectations as ge
from ruamel import yaml
from great_expectations.core.batch import BatchRequest

* Load `DataContext`

In [None]:
data_context: ge.DataContext = ge.get_context()

### `InferredAssetDataConnector` Example

* Add `Datasource` named `taxi_multi_batch_inferred_datasource` with two `InferredAssetDataConnectors`. A key difference is in the `pattern` they use to build the `data_asset_name`. Depending on which `group_names` are used, we can either create a data Asset with a single batch (corresponding to 1 csv file) or a data Asset with 12 batches (corresponding to 12 csv files for 2020)

* The first DataConnector is called `inferred_data_connector_single_batch_asset`, which takes the entire file name  (`(.*)`), and maps it to the `data_asset_name` group.
    * For the directory , we get 12 Data Assets, with 1 Batch each.
    * This can be seen in the output of `test_yaml_config()`, which shows the 12 data assets, with 1 Batch each. 
    
    * Here is the output: 
    
    ```	
    Available data_asset_names (3 of 12):
		yellow_tripdata_sample_2020-01 (1 of 1): ['yellow_tripdata_sample_2020-01.csv']
		yellow_tripdata_sample_2020-02 (1 of 1): ['yellow_tripdata_sample_2020-02.csv']
		yellow_tripdata_sample_2020-03 (1 of 1): ['yellow_tripdata_sample_2020-03.csv']
    ```

* A second DataConnector is called `inferred_data_connector_multi_batch_asset`
    * It takes `(yellow_tripdata_sample_2020)` and maps it to the `data_asset_name` group, and matches the month (`(\\d.*)`) as the second group (`month`). 
    * In the case of the files in our directory, we will return a single data asset named `yellow_tripdata_sample_2020`, with each of the 12 months corresponding to Batches for the asset. 
    * This can be seen in the output of `test_yaml_config()`, which shows 3 of the 12 Batches corresponding to `yellow_tripdata_sample_2020`
    * Here is the output:
 ```
 Available data_asset_names (1 of 1):
       yellow_tripdata_sample_2020 (3 of 12): ['yellow_tripdata_sample_2020-01.csv', 'yellow_tripdata_sample_2020-02.csv', 'yellow_tripdata_sample_2020-03.csv']
 ```

In [None]:
data_path: str = "../../../../test_sets/taxi_yellow_tripdata_samples/samples_2020"

datasource_config = {
    "name": "taxi_multi_batch_inferred_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "PandasExecutionEngine",
    },
    "data_connectors": {
        "inferred_data_connector_single_batch_asset": {
            "class_name": "InferredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "default_regex": {
                "group_names": ["data_asset_name"],
                "pattern": "(.*)\\.csv",
            },
        },
        "inferred_data_connector_multi_batch_asset": {
            "class_name": "InferredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "default_regex": {
                "group_names": ["data_asset_name", "month"],
                "pattern": "(yellow_tripdata_sample_2020)-(\\d.*)\\.csv",
            },
        },
        
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

In [None]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

## BatchRequest

* Depending on which `DataConnector` you send a `BatchRequest` to, you will retrieve a different number of `Batches`

* Single Batch returned by `inferred_data_connector_single_batch_asset` DataConnector.

In [None]:
single_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_inferred_datasource",
    data_connector_name="inferred_data_connector_single_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020-01",
)

In [None]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request)

In [None]:
batch_list

* Multi Batch returned by `inferred_data_connector_multi_batch_asset` DataConnector.

In [None]:
multi_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_inferred_datasource",
    data_connector_name="inferred_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020",
)

In [None]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [None]:
multi_batch_batch_list

* You can also get a single Batch from a multi-batch DataConnector by passing in `data_connector_query`. Index `-1` will return the most recent (month = `12`) batch. 

In [None]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_inferred_datasource",
    data_connector_name="inferred_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020",
    data_connector_query={
        "index": -1 
    }
)

In [None]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request_from_multi)

In [None]:
batch_list

In [None]:
batch_list[0].to_dict() # 'batch_identifiers': {'month': '12'}},

# Using auto-initializing `Expectations` to generate parameters

* We will generate a `Validator` using our `multi_batch_batch_list`

In [None]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [None]:
example_suite = data_context.create_expectation_suite(expectation_suite_name="example_inferred_suite", overwrite_existing=True)

In [None]:
validator = data_context.get_validator_using_batch_list(batch_list=multi_batch_batch_list, expectation_suite=example_suite)

* When you run methods on the validator, it will typically run on the most recent batch (index `-1`), even if the Validator has access to a longer Batch list. For example, notice that the `pickup_datetime` and `dropoff_datetime` below are all associated with December, indicating that it is with the most recent Batch.

In [None]:
validator.head()

### Typical Workflow
* A `batch_list` becomes really useful when you are calculating parameters for auto-initializing Expectations, as they us a `RuleBasedProfiler` under-the-hood to calculate parameters.

* Here is an example running `expect_column_median_to_be_between()` by "guessing" at the `min_value` and `max_value`. 

In [None]:
validator.expect_column_median_to_be_between(column="trip_distance", min_value=0, max_value=1)

* The observed value for the most recent batch (December/2020) is going to be `1.61`, which means the Expectation fails

* Now we run the same expectation again, but this time with `auto=True`. This means the `median` values are going to calculated across the `batch_list` associated with the `Validator` (ie 12 Batches for 2020), which gives the min value of `1.6` and the max value of `1.92`

In [None]:
validator.expect_column_median_to_be_between(column="trip_distance", auto=True)

* The `auto=True` will also automatically run the Expectation against the most recent Batch (which has an observed value of `1.61`) and the Expectation will pass. 

* You can now save the `ExpectationSuite`.

In [None]:
validator.save_expectation_suite()

### Running the `ExpectationSuite` against single `Batch`

* Now the ExpectationSuite can be used to validate single batches using a Checkpoint. As before, we can use `data_connector_query` to specify the batch that we would like to run the `Checkpoint` on, but the recommended way would be to use the `batch_identifier` parameter, where we have set `month` to `01` to specify the January 2020 batch.

In [None]:
multi_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_inferred_datasource",
    data_connector_name="inferred_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020",
    #data_connector_query={
    #    "index": 0 # this one will correspond to Jan 2020
    #}
)


In [None]:
checkpoint_config = {
    "name": "my_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": multi_batch_batch_request,
            "expectation_suite_name": "example_inferred_suite",
            "batch_identifiers":{"month": "01"} # batch_identifier month is set to 01
        }
    ],
}
data_context.add_checkpoint(**checkpoint_config)

In [None]:
results = data_context.run_checkpoint(
    checkpoint_name="my_checkpoint"
)

In [None]:
results.success