# Create a new pandas Datasource
Use this notebook to configure a new pandas Datasource and add it to your project.

In [None]:
from ruamel import yaml
import webbrowser

import great_expectations as gx
from great_expectations.core.batch import Batch, BatchRequest, RuntimeBatchRequest

In [None]:
context = gx.get_context()

## Customize Your Datasource Configuration

**If you are new to Great Expectations Datasources,** you should check out our [how-to documentation](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview)

**My configuration is not so simple - are there more advanced options?**
Glad you asked! Datasources are versatile. Please see our [How To Guides](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview)!

Give your datasource a unique name:

### For files based Datasources:
Here we are creating an example configuration.  The configuration contains an **InferredAssetFilesystemDataConnector** which will add a Data Asset for each file in the base directory you provided. It also contains a **RuntimeDataConnector** which can accept filepaths.   This is just an example, and you may customize this as you wish!

Also, if you would like to learn more about the **DataConnectors** used in this configuration, including other methods to organize assets, handle multi-file assets, name assets based on parts of a filename, please see our docs on [InferredAssetDataConnectors](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_configure_an_inferredassetdataconnector) and [RuntimeDataConnectors](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_configure_a_runtimedataconnector).


In [None]:
datasource_yaml = f"""
name: my_s3_datasource
class_name: Datasource
execution_engine:
    class_name: PandasExecutionEngine
data_connectors:
    default_runtime_data_connector_name:
        class_name: RuntimeDataConnector
        batch_identifiers:
            - default_identifier_name
    default_inferred_data_connector_name:
        class_name: InferredAssetS3DataConnector
        bucket: demo-gp-taxi-data
        default_regex:
            pattern: (.*)\.csv
            group_names:
                - data_asset_name
"""
print(datasource_yaml)

# Test Your Datasource Configuration
Here we will test your Datasource configuration to make sure it is valid.

This `test_yaml_config()` function is meant to enable fast dev loops. **If your
configuration is correct, this cell will show you some snippets of the data
assets in the data source.** You can continually edit your Datasource config
yaml and re-run the cell to check until the new config is valid.

If you instead wish to use python instead of yaml to configure your Datasource,
you can use `context.add_datasource()` and specify all the required parameters.

In [None]:
context.test_yaml_config(yaml_config=datasource_yaml)

## Save Your Datasource Configuration
Here we will save your Datasource in your Data Context once you are satisfied with the configuration. Note that `overwrite_existing` defaults to False, but you may change it to True if you wish to overwrite. Please note that if you wish to include comments you must add them directly to your `great_expectations.yml`.

In [None]:
context.add_datasource(**yaml.load(datasource_yaml))
context.list_datasources()

## Create A Batch of Data to Request

GX validates data in batches. A batch can be one or many assets. In this case we are going to create a batch with a specific file to validate

In [None]:
# Here is a RuntimeBatchRequest using a path to a single CSV file
batch_request = RuntimeBatchRequest(
    datasource_name="my_s3_datasource",
    data_connector_name="default_runtime_data_connector_name",
    data_asset_name="2019_taxi_data_january",  # this can be anything that identifies this data_asset for you
    runtime_parameters={"path": "s3://demo-gp-taxi-data/yellow_tripdata_sample_2019-01.csv"},  # Add your S3 path here.
    batch_identifiers={"default_identifier_name": "default_identifier"},
)

## Create Placeholder Suite

Let's create a placeholder expectation suite (i.e. no validations/expectations) and let's run it on the batch we just created above. Note: since the expectation suite is empty, you may see warning messages about it. It's fine to ignore those for now.

In [None]:
exp_suite_name = "test_suite"
context.create_expectation_suite(
    expectation_suite_name=exp_suite_name, overwrite_existing=True
)
validator = context.get_validator(
    batch_request=batch_request, expectation_suite_name=exp_suite_name
)
print(validator.head())

## Add Expectations To Suite

Now that we know the suite and batch of data is correctly setup, we can add [Expectations](https://greatexpectations.io/expectations) to the suite.

In [None]:
validator.expect_column_values_to_not_be_null(column="passenger_count")

validator.expect_column_values_to_be_between(
    column="congestion_surcharge", min_value=0, max_value=1000
)

In [None]:
validator.save_expectation_suite(discard_failed_expectations=False)

## Create Checkpoint

[Checkpoints](https://docs.greatexpectations.io/docs/terms/checkpoint/) provide a convenient abstraction for bundling the Validation of a Batch (or Batches) of data against an Expectation Suite (or several), as well as the Actions that should be taken after the validation. Because our use case is straight forward `SimpleCheckpoint` will suffice, but for more complex use cases `Checkpoint` should be used instead. 

In [None]:
my_checkpoint_name = "taxi_data_validator"
checkpoint_config = {
    "name": my_checkpoint_name,
    "config_version": 1.0,
    "class_name": "SimpleCheckpoint",
    "run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
}

## Test Checkpoint Configuration

Let's check the configuration thus far is correct. 

In [None]:
my_checkpoint = context.test_yaml_config(yaml.dump(checkpoint_config))

## Add Checkpoint to DataContext

Let's add the skeleton Checkpoint to the DataContext

In [None]:
context.add_or_update_checkpoint(**checkpoint_config)

## Run Checkpoint with Batch and Expectation Suite 

Now that we have all the pieces, let's put them together.

In [None]:
checkpoint_result = context.run_checkpoint(
    checkpoint_name=my_checkpoint_name,
    validations=[
        {
            "batch_request": batch_request,
            "expectation_suite_name": exp_suite_name,
        }
    ],
)

In [None]:
webbrowser.open('https://demo-data-docs.s3.amazonaws.com/index.html')