Skip to content

Commit

Permalink
Contracts v4 soda-core 3.3.9 (#817)
Browse files Browse the repository at this point in the history
* Contracts v4 soda-core 3.3.9

* update image for experimental support

---------

Co-authored-by: Janet Revell <janet@soda.io>
  • Loading branch information
tombaeyens and janet-can committed Jul 5, 2024
1 parent 5ce8b59 commit 595fbc0
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 30 deletions.
Binary file modified assets/images/experimental.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions soda/data-contracts-checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@ parent: Create a data contract
---

# Data contract check reference <br />
![experimental](/assets/images/experimental.png){:height="300px" width="300px"} <br />
![experimental](/assets/images/experimental.png){:height="400px" width="400px"} <br />
*Last modified on {% last_modified_at %}*

Soda data contracts is a Python library that verifies data quality standards as early and often as possible in a data pipeline so as to prevent negative downstream impact. Learn more [About Soda data contracts]({% link soda/data-contracts.md %}#about-data-contracts).

<small>✖️ &nbsp;&nbsp; Requires Soda Core Scientific</small><br />
<small>✔️ &nbsp;&nbsp; Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Spark, and Snowflake</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Core CLI</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Library + Soda Cloud</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Cloud Agreements + Soda Agent</small><br />
<small>✖️ &nbsp;&nbsp; Supported by SodaGPT</small><br />
Expand Down Expand Up @@ -338,7 +339,7 @@ For a list of the available formats to use with the `valid_formats` column confi

Also known as a referential integrity or foreign key check, Soda executes a validity check with a `valid_values_reference_data` column configuration key as a separate query, relative to other validity queries. The query counts all values that exist in the named column which also *do not* exist in the column in the referenced dataset.

The referential dataset must exist in the same warehouse as the dataset identified by the contract.
The referential dataset must exist in the same data source as the dataset identified by the contract.

{% include code-header.html %}
```yaml
Expand Down
35 changes: 18 additions & 17 deletions soda/data-contracts-verify.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,16 @@ parent: Create a data contract
---

# Verify a data contract <br />
![experimental](/assets/images/experimental.png){:height="300px" width="300px"} <br />
![experimental](/assets/images/experimental.png){:height="400px" width="400px"} <br />
*Last modified on {% last_modified_at %}*

To verify a **Soda data contract** is to scan the data in a warehouse to execute the data contract checks you defined in a contracts YAML file. Available as a Python library, you run the scan programmatically, invoking Soda data contracts in a CI/CD workflow when you create a new pull request, or in a data pipeline after importing or transforming new data.
To verify a **Soda data contract** is to scan the data in a data source to execute the data contract checks you defined in a contracts YAML file. Available as a Python library, you run the scan programmatically, invoking Soda data contracts in a CI/CD workflow when you create a new pull request, or in a data pipeline after importing or transforming new data.

When deciding when to verify a data contract, consider that contract verification works best on new data as soon as it is produced so as to limit its exposure to other systems or users who might access it. The earlier in a pipeline or workflow, the better! Further, best practice suggests that you store batches of new data in a temporary table, verify a contract on the batches, then append the data to a larger table.

<small>✖️ &nbsp;&nbsp; Requires Soda Core Scientific</small><br />
<small>✔️ &nbsp;&nbsp; Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Spark, and Snowflake</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Core CLI</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Library + Soda Cloud</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Cloud Agreements + Soda Agent</small><br />
<small>✖️ &nbsp;&nbsp; Supported by SodaGPT</small><br />
Expand All @@ -24,7 +25,7 @@ When deciding when to verify a data contract, consider that contract verificatio
[Prerequisites](#prerequisites)<br />
[Verify a data contract via API](#verify-a-data-contract-via-api)<br />
[Review contract verification results](#review-contract-verification-results)<br />
[About warehouse configurations](#about-warehouse-configurations)<br />
[About data source configurations](#about-data-source-configurations)<br />
[Verify data contracts with Spark](#verify-data-contracts-with-spark)<br />
[Validate data contracts](#validate-data-contracts)<br />
[Add a check identity](#add-a-check-identity)<br />
Expand All @@ -35,13 +36,13 @@ When deciding when to verify a data contract, consider that contract verificatio
## Prerequisites
* Python 3.8 or greater
* a code or text editor
* your warehouse connection credentials and details
* a `soda-core-contracts` package and a `soda-core[package]` [installed]({% link soda/data-contracts.md %}) in a virtual environment. Refer to the list of warehouse-specific <a href="https://github.com/sodadata/soda-core/blob/main/docs/installation.md" target="_blank">Soda Core packages</a> available to use.
* your data source connection credentials and details
* a `soda-core-contracts` package and a `soda-core[package]` [installed]({% link soda/data-contracts.md %}) in a virtual environment. Refer to the list of data source-specific <a href="https://github.com/sodadata/soda-core/blob/main/docs/installation.md" target="_blank">Soda Core packages</a> available to use.
* a Soda data contracts YAML file; see [Write a data contract]({% link soda/data-contracts-write.md %})

## Verify a data contract via API
1. In your code or text editor, create a new file name `warehouse.yml` accessible from within your working directory in your virtual environment.
2. To that file, add a warehouse configuration for Soda to connect to your warehouse and access the data within it to verify the contract. The example that follows is for a PostgreSQL warehouse; see [warehouse configuration](#about-warehouse-configurations) for further details . <br />Best practice dictates that you store sensitive credential values as environment variables using uppercase and underscores for the variables.
1. In your code or text editor, create a new file name `data_source.yml` accessible from within your working directory in your virtual environment.
2. To that file, add a data source configuration for Soda to connect to your data source and access the data within it to verify the contract. The example that follows is for a PostgreSQL data source; see [data source configuration](#about-data-source-configurations) for further details . <br />Best practice dictates that you store sensitive credential values as environment variables using uppercase and underscores for the variables.
```yaml
name: local_postgres
type: postgres
Expand All @@ -51,21 +52,21 @@ When deciding when to verify a data contract, consider that contract verificatio
username: ${POSTGRES_USERNAME}
password: ${POSTGRES_PASSWORD}
```
Alternatively, you can use a YAML string or dict to define connection details; use one of the `with_warehouse_...(...)` methods.
3. Add the following block to your Python working environment. Replace the values of the file paths with your own warehouse YAML file and contract YAML file respectively.
Alternatively, you can use a YAML string or dict to define connection details; use one of the `with_data_source_...(...)` methods.
3. Add the following block to your Python working environment. Replace the values of the file paths with your own data source YAML file and contract YAML file respectively.
```python
from soda.contracts.contract_verification import ContractVerification, ContractVerificationResult

contract_verification_result: ContractVerificationResult = (
ContractVerification.builder()
.with_contract_yaml_file('soda/local_postgres/public/customers.yml')
.with_warehouse_yaml_file('soda/local_postgres/warehouse.yml')
.with_data_source_yaml_file('soda/local_postgres/data_source.yml')
.execute()
)

print(str(contract_verification_result))
```
4. At runtime, Soda connects with your warehouse and verifies the contract by executing the data contract checks in your file. Use `${SCHEMA}` syntax to provide any environment variable values in a contract YAML file. Soda returns results of the verification as pass or fail check results, or indicate errors if any exist; see below.
4. At runtime, Soda connects with your data source and verifies the contract by executing the data contract checks in your file. Use `${SCHEMA}` syntax to provide any environment variable values in a contract YAML file. Soda returns results of the verification as pass or fail check results, or indicate errors if any exist; see below.

## Review contract verification results

Expand All @@ -80,27 +81,27 @@ When Soda surfaces a failed check or an execution error, you may wish to stop th
* Append `.assert_ok()` at the end of the contract verification result which produces a SodaException when a check fails or when or execution errors occur. The exception message includes a full report.
* Test for the result using `if not contract_verification_result.is_ok():` Use `str(contract_verification_result)` to get a report.

## About warehouse configurations
## About data source configurations

Soda data contracts connects to a warehouse to perform queries, and verify schemas and data quality checks on data stored in a warehouse. Notably, it does not extract or ingest data, it only scans your data to complete contract verification. If you are using the Contract API, you only need to provide one warehouse configuration in the contract verification which Soda uses to verify contracts.
Soda data contracts connects to a data source to perform queries, and verify schemas and data quality checks on data stored in a data source. Notably, it does not extract or ingest data, it only scans your data to complete contract verification. If you are using the Contract API, you only need to provide one data source configuration in the contract verification which Soda uses to verify contracts.

Best practice dictates that you store sensitive credential values as environment variables that use uppercase and underscores, such as `password: ${WAREHOUSE_PASSWORD}`. Soda data contracts uses environment variables by default; you can pass extra variables via the API using `.with_variables({"WAREHOUSE_PASSWORD": "***"})`.
Best practice dictates that you store sensitive credential values as environment variables that use uppercase and underscores, such as `password: ${DATA_SOURCE_PASSWORD}`. Soda data contracts uses environment variables by default; you can pass extra variables via the API using `.with_variables({"DATA_SOURCE_PASSWORD": "***"})`.


## Verify data contracts with Spark

Where you have a Spark session that potentially includes data frames that live in-memory, you can pass a Spark session into the contract verification API to verify
a data contract in data frames without persisting and reloading.

Use `with_warehouse_spark_session` to pass your Spark session into the contract verification, as in the example below.
Use `with_data_source_spark_session` to pass your Spark session into the contract verification, as in the example below.

```python
spark_session: SparkSession = ...

contract_verification: ContractVerification = (
ContractVerification.builder()
.with_contract_yaml_str(contract_yaml_str)
.with_warehouse_spark_session(spark_session=spark_session, warehouse_name="spark_ds")
.with_data_source_spark_session(spark_session=spark_session, data_source_name="spark_ds")
.execute()
)
```
Expand Down Expand Up @@ -138,7 +139,7 @@ During a contract verification, you can arrange skip checks using `check.skip` a
```python
contract_verification: ContractVerification = (
ContractVerification.builder()
.with_warehouse_yaml_file('soda/local_postgres/warehouse.yml')
.with_data_source_yaml_file('soda/local_postgres/data_source.yml')
.with_contract_yaml_file('soda/local_postgres/public/customers.yml')
.build()
)
Expand Down
17 changes: 9 additions & 8 deletions soda/data-contracts-write.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ parent: Create a data contract
---

# Write a data contract <br />
![experimental](/assets/images/experimental.png){:height="300px" width="300px"} <br />
![experimental](/assets/images/experimental.png){:height="400px" width="400px"} <br />
*Last modified on {% last_modified_at %}*

**Soda data contracts** is a Python library that uses checks to verify data. Contracts enforce data quality standards in a data pipeline so as to prevent negative downstream impact. To verify the data quality standards for a dataset, you prepare a data **contract YAML file**, which is a formal description of the data. In the data contract, you use checks to define your expectations for good-quality data. Using the Python API, you can add data contract verification ideally right after new data has been produced.
Expand Down Expand Up @@ -54,6 +54,7 @@ checks:

<small>✖️ &nbsp;&nbsp; Requires Soda Core Scientific</small><br />
<small>✔️ &nbsp;&nbsp; Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Spark, and Snowflake</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Core CLI</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Library + Soda Cloud</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Cloud Agreements + Soda Agent</small><br />
<small>✖️ &nbsp;&nbsp; Supported by SodaGPT</small><br />
Expand All @@ -72,12 +73,12 @@ checks:
1. After completing the Soda data contracts [install requirements]({% link soda/data-contracts.md %}), use a code or text editor to create a new YAML file name `dim_customer.contract.yml`.
2. In the `dim_customer.contract.yml` file, define the schema, or list of columns, that a data contract must verify, and any data contract checks you wish to enforce for your dataset. At a minimum, you must include the following required parameters; refer to [List of configuration keys](#list-of-configuration-keys) below.
```yaml
# an identifier for the table or view in the SQL warehouse
# an identifier for the table or view in the SQL data source
dataset: dim_customer

# a list of columns that represents the dataset's schema,
# each of which is identified by the name of a column
# in the SQL warehouse
# in the SQL data source
columns:
- name: first_name
- name: last_name
Expand Down Expand Up @@ -114,20 +115,20 @@ checks:

### Organize your data contracts

Best practice dictates that you structure your data contracts files in a way that resembles the structure of your warehouse.
Best practice dictates that you structure your data contracts files in a way that resembles the structure of your data source.
1. In your root git repository folder, create a `soda` folder.
2. In the `soda` folder, create one folder per warehouse, then add a `warehouse.yml` file in each.
3. In each warehouse folder, create folders in each schema, then add the contract files in the schema folders.
2. In the `soda` folder, create one folder per data source, then add a `data source.yml` file in each.
3. In each data source folder, create folders in each schema, then add the contract files in the schema folders.

```shell
+ soda
| + postgres_local
| | + warehouse.yml
| | + data_source.yml
| | + public
| | | + customers.yml
| | | + suppliers.yml
| + snowflake_sales
| | warehouse.yml
| | data_source.yml
| | + RAW
| | | + opportunities.yml
| | | + contacts.yml
Expand Down
15 changes: 12 additions & 3 deletions soda/data-contracts.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ redirect_from:
---

# Set up data contracts <br />
![experimental](/assets/images/experimental.png){:height="300px" width="300px"} <br />
![experimental](/assets/images/experimental.png){:height="400px" width="400px"} <br />
<!--Linked to UI, access Shlink-->
*Last modified on {% last_modified_at %}*

Expand Down Expand Up @@ -43,6 +43,7 @@ checks:
```
<small>✖️ &nbsp;&nbsp; Requires Soda Core Scientific</small><br />
<small>✔️ &nbsp;&nbsp; Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Spark, and Snowflake</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Core CLI</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Library + Soda Cloud</small><br />
<small>✖️ &nbsp;&nbsp; Supported in Soda Cloud Agreements + Soda Agent</small><br />
<small>✖️ &nbsp;&nbsp; Supported by SodaGPT</small><br />
Expand Down Expand Up @@ -70,7 +71,7 @@ Soda Core 3.3.0 supports the newest, experimental version of `soda-contracts`. T
* Python 3.8 or greater
* Pip 21.0 or greater
* a code or text editor
* your PostgreSQL, Spark, or Snowflake warehouse connection credentials and details
* your PostgreSQL, Spark, or Snowflake data source connection credentials and details
* (optional) a local development environment in which to test data contract execution
* (optional) a git repository to store and control the versions of your data contract YAML files

Expand All @@ -81,7 +82,7 @@ Data contracts are only available for use in programmatic scans using Soda Core.
Soda Core CLI *does not* support data contracts.

1. Best practice dictates that you install data contracts in a virtual environment. In your command-line interface tool, create and activate a <a href="https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments" target="_blank">Python virtual environment</a>.
2. Execute the following command, replacing the package name with the install package that matches the type of warehouse you use to store data; see the <a href="https://github.com/sodadata/soda-core/blob/main/docs/installation.md" target="_blank">complete list</a> of packages.
2. Execute the following command, replacing the package name with the install package that matches the type of data source you use to store data; see the <a href="https://github.com/sodadata/soda-core/blob/main/docs/installation.md" target="_blank">complete list</a> of packages.
```shell
pip install soda-core-postgres
```
Expand All @@ -97,6 +98,14 @@ soda --help
To exit the virtual environment, use the command `deactivate`.


## Upgrade data contracts

In the virtual environment in which you originally installed `soda-core-contracts`, use the following command to ugrade to the latest version of the package.

```shell
pip install soda-core-contracts -U
```


## Go further

Expand Down
1 change: 1 addition & 0 deletions soda/new-documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ parent: Learning resources

#### June 28, 2024
* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Agent 1.1.15 & 1.1.16, Soda Library 1.5.13, and Soda Core 3.3.7, 3.3.8 & 3.3.9.
* Published documentation to accompany data contracts version 4 release.

#### June 27, 2024
* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Agent 1.1.14 and Soda Library 1.5.12.
Expand Down

0 comments on commit 595fbc0

Please sign in to comment.