# Testing

This example demonstrates using the [EqualityValidate](https://arc.tripl.ai/validate/#equalityvalidate) stage and the `ETL_CONF_ENVIRONMENT` to perform data assertions for testing. This can then be used in a Continuous Integration/Continuous Deployment (CICD) process to prevent jobs reaching `production` without at least some business logic verification.

## Standard Data Extract and Typing

Run the standard extract, data typing and typing verification process.

In [None]:
{
  "type": "DelimitedExtract",
  "name": "extract data from green_tripdata schema 0",
  "environments": ["production", "test"],
  "inputURI": "/home/jovyan/examples/tutorial/data/nyc-tlc/trip-data/green_tripdata_2013-*.csv.gz",
  "outputView": "green_tripdata0_raw",            
  "delimiter": "Comma",
  "quote" : "DoubleQuote",
  "header": true
}

In [None]:
{
  "type": "TypingTransform",
  "name": "apply green_tripdata schema 0 data types",
  "environments": ["production", "test"],
  "schemaURI": "/home/jovyan/examples/testing/green_tripdata0.json",
  "inputView": "green_tripdata0_raw",            
  "outputView": "green_tripdata0",
  "persist": true
}

In [None]:
{
  "type": "SQLValidate",
  "name": "ensure no errors exist after data typing",
  "environments": ["production", "test"],
  "inputURI": "/home/jovyan/examples/testing/sqlvalidate_errors.sql",            
  "sqlParams": {
    "inputView": "green_tripdata0"
  }
}

## Apply business rules

Apply some business logic to the typed dataset (`green_tripdata0`) with a `SQLTransform` stage to simulate business rules. For example this query calculates the percentage of different payment methods by month which could be used by a business to track whether to keep accepting cash etc.

```sql
-- this query calculates the percentage of different payment methods by month which could be used by a business to track whether to keep accepting cash etc.

-- get a count of all records so monthly percentage can be calculated
WITH green_tripdata_monthly_trips AS (
  SELECT 
    COUNT(payment_type) AS green_tripdata_count 
    ,DATE_TRUNC('MM', lpep_pickup_datetime) AS month
  FROM green_tripdata0
  GROUP BY month
)
-- use the count to calcualte percentages
SELECT 
  CASE
    WHEN payment_type = '1' THEN 'Credit card'
    WHEN payment_type = '2' THEN 'Cash'
    WHEN payment_type = '3' THEN 'No charge'
    WHEN payment_type = '4' THEN 'Dispute'
    WHEN payment_type = '5' THEN 'Unknown'
    WHEN payment_type = '6' THEN 'Voided trip'
    ELSE 'Unknown'
  END AS payment_type
  ,DATE_TRUNC('MM', lpep_pickup_datetime) AS month
  ,COUNT(payment_type) / green_tripdata_count AS percent
FROM green_tripdata0
INNER JOIN green_tripdata_monthly_trips ON DATE_TRUNC('MM', green_tripdata0.lpep_pickup_datetime) = green_tripdata_monthly_trips.month
GROUP BY payment_type, DATE_TRUNC('MM', green_tripdata0.lpep_pickup_datetime), green_tripdata_count
ORDER BY payment_type, DATE_TRUNC('MM', green_tripdata0.lpep_pickup_datetime)
```

In [None]:
{
  "type": "SQLTransform",
  "name": "calculate payment method percent over time",
  "environments": [
    "production",
    "test"
  ],
  "inputURI": "/home/jovyan/examples/testing/payment_type_over_time.sql",
  "outputView": "payment_type_over_time"
}

## Snapshot the data

Work with the business owners to define what a 'correct' result would be based on an unchanged input set. A snapshot of this data can then be taken by either running a `ParquetLoad` stage like:

```json
{
  "type": "ParquetLoad",
  "name": "write out payment_type_over_time",
  "environments": [],
  "inputView": "payment_type_over_time",
  "outputURI": "/home/jovyan/examples/testing/payment_type_over_time_correct.parquet",
  "saveMode": "Overwrite"
}
```

... or you can manually create the 'correct' result using a completely different tool like Python using PyArrow which will produce a set which can be validated to be the same in CICD:

```python
# imports
import decimal
import datetime
import pytz
import pyarrow as pa
import pyarrow.parquet as pq

# create the result set as pyarrow arrays
payment_type = pa.array(['Cash', 'Cash', 'Cash', 'Credit card', 'Credit card', 'Credit card', 'Dispute', 'Dispute', 'No charge', 'No charge', 'No charge'], type=pa.string())
month = pa.array([datetime.datetime(2013, 8, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 9, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 10, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 8, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 9, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 10, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 8, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 9, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 8, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 9, 1, 0, 0, 0, 0, tzinfo=pytz.UTC), datetime.datetime(2013, 10, 1, 0, 0, 0, 0, tzinfo=pytz.UTC)])
percent = pa.array([0.6964044336, 0.6551807814, 0.6619047619, 0.2977831846, 0.3384291839, 0.3357142857, 0.0018924034, 0.0026894767, 0.0039199784, 0.0037005581, 0.0023809524], type=pa.float64())

# create the arrow table
# we are using an arrow table rather than a dataframe to correctly align with spark datatypes
table = pa.Table.from_arrays([payment_type, month, percent], 
  ['payment_type', 'month', 'percent'])

# write table to disk
pq.write_table(table, '/home/jovyan/examples/testing/payment_type_over_time_correct.parquet', flavor='spark')
```

## Configure the Testing

Once a known 'correct' snapshot has been created it can be configured to be extracted and tested for equality when only in test mode (i.e. `ETL_CONF_ENVIRONMENT=test`). This means that if the job was being run in production mode (i.e. `ETL_CONF_ENVIRONMENT=production`) these two next stages would be skipped.

Note:

```json
  "environments": [
    "test"
  ]
```

To prove this works it is easy to modify the `payment_type_over_time` SQL query like `,COUNT(payment_type) / green_tripdata_count + 1 AS percent` to see the job fail.

In [None]:
{
  "type": "ParquetExtract",
  "name": "load customers",
  "environments": [
    "test"
  ],
  "inputURI": "/home/jovyan/examples/testing/payment_type_over_time_correct.parquet",
  "outputView": "payment_type_over_time_correct"
}

In [None]:
{
  "type": "EqualityValidate",
  "name": "verify calculated payment_type_over_time data equals confirmed correct payment_type_over_time_correct (test only)",
  "environments": [
    "test"
  ],
  "leftView": "payment_type_over_time",
  "rightView": "payment_type_over_time_correct"
}

In [None]:
{
  "type": "DeltaLakeLoad",
  "name": "write out green_tripdata0 dataset",
  "environments": ["production", "test"],
  "inputView": "green_tripdata0",
  "outputURI": "/home/jovyan/examples/tutorial/0/output/green_tripdata0.delta",
  "saveMode": "Overwrite",
  "partitionBy": [
    "vendor_id"
  ]
}