This repository includes a simple data project to test for 5 criteria of data quality on the NYC Taxi dataset and publish the results of the tests as an API endpoint that you can use in CI/CD pipelines for data QA.
First, clone the repo into a directory of your choice with:
git clone https://github.com/tinybirdco/data-quality-checks.git
cd data-quality-checks
Tinybird stores your auth details in a .tinyb
file. Add this to your .gitignore
if you intend to push your work to GitHub.
If you want to follow along with the examples, go ahead and sign up for a free Tinybird account and create a Workspace.
Once you've done that, copy your user admin token in the Tinybird UI. It's the token that says Use it to authenticate with the CLI.
.
The easiest way to install the Tinybird CLI is with pip using a Python virtual environment. Run the following commands:
python3 -m venv .venv
source .venv/bin/activate
pip install tinybird-cli
Then go ahead and set your token as an environment variable for auth:
export TB_TOKEN=<paste your token here>
Authenticate to your Tinybird Workspace with:
tb auth
Now you can push the data project files from the repo to the Tinybird server.
Start by pushing the Data Source with
tb push datasources/yellow_tripdata.datasource
This creates an empty Data Source server side with the correct schema.
Next, add data to the Data Source with
tb datasource append yellow_tripdata https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet
Note: You might find that some rows go to quarantine. As we’ll see, this can be due to data quality issues in the source data! To learn more about data quarantine in Tinybird, go here. In the meantime, you can ignore this for the sake of following along.
In the /endpoints
folder is a .pipe
file. You can push this to the Tinybird server with
tb push endpoints/yellow_trip_data_qa_measurements.pipe
By default, Tinybird publishes the last node of the Pipe as an API endpoint when you push with the CLI.
To create a token to read the API, you'll need to navigate to the Tinybird UI: https://ui.tinybird.co for EU or https://ui.us-east.tinybird.co for US-East.
Click "Auth Tokens", then "Add a new token", then "Add Pipe scope", choose yellow_trip_data_qa_measurements
and give it a Read
scope.
Copy this token.
You can now call your API and get a JSON result with the following cURL:
curl --compressed -H 'Authorization: Bearer <TOKEN>' https://api.tinybird.co/v0/pipes/yellow_trip_data_qa_measurements.json
Note: For workspaces in US-EAST, use https://api.us-east.tinybird.co...
If you navigate back to the Tinybird UI, you can find additional sample usage for the endpoint on the API page.