Repo for the GCP-Tinybird workshop
For this first part of the workshop the plan is to ingest dimensional data from BigQuery, historical data from Google Cloud Storage, and realtime data from Pub/Sub.
Then, mixing the 3 sources, we will create a dynamic API Endpoint.
Go to https://ui.tinybird.co/signup to log in / sign up, and create a new Workspace.
Choose your region and go for empty workspace by default, we will not be using any Starter Kits for this workshop.
Upload the products CSV file to BigQuery.
In your Tinybird Workspace, create a new BigQuery Data Source from that BigQuery table following the Big Query connector documentation.
Let's do this step first cause the backfill may take a bit longer.
First, let's create a Data Source from this small parquet file we have as sample. Just drag and drop the file into the Tinybird UI. You can adjust data types, Sortng Key...
Now that the data source is created, we will ingest some bigger parquets from GCS.
Copy your admin token —or a new one with append rights to your newly created ecom_events data source—, edit the backfill_gcs.sh script, and run it.
Note: for private files, or to ingest every time there are new files in the bucket, you can follow the Ingest from GCS guide.
Follow the steps in the Ingest from Pub/Sub guide.
Note: do not use the sample script, use this one instead, editing lines 8,9 with your project id and topic.
project_id = <YOUR_PROJECT>
topic_id = <YOUR_TOPIC_ID>
Note 2: do not create a Materialized View to decode the messages yet, we will do that at query time.
Let's create a Pipe with several nodes:
- A first node to decode the messages from Pub/Sub. You'll need to use
base64()
andJSONExtract
as shown in the example. - A second node to filter only the sale events and for long sleeve category products querying the previous node where we decoded the Pub/Sub messages.
- A third node to apply the same filter to the historical data, and only the sales for today.
- A fourth node to make a
union all
of nodes 2 and 3, and make an aggregation —acount()
is fine— to know the number of sales - Let's enrich the ranking to show product name instead of id and total_revenue (price * units sold)
And let's create an API Endpoint from there.
Make the endpoint accept query params with the templating language. Check the syntax here
For example, let's make the category
and event
types dynamic, and let's document them for our frontend colleagues to know what things they can pass.
With Materialized Views we can use a Pipe and persist them in a Datasource. Choose Sorting Key and Data Types wisely. Recommended reads after the workshop: Best Practices for faster SQL and Thinking in Tinybird.
Compare processed data —using Service Data Sources like tinybird.pipe_stats_rt
— to see the difference between querying the MV and having to decode at query time.
AggregatingMergeTrees 101. Check this guide to learn about State and Merge modifiers.
Note that if you create the MV from the UI, Tinybird will add the State
modifier for you, but you will still need to use Merge
and group by at query time.
Create a MV that aggregates the sales, views, or carts per product and hour/day —tip: toStartofHour()
and toDate()
are your allies here—.
Compare the same queries from raw data and from Aggregated MV.
You have already seen in the docs some resources —Data Sources and Pipes— in text format, let's download the CLI and start working with it.
tb auth
tb init
tb workspace current
tb pull --auto
Edit a Pipe that ends in an endpoint and send it back to the Workspace with tb push
.
Go to the branch called chart-branch and copy its content —/pipes, /datasources, and /endpoints— into your data project.
git checkout chart-branch
cp -r ./data-project/pipes ./pipes
cp -r ./data-project/datasources ./datasources
cp -r ./data-project/endpoints ./endpoints
Push the resources.
tb push pipes/events_by_*.pipe --push-deps --populate
tb push endpoints/api_*.pipe
Get your dashboard token, go to this webpage, and paste it in the Token input. You can start playing with the filters, hours...
Note we are assuming that the GCS Data Source is called prods, and some types may mismatch. To see the demo fully working you can check this repo.
- Apigee
- Kafka connector
- Snowflake connector, very similar to BQ.
- Tokens
- Multitenancy, sharing data sources between workspaces.
- Copy Pipes
- Time Series
- Visualizing in Grafana or sending data to Datadog using vector.dev