Create or update Google Cloud Data Catalog tags on BigQuery tables with Cloud Dataprep Metadata and Column's Profile via a Cloud Function.
The 2 Data Catalog tags created or updated:
- Dataprep Job Metadata tag attached to the BigQuery table and containing information from the Dataprep job used to create or update the BigQuery table : the user, Dataprep Job (id, name, url, timestamp), Dataprep Dataset (id, name, url), Dataprep Flow (id, name, url), Job Profile (url and nb valid, invalid an empty values) and the Dataflow job (id, url).
- Dataprep Job Column's Profile tag attached to all BigQuery table columns and containing number of valid, invalid and empty values for each column.
To activate, learn and use Cloud Data Catalog, go to https://cloud.google.com/data-catalog and https://console.cloud.google.com/datacatalog.
This repository contains the Cloud Function Python code triggered from a Dataprep Webhook to create or update 2 Data Catalog tags.
This Cloud Function uses:
In your Cloud Function, you need the 5 files:
- main.py
- config.py where you need to update your GCP project name (where Tags Template are created) and the Dataprep Access Token (to use Dataprep API). You can also update the 2 tag templates ID if needed.
- datacatalog_functions.py to get or update Data Catalog objects
- dataprep_metadata.py to get Cloud Dataprep metadata
- requirements.txt
Before running the Cloud Function (and create or update tags), you need to create the 2 Data Catalog Tag Templates for Dataprep (Job Metadata and Job Column Profile). You can use:
-
Cloud Console where you can manage your Tag Templates
-
gcloud and the command
gcloud data-catalog tag-templates create
, full command lines in gcloud_tag-templates_create.sh, more details with and example and reference. But be aware that with gcloud command line, you cannot manage template tag fields's order, fields will be in alphabetical order. -
REST API with the 2 tag template json files dataprep_metadata_tag_template.json and dataprep_column_profile_tag_template.json, more details with an example and reference.
To use the Cloud Function you just have to pass the Dataprep Job ID in a JSON format like {"job_id":"7827359"}
.
And to trigger it from a Cloud Dataprep flow, you can use a Webhook on the Cloud Function endpoint with {"job_id":"$jobId"}
in the POST body.
When Data Catalog template tags are created and when tags are created or updated on BigQuery tables, you can find all results from https://console.cloud.google.com/datacatalog.
Finally, you can also search BigQuery tables in Cloud Data Catalog with a Dataprep tag from your own application like https://github.com/victorcouste/dataprep-datacatalog-explorer
Happy wrangling and tagging !