Skip to content

Create a ETL data pipeline on Google Cloud Platform

Notifications You must be signed in to change notification settings

rydernguyen/covid-jul25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Create a ETL data pipeline on Google Cloud Platform

Follow the steps laid out in the Medium story & clone the repository.

Useful BigQuery python commands

Write to a BigQuery table

pd.to_gbq('table_name',if_exists='param')

Read from a BigQuery table using legacy syntax

pd.read_gbq(sql, dialect='legacy')

Run queries on BigQuery directly from Jupyter

query_job = bigquery_client.query("""[SQL CODE]""")
results = query_job.result()

Useful Google Shell Command

Set up working environment

export PROJECT_ID='covid-jul25'
gcloud config set project $PROJECT_ID
export REGION=us-west3
export ZONE=us-west3-a
export BUCKET_LINK=gs://us-west3-{BUCKET_NAME}
export BUCKET=us-west3-{BUCKET_NAME}
export TEMPLATE_ID=daily_update_template

Naming the cluster & create a template

export cluster_name=covid-cluster
gcloud dataproc workflow-templates create
$TEMPLATE_ID --region $REGION

Delete existing workflow templates

gcloud dataproc workflow-templates delete {TEMPLATE_NAME} --region=us-west3

Attach managed cluster + Pandas to template

gcloud dataproc workflow-templates set-managed-cluster \
$TEMPLATE_ID \
--region $REGION \
--zone $ZONE \
--cluster-name $cluster_name \
--optional-components=ANACONDA \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 20 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 20 \
--num-workers 2 \
--image-version 1.4 \
--metadata='PIP_PACKAGES=pandas google.cloud pandas-gbq' \
--initialization-actions gs://us-west3-{BUCKET_NAME}/pip-install.sh

Add Python-based PySpark job

export STEP_ID=arima_update
gcloud dataproc workflow-templates add-job pyspark \
$BUCKET_LINK/daily_update.py \
--step-id $STEP_ID \
--workflow-template $TEMPLATE_ID \
--region $REGION

See jobs in template

gcloud dataproc workflow-templates list --region $REGION

Instantiate & time the workflow

export REGION=us-east4
export TEMPLATE_ID=daily_update
time gcloud dataproc workflow-templates instantiate \
$TEMPLATE_ID --region $REGION #--async

About

Create a ETL data pipeline on Google Cloud Platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages