Create a ETL data pipeline on Google Cloud Platform

Follow the steps laid out in the Medium story & clone the repository.

Useful BigQuery python commands

Write to a BigQuery table

pd.to_gbq('table_name',if_exists='param')

Read from a BigQuery table using legacy syntax

pd.read_gbq(sql, dialect='legacy')

Run queries on BigQuery directly from Jupyter

query_job = bigquery_client.query("""[SQL CODE]""")
results = query_job.result()

Useful Google Shell Command

Set up working environment

export PROJECT_ID='covid-jul25'
gcloud config set project $PROJECT_ID
export REGION=us-west3
export ZONE=us-west3-a
export BUCKET_LINK=gs://us-west3-{BUCKET_NAME}
export BUCKET=us-west3-{BUCKET_NAME}
export TEMPLATE_ID=daily_update_template

Naming the cluster & create a template

export cluster_name=covid-cluster
gcloud dataproc workflow-templates create
$TEMPLATE_ID --region $REGION

Delete existing workflow templates

gcloud dataproc workflow-templates delete {TEMPLATE_NAME} --region=us-west3

Attach managed cluster + Pandas to template

gcloud dataproc workflow-templates set-managed-cluster \
$TEMPLATE_ID \
--region $REGION \
--zone $ZONE \
--cluster-name $cluster_name \
--optional-components=ANACONDA \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 20 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 20 \
--num-workers 2 \
--image-version 1.4 \
--metadata='PIP_PACKAGES=pandas google.cloud pandas-gbq' \
--initialization-actions gs://us-west3-{BUCKET_NAME}/pip-install.sh

Add Python-based PySpark job

export STEP_ID=arima_update
gcloud dataproc workflow-templates add-job pyspark \
$BUCKET_LINK/daily_update.py \
--step-id $STEP_ID \
--workflow-template $TEMPLATE_ID \
--region $REGION

See jobs in template

gcloud dataproc workflow-templates list --region $REGION

Instantiate & time the workflow

export REGION=us-east4
export TEMPLATE_ID=daily_update
time gcloud dataproc workflow-templates instantiate \
$TEMPLATE_ID --region $REGION #--async

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
daily_update.ipynb		daily_update.ipynb
model_update.ipynb		model_update.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Create a ETL data pipeline on Google Cloud Platform

Useful BigQuery python commands

Write to a BigQuery table

Read from a BigQuery table using legacy syntax

Run queries on BigQuery directly from Jupyter

Useful Google Shell Command

Set up working environment

Naming the cluster & create a template

Delete existing workflow templates

Attach managed cluster + Pandas to template

Add Python-based PySpark job

See jobs in template

Instantiate & time the workflow

About

Releases

Packages

Languages

rydernguyen/covid-jul25

Folders and files

Latest commit

History

Repository files navigation

Create a ETL data pipeline on Google Cloud Platform

Useful BigQuery python commands

Write to a BigQuery table

Read from a BigQuery table using legacy syntax

Run queries on BigQuery directly from Jupyter

Useful Google Shell Command

Set up working environment

Naming the cluster & create a template

Delete existing workflow templates

Attach managed cluster + Pandas to template

Add Python-based PySpark job

See jobs in template

Instantiate & time the workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages