Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag
-
Updated
Sep 19, 2022 - Python
Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag
Run Jupyter Notebooks (and store data) on Google Cloud Platform.
An educational project to build an end-to-end pipline for near real-time and batch processing of data further used for visualisation and a machine learning model.
Data Workflows with GCP Dataproc, Apache Airflow and Apache Spark
PySpark Job that runs in Dataproc cluster, loads data from Cloud Storage to BigQuery table.
Add a description, image, and links to the dataproc-cluster topic page so that developers can more easily learn about it.
To associate your repository with the dataproc-cluster topic, visit your repo's landing page and select "manage topics."