PyDataflow Template

PyDataflow Template is an ETL tool that uses Cloud Dataflow. It is implemented as a FlexTemplate for Dataflow.

By defining a JSON configuration file, you can execute various pipelines without the need for programming. The pipeline is assembled based on the configuration file and executed as a Cloud Dataflow or Direct Runner job.

This project is implemented in Python as an alternative to the Java-based project DataflowTemplate, which you can refer to. (The number of modules is still limited.)

The three types of modules, sources, transforms, and sinks, are abstracted as the entities that perform Extract, Transform, and Load operations, respectively. Each module is a class that implements an Apache Beam Ptransform. By combining modules, you can construct flexible ETL pipelines.

Usage Example

The following configuration file is an example of saving the result of a MySQL query to a specified table in BigQuery.

{
  "name": "mysql-to-bigquery",
  "description": "Sample data load from MySQL to BigQuery.",
  "sources": [
    {
      "name": "mysqlInput",
      "module": "mysql",
      "parameters": {
        "query": "select * from test_db.test;",
        "profile": "test_mysql"
      }
    }
  ],
  "sinks": [
    {
      "name": "bigqueryOutput",
      "module": "bigquery",
      "input": "mysqlInput",
      "parameters": {
        "table": "py-dataflow:test.mysql_to_bigquery_sample_output",
        "create_disposition": "CREATE_IF_NEEDED"
      }
    }
  ]
}

To execute the command using the created configuration file, run the following command:

make run_workflow config=path/to/config.json

Dataflow jobs will be started via Workflows. You can check the execution status of the jobs on the console screen.

For more details, please refer to the documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
cloudbuild		cloudbuild
docs		docs
examples		examples
spec		spec
src/pydataflow_template		src/pydataflow_template
terraform		terraform
tests		tests
workflow		workflow
.dockerignore		.dockerignore
.gcloudignore		.gcloudignore
.gitignore		.gitignore
.test_connections.json		.test_connections.json
Dockerfile.flextemplate		Dockerfile.flextemplate
Dockerfile.worker		Dockerfile.worker
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
README_ja.md		README_ja.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

satokiyo/pydataflow-template

Folders and files

Latest commit

History

Repository files navigation

PyDataflow Template

Usage Example

About

Resources

License

Stars

Watchers

Forks

Languages