Sample Dataflow Pipeline

This repo shows an implementation of a sample dataflow pipeline, that parallelizes a sklearn preprocessing function on a BigQuery table.

Usage

Environment Setup

Install the python requirements:

pip install -r requirements.txt

Running the pipeline

python sample_pipeline.py

Modifying to your use case

To adapt this pipeline to your use case, here are some high-level steps:

Modify parameters: all necessary parameters are defined as constants at the top of the sample_pipeline.py file.
Enable dataflow in your GCP project: the dataflow API has to be enabled before you can start running jobs. For more info, check out the quickstart.
Test run: do a test run of the current pipeline (without any significant changes to the code) to make sure all the settings are fine
Modify pipeline: once you're sure that your environment is fine, you can now create your own thing :)

Input and Output tables

You can see the original input and output tables in the mle-exam GCP project.

Future Improvements

Remove training from pipeline

Instead of initializing and training the One-Hot Encoder every time:

enc = preprocessing.OneHotEncoder()
enc.fit(BASE_SPECIES)

Ideally it's better to do the "training" somewhere else and save it, then we just pass the saved model to the pipeline for parallelization. In the sample use case it's fine since the OneHotEncoder fits really quickly.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
sample_pipeline.py		sample_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample Dataflow Pipeline

Usage

Environment Setup

Running the pipeline

Modifying to your use case

Input and Output tables

Future Improvements

Remove training from pipeline

About

Uh oh!

Releases

Packages

Languages

thinkingmachines/sample-dataflow-pipeline

Folders and files

Latest commit

History

Repository files navigation

Sample Dataflow Pipeline

Usage

Environment Setup

Running the pipeline

Modifying to your use case

Input and Output tables

Future Improvements

Remove training from pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages