This repo shows an implementation of a sample dataflow pipeline, that parallelizes a sklearn preprocessing function on a BigQuery table.
Install the python requirements:
pip install -r requirements.txt
python sample_pipeline.py
To adapt this pipeline to your use case, here are some high-level steps:
- Modify parameters: all necessary parameters are defined as constants at the top of the
sample_pipeline.pyfile. - Enable dataflow in your GCP project: the dataflow API has to be enabled before you can start running jobs. For more info, check out the quickstart.
- Test run: do a test run of the current pipeline (without any significant changes to the code) to make sure all the settings are fine
- Modify pipeline: once you're sure that your environment is fine, you can now create your own thing :)
You can see the original input and output tables in the mle-exam GCP project.
Instead of initializing and training the One-Hot Encoder every time:
enc = preprocessing.OneHotEncoder()
enc.fit(BASE_SPECIES)
Ideally it's better to do the "training" somewhere else and save it, then we just pass the saved model to the pipeline for parallelization.
In the sample use case it's fine since the OneHotEncoder fits really quickly.