Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/pred2bq bulk update #230

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

cfezequiel
Copy link
Contributor

@cfezequiel cfezequiel commented Apr 1, 2023

Fixes #78

Includes changes in PR #225 .

This pull request contains bulk updates to the Predictions to BigQuery component.

Changes:

  • Adds integration test (executor-only, local run and Vertex AI run) and test data.
  • Adds component readme.
  • Updates component code to pass the tests.

Checks:

  • [ X ] Tests pass
  • [ X ] Appropriate changes to README are included in PR

@github-actions
Copy link
Contributor

github-actions bot commented Apr 1, 2023

Thanks for the PR! 🚀

Instructions: Approve using /lgtm and mark for automatic merge by using /merge.

@cfezequiel
Copy link
Contributor Author

@michaelwsherman fyi

tfx_addons/version.py Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/README.md Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/README.md Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/README.md Show resolved Hide resolved
schema=schema_gen.outputs['schema']
transform_graph=transform.outputs['transform_graph'],
bq_table_name='my_bigquery_table',
gcs_temp_dir='gs://bucket/temp-dir',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just use the temp_dir from beam_pipeline_args instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just realized that the WriteToBigQuery transform used by the Beam pipeline defaults to temp_location if custom_gcs_temp_location is not specified, and temp_location should be an argument supported by beam_pipeline_args. Although I'm not quite sure if temp_location needs to be explicitly set or if Beam will create one by default otherwise.
I think this would be a more involved change as the Vertex integration test needs a custom container having the pred2bq component in Artifact Registry to run. Perhaps we have it as a separate PR instead?

- Adds unit tests
- Also adds credits to original code author
Adds a test that runs the executor module's Beam pipeline using
a DirectRunner and exports prediction data to an actual BigQuery table.
Changes:
- Refactors the component integrate test and adds a test to run the
  component on Vertex AI Pipelines.

- Adds a Dockerfile to package the component code into
  a Docker image based on OSS TFX image. The image can then be used
  as the base image when running a pipeline in Vertex AI.

- Updates the `bigquery_export` output of the predictions-to-bigquery
  to store the generated BigQuery table name. This aids with checking
  the output of the component during testing, but also allows any
  downstream component receive this component's output.
Adds a test that integrates the transform component into the pipeline.
Test is implemented for local runner only.
Adds a container component stub to represent the TFX Transform component
for integration testing on Vertex AI.
Replaces create_tempfile and create_tempdir calls from abseil's
absltest.TestCase and parameterized.TestCase with equivalent
methods from the tempfile package.

The reason is that the abseil methods require parsing of the
FLAGs variable, which may not be executed if absltest.main() is
not invoked. This can happen when test filtering is performed, e.g.
```
python -m unittest path.to.test
```
Mentions the predictions-to-bigquery component in top-level readme.
- Fix issues in pred2bq readme
- Reverted version change in setup.py
- Add abls-py test prerequisite in setup.py
@cfezequiel cfezequiel force-pushed the feature/pred2bq-bulk-update branch from 20f4068 to d091922 Compare May 15, 2023 17:32
@cfezequiel
Copy link
Contributor Author

@casassg thanks for the comments! Made some changes and replied to your comments, ptal.

Copy link
Member

@casassg casassg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. nit are optional (good for later as well), undo chnage in version.py as instructed before merge. Also fixing CI so you can merge

README.md Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/Dockerfile Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/README.md Outdated Show resolved Hide resolved
tfx_addons/version.py Outdated Show resolved Hide resolved
@casassg
Copy link
Member

casassg commented May 15, 2023

may need to rebase to use #244 so we can run end to end all CI

Copy link
Contributor

@michaelwsherman michaelwsherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Carlos, I trust you to fix anything I flagged that's worth fixing so I'm approving this PR.

I didn't review or run the tests, and I didn't make sure that everything in utils is used. If you want the tests run/reviewed, let me know and I'll delegate it. If you've run them I'm good.

Overall this is great. I appreciate that you've written out a full example in the integration test, documented everything well, and included an example.

@@ -16,7 +16,7 @@

# We follow Semantic Versioning (https://semver.org/)
_MAJOR_VERSION = "0"
_MINOR_VERSION = "6"
_MINOR_VERSION = "7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up to other reviewers that this matches where we're currently at with the releases, not sure why github still shows it as a diff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be due to missing to rebase the branch from master

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelwsherman why do we need to remove the Dockerfile? It's currently used to define the tfx-addons container that's needed by the integration test.

tfx_addons/predictions_to_bigquery/component.py Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/component.py Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/executor.py Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/utils.py Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/utils.py Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@cfezequiel cfezequiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional comments @casassg and @michaelwsherman .
Made some fixes.

README.md Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/Dockerfile Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/README.md Outdated Show resolved Hide resolved


def _get_compress_type(file_path):
def _get_compress_type(file_path: str) -> Optional[str]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the original code by Hannes and I decided to reuse it. I saw a Python library that may provide a similar functionality: https://pypi.org/project/filetype
I could create an issue for it, and it might be good first issue to take on for a new contributor.

tfx_addons/predictions_to_bigquery/utils.py Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/utils.py Outdated Show resolved Hide resolved
tfx_addons/predictions_to_bigquery/utils.py Outdated Show resolved Hide resolved
tfx_addons/version.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upload Predictions to BigQuery
3 participants