Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/pred2bq bulk update #230

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
c5d3250
pred2bq: Update schema parsing from prediction results.
cfezequiel Mar 14, 2023
1ed693e
pred2bq: Add integration test.
cfezequiel Mar 15, 2023
9708b46
pred2bq: Refactor executor.py.
cfezequiel Feb 27, 2023
7f25633
pred2bq: Remove symlink to data folder - not needed.
cfezequiel Mar 3, 2023
f4ca221
pred2bq: Refactor executor.py.
cfezequiel Feb 27, 2023
6938e58
pred2bq: Add integration test - executor to BQ
cfezequiel Feb 21, 2023
cd17b95
pred2bq: Update component spec.
cfezequiel Mar 8, 2023
60faa2d
pred2bq: Update utils.py.
cfezequiel Mar 16, 2023
e4d78ce
pred2bq: Add component integration test.
cfezequiel Mar 17, 2023
497e3f9
Add Vertex AI Pipelines test.
cfezequiel Mar 21, 2023
c68667b
pred2bq: Add deps to version.py; update pkg version.
cfezequiel Mar 27, 2023
8f48517
pred2bq: Add integration test with transform.
cfezequiel Mar 28, 2023
f8a53d2
pred2bq: Add integration test with schema.
cfezequiel Mar 28, 2023
4fca0a9
pred2bq: Add Transform component in Vertex AI test.
cfezequiel Mar 30, 2023
b57d02a
pred2bq: Code cleanup and documentation.
cfezequiel Mar 31, 2023
a33b419
pred2bq: Add readme file.
cfezequiel Mar 31, 2023
6df1007
pred2bq: Replace abseil tempfile creation.
cfezequiel Mar 31, 2023
685cd60
Add tests to expand code coverage.
cfezequiel Mar 31, 2023
f68f0d7
Add project team to readme.
cfezequiel Apr 1, 2023
6035c17
Update top-level readme.
cfezequiel Apr 1, 2023
d091922
Update code based on reviewer comments.
cfezequiel May 15, 2023
a5de9d2
pred2bq: Update code based on code reviews.
cfezequiel May 26, 2023
91feb1c
Merge branch 'main' into feature/pred2bq-bulk-update
cfezequiel Jul 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 11 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,25 +10,25 @@ SIG TFX-Addons is a community-led open source project. As such, the project depe
## Maintainership

The maintainers of TensorFlow Addons can be found in the [CODEOWNERS](https://github.com/tensorflow/tfx-addons/blob/main/CODEOWNERS) file of the repo. If you would
like to maintain something, please feel free to submit a PR. We encourage multiple
like to maintain something, please feel free to submit a PR. We encourage multiple
owners for all submodules.


## Installation

TFX Addons is available on PyPI for all OS. To install the latest version,
TFX Addons is available on PyPI for all OS. To install the latest version,
run the following:

```
pip install tfx-addons
```

To ensure you have a compatible version of dependencies for any given project,
To ensure you have a compatible version of dependencies for any given project,
you can specify the project name as an extra requirement during install:

```
pip install tfx-addons[feast_examplegen,schema_curation]
```
```

To use TFX Addons:

Expand All @@ -45,18 +45,19 @@ tfxa.feast_examplegen.FeastExampleGen(...)

## TFX Addons projects

* [tfxa.feast_examplegen](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/feast_examplegen)
* [tfxa.feast_examplegen](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/feast_examplegen)
* [tfxa.feature_selection](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/feature_selection)
* [tfxa.firebase_publisher](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/firebase_publisher)
* [tfxa.huggingface_pusher](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/huggingface_pusher)
* [tfxa.message_exit_handler](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/message_exit_handler)
* [tfxa.mlmd_client](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/mlmd_client)
* [tfxa.message_exit_handler](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/message_exit_handler)
* [tfxa.mlmd_client](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/mlmd_client)
* [tfxa.model_card_generator](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/model_card_generator)
* [tfxa.pandas_transform](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/pandas_transform)
* [tfxa.pandas_transform](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/pandas_transform)
* [tfxa.sampling](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/sampling)
* [tfxa.schema_curation](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/schema_curation)
* [tfxa.schema_curation](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/schema_curation)
* [tfxa.xgboost_evaluator](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/xgboost_evaluator)

* [tfxa.predictions_to_bigquery](https://github.com/tensorflow/tfx-addons/tree/main/tfx_addons/predictions_to_bigquery)
cfezequiel marked this conversation as resolved.
Show resolved Hide resolved


Check out [proposals](https://github.com/tensorflow/tfx-addons/tree/main/proposals) for a list of existing or upcoming projects proposals for TFX Addons.

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def get_long_description():
return fp.read()


TESTS_REQUIRE = ["pytest", "pylint", "pre-commit", "isort", "yapf"]
TESTS_REQUIRE = ["pytest", "pylint", "pre-commit", "isort", "yapf", "absl-py"]

PKG_REQUIRES = get_pkg_metadata()
EXTRAS_REQUIRE = PKG_REQUIRES.copy()
Expand Down
24 changes: 24 additions & 0 deletions tfx_addons/predictions_to_bigquery/Dockerfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelwsherman why do we need to remove the Dockerfile? It's currently used to define the tfx-addons container that's needed by the integration test.

Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the 'License');
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an 'AS IS' BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
ARG PLATFORM=cpu
cfezequiel marked this conversation as resolved.
Show resolved Hide resolved

FROM gcr.io/tfx-oss-public/tfx:latest

WORKDIR /tfx-addons
RUN mkdir -p /tfx-addons/tfx_addons
ADD __init__.py /tfx-addons/tfx_addons
COPY ./ ./tfx_addons/predictions_to_bigquery

ENV PYTHONPATH="/tfx-addons:${PYTHONPATH}"
129 changes: 129 additions & 0 deletions tfx_addons/predictions_to_bigquery/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Prediction results to BigQuery component

[![Python](https://img.shields.io/pypi/pyversions/tfx.svg?style=plastic)](https://github.com/tensorflow/tfx)
cfezequiel marked this conversation as resolved.
Show resolved Hide resolved
[![TensorFlow](https://img.shields.io/badge/TFX-orange)](https://www.tensorflow.org/tfx)

## Project Description

This component exports prediction results from BulkInferrer to a BigQuery
table.
The BigQuery table schema can be generated through one of the following sources:
1. From SchemaGen component output
2. From Transform component output
3. From BulkInferrer component output (i.e. prediction results)

If both SchemaGen and Transform outputs are passed to the component,
the SchemaGen output will take priority. It would be best to use SchemaGen
for generating the BigQuery schema.

If the Transform output channel is passed to the component, without the
SchemaGen output, the BigQuery schema will be derived from the pre-transform
metadata schema generated by Transform. Note that the metadata schema may
include a label key, which may not be present in the BulkInferrer prediction
results. Therefore, this option may not work for unlabeled data.

If neither the SchemaGen nor Transform outputs are passed to the component,
the BigQuery schema will be parsed from the BulkInferrer prediction results
itself, which contains tf.Example protos.

Prediction string labels from the BulkInferrer output may be derived by passing a 'vocab_label_file' execution parameter to the component. This will only work
if the Transform component output is passed and if it the `vocab_label_file`
is present.

## Project Use-Case(s)

The main use case for this components is to enable export of model prediction
results into a BigQuery for further data analysis. The exported table will
contain the model predictions and their corresponding inputs. If the input
data is labeled, this would allow users to compare labels and corresponding predictions.

## Project Implementation

PredictionsToBigQuery component uses Beam to process the prediction results
from BulkInferrer and export it to a BigQuery table.

The BigQuery table name is passed as a parameter by the user, however the user
can also choose to have the component append a timestamp at the end of the table name.

The output component is the fully qualified BigQuery table name where the inference results are stored, and this can be accessed through the `bigquery_export` key. The same table name is also stored as a custom property
of the `bigquery_export` artifact.

### Usage example

```python

from tfx import v1 as tfx
import tfx_addons as tfxa

...

predictions_to_bigquery = tfxa.predictions_to_bigquery.PredictionsToBigQuery(
inference_results=bulk_inferrer.outputs['inference_result'],
schema=schema_gen.outputs['schema'],
transform_graph=transform.outputs['transform_graph'],
cfezequiel marked this conversation as resolved.
Show resolved Hide resolved
bq_table_name='my_bigquery_table',
gcs_temp_dir='gs://bucket/temp-dir',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just use the temp_dir from beam_pipeline_args instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just realized that the WriteToBigQuery transform used by the Beam pipeline defaults to temp_location if custom_gcs_temp_location is not specified, and temp_location should be an argument supported by beam_pipeline_args. Although I'm not quite sure if temp_location needs to be explicitly set or if Beam will create one by default otherwise.
I think this would be a more involved change as the Vertex integration test needs a custom container having the pred2bq component in Artifact Registry to run. Perhaps we have it as a separate PR instead?

vocab_label_file='Label',
)
```

Refer to `integration_test.py` for tests that demonstrates how to use the
component.

For a description of the inputs and execution parameters of the component,
refer to the `component.py` file.

## Project Dependencies

See `version.py` in the top repo directory for component dependencies.

## Testing

Each Python module has a corresponding unit test file ending in `_test.py`.

An integration test is also available and requires use of a Google Cloud
project. Additional instructions for running the unit test can be found in `integration_test.py`.

Some tests use Abseil's `absltest` module.
Install the package using pip:
```bash
pip install absl-py
cfezequiel marked this conversation as resolved.
Show resolved Hide resolved
```

### Test coverage

Test coverage can be generated using the `coverage package`:
```bash
pip install coverage
```

To get test code coverage on the component code, run the following from the
top directory of the tfx-addons repository:

```bash
coverage run -m unittest discover -s tfx_addons/predictions_to_bigquery -p *_test.py
```

Generate a summary report in the terminal:
```bash
coverage report -m

```
Generate an HTML report that also details missed lines
```bash
coverage html -d /tmp/htmlcov
```

If working on a remote machine, the HTML coverage report can be viewed
by launching a web server
```bash
pushd /tmp/htmlcov
python -m http.server 8000 # or another unused port number
```

## Project team
- Hannes Hapke (@hanneshapke, Digits Financial Inc.)
- Carlos Ezequiel (@cfezequiel, Google)
- Michael Sherman (@michaelwsherman, Google)
- Robert Crowe (@rcrowe-google, Google)
- Gerard Casas Saez (@casassg, Cash App)