-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/pred2bq bulk update #230
base: main
Are you sure you want to change the base?
Changes from 22 commits
c5d3250
1ed693e
9708b46
7f25633
f4ca221
6938e58
cd17b95
60faa2d
e4d78ce
497e3f9
c68667b
8f48517
f8a53d2
4fca0a9
b57d02a
a33b419
6df1007
685cd60
f68f0d7
6035c17
d091922
a5de9d2
91feb1c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Copyright 2023 The TensorFlow Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the 'License'); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an 'AS IS' BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ============================================================================== | ||
ARG PLATFORM=cpu | ||
cfezequiel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
FROM gcr.io/tfx-oss-public/tfx:latest | ||
|
||
WORKDIR /tfx-addons | ||
RUN mkdir -p /tfx-addons/tfx_addons | ||
ADD __init__.py /tfx-addons/tfx_addons | ||
COPY ./ ./tfx_addons/predictions_to_bigquery | ||
|
||
ENV PYTHONPATH="/tfx-addons:${PYTHONPATH}" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
# Prediction results to BigQuery component | ||
|
||
[![TensorFlow](https://img.shields.io/badge/TFX-orange)](https://www.tensorflow.org/tfx) | ||
|
||
## Project Description | ||
|
||
This component exports prediction results from BulkInferrer to a BigQuery | ||
table. | ||
The BigQuery table schema can be generated through one of the following sources: | ||
1. From SchemaGen component output | ||
2. From Transform component output | ||
3. From BulkInferrer component output (i.e. prediction results) | ||
|
||
If both SchemaGen and Transform outputs are passed to the component, | ||
the SchemaGen output will take priority. It would be best to use SchemaGen | ||
for generating the BigQuery schema. | ||
|
||
If the Transform output channel is passed to the component, without the | ||
SchemaGen output, the BigQuery schema will be derived from the pre-transform | ||
metadata schema generated by Transform. Note that the metadata schema may | ||
include a label key, which may not be present in the BulkInferrer prediction | ||
results. Therefore, this option may not work for unlabeled data. | ||
|
||
If neither the SchemaGen nor Transform outputs are passed to the component, | ||
the BigQuery schema will be parsed from the BulkInferrer prediction results | ||
itself, which contains tf.Example protos. | ||
|
||
Prediction string labels from the BulkInferrer output may be derived by passing a 'vocab_label_file' execution parameter to the component. This will only work | ||
if the Transform component output is passed and if it the `vocab_label_file` | ||
is present. | ||
|
||
## Project Use-Case(s) | ||
|
||
The main use case for this components is to enable export of model prediction | ||
results into a BigQuery for further data analysis. The exported table will | ||
contain the model predictions and their corresponding inputs. If the input | ||
data is labeled, this would allow users to compare labels and corresponding predictions. | ||
|
||
## Project Implementation | ||
|
||
PredictionsToBigQuery component uses Beam to process the prediction results | ||
from BulkInferrer and export it to a BigQuery table. | ||
|
||
The BigQuery table name is passed as a parameter by the user, however the user | ||
can also choose to have the component append a timestamp at the end of the table name. | ||
|
||
The output component is the fully qualified BigQuery table name where the inference results are stored, and this can be accessed through the `bigquery_export` key. The same table name is also stored as a custom property | ||
of the `bigquery_export` artifact. | ||
|
||
### Usage example | ||
|
||
```python | ||
|
||
from tfx import v1 as tfx | ||
import tfx_addons as tfxa | ||
|
||
... | ||
|
||
predictions_to_bigquery = tfxa.predictions_to_bigquery.PredictionsToBigQuery( | ||
inference_results=bulk_inferrer.outputs['inference_result'], | ||
schema=schema_gen.outputs['schema'], | ||
transform_graph=transform.outputs['transform_graph'], | ||
cfezequiel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
bq_table_name='my_bigquery_table', | ||
gcs_temp_dir='gs://bucket/temp-dir', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this just use the temp_dir from beam_pipeline_args instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I just realized that the |
||
vocab_label_file='Label', | ||
) | ||
``` | ||
|
||
Refer to `integration_test.py` for tests that demonstrates how to use the | ||
component. | ||
|
||
For a description of the inputs and execution parameters of the component, | ||
refer to the `component.py` file. | ||
|
||
## Project Dependencies | ||
|
||
See `version.py` in the top repo directory for component dependencies. | ||
|
||
## Testing | ||
|
||
Each Python module has a corresponding unit test file ending in `_test.py`. | ||
|
||
An integration test is also available and requires use of a Google Cloud | ||
project. Additional instructions for running the unit test can be found in `integration_test.py`. | ||
|
||
Some tests use Abseil's `absltest` module. | ||
Install the package using pip: | ||
```bash | ||
pip install absl-py | ||
cfezequiel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
### Test coverage | ||
|
||
Test coverage can be generated using the `coverage package`: | ||
```bash | ||
pip install coverage | ||
``` | ||
|
||
To get test code coverage on the component code, run the following from the | ||
top directory of the tfx-addons repository: | ||
|
||
```bash | ||
coverage run -m unittest discover -s tfx_addons/predictions_to_bigquery -p *_test.py | ||
``` | ||
|
||
Generate a summary report in the terminal: | ||
```bash | ||
coverage report -m | ||
|
||
``` | ||
Generate an HTML report that also details missed lines | ||
```bash | ||
coverage html -d /tmp/htmlcov | ||
``` | ||
|
||
If working on a remote machine, the HTML coverage report can be viewed | ||
by launching a web server | ||
```bash | ||
pushd /tmp/htmlcov | ||
python -m http.server 8000 # or another unused port number | ||
``` | ||
|
||
## Project team | ||
- Hannes Hapke (@hanneshapke, Digits Financial Inc.) | ||
- Carlos Ezequiel (@cfezequiel, Google) | ||
- Michael Sherman (@michaelwsherman, Google) | ||
- Robert Crowe (@rcrowe-google, Google) | ||
- Gerard Casas Saez (@casassg, Cash App) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelwsherman why do we need to remove the Dockerfile? It's currently used to define the tfx-addons container that's needed by the integration test.