Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PandasTransform - Ready for review #155

Merged
merged 28 commits into from
Aug 18, 2022
Merged

Conversation

rcrowe-google
Copy link
Collaborator

Project Proposal

This is an initial working version of the PandasTransform component. One feature not in the proposal is that it checks to see if a Beam custom component (BaseBeamComponent) is supported in the loaded version of TFX, and uses it if it is.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 8, 2022

Thanks for the PR! 🚀

Instructions: Approve using /lgtm and mark for automatic merge by using /merge.

@rcrowe-google rcrowe-google removed the request for review from theadactyl July 8, 2022 23:34
@rcrowe-google
Copy link
Collaborator Author

There's something failing in yapf, but other than that I think this is ready for review.

Copy link
Member

@casassg casassg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run YAPF locally by installing pre-commit install -- https://github.com/tensorflow/tfx-addons/blob/main/CONTRIBUTING.md#development-tips

Also seems a CI test is failing, not sure why, can take a look

statistics=statistics,
transformed_examples=transformed_examples,
module_file=module_file,
beam_pipeline=beam.Pipeline())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we add beam_pipeline_args somehow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They'll be added for TFX >= 1.8.0. I haven't really thought about how to add them for earlier versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I really need to add more tests and at least one example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added beam_pipeline_args for both the new and old versions:

beam_pipeline_args: A string with the argv options for creating a Beam pipeline.
Note that this is a string, not a list. It will be split on spaces to create
a list. If running TFX >= 1.8.0, if beam_pipeline_args are specified they will
override the pipeline beam args.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rcrowe-google
Copy link
Collaborator Author

Yapf is still complaining about something, but I can't quite see what it is.

@rcrowe-google rcrowe-google changed the title PandasTransform - Initial commit PandasTransform - Ready for review Aug 13, 2022
@casassg
Copy link
Member

casassg commented Aug 17, 2022

Also, @rcrowe-google would be curious if you want to add more context to Contributing.md file. Would be nice to double validate that pre-commit stuff works fine as we get more contributors.

Comment on lines 242 to 244
if not os.path.exists(module_file):
raise ImportError(
'DoPandasTransform: Module file not found: {}'.format(module_file))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we try to check if its a GCS file and download it if so?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that requires some parsing of the URI to extract the bucket name. I'm not finding any nice clean way to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be fair, we use https://github.com/tensorflow/tfx/blob/b92695748e4ea35164f7791fc3abed0d1d4679e2/tfx/utils/io_utils.py#L38 for some internal use cases so it may work for this as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I sorta forgot about that whole thing.

tfx_addons/version.py Outdated Show resolved Hide resolved
tfx_addons/pandas_transform/README.md Show resolved Hide resolved
CODEOWNERS Outdated
@@ -46,3 +46,5 @@
# Message Exit Handler
/tfx_addons/message_exit_handler @hanneshapke

# PandasTransform Component
/tfx_addons/pandas_transform @rcrowe-google
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@casassg Do we need to lint for blank lines are the end of files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do lint for that in python


from tfx_addons.pandas_transform.component import PandasTransform

__version__ = '0.1dev'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcrowe-google Doesn't the number should match the version in Release.md?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Yes, it should. Fixed. Thanks!

module_file=module_file,
beam_pipeline=this_beam_pipeline)
else:
logging.info('TFX < 1.8.0')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcrowe-google Nice pattern.

statistics: tfx.dsl.components.InputArtifact[ExampleStatistics] = None,
module_file: tfx.dsl.components.Parameter[str] = None,
beam_pipeline_args: tfx.dsl.components.Parameter[str] = None) -> None:
"""Placeholder Docstring"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcrowe-google Does this need an update before merging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified it here and in the other one. I thought about turning off the linting for it, but I think explaining it is better.

beam_pipeline_args: tfx.dsl.components.Parameter[str] = None,
beam_pipeline: BeamComponentParameter[beam.Pipeline] = None,
) -> None:
"""Placeholder Docstring"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcrowe-google Does this need an update before merging?

@hanneshapke
Copy link
Contributor

@rcrowe-google I left some questions in the PR, but all are non-blocking.

@casassg
Copy link
Member

casassg commented Aug 18, 2022

@rcrowe-google this is ready for merging, you may need to rebase to get CI fixes in main (or you can just merge and it should pick it up there)

@casassg
Copy link
Member

casassg commented Aug 18, 2022

/merge

@github-actions
Copy link
Contributor

Merge request received from @casassg! ✅

PR will be auto-merged once Test suite is green!

@github-actions github-actions bot merged commit 473315a into tensorflow:main Aug 18, 2022
@github-actions
Copy link
Contributor

Merged with approvals from casassg and hanneshapke - thanks for the contribution! 🎉

hanneshapke pushed a commit to digits/tfx-addons that referenced this pull request Apr 3, 2023
* Initial commit

* Fixing lint

* More lint

* Yet more lint

* Oops, whitespace

* Unused imports

* More cleanup

* Misc cleanup

* Fixed a couple of things

* Fixing astype

* Disabling unnecessary-comprehension

* Oops, wrong comprehension

* Trying to improve version handling

* Order of imports

* Removing reimport

* Adding version check to tests

* Missed the beam pipeline on the min

* Indents on the test params

* Lint wants the else formatted differently

* Adding an example and update to pydoc for component

* Adding README and release notes

* Added beam_pipeline_args, updated README

* Trying to fix lint

* Adding dependencies

* Lint, maybe yapf fixes

* Trying to fix formatting

* Capping Pandas version, updating CODEOWNERS

* Various updates pre-merge
github-actions bot pushed a commit that referenced this pull request Apr 11, 2023
* initial move of the existing mct code

* Added missing file

* Added (unreviewed) MCT example

* updated example

* extended doc

* add census helper files

* Create 20220513-PandasTransform.md

First draft of project proposal.

* Update 20220513-PandasTransform.md

Oops, forgot to add the dependencies!

* Update 20220513-PandasTransform.md

Adding a note about statistics and schema for the changed dataset.

* Rcrowe fixing filename (#140)

* Create 20220513-pandas_transform.md

* Delete 20220513-PandasTransform.md

* Update README.md (#142)

Fixing broken links on PyPI page.

* fix bad link in xgboost_evaluator

* Update CODEOWNERS

* Cherry pick bot to backport changes (#144)

* Cherry pick bot to backport changes

* Update RELEASE.md

* validate user is release manager

* Update examples for feature selection component (#116)

* Updated module file for compatibility

* Updated module file for compatibility

* Update iris_module_file.py

* Created Palmer Penguins example using Colaboratory

* Created Pima Indians Diabetes example pipeline using Colaboratory

* Created Iris example pipeline using Colaboratory

* Deleted outdated file + replaced with new examples

* Deleted outdated file

* Fix bad link in xgboost_evaluator (#145)

Commit 93fd006 didn't correctly fix the link.

* Increase supported version from TFX <1.8 to <1.9  (#147)

* include TFX 1.8

* increase CI

* switch cache key to constaint first

* remove shared cache

* fix exit handler

* fix patch issues

* reformating yapf

* Use feast from master git to avoid pip infinite backtrack (#152)

* pin pip

* Update ci.yml

* use requirements max and min instead

* add missing letter

* fix using -r option

* ammend triggers

* add tensorflow to the mix

* tf 2.6

* move constraints to version file

* pin api-core

* add protobuf

* protobuf 3.19

* add mlp-sdk

* add extras for api-core

* add it to requirements instead

* move feast dep

* use feast repo instead of release

* remove mlpipeline-sdk

* Updated readme with module file guidelines and example (#151)

* bump feast version support (#154)

* Add feature_selection to included pkg (#161)

* add feature_selection to included pkg

* upgrade tfx to avoid beam issue in sklearn example

* fix due to upgrade of tfma

* fix missing 1 types

* format inline

* update extractor test

* remove the header generator

* break down CI by component

* use file.json

* use python file

* use correct base_dir

* touch for feast_examplegen

* ensure it works fine

* run if certain dangerous files change

* removed non used files and update CONTRIBUTING

* improve a bit how variable is set

* revert change in feastexamplegen

* remove non used variable in version.py

* rename job phase

* cancel if new ones

* add some extra space

* add note on inspiration

* PandasTransform - Ready for review (#155)

* Initial commit

* Fixing lint

* More lint

* Yet more lint

* Oops, whitespace

* Unused imports

* More cleanup

* Misc cleanup

* Fixed a couple of things

* Fixing astype

* Disabling unnecessary-comprehension

* Oops, wrong comprehension

* Trying to improve version handling

* Order of imports

* Removing reimport

* Adding version check to tests

* Missed the beam pipeline on the min

* Indents on the test params

* Lint wants the else formatted differently

* Adding an example and update to pydoc for component

* Adding README and release notes

* Added beam_pipeline_args, updated README

* Trying to fix lint

* Adding dependencies

* Lint, maybe yapf fixes

* Trying to fix formatting

* Capping Pandas version, updating CODEOWNERS

* Various updates pre-merge

* support tfx 1.9 (#162)

* small fix for filter projects (#164)

* small fix for filter projects

* Update ci.yml

* Update README to add missing component (#165)

* Update README.md

* fix pandas_transform

* Added release notes for feature selection component (#150)

* Update CONTRIBUTING.md

* Create README.md

Adding a readme to the example

* Create requirements.txt

Adding requirements.txt to the example.  Only referencing the component, since that will keep all other dependencies in one place.

* Added tests for feature selection component (#149)

* Initial commit with working tests

* cleaned and ensured the file is working with pre-commit

* Testing checkpoint

* Added support for module_file

* Improved documentation

* Fixed test

* Fixed pre-commit errors

* Added data files for component_test.py

* Added tests for artifact count by type

* Fixed minor bug

* Added test to check if correct features are being selected

* Update dependencies for feature_selection

* Update tfx_addons/version.py

* Update tfx_addons/version.py

* Update tfx_addons/version.py

Co-authored-by: Gerard Casas Saez <gcasassaez@twitter.com>

* Improve examples CI to automatically pick up projects (#166)

* make ci_examples run only when needed

* remove non used init file

* only run those examples that have test files

* address comments and improve documentation

* add concurrency for ci-examples

* CI: Run all if event name is push (#167)

* if event name is push run all projects

* remove non-used RUN_ALL_FILE

* run ci-examples if changes in filter_examples

* add comment on why env name

* add better logging

* Move main to 0.4 after 0.3 release

* Update RELEASE.md (#170)

* update firebase publisher proposal (#171)

* update firebase publisher proposal

update firebase publisher proposal with more detailed descriptions

* remove vague word temporary

* replace custom_config with concrete parameters

* Update tfx_addons/firebase_publisher/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* clarify usage of SavedModel

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* HFModelPusher component proposal (#174)

* write HFModelPusher component proposal

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* commit_id in the outputs, more desc on the branching

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* fix: missing option --hook-type (#177)

fix: missing option `--hook-type` in "Install pre-commit hooks for push hooks: `pre-commit install --hook-type pre-push`"

* Scaling Sampler by using row level probability (#163)

* initial probabilistic sampling

* using sampled in batches

* use toDict

* use calculation for sampling by class

* implement using sampling at individual level

* fixup readme

* improve executor docs

* use patch to avoid flaky tests

* Add FirebasePublisher Implementation (#175)

* add empty __init__.py

* add component.py

* add component spec

* add executor.py

* unncessary codes from custom_config

* initial implementation done

* fix typo

* re-organize the runner

* pre-commit

* pre-commit

* add docstring for the component

* add basic unit tests

* add basic unit tests

* change private to public

* change function call name

* pass pre-commit

* Update tfx_addons/firebase_publisher/runner.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update tfx_addons/firebase_publisher/component.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update tfx_addons/firebase_publisher/component.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update tfx_addons/firebase_publisher/executor.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update tfx_addons/firebase_publisher/runner.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update tfx_addons/firebase_publisher/runner.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update tfx_addons/firebase_publisher/runner.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* remove mis-placed docstring in the component

* fix full-step consistency

* rename model_exist -> is_model_present

* rename upload_model -> upload_model_to_gcs

* more descriptions on update_model

* update log message on model_create

* fix case inconsistency

* fix pre-commit

* add copyright

* add import module in __init__.py

* add firebase_publisher version info

* fix typo

* remove ValueError in docstring

* update docstring

* fix: pre-commit

* fix: pre-commit

* update function name in the test module

* add copyrights

* reduce version requirement

* remove version info in __init__.py

* add constraint for firebase in CI

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Gerard Casas Saez <gcasassaez@twitter.com>

* update HFModelPusher proposal

* add space_url in the output

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update proposals/20220823-huggingface_model_pusher.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Archive penguins_sklearn example (#184)

* Update README.md

Archived and pointed to example in main TFX repo.

* Update penguin_pipeline_sklearn_gcp.py

Commenting out to avoid issues while archived.

* Update penguin_pipeline_sklearn_gcp_test.py

Commenting out to avoid CI issues while archived.

* Update penguin_pipeline_sklearn_local.py

Commenting out to avoid CI issues while archived.

* Update penguin_pipeline_sklearn_local_e2e_test.py

Commenting out to avoid CI issues while archived.

* Update penguin_pipeline_sklearn_local.py

* Update penguin_utils_sklearn.py

Commented out to avoid CI issues while archived

* Update sklearn_predict_extractor.py

Commented out to avoid CI issues while archived.

* Update sklearn_predict_extractor_test.py

Commented out to avoid CI issues while archived.

* Update requirements.txt

Commenting out to avoid CI issues while archived

* Update Dockerfile

Commenting out to avoid CI issues while archived

* Update penguin_utils_sklearn.py

Lint

* Update sklearn_predict_extractor.py

Lint

* Update sklearn_predict_extractor.py

Lint

* Update penguin_pipeline_sklearn_gcp.py

Lint

* Update penguin_pipeline_sklearn_gcp_test.py

Lint

* Update penguin_pipeline_sklearn_local.py

Lint

* Update penguin_pipeline_sklearn_local_e2e_test.py

Lint

* Update penguin_utils_sklearn.py

Lint

* Update sklearn_predict_extractor_test.py

Lint

* Update sklearn_predict_extractor_test.py

Lint

* Update penguin_utils_sklearn.py

Lint

* Update sklearn_predict_extractor.py

Lint

* Update sklearn_predict_extractor_test.py

Lint

* Update penguin_pipeline_sklearn_gcp_test.py

* Update penguin_pipeline_sklearn_local_e2e_test.py

Null test to satisfy CI

* Update sklearn_predict_extractor_test.py

Null test to satisfy CI

* Update penguin_pipeline_sklearn_gcp_test.py

Lint

* Update penguin_pipeline_sklearn_local_e2e_test.py

Lint

* Update sklearn_predict_extractor_test.py

Lint

* Update penguin_pipeline_sklearn_local_e2e_test.py

Lint

* Update penguin_pipeline_sklearn_gcp_test.py

Lint

* Update penguin_pipeline_sklearn_gcp_test.py

Lint

* Update penguin_pipeline_sklearn_local_e2e_test.py

Lint

* Update sklearn_predict_extractor_test.py

Lint

* remove files and include reference to existing code

Co-authored-by: Gerard Casas Saez <gcasassaez@twitter.com>

* remove old data (#185)

* PyTorch TFX proposal

* clean up of CODEOWNERS (#188)

* Support TFX 1.10 (#187)

* Initial release candidate testing for tfx==1.10.0rc0

* remove suffux from test_utils

* add missing extra line

* use tfx 1.10 constraint

* add firebase_publisher to readme

* Add HF pusher owners (#192)

* Add HuggingFace Pusher Implementation (#191)

* add hfpusher implementation

* update version.py

* add test for runner.py

* linting

* add dependency

* add pytest ignore protected access

* add one more test

* fix: test failure

* copy proposa as README

* add git-lfs dependency

* document about RuntimeError when git-lfs is not installed

* add _is_git_lfs_installed

* add test cases for executor

* change _executor -> exe

* yapf

* Update tfx_addons/huggingface_pusher/README.md

Co-authored-by: Gerard Casas Saez <gcasassaez@twitter.com>

* Update tfx_addons/huggingface_pusher/executor_test.py

Co-authored-by: Gerard Casas Saez <gcasassaez@twitter.com>

* add huggingface document reference for git-lfs

* space_config example added to component_test

* yapf

* add optional decrypt features

* add decrypt_fn in README

Co-authored-by: Gerard Casas Saez <gcasassaez@twitter.com>

* initial move of the existing mct code

* Added missing file

* Added (unreviewed) MCT example

* updated example

* extended doc

* add census helper files

* updated mct version dependency

* yapf + isort updates

* updated gitignore to ignore notebook generated files

* isort updated

* removed notebook example generated files

* removed model_card_pb2.SensitiveData() and model_card_pb2.ConfidenceInterval()

* updated formatting

* updated example notebook

* updated example notebook for the MCT component

* updated example notebook for the MCT component

* set the code owners for the MCT component

* Update RELEASE.md

* Update RELEASE.md

* added tfxtest

* added tfxtest

* updated .gitignore

* fix typo in CODEOWNERS

* Update .gitignore

---------

Co-authored-by: Robert Crowe <robertcrowe@google.com>
Co-authored-by: Gerard Casas Saez <gcasassaez@twitter.com>
Co-authored-by: Kshitijaa Jaglan <29124655+deutranium@users.noreply.github.com>
Co-authored-by: Jeroen Van Goey <jeroen.vangoey@gmail.com>
Co-authored-by: Chansung Park <deep.diver.csp@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants