Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenerateStatistics API Change #75

Open
paulgc opened this issue Jul 20, 2019 · 0 comments
Open

GenerateStatistics API Change #75

paulgc opened this issue Jul 20, 2019 · 0 comments

Comments

@paulgc
Copy link
Member

paulgc commented Jul 20, 2019

Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), GenerateStatistics API will take Arrow tables as input instead of Dict[FeatureName, ndarray]. The API will only accept Arrow tables whose columns are ListArray of primitive types (e.g., int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, binary, string, unicode) .

This change should be a no-op if you construct the pipeline using the default decoders (e.g., tfdv.DecodeTFExample and tfdv.DecodeCSV) or if you are using the utility methods to generate statistics (e.g., tfdv.generate_statistics_from_tfrecord, tfdv.generate_statistics_from_csv and tfdv.generate_statistics_from_dataframe).

TFDV 0.14 will have this new behavior. Let us know if you have any issues with migrating to the new API.

@paulgc paulgc changed the title API Changes GenerateStatistics API Change Jul 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant