Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for manual numerical distribution constraints in schema/anomalies #221

Closed
rclough opened this issue Aug 25, 2022 · 5 comments
Closed

Comments

@rclough
Copy link

rclough commented Aug 25, 2022

I've been looking through the detectable anomalies and realized that I don't think there's a way to accomplish what I'd like to accomplish, which is enforce a distribution constraint on a feature that isn't related to training/serving skew.

For a concrete example, lets say I have a regular retraining pipeline, and I want to enforce that all of my examples have a roughly 50/50 distribution of a boolean feature. We have a component that will fail our training pipeline if anomalies are detected, but as far as I'm aware, there's no anomaly I can set to enforce this distribution, unless I use the skew feature and create a "golden" set of statistics to compare to, but that seems like a roundabout way of doing it.

To generalize it, if I can manually enforce things like min/max value, it would be useful to also be able to enforce things like feature mean/std deviation or some similar way of thresholding a difference in numeric distribution.

@singhniraj08
Copy link

Hi @rclough,

Please go through these links 1 and 2 for thresholding the difference in numeric distribution using jensen_shannon_divergence and L-infinity distance. Let us know if this helps. Thank you!

@caveness
Copy link
Collaborator

Note that using Jensen Shannon and L-infinity will cover only comparisons between two datasets and will not handle single-dataset validation (unless you were to use a golden dataset as noted in the original issue).

TFDV does not currently have a good way to do the type of single-dataset distribution validation noted in the issue, but we are aiming to expand our anomaly detection functionality and will take this feature request into account.

@rclough
Copy link
Author

rclough commented Aug 26, 2022

@singhniraj08 Thanks, as I noted in the original ticket, and as @caveness clarifies, jensen_shannon_divergence and L-infinity distance only apply to problems with a pre-existing example datasets, so while it is possible to try and use them as a workaround, it's really not ideal for the use case when you want to just manually enforce a specific distribution on a feature.

@singhniraj08
Copy link

@caveness, Thank you for the clarification.
@rclough, I will make this a feature request as commented by @caveness. Thank you for reporting this issue.

@caveness
Copy link
Collaborator

@rclough - TFDV recently added support for custom data validation using SQL. You should be able to use this functionality to do checks like the one described above. Please re-open this if you aren't able to do what you need with custom validation.

Please note that custom data validation is not supported on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants