-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for manual numerical distribution constraints in schema/anomalies #221
Comments
Note that using Jensen Shannon and L-infinity will cover only comparisons between two datasets and will not handle single-dataset validation (unless you were to use a golden dataset as noted in the original issue). TFDV does not currently have a good way to do the type of single-dataset distribution validation noted in the issue, but we are aiming to expand our anomaly detection functionality and will take this feature request into account. |
@singhniraj08 Thanks, as I noted in the original ticket, and as @caveness clarifies, |
@rclough - TFDV recently added support for custom data validation using SQL. You should be able to use this functionality to do checks like the one described above. Please re-open this if you aren't able to do what you need with custom validation. Please note that custom data validation is not supported on Windows. |
I've been looking through the detectable anomalies and realized that I don't think there's a way to accomplish what I'd like to accomplish, which is enforce a distribution constraint on a feature that isn't related to training/serving skew.
For a concrete example, lets say I have a regular retraining pipeline, and I want to enforce that all of my examples have a roughly 50/50 distribution of a boolean feature. We have a component that will fail our training pipeline if anomalies are detected, but as far as I'm aware, there's no anomaly I can set to enforce this distribution, unless I use the skew feature and create a "golden" set of statistics to compare to, but that seems like a roundabout way of doing it.
To generalize it, if I can manually enforce things like min/max value, it would be useful to also be able to enforce things like feature mean/std deviation or some similar way of thresholding a difference in numeric distribution.
The text was updated successfully, but these errors were encountered: