Support for manual numerical distribution constraints in schema/anomalies #221

rclough · 2022-08-25T14:49:17Z

I've been looking through the detectable anomalies and realized that I don't think there's a way to accomplish what I'd like to accomplish, which is enforce a distribution constraint on a feature that isn't related to training/serving skew.

For a concrete example, lets say I have a regular retraining pipeline, and I want to enforce that all of my examples have a roughly 50/50 distribution of a boolean feature. We have a component that will fail our training pipeline if anomalies are detected, but as far as I'm aware, there's no anomaly I can set to enforce this distribution, unless I use the skew feature and create a "golden" set of statistics to compare to, but that seems like a roundabout way of doing it.

To generalize it, if I can manually enforce things like min/max value, it would be useful to also be able to enforce things like feature mean/std deviation or some similar way of thresholding a difference in numeric distribution.

singhniraj08 · 2022-08-26T10:04:15Z

Hi @rclough,

Please go through these links 1 and 2 for thresholding the difference in numeric distribution using jensen_shannon_divergence and L-infinity distance. Let us know if this helps. Thank you!

caveness · 2022-08-26T16:50:26Z

Note that using Jensen Shannon and L-infinity will cover only comparisons between two datasets and will not handle single-dataset validation (unless you were to use a golden dataset as noted in the original issue).

TFDV does not currently have a good way to do the type of single-dataset distribution validation noted in the issue, but we are aiming to expand our anomaly detection functionality and will take this feature request into account.

rclough · 2022-08-26T16:55:38Z

@singhniraj08 Thanks, as I noted in the original ticket, and as @caveness clarifies, jensen_shannon_divergence and L-infinity distance only apply to problems with a pre-existing example datasets, so while it is possible to try and use them as a workaround, it's really not ideal for the use case when you want to just manually enforce a specific distribution on a feature.

singhniraj08 · 2022-08-29T05:22:38Z

@caveness, Thank you for the clarification.
@rclough, I will make this a feature request as commented by @caveness. Thank you for reporting this issue.

caveness · 2022-11-29T22:48:56Z

@rclough - TFDV recently added support for custom data validation using SQL. You should be able to use this functionality to do checks like the one described above. Please re-open this if you aren't able to do what you need with custom validation.

Please note that custom data validation is not supported on Windows.

singhniraj08 self-assigned this Aug 26, 2022

singhniraj08 added the type:support label Aug 26, 2022

singhniraj08 added the stat:awaiting response label Aug 26, 2022

singhniraj08 assigned caveness and unassigned singhniraj08 Aug 29, 2022

singhniraj08 added stat:awaiting tensorflower type:feature and removed stat:awaiting response type:support labels Aug 29, 2022

caveness closed this as completed Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for manual numerical distribution constraints in schema/anomalies #221

Support for manual numerical distribution constraints in schema/anomalies #221

rclough commented Aug 25, 2022

singhniraj08 commented Aug 26, 2022

caveness commented Aug 26, 2022

rclough commented Aug 26, 2022

singhniraj08 commented Aug 29, 2022

caveness commented Nov 29, 2022

Support for manual numerical distribution constraints in schema/anomalies #221

Support for manual numerical distribution constraints in schema/anomalies #221

Comments

rclough commented Aug 25, 2022

singhniraj08 commented Aug 26, 2022

caveness commented Aug 26, 2022

rclough commented Aug 26, 2022

singhniraj08 commented Aug 29, 2022

caveness commented Nov 29, 2022