You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have inferred a schema and statistics on a TFRecords dataset. (unfortunately I cannot share the dataset).
when I validate statistics against "its own" schema, I get anomalies for some of my features like:
u"'country_of_origin'":
description: "Some examples have fewer values than expected."
severity: ERROR
short_description: "Missing values"
reason {
type: FEATURE_TYPE_LOW_NUMBER_VALUES
short_description: "Missing values"
description: "Some examples have fewer values than expected."
}
path {
step: "country_of_origin"
}
This is the relevant (inferred) part of my schema:
This is the representation of the relevant part of the statistics:
I have several problems with this:
It looks like something weird happened with the quotes around the key of the anomaly_info.
The anomalies says: "Missing values", but my statistics says 0% missing values
Generally in TFDV, it's not clear which numbers are compared and raise an anomaly. For example here, is the error caused by the value_count or the presence.min_count part of my schema? In my statistics, which of this fields is used ?:
feature.string_stats.common_stats.min_num_values
feature.string_stats.common_stats.num_missing
feature.string_stats.common_stats.tot_num_values
feature.string_stats.common_stats.avg_num_values
I suspect it's the first one but there is no other way for me to be sure than digging into the C++ code of this repository and this I don't know the codebase, I'm not 100% sure.
It also says: "Some examples have fewer values than expected." Ideally I would like to have more information to solve the problem or to account how severe the problem is: how many examples, which ones, how many values are missing, etc...
Depending on this results, maybe I would like to allow a portion of my dataset to have "fewer missing values" than what is required. But how could I define this?
I would expect the content of the anomaly would give me more information about what exactly is the problem, what comparison is failing.
Ideally it would also suggest one way to "silence" the anomaly if I which (could I change the severity for this error for example) or where to look in my dataset to find the root of the problem.
The text was updated successfully, but these errors were encountered:
In this specific case, the anomaly means that feature.string_stats.common_stats.min_num_values is less than the value_count. Can you share the feature.string_stats.common_stats.min_num_values and feature.string_stats.common_stats.max_num_values ?
@martin-laurent Can you please share the feature.string_stats.common_stats.min_num_values and feature.string_stats.common_stats.max_num_values as described by @paulgc. Thanks!
I have inferred a schema and statistics on a TFRecords dataset. (unfortunately I cannot share the dataset).
when I validate statistics against "its own" schema, I get anomalies for some of my features like:
This is the relevant (inferred) part of my schema:
This is the representation of the relevant part of the statistics:
I have several problems with this:
value_count
or thepresence.min_count
part of my schema? In my statistics, which of this fields is used ?:I suspect it's the first one but there is no other way for me to be sure than digging into the C++ code of this repository and this I don't know the codebase, I'm not 100% sure.
I would expect the content of the anomaly would give me more information about what exactly is the problem, what comparison is failing.
Ideally it would also suggest one way to "silence" the anomaly if I which (could I change the severity for this error for example) or where to look in my dataset to find the root of the problem.
The text was updated successfully, but these errors were encountered: