Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear anomaly_info #62

Closed
martin-laurent opened this issue May 2, 2019 · 3 comments
Closed

Unclear anomaly_info #62

martin-laurent opened this issue May 2, 2019 · 3 comments

Comments

@martin-laurent
Copy link

I have inferred a schema and statistics on a TFRecords dataset. (unfortunately I cannot share the dataset).
when I validate statistics against "its own" schema, I get anomalies for some of my features like:

u"'country_of_origin'": 
description: "Some examples have fewer values than expected."
severity: ERROR
short_description: "Missing values"
reason {
  type: FEATURE_TYPE_LOW_NUMBER_VALUES
  short_description: "Missing values"
  description: "Some examples have fewer values than expected."
}
path {
  step: "country_of_origin"
}

This is the relevant (inferred) part of my schema:

feature {
  name: "country_of_origin"
  value_count {
    min: 1
  }
  type: BYTES
  presence {
    min_count: 1
  }
}

This is the representation of the relevant part of the statistics:
Screen Shot 2019-05-02 at 10 31 27 AM

I have several problems with this:

  1. It looks like something weird happened with the quotes around the key of the anomaly_info.
  2. The anomalies says: "Missing values", but my statistics says 0% missing values
  3. Generally in TFDV, it's not clear which numbers are compared and raise an anomaly. For example here, is the error caused by the value_count or the presence.min_count part of my schema? In my statistics, which of this fields is used ?:
    • feature.string_stats.common_stats.min_num_values
    • feature.string_stats.common_stats.num_missing
    • feature.string_stats.common_stats.tot_num_values
    • feature.string_stats.common_stats.avg_num_values
      I suspect it's the first one but there is no other way for me to be sure than digging into the C++ code of this repository and this I don't know the codebase, I'm not 100% sure.
  4. It also says: "Some examples have fewer values than expected." Ideally I would like to have more information to solve the problem or to account how severe the problem is: how many examples, which ones, how many values are missing, etc...
  5. Depending on this results, maybe I would like to allow a portion of my dataset to have "fewer missing values" than what is required. But how could I define this?

I would expect the content of the anomaly would give me more information about what exactly is the problem, what comparison is failing.

Ideally it would also suggest one way to "silence" the anomaly if I which (could I change the severity for this error for example) or where to look in my dataset to find the root of the problem.

@gowthamkpr gowthamkpr self-assigned this May 2, 2019
@paulgc
Copy link
Member

paulgc commented May 2, 2019

@martin-laurent Thanks for the feedback. We are working towards making the documentation better and providing more information about the anomalous examples (e.g., statistics over anomalous examples for each type of anomaly).

In this specific case, the anomaly means that feature.string_stats.common_stats.min_num_values is less than the value_count. Can you share the feature.string_stats.common_stats.min_num_values and feature.string_stats.common_stats.max_num_values ?

@gowthamkpr
Copy link

@martin-laurent Can you please share the feature.string_stats.common_stats.min_num_values and feature.string_stats.common_stats.max_num_values as described by @paulgc. Thanks!

@gowthamkpr
Copy link

Closing this issue as this has been in 'awaiting response' for more than a week. Please add additional comments and we can reopen the issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants