Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect feature stats #30

Closed
nevillelyh opened this issue Aug 9, 2017 · 12 comments
Closed

Collect feature stats #30

nevillelyh opened this issue Aug 9, 2017 · 12 comments
Labels
enhancement New feature or request

Comments

@nevillelyh
Copy link
Contributor

Could be useful for debugging. A couple of thoughts

  • Opt-in for user specified columns
  • Pre/post transformation
  • How do we deal with vectors, etc.?
@nevillelyh nevillelyh added the enhancement New feature or request label Aug 9, 2017
@nevillelyh
Copy link
Contributor Author

I'm thinking maybe an extra featureStatistics method on FeatureExtractor, so this is done post transformation. We can build one Algebord Moment per column easily.

Optionally we can also let user opt-in a subset of transformers, but that's extra complexity.

OTOH not sure if we can compute stats pre-transformation though, since it doesn't make sense for all inputs, e.g. strings, vector.

@yonromai
Copy link
Contributor

yonromai commented Oct 4, 2017

FWIW here are a few things I've commonly checked in the past:

  • Plot distribution per feature
  • Descriptive stats on features: median, stddev, top_k_values, p05, p95, min, max
  • Count Missing values, NaNs, outliers...
  • Evolution of features stats across batches (say overall several weeks)
  • correlation between target and features, and between features themselves

@nevillelyh
Copy link
Contributor Author

@yonromai questions:

  • Plot distribution per feature
    Is this pre or post transformation? Some transformer i.e. binarizer may change dist, I guess pre is more useful in that case? Some case we might want post, i.e. one-hot encoder?
  • Descriptive stats on features
    These should be easy. Not sure if p05/p95 make sense here. Did you mean 5% & 95% quantile?
  • Count missing values, NaNs, outliers...
    This should be pre-transformation right?
  • Evolution
    This is outside the scope of featran, but maybe suitable for the system one layer above?

@yonromai
Copy link
Contributor

yonromai commented Oct 4, 2017

  • Plot distribution per feature
    Right, mostly pre transformation. This work can inform what transformations to apply
  • Descriptive stats on features
    Yes
  • Count missing values, NaNs, outliers...
    Yes
  • Evolution
    👍

@nevillelyh
Copy link
Contributor Author

So the pre-transform stats are potentially doable in the same reduce pass for feature settings, the post-transform stats definitely requires another reduce pass. Both should be opt-in obviously. We could also warn in cases of high dimensional features like *hot encoders.

@richwhitjr do you think it's worth doing the pre-transform stats in the same reduce as feature settings? IMO it's complex and probably not worth since the user is most likely doing it ad-hoc to explore data.

@richwhitjr
Copy link
Contributor

Seems complex and would be hard to mix monoids as needed. For example the transformation may want a QTree but for stats you will need a Moment Monoid. An adhoc "analysis" phase sounds promising though. I wonder though if this will require another type of Spec or if uses will expect the same type of stats for the same transformers.

@marcromeyn
Copy link

marcromeyn commented Oct 6, 2017

Could be nice to have it output the protobuf format that is required by Facets so that we get the feature visualizations for free.

Facets

See: https://github.com/PAIR-code/facets/blob/master/facets_overview/proto/feature_statistics.proto

@nevillelyh
Copy link
Contributor Author

nevillelyh commented Oct 11, 2017

@marcromeyn Facets seems to support a lot more things than we discussed here. Just checking if we can drop some to narrow the scope.

  • Weighted feature - Right now we only have weighted labels for n-hot encoders. Do we need it for anything else? Can we drop it altogether?
  • Median, histogram, rank histogram, frequency and value - These are hard to brute force for large datasets, is approximation acceptable? If so do we need tunable precision (could make it more complex)?
  • Bytes input - There's no bytes type in featran, guess we can drop?
  • Vectors - How do vectors (Array[Double]) fit in here?

@nevillelyh
Copy link
Contributor Author

Seems it could be a lot of work to replicate all the logic in facets. I'm wondering if it's easier and better to just sample in featran and do the statistics summarization in facets?

@marcromeyn @yonromai @richwhitjr thoughts?

@richwhitjr
Copy link
Contributor

I like the idea of keeping the statistic summarization internal but make it easy for someone to take the stats and dump it ot something like Facets. In the future we could have a sub project to help do this in one step.

I just worry about serialization and dependency issues we may run into when introducing a new library since Featran has to support a lot of different distributed systems,

@yonromai
Copy link
Contributor

It turns out that Facets has a the ability to import TfRecord files and computes stats on it. I tried it on my data and it seems to work fine, although quite slow on big files.
So it should be really easy to sample TFRecord files from the featran output and import them inside Facets.

@nevillelyh
Copy link
Contributor Author

It's easier to do this in TFDV, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants