Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

value_counts is slow for nested columns #115

Closed
manycoding opened this issue Jun 18, 2019 · 7 comments
Closed

value_counts is slow for nested columns #115

manycoding opened this issue Jun 18, 2019 · 7 comments

Comments

@manycoding
Copy link
Contributor

More data to follow
Because value_counts is slow, any big df makes report_all awfully slow.

  1. See if it can be improved
  2. If not, exlude get_categories from report_all
    or make a parameter like (fast=True)
@manycoding manycoding added this to the 0.3.6 milestone Jun 18, 2019
@manycoding
Copy link
Contributor Author

Introduced in #100

@manycoding
Copy link
Contributor Author

manycoding commented Jun 18, 2019

The slowest are object columns, specifically nested data. The difference in 10-1000x.
@ejulio @alexandr1988, I am thinking about alternatives:

  1. Ignore object columns
  2. fast argument, so the category method won't use object columns, and fast schema Use fastjsonschema for huge jobs by default #5 Arche(..fast=True). Default False.
  3. Leave it as it is (users are expected to use specific rules, but the rule has a progress bar)
  4. Use flat_df - very fast, but the categories pictures in this case is not the same since data is flattened. - doesn't make any sense since flat data is sparse.

@victor-torres
Copy link
Contributor

If there's not a cache, seems like you're calculating value_counts twice.
See: https://github.com/scrapinghub/arche/pull/100/files#r295002793

@victor-torres
Copy link
Contributor

You could use a generator to reduce the number of calls to the function.
See: https://github.com/scrapinghub/arche/pull/100/files#r295004052

@manycoding
Copy link
Contributor Author

manycoding commented Jun 18, 2019

@victor-torres It doesn't make much difference since object columns are that slow anyway. Pandas has its own caching.
But I compared performance
Screenshot 2019-06-18 at 16 35 41

@ejulio
Copy link

ejulio commented Jun 19, 2019

No major comments here.
Maybe I can search a bit on pandas internals.
I'd only mention that we would need to be careful about fast argument because we already use something like that for fastjsonschema and we need to avoid misinterpretations :)

victor-torres added a commit to victor-torres/arche that referenced this issue Jun 19, 2019
@manycoding
Copy link
Contributor Author

manycoding commented Jun 19, 2019

I'd only mention that we would need to be careful about fast argument

I thought we can put fast and if it's True use the fastest validations

  1. excluding nested data from categories (exlusion is done by using common with flat_df columns, I think that's the fastest way. But we need to drop nan columns from flat_df first, for this dataset it's 1 minute)
  2. fastjsonschema instead of jsonschema

But we need to find a solution since right now value_counts degrades performance in 100-1000x times if nested columns are there. Which makes this particular dataset untestable.

Without nested data it takes around 30 seconds.
Full dataset - I stopped after 30 minutes. One nested column (photos) takes 13 minutes.

@manycoding manycoding changed the title value_counts is slow value_counts is slow for nested columns Jun 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants