value_counts is slow for nested columns #115

manycoding · 2019-06-18T00:35:30Z

More data to follow
Because value_counts is slow, any big df makes report_all awfully slow.

See if it can be improved
If not, exlude get_categories from report_all
or make a parameter like (fast=True)

The text was updated successfully, but these errors were encountered:

manycoding · 2019-06-18T19:23:49Z

Introduced in #100

manycoding · 2019-06-18T19:55:48Z

The slowest are object columns, specifically nested data. The difference in 10-1000x.
@ejulio @alexandr1988, I am thinking about alternatives:

Ignore object columns
fast argument, so the category method won't use object columns, and fast schema Use fastjsonschema for huge jobs by default #5 Arche(..fast=True). Default False.
Leave it as it is (users are expected to use specific rules, but the rule has a progress bar)
~~Use flat_df - very fast, but the categories pictures in this case is not the same since data is flattened.~~ - doesn't make any sense since flat data is sparse.

victor-torres · 2019-06-18T20:03:35Z

If there's not a cache, seems like you're calculating value_counts twice.
See: https://github.com/scrapinghub/arche/pull/100/files#r295002793

victor-torres · 2019-06-18T20:06:33Z

You could use a generator to reduce the number of calls to the function.
See: https://github.com/scrapinghub/arche/pull/100/files#r295004052

…his should help with scrapinghub#115

manycoding · 2019-06-18T20:36:26Z

@victor-torres It doesn't make much difference since object columns are that slow anyway. Pandas has its own caching.
But I compared performance

ejulio · 2019-06-19T11:59:34Z

No major comments here.
Maybe I can search a bit on pandas internals.
I'd only mention that we would need to be careful about fast argument because we already use something like that for fastjsonschema and we need to avoid misinterpretations :)

…his should help with scrapinghub#115

manycoding · 2019-06-19T17:19:46Z

I'd only mention that we would need to be careful about fast argument

I thought we can put fast and if it's True use the fastest validations

excluding nested data from categories (exlusion is done by using common with flat_df columns, I think that's the fastest way. But we need to drop nan columns from flat_df first, for this dataset it's 1 minute)
fastjsonschema instead of jsonschema

But we need to find a solution since right now value_counts degrades performance in 100-1000x times if nested columns are there. Which makes this particular dataset untestable.

Without nested data it takes around 30 seconds.
Full dataset - I stopped after 30 minutes. One nested column (photos) takes 13 minutes.

manycoding added the Type: Performance label Jun 18, 2019

manycoding added this to the 0.3.6 milestone Jun 18, 2019

victor-torres added a commit to victor-torres/arche that referenced this issue Jun 18, 2019

avoid having to calculate value_counts twice when getting category; t…

49c3cec

…his should help with scrapinghub#115

victor-torres mentioned this issue Jun 18, 2019

Avoid having to calculate value_counts twice when getting category #116

Merged

victor-torres added a commit to victor-torres/arche that referenced this issue Jun 19, 2019

avoid having to calculate value_counts twice when getting category; t…

ef4d5e8

…his should help with scrapinghub#115

manycoding changed the title ~~value_counts is slow~~ value_counts is slow for nested columns Jun 19, 2019

manycoding added a commit that referenced this issue Jun 20, 2019

Infer categories from sampled df, closes #115

bc8222a

manycoding closed this as completed in a7dd6b3 Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

value_counts is slow for nested columns #115

value_counts is slow for nested columns #115

manycoding commented Jun 18, 2019

manycoding commented Jun 18, 2019

manycoding commented Jun 18, 2019 •

edited

Loading

victor-torres commented Jun 18, 2019

victor-torres commented Jun 18, 2019

manycoding commented Jun 18, 2019 •

edited

Loading

ejulio commented Jun 19, 2019

manycoding commented Jun 19, 2019 •

edited

Loading

value_counts is slow for nested columns #115

value_counts is slow for nested columns #115

Comments

manycoding commented Jun 18, 2019

manycoding commented Jun 18, 2019

manycoding commented Jun 18, 2019 • edited Loading

victor-torres commented Jun 18, 2019

victor-torres commented Jun 18, 2019

manycoding commented Jun 18, 2019 • edited Loading

ejulio commented Jun 19, 2019

manycoding commented Jun 19, 2019 • edited Loading

manycoding commented Jun 18, 2019 •

edited

Loading

manycoding commented Jun 18, 2019 •

edited

Loading

manycoding commented Jun 19, 2019 •

edited

Loading