-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of apriori #619
Conversation
old_combination is sorted, thus its max() is its last element. Since items_types_in_previous_step is a Numpy array, we can find all valid elements with a single call, which makes inner loop shorter.
Let generate_new_combinations return ints instead of tuples, and collect them with np.fromiter. Slower with low_memory=True, this will be fixed by next commit.
Verbose output has to be modified, since we loop on valid combinations only. Performance is now equivalent to better than version with low_memory=False. Adjust test_fpbase.py output.
Here is the benchmark script I used to compare master and this branch:
Results:
Results with the smallest dataset are not really relevant IMO, times are too small, and it is unlikely that one would use sparse matrix or lowmem option. Anyway they are provided here to show that there is no dramatic regression. |
Thanks for the PR. The benchmarks look pretty convincing. And I agree with you, it will be unlikely that someone would apply that on small data frames. Also, then, the performance differences are negligible because in either case it would finish running in non-substantial time. |
If all columns are boolean, there is nothing to check. In apriori.py, call valid_input_check.
I found #549 when looking for other benchmarks, and ran the same tests with this script:
Here timings are made on the whole functions. I just pushed a commit to improve checks on input dataframe. Results are:
|
I forgot to mention that 6bd97d4 only improves timings for boolean dataframes; users should be advised to prefer them over integer dataframes, or there could be an option to disable this check, which takes most of the time. |
I agree with you. Maybe we could display a warning if people use integer arrays to subtly nudge people towards using bool arrays. Could be in the form of a deprecation warning, even, and then phase out integer arrays in the next versions. |
Was also just running the benchmarks (the 2nd set you posted, kosarak-50k.dat) using the old and the improved version(s).
Hm, for some reason I don't get the same improvements you got. I was running that on an MBP. I will run the same code on my workstation later (running some other stuff there now) to see what's going on. |
Pls ignore the numbers above ... somehow I had a clobbered environment. The improvements look great -- probably mostly due to the improved checking, but still, I think we should merge the whole PR. That's really great. Thanks! Could you add a changelog entry though? PS: my updated benchmark results are shown below
|
Replace 0/1 by False/True in docstrings of apriori, fpgrowth and fpmax to promote usage of boolean arrays.
Changelog entry added, please let me know if this is not clear. About your results, they look consistent with mine; there is a constant difference between 3056e4a and 6bd97d4, which depends only on input dataframe size, for instance 4.2s on your workstation. MBP is more volatile, like my laptop. Between master and 3056e4a, the major difference is with |
Looks good, thanks! I guess it is good to merge then? |
Okay for me. |
Great! I can open an issue reg warnings and encouraging users to use boolean type inputs and revisit this another day then :) |
Description
This PR improves performance of apriori for medium to large datasets, in particular with low_memory=True. Situation is less clear for very small datasets.
Related issues or pull requests
Pull Request Checklist
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend