Apriori optimization #567

jmayse · 2019-07-18T16:50:42Z

Description

Currently, the implementation of the apriori algorithm uses a slow iteration to examine each item combination for above-threshold support. This iteration can be replaced by matrix operations that are generally faster, but use slightly more memory in some cases.

While still generally slower than fpgrowth, this implementation of apriori is 3-6x faster:

https://gist.github.com/jmayse/ad688d6a7fd842269996a701d7cecd4c

https://gist.github.com/jmayse/7c76a2d838ac164b923a47b29527f2ed

The current use of the verbose operator doesn't really make sense without the iterator; hence, I have replaced it with a statement that outputs the current number of combinations being compared.

The exit behavior for the main while loop has also changed from setting max_itemset = 0 to using break. If this is undesirable, we can easily reverse this change.

Related issues or pull requests

Fixes #566

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

coveralls · 2019-07-18T17:01:31Z

Coverage increased (+0.05%) to 92.124% when pulling f0788a7 on jmayse:apriori_optimization into 0abf710 on rasbt:master.

rasbt · 2019-07-18T18:41:40Z

Thanks a lot for the PR! 3-6x is definitely a substantial improvement and very welcome :). Regarding the implementation, it'll probably be more memory hungry since everything is done within an array instead of iterating over a generator, but I suppose it's not an issue on big datasets based on your experience with this code? (I assume, big datasets where this would be prohibitive would be so large that the iterative version would run ~forever anyways).

Otherwise, the alternative would be to enable both options via a flag "in_memory=False" (old version) and "in_memory=True" (new version). But I guess that's overkill!?

jmayse · 2019-07-18T19:02:55Z

I don't really think that is overkill at all. The user should be able to specify a lightweight option in terms of memory usage for applications on low memory platforms (raspberry pi, for instance). Additionally the previous method using the iterator will always work - if given infinite time. There's value in that on its own. I'll add it back in under a flag with testing around the flag behavior.

jmayse · 2019-07-18T20:37:03Z

mlxtend/frequent_patterns/tests/test_fpbase.py

+                print(out.getvalue())
+        except TypeError:
+            # If there is no low_memory argument, don't run the test.
+            assert True


I really do not like the try-except-else pattern, but it's common in Python and works here. I'm very, very open to suggestions...

Hm, since it's a function, that's a tricky one ... one solution, which is not that much better, would be

import inspect def test_low_memory_flag(self): if 'low_memory' in inspect.getargspec(self.fpalgo)[0]: with captured_output() as (out, err): _ = self.fpalgo(self.df, low_memory=True, verbose=1) print(out.getvalue()) else: # If there is no low_memory argument, don't run the test. assert True

Actually, I like that better. I've changed this test to use inspect.signature and removed that pesky debug print that snuck through

sounds good!

rasbt · 2019-07-18T21:16:49Z

I don't really think that is overkill at all. The user should be able to specify a lightweight option in terms of memory usage for applications on low memory platforms (raspberry pi, for instance). Additionally the previous method using the iterator will always work - if given infinite time. There's value in that on its own. I'll add it back in under a flag with testing around the flag behavior.

Sounds good then. Also, the low-resource env makes sense. Thanks!

rasbt

Everything looks great, there are just a few stylistic issues I noted below.

mlxtend/frequent_patterns/apriori.py

rasbt · 2019-07-19T14:29:57Z

mlxtend/frequent_patterns/apriori.py

@@ -105,6 +111,19 @@ def apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0):
    http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/

    """
+
+    def _support(_x, _n_rows, _is_sparse):


Not really important since private functions don't show up in the docs, but usually, we use the NumPy-style documentation style. That's because I had written a Python doc to markdown parser that automatically generates the documentation for the HTML website documentation. No need to change it here though because like I said, it won't appear on the website

Well, this is definitely the time to change it if you want me to. I'm happy to do it to conform to the overall style.

In fact, if you don't mind, let me just do that real quick...even if it doesn't appear on the website we should be consistent to the overall style. Why build in inconsistencies?

Done -> 04e1e31

ok sure, please go ahead then. Thanks!

pep8speaks · 2019-07-19T15:20:53Z

Hello @jmayse! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-19 15:22:32 UTC

rasbt · 2019-07-19T15:24:35Z

I think everything should be good to go now (once the tests pass). Just added an additiona tests such that the whole suite runs for both low_memory=True & low_memory=False. Besides that, I also just added the Changelog entry.

Unless you have further suggestions, I would merge it once the tests pass. Will then also make a new version release later tonight or tomorrow -- it's about time ;).

Thanks a lot for this PR!

jmayse · 2019-07-19T15:30:00Z

Happy to! Thanks for maintaining the library!

jmayse added 3 commits July 18, 2019 12:22

replaced iteration through combos with matrix operations

b1004dc

flake8 changes

2dcb86b

fixed silly bug in determining combin sizes

840dcf7

added low_memory flag. added low_memory unittests

bbc53e7

jmayse commented Jul 18, 2019

View reviewed changes

changed test inspection

0e5f9fa

rasbt requested changes Jul 19, 2019

View reviewed changes

mlxtend/frequent_patterns/apriori.py Outdated Show resolved Hide resolved

mlxtend/frequent_patterns/apriori.py Show resolved Hide resolved

consolidated low_memory checks and redundant function call

699b18a

rasbt approved these changes Jul 19, 2019

View reviewed changes

rasbt reviewed Jul 19, 2019

View reviewed changes

jmayse and others added 2 commits July 19, 2019 10:36

added numpy style docs to private function _support

04e1e31

additional unit test and docs

78d9c7f

pep8 fix

f0788a7

rasbt merged commit fe8a022 into rasbt:master Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apriori optimization #567

Apriori optimization #567

jmayse commented Jul 18, 2019 •

edited

Loading

coveralls commented Jul 18, 2019 •

edited

Loading

rasbt commented Jul 18, 2019

jmayse commented Jul 18, 2019 •

edited

Loading

jmayse Jul 18, 2019

rasbt Jul 18, 2019 •

edited

Loading

jmayse Jul 18, 2019

rasbt Jul 19, 2019

rasbt commented Jul 18, 2019

rasbt left a comment •

edited

Loading

rasbt Jul 19, 2019

jmayse Jul 19, 2019

jmayse Jul 19, 2019

jmayse Jul 19, 2019 •

edited

Loading

rasbt Jul 19, 2019

pep8speaks commented Jul 19, 2019 •

edited

Loading

rasbt commented Jul 19, 2019

jmayse commented Jul 19, 2019

Apriori optimization #567

Apriori optimization #567

Conversation

jmayse commented Jul 18, 2019 • edited Loading

Description

Related issues or pull requests

Pull Request Checklist

coveralls commented Jul 18, 2019 • edited Loading

rasbt commented Jul 18, 2019

jmayse commented Jul 18, 2019 • edited Loading

jmayse Jul 18, 2019

Choose a reason for hiding this comment

rasbt Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

jmayse Jul 18, 2019

Choose a reason for hiding this comment

rasbt Jul 19, 2019

Choose a reason for hiding this comment

rasbt commented Jul 18, 2019

rasbt left a comment • edited Loading

Choose a reason for hiding this comment

rasbt Jul 19, 2019

Choose a reason for hiding this comment

jmayse Jul 19, 2019

Choose a reason for hiding this comment

jmayse Jul 19, 2019

Choose a reason for hiding this comment

jmayse Jul 19, 2019 • edited Loading

Choose a reason for hiding this comment

rasbt Jul 19, 2019

Choose a reason for hiding this comment

pep8speaks commented Jul 19, 2019 • edited Loading

Comment last updated at 2019-07-19 15:22:32 UTC

rasbt commented Jul 19, 2019

jmayse commented Jul 19, 2019

jmayse commented Jul 18, 2019 •

edited

Loading

coveralls commented Jul 18, 2019 •

edited

Loading

jmayse commented Jul 18, 2019 •

edited

Loading

rasbt Jul 18, 2019 •

edited

Loading

rasbt left a comment •

edited

Loading

jmayse Jul 19, 2019 •

edited

Loading

pep8speaks commented Jul 19, 2019 •

edited

Loading