Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apriori optimization #567

Merged
merged 9 commits into from
Jul 19, 2019
Merged

Apriori optimization #567

merged 9 commits into from
Jul 19, 2019

Conversation

jmayse
Copy link
Contributor

@jmayse jmayse commented Jul 18, 2019

Description

Currently, the implementation of the apriori algorithm uses a slow iteration to examine each item combination for above-threshold support. This iteration can be replaced by matrix operations that are generally faster, but use slightly more memory in some cases.

While still generally slower than fpgrowth, this implementation of apriori is 3-6x faster:

https://gist.github.com/jmayse/ad688d6a7fd842269996a701d7cecd4c

https://gist.github.com/jmayse/7c76a2d838ac164b923a47b29527f2ed

The current use of the verbose operator doesn't really make sense without the iterator; hence, I have replaced it with a statement that outputs the current number of combinations being compared.

The exit behavior for the main while loop has also changed from setting max_itemset = 0 to using break. If this is undesirable, we can easily reverse this change.

Related issues or pull requests

Fixes #566

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

@coveralls
Copy link

coveralls commented Jul 18, 2019

Coverage Status

Coverage increased (+0.05%) to 92.124% when pulling f0788a7 on jmayse:apriori_optimization into 0abf710 on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Jul 18, 2019

Thanks a lot for the PR! 3-6x is definitely a substantial improvement and very welcome :). Regarding the implementation, it'll probably be more memory hungry since everything is done within an array instead of iterating over a generator, but I suppose it's not an issue on big datasets based on your experience with this code? (I assume, big datasets where this would be prohibitive would be so large that the iterative version would run ~forever anyways).

Otherwise, the alternative would be to enable both options via a flag "in_memory=False" (old version) and "in_memory=True" (new version). But I guess that's overkill!?

@jmayse
Copy link
Contributor Author

jmayse commented Jul 18, 2019

I don't really think that is overkill at all. The user should be able to specify a lightweight option in terms of memory usage for applications on low memory platforms (raspberry pi, for instance). Additionally the previous method using the iterator will always work - if given infinite time. There's value in that on its own. I'll add it back in under a flag with testing around the flag behavior.

print(out.getvalue())
except TypeError:
# If there is no low_memory argument, don't run the test.
assert True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not like the try-except-else pattern, but it's common in Python and works here. I'm very, very open to suggestions...

Copy link
Owner

@rasbt rasbt Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, since it's a function, that's a tricky one ... one solution, which is not that much better, would be

import inspect

    def test_low_memory_flag(self):

        if 'low_memory' in inspect.getargspec(self.fpalgo)[0]:
             with captured_output() as (out, err):
                 _ = self.fpalgo(self.df, low_memory=True, verbose=1)
                 print(out.getvalue())
         else:
             # If there is no low_memory argument, don't run the test.
             assert True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I like that better. I've changed this test to use inspect.signature and removed that pesky debug print that snuck through

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@rasbt
Copy link
Owner

rasbt commented Jul 18, 2019

I don't really think that is overkill at all. The user should be able to specify a lightweight option in terms of memory usage for applications on low memory platforms (raspberry pi, for instance). Additionally the previous method using the iterator will always work - if given infinite time. There's value in that on its own. I'll add it back in under a flag with testing around the flag behavior.

Sounds good then. Also, the low-resource env makes sense. Thanks!

Copy link
Owner

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks great, there are just a few stylistic issues I noted below.

mlxtend/frequent_patterns/apriori.py Outdated Show resolved Hide resolved
mlxtend/frequent_patterns/apriori.py Show resolved Hide resolved
@@ -105,6 +111,19 @@ def apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0):
http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/

"""

def _support(_x, _n_rows, _is_sparse):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really important since private functions don't show up in the docs, but usually, we use the NumPy-style documentation style. That's because I had written a Python doc to markdown parser that automatically generates the documentation for the HTML website documentation. No need to change it here though because like I said, it won't appear on the website

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is definitely the time to change it if you want me to. I'm happy to do it to conform to the overall style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, if you don't mind, let me just do that real quick...even if it doesn't appear on the website we should be consistent to the overall style. Why build in inconsistencies?

Copy link
Contributor Author

@jmayse jmayse Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -> 04e1e31

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure, please go ahead then. Thanks!

@pep8speaks
Copy link

pep8speaks commented Jul 19, 2019

Hello @jmayse! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-19 15:22:32 UTC

@rasbt
Copy link
Owner

rasbt commented Jul 19, 2019

I think everything should be good to go now (once the tests pass). Just added an additiona tests such that the whole suite runs for both low_memory=True & low_memory=False. Besides that, I also just added the Changelog entry.

Unless you have further suggestions, I would merge it once the tests pass. Will then also make a new version release later tonight or tomorrow -- it's about time ;).

Thanks a lot for this PR!

@jmayse
Copy link
Contributor Author

jmayse commented Jul 19, 2019

Happy to! Thanks for maintaining the library!

@rasbt rasbt merged commit fe8a022 into rasbt:master Jul 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimization of apriori algorithm by replacing iterator with matrix operations
4 participants