Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified powerset function in EFS to reduce memory consumption #195

Merged
merged 4 commits into from
Sep 8, 2017

Conversation

admercs
Copy link
Contributor

@admercs admercs commented May 21, 2017

Description

I modified the powerset function and calculation of number of possible combinations to reduce memory overhead by not longer storing all combinations as a list.

Related issues or pull requests

Fixes # #194

Pull Request requirements

  • Added appropriate unit test functions in the ./mlxtend/*/tests directories
  • Ran nosetests ./mlxtend -sv and make sure that all unit tests pass
  • Checked the test coverage by running nosetests ./mlxtend --with-coverage
  • Checked for style issues by running flake8 ./mlxtend
  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file
  • Modify documentation in the appropriate location under mlxtend/docs/sources/ (optional)
  • Checked that the Travis-CI build passed at https://travis-ci.org/rasbt/mlxtend

Adam Erickson added 3 commits May 21, 2017 18:28
I modified the calculation of the combination of candidates so that it is no longer stored in a list. This also required updating the calculation of the total number of combinations using an efficient ncr function. Tests are needed.
@pep8speaks
Copy link

pep8speaks commented May 21, 2017

Hello @adam-erickson! Thanks for updating the PR.

Line 163:27: E128 continuation line under-indented for visual indent
Line 167:1: W293 blank line contains whitespace
Line 170:1: W293 blank line contains whitespace
Line 177:1: W293 blank line contains whitespace
Line 181:1: W293 blank line contains whitespace
Line 183:1: W293 blank line contains whitespace
Line 190:1: W293 blank line contains whitespace
Line 194:1: W293 blank line contains whitespace

Comment last updated on May 21, 2017 at 17:03 Hours UTC

@admercs admercs changed the title Modified powerset function to reduce memory consumption Modified powerset function in EFS to reduce memory consumption May 21, 2017
@rasbt
Copy link
Owner

rasbt commented May 21, 2017

Thanks for the PR! I just gave it a quick glance and there's already something like ncr (math.num_combinations); prob a bit more efficient since it doesn't evaluate range objects.

E.g., you could replace

all_comb = np.sum([ncr(n=X.shape[1], r=i)
                   for i in range(self.min_features,
                                  self.max_features + 1)])

by

from ..math import num_combinations
...
all_comb = np.sum([num_combinations(n=X.shape[1], k=i)
                   for i in range(self.min_features,
                                  self.max_features + 1)])

@admercs
Copy link
Contributor Author

admercs commented May 21, 2017

Supposedly, the implementation provided is ~25% more efficient than math.factorial: http://stackoverflow.com/questions/4941753/is-there-a-math-ncr-function-in-python

Update: I tested both on (n=100, r=10) and the one I provided was ~ 20% more efficient. I ran tests again on (n=10, r=5) and found the opposite, as your implementation was 3x faster here. However, the below version was twice as fast as yours for small datasets and also faster for large datasets:

def nCr(n,r):
    return factorial(n) // factorial(r) // factorial(n-r)

I'd consider this version optimal.

@rasbt
Copy link
Owner

rasbt commented May 21, 2017

Interesting! However, I think any of these functions runs so quickly (compared to the rest) that speed is really not a concern here. Sorry my previous comment was a bit misleading; I primarily thought that resusing the existing num_combinations function would be a good idea to avoid reimplementing this code here (in the sense of refactoring or duplication of effort :)).

@rasbt
Copy link
Owner

rasbt commented May 27, 2017

Hi, @adam-erickson . Just revisiting this discussion from last weekend ...
That's a really good fix. and I'd like to merge it. Can we use the internal from ..math import num_combinations for this though to keep the code simple and add a short message to the Changelog? I am happy to make those changes via "Allow edits from maintainers" if you don't mind :)

@admercs
Copy link
Contributor Author

admercs commented May 31, 2017

Hi @rasbt sorry for the slow reply, I was releasing gapfraction and its website. Of course, we can use the num_combinations function if you'd like for simplicity. What can I do to help facilitate these changes? I'm a bit new to this process. Should I close the pull request?

@rasbt
Copy link
Owner

rasbt commented May 31, 2017

No worries at all -- just wasn't sure if you'd okay with the proposed changes.

What can I do to help facilitate these changes? I'm a bit new to this process. Should I close the pull request?

Feel free to make another commit to this PR to change the code using the existing

from ..math import num_combinations

and add a note to the CHANGELOG.md file at ./docs/sources/CHANGELOG.md (in the bug fixes section).

Since GitHub now has this convenient "Squash and Merge" button, we don't have to worry about "too many" commits since that's all being taken care of at the end.

@admercs
Copy link
Contributor Author

admercs commented Jun 4, 2017

Hi @rabst, I did some testing and it appears that the results of the num_combinations function may be imprecise or incorrect? I tested it against the scipy.misc.comb function with exact=True for N=300, k=10 and found different outputs. The proposed nCr function produces identical results to scipy.misc.comb. You might want to verify this.

@rasbt
Copy link
Owner

rasbt commented Jun 4, 2017

Thanks for looking into that, @adam-erickson . I also find the same issue, when I

>>> num_combinations(n=200, k=10)
22451004309013280

>>> scipy.misc.comb(n=200, k=10)
22451004309013280

But for n=300, I get 1398320233241701888 & 1398320233241701770 (the later from scipy). I ran the same thing in Python 2.7 and noticed that Py 2.7 gives the correct results for num_combinations. Turns out to be an integer division problem. For instance, it should be

comb = numerator//denominator

instead of

comb = int(numerator/denominator)

Will fix that

@rasbt
Copy link
Owner

rasbt commented Jun 4, 2017

Should be fixed now in #200 (but you could also use scipy.misc (didn't know comb existed back then)

@rasbt
Copy link
Owner

rasbt commented Sep 8, 2017

Sorry, somehow, I forgot about this ... Merging now because the code is good and works perfectly fine :). Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants