Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added sparse option to onehot transform #328

Merged
merged 5 commits into from
Feb 18, 2018
Merged

Added sparse option to onehot transform #328

merged 5 commits into from
Feb 18, 2018

Conversation

jaksmid
Copy link

@jaksmid jaksmid commented Feb 14, 2018

Added option to create sparse one hot matrices on onehot transform

Should mitigate issue #298

@pep8speaks
Copy link

pep8speaks commented Feb 14, 2018

Hello @jaksmid! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 17, 2018 at 10:15 Hours UTC

@jaksmid
Copy link
Author

jaksmid commented Feb 14, 2018

Hello @rasbt , please check if the API changes are to your liking.

Wasnt sure about two possible format outcomes in the transform return doc.

Also not sure about the documentation notebooks though. Should I change them?

@jaksmid jaksmid changed the title Added sparse option to tranform Added sparse option to transform Feb 14, 2018
@jaksmid jaksmid changed the title Added sparse option to transform Added sparse option to onehot transform Feb 14, 2018
@coveralls
Copy link

coveralls commented Feb 14, 2018

Coverage Status

Coverage increased (+0.03%) to 91.588% when pulling cc105e2 on jaksmid:master into 98ba8a9 on rasbt:master.

Copy link
Owner

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a few minor suggestions :)

Returns
------------
onehot : NumPy array [n_transactions, n_unique_items]
The NumPy one-hot encoded integer array of the input transactions,
onehot : NumPy array [n_transactions, n_unique_items] If not sparse
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing

"If not sparse"

to

"if sparse=False (default)
and SciPy Compressed Sparse Row (CSR) matrix, otherwise

@@ -119,12 +120,19 @@ def transform(self, X):
['Milk', 'Beer', 'Rice'],
['Milk', 'Beer'],
['Apple', 'Bananas']]

sparse: bool
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested change

sparse: bool -> sparse: bool (default=False)

oht = OnehotTransactions()
oht.fit(dataset)
trans = oht.transform(dataset, sparse=True)
np.testing.assert_array_equal(expect, trans.todense())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, for readability, adding a line

assert (isinstance(trans, csr_matrix))

wouldn't hurt :)

@@ -32,6 +32,7 @@ The CHANGELOG for the current development version is available at
- Raises an informative error message if `predict` or `predict_meta_features` is called prior to calling the `fit` method in `StackingRegressor` and `StackingCVRegressor`. ([#315](https://github.com/rasbt/mlxtend/issues/315))
- The `plot_decision_regions` function now automatically determines the optimal setting based on the feature dimensions and supports anti-aliasing. The old `res` parameter has been deprecated. ([#309](https://github.com/rasbt/mlxtend/pull/309) by [Guillaume Poirier-Morency](https://github.com/arteymix))
- Apriori code is faster due to optimization in `onehot transformation` and the amount of candidates generated by the `apriori` algorithm. ([#327](https://github.com/rasbt/mlxtend/pull/327) by [Jakub Smid](https://github.com/jaksmid))
- `onehot transform` can be now called with `sparse` argument resulting in the sparse representation of the `onehot` matrix
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can append

(#328 by Jakub Smid)

@jaksmid
Copy link
Author

jaksmid commented Feb 15, 2018

Thanks for the code review @rasbt .

Incorporated your suggestions. What about the documentation in the notebooks though?

Should I address that as well?

@rasbt
Copy link
Owner

rasbt commented Feb 15, 2018

Good point. If you would like to add an example to the Jupyter Notebook, that would be great (but not 100% necessary). To update the API documentation in the notebooks, you need to go to

mlxtend/docs and execute

python make_api.py

in this directory. It will then update/generate the markdown documentation files. After that's completed, you can open the jupyter notebook and just rerun the last section of the notebook.

As a note, these markdown files are not added to the version control in GitHub because of redundancy (i.e., the API doc markdown gets included in the notebook at the bottom), and the markdown generated from the notebook for MKdocs has the same content as the notebook (so, for space reasons, only the notebooks are currently in version control on GitHub).

@jaksmid
Copy link
Author

jaksmid commented Feb 16, 2018

Will do.

Was also thinking about changing the dtype to bool in the onehot.py. This can further reduce the memory footprint

 * Onehot uses bools
@jaksmid
Copy link
Author

jaksmid commented Feb 16, 2018

Hello @rasbt ,

updated the doc, added a sparse example to the apriori notebook and changed the type of onehot transforms to bool. I manually verified that the apriori is still working as expected. Tests also reported no problem.

Looking forward to the feedback.

@rasbt
Copy link
Owner

rasbt commented Feb 17, 2018

Oh, I somehow assumed we were talking about the OneHotTransaction class regarding the Jupyter notebook documentation.

But yeah, modifying the apriori documentation is even better :).

However, it would maybe be nice to update the OneHotTransaction documentation as well (and then add a note about why it's using bool by default, and maybe showing an astype examples to get the classic binary integer representation -- sorry if that's too much asking, I would also be happy to do that :).

Overall, this PR is great, and it was a good idea to use sparse representations for this!

@rasbt
Copy link
Owner

rasbt commented Feb 17, 2018

Just a little suggestion, but can you modify the Changelog message a little bit to illustrate that the apriori func also "improves"?

E.g., changing the original

onehot transform is now
more memory efficient as it uses boolean instead of ints. Furthermore, onehot transform can be now called with sparse argument resulting in the sparse representation of the onehot matrix. (#328 by Jakub Smid)

to

The OnehotTransactions class (which is typically often in combination with the apriori function for associaton rule mining) is now more memory efficient as it uses boolan arrays instead of integer arrays. In addition, the OnehotTransactions class can be now be provided with sparse argument to generate sparse representations of the onehot matrix to further improve memory efficiency. (#328 by Jakub Smid)

Below is a rawtext version for convenience :)

The `OnehotTransactions` class (which is typically often in combination with the `apriori` function for associaton rule mining) is now more memory efficient as it uses boolan arrays instead of integer arrays. In addition, the `OnehotTransactions` class can be now be provided with `sparse` argument to generate sparse representations of the `onehot` matrix to further improve memory efficiency. ([#328](https://github.com/rasbt/mlxtend/pull/328) by [Jakub Smid](https://github.com/jaksmid))

@jaksmid
Copy link
Author

jaksmid commented Feb 17, 2018

Not a problem at all.

I tried to find all places that needs/deserves an update, but as I am not that familiar with the whole codebase, it may well happen that I unintentionally overlook something.

Really appreciate your suggestions.

@rasbt
Copy link
Owner

rasbt commented Feb 18, 2018

Looks great, thanks a lot! Will merge it now, thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants