Added sparse option to onehot transform #328

jaksmid · 2018-02-14T13:53:15Z

Added option to create sparse one hot matrices on onehot transform

Should mitigate issue #298

pep8speaks · 2018-02-14T13:53:19Z

Hello @jaksmid! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 17, 2018 at 10:15 Hours UTC

jaksmid · 2018-02-14T13:56:18Z

Hello @rasbt , please check if the API changes are to your liking.

Wasnt sure about two possible format outcomes in the transform return doc.

Also not sure about the documentation notebooks though. Should I change them?

coveralls · 2018-02-14T14:06:30Z

Coverage increased (+0.03%) to 91.588% when pulling cc105e2 on jaksmid:master into 98ba8a9 on rasbt:master.

rasbt

Looks good to me, just a few minor suggestions :)

rasbt · 2018-02-14T17:21:29Z

mlxtend/preprocessing/onehot.py

        Returns
        ------------
-        onehot : NumPy array [n_transactions, n_unique_items]
-           The NumPy one-hot encoded integer array of the input transactions,
+        onehot : NumPy array [n_transactions, n_unique_items] If not sparse


I would suggest changing

"If not sparse"

to

"if sparse=False (default)
and SciPy Compressed Sparse Row (CSR) matrix, otherwise

rasbt · 2018-02-14T17:22:28Z

mlxtend/preprocessing/onehot.py

@@ -119,12 +120,19 @@ def transform(self, X):
           ['Milk', 'Beer', 'Rice'],
           ['Milk', 'Beer'],
           ['Apple', 'Bananas']]
+
+        sparse: bool


suggested change

sparse: bool -> sparse: bool (default=False)

rasbt · 2018-02-14T17:23:36Z

mlxtend/preprocessing/tests/test_onehot_transactions.py

+    oht = OnehotTransactions()
+    oht.fit(dataset)
+    trans = oht.transform(dataset, sparse=True)
+    np.testing.assert_array_equal(expect, trans.todense())


here, for readability, adding a line

assert (isinstance(trans, csr_matrix))

wouldn't hurt :)

rasbt · 2018-02-14T17:24:07Z

docs/sources/CHANGELOG.md

@@ -32,6 +32,7 @@ The CHANGELOG for the current development version is available at
 - Raises an informative error message if `predict` or `predict_meta_features` is called prior to calling the `fit` method in `StackingRegressor` and `StackingCVRegressor`. ([#315](https://github.com/rasbt/mlxtend/issues/315))
 - The `plot_decision_regions` function now automatically determines the optimal setting based on the feature dimensions and supports anti-aliasing. The old `res`  parameter has been deprecated. ([#309](https://github.com/rasbt/mlxtend/pull/309) by [Guillaume Poirier-Morency](https://github.com/arteymix))
 - Apriori code is faster due to optimization in `onehot transformation` and the amount of candidates generated by the `apriori` algorithm. ([#327](https://github.com/rasbt/mlxtend/pull/327) by [Jakub Smid](https://github.com/jaksmid))
+- `onehot transform` can be now called with `sparse` argument resulting in the sparse representation of the `onehot` matrix


You can append

(#328 by Jakub Smid)

jaksmid · 2018-02-15T09:04:30Z

Thanks for the code review @rasbt .

Incorporated your suggestions. What about the documentation in the notebooks though?

Should I address that as well?

rasbt · 2018-02-15T21:33:12Z

Good point. If you would like to add an example to the Jupyter Notebook, that would be great (but not 100% necessary). To update the API documentation in the notebooks, you need to go to

mlxtend/docs and execute

python make_api.py

in this directory. It will then update/generate the markdown documentation files. After that's completed, you can open the jupyter notebook and just rerun the last section of the notebook.

As a note, these markdown files are not added to the version control in GitHub because of redundancy (i.e., the API doc markdown gets included in the notebook at the bottom), and the markdown generated from the notebook for MKdocs has the same content as the notebook (so, for space reasons, only the notebooks are currently in version control on GitHub).

jaksmid · 2018-02-16T09:05:35Z

Will do.

Was also thinking about changing the dtype to bool in the onehot.py. This can further reduce the memory footprint

* Onehot uses bools

jaksmid · 2018-02-16T15:35:42Z

Hello @rasbt ,

updated the doc, added a sparse example to the apriori notebook and changed the type of onehot transforms to bool. I manually verified that the apriori is still working as expected. Tests also reported no problem.

Looking forward to the feedback.

rasbt · 2018-02-17T03:02:27Z

Oh, I somehow assumed we were talking about the OneHotTransaction class regarding the Jupyter notebook documentation.

But yeah, modifying the apriori documentation is even better :).

However, it would maybe be nice to update the OneHotTransaction documentation as well (and then add a note about why it's using bool by default, and maybe showing an astype examples to get the classic binary integer representation -- sorry if that's too much asking, I would also be happy to do that :).

Overall, this PR is great, and it was a good idea to use sparse representations for this!

rasbt · 2018-02-17T03:11:01Z

Just a little suggestion, but can you modify the Changelog message a little bit to illustrate that the apriori func also "improves"?

E.g., changing the original

onehot transform is now
more memory efficient as it uses boolean instead of ints. Furthermore, onehot transform can be now called with sparse argument resulting in the sparse representation of the onehot matrix. (#328 by Jakub Smid)

to

The OnehotTransactions class (which is typically often in combination with the apriori function for associaton rule mining) is now more memory efficient as it uses boolan arrays instead of integer arrays. In addition, the OnehotTransactions class can be now be provided with sparse argument to generate sparse representations of the onehot matrix to further improve memory efficiency. (#328 by Jakub Smid)

Below is a rawtext version for convenience :)

The `OnehotTransactions` class (which is typically often in combination with the `apriori` function for associaton rule mining) is now more memory efficient as it uses boolan arrays instead of integer arrays. In addition, the `OnehotTransactions` class can be now be provided with `sparse` argument to generate sparse representations of the `onehot` matrix to further improve memory efficiency. ([#328](https://github.com/rasbt/mlxtend/pull/328) by [Jakub Smid](https://github.com/jaksmid))

jaksmid · 2018-02-17T10:18:53Z

Not a problem at all.

I tried to find all places that needs/deserves an update, but as I am not that familiar with the whole codebase, it may well happen that I unintentionally overlook something.

Really appreciate your suggestions.

rasbt · 2018-02-18T02:40:04Z

Looks great, thanks a lot! Will merge it now, thanks for the PR!

Added sparse option to tranform

4641b61

Docs improvement

164a3b8

jaksmid changed the title ~~Added sparse option to tranform~~ Added sparse option to transform Feb 14, 2018

jaksmid changed the title ~~Added sparse option to transform~~ Added sparse option to onehot transform Feb 14, 2018

rasbt requested changes Feb 14, 2018

View reviewed changes

Incorporated Code Review

aeca225

rasbt approved these changes Feb 15, 2018

View reviewed changes

* Doc updates

b4590e5

* Onehot uses bools

* Further Doc update

cc105e2

rasbt merged commit 77c57d5 into rasbt:master Feb 18, 2018

rasbt mentioned this pull request Feb 18, 2018

Supporting out-of-core processing in the apriori function for DataFrames that don't fit into memory #298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added sparse option to onehot transform #328

Added sparse option to onehot transform #328

jaksmid commented Feb 14, 2018

pep8speaks commented Feb 14, 2018 •

edited

Loading

jaksmid commented Feb 14, 2018

coveralls commented Feb 14, 2018 •

edited

Loading

rasbt left a comment

rasbt Feb 14, 2018

rasbt Feb 14, 2018

rasbt Feb 14, 2018

rasbt Feb 14, 2018

jaksmid commented Feb 15, 2018

rasbt commented Feb 15, 2018

jaksmid commented Feb 16, 2018

jaksmid commented Feb 16, 2018

rasbt commented Feb 17, 2018

rasbt commented Feb 17, 2018

jaksmid commented Feb 17, 2018

rasbt commented Feb 18, 2018

Added sparse option to onehot transform #328

Added sparse option to onehot transform #328

Conversation

jaksmid commented Feb 14, 2018

pep8speaks commented Feb 14, 2018 • edited Loading

Comment last updated on February 17, 2018 at 10:15 Hours UTC

jaksmid commented Feb 14, 2018

coveralls commented Feb 14, 2018 • edited Loading

rasbt left a comment

Choose a reason for hiding this comment

rasbt Feb 14, 2018

Choose a reason for hiding this comment

rasbt Feb 14, 2018

Choose a reason for hiding this comment

rasbt Feb 14, 2018

Choose a reason for hiding this comment

rasbt Feb 14, 2018

Choose a reason for hiding this comment

jaksmid commented Feb 15, 2018

rasbt commented Feb 15, 2018

jaksmid commented Feb 16, 2018

jaksmid commented Feb 16, 2018

rasbt commented Feb 17, 2018

rasbt commented Feb 17, 2018

jaksmid commented Feb 17, 2018

rasbt commented Feb 18, 2018

pep8speaks commented Feb 14, 2018 •

edited

Loading

coveralls commented Feb 14, 2018 •

edited

Loading