Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent Patterns Sparse Support #667

Merged
merged 9 commits into from
Feb 24, 2020
Merged

Frequent Patterns Sparse Support #667

merged 9 commits into from
Feb 24, 2020

Conversation

rasbt
Copy link
Owner

@rasbt rasbt commented Feb 21, 2020

Attempt to add support for Pandas 1.0 in frequent pattern mining functions.

It looks like pandas 1.0 had quite a revamp regarding sparse data frames. I attempted to use the migration guide at https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating but it's a bit tricky, because they don't have fill values anymore etc.

@dbarbier and @DanielMorales9 , do you have experience with porting code to pandas 1.0? I would really appreciate your feedback for how to handle these cases.

@dbarbier
Copy link
Contributor

I believe that e3ff3a7 fixes all issues with pandas 1.0, except failures with test_sparse_with_zero. But this is a because pandas 1.0 does not handle sparse dataframes with zeros, see pandas-dev/pandas#29814. You could either remove this test, or discard it with pandas >= 1.0.

@pep8speaks
Copy link

pep8speaks commented Feb 22, 2020

Hello @rasbt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-24 20:39:52 UTC

@rasbt
Copy link
Owner Author

rasbt commented Feb 22, 2020

Thanks for the feedback! To prevent the codebase becoming more and more cluttered, I thought that even dropping SparseDataFrame support entirely may be a good idea (in favor of the new sparse data handling in Pandas >= 1.0). I added a check valid_input_check() (in fpcommon.py) to let users now that we only support the "new" way.

Thanks your points, most of the issues are now addressed. One thing though is that in some tests, we check columns and dtypes of the columns. However, when doing sth like

dfs = df.astype(pd.SparseDtype("int", np.nan)).sparse.to_coo()

this will produce a sparse scipy matrix, which doesn't have dtypes and columns. So we'd just need to do

dfs = df.astype(pd.SparseDtype("int", np.nan))

I guess. But is this now really "sparse" or do we need to do sth extra?

@dbarbier
Copy link
Contributor

Thanks for the feedback! To prevent the codebase becoming more and more cluttered, I thought that even dropping SparseDataFrame support entirely may be a good idea (in favor of the new sparse data handling in Pandas >= 1.0). I added a check valid_input_check() (in fpcommon.py) to let users now that we only support the "new" way.

Good idea, you can then remove if hasattr(df, "to_coo") blocks which become useless.

With dfs = df.astype(pd.SparseDtype(...)), all columns become sparse, yes.

@rasbt
Copy link
Owner Author

rasbt commented Feb 23, 2020

Good idea, you can then remove if hasattr(df, "to_coo") blocks which become useless.

Ah yes, thanks, will do!

dfs = df.astype(pd.SparseDtype(...))

Thanks! I was a bit suspicious for some reason ...

@rasbt rasbt changed the title Frequent Patterns Sparse Support [WIP] Frequent Patterns Sparse Support Feb 23, 2020
@rasbt
Copy link
Owner Author

rasbt commented Feb 23, 2020

I think I addressed everything now. Will make a new "bugfix" version so that there is a version that works for people who installed the latest pandas version. Thanks for your help!

(df.dtypes == bool)).all()
else:
all_bools = (df.dtypes == bool).all()
all_bools = (df.dtypes == bool).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should have kept

   all_bools = ((df.dtypes == pd.SparseDtype(bool)) |
                        (df.dtypes == bool)).all()

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why I deleted this! Now, after adding it back, for some reason

(df.dtypes == pd.SparseDtype(bool))

produces a deprecation warning for some reason.

DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
res_values = op(left, right)

Can't reproduce this issue outside the code though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's weird; I do not find how to avoid this warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work; is_bool_dtype had been introduced in pandas 0.19, so there is no need to check version:

# Fast path: if all columns are boolean, there is nothing to checks
all_bools = df.dtypes.apply(pd.api.types.is_bool_dtype).all()

@rasbt
Copy link
Owner Author

rasbt commented Feb 24, 2020

Awesome, that works! Thanks a lot!

@rasbt rasbt merged commit 213fd02 into master Feb 24, 2020
@rasbt rasbt deleted the fix-pandas branch November 12, 2020 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants