Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT Add aggregate and join functions for Pandas and Polars #733

Merged

Conversation

Vincent-Maladiere
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere commented Sep 5, 2023

References
This PR aims at simplifying #600 by adding the dataframe._pandas and dataframe._polars namespaces prior to it.
It also enables solving issues like #730.

This PR takes into account the latest reviews made in #600.

What does this PR implement?

  • Create a new submodule skrub.dataframe so that the join and aggregate functions for Pandas and Polars can be shared more easily across skrub.
  • Add the get_df_namespace function and the DataFrameLike, SeriesLike types in separate files within skrub.dataframe.
  • Add a specific documentation entry in the API section for the pandas and polars submodules.

Additional comments
This PR doesn't take into account the output of the discussion #719 for now.

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Vincent, great idea! I think some of these methods can be used inside the fuzzy_join/Joiner (perhaps to do when we add polars compatibility).
Some comments:

CHANGES.rst Outdated Show resolved Hide resolved
doc/api.rst Outdated Show resolved Hide resolved
@@ -84,6 +84,46 @@ This page lists all available functions and classes of `skrub`.

deduplicate

.. raw:: html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the API page is getting crowded. Maybe we can try to add subsections with a dropdown menu on the sidebar for .datasets and .dataframe methods (as a follow up PR maybe)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting suggestion, WDYT @GaelVaroquaux?

skrub/dataframe/_pandas.py Show resolved Hide resolved
skrub/dataframe/tests/test_namespace.py Outdated Show resolved Hide resolved
skrub/dataframe/tests/test_namespace.py Show resolved Hide resolved
skrub/dataframe/_namespace.py Outdated Show resolved Hide resolved
Vincent-Maladiere and others added 2 commits September 6, 2023 11:26
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
@@ -0,0 +1,41 @@
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need an __init__.py in this directory (at least with the current approach of having test modules be part of the package), otherwise for me locally pytest fails and complains it cannot import skrub.dataframe

@Vincent-Maladiere Vincent-Maladiere merged commit e595ec0 into skrub-data:main Sep 22, 2023
24 checks passed
@Vincent-Maladiere Vincent-Maladiere deleted the add_polars_pandas_utils branch September 22, 2023 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants