Skip to content

Conversation

@jeromedockes
Copy link
Member

@jeromedockes jeromedockes commented Jul 29, 2024

ATM the AggJoiner and AggTarget when using value counts or histogram include an unwanted "index" column in their output.

After performing a groupby, the grouping column is in the index of the pandas result. to have it as a column and be able to join on it, we need to do a reset_index. skrub._dataframe._pandas.aggregate used to do it for histogram and value counts, but not for other aggregation functions. This went unnoticed/was not a problem because the aggjoiner and aggtarget used pandas.merge which then performed the merge on the index rather than a column, because it allows "right_on" to be either a column index or an index level index.

in #945 a reset_index was added, but in both for mean() and for value_counts(). So before the reset_index was done 0 times for mean(), 1 times for value_counts(); after it was done 1 time for mean() and 2 times for value_counts() -- the second one resulting in the "index" column.

what we want is to do reset_index once in all cases.

At least that's what I understand from a quick look, it would be great if @TheooJ and @Vincent-Maladiere can confirm

@Vincent-Maladiere
Copy link
Member

Hey @jeromedockes, actually, I don't remember why we initially needed to use reset_index. If my main table has some custom indices (e.g., due to a split), I might want to keep these. For example, this could create silent errors in data wrangling use-cases out of a pipeline. WDYT?

import pandas as pd
from skrub import AggJoiner

main = pd.DataFrame({
    "airportId": [1, 2],
    "airportName": ["Paris CDG", "NY JFK"],
}, index=[2, 3])

aux = pd.DataFrame({
    "flightId": range(1, 7),
    "from_airport": [1, 1, 1, 2, 2, 2],
    "total_passengers": [90, 120, 100, 70, 80, 90],
    "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
}, index=[10, 11, 12, 13, 14, 15])

AggJoiner(
    aux_table=aux,
    main_key="airportId",
    aux_key="from_airport",
    cols=["total_passengers", "company"],
    operations=["hist(4)", "mode"],
).fit_transform(main).index.tolist()

# >>> [0, 1], instead of [2, 3]

@jeromedockes
Copy link
Member Author

As discussed IRL with @Vincent-Maladiere his comment above is correct but we'll open a separate issue about it: this PR is about the index of the aux table, whereas the comment is about preserving the index of the main table

@Vincent-Maladiere
Copy link
Member

By the way, do we need to reset the index of the aux table? As we join on a key, the index shouldn't matter, should it?

@jeromedockes
Copy link
Member Author

jeromedockes commented Jul 31, 2024 via email

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, LGTM then

@jeromedockes
Copy link
Member Author

thanks @Vincent-Maladiere

@jeromedockes jeromedockes merged commit b78a5f2 into skrub-data:main Aug 1, 2024
@jeromedockes jeromedockes deleted the fix-index-in-aggtarget-output branch August 1, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants