FEAT Add the AggJoiner and AggTarget transformers #600

Vincent-Maladiere · 2023-06-13T20:30:18Z

What does this PR introduce?

This draft proposes a first POC for the Join Aggregator. It aims at aggregating auxiliary tables before merging them on the base table, in a 1:N fashion.

Its API follows a consistent logic to the one of Feature Augmenter:

agg_joiner = AggJoiner(
    tables=[
        (aux_1_large, "Country Name", ["GDP per capita (current US$)"]),
        (aux_2_large, "Country Name", ["Life expectancy at birth, total (years)", "country"])
    ],
    main_key="Country",
    agg_ops=["mean", "min", "max", "mode", "value_counts", "hist(4)"],
)
agg_joiner.fit_transform(df)

It currently supports Pandas DataFrames, Polars DataFrames, and Polars LazyFrames. Therefore can be run lazily with Polars! My idea is to preserve the dataframe module of the input to avoid confusion, and refuse dataframes that mix backends. The Polars dependency is of course optional.

My next step will be to benchmark the RAM consumption and time to run for these 3 different dataframes on bigger datasets.

Edit: Here is a demo of the Join Aggregator applied to feature engineering for RecSys within a Kaggle Competition!

How is it implemented?

The API of Join Aggregator is straightforward and relies on specialized implementations of an abstract AssemblingEngine class to handle both Pandas and Polars dataframes, namely PandasAssemblingEngine and PolarsAssemblingEngine.

In the absence of fully fledged tests, here is a working example::

import numpy as np
import pandas as pd
from skrub.datasets import fetch_world_bank_indicator
from skrub._join_aggregator import JoinAggregator


main = pd.read_csv(
    "https://raw.githubusercontent.com/dirty-cat/datasets/master/data/Happiness_report_2022.csv",
    thousands=",",
)
main = main[["Country", "Happiness score"]]

aux_1 = fetch_world_bank_indicator(indicator_id="NY.GDP.PCAP.CD").X
aux_2 = fetch_world_bank_indicator("SP.DYN.LE00.IN").X

# Duplicate rows to create 1:N conditions
def augmente(df, id_col, val_cols, n_repeat=5):
    dfs = []
    for val_col in val_cols:
        for id, el in df[[id_col, val_col]].values:
            repeated = np.random.normal(el, scale=el/100, size=n_repeat)
            df_ = pd.DataFrame({id_col: id, val_col: repeated})
            dfs.append(df_)
    return pd.concat(dfs)

aux_1_large = augmente(aux_1, "Country Name", ["GDP per capita (current US$)"])
aux_2_large = augmente(aux_2, "Country Name", ["Life expectancy at birth, total (years)"])

# Add a categorical column, arbitrarily
aux_2_large["country"] = aux_2_large["Country Name"]


# Pandas
join_agg = JoinAggregator(
    tables=[
        (aux_1_large, ["Country Name"], ["GDP per capita (current US$)"]),
        (aux_2_large, ["Country Name"], ["Life expectancy at birth, total (years)", "country"])
    ],
    main_key="Country",
    agg_ops=["mean", "min", "max", "mode"],
)
pandas_out = join_agg.fit_transform(main)

# Polars, eager
import polars as pl

join_agg = JoinAggregator(
    tables=[
        (pl.DataFrame(aux_1_large), ["Country Name"], ["GDP per capita (current US$)"]),
        (pl.DataFrame(aux_2_large), ["Country Name"], ["Life expectancy at birth, total (years)", "country"])
    ],
    main_key="Country",
    agg_ops=["mean", "min", "max", "mode"],
)
polars_eager_out = join_agg.fit_transform(pl.DataFrame(main))

# Polars, lazy
join_agg = JoinAggregator(
    tables=[
        (pl.DataFrame(aux_1_large).lazy(), ["Country Name"], ["GDP per capita (current US$)"]),
        (pl.DataFrame(aux_2_large).lazy(), ["Country Name"], ["Life expectancy at birth, total (years)", "country"])
    ],
    main_key="Country",
    agg_ops=["mean", "min", "max", "mode"],
)
polars_lazy_out = join_agg.fit_transform(pl.DataFrame(main).lazy())

WDYT?

cc @strayMat and his super helpful aggregation implementations here and here :)

GaelVaroquaux

Thanks for the PR. This is exciting!!

I must confess that I am having a hard time reviewing this PR, as the github diff is ridden with warnings from codecov complaining of missing test coverage.

The user-level API feels right to me, although without using it, I don't have a perfect feeling of the user experience.

I think that we should try to find a consistent naming across the FeatureAugmenter and the JoinAggregator. This may call for renaming the FeatureAugmenter

We need to write tests, as these will help us feel if we have the right API for the internal components: if things are easy to test, it's a good sign.

It would be nice to have an exemple where we don't duplicate row to create the 1 to many relation. Maybe @jovan-stojanovic can help here. Ideally, this example should also show that the aggregation is beneficial for prediction, and we should add it to the PR so that we can comment on it.

Points for later:

Actual support of polars will require us to support it in every single function and class of skrub (else the user will be confused). This will be a bit of work, in particular it will require define impedence matching / adaptation to many functionality of pandas / polars.

I wonder if the pre-aggregation strategy is the good one. If my external table has many more entries on the common key than the main table, preaggregation will lead me to compute many aggregates that I don't need. We should note this and potentially address it later.

GaelVaroquaux · 2023-06-15T07:17:59Z

skrub/_join_aggregator.py

+    operators of the respective module.
+    """
+    @classmethod
+    def get_for(cls, tables):


I would prefer a function (a simple function, not a method or classmethod) as a factory to instantiate the engine. It leads to simpler code.

GaelVaroquaux · 2023-06-15T07:45:35Z

skrub/_join_aggregator.py

+    return num_ops, categ_ops
+
+
+class AssemblingEngine:


Maybe this should be called a "Dispatcher", as IMHO this construct is related to dispatching and the Dispatcher pattern.

Hmm, what would this Dispatcher do if, according to your suggestion above, we replace get_for with a function?

The "get_for" would be a function "make_agg_dispatcher" that would return a dispatcher. Not very different from what you currently have, but as a function rather than a classmethod

GaelVaroquaux · 2023-06-15T07:52:56Z

skrub/_join_aggregator.py

+
+
+def pandas_get_agg_ops(cols, agg_ops):
+    pandas_ops_mapping = {


I think that you should make those dictionaries globals defined outside the function (and thus called with allcaps names "PANDAS_CAT_OPS_MAPPING". In the long term, having them accessible to other functions can be useful (let alone for testing)

Once this is done, I believe that the present function can be inlined into were it is called, as it is very short (many very short function make code harder to read, as they require memorizing a lot of indirections).

To stay consistent, we'd like to also inline the {pandas, polars}_get_agg_ops and {pandas, polars}_split_num_categ_cols functions.

However, some of these functions are long and might clutter the logic of the calling method.

Maybe we should define these functions as staticmethod? So that things are close to each other.

WDYT?

skrub/_join_aggregator.py

GaelVaroquaux · 2023-06-15T09:42:27Z

Maybe we should define these functions as staticmethod? So that things are close to each other.

No, staticmethods are to be used only when they are important for inheritance reasons. Classes are not to be confused with modules. Classes serve to define inheritance. Objects serve to associate functions to data (attributes). Modules serve to group code (symbols) together.

Vincent-Maladiere · 2023-06-15T09:59:03Z

Thanks for this detailed feedback!

I must confess that I am having a hard time reviewing this PR, as the github diff is ridden with warnings from codecov complaining of missing test coverage.

Whoops sorry I didn't notice. Adding skip-ci in my commit messages should help here.

I think that we should try to find a consistent naming across the FeatureAugmenter and the JoinAggregator. This may call for renaming the FeatureAugmenter

Yes, that sounds good. Something like "FuzzyJoiner"?

We need to write tests, as these will help us feel if we have the right API for the internal components: if things are easy to test, it's a good sign.

Yes, that's next on my todo :)

It would be nice to have an example where we don't duplicate rows to create the 1-to-many relation.

I just made a demo on Kaggle, the base table is huge (30M rows) and needs Polars lazy mode :)

When using models like LambdaMart or BoostingTrees for Recommender Systems, you always end up aggregating the base table | user_id | product_id | timestamp | by user_id and also by product_id before joining these two aggregated tables back to the main one: this is an ideal and very useful application for the Join Aggregator!

Please have a look at it:

https://www.kaggle.com/code/vincentmaladiere/h-m-recsys-feature-engineering-with-skrub-polars

Of course, we also need a proper example in the documentation, as you mentioned. Why not use a lighter version of MovieLens and perform recommendations in our documentation? :)

Points for later:

Actual support of polars will require us to support it in every single function and class of skrub (else the user will be confused). This will be a bit of work, in particular it will require define impedence matching / adaptation to many functionality of pandas / polars.

Yes, how do you rank it as a priority? There is so much value in actually offering this, I'd be thrilled to start thinking about it very soon.

I wonder if the pre-aggregation strategy is the good one. If my external table has many more entries on the common key than the main table, preaggregation will lead me to compute many aggregates that I don't need. We should note this and potentially address it later.

Very good point. We could filter our tables before aggregation on the right keys during fit!

Vincent-Maladiere · 2023-06-15T10:01:04Z

No, staticmethods are to be used only when they are important for inheritance reasons. Classes are not to be confused with modules. Classes serve to define inheritance. Objects serve to associate functions to data (attributes). Modules serve to group code (symbols) together.

What are you suggesting then?

Vincent-Maladiere · 2023-06-15T11:32:28Z

I've just noticed that our TableVectorizer already handles Polars input héhé (not lazily though)

import polars as pl
from skrub import TableVectorizer

main = pd.read_csv(
    "https://raw.githubusercontent.com/dirty-cat/datasets/master/data/Happiness_report_2022.csv",
    thousands=",",
)

tv = TableVectorizer()
tv.fit_transform(
    pl.DataFrame(main)
)

This is thanks to a fortunate cast to pandas during the fit_transform

        # Convert to pandas DataFrame if not already.
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        else:
            # Create a copy to avoid altering the original data.
            X = X.copy()

A pleasant surprise 😄

GaelVaroquaux · 2023-06-15T12:15:14Z

I've just noticed that our TableVectorizer already handle polars input héhé

Good. As we move to supporting polars, we are going to have to: first test aggressively that all our objects and function do support polars, second optimize them to avoid casting when not necessary (as it will induce copies), and third support lazy polars.

GaelVaroquaux · 2023-06-15T12:42:01Z

Yes, that sounds good. Something like "FuzzyJoiner"?

Maybe just "Joiner"? I think that we will probably want to add fuzziness to almost everything that we do, so we shouldn't start putting it in every name.

I just made a demo on Kaggle, the base table is huge (30M rows) and needs Polars lazy mode :) https://www.kaggle.com/code/vincentmaladiere/h-m-recsys-feature-engineering-with-skrub-polars

Nice! Great demonstration of the vision for skrub, this is material for an absolutely great example. Any chance that we can make this light and fast enough to be in our documentation? I would love to have it in our main docs, and to progressively build our features and API to keep making this example simpler and simpler.

Yes, how do you rank it as a priority? There is so much value in actually offering this, I'd be thrilled to start thinking about it very soon.

Let's merge the present PR first :). After this, why not :).

GaelVaroquaux · 2023-06-15T14:00:41Z

What are you suggesting then?

Plain functions for now, and long term maybe two modules

examples/07_join_aggregation.py

skrub/_join_aggregator.py

examples/07_join_aggregation.py

jovan-stojanovic

Thanks for this great new feature @Vincent-Maladiere! I have a few comments on the example.

examples/07_join_aggregation.py

skrub/_join_aggregator.py

jeromedockes

just a first batch of small comments

CHANGES.rst

skrub/_agg_joiner.py

jeromedockes · 2023-09-15T11:56:26Z

Maybe we should define these functions as staticmethod? So that things are close to each other.
No, staticmethods are to be used only when they are important for inheritance reasons. Classes are not to be confused with modules. Classes serve to define inheritance. Objects serve to associate functions to data (attributes). Modules serve to group code (symbols) together.

just for the sake of argument, having a class, say skrub.DataFrame would allow to:

move around the methods together with the dataframe, without having to call get_df_namespace every time we need a specialized function
describe explicitly the interface that _polars and _pandas must implement

Vincent-Maladiere · 2023-09-15T12:19:09Z

You're right. But it feels like we're going to reimplement the dataframe API in some way, WDYT?

jeromedockes · 2023-09-15T12:25:20Z

You're right. But it feels like we're going to reimplement the dataframe API in some way, WDYT?

I agree but isn't this already what we're doing, just with the dataframe submodules instead of DataFrame subclasses?

it is definitely true it feels like we're doing something too similar to ibis's or the dataframe API's objectives; what I understood from IRL discussions is we want to do it for a much more restricted/specialized set of operations until the Dataframe API covers all we need in skrub but I may have misunderstood

Vincent-Maladiere · 2023-09-15T13:58:39Z

I agree but isn't this already what we're doing, just with the dataframe submodules instead of DataFrame subclasses?

I understand your point, we're indeed already simulating our tiny dataframe API.

it is definitely true it feels like we're doing something too similar to ibis's or the dataframe API's objectives; what I understood from IRL discussions is we want to do it for a much more restricted/specialized set of operations until the Dataframe API covers all we need in skrub but I may have misunderstood

Yes, you're right, this is where we're heading. So I agree with what you say on the skrub.DataFrame, providing it doesn't add too much complexity and cost on our side.

Vincent-Maladiere · 2023-09-27T13:55:19Z

@GaelVaroquaux, does this new version match your requirements?

Vincent-Maladiere · 2023-09-29T08:01:47Z

I guess we can merge this now since we've converged on the design. #734 lists the next TODOs.

GaelVaroquaux

Very very cool.

I left one small inline comment (I hope that it won't be too much to address), and I have a couple of comments on the docs / website:

Front page:

Replace on the front page:

"Joiner, a transformer for joining multiple tables together."

by

"Joiner, AggJoiner", transformers for joining multiple tables together."

Assembling docs

Rework a tiny bit the assembling narrative docs to list the AggJoiner and
AggTarget.

I think that the way that I would do this is by add to the section
"Joinning external tables for machine learning. Where the Joiner is
mentioned, I would do a list with Joiner, AggJoiner and AggTarget, giving
quickly the differences between these.

GaelVaroquaux · 2023-10-01T21:02:20Z

examples/08_join_aggregation.py

+    aux_key="userId",
+    cols=timestamp_cols,
+    main_key="userId",
+    suffix="_user",


Is specifying the suffix necessary for the example to work (not only here, but all over the example)?

If yes, can we think of a default for the suffix that makes it work here (and is not too strange / magic)

If no, I think that we should remove it from the example, to make the example a bit lighter.

This is a relatively complete example. In most simple settings, users won't need to set a suffix.

When a suffix is set, the logic is the following:

This suffix is added to each column of the aux table after aggregating it.

Then, the join procedure is performed between the main table and the aggregated aux table, using the default suffix values, e.g., ("_x", "_y") for pandas.

Therefore, we won't have errors if we don't set the suffix in this example. However, we will have columns suffixed with variations of _x and _y, which will be hard to decipher.

Note that if we input severable tables without setting suffixes, we will automatically generate suffixes using the index of the auxiliary tables (_1, _2, ...).

However, if we call several AggJoiner like in this example, I'm afraid we can't make good default suffixes.

WDYT?

I think the meaningful suffixes "_movie" and "_user" actually help understand what we are doing and what we see in the cells' outputs (although that part is a bit hard still because the tables are quite wide)

GaelVaroquaux · 2023-10-02T10:15:10Z

This is a relatively complete example. In most simple settings, users won't need to set a suffix.

OK, but in the current situation the website will convey the impression that it is very complicated to use and will scare people away. I'm trying to act on this as much as I can by removing every possible element of complexity. It's a pity that we cannot have heuristics that avoid collisions and that we need to set suffixes in this example.

Vincent-Maladiere · 2023-10-02T13:29:03Z

This is a relatively complete example. In most simple settings, users won't need to set a suffix.
OK, but in the current situation the website will convey the impression that it is very complicated to use and will scare people away. I'm trying to act on this as much as I can by removing every possible element of complexity. It's a pity that we cannot have heuristics that avoid collisions and that we need to set suffixes in this example.

We have heuristics that avoid collisions, but these will be very un-informative in this example (what is _x, what is _y). We could do magic stuff with metaclasses or caching to replace those with _1 and _2, but I'm not eager to go down that road.

Vincent-Maladiere · 2023-10-02T13:30:48Z

Alternatively, we can simplify the example by removing one of the AggJoiner and AggTarget and hope it does not degrade the already weak performances too much.

jeromedockes · 2023-10-02T13:43:34Z

degrade the already weak performances too much.

as the baseline have you tried the same estimator but without joining any auxiliary tables?

Vincent-Maladiere · 2023-10-02T13:51:11Z

degrade the already weak performances too much.

as the baseline have you tried the same estimator but without joining any auxiliary tables?

Yes, it gives random performances, with zero R2.
We already have some baselines that bring predictive power.

GaelVaroquaux · 2023-10-02T14:37:07Z

Alternatively, we can simplify the example by removing one of the AggJoiner and AggTarget and hope it does not degrade the already weak performances too much.

If the performance don't degrade too much, I think that this would be great

Vincent-Maladiere · 2023-10-04T08:44:04Z

Here is a quick ablation study.

Full pipeline

pipeline = make_pipeline(
    table_vectorizer,
    agg_joiner_user,
    agg_joiner_movie,
    agg_target_user,
    agg_target_movie,
    HistGradientBoostingRegressor(learning_rate=0.1, max_depth=4, max_iter=40),
)

Without agg_joiner_movie and agg_joiner_user

Without agg_joiner_movie and agg_joiner_user and agg_target_user

Without agg_joiner_movie and agg_joiner_user and agg_target_movie

Conclusions:

both AggJoiner don't improve performances
both AggTarget play a significant role

So, we can remove AggJoiner , but we have to keep both AggTarget.
Are we happy with this simplification? I know it doesn't showcase AggJoiner anymore, but we can find another example who will.

GaelVaroquaux · 2023-10-04T11:56:31Z

So, we can remove `AggJoiner` , but we have to keep both `AggTarget`. Are we happy with this simplification? I know it doesn't showcase `AggJoiner` anymore, but we can find another example who will.

Yes!! Thank you for doing this. Maybe add a note in the example saying that if we wanted to aggregate on a value of X rather than y we have the AggJoiner.

Vincent-Maladiere · 2023-10-06T11:17:10Z

It's done! LMK what you think :)

Vincent-Maladiere · 2023-10-10T09:52:33Z

@GaelVaroquaux should we merge this now?

GaelVaroquaux

LGTM. Merging, thanks!

GaelVaroquaux · 2023-10-10T13:25:46Z

🎉

first draft

72b4cbc

GaelVaroquaux reviewed Jun 15, 2023

View reviewed changes

Vincent-Maladiere added 3 commits June 15, 2023 14:00

apply Gael feedback skip-ci

5a3a7d6

ci skip

1058d17

[ci skip]

6f7b5a4

Vincent-Maladiere added 7 commits June 16, 2023 15:23

add tests

bd2a533

install and run pre-commit

f3eb33d

update changelog

94a6820

add JoinAggregator to api.rst

3d25d20

add movielens dataset fetchers

fbff72c

add string 'X' option for tables

1c79ff5

first iteration of the exemple 07

05aa9fe