MAINT use composition in TableVectorizer contn'd #761

glemaitre · 2023-09-27T15:45:58Z

closes #660
closes #675

The same as #675 but handles the split/merge parallelism. It seems that we can simplify the code quite a bit.

I need to add the tests and modify some as in the original PR.

Vincent-Maladiere · 2023-09-28T10:32:58Z

Thank you for this!

LeoGrin · 2023-09-28T11:16:57Z

Awesome, the split/merge logic is much simpler now!

glemaitre · 2023-09-28T14:02:44Z

Awesome, the split/merge logic is much simpler now!

And we don't need the _split method indeed.

skrub/_table_vectorizer.py

jeromedockes · 2023-10-02T17:33:15Z

@glemaitre with the PR I opened on your branch the conflicts are resolved and the tests should pass so it should be easier to move forward

fix conflicts & tests

glemaitre · 2023-10-03T08:13:44Z

Thanks @jeromedockes I just merge the PR.

glemaitre · 2023-10-03T08:17:04Z

I assume that it still miss an implementation of the get_params/set_params as done by @Vincent-Maladiere. I did not yet have the time to check those.

jeromedockes · 2023-10-03T09:33:42Z

@glemaitre should the merging of transformers be applied also to named_transformers_ and column_indices_? I think it might be confusing otherwise if the transformers_ don't map to anything in those attributes

also we might want these properties to check if the estimator is fitted otherwise we get some attributeerror a bit harder to interpret

skrub/_table_vectorizer.py

jeromedockes · 2023-10-04T08:23:22Z

@glemaitre one option to avoid the split and merge stuff without nested parallelism is to rely on the encoder's built-in parallelization, and have no parallelization at the TableVectorizer level. Indeed, in most cases what takes a lot of time will be encoding the text columns, so performing the numerical passthrough at the same time has little benefit. Advantages would be much simpler code, avoiding the tight coupling between the TableVectorizer and the encoders, treating user-provided or skrub encoders in the same way, and focussing the parallelization & optimization efforts in 1 place only -- for example the minhash's parallelization is a bit more involved than just parallelizing over columns because it takes unique values accross columns, so using its own specialized approach could be better. We can talk about it today but it would be great to also have @LeoGrin 's opinion on this

jeromedockes · 2023-10-04T08:25:05Z

a small script to show it doesn't slow down a typical example:

import timeit

from skrub.datasets import fetch_employee_salaries
from skrub import TableVectorizer, GapEncoder, MinHashEncoder

dataset = fetch_employee_salaries()

X = dataset.X
y = dataset.y


print(X.iloc[0])
print(X.shape)

number = 10
n_jobs = 8
for encoder in GapEncoder, MinHashEncoder:
    for n_components in 10, 30, 90:
        print(f"\n{encoder.__name__}, {n_components} components")
        vectorizer = TableVectorizer(
            n_jobs=n_jobs,
            high_card_cat_transformer=encoder(n_components=n_components, n_jobs=1),
        )
        elapsed = timeit.timeit(
            "vectorizer.fit_transform(X)", number=number, globals=globals()
        )
        print(f"parallelize vectorizer: {elapsed / number:.2f}s")

        vectorizer = TableVectorizer(
            n_jobs=1,
            high_card_cat_transformer=encoder(n_components=n_components, n_jobs=n_jobs),
        )
        elapsed = timeit.timeit(
            "vectorizer.fit_transform(X)", number=number, globals=globals()
        )
        print(f"parallelize encoder: {elapsed / number:.2f}s")

gender                                                                     F
department                                                               POL
department_name                                         Department of Police
division                   MSB Information Mgmt and Tech Division Records...
assignment_category                                         Fulltime-Regular
employee_position_title                          Office Services Coordinator
date_first_hired                                                  09/22/1986
year_first_hired                                                        1986
Name: 0, dtype: object
(9228, 8)

GapEncoder, 10 components
parallelize vectorizer: 1.16s
parallelize encoder: 1.14s

GapEncoder, 30 components
parallelize vectorizer: 1.67s
parallelize encoder: 1.71s

GapEncoder, 90 components
parallelize vectorizer: 3.41s
parallelize encoder: 3.49s

MinHashEncoder, 10 components
parallelize vectorizer: 0.15s
parallelize encoder: 0.13s

MinHashEncoder, 30 components
parallelize vectorizer: 0.21s
parallelize encoder: 0.18s

MinHashEncoder, 90 components
parallelize vectorizer: 0.39s
parallelize encoder: 0.24s

We can do more benchmarks if needed

LeoGrin · 2023-10-04T08:44:34Z

We can talk about it today but it would be great to also have @LeoGrin 's opinion on this

Favouring the inner loop by removing the column transformer parallelism and parallelising each transformer was one of the things we considered, but it was deemed too surprising for the user (for instance the n_jobs attribute wouldn't match what the user provided), see #586. I hadn't thought about it, but with this PR (using composition), I think that this is less of a concern: users might be surprised by how we parallelize, but we can document it well. Split/merge would be faster than favouring the inner loop if we have multiple slow transformers (and maybe if we have a lot of normal transformers, I'm not sure), but in the current situation, I agree that the speed gain will probably be small. All in all I think it may actually be a good idea to go back to the simple solution :)

jeromedockes · 2023-10-04T10:02:08Z

All in all I think it may actually be a good idea to go back to the simple solution :)

Thanks for your insights! I had seen the original PR but missed that discussion in the issue (and of course the IRL one). Given that you @LeoGrin agree to parallelizing the encoder rather than the TableVectorizer, and so do @glemaitre @Vincent-Maladiere and @GaelVaroquaux, we'll do that. We'll do it in this PR to avoid adding code and tests that would be removed afterwards.

If I misunderstood someone's preference on this please LMK!

glemaitre · 2023-10-04T11:43:00Z

I'm going to work on this in the afternoon.

skrub/_table_vectorizer.py

jeromedockes

LGTM!

Vincent-Maladiere

I have a few remarks and questions before it LGTM.

skrub/_table_vectorizer.py

Vincent-Maladiere · 2023-10-16T10:07:17Z

skrub/_table_vectorizer.py

+        # Note that when fitting on a dataframe and transforming on
+        # the same dataframe with different column names,
+        # _check_feature_names will raise an error.
+        self._check_feature_names(X, reset=reset)


Should this block be put before L699? So that feature_names_in is passed to the dataframe constructor instead of being set afterward. Furthermore, can feature_names be None since we convert X to a dataframe?

We can move those checks in fit_transform as done in the ColumnTransformer.

skrub/_table_vectorizer.py

Vincent-Maladiere · 2023-10-16T12:12:28Z

skrub/_table_vectorizer.py

-            return list(ct_feature_names)
-
-        return all_trans_feature_names
+            if name == "remainder" and len(columns) < 20:


Let's add a TODO to remove this when scikit-learn/scikit-learn#27533 is closed.

LeoGrin

LGTM!

As discussed during the meeting, #709 will be closed in another PR.

Vincent-Maladiere

LGTM! :)

jeromedockes · 2023-10-30T17:27:48Z

awesome, thanks a lot!! @glemaitre all reviewers approved it; can I merge it or were you still planning to push some changes?

glemaitre · 2023-10-30T17:36:05Z

Let's merge and iterate on the improvements.

MAINT use composition in TableVectorizer contn'd

fccb8d3

glemaitre marked this pull request as draft September 28, 2023 14:02

Vincent-Maladiere mentioned this pull request Sep 29, 2023

Support Polars dataframes across the library #769

Open

12 tasks

jeromedockes reviewed Oct 2, 2023

View reviewed changes

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved

jeromedockes added 3 commits October 2, 2023 15:49

fix tests that don't focus on splitting & merging transformers

1b5d967

Merge remote-tracking branch 'upstream/main' into continue_761

7242b75

fix remaining failures

cd761fc

Merge pull request #1 from jeromedockes/continue_761

c412d61

fix conflicts & tests

jeromedockes reviewed Oct 3, 2023

View reviewed changes

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved

glemaitre added 7 commits October 4, 2023 15:13

implement n_jobs propagation

e832af6

typo

f7230ba

do not split anymore

d5d182b

remove dead code for TableVectorizer

fe2e8ef

remove split/merge from encoder

fd9daca

add changelog

51a53ea

remove dead code

da41214

glemaitre marked this pull request as ready for review October 4, 2023 13:52

remove combine_lru_dicts

018a831

jeromedockes reviewed Oct 4, 2023

View reviewed changes

skrub/_table_vectorizer.py Show resolved Hide resolved

skrub/_table_vectorizer.py Show resolved Hide resolved

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved

LeoGrin reviewed Oct 10, 2023

View reviewed changes

skrub/_table_vectorizer.py Show resolved Hide resolved

do not modify inplae

fe950b3

jeromedockes approved these changes Oct 13, 2023

View reviewed changes

Vincent-Maladiere reviewed Oct 16, 2023

View reviewed changes

LeoGrin approved these changes Oct 16, 2023

View reviewed changes

glemaitre mentioned this pull request Oct 17, 2023

MAINT use composition in TableVectorizer #675

Closed

glemaitre added 2 commits October 23, 2023 13:25

Merge remote-tracking branch 'origin/main' into revamp_675

6e71b01

stick to ColumnTransformer fit_transform implementation

336c831

Vincent-Maladiere approved these changes Oct 30, 2023

View reviewed changes

jeromedockes merged commit bcd7cd8 into skrub-data:main Oct 30, 2023
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT use composition in TableVectorizer contn'd #761

MAINT use composition in TableVectorizer contn'd #761

glemaitre commented Sep 27, 2023

Vincent-Maladiere commented Sep 28, 2023

LeoGrin commented Sep 28, 2023

glemaitre commented Sep 28, 2023

jeromedockes commented Oct 2, 2023

glemaitre commented Oct 3, 2023

glemaitre commented Oct 3, 2023

jeromedockes commented Oct 3, 2023

jeromedockes commented Oct 4, 2023

jeromedockes commented Oct 4, 2023

LeoGrin commented Oct 4, 2023 •

edited

jeromedockes commented Oct 4, 2023

glemaitre commented Oct 4, 2023

jeromedockes left a comment

Vincent-Maladiere left a comment

Vincent-Maladiere Oct 16, 2023

glemaitre Oct 23, 2023

Vincent-Maladiere Oct 16, 2023

LeoGrin left a comment

Vincent-Maladiere left a comment

jeromedockes commented Oct 30, 2023

glemaitre commented Oct 30, 2023

MAINT use composition in TableVectorizer contn'd #761

MAINT use composition in TableVectorizer contn'd #761

Conversation

glemaitre commented Sep 27, 2023

Vincent-Maladiere commented Sep 28, 2023

LeoGrin commented Sep 28, 2023

glemaitre commented Sep 28, 2023

jeromedockes commented Oct 2, 2023

glemaitre commented Oct 3, 2023

glemaitre commented Oct 3, 2023

jeromedockes commented Oct 3, 2023

jeromedockes commented Oct 4, 2023

jeromedockes commented Oct 4, 2023

LeoGrin commented Oct 4, 2023 • edited

jeromedockes commented Oct 4, 2023

glemaitre commented Oct 4, 2023

jeromedockes left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere Oct 16, 2023

Choose a reason for hiding this comment

glemaitre Oct 23, 2023

Choose a reason for hiding this comment

Vincent-Maladiere Oct 16, 2023

Choose a reason for hiding this comment

LeoGrin left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

jeromedockes commented Oct 30, 2023

glemaitre commented Oct 30, 2023

LeoGrin commented Oct 4, 2023 •

edited