Skip to content

MAINT use composition in TableVectorizer contn'd#761

Merged
jeromedockes merged 16 commits into
skrub-data:mainfrom
glemaitre:revamp_675
Oct 30, 2023
Merged

MAINT use composition in TableVectorizer contn'd#761
jeromedockes merged 16 commits into
skrub-data:mainfrom
glemaitre:revamp_675

Conversation

@glemaitre

Copy link
Copy Markdown
Member

closes #660
closes #675

The same as #675 but handles the split/merge parallelism. It seems that we can simplify the code quite a bit.

I need to add the tests and modify some as in the original PR.

@Vincent-Maladiere

Copy link
Copy Markdown
Member

Thank you for this!

@LeoGrin

LeoGrin commented Sep 28, 2023

Copy link
Copy Markdown
Contributor

Awesome, the split/merge logic is much simpler now!

@glemaitre

Copy link
Copy Markdown
Member Author

Awesome, the split/merge logic is much simpler now!

And we don't need the _split method indeed.

@glemaitre glemaitre marked this pull request as draft September 28, 2023 14:02
Comment thread skrub/_table_vectorizer.py Outdated
@jeromedockes

Copy link
Copy Markdown
Member

@glemaitre with the PR I opened on your branch the conflicts are resolved and the tests should pass so it should be easier to move forward

@glemaitre

Copy link
Copy Markdown
Member Author

Thanks @jeromedockes I just merge the PR.

@glemaitre

Copy link
Copy Markdown
Member Author

I assume that it still miss an implementation of the get_params/set_params as done by @Vincent-Maladiere. I did not yet have the time to check those.

@jeromedockes

Copy link
Copy Markdown
Member

@glemaitre should the merging of transformers be applied also to named_transformers_ and column_indices_? I think it might be confusing otherwise if the transformers_ don't map to anything in those attributes

also we might want these properties to check if the estimator is fitted otherwise we get some attributeerror a bit harder to interpret

Comment thread skrub/_table_vectorizer.py Outdated
@jeromedockes

Copy link
Copy Markdown
Member

@glemaitre one option to avoid the split and merge stuff without nested parallelism is to rely on the encoder's built-in parallelization, and have no parallelization at the TableVectorizer level. Indeed, in most cases what takes a lot of time will be encoding the text columns, so performing the numerical passthrough at the same time has little benefit. Advantages would be much simpler code, avoiding the tight coupling between the TableVectorizer and the encoders, treating user-provided or skrub encoders in the same way, and focussing the parallelization & optimization efforts in 1 place only -- for example the minhash's parallelization is a bit more involved than just parallelizing over columns because it takes unique values accross columns, so using its own specialized approach could be better. We can talk about it today but it would be great to also have @LeoGrin 's opinion on this

@jeromedockes

Copy link
Copy Markdown
Member

a small script to show it doesn't slow down a typical example:

import timeit

from skrub.datasets import fetch_employee_salaries
from skrub import TableVectorizer, GapEncoder, MinHashEncoder

dataset = fetch_employee_salaries()

X = dataset.X
y = dataset.y


print(X.iloc[0])
print(X.shape)

number = 10
n_jobs = 8
for encoder in GapEncoder, MinHashEncoder:
    for n_components in 10, 30, 90:
        print(f"\n{encoder.__name__}, {n_components} components")
        vectorizer = TableVectorizer(
            n_jobs=n_jobs,
            high_card_cat_transformer=encoder(n_components=n_components, n_jobs=1),
        )
        elapsed = timeit.timeit(
            "vectorizer.fit_transform(X)", number=number, globals=globals()
        )
        print(f"parallelize vectorizer: {elapsed / number:.2f}s")

        vectorizer = TableVectorizer(
            n_jobs=1,
            high_card_cat_transformer=encoder(n_components=n_components, n_jobs=n_jobs),
        )
        elapsed = timeit.timeit(
            "vectorizer.fit_transform(X)", number=number, globals=globals()
        )
        print(f"parallelize encoder: {elapsed / number:.2f}s")
gender                                                                     F
department                                                               POL
department_name                                         Department of Police
division                   MSB Information Mgmt and Tech Division Records...
assignment_category                                         Fulltime-Regular
employee_position_title                          Office Services Coordinator
date_first_hired                                                  09/22/1986
year_first_hired                                                        1986
Name: 0, dtype: object
(9228, 8)

GapEncoder, 10 components
parallelize vectorizer: 1.16s
parallelize encoder: 1.14s

GapEncoder, 30 components
parallelize vectorizer: 1.67s
parallelize encoder: 1.71s

GapEncoder, 90 components
parallelize vectorizer: 3.41s
parallelize encoder: 3.49s

MinHashEncoder, 10 components
parallelize vectorizer: 0.15s
parallelize encoder: 0.13s

MinHashEncoder, 30 components
parallelize vectorizer: 0.21s
parallelize encoder: 0.18s

MinHashEncoder, 90 components
parallelize vectorizer: 0.39s
parallelize encoder: 0.24s

We can do more benchmarks if needed

@LeoGrin

LeoGrin commented Oct 4, 2023

Copy link
Copy Markdown
Contributor

We can talk about it today but it would be great to also have @LeoGrin 's opinion on this

Favouring the inner loop by removing the column transformer parallelism and parallelising each transformer was one of the things we considered, but it was deemed too surprising for the user (for instance the n_jobs attribute wouldn't match what the user provided), see #586. I hadn't thought about it, but with this PR (using composition), I think that this is less of a concern: users might be surprised by how we parallelize, but we can document it well. Split/merge would be faster than favouring the inner loop if we have multiple slow transformers (and maybe if we have a lot of normal transformers, I'm not sure), but in the current situation, I agree that the speed gain will probably be small. All in all I think it may actually be a good idea to go back to the simple solution :)

@jeromedockes

Copy link
Copy Markdown
Member

All in all I think it may actually be a good idea to go back to the simple solution :)

Thanks for your insights! I had seen the original PR but missed that discussion in the issue (and of course the IRL one). Given that you @LeoGrin agree to parallelizing the encoder rather than the TableVectorizer, and so do @glemaitre @Vincent-Maladiere and @GaelVaroquaux, we'll do that. We'll do it in this PR to avoid adding code and tests that would be removed afterwards.

If I misunderstood someone's preference on this please LMK!

@glemaitre

Copy link
Copy Markdown
Member Author

I'm going to work on this in the afternoon.

@glemaitre glemaitre marked this pull request as ready for review October 4, 2023 13:52
Comment thread skrub/_table_vectorizer.py
Comment thread skrub/_table_vectorizer.py
Comment thread skrub/_table_vectorizer.py Outdated
Comment thread skrub/_table_vectorizer.py

@jeromedockes jeromedockes left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Vincent-Maladiere Vincent-Maladiere left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few remarks and questions before it LGTM.

Comment thread skrub/_table_vectorizer.py
Comment thread skrub/_table_vectorizer.py Outdated
# Note that when fitting on a dataframe and transforming on
# the same dataframe with different column names,
# _check_feature_names will raise an error.
self._check_feature_names(X, reset=reset)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this block be put before L699? So that feature_names_in is passed to the dataframe constructor instead of being set afterward. Furthermore, can feature_names be None since we convert X to a dataframe?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move those checks in fit_transform as done in the ColumnTransformer.

Comment thread skrub/_table_vectorizer.py
return list(ct_feature_names)

return all_trans_feature_names
if name == "remainder" and len(columns) < 20:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a TODO to remove this when scikit-learn/scikit-learn#27533 is closed.

@LeoGrin LeoGrin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

As discussed during the meeting, #709 will be closed in another PR.

@Vincent-Maladiere Vincent-Maladiere left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! :)

@jeromedockes

Copy link
Copy Markdown
Member

awesome, thanks a lot!! @glemaitre all reviewers approved it; can I merge it or were you still planning to push some changes?

@glemaitre

Copy link
Copy Markdown
Member Author

Let's merge and iterate on the improvements.

@jeromedockes jeromedockes merged commit bcd7cd8 into skrub-data:main Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use composition in the TableVectorizer

4 participants