Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column-wise transforms & refactor TableVectorizer #902

Merged
merged 116 commits into from
May 28, 2024

Conversation

jeromedockes
Copy link
Member

@jeromedockes jeromedockes commented May 2, 2024

closes #874, #886, #894, #877, #848, #904, #905, #830, #626, #870

This is the last part of the changes outlined in #877 (the first two parts have been merged in #895 and #888)
The main addition is OnEachColumn, a transformer that applies a transformation independently to each column in a dataframe, and is used to refactor the TableVectorizer and ensure it does consistent operations across calls to transform.

@jeromedockes jeromedockes changed the title Add column-wise transforms & refactor TableVectorizer [WIP] Add column-wise transforms & refactor TableVectorizer May 6, 2024
Copy link
Contributor

@TheooJ TheooJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third pass, thank you @jeromedockes !

skrub/_clean_categories.py Outdated Show resolved Hide resolved
skrub/tests/test_on_subframe.py Outdated Show resolved Hide resolved
skrub/tests/test_to_datetime.py Outdated Show resolved Hide resolved
skrub/tests/test_to_datetime.py Outdated Show resolved Hide resolved
skrub/tests/test_clean_null_strings.py Show resolved Hide resolved
skrub/tests/test_to_float32.py Outdated Show resolved Hide resolved
skrub/tests/test_to_str.py Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
examples/01_encodings.py Outdated Show resolved Hide resolved
examples/01_encodings.py Outdated Show resolved Hide resolved
Co-authored-by: Théo Jolivet <57430673+TheooJ@users.noreply.github.com>
Copy link
Member Author

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @TheooJ

skrub/tests/test_clean_null_strings.py Outdated Show resolved Hide resolved
skrub/tests/test_to_str.py Show resolved Hide resolved
@glemaitre
Copy link
Member

I'll make a pass now.

@glemaitre glemaitre self-requested a review May 28, 2024 12:30
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ignore the things for the example style. I think that we should in another PR.

#
# Let's first retrieve the dataset:
# Let's first retrieve the dataset, using one of the downloaders from the :mod:`skrub.datasets` module.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you to make the example black complient now (less than 88 characters) or make an automatic pass of the tool in another PR?

Comment on lines 41 to +42
###############################################################################
# A simple prediction pipeline
# ----------------------------
# Easily encoding a dataframe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the example, I would probably use the # %% delimiter nowadays.

Suggested change
###############################################################################
# A simple prediction pipeline
# ----------------------------
# Easily encoding a dataframe
# %%
# Easily encoding a dataframe


from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y
employees

###############################################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
###############################################################################
# %%


X = dataset.X
y = dataset.y
###############################################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
###############################################################################
# %%

Comment on lines 79 to 88
###############################################################################
# We observe diverse columns in the dataset:
# - binary (``'gender'``),
# - numerical (``'employee_annual_salary'``),
# - categorical (``'department'``, ``'department_name'``, ``'assignment_category'``),
# - datetime (``'date_first_hired'``)
# - dirty categorical (``'employee_position_title'``, ``'division'``).
#
# Using skrub's |TableVectorizer|, we can now already build a machine-learning
# pipeline and train it:
# From our 8 columns, the |TableVectorizer| has extracted 143 numerical
# features. Most of them are one-hot encoded representations of the categorical
# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``,
# ``'gender_nan'`` were created to encode the ``'gender'`` column.

###############################################################################
# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning:

from sklearn.ensemble import HistGradientBoostingRegressor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
###############################################################################
# We observe diverse columns in the dataset:
# - binary (``'gender'``),
# - numerical (``'employee_annual_salary'``),
# - categorical (``'department'``, ``'department_name'``, ``'assignment_category'``),
# - datetime (``'date_first_hired'``)
# - dirty categorical (``'employee_position_title'``, ``'division'``).
#
# Using skrub's |TableVectorizer|, we can now already build a machine-learning
# pipeline and train it:
# From our 8 columns, the |TableVectorizer| has extracted 143 numerical
# features. Most of them are one-hot encoded representations of the categorical
# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``,
# ``'gender_nan'`` were created to encode the ``'gender'`` column.
###############################################################################
# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning:
from sklearn.ensemble import HistGradientBoostingRegressor
# %%
# From our 8 columns, the |TableVectorizer| has extracted 143 numerical
# features. Most of them are one-hot encoded representations of the categorical
# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``,
# ``'gender_nan'`` were created to encode the ``'gender'`` column.
#
# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning:
from sklearn.ensemble import HistGradientBoostingRegressor

Comment on lines 231 to 235
###############################################################################
# The simple pipeline applied on this complex dataset gave us very good results.
# We can see that this new pipeline achieves a similar score but is fitted much faster.
# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable).

###############################################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
###############################################################################
# The simple pipeline applied on this complex dataset gave us very good results.
# We can see that this new pipeline achieves a similar score but is fitted much faster.
# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable).
###############################################################################
# %%
# We can see that this new pipeline achieves a similar score but is fitted much faster.
# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable).
#

pipeline = make_pipeline(TableVectorizer(), regressor)
pipeline.fit(X, y)
pipeline = make_pipeline(vectorizer, regressor)
pipeline.fit(employees, salaries)

###############################################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
###############################################################################
# %%

remainder="drop",
)

X_enc = encoder.fit_transform(X)
pprint(encoder.get_feature_names_out())
# pprint(encoder.get_feature_names_out())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove it then.

@@ -85,6 +85,7 @@ def cols(*columns):
>>> s.all() & ['kind', 'ID']
(all() & cols('kind', 'ID'))

# noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for noqa?

Here we can see the input to ``transform`` has been converted back to the
timezone used during ``fit`` and that we get the same result for "hour".

# noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so this is to avoid the check on the docstring. I assume that we can clean it afterwords

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is actually looking good.

@glemaitre glemaitre merged commit 5b30ddd into skrub-data:main May 28, 2024
26 checks passed
@jeromedockes
Copy link
Member Author

Oops so sorry @glemaitre I should have said so but I think @GaelVaroquaux was planning to review it as well ... @GaelVaroquaux , if you would like LMK if you want to review maybe the easiest way will be to revert the merge commit and open a new PR to un-revert

@jeromedockes
Copy link
Member Author

Thanks a lot for the review @glemaitre !

@GaelVaroquaux
Copy link
Member

No, no, it's good to have merged. I can give feedback via issues.

Hurray for merge. Thanks a lot to everyone involved!!

@glemaitre
Copy link
Member

Thanks @GaelVaroquaux. We will address the subsequent issues. Let's roll ;)

@jeromedockes
Copy link
Member Author

No, no, it's good to have merged. I can give feedback via issues.

ok, thanks. there will be a few follow-up PRs in any case, @TheooJ and I are going to open a couple of issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Basic regression problem raises exception on inference
4 participants