Attempt to speed up unique value discovery in `_BaseEncoder` for polars and pandas series #27911

jeromedockes · 2023-12-07T16:47:53Z

Reference Issues/PRs

Address this comment in a follow-up PR to #27835

What does this implement/fix? Explain your changes.

this relies on the pandas or polars Series unique method rather than numpy.unique to identify categories as they can be faster.

Any other comments?

so far I am seeing a speedup for pandas but not really for polars; unless I can make it faster for polars it is probably not worth the added complexity

toy benchmark

https://gist.github.com/jeromedockes/d2e1fc147b7ad0a6dfd686318cc9da57

results

branch: main                            branch: ordinal-encoder-pd-unique

polars                                  polars
========                                ========
ordinal encoder fit: 7.05e-02           ordinal encoder fit: 4.89e-02
gradient boosting fit: 3.74e-01         gradient boosting fit: 3.19e-01
_unique(array): 1.38e-02                _unique(series): 1.92e-04
series.unique(): 2.55e-05               series.unique(): 2.56e-05
np.unique(): 4.84e-01                   np.unique(): 4.69e-01

pandas                                  pandas
========                                ========
ordinal encoder fit: 1.40e-02           ordinal encoder fit: 5.72e-03
gradient boosting fit: 2.90e-01         gradient boosting fit: 2.55e-01
_unique(array): 1.03e-02                _unique(series): 2.39e-03
series.unique(): 2.36e-03               series.unique(): 2.38e-03
np.unique(): 3.25e-01                   np.unique(): 3.25e-01

if we change the type of the categorical column to contain integers rather than categories we see a small speedup for polars but almost 10x for the OrdinalEncoder on pandas

branch: main                            branch: ordinal-encoder-pd-unique

polars                                  polars
========                                ========
ordinal encoder fit: 2.46e-02           ordinal encoder fit: 1.23e-02
gradient boosting fit: 2.42e-01         gradient boosting fit: 2.67e-01
_unique(array): 2.09e-02                _unique(series): 1.11e-02
series.unique(): 8.90e-03               series.unique(): 8.44e-03
np.unique(): 2.08e-02                   np.unique(): 2.07e-02

pandas                                  pandas
========                                ========
ordinal encoder fit: 2.11e-02           ordinal encoder fit: 2.89e-03
gradient boosting fit: 2.49e-01         gradient boosting fit: 2.32e-01
_unique(array): 2.09e-02                _unique(series): 2.70e-03
series.unique(): 2.69e-03               series.unique(): 2.69e-03
np.unique(): 2.08e-02                   np.unique(): 2.07e-02

github-actions · 2023-12-07T16:49:07Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: b21d596. Link to the linter CI: here}

…mpy 1.24" This reverts commit bd86ed4.

…unique

adrinjalali · 2023-12-11T13:36:07Z

This is adding quite a bit of complexity for something which I think should be mostly in the corresponding libraries and not scikit-learn. I wonder why @ogrisel thinks it needs to be here.

ogrisel · 2023-12-11T18:02:37Z

I wonder why @ogrisel thinks it needs to be here.

Because the numpy.unique has n log n complexity which is quite catastrophic for large dataframes. Both polars and pandas have n complexity implementations.

@jeromedockes could you please run a quick benchmark of this PR vs main on random pandas & polars dataframes with 100 categorical columns and 10_000_000 rows just to confirm that this is worth the extra complexity?

EDIT: I just opened the details of the description and indeed there is already a nice 10x speed-up at 1M samples. This should grow significantly beyond.

ogrisel · 2023-12-11T18:07:46Z

@jeromedockes can you please check at 10M for polars to see if the speed up grows or no? If not maybe we can indeed remove the polars specialization for the sake of maintainability.

adrinjalali

My feeling is that ideally we would be relying on array API unique_values method instead.

Here we are modifying the default behavior or unique from pandas and polars, and this seems a bit inconsistent in the sense that we don't necessarily do the same regarding other methods (?)

The issue with np.unique is also not unique (pun intended) to encoders, and we have another PR (#26820) circumventing inefficiency of np.unique in a different way.

In an IRL discussion with @glemaitre we think a better solution forward is to improve np.unique, which would solve all the issues here in scikit-learn and elsewhere.

So I plan to work on that, which would in turn remove the need for this PR, or the other PR.

adrinjalali · 2023-12-11T19:02:34Z

sklearn/preprocessing/_encoders.py

+                if is_pandas:
+                    values = X.iloc[:, i]
+                elif is_polars:
+                    values = X[X.columns[i]]
+                else:
+                    values = Xi


should we use _safe_indexing instead?

jeromedockes · 2023-12-13T17:39:40Z

@jeromedockes can you please check at 10M for polars to see if the speed up grows or no? If not maybe we can indeed remove the polars specialization for the sake of maintainability.

here are some plots for _unique with 1M and 10M samples. actually the biggest speedup is for polars categories but I think it will rarely be the case, see below. for object arrays, in the main branch a python set is used which is faster than both np.unique and pandas.unique, so I turned off the use of pandas.unique for columns with dtype object

it seems polars categorical series store a metadata bit indicating if the unique values in the series are the same set as all the categories possible in that series. that is probably a frequent enough case, as it occurs when a categorical series is built inferring the categories from the data. When it is the case (and the series is in a single chunk), the unique values can be obtained looking only at the metadata (the mapping from categorical type to string values) and not the data, which is why it is so fast in the example above. Therefore np.unique will never be able to match that, however it is a very specific case (polars, categorical, and fast unique possible) and probably won't apply whenever the polars series is sliced (eg when doing cross-validation):

>>> s.dtype, s.shape
(Categorical, (1000000,))
>>> s1 = s[:-1]
>>> s1.dtype, s1.shape
(Categorical, (999999,))
>>> %timeit s.unique()
24.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s1.unique()
5.29 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

so we would most likely almost never benefit from that optimization if this PR were merged.

For other cases I agree that improving np.unique sounds like the best approach; I guess having a temporary optimization in scikit-learn's unique could still make sense if the improvement of np.unique is likely to take a long time before being released, but it does add some complexity

ogrisel

Given the benchmark results and the analysis above and the limited complexity of the dataframe specific code, I would be in favor of including the optimizations suggested in this PR (while contributing to improve numpy in parallel).

Once the numpy implementation has been optimized, we can consider removing some dataframe specific code in scikit-learn but probably not all to still avoid sorting columns for nothing when the results has already been computed as part of the dataframe dtype metadata.

ogrisel · 2024-01-09T17:27:40Z

doc/whats_new/v1.4.rst

+  :class:`preprocessing.OneHotEncoder` and :class:`preprocessing.TargetEncoder`
+  can be faster on pandas DataFrames by using a more efficient way of finding
+  unique values in the categorical columns. :pr:`27911` by :user:`Jérôme Dockès
+  <jeromedockes>`.


This will have to be moved to target 1.5 instead of 1.4.

…unique

thomasjpfan

Thanks for the PR!

thomasjpfan · 2024-01-09T20:00:41Z

sklearn/utils/_encode.py

+            import pandas as pd
+
+            if parse_version("1.4.0") <= parse_version(pd.__version__):
+                return _unique_pandas(values, return_counts=return_counts)


At this point, if the pandas version is too old, then we return None.

thomasjpfan · 2024-01-09T20:01:05Z

sklearn/utils/_encode.py

+        if _is_polars_series(values):
+            # polars unique, arg_sort not supported for polars.Object dtype.
+            if str(values.dtype) != "Object":
+                return _unique_polars(values, return_counts=return_counts)


Same here regarding return None.

thomasjpfan · 2024-01-11T03:37:00Z

sklearn/utils/_encode.py

+    # unique returns a NumpyExtensionArray for extension dtypes and a numpy
+    # array for other dtypes


Is this comment true? For this categorical dtype the output is not NumpyExtensionArray or a np.ndarray:

import pandas as pd from pandas.arrays import NumpyExtensionArray import numpy as np x = pd.Series(["a", "b", "c", "b", "a", "b"],dtype="category") uniques = x.unique() assert not isinstance(uniques, NumpyExtensionArray) assert not isinstance(uniques, np.ndarray)

thomasjpfan · 2024-01-11T03:37:26Z

sklearn/utils/_encode.py

+        try:
+            unique = unique.reorder_categories(unique.categories.sort_values())


I do not see the need to reorder_categories here if we end up running unique.sort_values() afterwards. Can you provide an example where this is required?

thomasjpfan · 2024-01-11T03:37:42Z

sklearn/utils/_encode.py

+        return unique.sort_values().to_numpy()
+    if unique.dtype != object:
+        return np.sort(unique)
+    return _unique_python(unique, return_counts=False, return_inverse=False)


This code is not covered based on codecov.

thomasjpfan · 2024-01-11T03:38:48Z

sklearn/utils/_encode.py

+            return values
+        else:
+            return values, counts
+    values = values[:-1]


According to codecov, everything below this line is not covered.

speed up unique value discovery for polars and pandas series

5add48f

jeromedockes marked this pull request as draft December 7, 2023 16:48

github-actions bot added module:preprocessing module:utils labels Dec 7, 2023

jeromedockes added 9 commits December 7, 2023 19:14

sort pandas categories in lexical order

9112779

remove _unique_np not needed since equal_nan introduced in numpy 1.24

bd86ed4

test _unique with polars and pandas series

2f29f3d

shorter comment

4237de1

Revert "remove _unique_np not needed since equal_nan introduced in nu…

b339650

…mpy 1.24" This reverts commit bd86ed4.

add todo

9a4dae1

do not use pandas value_counts for pandas <= 1.4.0

9fa2e0b

add whatsnew

e6322e8

Merge remote-tracking branch 'upstream/main' into ordinal-encoder-pd-…

2a881fc

…unique

adrinjalali reviewed Dec 12, 2023

View reviewed changes

use _safe_indexing

c0bde6f

use unique_python for pandas object dtype

c53dff1

glemaitre self-requested a review January 9, 2024 09:27

ogrisel approved these changes Jan 9, 2024

View reviewed changes

jeromedockes added 2 commits January 10, 2024 09:43

Merge remote-tracking branch 'upstream/main' into ordinal-encoder-pd-…

2d7542b

…unique

move to v1.5 whatsnew

b21d596

jeromedockes changed the title ~~[WIP] attempt to speed up unique value discovery in _BaseEncoder for polars and pandas series~~ Attempt to speed up unique value discovery in _BaseEncoder for polars and pandas series Jan 10, 2024

jeromedockes marked this pull request as ready for review January 10, 2024 08:47

thomasjpfan reviewed Jan 11, 2024

View reviewed changes

glemaitre mentioned this pull request Jan 22, 2024

Handle pd.Categorical in encoders #14953

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to speed up unique value discovery in `_BaseEncoder` for polars and pandas series #27911

Attempt to speed up unique value discovery in `_BaseEncoder` for polars and pandas series #27911

jeromedockes commented Dec 7, 2023 •

edited

github-actions bot commented Dec 7, 2023 •

edited

adrinjalali commented Dec 11, 2023

ogrisel commented Dec 11, 2023 •

edited

ogrisel commented Dec 11, 2023

adrinjalali left a comment

adrinjalali Dec 11, 2023

jeromedockes commented Dec 13, 2023 •

edited

ogrisel left a comment

ogrisel Jan 9, 2024

thomasjpfan left a comment

thomasjpfan Jan 9, 2024

thomasjpfan Jan 9, 2024

thomasjpfan Jan 11, 2024

thomasjpfan Jan 11, 2024

thomasjpfan Jan 11, 2024

thomasjpfan Jan 11, 2024

		# unique returns a NumpyExtensionArray for extension dtypes and a numpy
		# array for other dtypes

		try:
		unique = unique.reorder_categories(unique.categories.sort_values())

Attempt to speed up unique value discovery in _BaseEncoder for polars and pandas series #27911

Are you sure you want to change the base?

Attempt to speed up unique value discovery in _BaseEncoder for polars and pandas series #27911

Conversation

jeromedockes commented Dec 7, 2023 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Dec 7, 2023 • edited

✔️ Linting Passed

adrinjalali commented Dec 11, 2023

ogrisel commented Dec 11, 2023 • edited

ogrisel commented Dec 11, 2023

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes commented Dec 13, 2023 • edited

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Attempt to speed up unique value discovery in `_BaseEncoder` for polars and pandas series #27911

Attempt to speed up unique value discovery in `_BaseEncoder` for polars and pandas series #27911

jeromedockes commented Dec 7, 2023 •

edited

github-actions bot commented Dec 7, 2023 •

edited

ogrisel commented Dec 11, 2023 •

edited

jeromedockes commented Dec 13, 2023 •

edited