Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to speed up unique value discovery in _BaseEncoder for polars and pandas series #27911

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

jeromedockes
Copy link
Contributor

@jeromedockes jeromedockes commented Dec 7, 2023

Reference Issues/PRs

Address this comment in a follow-up PR to #27835

What does this implement/fix? Explain your changes.

this relies on the pandas or polars Series unique method rather than numpy.unique to identify categories as they can be faster.

Any other comments?

so far I am seeing a speedup for pandas but not really for polars; unless I can make it faster for polars it is probably not worth the added complexity

toy benchmark

https://gist.github.com/jeromedockes/d2e1fc147b7ad0a6dfd686318cc9da57

results

branch: main                            branch: ordinal-encoder-pd-unique

polars                                  polars
========                                ========
ordinal encoder fit: 7.05e-02           ordinal encoder fit: 4.89e-02
gradient boosting fit: 3.74e-01         gradient boosting fit: 3.19e-01
_unique(array): 1.38e-02                _unique(series): 1.92e-04
series.unique(): 2.55e-05               series.unique(): 2.56e-05
np.unique(): 4.84e-01                   np.unique(): 4.69e-01

pandas                                  pandas
========                                ========
ordinal encoder fit: 1.40e-02           ordinal encoder fit: 5.72e-03
gradient boosting fit: 2.90e-01         gradient boosting fit: 2.55e-01
_unique(array): 1.03e-02                _unique(series): 2.39e-03
series.unique(): 2.36e-03               series.unique(): 2.38e-03
np.unique(): 3.25e-01                   np.unique(): 3.25e-01

if we change the type of the categorical column to contain integers rather than categories we see a small speedup for polars but almost 10x for the OrdinalEncoder on pandas

branch: main                            branch: ordinal-encoder-pd-unique

polars                                  polars
========                                ========
ordinal encoder fit: 2.46e-02           ordinal encoder fit: 1.23e-02
gradient boosting fit: 2.42e-01         gradient boosting fit: 2.67e-01
_unique(array): 2.09e-02                _unique(series): 1.11e-02
series.unique(): 8.90e-03               series.unique(): 8.44e-03
np.unique(): 2.08e-02                   np.unique(): 2.07e-02

pandas                                  pandas
========                                ========
ordinal encoder fit: 2.11e-02           ordinal encoder fit: 2.89e-03
gradient boosting fit: 2.49e-01         gradient boosting fit: 2.32e-01
_unique(array): 2.09e-02                _unique(series): 2.70e-03
series.unique(): 2.69e-03               series.unique(): 2.69e-03
np.unique(): 2.08e-02                   np.unique(): 2.07e-02

Copy link

github-actions bot commented Dec 7, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b21d596. Link to the linter CI: here

@adrinjalali
Copy link
Member

This is adding quite a bit of complexity for something which I think should be mostly in the corresponding libraries and not scikit-learn. I wonder why @ogrisel thinks it needs to be here.

@ogrisel
Copy link
Member

ogrisel commented Dec 11, 2023

I wonder why @ogrisel thinks it needs to be here.

Because the numpy.unique has n log n complexity which is quite catastrophic for large dataframes. Both polars and pandas have n complexity implementations.

@jeromedockes could you please run a quick benchmark of this PR vs main on random pandas & polars dataframes with 100 categorical columns and 10_000_000 rows just to confirm that this is worth the extra complexity?

EDIT: I just opened the details of the description and indeed there is already a nice 10x speed-up at 1M samples. This should grow significantly beyond.

@ogrisel
Copy link
Member

ogrisel commented Dec 11, 2023

@jeromedockes can you please check at 10M for polars to see if the speed up grows or no? If not maybe we can indeed remove the polars specialization for the sake of maintainability.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that ideally we would be relying on array API unique_values method instead.

Here we are modifying the default behavior or unique from pandas and polars, and this seems a bit inconsistent in the sense that we don't necessarily do the same regarding other methods (?)

The issue with np.unique is also not unique (pun intended) to encoders, and we have another PR (#26820) circumventing inefficiency of np.unique in a different way.

In an IRL discussion with @glemaitre we think a better solution forward is to improve np.unique, which would solve all the issues here in scikit-learn and elsewhere.

So I plan to work on that, which would in turn remove the need for this PR, or the other PR.

Comment on lines 105 to 110
if is_pandas:
values = X.iloc[:, i]
elif is_polars:
values = X[X.columns[i]]
else:
values = Xi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use _safe_indexing instead?

@jeromedockes
Copy link
Contributor Author

jeromedockes commented Dec 13, 2023

@jeromedockes can you please check at 10M for polars to see if the speed up grows or no? If not maybe we can indeed remove the polars specialization for the sake of maintainability.

here are some plots for _unique with 1M and 10M samples. actually the biggest speedup is for polars categories but I think it will rarely be the case, see below. for object arrays, in the main branch a python set is used which is faster than both np.unique and pandas.unique, so I turned off the use of pandas.unique for columns with dtype object

sklearn utils _encode _unique-log-1000000
sklearn utils _encode _unique-log-10000000

it seems polars categorical series store a metadata bit indicating if the unique values in the series are the same set as all the categories possible in that series. that is probably a frequent enough case, as it occurs when a categorical series is built inferring the categories from the data. When it is the case (and the series is in a single chunk), the unique values can be obtained looking only at the metadata (the mapping from categorical type to string values) and not the data, which is why it is so fast in the example above. Therefore np.unique will never be able to match that, however it is a very specific case (polars, categorical, and fast unique possible) and probably won't apply whenever the polars series is sliced (eg when doing cross-validation):

>>> s.dtype, s.shape
(Categorical, (1000000,))
>>> s1 = s[:-1]
>>> s1.dtype, s1.shape
(Categorical, (999999,))
>>> %timeit s.unique()
24.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit s1.unique()
5.29 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

so we would most likely almost never benefit from that optimization if this PR were merged.

For other cases I agree that improving np.unique sounds like the best approach; I guess having a temporary optimization in scikit-learn's unique could still make sense if the improvement of np.unique is likely to take a long time before being released, but it does add some complexity

@glemaitre glemaitre self-requested a review January 9, 2024 09:27
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the benchmark results and the analysis above and the limited complexity of the dataframe specific code, I would be in favor of including the optimizations suggested in this PR (while contributing to improve numpy in parallel).

Once the numpy implementation has been optimized, we can consider removing some dataframe specific code in scikit-learn but probably not all to still avoid sorting columns for nothing when the results has already been computed as part of the dataframe dtype metadata.

:class:`preprocessing.OneHotEncoder` and :class:`preprocessing.TargetEncoder`
can be faster on pandas DataFrames by using a more efficient way of finding
unique values in the categorical columns. :pr:`27911` by :user:`Jérôme Dockès
<jeromedockes>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be moved to target 1.5 instead of 1.4.

@jeromedockes jeromedockes changed the title [WIP] attempt to speed up unique value discovery in _BaseEncoder for polars and pandas series Attempt to speed up unique value discovery in _BaseEncoder for polars and pandas series Jan 10, 2024
@jeromedockes jeromedockes marked this pull request as ready for review January 10, 2024 08:47
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

import pandas as pd

if parse_version("1.4.0") <= parse_version(pd.__version__):
return _unique_pandas(values, return_counts=return_counts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, if the pandas version is too old, then we return None.

if _is_polars_series(values):
# polars unique, arg_sort not supported for polars.Object dtype.
if str(values.dtype) != "Object":
return _unique_polars(values, return_counts=return_counts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding return None.

Comment on lines +83 to +84
# unique returns a NumpyExtensionArray for extension dtypes and a numpy
# array for other dtypes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment true? For this categorical dtype the output is not NumpyExtensionArray or a np.ndarray:

import pandas as pd
from pandas.arrays import NumpyExtensionArray
import numpy as np

x = pd.Series(["a", "b", "c", "b", "a", "b"],dtype="category")
uniques = x.unique()

assert not isinstance(uniques, NumpyExtensionArray)
assert not isinstance(uniques, np.ndarray)

Comment on lines +88 to +89
try:
unique = unique.reorder_categories(unique.categories.sort_values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the need to reorder_categories here if we end up running unique.sort_values() afterwards. Can you provide an example where this is required?

return unique.sort_values().to_numpy()
if unique.dtype != object:
return np.sort(unique)
return _unique_python(unique, return_counts=False, return_inverse=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not covered based on codecov.

return values
else:
return values, counts
values = values[:-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to codecov, everything below this line is not covered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants