PERF speedup classification_report by attaching unique values to dtype.metadata#29738
PERF speedup classification_report by attaching unique values to dtype.metadata#29738adrinjalali merged 15 commits intoscikit-learn:mainfrom
Conversation
thomasjpfan
left a comment
There was a problem hiding this comment.
This is pretty cool. I did not know you can attached metadata to a dtype.
| unique = xp.unique_values(y) | ||
| try: | ||
| unique_dtype = np.dtype(y.dtype, metadata={"unique": unique}) | ||
| return y.view(dtype=unique_dtype) |
There was a problem hiding this comment.
When you pickle or np.save this view, does the dtype metadata get serialized as well?
There was a problem hiding this comment.
pickle:
In [1]: import numpy as np
In [2]: a = np.array([1])
In [4]: m_dtype = np.dtype(a.dtype, metadata={"unique": np.unique(a)})
In [5]: b = a.view(dtype=m_dtype)
In [7]: from pickle import loads, dumps
In [8]: loads(dumps(b)).dtype.metadata
Out[8]: mappingproxy({'unique': array([1])})However, np.save complains:
In [10]: from io import BytesIO
In [11]: f = BytesIO()
In [13]: np.save(f, b)
/home/adrin/micromamba/envs/sklearn/lib/python3.12/site-packages/numpy/lib/format.py:380: UserWarning: metadata on a dtype is not saved to an npy/npz. Use another format (such as pickle) to store it.
d['descr'] = dtype_to_descr(array.dtype)We're not gonna save these anyway, but interesting. cc @seberg
There was a problem hiding this comment.
Yes, we had to put this in because of h5py, you can't store metadata in npy format.
There was a problem hiding this comment.
We are not saving it, but a user may end up saving an ndarray with the metadata attached. In this case:
- If the unique values end up to be long, then it would increase their pickle size.
- For
np.save, they will get a warning that they can ignore. (But it's hard to tell if it's safe to ignore)
There was a problem hiding this comment.
We are not modifying user's arrays. We create a new view and these arrays are not returned.
There was a problem hiding this comment.
Looking through the code now, I see that nothing is returned. Moving forward, we'll need to be careful that the output of _check_targets is not returned by a public function.
There was a problem hiding this comment.
That's a very good point, and I think it's a good idea to not modify y in any function that returns them. So I've moved attach_unique to the callers of _check_targets instead, so that in the future we don't have to worry about this point.
thomasjpfan
left a comment
There was a problem hiding this comment.
Minor comment, otherwise LGTM
| def attach_unique(*ys, return_tuple=False): | ||
| """Attach unique values of ys to ys and return the results. | ||
|
|
||
| The result is a view of y, and the metadata (unique) is not attached to y. |
There was a problem hiding this comment.
Can we include a comment here that starts that the output of attach_unique should never be returned from a public function?
There was a problem hiding this comment.
Yep, added a comment for this.
|
LGTM as well. |
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Fixes #26808
Closes #26820
This is alternative to #26820 where we attach unique values to the
dtype.metadataof a view ony.This gets the same speedup as reported in #26820 but is a lot cleaner IMO.
WDYT @ogrisel @glemaitre @thomasjpfan
(I'm still working on speeding up
np.uniqueindependent of this)