-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Reimplement DataFrame.lookup #61185
base: main
Are you sure you want to change the base?
Conversation
np.random.Generator.random, not np.random.Generator
I tested out three variants of subsetting the dataframe before converting to numpy:
Optimization testing script: import pandas as pd
import numpy as np
import timeit
np.random.seed(43)
for n in [100,100_000]:
for k in range(2,6):
print(k,n)
cols = list('abcdef')
df = pd.DataFrame(np.random.randint(0, 10, size=(n,len(cols))), columns=cols)
df['col'] = np.random.choice(cols, n)
sample_n = n//10
idx = np.random.choice(df['col'].index.to_numpy(),sample_n)
cols = np.random.choice(df['col'].to_numpy(),sample_n)
timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
str_col = cols[0]
df[str_col] = df[str_col].astype(str)
df[str_col] = str_col
timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
As a result of this testing I settled on the third option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are to move forward, this looks good, should get a whatsnew in enhancements for 3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Fixed link below
Is the implementation in #40140 (comment) not sufficient?
size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.random(size), 'c': 'x'})
row_labels = np.repeat(np.arange(size), 2)
col_labels = np.tile(['a', 'b'], size)
%timeit df.lookup(row_labels, col_labels)
# 22.3 ms ± 391 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) <--- this PR
# 13.4 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) <--- proposed implementation
The implementation |
pandas.DataFrame.lookup
Done |
@stevenae - sorry, linked to the wrong comment. I've fixed my comment above. Ah, but I think I see. This avoids a large copy when only certain columns are used. |
cc @pandas-dev/pandas-core My take: this provides an implementation for what I think is a natural operation that is not straightforward for most users. It provides performance benefits that take into account columnar-based storage (subsetting columns prior to calling |
Yes -- I ran a comparison (script at end) and found this PR implementation beats the comment you referenced on large mixed-type lookups. Metrics
Script import pandas as pd
import numpy as np
import timeit
np.random.seed(43)
def pd_lookup(df, row_labels, col_labels):
rows = df.index.get_indexer(row_labels)
cols = df.columns.get_indexer(col_labels)
result = df.to_numpy()[rows, cols]
return result
for n in [100,100_000]:
for k in range(2,6):
print(k,n)
cols = list('abcdef')
df = pd.DataFrame(np.random.randint(0, 10, size=(n,len(cols))), columns=cols)
df['col'] = np.random.choice(cols, n)
sample_n = n//10
idx = np.random.choice(df['col'].index.to_numpy(),sample_n)
cols = np.random.choice(df['col'].to_numpy(),sample_n)
timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
timeit.timeit(lambda: pd_lookup(df.drop(columns='col'),idx,cols),number=1000)
str_col = cols[0]
df[str_col] = df[str_col].astype(str)
df[str_col] = str_col
timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
timeit.timeit(lambda: pd_lookup(df.drop(columns='col'),idx,cols),number=1000) |
pandas/core/frame.py
Outdated
Returns | ||
------- | ||
numpy.ndarray | ||
The found values. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be really useful to have an example here in the docs for the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, please take a look.
expanded example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice example. I will let the other pandas developers handle the rest of the PR
and column labels, this can be achieved by ``pandas.factorize`` and NumPy indexing. | ||
For instance: | ||
and column labels, and the ``lookup`` method allows for this and returns a | ||
NumPy array. For instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have other places in our API where we return a NumPy array? With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like values
also does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed I think this API should return an ExtensionArray
or numpy array depending on the initial type or result type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values
only returns a NumPy array for numpy types. For extension types or arrow-backed types you get something different:
>>> pd.Series([1, 2, 3], dtype="int64[pyarrow]").values
<ArrowExtensionArray>
[1, 2, 3]
Length: 3, dtype: int64[pyarrow]
I don't think we should force a NumPy array return here; particularly for string data, that could be non-performant and expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the requested change as simple as removing the type-hint?
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Optimization notes:
Most important change is removal of:
if not self._is_mixed_type or n > thresh
The old implementation slowed down when
n < thresh
, with or without mixed types. Casesn < thresh
now 10x faster.Logic can be followed via python operator precedence:
https://docs.python.org/3/reference/expressions.html#operator-precedence
Test notes:
I am unfamiliar with pytest and did not add paramterization