Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Reimplement DataFrame.lookup #61185

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

stevenae
Copy link

@stevenae stevenae commented Mar 26, 2025

  • closes ENH: re-implement DataFrame.lookup. #40140
  • [Tests added and passed]
  • All [code checks passed]
  • Added [type annotations]
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Optimization notes:
Most important change is removal of:
if not self._is_mixed_type or n > thresh

The old implementation slowed down when n < thresh, with or without mixed types. Cases n < thresh now 10x faster.

Logic can be followed via python operator precedence:

https://docs.python.org/3/reference/expressions.html#operator-precedence

Test notes:
I am unfamiliar with pytest and did not add paramterization

@stevenae
Copy link
Author

stevenae commented Mar 27, 2025

I tested out three variants of subsetting the dataframe before converting to numpy:

  • subset column and row
  • subset only column
  • subset column, then subset row if types are mixed

Optimization testing script:

import pandas as pd
import numpy as np
import timeit
np.random.seed(43)
for n in [100,100_000]:
	for k in range(2,6):
		print(k,n)
		cols = list('abcdef')
		df = pd.DataFrame(np.random.randint(0, 10, size=(n,len(cols))), columns=cols)
		df['col'] = np.random.choice(cols, n)
		sample_n = n//10
		idx = np.random.choice(df['col'].index.to_numpy(),sample_n)
		cols = np.random.choice(df['col'].to_numpy(),sample_n)
		timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
		str_col = cols[0]
		df[str_col] = df[str_col].astype(str)
		df[str_col] = str_col
		timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
col+row col-only col+mixed row
2 100 2 100 2 100
numeric 0.19170337496325374 0.2384615419432521 0.19463533395901322
mixed 0.1781897919718176 0.23713816609233618 0.27453291695564985
3 100 3 100 3 100
numeric 0.15338195790536702 0.20400249981321394 0.1500512920320034
mixed 0.18086445797234774 0.2427495000883937 0.2795307501219213
4 100 4 100 4 100
numeric 0.1565960831940174 0.2095870419871062 0.15431487490423024
mixed 0.17770141689106822 0.23276254208758473 0.26711999997496605
5 100 5 100 5 100
numeric 0.1558396250475198 0.2023254157975316 0.15394329093396664
mixed 0.17938704183325171 0.2375077500473708 0.274615041911602
2 100000 2 100000 2 100000
numeric 0.6304021249525249 1.2773219170048833 0.855312000028789
mixed 4.435680666938424 1.679579583927989 1.979861208004877
3 100000 3 100000 3 100000
numeric 0.6471724167931825 1.248306917026639 0.843553707934916
mixed 4.393679084023461 1.7129242909140885 1.955484125064686
4 100000 4 100000 4 100000
numeric 0.6682121250778437 1.2452070831786841 0.8302506660111248
mixed 4.390174541156739 1.6384193329140544 1.9620799159165472
5 100000 5 100000 5 100000
numeric 0.6654676250182092 1.2772445830050856 0.865516958059743
mixed 4.451537624932826 1.742541000014171 2.0112057079095393

As a result of this testing I settled on the third option.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to move forward, this looks good, should get a whatsnew in enhancements for 3.0.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Fixed link below

Is the implementation in #40140 (comment) not sufficient?

size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.random(size), 'c': 'x'})
row_labels = np.repeat(np.arange(size), 2)
col_labels = np.tile(['a', 'b'], size)
%timeit df.lookup(row_labels, col_labels)
# 22.3 ms ± 391 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- this PR
# 13.4 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <--- proposed implementation

@stevenae
Copy link
Author

Is the implementation in #40140 (comment) not sufficient?

size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.random(size), 'c': 'x'})
row_labels = np.repeat(np.arange(size), 2)
col_labels = np.tile(['a', 'b'], size)
%timeit df.lookup(row_labels, col_labels)
# 22.3 ms ± 391 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- this PR
# 13.4 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <--- proposed implementation

The implementation df[['a', 'b']].sum().sum() pulls up the entire column, and does not support lookup of individual values by row/column. Is that what you are referring to?

@stevenae
Copy link
Author

If we are to move forward, this looks good, should get a whatsnew in enhancements for 3.0.

Done

@rhshadrach
Copy link
Member

rhshadrach commented Mar 28, 2025

@stevenae - sorry, linked to the wrong comment. I've fixed my comment above.

Ah, but I think I see. This avoids a large copy when only certain columns are used.

@rhshadrach rhshadrach changed the title Enh lookup ENH: Reimplement DataFrame.lookup Mar 28, 2025
@rhshadrach rhshadrach added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 28, 2025
@rhshadrach
Copy link
Member

cc @pandas-dev/pandas-core

My take: this provides an implementation for what I think is a natural operation that is not straightforward for most users. It provides performance benefits that take into account columnar-based storage (subsetting columns prior to calling .to_numpy()). This seems like a worthy addition in my opinion, especially given the user feedback when the previous version was removed.

@stevenae
Copy link
Author

@stevenae - sorry, linked to the wrong comment. I've fixed my comment above.

Ah, but I think I see. This avoids a large copy when only certain columns are used.

Yes -- I ran a comparison (script at end) and found this PR implementation beats the comment you referenced on large mixed-type lookups.

Metrics

PR 40140
2 100
0.1964133749715984 0.0907377500552684
0.274302874924615 0.11014608410187066
3 100
0.15044220816344023 0.08912291703745723
0.2768622918520123 0.11031254194676876
4 100
0.15489325020462275 0.09032529196701944
0.26732829213142395 0.10644491598941386
5 100
0.1546538749244064 0.08968612505123019
0.2721201251260936 0.11162270791828632
2 100000
0.8096102089621127 0.40509104216471314
1.9508202918805182 4.064577874960378
3 100000
0.8242515418678522 0.4148290839511901
1.9535491249989718 4.241159915924072
4 100000
0.8302762501407415 0.42497566691599786
1.9240409170743078 4.146159041905776
5 100000
0.8654224998317659 0.44505883287638426
2.0630989999044687 4.4090170410927385

Script

import pandas as pd
import numpy as np
import timeit
np.random.seed(43)

def pd_lookup(df, row_labels, col_labels):
    rows = df.index.get_indexer(row_labels)
    cols = df.columns.get_indexer(col_labels)
    result = df.to_numpy()[rows, cols]
    return result

for n in [100,100_000]:
	for k in range(2,6):
		print(k,n)
		cols = list('abcdef')
		df = pd.DataFrame(np.random.randint(0, 10, size=(n,len(cols))), columns=cols)
		df['col'] = np.random.choice(cols, n)
		sample_n = n//10
		idx = np.random.choice(df['col'].index.to_numpy(),sample_n)
		cols = np.random.choice(df['col'].to_numpy(),sample_n)
		timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
		timeit.timeit(lambda: pd_lookup(df.drop(columns='col'),idx,cols),number=1000)
		str_col = cols[0]
		df[str_col] = df[str_col].astype(str)
		df[str_col] = str_col
		timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
		timeit.timeit(lambda: pd_lookup(df.drop(columns='col'),idx,cols),number=1000)

Comment on lines 5153 to 5157
Returns
-------
numpy.ndarray
The found values.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be really useful to have an example here in the docs for the API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, please take a look.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example. I will let the other pandas developers handle the rest of the PR

and column labels, this can be achieved by ``pandas.factorize`` and NumPy indexing.
For instance:
and column labels, and the ``lookup`` method allows for this and returns a
NumPy array. For instance:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have other places in our API where we return a NumPy array? With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like values also does this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed I think this API should return an ExtensionArray or numpy array depending on the initial type or result type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values only returns a NumPy array for numpy types. For extension types or arrow-backed types you get something different:

>>> pd.Series([1, 2, 3], dtype="int64[pyarrow]").values
<ArrowExtensionArray>
[1, 2, 3]
Length: 3, dtype: int64[pyarrow]

I don't think we should force a NumPy array return here; particularly for string data, that could be non-performant and expensive

Copy link
Author

@stevenae stevenae Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the requested change as simple as removing the type-hint?

@stevenae stevenae requested a review from rhshadrach March 31, 2025 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: re-implement DataFrame.lookup.
5 participants