ENH: Reimplement DataFrame.lookup #61185

stevenae · 2025-03-26T21:13:46Z

closes ENH: re-implement DataFrame.lookup. #40140
[Tests added and passed]
All [code checks passed]
Added [type annotations]
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Optimization notes:
Most important change is removal of:
if not self._is_mixed_type or n > thresh

The old implementation slowed down when n < thresh, with or without mixed types. Cases n < thresh now 10x faster.

Logic can be followed via python operator precedence:

https://docs.python.org/3/reference/expressions.html#operator-precedence

Test notes:
I am unfamiliar with pytest and did not add paramterization

np.random.Generator.random, not np.random.Generator

stevenae · 2025-03-27T18:55:57Z

I tested out three variants of subsetting the dataframe before converting to numpy:

subset column and row
subset only column
subset column, then subset row if types are mixed

Optimization testing script:

import pandas as pd
import numpy as np
import timeit
np.random.seed(43)
for n in [100,100_000]:
	for k in range(2,6):
		print(k,n)
		cols = list('abcdef')
		df = pd.DataFrame(np.random.randint(0, 10, size=(n,len(cols))), columns=cols)
		df['col'] = np.random.choice(cols, n)
		sample_n = n//10
		idx = np.random.choice(df['col'].index.to_numpy(),sample_n)
		cols = np.random.choice(df['col'].to_numpy(),sample_n)
		timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
		str_col = cols[0]
		df[str_col] = df[str_col].astype(str)
		df[str_col] = str_col
		timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)

	col+row	col-only	col+mixed row
	2 100	2 100	2 100
numeric	0.19170337496325374	0.2384615419432521	0.19463533395901322
mixed	0.1781897919718176	0.23713816609233618	0.27453291695564985
	3 100	3 100	3 100
numeric	0.15338195790536702	0.20400249981321394	0.1500512920320034
mixed	0.18086445797234774	0.2427495000883937	0.2795307501219213
	4 100	4 100	4 100
numeric	0.1565960831940174	0.2095870419871062	0.15431487490423024
mixed	0.17770141689106822	0.23276254208758473	0.26711999997496605
	5 100	5 100	5 100
numeric	0.1558396250475198	0.2023254157975316	0.15394329093396664
mixed	0.17938704183325171	0.2375077500473708	0.274615041911602
	2 100000	2 100000	2 100000
numeric	0.6304021249525249	1.2773219170048833	0.855312000028789
mixed	4.435680666938424	1.679579583927989	1.979861208004877
	3 100000	3 100000	3 100000
numeric	0.6471724167931825	1.248306917026639	0.843553707934916
mixed	4.393679084023461	1.7129242909140885	1.955484125064686
	4 100000	4 100000	4 100000
numeric	0.6682121250778437	1.2452070831786841	0.8302506660111248
mixed	4.390174541156739	1.6384193329140544	1.9620799159165472
	5 100000	5 100000	5 100000
numeric	0.6654676250182092	1.2772445830050856	0.865516958059743
mixed	4.451537624932826	1.742541000014171	2.0112057079095393

As a result of this testing I settled on the third option.

rhshadrach

If we are to move forward, this looks good, should get a whatsnew in enhancements for 3.0.

pandas/core/frame.py

rhshadrach

Edit: Fixed link below

Is the implementation in #40140 (comment) not sufficient?

size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.random(size), 'c': 'x'})
row_labels = np.repeat(np.arange(size), 2)
col_labels = np.tile(['a', 'b'], size)
%timeit df.lookup(row_labels, col_labels)
# 22.3 ms ± 391 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- this PR
# 13.4 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <--- proposed implementation

stevenae · 2025-03-28T19:45:22Z

Is the implementation in #40140 (comment) not sufficient?

size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.random(size), 'c': 'x'})
row_labels = np.repeat(np.arange(size), 2)
col_labels = np.tile(['a', 'b'], size)
%timeit df.lookup(row_labels, col_labels)
# 22.3 ms ± 391 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- this PR
# 13.4 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <--- proposed implementation

The implementation df[['a', 'b']].sum().sum() pulls up the entire column, and does not support lookup of individual values by row/column. Is that what you are referring to?

pandas.DataFrame.lookup

stevenae · 2025-03-28T19:49:48Z

If we are to move forward, this looks good, should get a whatsnew in enhancements for 3.0.

Done

rhshadrach · 2025-03-28T19:56:40Z

@stevenae - sorry, linked to the wrong comment. I've fixed my comment above.

Ah, but I think I see. This avoids a large copy when only certain columns are used.

rhshadrach · 2025-03-28T20:06:51Z

cc @pandas-dev/pandas-core

My take: this provides an implementation for what I think is a natural operation that is not straightforward for most users. It provides performance benefits that take into account columnar-based storage (subsetting columns prior to calling .to_numpy()). This seems like a worthy addition in my opinion, especially given the user feedback when the previous version was removed.

stevenae · 2025-03-28T20:20:41Z

@stevenae - sorry, linked to the wrong comment. I've fixed my comment above.

Ah, but I think I see. This avoids a large copy when only certain columns are used.

Yes -- I ran a comparison (script at end) and found this PR implementation beats the comment you referenced on large mixed-type lookups.

Metrics

PR	40140
2 100
0.1964133749715984	0.0907377500552684
0.274302874924615	0.11014608410187066
3 100
0.15044220816344023	0.08912291703745723
0.2768622918520123	0.11031254194676876
4 100
0.15489325020462275	0.09032529196701944
0.26732829213142395	0.10644491598941386
5 100
0.1546538749244064	0.08968612505123019
0.2721201251260936	0.11162270791828632
2 100000
0.8096102089621127	0.40509104216471314
1.9508202918805182	4.064577874960378
3 100000
0.8242515418678522	0.4148290839511901
1.9535491249989718	4.241159915924072
4 100000
0.8302762501407415	0.42497566691599786
1.9240409170743078	4.146159041905776
5 100000
0.8654224998317659	0.44505883287638426
2.0630989999044687	4.4090170410927385

Script

import pandas as pd
import numpy as np
import timeit
np.random.seed(43)

def pd_lookup(df, row_labels, col_labels):
    rows = df.index.get_indexer(row_labels)
    cols = df.columns.get_indexer(col_labels)
    result = df.to_numpy()[rows, cols]
    return result

for n in [100,100_000]:
	for k in range(2,6):
		print(k,n)
		cols = list('abcdef')
		df = pd.DataFrame(np.random.randint(0, 10, size=(n,len(cols))), columns=cols)
		df['col'] = np.random.choice(cols, n)
		sample_n = n//10
		idx = np.random.choice(df['col'].index.to_numpy(),sample_n)
		cols = np.random.choice(df['col'].to_numpy(),sample_n)
		timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
		timeit.timeit(lambda: pd_lookup(df.drop(columns='col'),idx,cols),number=1000)
		str_col = cols[0]
		df[str_col] = df[str_col].astype(str)
		df[str_col] = str_col
		timeit.timeit(lambda: df.drop(columns='col').lookup(idx, cols),number=1000)
		timeit.timeit(lambda: pd_lookup(df.drop(columns='col'),idx,cols),number=1000)

Dr-Irv · 2025-03-28T20:28:57Z

pandas/core/frame.py

+        Returns
+        -------
+        numpy.ndarray
+            The found values.
+        """


I think it would be really useful to have an example here in the docs for the API.

Added, please take a look.

expanded example

Dr-Irv

Nice example. I will let the other pandas developers handle the rest of the PR

WillAyd · 2025-03-31T14:09:50Z

doc/source/user_guide/indexing.rst

-and column labels, this can be achieved by ``pandas.factorize``  and NumPy indexing.
-For instance:
+and column labels, and the ``lookup`` method allows for this and returns a
+NumPy array.  For instance:


Do we have other places in our API where we return a NumPy array? With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array

It looks like values also does this.

Agreed I think this API should return an ExtensionArray or numpy array depending on the initial type or result type

values only returns a NumPy array for numpy types. For extension types or arrow-backed types you get something different:

>>> pd.Series([1, 2, 3], dtype="int64[pyarrow]").values <ArrowExtensionArray> [1, 2, 3] Length: 3, dtype: int64[pyarrow]

I don't think we should force a NumPy array return here; particularly for string data, that could be non-performant and expensive

Thought through and did a bit more of a heavy-handed rewrite.

Now using melt to achieve the outcome of values or to_numpy'

Performance does take a hit, however, we are still outperforming the naiive lookup of to_numpy for mixed-type lookups.

Old PR New PR

2 100

0.1964133749715984 0.5150299999950221

0.274302874924615 0.5055611249990761

3 100

0.15044220816344023 0.48040162499819417

0.2768622918520123 0.5237024579982972

4 100

0.15489325020462275 0.49075670799356885

0.26732829213142395 0.5079907500039553

5 100

0.1546538749244064 0.4678692500019679

0.2721201251260936 0.5082256250025239

2 100000

0.8096102089621127 2.114792499996838

1.9508202918805182 2.619460332993185

3 100000

0.8242515418678522 2.2221941250027157

1.9535491249989718 2.6292148750071647

4 100000

0.8302762501407415 2.3314981659932528

1.9240409170743078 2.711707041991758

5 100000

0.8654224998317659 2.201970291993348

2.0630989999044687 2.674396375005017

Do we have other places in our API where we return a NumPy array?

factorize

With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array

This function can be operating on multiple columns of different dtypes. I think the only option in such a case is to return a NumPy array.

That's true on factorize but that isn't 100% an equivalent comparison. For sure the indexer is a numpy array, but the values in the two-tuple are an Index that should be type-preserving.

That's also a great point on the mixed column types, but that makes me wary of re-implementing this function. With all of the work going towards clarifying our nullability handling and implementing more than just NumPy types, it seems like this function is going to have a ton of edge cases

We could also wrap the result in a Series.

pandas/core/frame.py

potential compromise

jbrockmendel · 2025-04-02T16:25:27Z

Trying to make sure i understand correctly: this seems equivalent to df.loc[rows, cols].to_numpy().ravel()? (or maybe df.loc[rows, cols].stack().values might be better for preserving EAs?) And the main motivation is that this is more performant than those options?

stevenae · 2025-04-02T16:30:58Z

Trying to make sure i understand correctly: this seems equivalent to df.loc[rows, cols].to_numpy().ravel()? (or maybe df.loc[rows, cols].stack().values might be better for preserving EAs?) And the main motivation is that this is more performant than those options?

Hi @jbrockmendel -- df.loc[rows, cols] returns all columns for all rows. Lookup only returns the values at paired columns and rows.

jbrockmendel · 2025-04-02T16:33:20Z

That makes sense, thanks. So more of a df[rows, cols].diag() (which doesnt exist)?

stevenae · 2025-04-03T17:15:47Z

That makes sense, thanks. So more of a df[rows, cols].diag() (which doesnt exist)?

I think the best analogue from within pandas is is a for loop of .at[].

WillAyd · 2025-04-08T14:07:18Z

Overall I am -1 adding this back in. I think the utility of this function is limited in the general case of non-homogenous dataframes.

github-actions · 2025-05-17T00:08:16Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

stevenae · 2025-05-17T15:51:22Z

I am still interested. @rhshadrach what's the right next step?

…

On Fri, May 16, 2025, 8:08 PM github-actions[bot] ***@***.***> wrote: *github-actions[bot]* left a comment (pandas-dev/pandas#61185) <#61185 (comment)> This pull request is stale because it has been open for thirty days with no activity. Please update <https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#updating-your-pull-request> and respond to this comment if you're still interested in working on this. — Reply to this email directly, view it on GitHub <#61185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFZOHPXV5NLEHJQ3RVANL326Z4YPAVCNFSM6AAAAABZ3L3IL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOBXHA3DSOBYGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rhshadrach · 2025-05-18T13:31:09Z

@stevenae - I think we need agreement among core devs on whether this should be supported. While I'm sympathetic to users who found this useful prior to it's removal from pandas, there are a few arguments against which I find compelling.

Other DataFrame libraries do not offer such a method (to my knowledge).
The implementation can be achieved using existing functionality with what I measure (see below) as a 30% decrease in performance in non-homogeneous cases, and a 400% increase in performance in the homogeneous case.
Methods that coerce to object dtype when used on non-homogeneous DataFrames is something that I would like to see less in the built-in methods of pandas, not more. Here it's my opinion not that user's shouldn't be able to do it, but that it we should avoid it being built-in to pandas.

For the benchmark in bullet 2, I ran the code in #61185 (comment) with the following modification of pd_lookup:

def pd_lookup(df, row_labels, col_labels):
    df = df.loc[:, sorted(set(col_labels))]
    rows = df.index.get_indexer(row_labels)
    cols = df.columns.get_indexer(col_labels)
    result = df.to_numpy()[rows, cols]
    return result

stevenae · 2025-05-18T14:12:11Z

Understood! Should I put together a recipe for the documentation then? Since it seems there's indeed a 30% performance improvement to be had when indexing heterogeneous columns.

…

On Sun, May 18, 2025, 9:31 AM Richard Shadrach ***@***.***> wrote: *rhshadrach* left a comment (pandas-dev/pandas#61185) <#61185 (comment)> @stevenae <https://github.com/stevenae> - I think we need agreement among core devs on whether this should be supported. While I'm sympathetic to users who found this useful prior to it's removal from pandas, there are a few arguments against which I find compelling. - Other DataFrame libraries do not offer such a method (to my knowledge). - The implementation can be achieved using existing functionality with what I measure (see below) as a 30% decrease in performance in non-homogeneous cases, and a 400% increase in performance in the homogeneous case. - Methods that coerce to object dtype when used on non-homogeneous DataFrames is something that I would like to see less in the built-in methods of pandas, not more. Here it's my opinion *not* that user's shouldn't be able to do it, but that it we should avoid it being built-in to pandas. For the benchmark in bullet 2, I ran the code in #61185 (comment) <#61185 (comment)> with the following modification of pd_lookup: def pd_lookup(df, row_labels, col_labels): df = df.loc[:, sorted(set(col_labels))] rows = df.index.get_indexer(row_labels) cols = df.columns.get_indexer(col_labels) result = df.to_numpy()[rows, cols] return result — Reply to this email directly, view it on GitHub <#61185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFZOHOBP6HQLM3EJVA6FQD27CDTHAVCNFSM6AAAAABZ3L3IL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOBYHE4TEMRWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rhshadrach · 2025-05-18T19:20:27Z

@stevenae - yes, I think that would be uncontroversial.

stevenae · 2025-05-18T19:56:54Z

@stevenae - yes, I think that would be uncontroversial.

Okay! Will do in the next couple weeks

stevenae · 2025-05-21T15:45:01Z

@rhshadrach doc update is at #61471

mroeschke · 2025-06-02T17:00:28Z

Since we just merged #61471 adding documentation, closing this PR

stevenae added 10 commits March 26, 2025 15:11

dev setup

a4057e5

Update dev_attempts.py

0f5ad86

removed mixed type and threshold

7e30181

Delete dev_attempts.py

6fed58d

Update indexing.rst

8156c42

bringing tests back from 1.1.x

c17a020

extend underline

4a0b856

spacing

e0b0b57

remove dev_version

2a6dfae

fixed test_lookup_requires_unique_axes

21280ed

np.random.Generator.random, not np.random.Generator

stevenae mentioned this pull request Mar 27, 2025

ENH: re-implement DataFrame.lookup. #40140

Closed

stevenae added 3 commits March 27, 2025 12:19

Reduce columns to those in lookup

48f1cde

Update frame.py

9c060a8

Merge branch 'enh-lookup-subset' into enh-lookup

d620710

rhshadrach requested changes Mar 28, 2025

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

rhshadrach reviewed Mar 28, 2025

View reviewed changes

stevenae added 2 commits March 28, 2025 15:46

one line to separate sections

4e0c17f

Update v3.0.0.rst

a5e379b

pandas.DataFrame.lookup

rhshadrach changed the title ~~Enh lookup~~ ENH: Reimplement DataFrame.lookup Mar 28, 2025

rhshadrach added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 28, 2025

Dr-Irv requested changes Mar 28, 2025

View reviewed changes

stevenae added 2 commits March 28, 2025 17:05

Adding an example

47e0b1b

Update frame.py

0c04e97

expanded example

Dr-Irv approved these changes Mar 28, 2025

View reviewed changes

WillAyd reviewed Mar 31, 2025

View reviewed changes

stevenae requested a review from rhshadrach March 31, 2025 14:25

mroeschke reviewed Mar 31, 2025

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

stevenae added 3 commits March 31, 2025 13:03

shorter example

7d6dea5

Update frame.py

5018365

potential compromise

rewrite to preserve types

6aa6218

github-actions bot added the Stale label May 17, 2025

stevenae mentioned this pull request May 21, 2025

DOC: Improve lookup documentation #61471

Merged

5 tasks

mroeschke closed this Jun 2, 2025

Old PR	New PR
2 100
0.1964133749715984	0.5150299999950221
0.274302874924615	0.5055611249990761
3 100
0.15044220816344023	0.48040162499819417
0.2768622918520123	0.5237024579982972
4 100
0.15489325020462275	0.49075670799356885
0.26732829213142395	0.5079907500039553
5 100
0.1546538749244064	0.4678692500019679
0.2721201251260936	0.5082256250025239
2 100000
0.8096102089621127	2.114792499996838
1.9508202918805182	2.619460332993185
3 100000
0.8242515418678522	2.2221941250027157
1.9535491249989718	2.6292148750071647
4 100000
0.8302762501407415	2.3314981659932528
1.9240409170743078	2.711707041991758
5 100000
0.8654224998317659	2.201970291993348
2.0630989999044687	2.674396375005017

Uh oh!

ENH: Reimplement DataFrame.lookup #61185

ENH: Reimplement DataFrame.lookup #61185

Uh oh!

Conversation

stevenae commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevenae commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhshadrach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenae commented Mar 28, 2025

Uh oh!

stevenae commented Mar 28, 2025

Uh oh!

rhshadrach commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented Mar 28, 2025

Uh oh!

stevenae commented Mar 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dr-Irv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenae Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jbrockmendel commented Apr 2, 2025

Uh oh!

stevenae commented Apr 2, 2025

Uh oh!

jbrockmendel commented Apr 2, 2025

Uh oh!

stevenae commented Apr 3, 2025

Uh oh!

WillAyd commented Apr 8, 2025

Uh oh!

github-actions bot commented May 17, 2025

Uh oh!

stevenae commented May 17, 2025 via email

Uh oh!

rhshadrach commented May 18, 2025

Uh oh!

stevenae commented May 18, 2025 via email

Uh oh!

rhshadrach commented May 18, 2025

Uh oh!

stevenae commented May 18, 2025

Uh oh!

stevenae commented May 21, 2025

stevenae commented Mar 26, 2025 •

edited

Loading

stevenae commented Mar 27, 2025 •

edited

Loading

rhshadrach left a comment •

edited

Loading

rhshadrach left a comment •

edited

Loading

rhshadrach commented Mar 28, 2025 •

edited

Loading

stevenae Mar 31, 2025 •

edited

Loading