Skip to content

ENH: re-implement DataFrame.lookup. #40140

Closed
@erfannariman

Description

@erfannariman
Member

DataFrame.lookup was deprecated in #35224 in 1.2. After some feedback (#39171 ) I opened this ticket to discuss re-implentation of lookup in a performant way. As mentioned in the discussion on 35244: "but it would have to be performant and not be yet another indexing api".

This ticket can be a starting point for proposed methods, although the old implementation was actually quite performant look at given tests in the discussion of 35244:

pandas/pandas/core/frame.py

Lines 3848 to 3861 in b5958ee

if not self._is_mixed_type or n > thresh:
values = self.values
ridx = self.index.get_indexer(row_labels)
cidx = self.columns.get_indexer(col_labels)
if (ridx == -1).any():
raise KeyError("One or more row labels was not found")
if (cidx == -1).any():
raise KeyError("One or more column labels was not found")
flat_index = ridx * len(self.columns) + cidx
result = values.flat[flat_index]
else:
result = np.empty(n, dtype="O")
for i, (r, c) in enumerate(zip(row_labels, col_labels)):
result[i] = self._get_value(r, c)

Activity

changed the title [-]ENH: re-implement DataFrame.lookup in a performant way.[/-] [+]ENH: re-implement DataFrame.lookup.[/+] on Mar 1, 2021
added
Needs DiscussionRequires discussion from core team before further action
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Mar 1, 2021
challisd

challisd commented on Aug 4, 2021

@challisd

I think there should definitely be a lookup function. Since the old one seems to work well, is un-deprecating it an option? I find the proposed alternative using melt to be unreadable, and based on the (sadly heated) discussion here the old lookup function is faster than the melt alternative suggested. Pandas is a module used by many thousands of programmers and scientists who often have only a vague (or no) idea what the melt function does. The ability to run a quick series of lookups using lists of row and column coordinates is a fairly ordinary task, but if you don't provide this lookup function most users will likely fall back on using a slow for loop; and if that's too slow for them, decide to forget it and just use NumPy where you can do the_data[row_index_list, column_index_list]
Can we please keep this function?

berkgercek

berkgercek commented on May 9, 2022

@berkgercek

Going to throw my voice in here and say that this is a pretty important feature for dataframes that allows for numpy-like behavior with labeled complex indexes and columns.

My personal use case for Pandas is often reliant on using it to keep labels and data together, and working with a method like lookup is a part of how I use it. It also fits in the scope of the package description provided in the documentation:

Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

It's not something I use often, but the proposed solution linked in the deprecation notice is very inelegant.

One use case I have today is to do something similar to the following (obviously with meaningful data), which I have done with the above solution:

df1 = pd.DataFrame(np.random.normal(size=[1000, 3], columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.normal(size=[1000, 3], columns=['A', 'B', 'C'])
maxidx = df.idxmax(axis=1)
# The solution suggested by the deprecation notice is below.
idx, cols = pd.factorize(maxidx)
df2_lookup = pd.Series(df2.reindex(cols, axis=1).to_numpy()[np.arange(len(df2)), idx], index=df2.index)

This is not a readable solution for me and when others need to maintain this code it will be very much not obvious at a glance what I'm doing here.

If there is another solution that would work equally well but be more intuitive I am happy to use that instead, but I see no alternative to the .lookup method for this use case.

rhshadrach

rhshadrach commented on May 11, 2022

@rhshadrach
Member

+1 on re-implementing unless there is a more understandable alternative; I too find it hard to discern what the current alternative is doing.

added this to the 2.0 milestone on Nov 28, 2022
modified the milestones: 2.0, 3.0 on Feb 8, 2023
challisd

challisd commented on Feb 22, 2023

@challisd

Any word on if or when this feature will be added back in, or has anyone figured out a viable alternative?

erfannariman

erfannariman commented on Feb 22, 2023

@erfannariman
MemberAuthor

Just to check, is there an agreement that this will be added back in if there's a viable PR before someone (or myself) starts to work on it. @jorisvandenbossche @mroeschke @rhshadrach

18 remaining items

challisd

challisd commented on Nov 7, 2023

@challisd

I agree that we can't provide a ton of short combinations of pandas methods, but such a common and basic use case certainly should be included in my opinion. What about something like the following:

def lookup(df, row_labels, col_labels, dtype=None):
    if len(df.dtypes.unique()) > 1:
        warnings.warn("DataFrame contains mixed data types which may lead to unexpected type coersion and/or decreased performance")
    if np.dtype('O') in df.dtypes.values and dtype is not None and dtype != 'object':
        warnings.warn("DataFrames with columns of the 'object' data type may fail to be coerced to other data types")
    rows = df.index.get_indexer(row_labels)
    cols = df.columns.get_indexer(col_labels)
    return df.to_numpy(dtype=dtype)[rows, cols]

It allows greater control over the output data type and gives warnings when dealing with mixed data types. It gives an additional warning if attempting to coerce an object column to a non-object data-type as this can easily lead to an exception

MarcoGorelli

MarcoGorelli commented on Nov 7, 2023

@MarcoGorelli
Member

such a common and basic use case certainly should be included

just checking, are there any other dataframe libraries which include this?

challisd

challisd commented on Nov 7, 2023

@challisd

Not sure, I've mostly only used Pandas in Python. I'm making the claim it's a basic feature based off the evidence that both R and Numpy support this functionality as part of the built-in [] indexing function.

challisd

challisd commented on Nov 8, 2023

@challisd

Actually, it seems I remembered incorrectly and it is not a basic feature of R. Sorry for the mistake!

stevenae

stevenae commented on Feb 21, 2025

@stevenae
Contributor

Unless this has been deprioritized, I'll try out some optimizations and aim to put a PR up in the 1 week - 1 month time frame (sometime March 2025)

stevenae

stevenae commented on Mar 26, 2025

@stevenae
Contributor

take

stevenae

stevenae commented on Mar 27, 2025

@stevenae
Contributor

Hi, I put up a PR (#61185). There is one CI test failing* but it appears to be unrelated to the change itself (perhaps a flaky test).

*[Unit Tests / macos-13 actions-312.yaml (pull_request)](https://github.com/pandas-dev/pandas/actions/runs/14094479324/job/39478857250?pr=61185)

stevenae

stevenae commented on Mar 27, 2025

@stevenae
Contributor

Added another optimization for lookups on subset of columns.

stevenae

stevenae commented on Mar 27, 2025

@stevenae
Contributor

Added one final optimization -- subsetting rows as well when there is a mixture of types.

Also reduced complexity, on the assumption that lookup will be done for less than all rows and/or columns.

stevenae

stevenae commented on May 21, 2025

@stevenae
Contributor

Just a heads up: feedback on #61185 has led to a decision not to re-implement DataFrame.lookup. Instead I will add documentation recommending user code.

I recommend closing this issue.

stevenae

stevenae commented on May 21, 2025

@stevenae
Contributor

Adding documentation for usercode instead, at #61471

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

EnhancementNeeds DiscussionRequires discussion from core team before further actionReshapingConcat, Merge/Join, Stack/Unstack, Explode

Type

No type

Projects

No projects

Relationships

None yet

    Participants

    @challisd@stevenae@jorisvandenbossche@datapythonista@mroeschke

    Issue actions

      ENH: re-implement DataFrame.lookup. · Issue #40140 · pandas-dev/pandas