Skip to content

BUG: loc[] returns object type instead of float #60600

Closed
@metazoic

Description

@metazoic

Pandas version checks

  • I have checked that this issue has not already been reported.

    I have confirmed this bug exists on the latest version of pandas.

    I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes
df[[1,2]].loc[0].dtypes

Issue Description

df.loc[0,[1,2]] results in a Series of type dtype('O'), while df[[1,2]].loc[0] results in a Series of type dtype('float64').

Expected Behavior

I would expect df.loc[0,[1,2]] to be of type float64, same as df[[1,2]].loc[0]. The current behavior seems to encourage chaining instead of canonical referencing.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.8
python-bits : 64
OS : Darwin
OS-release : 23.6.0
Version : Darwin Kernel Version 23.6.0: Thu Sep 12 23:35:10 PDT 2024; root:xnu-10063.141.1.701.1~1/RELEASE_ARM64_T6030
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : 8.1.3
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.12.0
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : 5.3.0
matplotlib : 3.10.0
numba : 0.60.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : 8.3.4
python-calamine : None
pyxlsb : None
s3fs : 2024.12.0
scipy : 1.14.1
sqlalchemy : 2.0.36
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.11.0
xlrd : None
xlsxwriter : None
zstandard : 0.23.0
tzdata : 2024.2
qtpy : N/A
pyqt5 : None

Activity

added
Needs TriageIssue that has not been reviewed by a pandas team member
on Dec 23, 2024
rhshadrach

rhshadrach commented on Dec 26, 2024

@rhshadrach
Member

Thanks for the report. I'd hazard a guess that we are determining the dtype of the result prior to column selection. Further investigations are welcome!

added
IndexingRelated to indexing on series/frames, not to indexes themselves
Dtype ConversionsUnexpected or buggy dtype conversions
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Dec 26, 2024
parthi-siva

parthi-siva commented on Dec 28, 2024

@parthi-siva
Contributor

take

DarthKitten2130

DarthKitten2130 commented on Dec 30, 2024

@DarthKitten2130
Contributor

take

sanggon6107

sanggon6107 commented on Jan 12, 2025

@sanggon6107
Contributor

Hi @parthi-siva and @DarthKitten2130 ,
Are you still working on this issue? I would like to work on this one if you don't mind.

parthi-siva

parthi-siva commented on Jan 12, 2025

@parthi-siva
Contributor

Hi @sanggon6107 I'm still working on this..

sanggon6107

sanggon6107 commented on Jan 12, 2025

@sanggon6107
Contributor

Hi @sanggon6107 I'm still working on this..

Well noted. Thanks for the quick reply.

parthi-siva

parthi-siva commented on Feb 28, 2025

@parthi-siva
Contributor

for this input

df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:131 (Here we get the data type for the resulting series. )

 dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(
        arr, indexer, fill_value, allow_fill
    )
arr = [a, 1, 2]
indexer = [1,2]

as we can see that arr contains string so the datatype returned will be object only.

Then we are creating empty numpy array using the dtype which will be of type object

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:155

out = np.empty(out_shape, dtype=dtype) 

Then we do slice using cpython function

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:160

func(arr, indexer, out, fill_value)

After the func(arr, indexer, out, fill_value) call, the out array is populated with the selected elements. However, the dtype of out will not match the dtype of the elements in arr.

I tried to add a step to check and adjust the dtype of out after the func call.

  # Check if the dtype of out matches the dtype of the elements in arr
    if out.size > 0:  # Only check if out is not empty
        first_element = out.flat[0]  # Get the first element

        # Check if the first element's type is different from out.dtype
        if isinstance(first_element, (int, float, np.number)) and out.dtype == object:
            # If the first element is numeric but out.dtype is object, update the dtype
            new_dtype = np.result_type(first_element)
            out = out.astype(new_dtype)

This fixed the op's issue but test cases are failing. Also I feel this is not a right way to address the issue

So I'm not sure how to infer the dtype pragmatically before this for df.loc[0,[1,2]]

@rhshadrach @sanggon6107

sanggon6107

sanggon6107 commented on Feb 28, 2025

@sanggon6107
Contributor

Hi @parthi-siva,
thanks for the comment.

I had also tried similar thing, but it seems there could be side effects including test failures, since there could be many other pandas functions that call take_nd().
I would rather change some codes at the relatively outer level of the call stack so that we can minimize the impact.
Since it seems this issue only appears where the first axis is integer and the second one is list or slice - loc[int,list/slice], I think we could re-interpret the dtype of the output at the end of _LocationIndexer._getitem_lowerdim().

Proposed solution

    @final
    def _getitem_lowerdim(self, tup: tuple):

...

                # This is an elided recursive call to iloc/loc
                out = getattr(section, self.name)[new_key]
                # Re-interpret dtype of out.values for loc/iloc[int, list/slice]. # GH60600
                if i == 0 and isinstance(key, int) and isinstance(new_key, (list, slice)):
                    inferred_dtype = np.array(out.values.tolist()).dtype
                    if inferred_dtype != out.dtype:
                        out = out.astype(inferred_dtype)
                return out

There was only one failing test when I locally ran pytest, but the failing case should be revised according to this code change since the test is currently expecting loc[int, list] to be an object dataframe.
My concern is that, we have to create a new np.array only to re-interpret the dtype. I'm not sure if there's more elegant way to infer the output's dtype.

Please let me know what you think about the proposal. I'd be glad to co-author a commit and make a PR if you don't mind.

cc @rhshadrach

Thanks!

9 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

BugDtype ConversionsUnexpected or buggy dtype conversionsIndexingRelated to indexing on series/frames, not to indexes themselves

Type

No type

Projects

No projects

Relationships

None yet

    Development

    Participants

    @parthi-siva@metazoic@rhshadrach@sanggon6107@DarthKitten2130

    Issue actions

      BUG: loc[] returns object type instead of float · Issue #60600 · pandas-dev/pandas