Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: loc[] returns object type instead of float #60600

Open
3 tasks done
metazoic opened this issue Dec 23, 2024 · 13 comments · May be fixed by #61054
Open
3 tasks done

BUG: loc[] returns object type instead of float #60600

metazoic opened this issue Dec 23, 2024 · 13 comments · May be fixed by #61054
Assignees
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@metazoic
Copy link

metazoic commented Dec 23, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes
df[[1,2]].loc[0].dtypes

Issue Description

df.loc[0,[1,2]] results in a Series of type dtype('O'), while df[[1,2]].loc[0] results in a Series of type dtype('float64').

Expected Behavior

I would expect df.loc[0,[1,2]] to be of type float64, same as df[[1,2]].loc[0]. The current behavior seems to encourage chaining instead of canonical referencing.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.8
python-bits : 64
OS : Darwin
OS-release : 23.6.0
Version : Darwin Kernel Version 23.6.0: Thu Sep 12 23:35:10 PDT 2024; root:xnu-10063.141.1.701.1~1/RELEASE_ARM64_T6030
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : 8.1.3
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.12.0
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : 5.3.0
matplotlib : 3.10.0
numba : 0.60.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : 8.3.4
python-calamine : None
pyxlsb : None
s3fs : 2024.12.0
scipy : 1.14.1
sqlalchemy : 2.0.36
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.11.0
xlrd : None
xlsxwriter : None
zstandard : 0.23.0
tzdata : 2024.2
qtpy : N/A
pyqt5 : None

@metazoic metazoic added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 23, 2024
@rhshadrach
Copy link
Member

Thanks for the report. I'd hazard a guess that we are determining the dtype of the result prior to column selection. Further investigations are welcome!

@rhshadrach rhshadrach added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 26, 2024
@parthi-siva
Copy link
Contributor

take

@DarthKitten2130
Copy link
Contributor

take

@sanggon6107
Copy link

Hi @parthi-siva and @DarthKitten2130 ,
Are you still working on this issue? I would like to work on this one if you don't mind.

@parthi-siva
Copy link
Contributor

Hi @sanggon6107 I'm still working on this..

@sanggon6107
Copy link

Hi @sanggon6107 I'm still working on this..

Well noted. Thanks for the quick reply.

@parthi-siva
Copy link
Contributor

parthi-siva commented Feb 28, 2025

for this input

df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:131 (Here we get the data type for the resulting series. )

 dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(
        arr, indexer, fill_value, allow_fill
    )
arr = [a, 1, 2]
indexer = [1,2]

as we can see that arr contains string so the datatype returned will be object only.

Then we are creating empty numpy array using the dtype which will be of type object

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:155

out = np.empty(out_shape, dtype=dtype) 

Then we do slice using cpython function

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:160

func(arr, indexer, out, fill_value)

After the func(arr, indexer, out, fill_value) call, the out array is populated with the selected elements. However, the dtype of out will not match the dtype of the elements in arr.

I tried to add a step to check and adjust the dtype of out after the func call.

  # Check if the dtype of out matches the dtype of the elements in arr
    if out.size > 0:  # Only check if out is not empty
        first_element = out.flat[0]  # Get the first element

        # Check if the first element's type is different from out.dtype
        if isinstance(first_element, (int, float, np.number)) and out.dtype == object:
            # If the first element is numeric but out.dtype is object, update the dtype
            new_dtype = np.result_type(first_element)
            out = out.astype(new_dtype)

This fixed the op's issue but test cases are failing. Also I feel this is not a right way to address the issue

So I'm not sure how to infer the dtype pragmatically before this for df.loc[0,[1,2]]

@rhshadrach @sanggon6107

@sanggon6107
Copy link

Hi @parthi-siva,
thanks for the comment.

I had also tried similar thing, but it seems there could be side effects including test failures, since there could be many other pandas functions that call take_nd().
I would rather change some codes at the relatively outer level of the call stack so that we can minimize the impact.
Since it seems this issue only appears where the first axis is integer and the second one is list or slice - loc[int,list/slice], I think we could re-interpret the dtype of the output at the end of _LocationIndexer._getitem_lowerdim().

Proposed solution

    @final
    def _getitem_lowerdim(self, tup: tuple):

...

                # This is an elided recursive call to iloc/loc
                out = getattr(section, self.name)[new_key]
                # Re-interpret dtype of out.values for loc/iloc[int, list/slice]. # GH60600
                if i == 0 and isinstance(key, int) and isinstance(new_key, (list, slice)):
                    inferred_dtype = np.array(out.values.tolist()).dtype
                    if inferred_dtype != out.dtype:
                        out = out.astype(inferred_dtype)
                return out

There was only one failing test when I locally ran pytest, but the failing case should be revised according to this code change since the test is currently expecting loc[int, list] to be an object dataframe.
My concern is that, we have to create a new np.array only to re-interpret the dtype. I'm not sure if there's more elegant way to infer the output's dtype.

Please let me know what you think about the proposal. I'd be glad to co-author a commit and make a PR if you don't mind.

cc @rhshadrach

Thanks!

@parthi-siva
Copy link
Contributor

parthi-siva commented Mar 1, 2025

Hi @sanggon6107 ,

Thanks for the reply.

Pls proceed with you proposal. I'm good!

I spent some time regarding your concern about creating a np.array just to find dtype.

can we try using np.result_type

either like this

inferred_dtype = reduce(np.result_type, out)

or like this

inferred_dtype = np.result_type(out.values.tolist())

Please let me know if it helps.

@rhshadrach
Copy link
Member

@sanggon6107 - it's not clear to me what the proposal is. Best to open a PR I think.

@sanggon6107
Copy link

Hi @parthi-siva , your suggestion helped a lot!

I've also found that we could simplify the code by using infer_objects().
I'll make a PR based on this discussion.

@sanggon6107
Copy link

take

@parthi-siva
Copy link
Contributor

Hi @parthi-siva , your suggestion helped a lot!

I've also found that we could simplify the code by using infer_objects().
I'll make a PR based on this discussion.

Sure @sanggon6107 :)

@sanggon6107 sanggon6107 linked a pull request Mar 4, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants