Description
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes
df[[1,2]].loc[0].dtypes
Issue Description
df.loc[0,[1,2]]
results in a Series of type dtype('O')
, while df[[1,2]].loc[0]
results in a Series of type dtype('float64')
.
Expected Behavior
I would expect df.loc[0,[1,2]]
to be of type float64
, same as df[[1,2]].loc[0]
. The current behavior seems to encourage chaining instead of canonical referencing.
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.12.8
python-bits : 64
OS : Darwin
OS-release : 23.6.0
Version : Darwin Kernel Version 23.6.0: Thu Sep 12 23:35:10 PDT 2024; root:xnu-10063.141.1.701.1~1/RELEASE_ARM64_T6030
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : 8.1.3
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.12.0
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : 5.3.0
matplotlib : 3.10.0
numba : 0.60.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : 8.3.4
python-calamine : None
pyxlsb : None
s3fs : 2024.12.0
scipy : 1.14.1
sqlalchemy : 2.0.36
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.11.0
xlrd : None
xlsxwriter : None
zstandard : 0.23.0
tzdata : 2024.2
qtpy : N/A
pyqt5 : None
Activity
rhshadrach commentedon Dec 26, 2024
Thanks for the report. I'd hazard a guess that we are determining the dtype of the result prior to column selection. Further investigations are welcome!
parthi-siva commentedon Dec 28, 2024
take
DarthKitten2130 commentedon Dec 30, 2024
take
sanggon6107 commentedon Jan 12, 2025
Hi @parthi-siva and @DarthKitten2130 ,
Are you still working on this issue? I would like to work on this one if you don't mind.
parthi-siva commentedon Jan 12, 2025
Hi @sanggon6107 I'm still working on this..
sanggon6107 commentedon Jan 12, 2025
Well noted. Thanks for the quick reply.
parthi-siva commentedon Feb 28, 2025
for this input
pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:131
(Here we get the data type for the resulting series. )as we can see that arr contains string so the datatype returned will be object only.
Then we are creating empty numpy array using the dtype which will be of type object
pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:155
Then we do slice using cpython function
pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:160
After the func(arr, indexer, out, fill_value) call, the out array is populated with the selected elements. However, the dtype of out will not match the dtype of the elements in arr.
I tried to add a step to check and adjust the dtype of out after the func call.
This fixed the op's issue but test cases are failing. Also I feel this is not a right way to address the issue
So I'm not sure how to infer the dtype pragmatically before this for df.loc[0,[1,2]]
@rhshadrach @sanggon6107
sanggon6107 commentedon Feb 28, 2025
Hi @parthi-siva,
thanks for the comment.
I had also tried similar thing, but it seems there could be side effects including test failures, since there could be many other pandas functions that call
take_nd()
.I would rather change some codes at the relatively outer level of the call stack so that we can minimize the impact.
Since it seems this issue only appears where the first axis is integer and the second one is list or slice -
loc[int,list/slice]
, I think we could re-interpret the dtype of the output at the end of_LocationIndexer._getitem_lowerdim()
.Proposed solution
There was only one failing test when I locally ran
pytest
, but the failing case should be revised according to this code change since the test is currently expecting loc[int, list] to be an object dataframe.My concern is that, we have to create a new np.array only to re-interpret the dtype. I'm not sure if there's more elegant way to infer the output's dtype.
Please let me know what you think about the proposal. I'd be glad to co-author a commit and make a PR if you don't mind.
cc @rhshadrach
Thanks!
9 remaining items