Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iterrows: when upcasting to object, values are converted to python types #13468

Open
jorisvandenbossche opened this issue Jun 16, 2016 · 10 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@jorisvandenbossche
Copy link
Member

I know iterrows is not the most recommended function, but I noticed a strange behaviour (triggered by a problem of a geopandas user: geopandas/geopandas#348). When using iterrows on a df with mixed dtypes (so the resulting series is of object dtype), the numeric values are converted to python types, while with loc/iloc the numpy types are preserved:

In [254]: df = pd.DataFrame({'int':[0,1], 'float':[0.1,0.2], 'str':['a','b']})

In [255]: df
Out[255]:
   float  int str
0    0.1    0   a
1    0.2    1   b

In [256]: row1 = df.iloc[0]

In [257]: i, row2 = next(df.iterrows())

In [258]: row3= next(df.itertuples())

In [260]: type(row1['float'])
Out[260]: numpy.float64

In [261]: type(row2['float'])
Out[261]: float

In [269]: type(row3.float)
Out[269]: numpy.float64

Is this intentional? (it's a consequence of using self.values in the implementation, and numpy does this conversion to python types in an object array) And if so, is this worth documenting?

(note it was actually the numpy types in an object dtyped series that caused an issue for the geopandas user, because fiona couldn't handle those numpy scalars in an object dtyped column, but that's not an issue to blame pandas)

@jreback
Copy link
Contributor

jreback commented Jun 16, 2016

see discussion #13236

should be the same (eg Python types)

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Jun 17, 2016
@jreback jreback added this to the Next Major Release milestone Jun 17, 2016
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 12, 2017
@jreback
Copy link
Contributor

jreback commented Sep 12, 2017

so after #17491
[269] is also float.

I think we could actually/should fix [260], but that's another item.

@jorisvandenbossche
Copy link
Member Author

Yes, I think this can actually be closed now, apart from a doc update to iterrows / itertuples to make it clear that it boxes to python / custom pandas types.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2017

i think ok to keep open for now

I want to fix the scalar getting as well (will reirose for that)

@jreback jreback modified the milestones: 0.21.0, 1.0 Oct 2, 2017
@jreback jreback modified the milestones: 1.0, 0.24.0 Apr 23, 2018
@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 23, 2018
@mitar
Copy link
Contributor

mitar commented Apr 23, 2018

To fix [260] you can call item on the underlying numpy arrays the same as I am doing in #20796. So this seems to do the same thing as tolist does on whole array. So you could call item for each cell in a row, when constructing the result for df.iloc[0].

@mitar
Copy link
Contributor

mitar commented Apr 23, 2018

There is something strange going on here. Taking an example from documentation:

>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
>>> row = next(df.iterrows())[1]
>>> row
int      1.0
float    1.5
Name: 0, dtype: float64
>>> print(row['int'].dtype)
float64

But:

>>> df = pd.DataFrame([[1, 1.5, 'a']], columns=['int', 'float', 'str'])
>>> row = next(df.iterrows())[1]
>>> row
int        1
float    1.5
str        a
Name: 0, dtype: object
>>> print(row['int'].dtype)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute 'dtype'

So it seems conversion to Python types happens only if there is some object dtype present. Otherwise we get (and keep) numpy types, only upcast to a common dtype.

@bscheetz
Copy link

bscheetz commented May 8, 2018

@jreback In the first example posted by @mitar, python type int should be returned because we're iterating, correct?

It also sounds like we want to fix the type returned by iloc - should return python type int but instead returns numpy.int64

@jorisvandenbossche
Copy link
Member Author

@jreback In the first example posted by @mitar, python type int should be returned because we're iterating, correct?

I don't think so, as in that example there are only numeric dtypes, so it makes sense to keep the row / Series as float dtype.
And if we do that, this boils down to the fact that accessing a single element from a numerical Series gives a numpy scalar type (type(pd.Series([1.0, 2.0])[0]) == np.float64)

I agree it is a bit confusing that it depends on whether there is a string column or not. But I think the dtype of the resulting Series of float vs object makes sense.

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche is the only remaining issue here documenting the behavior?

@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019
@arw2019 arw2019 added the Docs label Nov 5, 2020
@MarioProjects
Copy link

There is something strange going on here. Taking an example from documentation:

>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
>>> row = next(df.iterrows())[1]
>>> row
int      1.0
float    1.5
Name: 0, dtype: float64
>>> print(row['int'].dtype)
float64

But:

>>> df = pd.DataFrame([[1, 1.5, 'a']], columns=['int', 'float', 'str'])
>>> row = next(df.iterrows())[1]
>>> row
int        1
float    1.5
str        a
Name: 0, dtype: object
>>> print(row['int'].dtype)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute 'dtype'

So it seems conversion to Python types happens only if there is some object dtype present. Otherwise we get (and keep) numpy types, only upcast to a common dtype.

I found the same problem printing a dataframe. When printing an int column as [1992, 1993, 1994], prints [1992.0, 1993.0, 1994.0]. I tried

wm["Year"] = wm["Year"].astype(int)
wm.astype(int)

and nothing

@mroeschke mroeschke added Bug and removed Docs Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 1, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

9 participants