Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: update should try harder to preserve dtypes #4094

Closed
jreback opened this issue Jul 1, 2013 · 8 comments
Closed

BUG: update should try harder to preserve dtypes #4094

jreback opened this issue Jul 1, 2013 · 8 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@jreback
Copy link
Contributor

jreback commented Jul 1, 2013

more examples in #13957

http://stackoverflow.com/questions/17398216/unwanted-type-conversion-in-pandas-dataframe-update

df = pd.DataFrame({'int': [1, 2], 'float': [np.nan, np.nan]})

print('Integer column:')
print(df['int'])

for _, df_sub in df.groupby('int'):
    df_sub['float'] = float(df_sub['int'])
    df.update(df_sub)

print('NO integer column:')
print(df['int']) 
@gbrand-salesforce
Copy link

Is this marked as fixed?

I'm using pandas 0.19.1

and I have the following problem:

a = pandas.DataFrame({'bool_column': [True, False]})
b = pandas.DataFrame({'bool_column': [False]})

The dtype of both a.bool_column and b.bool_column are bool, but after
a.update(b)
The dtype becomes object...

@jreback
Copy link
Contributor Author

jreback commented Apr 27, 2017

it's marked as open
update could use some love if you would like to do a PR

@filippchistiakov
Copy link

upd

@whnr
Copy link

whnr commented Oct 27, 2018

Same issue for the category dtype. Now I'm recasting after reading all my data after append() and update()

@ghost
Copy link

ghost commented Feb 7, 2020

This is also a problem with the new nullable integer types.

In [1]: df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['d', 'e', 'f']}, dtype='string')

In [2]: df.dtypes
Out[2]:
A    string
B    string
dtype: object

In [3]: df2 =  pd.DataFrame({'A': ['a2', 'b2', 'c2'], 'B': ['d', 'e', 'f']}, dtype='string')

In [4]: df2.dtypes
Out[4]:
A    string
B    string
dtype: object

In [5]: df.update(df2)

In [6]: df.dtypes
Out[6]:
A    object
B    object
dtype: object

@rpkilby
Copy link

rpkilby commented Oct 26, 2022

Just confirming that this seems to be an issue with all the new-style dtypes. I wrote a quick test. Given:

new_types = {
    "Int64": pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3],}, dtype="Int64"),
    "UInt32": pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3],}, dtype="UInt32"),
    "boolean": pd.DataFrame({"A": [True, False], "B": [True, False],}, dtype="boolean"),
    "string": pd.DataFrame({"A": ["1", "2", "3"], "B": ["1", "2", "3"],}, dtype="string"),
}

old_types = {
    "int64": pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3],}, dtype="int64"),
    "float": pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3],}, dtype="float"),
    "bool": pd.DataFrame({"A": [True, False], "B": [True, False],}, dtype="bool"),
}

For each dtype/df pair, more or less:

for dtype, df in new_types.items():
    df2 = df.select_dtypes(dtype)
    df.update(df2)

And then printed the dtypes of the original df, the intermediate/selected df2 (as a sanity check to ensure it wasn't responsible for modifying the dtypes), and then the updated df. The output was:

# New-style dtypes: -----------------------------

Source DF        |Selected         |Updated          
-------------------------------------------------
A    Int64       |A    Int64       |A    object      
B    Int64       |B    Int64       |B    object      
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    UInt32      |A    UInt32      |A    object      
B    UInt32      |B    UInt32      |B    object      
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    boolean     |A    boolean     |A    object      
B    boolean     |B    boolean     |B    object      
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    string      |A    string      |A    object      
B    string      |B    string      |B    object      
dtype: object    |dtype: object    |dtype: object    


# Old-style dtypes: -----------------------------

Source DF        |Selected         |Updated          
-------------------------------------------------
A    int64       |A    int64       |A    int64       
B    int64       |B    int64       |B    int64       
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    float64     |A    float64     |A    float64     
B    float64     |B    float64     |B    float64     
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    bool        |A    bool        |A    bool        
B    bool        |B    bool        |B    bool        
dtype: object    |dtype: object    |dtype: object     

You can see that old-style dtypes are preserved, but update() demotes the new-style dtypes to object.


Additionally, this looks to be the cause of update() sometimes issuing a FutureWarning (e.g., #48853 (comment)). My best guess is that the nullable-dtype demotion to object is somehow confusing the mask check at the end of update, resulting in the loc assignment which in turn issues the FutureWarning.

pandas/pandas/core/frame.py

Lines 8213 to 8217 in 9c9789c

# don't overwrite columns unnecessarily
if mask.all():
continue
self.loc[:, col] = expressions.where(mask, this, that)

@mroeschke
Copy link
Member

This looks to be int now and I believe we have a test for this already so closing

In [5]: df = pd.DataFrame({'int': [1, 2], 'float': [np.nan, np.nan]})
   ...: 
   ...: print('Integer column:')
   ...: print(df['int'])
   ...: 
   ...: for _, df_sub in df.groupby('int'):
   ...:     df_sub['float'] = float(df_sub['int'])
   ...:     df.update(df_sub)
   ...: 
   ...: print('NO integer column:')
   ...: print(df['int'])
Integer column:
0    1
1    2
Name: int, dtype: int64
<ipython-input-5-9e5ccbb4c5fe>:7: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  df_sub['float'] = float(df_sub['int'])
NO integer column:
0    1
1    2
Name: int, dtype: int64

@ivanleoncz
Copy link

ivanleoncz commented May 21, 2023

The issue seems with passing a single element Series, as the warning states.

But slicing the only element of the Series, doesn't pose any warning for Pandas:

In [182]: df.tail(5)                                                                        
Out[182]: 
        ID   Age  Gender  Height  Weight    BMI        Label
103  106.0  11.0    Male   175.0    10.0    3.9  Underweight
104  107.0  16.0  Female   160.0    10.0    3.9  Underweight
105  108.0  21.0    Male   180.0    15.0    5.6  Underweight
106  109.0  26.0  Female   150.0    15.0    5.6  Underweight
107  110.0  31.0    Male   190.0    20.0  200.0  Underweight

In [183]: df.at[107, "BMI"] = df["BMI"].mode()                                              
<ipython-input-183-148156c2d3af>:1: 
FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError 
in the future. 
Use float(ser.iloc[0]) instead df.at[107, "BMI"] = df["BMI"].mode()

In [184]: df["BMI"].mode()                                                                  
Out[184]: 
0    16.7
Name: BMI, dtype: float64

In [185]: type(df["BMI"].mode())                                                            
Out[185]: pandas.core.series.Series

In [186]: df.at[107, "BMI"] = df["BMI"].mode()[0]                                           

In [187]:                                                                                   

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants