BUG: update should try harder to preserve dtypes #4094

jreback · 2013-07-01T13:00:04Z

more examples in #13957

http://stackoverflow.com/questions/17398216/unwanted-type-conversion-in-pandas-dataframe-update

df = pd.DataFrame({'int': [1, 2], 'float': [np.nan, np.nan]})

print('Integer column:')
print(df['int'])

for _, df_sub in df.groupby('int'):
    df_sub['float'] = float(df_sub['int'])
    df.update(df_sub)

print('NO integer column:')
print(df['int'])

The text was updated successfully, but these errors were encountered:

gbrand-salesforce · 2017-04-27T10:08:12Z

Is this marked as fixed?

I'm using pandas 0.19.1

and I have the following problem:

a = pandas.DataFrame({'bool_column': [True, False]})
b = pandas.DataFrame({'bool_column': [False]})

The dtype of both a.bool_column and b.bool_column are bool, but after
a.update(b)
The dtype becomes object...

jreback · 2017-04-27T10:53:01Z

it's marked as open
update could use some love if you would like to do a PR

filippchistiakov · 2018-10-18T15:17:16Z

upd

whnr · 2018-10-27T09:45:59Z

Same issue for the category dtype. Now I'm recasting after reading all my data after append() and update()

ghost · 2020-02-07T00:15:57Z

This is also a problem with the new nullable integer types.

In [1]: df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['d', 'e', 'f']}, dtype='string')

In [2]: df.dtypes
Out[2]:
A    string
B    string
dtype: object

In [3]: df2 =  pd.DataFrame({'A': ['a2', 'b2', 'c2'], 'B': ['d', 'e', 'f']}, dtype='string')

In [4]: df2.dtypes
Out[4]:
A    string
B    string
dtype: object

In [5]: df.update(df2)

In [6]: df.dtypes
Out[6]:
A    object
B    object
dtype: object

rpkilby · 2022-10-26T20:32:42Z

Just confirming that this seems to be an issue with all the new-style dtypes. I wrote a quick test. Given:

new_types = {
    "Int64": pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3],}, dtype="Int64"),
    "UInt32": pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3],}, dtype="UInt32"),
    "boolean": pd.DataFrame({"A": [True, False], "B": [True, False],}, dtype="boolean"),
    "string": pd.DataFrame({"A": ["1", "2", "3"], "B": ["1", "2", "3"],}, dtype="string"),
}

old_types = {
    "int64": pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3],}, dtype="int64"),
    "float": pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3],}, dtype="float"),
    "bool": pd.DataFrame({"A": [True, False], "B": [True, False],}, dtype="bool"),
}

For each dtype/df pair, more or less:

for dtype, df in new_types.items():
    df2 = df.select_dtypes(dtype)
    df.update(df2)

And then printed the dtypes of the original df, the intermediate/selected df2 (as a sanity check to ensure it wasn't responsible for modifying the dtypes), and then the updated df. The output was:

# New-style dtypes: -----------------------------

Source DF        |Selected         |Updated          
-------------------------------------------------
A    Int64       |A    Int64       |A    object      
B    Int64       |B    Int64       |B    object      
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    UInt32      |A    UInt32      |A    object      
B    UInt32      |B    UInt32      |B    object      
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    boolean     |A    boolean     |A    object      
B    boolean     |B    boolean     |B    object      
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    string      |A    string      |A    object      
B    string      |B    string      |B    object      
dtype: object    |dtype: object    |dtype: object    


# Old-style dtypes: -----------------------------

Source DF        |Selected         |Updated          
-------------------------------------------------
A    int64       |A    int64       |A    int64       
B    int64       |B    int64       |B    int64       
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    float64     |A    float64     |A    float64     
B    float64     |B    float64     |B    float64     
dtype: object    |dtype: object    |dtype: object    

Source DF        |Selected         |Updated          
-------------------------------------------------
A    bool        |A    bool        |A    bool        
B    bool        |B    bool        |B    bool        
dtype: object    |dtype: object    |dtype: object

You can see that old-style dtypes are preserved, but update() demotes the new-style dtypes to object.

Additionally, this looks to be the cause of update() sometimes issuing a FutureWarning (e.g., #48853 (comment)). My best guess is that the nullable-dtype demotion to object is somehow confusing the mask check at the end of update, resulting in the loc assignment which in turn issues the FutureWarning.

pandas/pandas/core/frame.py

Lines 8213 to 8217 in 9c9789c

    
           # don't overwrite columns unnecessarily 
        
           if mask.all(): 
        
               continue 
        
           self.loc[:, col] = expressions.where(mask, this, that)

mroeschke · 2023-03-31T18:39:26Z

This looks to be int now and I believe we have a test for this already so closing

In [5]: df = pd.DataFrame({'int': [1, 2], 'float': [np.nan, np.nan]})
   ...: 
   ...: print('Integer column:')
   ...: print(df['int'])
   ...: 
   ...: for _, df_sub in df.groupby('int'):
   ...:     df_sub['float'] = float(df_sub['int'])
   ...:     df.update(df_sub)
   ...: 
   ...: print('NO integer column:')
   ...: print(df['int'])
Integer column:
0    1
1    2
Name: int, dtype: int64
<ipython-input-5-9e5ccbb4c5fe>:7: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  df_sub['float'] = float(df_sub['int'])
NO integer column:
0    1
1    2
Name: int, dtype: int64

ivanleoncz · 2023-05-21T20:10:09Z

The issue seems with passing a single element Series, as the warning states.

But slicing the only element of the Series, doesn't pose any warning for Pandas:

In [182]: df.tail(5)                                                                        
Out[182]: 
        ID   Age  Gender  Height  Weight    BMI        Label
103  106.0  11.0    Male   175.0    10.0    3.9  Underweight
104  107.0  16.0  Female   160.0    10.0    3.9  Underweight
105  108.0  21.0    Male   180.0    15.0    5.6  Underweight
106  109.0  26.0  Female   150.0    15.0    5.6  Underweight
107  110.0  31.0    Male   190.0    20.0  200.0  Underweight

In [183]: df.at[107, "BMI"] = df["BMI"].mode()                                              
<ipython-input-183-148156c2d3af>:1: 
FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError 
in the future. 
Use float(ser.iloc[0]) instead df.at[107, "BMI"] = df["BMI"].mode()

In [184]: df["BMI"].mode()                                                                  
Out[184]: 
0    16.7
Name: BMI, dtype: float64

In [185]: type(df["BMI"].mode())                                                            
Out[185]: pandas.core.series.Series

In [186]: df.at[107, "BMI"] = df["BMI"].mode()[0]                                           

In [187]:

jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 1, 2015

jreback mentioned this issue Aug 10, 2016

DataFrame.update() changes type of boolean column to Object. #13957

Closed

jreback added Difficulty Intermediate labels Aug 10, 2016

jreback mentioned this issue Aug 11, 2016

BUG/DEPR: combine dtype fixes #13970

Closed

5 tasks

birdcolour mentioned this issue Feb 26, 2018

DataFrame.update silently does nothing when indices are of differing type #19905

Open

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

jbrockmendel removed Effort Medium labels Oct 21, 2019

minouHub mentioned this issue Feb 4, 2020

bug in pandas conversion to np.int64 radis/radis#65

Closed

4 tasks

Horstage mentioned this issue Mar 4, 2021

BUG: update should try harder to preserve dtypes #4094 #40219

Closed

4 tasks

mzeitlin11 mentioned this issue Aug 4, 2021

BUG: DataFrame.update changes type of updated values #42891

Closed

3 tasks

mroeschke removed this from the Someday milestone Oct 13, 2022

rpkilby mentioned this issue Oct 26, 2022

BUG: DataFrame.update do not preserve string dtype #44104

Open

3 tasks

chris-langfield mentioned this issue Jan 3, 2023

Fix Simulation.filter_indices bug ComputationalCryoEM/ASPIRE-Python#816

Closed

mroeschke closed this as completed Mar 31, 2023

aaron-robeson-8451 mentioned this issue Oct 13, 2023

BUG: DataFrame.update doesn't preserve dtypes #55509

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: update should try harder to preserve dtypes #4094

BUG: update should try harder to preserve dtypes #4094

jreback commented Jul 1, 2013 •

edited

gbrand-salesforce commented Apr 27, 2017

jreback commented Apr 27, 2017

filippchistiakov commented Oct 18, 2018

whnr commented Oct 27, 2018

ghost commented Feb 7, 2020

rpkilby commented Oct 26, 2022

mroeschke commented Mar 31, 2023

ivanleoncz commented May 21, 2023 •

edited

BUG: update should try harder to preserve dtypes #4094

BUG: update should try harder to preserve dtypes #4094

Comments

jreback commented Jul 1, 2013 • edited

gbrand-salesforce commented Apr 27, 2017

jreback commented Apr 27, 2017

filippchistiakov commented Oct 18, 2018

whnr commented Oct 27, 2018

ghost commented Feb 7, 2020

rpkilby commented Oct 26, 2022

mroeschke commented Mar 31, 2023

ivanleoncz commented May 21, 2023 • edited

jreback commented Jul 1, 2013 •

edited

ivanleoncz commented May 21, 2023 •

edited