[BUG] Adding a string column to an empty dataframe changes all column dtypes to float64 #1172

beckernick · 2019-03-12T17:22:52Z

Describe the bug
Adding a column to an empty dataframe changes all column dtypes to float64.

Steps/Code to reproduce bug

import pandas as pd
import cudf

cols = ['a', 'b', 'c']
df = pd.DataFrame(columns=cols, dtype='int')

gdf = cudf.from_pandas(df)
print(gdf.dtypes)

gdf['a'] = [1,2,]
print(gdf.dtypes)
a    int64
b    int64
c    int64
dtype: object
a      int64
b    float64
c    float64
dtype: object

Expected behavior
I expect empty dataframe columns with specific types to remain the same when the first column is added of a different type.

Environment details (please complete the following information):

**git***
commit b1f70a81f4597e4d6da475382f4d67fac2b27c55 (HEAD, kkraus14/fea-ext-string-support)
Merge: e39ab4d e7fdd4a
Author: Keith Kraus <keith.j.kraus@gmail.com>
Date:   Sun Mar 10 19:59:39 2019 -0400

    Merge branch 'branch-0.6' into fea-ext-string-support

***OS Information***
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2018-01-10"
DGX_SWBUILD_VERSION="3.1.4"
DGX_COMMIT_ID="660a5f359205297159909ff1631b15af9ecc3aef"
DGX_SERIAL_NUMBER=QTFCOU6430065-R1

DGX_OTA_VERSION="3.1.4"
DGX_OTA_DATE="Fri Feb  9 11:04:37 PST 2018"

DGX_OTA_VERSION="3.1.7"
DGX_OTA_DATE="Tue Jun 12 14:31:36 PDT 2018"
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Linux dgx03 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

***GPU Information***
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

***CPU***
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2706.945
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4392.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d

***CMake***
/usr/local/bin/cmake
cmake version 3.12.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

***g++***
/usr/bin/g++
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


***nvcc***
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

***Python***
/usr/bin/python
Python 2.7.12

***Environment Variables***
PATH                            : /home/nfs/nicholasb/bin:/home/nfs/nicholasb/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda/bin
LD_LIBRARY_PATH                 : 
NUMBAPRO_NVVM                   : 
NUMBAPRO_LIBDEVICE              : 
CONDA_PREFIX                    : 
PYTHON_PATH                     : 

conda not found
***pip packages***
/usr/local/bin/pip
Package                 Version               
----------------------- ----------------------
cached-property         1.4.3                 
certifi                 2018.4.16             
chardet                 3.0.4                 
cli-helpers             1.0.2                 
click                   6.7                   
cmake                   3.12.0                
command-not-found       0.3                   
configmanager           1.34.0                
configparser            3.5.0                 
docker                  3.5.0                 
docker-compose          1.22.0                
docker-pycreds          0.3.0                 
dockerpty               0.4.1                 
docopt                  0.6.2                 
ecdsa                   0.13                  
fail2ban                0.9.3                 
funcsigs                1.0.2                 
future                  0.16.0                
graphistry              2.12.0                
hookery                 1.4.0                 
humanize                0.5.1                 
idna                    2.6                   
Jinja2                  2.10                  
jsonschema              2.6.0                 
language-selector       0.1                   
MarkupSafe              1.0                   
pip                     18.0                  
pycurl                  7.43.0                
Pygments                2.2.0                 
pygobject               3.20.0                
python-apt              1.1.0b1+ubuntu0.16.4.2
python-debian           0.1.27                
PyYAML                  3.13                  
requests                2.18.4                
screen-resolution-extra 0.0.0                 
setuptools              40.0.0                
six                     1.10.0                
tabulate                0.8.2                 
terminaltables          3.1.0                 
texttable               0.9.1                 
unattended-upgrades     0.1                   
urllib3                 1.22                  
wcwidth                 0.1.7                 
websocket-client        0.52.0                
wheel                   0.31.1                
xkit                    0.0.0                 
You are using pip version 18.0, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

The text was updated successfully, but these errors were encountered:

kayush2O6 · 2019-03-18T15:16:32Z

@beckernick I was looking at this issue and realised that pandas also does the same as cudf. here is minimal repro:

In [1]: import pandas as pd
   ...: cols = ['a', 'b', 'c']
   ...: df = pd.DataFrame(columns=cols, dtype='int')
   ...: df['a']=[1,2,]
   ...: df.dtypes
Out[1]:
a      int64
b    float64
c    float64
dtype: object

I just want to know that Are we gonna differ from pandas behaviour in this scenario?

beckernick · 2019-03-18T17:16:00Z

This is interesting @AK-ayush. cc @kkraus14 for visibility.

We appear to be consistent with pandas 0.24 for numeric types, but not string type columns (examples below). From a quick glance it seems like this is due to the handling of NaNs. It looks like pandas does not coerce types to float64 when introducing NaNs to object typed columns, but does when introducing NaNs to int/float typed columns. Pandas has introduced support for NaNs in integer columns in 0.24, though.

I'm inclined to maintain consistency with Pandas. I think the key consistency bug is that object columns are currently coerced to float64s, and I'll update this issue to reflect that. On a related note, I'm going to learn more about the reasoning behind Dask concatenation requiring type consistency across series in the list of dataframes. We may discuss that in the future. Pandas allows us to concatenate with differing types, abstracting the type coercing from users (which has pros and cons).

import pandas as pd
df1 = pd.DataFrame({'a':[1,2,], 'b':[2.,6.]})
df2 = pd.DataFrame({'a':['a', 'b'], 'b':[2.,6.]})
pd.concat([df1, df2]).dtypes
a     object
b    float64
dtype: object

Pandas object behavior going to object

cols = ['a', 'b', 'c']
df = pd.DataFrame(columns=cols, dtype='str')
print(df.dtypes)
df['a'] = ['x', 'y']
print(df.dtypes)
a    object
b    object
c    object
dtype: object
a    object
b    object
c    object
dtype: object

Pandas int behavior going to numeric

cols = ['a', 'b', 'c']
df = pd.DataFrame(columns=cols, dtype='int')
print(df.dtypes)
df['a'] = [1,2,]
print(df.dtypes)
a    int64
b    int64
c    int64
dtype: object
a      int64
b    float64
c    float64
dtype: object

Pandas object behavior going to integer

cols = ['a', 'b', 'c']
df = pd.DataFrame(columns=cols, dtype='str')
print(df.dtypes)
df['a'] = [1, 2]
print(df.dtypes)
a    object
b    object
c    object
dtype: object
a     int64
b    object
c    object
dtype: object

Our behavior going from object to int

import pandas as pd
import cudf

cols = ['a', 'b', 'c']
data = {k:v for (k,v) in zip(cols, [['a'] for x in cols])}

gdf = cudf.DataFrame(data)
gdf = gdf[:0]
print(gdf.dtypes)
gdf['a'] = [1,]
print(gdf.dtypes)
a    object
b    object
c    object
dtype: object
a      int64
b    float64
c    float64
dtype: object

We also don't yet maintain object type consistency when using from_pandas in the same way we do with ints and floats, which I'll open a separate ticket for. As an example:

import pandas as pd
import cudf

cols = ['a', 'b', 'c']
df = pd.DataFrame(columns=cols, dtype='str')

gdf = cudf.from_pandas(df)
print(gdf.dtypes)

a    float64
b    float64
c    float64
dtype: object

import pandas as pd
import cudf

cols = ['a', 'b', 'c']
df = pd.DataFrame(columns=cols, dtype='int32')

gdf = cudf.from_pandas(df)
print(gdf.dtypes)
a    int32
b    int32
c    int32
dtype: object

kayush2O6 · 2019-03-19T11:18:37Z

I have created PR for stating the issue of adding column to str dataframe but I was trying to solve the from_pandas() as well. So, I made some changes to columnsops.py and it worked but when I add any column to resulting gdf, it is giving SegmentationFault.

[REVIEW]Fix dtypes issue #1172 while adding a col to empty object dataframe

beckernick · 2019-04-03T15:22:39Z

This is resolved by #1233 . Closing.

beckernick added Needs Triage Need team to review and classify bug Something isn't working labels Mar 12, 2019

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Mar 13, 2019

beckernick changed the title ~~[BUG] Adding a column to an empty dataframe changes all column dtypes to float64~~ [BUG] Adding a string column to an empty dataframe changes all column dtypes to float64 Mar 18, 2019

kayush2O6 mentioned this issue Mar 19, 2019

[REVIEW]Fix dtypes issue #1172 while adding a col to empty object dataframe #1233

Merged

kkraus14 added a commit that referenced this issue Mar 20, 2019

Merge pull request #1233 from AK-ayush/fix-1172

5ec92d1

[REVIEW]Fix dtypes issue #1172 while adding a col to empty object dataframe

beckernick closed this as completed Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Adding a string column to an empty dataframe changes all column dtypes to float64 #1172

[BUG] Adding a string column to an empty dataframe changes all column dtypes to float64 #1172

beckernick commented Mar 12, 2019

kayush2O6 commented Mar 18, 2019

beckernick commented Mar 18, 2019 •

edited

Loading

kayush2O6 commented Mar 19, 2019

beckernick commented Apr 3, 2019

[BUG] Adding a string column to an empty dataframe changes all column dtypes to float64 #1172

[BUG] Adding a string column to an empty dataframe changes all column dtypes to float64 #1172

Comments

beckernick commented Mar 12, 2019

kayush2O6 commented Mar 18, 2019

beckernick commented Mar 18, 2019 • edited Loading

kayush2O6 commented Mar 19, 2019

beckernick commented Apr 3, 2019

beckernick commented Mar 18, 2019 •

edited

Loading