Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Readers report which specified types are unsupported #4957

Closed
wbadart opened this issue Apr 20, 2020 · 17 comments · Fixed by #6720
Closed

[FEA] Readers report which specified types are unsupported #4957

wbadart opened this issue Apr 20, 2020 · 17 comments · Fixed by #6720
Assignees
Labels
bug Something isn't working cuDF (Python) Affects Python cuDF API. cuIO cuIO issue doc Documentation feature request New feature or request

Comments

@wbadart
Copy link
Contributor

wbadart commented Apr 20, 2020

Is your feature request related to a problem? Please describe.
Sometimes cudf.read_csv fails with

RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1581433420693/work/cpp/src/io/csv/legacy/csv_reader_impl.cu

when given the dtype=MY_TYPES argument. For example,

from io import StringIO
import cudf
import numpy as np

my_types = {
   'frame_time': str,
   'frame_number': int,
   'ip_src': str,
   'tcp_srcport': np.int32,
   'ip_dst': str,
   'tcp_dstport': np.int32,
   'frame_len': int,
   'tcp_flags_syn': bool,
   'tcp_flags_fin': bool,
}

s = StringIO("""
    "Jul  3, 2017 11:55:58.598308000 UTC","1","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598312000 UTC","2","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598313000 UTC","3","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598314000 UTC","4","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598315000 UTC","5","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598316000 UTC","6","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598317000 UTC","7","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598318000 UTC","8","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:56:22.331018000 UTC","20","8.253.185.121","80","192.168.10.14","49486","60","0","1"
    "Jul  3, 2017 11:56:22.331021000 UTC","21","8.253.185.121","80","192.168.10.14","49486","60","0","1"
""")

print(cudf.read_csv(s, header=None, names=list(my_types.keys()), dtype=my_types).dtypes)

gives

Traceback (most recent call last):
  File "test.py", line 31, in <module>
    print(cudf.read_csv(s, header=None, names=list(dtypes.keys()), dtype=dtypes).dtypes)
  File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py", line 84, in read_csv
    index_col=index_col,
  File "cudf/_lib/legacy/csv.pyx", line 37, in cudf._lib.legacy.csv.read_csv
  File "cudf/_lib/legacy/csv.pyx", line 227, in cudf._lib.legacy.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:663: Unsupported data type

While swapping in pandas gives:

frame_time       object
frame_number      int64
ip_src           object
tcp_srcport       int32
ip_dst           object
tcp_dstport       int32
frame_len         int64
tcp_flags_syn    object
tcp_flags_fin    object
dtype: object

(I do wonder if this particular example is hitting a bug, or a problem in my data even; are any of bool, int64, int32 and str actually unsupported?)

Describe the solution you'd like
If it's possible, it would be nice to know which type in MY_TYPES is unsupported. Can

CUDF_EXPECTS(dtypes.back().id() != cudf::type_id::EMPTY,
"Unsupported data type");

and

CUDF_EXPECTS(dtypes.back().id() != cudf::type_id::EMPTY,
"Unsupported data type");

be extended to support this?

(And I guess also https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L624 and https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L638. There might be more spots; this is just what I surfaced with some quick grepping around.)

Describe alternatives you've considered
One alternative would be to simply document supported dtypes. If this exists already, I apologize for not finding it (though if this is the case, could we perhaps link or otherwise include the list in the read_csv documentation?).

Additional context

`conda env export` for the above example
name: rapids14
channels:
  - rapidsai-nightly
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_llvm
  - aiohttp=3.6.2=py37h516909a_0
  - appdirs=1.4.3=py_1
  - arrow-cpp=0.15.0=py37h090bef1_2
  - async-timeout=3.0.1=py_1000
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py_0
  - bleach=3.1.4=pyh9f0ad1d_0
  - bokeh=1.4.0=py37hc8dfbb8_1
  - boost=1.70.0=py37h9de70de_1
  - boost-cpp=1.70.0=h8e57a91_2
  - brotli=1.0.7=he1b5a44_1001
  - brotlipy=0.7.0=py37h8f50634_1000
  - bzip2=1.0.8=h516909a_2
  - c-ares=1.15.0=h516909a_1001
  - ca-certificates=2020.4.5.1=hecc5488_0
  - cairo=1.16.0=hcf35c78_1003
  - certifi=2020.4.5.1=py37hc8dfbb8_0
  - cffi=1.14.0=py37hd463f26_0
  - cfitsio=3.470=hb60a0a2_2
  - chardet=3.0.4=py37hc8dfbb8_1006
  - click=7.1.1=pyh8c360ce_0
  - click-plugins=1.1.1=py_0
  - cligj=0.5.0=py_0
  - cloudpickle=1.3.0=py_0
  - colorcet=2.0.1=py_0
  - cryptography=2.8=py37hb09aad4_2
  - cudatoolkit=10.1.243=h6bb024c_0
  - cudf=0.14.0a200418=py37_3339
  - cudnn=7.6.0=cuda10.1_0
  - cugraph=0.14.0a200418=py37_299
  - cuml=0.14.0a200418=cuda10.1_py37_1429
  - cupy=7.3.0=py37h0632833_0
  - curl=7.69.1=h33f0ec9_0
  - cusignal=0.14.0a200418=py37_179
  - cuspatial=0.14.0a200418=py37_169
  - cuxfilter=0.14.0a200418=py37_54
  - cycler=0.10.0=py_2
  - cytoolz=0.10.1=py37h516909a_0
  - dask=2.14.0=py_0
  - dask-core=2.14.0=py_0
  - dask-cuda=0.14.0a200418=py37_43
  - dask-cudf=0.14.0a200418=py37_3339
  - dask-xgboost=0.2.0.dev28=cuda10.1py36_0
  - datashader=0.10.0=py_0
  - datashape=0.5.4=py_1
  - decorator=4.4.2=py_0
  - defusedxml=0.6.0=py_0
  - distributed=2.14.0=py37hc8dfbb8_0
  - dlpack=0.2=he1b5a44_1
  - double-conversion=3.1.5=he1b5a44_2
  - entrypoints=0.3=py37hc8dfbb8_1001
  - expat=2.2.9=he1b5a44_2
  - fastavro=0.23.1=py37h8f50634_0
  - fastrlock=0.4=py37h3340039_1001
  - fiona=1.8.9.post2=py37hdff7cfa_0
  - fontconfig=2.13.1=h86ecdb6_1001
  - freetype=2.10.1=he06d7ca_0
  - freexl=1.0.5=h14c3975_1002
  - fsspec=0.7.2=py_0
  - gdal=2.4.4=py37h5f563d9_0
  - geopandas=0.7.0=py_1
  - geos=3.8.0=he1b5a44_1
  - geotiff=1.5.1=h38872f0_8
  - gettext=0.19.8.1=hc5be6a0_1002
  - gflags=2.2.2=he1b5a44_1002
  - giflib=5.1.7=h516909a_1
  - glib=2.64.2=h6f030ca_0
  - glog=0.4.0=h49b9bf7_3
  - grpc-cpp=1.23.0=h18db393_0
  - hdf4=4.2.13=hf30be14_1003
  - hdf5=1.10.5=nompi_h3c11f04_1104
  - heapdict=1.0.1=py_0
  - icu=64.2=he1b5a44_1
  - idna=2.9=py_1
  - imageio=2.8.0=py_0
  - importlib-metadata=1.6.0=py37hc8dfbb8_0
  - importlib_metadata=1.6.0=0
  - ipykernel=5.2.0=py37h43977f1_1
  - ipython=7.13.0=py37hc8dfbb8_2
  - ipython_genutils=0.2.0=py_1
  - jedi=0.17.0=py37hc8dfbb8_0
  - jinja2=2.11.2=pyh9f0ad1d_0
  - joblib=0.14.1=py_0
  - jpeg=9c=h14c3975_1001
  - json-c=0.13.1=h14c3975_1001
  - jsonschema=3.2.0=py37hc8dfbb8_1
  - jupyter-server-proxy=1.3.2=py_0
  - jupyter_client=6.1.3=py_0
  - jupyter_core=4.6.3=py37hc8dfbb8_1
  - kealib=1.4.13=hec59c27_0
  - kiwisolver=1.2.0=py37h99015e2_0
  - krb5=1.17.1=h2fd8d38_0
  - ld_impl_linux-64=2.34=h53a641e_0
  - libblas=3.8.0=16_openblas
  - libcblas=3.8.0=16_openblas
  - libcudf=0.14.0a200418=cuda10.1_3339
  - libcugraph=0.14.0a200418=cuda10.1_299
  - libcuml=0.14.0a200418=cuda10.1_1429
  - libcumlprims=0.14.0a200417=cuda10.1_22
  - libcurl=7.69.1=hf7181ac_0
  - libcuspatial=0.14.0a200418=cuda10.1_169
  - libdap4=3.20.4=hd3bb157_0
  - libedit=3.1.20170329=hf8c457e_1001
  - libevent=2.1.10=h72c5cf5_0
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=9.2.0=h24d8f2e_2
  - libgdal=2.4.4=h2b6fda6_0
  - libgfortran-ng=7.3.0=hdf63c60_5
  - libhwloc=2.1.0=h3c4fd83_0
  - libiconv=1.15=h516909a_1006
  - libkml=1.3.0=h4fcabce_1010
  - liblapack=3.8.0=16_openblas
  - libllvm8=8.0.1=hc9558a2_0
  - libnetcdf=4.7.3=nompi_h9f9fd6a_101
  - libnvstrings=0.14.0a200418=cuda10.1_3339
  - libopenblas=0.3.9=h5ec1e0e_0
  - libpng=1.6.37=hed695b0_1
  - libpq=12.2=h5513abc_1
  - libprotobuf=3.8.0=h8b12597_0
  - librmm=0.14.0a200418=cuda10.1_258
  - libsodium=1.0.17=h516909a_0
  - libspatialindex=1.9.3=he1b5a44_3
  - libspatialite=4.3.0a=ha48a99a_1034
  - libssh2=1.8.2=h22169c7_2
  - libstdcxx-ng=9.2.0=hdf63c60_2
  - libtiff=4.1.0=hfc65ed5_0
  - libuuid=2.32.1=h14c3975_1000
  - libxcb=1.13=h14c3975_1002
  - libxgboost=1.0.2dev.rapidsai0.13=cuda10.1_6
  - libxml2=2.9.10=hee79883_0
  - llvm-openmp=10.0.0=hc9558a2_0
  - llvmlite=0.31.0=py37h5202443_1
  - locket=0.2.0=py_2
  - lz4-c=1.8.3=he1b5a44_1001
  - markdown=3.2.1=py_0
  - markupsafe=1.1.1=py37h8f50634_1
  - matplotlib-base=3.2.1=py37h30547a4_0
  - mistune=0.8.4=py37h8f50634_1001
  - msgpack-python=1.0.0=py37h99015e2_1
  - multidict=4.7.5=py37h516909a_0
  - multipledispatch=0.6.0=py_0
  - munch=2.5.0=py_0
  - nbconvert=5.6.1=py37hc8dfbb8_1
  - nbformat=5.0.4=py_0
  - nccl=2.5.7.1=h51cf6c1_0
  - ncurses=6.1=hf484d3e_1002
  - networkx=2.4=py_1
  - notebook=6.0.3=py37_0
  - numba=0.48.0=py37hb3f55d8_0
  - numpy=1.18.1=py37h8960a57_1
  - nvstrings=0.14.0a200418=py37_3339
  - olefile=0.46=py_0
  - openjpeg=2.3.1=h981e76c_3
  - openssl=1.1.1f=h516909a_0
  - packaging=20.1=py_0
  - pandas=0.25.3=py37hb3f55d8_0
  - pandoc=2.9.2.1=0
  - pandocfilters=1.4.2=py_1
  - panel=0.6.4=0
  - param=1.9.3=py_0
  - parquet-cpp=1.5.1=2
  - parso=0.7.0=pyh9f0ad1d_0
  - partd=1.1.0=py_0
  - pcre=8.44=he1b5a44_0
  - pexpect=4.8.0=py37hc8dfbb8_1
  - pickleshare=0.7.5=py37hc8dfbb8_1001
  - pillow=7.1.1=py37h718be6c_0
  - pip=20.0.2=py_2
  - pixman=0.38.0=h516909a_1003
  - poppler=0.67.0=h14e79db_8
  - poppler-data=0.4.9=1
  - postgresql=12.2=h8573dbc_1
  - proj=6.3.0=hc80f0dc_0
  - prometheus_client=0.7.1=py_0
  - prompt-toolkit=3.0.5=py_0
  - psutil=5.7.0=py37h8f50634_1
  - pthread-stubs=0.4=h14c3975_1001
  - ptyprocess=0.6.0=py_1001
  - py-xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
  - pyarrow=0.15.0=py37h8b68381_1
  - pycparser=2.20=py_0
  - pyct=0.4.6=py_0
  - pyct-core=0.4.6=py_0
  - pyee=7.0.1=py_0
  - pygments=2.6.1=py_0
  - pynvml=8.0.4=py_0
  - pyopenssl=19.1.0=py_1
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyppeteer=0.0.25=py_1
  - pyproj=2.5.0=py37h8ff28aa_0
  - pyrsistent=0.16.0=py37h8f50634_0
  - pysocks=1.7.1=py37hc8dfbb8_1
  - python=3.7.6=h8356626_5_cpython
  - python-dateutil=2.8.1=py_0
  - python_abi=3.7=1_cp37m
  - pytz=2019.3=py_0
  - pyviz_comms=0.7.4=pyh8c360ce_0
  - pywavelets=1.1.1=py37h03ebfcd_1
  - pyyaml=5.3.1=py37h8f50634_0
  - pyzmq=19.0.0=py37hac76be4_1
  - rapids=0.14.0=cuda10.1_py37_150
  - rapids-xgboost=0.14.0=cuda10.1_py37_150
  - re2=2020.04.01=he1b5a44_0
  - readline=8.0=hf8c457e_0
  - requests=2.23.0=pyh8c360ce_2
  - rmm=0.14.0a200418=py37_258
  - rtree=0.9.4=py37h8526d28_1
  - scikit-image=0.16.2=py37hb3f55d8_0
  - scikit-learn=0.22.2.post1=py37hcdab131_0
  - scipy=1.4.1=py37ha3d9a3c_3
  - send2trash=1.5.0=py_0
  - setuptools=46.1.3=py37hc8dfbb8_0
  - shapely=1.7.0=py37hb106bac_1
  - simpervisor=0.3=py_1
  - six=1.14.0=py_1
  - snappy=1.1.8=he1b5a44_1
  - sortedcontainers=2.1.0=py_0
  - sqlite=3.30.1=hcee41ef_0
  - tblib=1.6.0=py_0
  - terminado=0.8.3=py37hc8dfbb8_1
  - testpath=0.4.4=py_0
  - thrift-cpp=0.12.0=hf3afdfd_1004
  - tk=8.6.10=hed695b0_0
  - toolz=0.10.0=py_0
  - tornado=6.0.4=py37h8f50634_1
  - tqdm=4.45.0=pyh9f0ad1d_0
  - traitlets=4.3.3=py37hc8dfbb8_1
  - tzcode=2019a=h516909a_1002
  - ucx=1.7.0+g9d06c3a=cuda10.1_0
  - uriparser=0.9.3=he1b5a44_1
  - urllib3=1.25.9=py_0
  - wcwidth=0.1.9=pyh9f0ad1d_0
  - webencodings=0.5.1=py_1
  - websockets=8.1=py37h8f50634_1
  - wheel=0.34.2=py_1
  - xarray=0.15.1=py_0
  - xerces-c=3.2.2=h8412b87_1004
  - xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
  - xorg-kbproto=1.0.7=h14c3975_1002
  - xorg-libice=1.0.10=h516909a_0
  - xorg-libsm=1.2.3=h84519dc_1000
  - xorg-libx11=1.6.9=h516909a_0
  - xorg-libxau=1.0.9=h14c3975_0
  - xorg-libxdmcp=1.1.3=h516909a_0
  - xorg-libxext=1.3.4=h516909a_0
  - xorg-libxrender=0.9.10=h516909a_1002
  - xorg-renderproto=0.11.1=h14c3975_1002
  - xorg-xextproto=7.3.0=h14c3975_1002
  - xorg-xproto=7.0.31=h14c3975_1007
  - xz=5.2.5=h516909a_0
  - yaml=0.2.3=h516909a_0
  - yarl=1.3.0=py37h516909a_1000
  - zeromq=4.3.2=he1b5a44_2
  - zict=2.0.0=py_0
  - zipp=3.1.0=py_0
  - zlib=1.2.11=h516909a_1006
  - zstd=1.4.3=h3b9ef0a_0
  - pip:
    - ucx-py==0.14.0a0+133.ge9a2c92
prefix: /home/wbadar/workspace/.miniconda3/envs/rapids14
@wbadart wbadart added ? - Needs Triage Need team to review and classify feature request New feature or request labels Apr 20, 2020
@github-actions github-actions bot added this to Needs prioritizing in Feature Planning Apr 20, 2020
@OlivierNV OlivierNV added bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. libcudf++ and removed ? - Needs Triage Need team to review and classify labels Apr 21, 2020
@OlivierNV
Copy link
Contributor

From the code below, I think the problematic type name would be np.int32 (renaming it to int should work).
https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/legacy/cuio_common.cpp#L23
I'm surprised to see that it's still going through the legacy reader path in branch-0.14, though.

@OlivierNV OlivierNV added doc Documentation and removed libcudf Affects libcudf (C++/CUDA) code. libcudf++ labels Apr 21, 2020
@harrism
Copy link
Member

harrism commented Apr 22, 2020

@OlivierNV based on your triage is this a bug or a new feature request? (labeled as both)

@wbadart
Copy link
Contributor Author

wbadart commented Apr 22, 2020

Hi @OlivierNV - I'm not sure np.int32 is the culprit here. Even when I tell cudf to leave the types uninterpreted by setting them all to str, I still see the "Unsupported data types" exception.

@OlivierNV
Copy link
Contributor

OlivierNV commented Apr 22, 2020

Could this be due to dtypes not being a list of strings ? (maybe something like list(my_types.values()) instead of my_types). For example, does it still fail with dtypes=["str", "str", ..., "str"] ?

@harrism At this point this is a feature request for more explicit error messages/doc, but a bug has not been ruled out yet, so intentionally added both labels.

@wbadart
Copy link
Contributor Author

wbadart commented Apr 22, 2020

Hmm, no dice switching to a list:

In [2]: cudf.read_csv(s, header=None, names=list(my_types), dtype=list(my_types.values()))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-13b1f0db46da> in <module>
----> 1 cudf.read_csv(s, header=None, names=list(my_types), dtype=list(my_types.values()))

~/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76

~/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
     82         na_filter=na_filter,
     83         prefix=prefix,
---> 84         index_col=index_col,
     85     )
     86

cudf/_lib/legacy/csv.pyx in cudf._lib.legacy.csv.read_csv()

cudf/_lib/legacy/csv.pyx in cudf._lib.legacy.csv.read_csv()

RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:638: Unsupported data type

@OlivierNV
Copy link
Contributor

What's the output of print(list(my_types).values ? (I think this needs to be a list of strings iirc)

@wbadart
Copy link
Contributor Author

wbadart commented Apr 22, 2020

Ahh I think I get your meaning now. Yeah, it's a list of classes (the str/ np.int32/ bool objects themselves). Gimmie one sec to try it with strings

@wbadart
Copy link
Contributor Author

wbadart commented Apr 22, 2020

Boom:

In [2]: t = {'frame_time': 'str', 'frame_numer': 'int', 'ip_src': 'str', 'tcp_srcport': 'int', 'ip_dst': 'str', 'tcp_dstport': 'int', 'frame_len': 'int', 'tcp_flags_syn': 'bool', 'tcp_flags_fin': '
   ...: bool'}

In [3]: cudf.read_csv(s, header=None, names=list(t), dtype=list(t.values()))
Out[3]:
                                  frame_time  frame_numer         ip_src  tcp_srcport         ip_dst  tcp_dstport  frame_len  tcp_flags_syn  tcp_flags_fin
0      "Jul  3, 2017 11:55:58.598308000 UTC"            1  8.254.250.126           80   192.168.10.5        49188         60          False           True
1      "Jul  3, 2017 11:55:58.598312000 UTC"            2  8.254.250.126           80   192.168.10.5        49188         60          False           True
2      "Jul  3, 2017 11:55:58.598313000 UTC"            3  8.254.250.126           80   192.168.10.5        49188         60          False           True
3      "Jul  3, 2017 11:55:58.598314000 UTC"            4  8.254.250.126           80   192.168.10.5        49188         60          False           True
4      "Jul  3, 2017 11:55:58.598315000 UTC"            5  8.254.250.126           80   192.168.10.5        49188         60          False           True
5      "Jul  3, 2017 11:55:58.598316000 UTC"            6  8.254.250.126           80   192.168.10.5        49188         60          False           True
6      "Jul  3, 2017 11:55:58.598317000 UTC"            7  8.254.250.126           80   192.168.10.5        49188         60          False           True
7      "Jul  3, 2017 11:55:58.598318000 UTC"            8  8.254.250.126           80   192.168.10.5        49188         60          False           True
8      "Jul  3, 2017 11:56:22.331018000 UTC"           20  8.253.185.121           80  192.168.10.14        49486         60          False           True
9      "Jul  3, 2017 11:56:22.331021000 UTC"           21  8.253.185.121           80  192.168.10.14        49486         60          False           True

In [4]: _.dtypes
Out[4]:
frame_time       object
frame_numer       int32
ip_src           object
tcp_srcport       int32
ip_dst           object
tcp_dstport       int32
frame_len         int32
tcp_flags_syn      bool
tcp_flags_fin      bool
dtype: object

Thanks for the suggestion @OlivierNV. If you think it'd be appropriate, I'd be happy to contribute some documentation to clarify the expected use of read_csv's dtype parameter, to reflect our discussion. Let me know!

@OlivierNV OlivierNV removed the bug Something isn't working label Apr 22, 2020
@OlivierNV
Copy link
Contributor

@wbadart Sounds good to me, that'd be great (you can open a doc PR and link to this issue)

@wbadart
Copy link
Contributor Author

wbadart commented Apr 23, 2020

I'll draft something up!

Also, here's our call to the legacy reader, since that came up:

return libcudf_legacy.csv.read_csv(

@OlivierNV
Copy link
Contributor

Yeah, it looks like the legacy reader is still being used until the csv writer gets ported to libcudf++ (#4342 ), since they're both called from the same python file.

@lazykyama
Copy link

Hi team, I've encountered very similar issue on NGC's latest container (nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04). Has this issue been already resolved?
My error looks like non legacy lib. So, is my case the same as this issue?

Just in case, let me share reproduction code and error message below.

Code

import numpy as np
import cudf


def main():
    filepath = './test.csv'

    df = cudf.DataFrame()
    df['col1'] = list(range(10))
    df['col2'] = np.random.random(10)
    
    cudf.io.csv.to_csv(df, path=filepath, header=False, index=False)

    names = ['col1', 'col2']
    # dtype = {'col1': 'int64', 'col2': 'float64'}  # <- It works!
    dtype = {'col1': np.int64, 'col2': np.float64}
    print(dtype)
    df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
    print(df)


if __name__ == "__main__":
    main()

Error

Traceback (most recent call last):
  File "smallest_read_csv_dtype.py", line 25, in <module>
    main()
  File "smallest_read_csv_dtype.py", line 20, in main
    df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
  File "/opt/conda/envs/rapids/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/csv.py", line 84, in read_csv
    index_col=index_col,
  File "cudf/_lib/csv.pyx", line 337, in cudf._lib.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1591199376654/work/cpp/src/io/csv/reader_impl.cu:649: Unsupported data type

Launch command

sudo docker run --gpus=all --rm -it -v $(pwd):/ws nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04

@kkraus14
Copy link
Collaborator

Hi team, I've encountered very similar issue on NGC's latest container (nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04). Has this issue been already resolved?
My error looks like non legacy lib. So, is my case the same as this issue?

Just in case, let me share reproduction code and error message below.

Code

import numpy as np
import cudf


def main():
    filepath = './test.csv'

    df = cudf.DataFrame()
    df['col1'] = list(range(10))
    df['col2'] = np.random.random(10)
    
    cudf.io.csv.to_csv(df, path=filepath, header=False, index=False)

    names = ['col1', 'col2']
    # dtype = {'col1': 'int64', 'col2': 'float64'}  # <- It works!
    dtype = {'col1': np.int64, 'col2': np.float64}
    print(dtype)
    df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
    print(df)


if __name__ == "__main__":
    main()

Error

Traceback (most recent call last):
  File "smallest_read_csv_dtype.py", line 25, in <module>
    main()
  File "smallest_read_csv_dtype.py", line 20, in main
    df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
  File "/opt/conda/envs/rapids/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/csv.py", line 84, in read_csv
    index_col=index_col,
  File "cudf/_lib/csv.pyx", line 337, in cudf._lib.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1591199376654/work/cpp/src/io/csv/reader_impl.cu:649: Unsupported data type

Launch command

sudo docker run --gpus=all --rm -it -v $(pwd):/ws nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04

Can you try passing a string of int64 instead of np.int64 for the dtypes? Likely a bug on our end in handling the dtypes.

@lazykyama
Copy link

Thanks for the reply, @kkraus14 !

Can you try passing a string of int64 instead of np.int64 for the dtypes?

Yes, although I commented out in my code, the program works well by passing 'int64' and 'float64' strings instead of numpy's dtype like below.

dtype = {'col1': 'int64', 'col2': 'float64'}

@kkraus14 kkraus14 added bug Something isn't working cuDF (Python) Affects Python cuDF API. labels Jun 23, 2020
@Rogerh91
Copy link

Rogerh91 commented Oct 9, 2020

Hey all, I saw a similar issue come up as I was playing around with cuDF's read_csv function with RAPIDS on Kaggle. My code runs fine, but it kept on propagating the following error:

RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1598487636199/work/cpp/src/io/csv/reader_impl.cu:651: Unsupported data type

After looking at this issue and guessing a lot, I got my code to the point where I realize 'int64' and 'str' work for dtypes. But I'm struggling with the last column which should be datetime and won't render properly as a string or an integer.

checkout_list = []

for filename in all_checkout_files:
    cu = cudf.io.csv.read_csv(filename, index_col = None, header = 0, dtype ={'BibNumber': 'int64', 'ItemBarcode': 'int64', 'ItemType': 'str', 'Collection': 'str', 'CallNumber': 'int64', 'CheckoutDateTime': 'datetime64'})
    checkout_list.append(cu)
    
checkout = cudf.core.reshape.concat(checkout_list, axis=0, ignore_index = True)

The code is meant to append a bunch of cuDF dataframes together that all follow a common pattern. I know that if I replace 'datetime64' with 'int64', this runs properly, at least at first glance. I'm wondering what the proper way for the function to accept datetime as a reference would be.

A basic point of frustration on this has been guessing at the proper way to render datatypes which came on top of data validation errors (columns were being misread which is why I had to set the dtypes at the read_csv level in the first place). I think this problem could be resolved by fixing the underlying bug -- but in the absence of that, correcting this documentation to be more accurate would help a lot.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 9, 2020

@Rogerh91 I believe if you use timestamp or something like timestamp[s] for example it should work.

We're actively working on refactoring this code and cleaning this up is definitely one of the things we're planning to tackle.

@Rogerh91
Copy link

Rogerh91 commented Oct 10, 2020

Hey @kkraus14, thanks for the tip -- just wanted to report that it worked the first time I tried it. It doesn't seem to be anywhere in the documentation which most people will consult when they're stuck on this, but appreciate that you all are refactoring and cleaning things up. That seems like it might be a quick fix in the meantime though (clearing up documentation), or a blog post that will show up on SEO maybe.

@galipremsagar galipremsagar self-assigned this Nov 10, 2020
@galipremsagar galipremsagar added this to Issue-Needs prioritizing in v0.17 Release via automation Nov 10, 2020
@galipremsagar galipremsagar linked a pull request Nov 10, 2020 that will close this issue
3 tasks
@galipremsagar galipremsagar moved this from Issue-Needs prioritizing to Issue-P1 in v0.17 Release Nov 10, 2020
Feature Planning automation moved this from Needs prioritizing to Closed Nov 11, 2020
v0.17 Release automation moved this from Issue-P1 to Done Nov 11, 2020
galipremsagar added a commit that referenced this issue Nov 11, 2020
Fixes: #6606, #4957

This PR:

 Adds support for an arbitrary type to be passed as dtype.
 Adds support for a scalar type of input for dtype.
 Handles conversion of pandas nullable dtypes as well in dtype param.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuDF (Python) Affects Python cuDF API. cuIO cuIO issue doc Documentation feature request New feature or request
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

7 participants