New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Readers report which specified types are unsupported #4957
Comments
From the code below, I think the problematic type name would be |
@OlivierNV based on your triage is this a bug or a new feature request? (labeled as both) |
Hi @OlivierNV - I'm not sure |
Could this be due to dtypes not being a list of strings ? (maybe something like @harrism At this point this is a feature request for more explicit error messages/doc, but a bug has not been ruled out yet, so intentionally added both labels. |
Hmm, no dice switching to a list: In [2]: cudf.read_csv(s, header=None, names=list(my_types), dtype=list(my_types.values()))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-13b1f0db46da> in <module>
----> 1 cudf.read_csv(s, header=None, names=list(my_types), dtype=list(my_types.values()))
~/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py in inner(*args, **kwds)
72 def inner(*args, **kwds):
73 with self._recreate_cm():
---> 74 return func(*args, **kwds)
75 return inner
76
~/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
82 na_filter=na_filter,
83 prefix=prefix,
---> 84 index_col=index_col,
85 )
86
cudf/_lib/legacy/csv.pyx in cudf._lib.legacy.csv.read_csv()
cudf/_lib/legacy/csv.pyx in cudf._lib.legacy.csv.read_csv()
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:638: Unsupported data type |
What's the output of |
Ahh I think I get your meaning now. Yeah, it's a list of classes (the |
Boom: In [2]: t = {'frame_time': 'str', 'frame_numer': 'int', 'ip_src': 'str', 'tcp_srcport': 'int', 'ip_dst': 'str', 'tcp_dstport': 'int', 'frame_len': 'int', 'tcp_flags_syn': 'bool', 'tcp_flags_fin': '
...: bool'}
In [3]: cudf.read_csv(s, header=None, names=list(t), dtype=list(t.values()))
Out[3]:
frame_time frame_numer ip_src tcp_srcport ip_dst tcp_dstport frame_len tcp_flags_syn tcp_flags_fin
0 "Jul 3, 2017 11:55:58.598308000 UTC" 1 8.254.250.126 80 192.168.10.5 49188 60 False True
1 "Jul 3, 2017 11:55:58.598312000 UTC" 2 8.254.250.126 80 192.168.10.5 49188 60 False True
2 "Jul 3, 2017 11:55:58.598313000 UTC" 3 8.254.250.126 80 192.168.10.5 49188 60 False True
3 "Jul 3, 2017 11:55:58.598314000 UTC" 4 8.254.250.126 80 192.168.10.5 49188 60 False True
4 "Jul 3, 2017 11:55:58.598315000 UTC" 5 8.254.250.126 80 192.168.10.5 49188 60 False True
5 "Jul 3, 2017 11:55:58.598316000 UTC" 6 8.254.250.126 80 192.168.10.5 49188 60 False True
6 "Jul 3, 2017 11:55:58.598317000 UTC" 7 8.254.250.126 80 192.168.10.5 49188 60 False True
7 "Jul 3, 2017 11:55:58.598318000 UTC" 8 8.254.250.126 80 192.168.10.5 49188 60 False True
8 "Jul 3, 2017 11:56:22.331018000 UTC" 20 8.253.185.121 80 192.168.10.14 49486 60 False True
9 "Jul 3, 2017 11:56:22.331021000 UTC" 21 8.253.185.121 80 192.168.10.14 49486 60 False True
In [4]: _.dtypes
Out[4]:
frame_time object
frame_numer int32
ip_src object
tcp_srcport int32
ip_dst object
tcp_dstport int32
frame_len int32
tcp_flags_syn bool
tcp_flags_fin bool
dtype: object Thanks for the suggestion @OlivierNV. If you think it'd be appropriate, I'd be happy to contribute some documentation to clarify the expected use of |
@wbadart Sounds good to me, that'd be great (you can open a doc PR and link to this issue) |
I'll draft something up! Also, here's our call to the legacy reader, since that came up: cudf/python/cudf/cudf/io/csv.py Line 52 in fff2bed
|
Yeah, it looks like the legacy reader is still being used until the csv writer gets ported to libcudf++ (#4342 ), since they're both called from the same python file. |
Hi team, I've encountered very similar issue on NGC's latest container ( Just in case, let me share reproduction code and error message below. Codeimport numpy as np
import cudf
def main():
filepath = './test.csv'
df = cudf.DataFrame()
df['col1'] = list(range(10))
df['col2'] = np.random.random(10)
cudf.io.csv.to_csv(df, path=filepath, header=False, index=False)
names = ['col1', 'col2']
# dtype = {'col1': 'int64', 'col2': 'float64'} # <- It works!
dtype = {'col1': np.int64, 'col2': np.float64}
print(dtype)
df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
print(df)
if __name__ == "__main__":
main() Error
Launch command
|
Can you try passing a string of |
Thanks for the reply, @kkraus14 !
Yes, although I commented out in my code, the program works well by passing
|
Hey all, I saw a similar issue come up as I was playing around with cuDF's read_csv function with RAPIDS on Kaggle. My code runs fine, but it kept on propagating the following error: RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1598487636199/work/cpp/src/io/csv/reader_impl.cu:651: Unsupported data type After looking at this issue and guessing a lot, I got my code to the point where I realize 'int64' and 'str' work for dtypes. But I'm struggling with the last column which should be datetime and won't render properly as a string or an integer.
The code is meant to append a bunch of cuDF dataframes together that all follow a common pattern. I know that if I replace 'datetime64' with 'int64', this runs properly, at least at first glance. I'm wondering what the proper way for the function to accept datetime as a reference would be. A basic point of frustration on this has been guessing at the proper way to render datatypes which came on top of data validation errors (columns were being misread which is why I had to set the dtypes at the read_csv level in the first place). I think this problem could be resolved by fixing the underlying bug -- but in the absence of that, correcting this documentation to be more accurate would help a lot. |
@Rogerh91 I believe if you use We're actively working on refactoring this code and cleaning this up is definitely one of the things we're planning to tackle. |
Hey @kkraus14, thanks for the tip -- just wanted to report that it worked the first time I tried it. It doesn't seem to be anywhere in the documentation which most people will consult when they're stuck on this, but appreciate that you all are refactoring and cleaning things up. That seems like it might be a quick fix in the meantime though (clearing up documentation), or a blog post that will show up on SEO maybe. |
Is your feature request related to a problem? Please describe.
Sometimes
cudf.read_csv
fails withwhen given the
dtype=MY_TYPES
argument. For example,gives
While swapping in pandas gives:
(I do wonder if this particular example is hitting a bug, or a problem in my data even; are any of
bool
,int64
,int32
andstr
actually unsupported?)Describe the solution you'd like
If it's possible, it would be nice to know which type in
MY_TYPES
is unsupported. Cancudf/cpp/src/io/csv/reader_impl.cu
Lines 627 to 628 in 8e90792
and
cudf/cpp/src/io/csv/reader_impl.cu
Lines 641 to 642 in 8e90792
be extended to support this?
(And I guess also https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L624 and https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L638. There might be more spots; this is just what I surfaced with some quick grepping around.)
Describe alternatives you've considered
One alternative would be to simply document supported dtypes. If this exists already, I apologize for not finding it (though if this is the case, could we perhaps link or otherwise include the list in the
read_csv
documentation?).Additional context
`conda env export` for the above example
The text was updated successfully, but these errors were encountered: