Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read_csv should raise error on bad dtypes #3795

Closed
hayd opened this issue Jun 7, 2013 · 7 comments · Fixed by #3797
Closed

Read_csv should raise error on bad dtypes #3795

hayd opened this issue Jun 7, 2013 · 7 comments · Fixed by #3797
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Jun 7, 2013

OPs code from http://stackoverflow.com/questions/16988526/pandas-reading-csv-as-string-type

Add: auto conversion from a passed str or np.string_ to np.object might be ok

df
           A         B
1A  0.209059  0.275554
1B  0.742666  0.721165

df.to_csv(savefile)

df_read = pd.read_csv(savefile, dtype=str, index_col=0)
   A  B
B  (  <

Actually I get something different on dev:

In [101]: pd.read_csv('a', index_col=0, dtype=str)
Out[101]:
 A B


@cpcloud
Copy link
Member

cpcloud commented Jun 7, 2013

@jreback suggesting should return a frame with floats converted to str (which pandas will then cvt to object)? or convert to object (which still leaves them as floats)...

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

@jreback

op wanted a string dtype.....

the answer is to do:

pd.read_csv('a',index_col=0,dtype=np.object')

I am just suggesting

a) if we see a dtype==str (string, str, or np.str), convert it to np.object
b) validate that dtype is not some weird thing (which actually happens I think already), it
just tries to coerce and if it fails its left as object

so this would just avoid the np.str trap

however I think there is also a case of

dtype='datetime64[ns]' (or variants of) which should be ignored/raise an error (as you really need to pass parse_dates)

@hayd
Copy link
Contributor Author

hayd commented Jun 7, 2013

It has to be a valid type (at least) ... I like the idea of special casing str.

With regards to last point I guess conceivably you may know the things you're passing are datetime64[ns] already... ?

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

yes...I could see something like:

read_csv(....dtype = { 'A' : np.str, 'B' : 'datetime64' })

being equivalent to

read_csv(dtype = { 'A' : np.object }, parse_dates=['B'])

for consistency then you could just construct a dtype dict (rather than having to separate out the date fields)
(and then you could accept dtype-like string that basically all go to datetime64[ns], e.g. datetime64 is not a valid dtype

(also showing the conversion of the 'string-like' dtypes

@cpcloud
Copy link
Member

cpcloud commented Jun 7, 2013

i think special casing str is probably the way 2 go. what happens if someone passes read_csv(dtype={'A': str, 'B': 'datetime64'}, parse_dates=['B'])

@jreback what's the reason again for not using numpy's string type? i've sort of dogmatically accepted object just because it doesn't really affect anything that i do, but now i'm a bit curious. does it have to with the fact that there's basically different dtype for every string length?

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

hmmm...the first case should probably be an error

np.str is special case that has fixed width for each item, while object is better because it allows variable length string (and any object actually). You get more perf by using np.str but less flexibility.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

fyi...looking at this closer, passing 'str' as the dtype actually works, but since the width of the string is 0 it does weird stuff....so that's the case to handle

e.g. np.dtype('S10') 'works' (and returns object), but np.dtype('S') (which is what 'str') give you is an error
this is all in parser.pyx/_convert_with_dtype, pretty easy fix for the string issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants