New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easy to catch mutations in the data by emitting col_types string #314

Closed
earino opened this Issue Nov 14, 2015 · 7 comments

Comments

Projects
None yet
5 participants
@earino

earino commented Nov 14, 2015

One of the nice features of readr is that it introspects column types and gives you data frames with columns of "appropriate" data types. This is cool. It does so heuristically by reading a bunch of rows (but not all) and guessing.

This is very nice for interactive use, however I think it could be made into a neat feature to catch data schema mutation.

I would like to propose a new function which reads a whole CSV file, and from the entire file returns a col_types string. A user can then take that string and put it in their production script's read_csv ca;;, so that in the future, if a new data file comes in which has different data types, instead of (for example) silently returning a data frame which has chars where a date once was, it will now give a proper error.

@hadley

This comment has been minimized.

Member

hadley commented Dec 2, 2015

This is a great idea :)

@jennybc

This comment has been minimized.

Member

jennybc commented Dec 23, 2015

If implemented, it might be nice to comply with something like csvy: http://csvy.org, http://dataprotocols.org/json-table-schema/. The rio package just accommodated csvy (leeper/rio#52).

Added 2016-02-10: W3C also has a draft document on "standard ways to express useful metadata about CSV files and other kinds of tabular data": http://w3c.github.io/csvw/primer/.

@hadley

This comment has been minimized.

Member

hadley commented Jun 2, 2016

I'm thinking that the default behaviour should be to spit out the code that you can copy and paste into the col_types in order to repeat the exact same parsing in the future.

@jimhester

This comment has been minimized.

Member

jimhester commented Jul 7, 2016

Addressed by #437

@jimhester jimhester closed this Jul 7, 2016

@jimhester jimhester removed the ready label Jul 7, 2016

@drolejoel

This comment has been minimized.

drolejoel commented Sep 9, 2016

is it OK to add a comment to this completed enhancement, or should I add a new enhancement request?

While spitting out the column specification (I like this) is a great feature as a default, it gets a tad tedious to see the same red text on the screen once the spec has been figured out. My suggestion is to add a "silent = TRUE" option to suppress the column specifications if they are not wanted.
Thank you for your consideration.

@hadley

This comment has been minimized.

Member

hadley commented Sep 9, 2016

It's always better to submit a new issue. (But this one is already handled: just do col_types = col())

@drolejoel

This comment has been minimized.

drolejoel commented Sep 9, 2016

ok thank you

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.