Add a return_spec argument to read functions #437

jimhester · 2016-06-15T21:34:12Z

This tries to address #314 slightly differently than proposed.

It adds a return_spec argument to the read_* functions which rather than read the files just returns the col_spec object that is generated before reading the file.

This object can then be used in subsequent calls as the value for the col_types argument, which will allow you to enforce a specification on other datasets with the same specifications and such.

These objects can be saved retrieved with saveRDS()/readRDS() and already have formatting methods written.

We could also write a function to take the object and produce an expression that could be used to recreate the object, but I am not sure that is more useful or informative than just working with the objects directly.

hadley · 2016-06-17T12:11:08Z

Reading this makes me realise that I forgot to describe an important part of the workflow that I was imagining. I was thinking you'd often run it once to make the col-guesses explicit, and then you'd modify that code/text to fix any detection errors and specify the correct columns types. So I think that makes having a code/text rendering quite important.

hadley · 2016-06-17T12:12:13Z

Also I think to start with I'd rather have this be a separate function, i.e. spec_csv, spec_tsv etc.

jimhester · 2016-06-17T13:26:45Z

I feel like that workflow works better with this implementation than the proposed one, it is very easy to subset the object and change the columns after returning the col_spec object. I also added a as.character() method to print an R expression that would recapitulate the spec object.

write_csv(mtcars, "mtcars.csv")
spec <- read_csv("mtcars.csv", return_spec = TRUE)
spec
#> <col_spec>
#> * mpg: double
#> * cyl: integer
#> * disp: double
#> * hp: integer
#> * drat: double
#> * wt: double
#> * qsec: double
#> * vs: integer
#> * am: integer
#> * gear: integer
#> * carb: integer
#> * default: guess

## Oh actually cylinders should be a factor
spec$cols$cyl <- col_factor(c("4", "6", "8"))
spec
#> <col_spec>
#> * mpg: double
#> * cyl: factor
#> * disp: double
#> * hp: integer
#> * drat: double
#> * wt: double
#> * qsec: double
#> * vs: integer
#> * am: integer
#> * gear: integer
#> * carb: integer
#> * default: guess

data <- read_csv("mtcars.csv", col_types = spec)
data
#> <tibble [32 x 11]>
#>      mpg    cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <fctr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1   21.0      6 160.0   110  3.90 2.620 16.46     0     1     4     4
#> 2   21.0      6 160.0   110  3.90 2.875 17.02     0     1     4     4
#> 3   22.8      4 108.0    93  3.85 2.320 18.61     1     1     4     1
#> 4   21.4      6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 5   18.7      8 360.0   175  3.15 3.440 17.02     0     0     3     2
#> 6   18.1      6 225.0   105  2.76 3.460 20.22     1     0     3     1
#> 7   14.3      8 360.0   245  3.21 3.570 15.84     0     0     3     4
#> 8   24.4      4 146.7    62  3.69 3.190 20.00     1     0     4     2
#> 9   22.8      4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 10  19.2      6 167.6   123  3.92 3.440 18.30     1     0     4     4
#> ... with 22 more rows

# alternatively can use the spec as a character
cat(as.character(spec))
#> cols(
#>   col_double(),
#>   col_factor(levels = c("4", "6", "8"), ordered = FALSE),
#>   col_double(),
#>   col_integer(),
#>   col_double(),
#>   col_double(),
#>   col_double(),
#>   col_integer(),
#>   col_integer(),
#>   col_integer(),
#>   col_integer())

data2 <- read_csv("mtcars.csv",
  col_types = cols(
    col_double(),
    col_factor(levels = c("4", "6", "8"), ordered = FALSE),
    col_double(),
    col_integer(),
    col_double(),
    col_double(),
    col_double(),
    col_integer(),
    col_integer(),
    col_integer(),
    col_integer()))

all.equal(data, data2)
#> [1] TRUE

We can do separate spec_* functions, that was actually my first prototype of this.

I went with the argument mainly to decrease the maintenance burden, since you don't have to replicate (and keep in sync) the setup logic in two different places.

I also think the argument is slightly more discoverable than a separate function, but that is largely subjective.

hadley · 2016-06-17T14:09:24Z

I think most people will prefer the manipulating the character strings, but I do like how this gives you both interfaces.

How hard would it be generate a named list? That would make it easier to work with, and then if you wanted to only read a few cols, you could delete the lines you don't want and switch to cols_only().

Maybe the default could be to print it out whenever any columns are guessed? That nice because it tells you exactly how readr is reading it in, and if you're loading a big file and spot a problem you can ctrl + break, fix the problem and start again?

I really dislike arguments that change the return type of a function. But I agree that having spec_csv(), spec_tsv() etc is going to add quite a bit of code duplication. But there are only 5 functions we'd need to do it for. But it is a frequent source of copy and paste errors. I wonder if we could have a tokenizer helper that reached into the parent environment to pluck out the arguments it needed?

jimhester · 2016-06-17T14:38:43Z

Adding column names is easy, I should have done that originally.

The main issue with printing columns by default is I think it would get unwieldy when you are reading a file with 100+ columns.

That is a good point about the return type polymorphism. I will try some ideas to avoid too much duplication.

hadley · 2016-06-17T14:53:48Z

We could only print (say) the first twenty names by default.

Another option would be to return the spec as an attribute (like problems), but then of course you'd need to wait until you'd parsed the whole thing successfully.

jennybc · 2016-06-17T19:20:20Z

It seems like this is about #314, as much or more than #304 or even #237? BTW this will be so useful.

jimhester · 2016-06-17T19:44:46Z

@jennybc Sorry 304 was a typo, should be #314 as you said!

jimhester · 2016-06-17T20:58:37Z

All the read_*() functions include an attribute with the col_spec now, which can be retrieved with spec(). There are also spec_*() functions for each read_*() function.

I also added a guess_max parameter to allow users to specify exactly how many lines they wanted to use when guessing, and changed the default value of n_max from -1 to Inf, which allows an easy definition of the guess_max default.

The spec functions are all defined to call the read functions with n_max = 0, so they only end up reading the file once.

write_csv(mtcars, "mtcars.csv")
data <- read_csv("mtcars.csv")
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())
data
#> # A tibble: 32 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#> 2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
#> 3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
#> 4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#> 6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
#> 7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
#> 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
#> 9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
#> ... with 22 more rows

# Every table returned has a spec attribute
s <- spec(data)
s
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())

# Alternatively you can use a spec function instead, which will only read the
# first 1000 rows (user configurable with guess_max)
s <- spec_csv("mtcars.csv")
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())
s
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())

# If the spec has a default of skip then uses cols_only
s$default <- col_skip()
s
#> cols_only(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())

# Otherwise set the default to the proper type
s$default <- col_character()
s
#> cols(.default = col_character(),
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())

# The print method takes a n parameter to return only that number of columns
print(s, n=5)
#> cols(.default = col_character(),
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double()
#>   # ... with 6 more columns
#> )

# When reading this is set to 20 by default, set options("readr.num_columns" = x) to change
options("readr.num_columns" = 5)
data <- read_csv("mtcars.csv")
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double()
#>   # ... with 6 more columns
#> )

hadley · 2016-06-17T21:47:21Z

I like it! But I think it's sufficiently complex and useful that it probably need to go in the column types vignette.

@earino would love your thoughts

earino · 2016-07-02T01:50:28Z

Cool! A few thoughts:

I think this is a better version of the thing I was imagining.
If I want to share spec objects between R processes, is the idea that I save them out as an rda and pass them around in that fashion?
The examples didn't have dates in them, does it work for that as well?

hadley · 2016-07-05T14:32:31Z

@earino you could either save as RDS or print and copy and paste into your code.

Export spec_* functions for each read_* function

jimhester · 2016-07-05T20:25:34Z

Ok just added some tests, documentation and news for this, should be ready unless you spot something that needs to change.

hadley · 2016-07-05T21:32:30Z

One last thing - I realised it would be useful to reference spec() when there are any problems. Could you tweak the message?

Otherwise looks good to merge.

jimhester · 2016-07-06T14:24:16Z

Ok I added a line about using spec().

read_csv("a,b,c\nd,e,3", col_types='iii')
#> Warning: 2 parsing failures.
#> See spec(...) for column specifications used.
#> row col   expected actual
#>   1   a an integer      d
#>   1   b an integer      e
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1    NA    NA     3

hadley · 2016-07-06T14:44:33Z

Perfect!

jimhester added the in progress label Jun 15, 2016

jimhester force-pushed the feature/314 branch from c64bd63 to 5c62790 Compare June 16, 2016 13:01

jimhester force-pushed the feature/314 branch from bb1ab1e to ff188ae Compare June 17, 2016 18:43

jimhester force-pushed the feature/314 branch from ff188ae to e261853 Compare June 17, 2016 19:45

jimhester added 14 commits July 5, 2016 16:16

Add a return_spec argument to read functions

8c895a1

Add as.character method for col_spec

6acde55

Print names as well

187ad2d

Fix signature of as.character.col_spec

04c17ca

Attach a spec to the returned object

7d9f733

Export spec_* functions for each read_* function

Explicitly set progress = FALSE when running tests

e852a26

Minor code cleanup

6d59d6c

Add a guess_max parameter and switch the default for n_max to Inf

bbf6796

Add some tests for col_spec printing

a8f4e68

Handle arguments properly

d8ef44f

Handle case with unnamed skipped columns

3c08694

printing a spec with n = 0 produces no output

361081a

Test for cols_only() printing

40d4bde

Add col_spec() information to the documentation and vignette

d036235

Add spec information to the column-type vignette

f1e6708

jimhester force-pushed the feature/314 branch from 0218247 to f1e6708 Compare July 5, 2016 20:17

Add note to NEWS

daf9588

Add reference to spec() in problem warning

11e9459

jimhester merged commit ce24a73 into tidyverse:master Jul 6, 2016

jimhester mentioned this pull request Jul 7, 2016

Make it easy to catch mutations in the data by emitting col_types string #314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a return_spec argument to read functions #437

Add a return_spec argument to read functions #437

jimhester commented Jun 15, 2016 •

edited

hadley commented Jun 17, 2016

hadley commented Jun 17, 2016

jimhester commented Jun 17, 2016

hadley commented Jun 17, 2016

jimhester commented Jun 17, 2016

hadley commented Jun 17, 2016

jennybc commented Jun 17, 2016

jimhester commented Jun 17, 2016

jimhester commented Jun 17, 2016

hadley commented Jun 17, 2016

earino commented Jul 2, 2016

hadley commented Jul 5, 2016

jimhester commented Jul 5, 2016

hadley commented Jul 5, 2016

jimhester commented Jul 6, 2016

hadley commented Jul 6, 2016

Add a return_spec argument to read functions #437

Add a return_spec argument to read functions #437

Conversation

jimhester commented Jun 15, 2016 • edited

hadley commented Jun 17, 2016

hadley commented Jun 17, 2016

jimhester commented Jun 17, 2016

hadley commented Jun 17, 2016

jimhester commented Jun 17, 2016

hadley commented Jun 17, 2016

jennybc commented Jun 17, 2016

jimhester commented Jun 17, 2016

jimhester commented Jun 17, 2016

hadley commented Jun 17, 2016

earino commented Jul 2, 2016

hadley commented Jul 5, 2016

jimhester commented Jul 5, 2016

hadley commented Jul 5, 2016

jimhester commented Jul 6, 2016

hadley commented Jul 6, 2016

jimhester commented Jun 15, 2016 •

edited