Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a return_spec argument to read functions #437

Merged
merged 17 commits into from Jul 6, 2016

Conversation

jimhester
Copy link
Collaborator

@jimhester jimhester commented Jun 15, 2016

This tries to address #314 slightly differently than proposed.

It adds a return_spec argument to the read_* functions which rather than read the files just returns the col_spec object that is generated before reading the file.

This object can then be used in subsequent calls as the value for the col_types argument, which will allow you to enforce a specification on other datasets with the same specifications and such.

These objects can be saved retrieved with saveRDS()/readRDS() and already have formatting methods written.

We could also write a function to take the object and produce an expression that could be used to recreate the object, but I am not sure that is more useful or informative than just working with the objects directly.

@hadley
Copy link
Member

hadley commented Jun 17, 2016

Reading this makes me realise that I forgot to describe an important part of the workflow that I was imagining. I was thinking you'd often run it once to make the col-guesses explicit, and then you'd modify that code/text to fix any detection errors and specify the correct columns types. So I think that makes having a code/text rendering quite important.

@hadley
Copy link
Member

hadley commented Jun 17, 2016

Also I think to start with I'd rather have this be a separate function, i.e. spec_csv, spec_tsv etc.

@jimhester
Copy link
Collaborator Author

I feel like that workflow works better with this implementation than the proposed one, it is very easy to subset the object and change the columns after returning the col_spec object. I also added a as.character() method to print an R expression that would recapitulate the spec object.

write_csv(mtcars, "mtcars.csv")
spec <- read_csv("mtcars.csv", return_spec = TRUE)
spec
#> <col_spec>
#> * mpg: double
#> * cyl: integer
#> * disp: double
#> * hp: integer
#> * drat: double
#> * wt: double
#> * qsec: double
#> * vs: integer
#> * am: integer
#> * gear: integer
#> * carb: integer
#> * default: guess

## Oh actually cylinders should be a factor
spec$cols$cyl <- col_factor(c("4", "6", "8"))
spec
#> <col_spec>
#> * mpg: double
#> * cyl: factor
#> * disp: double
#> * hp: integer
#> * drat: double
#> * wt: double
#> * qsec: double
#> * vs: integer
#> * am: integer
#> * gear: integer
#> * carb: integer
#> * default: guess

data <- read_csv("mtcars.csv", col_types = spec)
data
#> <tibble [32 x 11]>
#>      mpg    cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <fctr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1   21.0      6 160.0   110  3.90 2.620 16.46     0     1     4     4
#> 2   21.0      6 160.0   110  3.90 2.875 17.02     0     1     4     4
#> 3   22.8      4 108.0    93  3.85 2.320 18.61     1     1     4     1
#> 4   21.4      6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 5   18.7      8 360.0   175  3.15 3.440 17.02     0     0     3     2
#> 6   18.1      6 225.0   105  2.76 3.460 20.22     1     0     3     1
#> 7   14.3      8 360.0   245  3.21 3.570 15.84     0     0     3     4
#> 8   24.4      4 146.7    62  3.69 3.190 20.00     1     0     4     2
#> 9   22.8      4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 10  19.2      6 167.6   123  3.92 3.440 18.30     1     0     4     4
#> ... with 22 more rows

# alternatively can use the spec as a character
cat(as.character(spec))
#> cols(
#>   col_double(),
#>   col_factor(levels = c("4", "6", "8"), ordered = FALSE),
#>   col_double(),
#>   col_integer(),
#>   col_double(),
#>   col_double(),
#>   col_double(),
#>   col_integer(),
#>   col_integer(),
#>   col_integer(),
#>   col_integer())

data2 <- read_csv("mtcars.csv",
  col_types = cols(
    col_double(),
    col_factor(levels = c("4", "6", "8"), ordered = FALSE),
    col_double(),
    col_integer(),
    col_double(),
    col_double(),
    col_double(),
    col_integer(),
    col_integer(),
    col_integer(),
    col_integer()))

all.equal(data, data2)
#> [1] TRUE

We can do separate spec_* functions, that was actually my first prototype of this.

I went with the argument mainly to decrease the maintenance burden, since you don't have to replicate (and keep in sync) the setup logic in two different places.

I also think the argument is slightly more discoverable than a separate function, but that is largely subjective.

@hadley
Copy link
Member

hadley commented Jun 17, 2016

I think most people will prefer the manipulating the character strings, but I do like how this gives you both interfaces.

How hard would it be generate a named list? That would make it easier to work with, and then if you wanted to only read a few cols, you could delete the lines you don't want and switch to cols_only().

Maybe the default could be to print it out whenever any columns are guessed? That nice because it tells you exactly how readr is reading it in, and if you're loading a big file and spot a problem you can ctrl + break, fix the problem and start again?

I really dislike arguments that change the return type of a function. But I agree that having spec_csv(), spec_tsv() etc is going to add quite a bit of code duplication. But there are only 5 functions we'd need to do it for. But it is a frequent source of copy and paste errors. I wonder if we could have a tokenizer helper that reached into the parent environment to pluck out the arguments it needed?

@jimhester
Copy link
Collaborator Author

Adding column names is easy, I should have done that originally.

The main issue with printing columns by default is I think it would get unwieldy when you are reading a file with 100+ columns.

That is a good point about the return type polymorphism. I will try some ideas to avoid too much duplication.

@hadley
Copy link
Member

hadley commented Jun 17, 2016

We could only print (say) the first twenty names by default.

Another option would be to return the spec as an attribute (like problems), but then of course you'd need to wait until you'd parsed the whole thing successfully.

@jennybc
Copy link
Member

jennybc commented Jun 17, 2016

It seems like this is about #314, as much or more than #304 or even #237? BTW this will be so useful.

@jimhester
Copy link
Collaborator Author

@jennybc Sorry 304 was a typo, should be #314 as you said!

@jimhester
Copy link
Collaborator Author

All the read_*() functions include an attribute with the col_spec now, which can be retrieved with spec(). There are also spec_*() functions for each read_*() function.

I also added a guess_max parameter to allow users to specify exactly how many lines they wanted to use when guessing, and changed the default value of n_max from -1 to Inf, which allows an easy definition of the guess_max default.

The spec functions are all defined to call the read functions with n_max = 0, so they only end up reading the file once.

write_csv(mtcars, "mtcars.csv")
data <- read_csv("mtcars.csv")
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())
data
#> # A tibble: 32 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#> 2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
#> 3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
#> 4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#> 6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
#> 7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
#> 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
#> 9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
#> ... with 22 more rows

# Every table returned has a spec attribute
s <- spec(data)
s
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())

# Alternatively you can use a spec function instead, which will only read the
# first 1000 rows (user configurable with guess_max)
s <- spec_csv("mtcars.csv")
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())
s
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())

# If the spec has a default of skip then uses cols_only
s$default <- col_skip()
s
#> cols_only(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())

# Otherwise set the default to the proper type
s$default <- col_character()
s
#> cols(.default = col_character(),
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer())

# The print method takes a n parameter to return only that number of columns
print(s, n=5)
#> cols(.default = col_character(),
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double()
#>   # ... with 6 more columns
#> )

# When reading this is set to 20 by default, set options("readr.num_columns" = x) to change
options("readr.num_columns" = 5)
data <- read_csv("mtcars.csv")
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double()
#>   # ... with 6 more columns
#> )

@hadley
Copy link
Member

hadley commented Jun 17, 2016

I like it! But I think it's sufficiently complex and useful that it probably need to go in the column types vignette.

@earino would love your thoughts

@earino
Copy link

earino commented Jul 2, 2016

Cool! A few thoughts:

  1. I think this is a better version of the thing I was imagining.
  2. If I want to share spec objects between R processes, is the idea that I save them out as an rda and pass them around in that fashion?
  3. The examples didn't have dates in them, does it work for that as well?

@hadley
Copy link
Member

hadley commented Jul 5, 2016

@earino you could either save as RDS or print and copy and paste into your code.

@jimhester
Copy link
Collaborator Author

Ok just added some tests, documentation and news for this, should be ready unless you spot something that needs to change.

@hadley
Copy link
Member

hadley commented Jul 5, 2016

One last thing - I realised it would be useful to reference spec() when there are any problems. Could you tweak the message?

Otherwise looks good to merge.

@jimhester
Copy link
Collaborator Author

Ok I added a line about using spec().

read_csv("a,b,c\nd,e,3", col_types='iii')
#> Warning: 2 parsing failures.
#> See spec(...) for column specifications used.
#> row col   expected actual
#>   1   a an integer      d
#>   1   b an integer      e
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1    NA    NA     3

@hadley
Copy link
Member

hadley commented Jul 6, 2016

Perfect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants