New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a return_spec argument to read functions #437
Conversation
Reading this makes me realise that I forgot to describe an important part of the workflow that I was imagining. I was thinking you'd often run it once to make the col-guesses explicit, and then you'd modify that code/text to fix any detection errors and specify the correct columns types. So I think that makes having a code/text rendering quite important. |
Also I think to start with I'd rather have this be a separate function, i.e. |
I feel like that workflow works better with this implementation than the proposed one, it is very easy to subset the object and change the columns after returning the col_spec object. I also added a write_csv(mtcars, "mtcars.csv")
spec <- read_csv("mtcars.csv", return_spec = TRUE)
spec
#> <col_spec>
#> * mpg: double
#> * cyl: integer
#> * disp: double
#> * hp: integer
#> * drat: double
#> * wt: double
#> * qsec: double
#> * vs: integer
#> * am: integer
#> * gear: integer
#> * carb: integer
#> * default: guess
## Oh actually cylinders should be a factor
spec$cols$cyl <- col_factor(c("4", "6", "8"))
spec
#> <col_spec>
#> * mpg: double
#> * cyl: factor
#> * disp: double
#> * hp: integer
#> * drat: double
#> * wt: double
#> * qsec: double
#> * vs: integer
#> * am: integer
#> * gear: integer
#> * carb: integer
#> * default: guess
data <- read_csv("mtcars.csv", col_types = spec)
data
#> <tibble [32 x 11]>
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <fctr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> ... with 22 more rows
# alternatively can use the spec as a character
cat(as.character(spec))
#> cols(
#> col_double(),
#> col_factor(levels = c("4", "6", "8"), ordered = FALSE),
#> col_double(),
#> col_integer(),
#> col_double(),
#> col_double(),
#> col_double(),
#> col_integer(),
#> col_integer(),
#> col_integer(),
#> col_integer())
data2 <- read_csv("mtcars.csv",
col_types = cols(
col_double(),
col_factor(levels = c("4", "6", "8"), ordered = FALSE),
col_double(),
col_integer(),
col_double(),
col_double(),
col_double(),
col_integer(),
col_integer(),
col_integer(),
col_integer()))
all.equal(data, data2)
#> [1] TRUE We can do separate I went with the argument mainly to decrease the maintenance burden, since you don't have to replicate (and keep in sync) the setup logic in two different places. I also think the argument is slightly more discoverable than a separate function, but that is largely subjective. |
I think most people will prefer the manipulating the character strings, but I do like how this gives you both interfaces. How hard would it be generate a named list? That would make it easier to work with, and then if you wanted to only read a few cols, you could delete the lines you don't want and switch to Maybe the default could be to print it out whenever any columns are guessed? That nice because it tells you exactly how readr is reading it in, and if you're loading a big file and spot a problem you can ctrl + break, fix the problem and start again? I really dislike arguments that change the return type of a function. But I agree that having |
Adding column names is easy, I should have done that originally. The main issue with printing columns by default is I think it would get unwieldy when you are reading a file with 100+ columns. That is a good point about the return type polymorphism. I will try some ideas to avoid too much duplication. |
We could only print (say) the first twenty names by default. Another option would be to return the spec as an attribute (like problems), but then of course you'd need to wait until you'd parsed the whole thing successfully. |
All the I also added a The spec functions are all defined to call the read functions with n_max = 0, so they only end up reading the file once. write_csv(mtcars, "mtcars.csv")
data <- read_csv("mtcars.csv")
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer())
data
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> ... with 22 more rows
# Every table returned has a spec attribute
s <- spec(data)
s
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer())
# Alternatively you can use a spec function instead, which will only read the
# first 1000 rows (user configurable with guess_max)
s <- spec_csv("mtcars.csv")
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer())
s
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer())
# If the spec has a default of skip then uses cols_only
s$default <- col_skip()
s
#> cols_only(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer())
# Otherwise set the default to the proper type
s$default <- col_character()
s
#> cols(.default = col_character(),
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer())
# The print method takes a n parameter to return only that number of columns
print(s, n=5)
#> cols(.default = col_character(),
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double()
#> # ... with 6 more columns
#> )
# When reading this is set to 20 by default, set options("readr.num_columns" = x) to change
options("readr.num_columns" = 5)
data <- read_csv("mtcars.csv")
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double()
#> # ... with 6 more columns
#> ) |
I like it! But I think it's sufficiently complex and useful that it probably need to go in the column types vignette. @earino would love your thoughts |
Cool! A few thoughts:
|
@earino you could either save as RDS or print and copy and paste into your code. |
Export spec_* functions for each read_* function
Ok just added some tests, documentation and news for this, should be ready unless you spot something that needs to change. |
One last thing - I realised it would be useful to reference Otherwise looks good to merge. |
Ok I added a line about using spec(). read_csv("a,b,c\nd,e,3", col_types='iii')
#> Warning: 2 parsing failures.
#> See spec(...) for column specifications used.
#> row col expected actual
#> 1 a an integer d
#> 1 b an integer e
#> # A tibble: 1 x 3
#> a b c
#> <int> <int> <int>
#> 1 NA NA 3 |
Perfect! |
This tries to address #314 slightly differently than proposed.
It adds a
return_spec
argument to the read_* functions which rather than read the files just returns thecol_spec
object that is generated before reading the file.This object can then be used in subsequent calls as the value for the
col_types
argument, which will allow you to enforce a specification on other datasets with the same specifications and such.These objects can be saved retrieved with
saveRDS()/readRDS()
and already have formatting methods written.We could also write a function to take the object and produce an expression that could be used to recreate the object, but I am not sure that is more useful or informative than just working with the objects directly.