Revise col-types vignette (#1448)

* Revise col-types vignette * Apply suggestions from code review
tidyverse · Nov 9, 2022 · ecbdc3b · ecbdc3b
1 parent 41447ce
commit ecbdc3b
Show file tree

Hide file tree

Showing 8 changed files with 39 additions and 120 deletions.
diff --git a/R/read_delim.R b/R/read_delim.R
@@ -72,6 +72,7 @@ NULL
 #'   supplied any commented lines are ignored _after_ skipping.
 #' @param n_max Maximum number of lines to read.
 #' @param guess_max Maximum number of lines to use for guessing column types.
+#'   Will never use more than the number of lines read.
 #'   See `vignette("column-types", package = "readr")` for more details.
 #' @param progress Display a progress bar? By default it will only display
 #'   in an interactive session and not while knitting a document. The automatic

diff --git a/man/melt_table.Rd b/man/melt_table.Rd
diff --git a/man/read_delim.Rd b/man/read_delim.Rd
diff --git a/man/read_delim_chunked.Rd b/man/read_delim_chunked.Rd
diff --git a/man/read_fwf.Rd b/man/read_fwf.Rd
diff --git a/man/read_table.Rd b/man/read_table.Rd
diff --git a/man/spec_delim.Rd b/man/spec_delim.Rd
diff --git a/vignettes/column-types.Rmd b/vignettes/column-types.Rmd
@@ -1,8 +1,8 @@
 ---
-title: "Column type guessing"
+title: "Column type"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Column type guessing}
+  %\VignetteIndexEntry{Column type}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
@@ -14,70 +14,29 @@ knitr::opts_chunk$set(
 )
 ```
 
-readr will guess column types from the data if the user does not specify the types.
-The `guess_max` parameter controls how many rows of the input file are used to form these guesses.
-Ideally, the column types would be completely obvious from the first non-header row and we could use `guess_max = 1`.
-That would be very efficient!
-But the situation is rarely so clear-cut.
-
-By default, readr consults 1000 rows when type-guessing, i.e. `guess_max = 1000`.
-Note that readr never consults rows that won't be part of the import, so the actual default is `guess_max = min(1000, n_max)`.
-
-Sometimes you want to convey "use all of the data we're going to import to guess the column types", often without even knowing or specifying `n_max`.
-How should you say that?
-It's also worth discussing the possible downsides of such a request.
+This vignette provides an overview of column type specification with readr.
+Currently it focuses on how automatic guessing works, but over time we expect to cover more topics.
 
 ```{r setup}
 library(readr)
 ```
 
-## readr >= 2.0.0
-
-readr got a new parsing engine in version 2.0.0 (released July 2021).
-In this so-called second edition, readr calls `vroom::vroom()`, by default.
-The vroom package and, therefore, the second edition of readr supports a very natural expression of "use all the data to guess": namely `guess_max = Inf`.
-
-```{r, eval = FALSE}
-read_csv("path/to/your/file", ..., guess_max = Inf)
-```
-
-Why isn't this the default?
-Why not do this all the time?
-Because column type guessing basically adds another pass through the data, in addition to the main parsing.
-
-If you routinely use `guess_max = Inf`, you're basically processing every file twice, in its entirety.
-If you only work with small files, this is fine.
-But for larger files, this can be very costly and for relatively little benefit.
-Often the column types guessed based on a subset of the file are "good enough".
-
-Note also that `guess_max = n`, for finite `n`, works better in the second edition parser.
-Due to its different design, vroom is able to sub-sample `n` rows throughout the file and it always includes the last row, whereas earlier versions of readr just consulted the first `n` rows.
-In practice, the result is that the default of `guess_max = min(1000, n_max)` produces better guessed column types that it used to.
-It should feel less necessary to fiddle with `guess_max` now.
-
-<!-- 
-https://github.com/r-lib/vroom/issues/352
+## Automatic guessing
 
-In fact vroom generally does this guessing better than readr currently does. Readr always uses the first guess_max number of rows for guessing, whereas vroom uses rows interspersed throughout the file if guess_max is less than the total number of rows.
+If you don't explicit specify column types with the `col_types` argument, readr will attempt to guess them using some simple heuristics.
+By default, it will inspect 1000 values, evenly spaced from the first to the last row.
+This is a heuristic designed to always be fast (no matter how large your file is) and, in our experience, does a good job in most cases.
 
-However as this example shows we should probably always include the last row when guessing, so I have made a change to always include the last row in the guess in the future.
--->
+If needed, you can request that readr use more rows by supplying the `guess_max` argument.
+You can even supply `guess_max = Inf` to use every row to guess the column types.
+You might wonder why this isn't the default.
+That's because it's slow: it has to look at every column twice, once to determine the type and once to parse the value.
+In most cases, you're best off supplying `col_types` yourself.
 
-As always, remember that the best strategy is to provide explicit column types as any data analysis project matures past the exploratory phase.
+### Legacy behavior
 
-<!-- Future link to a vignette on column specification or, if that content gets co-located here, link to later in this vignette. -->
-
-## readr first edition and readr < 2.0.0
-
-The parsing engine in readr versions prior to 2.0.0 is now called the first edition.
-If you're using readr >= 2.0.0, you can still access first edition parsing via the functions `with_edition()` and `local_edition()`.
-And, obviously, if you're using readr < 2.0.0, you will get first edition parsing, by definition, because that's all there is.
-
-The first edition parser doesn't have a perfect way to convey "use all of the data to guess the column types".
-(This is one of several reasons to prefer readr >= 2.0.0.)
-
-Let's set up a slightly tricky file, so we can demonstrate different approaches.
-The column `x` is mostly empty, but has some numeric data at the very end, in row 1001.
+Column type guessing was substantially worse in the first edition of readr (meaning, prior to v2.0.0), because it always looked at the first 1000 rows, and through some application of Murphy's Law, it appears that many real csv files have lots of empty values at the start, followed by more "excitement" later in the file.
+Let's demonstrate the problem with a slightly tricky file: the column `x` is mostly empty, but has some numeric data at the very end, in row 1001.
 
 ```{r}
 tricky_dat <- tibble::tibble(
@@ -88,77 +47,30 @@ tfile <- tempfile("tricky-column-type-guessing-", fileext = ".csv")
 write_csv(tricky_dat, tfile)
 ```
 
-First, note that the second edition parser guesses the right type for `x`, even with the default `guess_max` behaviour.
+The first edition parser doesn't guess the right type for `x` so the `2` becomes an `NA`:
 
 ```{r}
-tail(read_csv(tfile))
+df <- with_edition(1, read_csv(tfile))
+tail(df)
 ```
 
-In contrast, the first edition parser doesn't guess the right type for `x` with the `guess_max` default.
-`x` is imported as logical and the `2` becomes an `NA`.
+For this specific case, we can fix the problem by marginally increasing `guess_max`:
 
 ```{r}
-with_edition(
-  1,
-  tail(read_csv(tfile))
-)
+df <- with_edition(1, read_csv(tfile, guess_max = 1001))
+tail(df)
+```
+
+Unlike the 2nd edition, we don't recommend using `guess_max = Inf` with the legacy parser, because the engine pre-allocates a large amount of memory in the face of this uncertainty.
+This means that reading with `guess_max = Inf` can be extremely slow and might even crash your R session.
+Instead specify the `col_types`:
+
+```{r}
+df <- with_edition(1, read_csv(tfile, col_types = list(x = col_double())))
+tail(df)
 ```
 
-There are three ways to proceed, each of which has some downside:
-
-* Specify `guess_max = Inf`, just like we do for the second edition parser.
-
-  Since readr does not know how much data it will be processing, the first
-  edition engine pre-allocates a large amount of memory in the face of this
-  uncertainty.
-  This means that reading with `guess_max = Inf` can be extremely slow and
-  might even crash your R session.
-
-    ```{r}
-    with_edition(
-      1,
-      tail(read_csv(tfile, guess_max = Inf))
-    )
-    ```
-
-* Specify an actual, non-infinite value for `guess_max`.
-
-  This is an awkward suggestion, because if you knew how many rows there were,
-  we wouldn't be having this conversation in the first place.
-  But sometimes you have a decent estimate and can choose a value of `guess_max`
-  that is "big enough".
-  This usually results in much better performance than `guess_max = Inf`.
-
-    ```{r}
-    with_edition(
-      1,
-      tail(read_csv(tfile, guess_max = 1200))
-    )
-    ```
-
-* Read all columns as character, then use `type_convert()`.
-  This is a bit clunky, since this obligates you to post-processing once you've
-  brought you data into R.
-
-    ```{r}
-    dat_chr <- with_edition(
-      1,
-      read_csv(tfile, col_types = cols(.default = col_character()))
-    )
-    tail(dat_chr)
-    
-    dat <- type_convert(dat_chr)
-    tail(dat)
-    ```
-
-<!-- 
-https://github.com/tidyverse/readr/issues/982
-https://github.com/tidyverse/readr/issues/588
-  
-Using type_convert() is another approach however this requires an extra step (or a few extra steps) after reading in the data.
--->
-
-Clean up the temporary tricky csv file.
 ```{r}
+#| include: false
 file.remove(tfile)
 ```