Rework `problems` example in "Getting Started" #1431

sbearrows · 2022-09-02T17:22:28Z

Modifies the challenge.csv file so that it actually reports some parsing problems. Also, updates some of the wording around spec_csv() and guess_max

This PR doesn't aim to fix all of the issues with the "Getting Started" page and in fact we already have a PR open for that #1364. It only deals with making it better than before and fixing things that directly relate to challenge.csv.

sbearrows · 2022-09-02T23:29:15Z

vignettes/readr.Rmd

@@ -142,15 +142,6 @@ mtcars_spec
 cols_condense(mtcars_spec)
 ```

-
-By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:
-


This is no longer true so it's been removed. Now, readr (via vroom) selects semi-randomized rows to select for guess, and always includes the last row for guessing. So increasing guess_max won't necessarily resolve problems in this case.

https://github.com/tidyverse/vroom/blob/548344f4826818ca988e80b90cdf7fd837a90561/src/collectors.h#L161-L189

I'm not reading the whole vignette at once, just these changed snippets. So you have a better sense of flow and what's currently included / dropped / moved.

But I wonder if, instead of deleting this text entirely, we should update it to make it correct. I.e. still make sure the user knows we default to using 1000 rows, which helps with speed but can lead to incorrect guesses. And that increasing that number might help, but the best solution is to specify the column types.

The rows are not selected in a random fashion. They're basically evenly spaced within the file and the last row is always included for guessing.

https://github.com/tidyverse/vroom/blob/0e2df4cc8a07fbb007206f6bfcc6a3732e6c4613/src/collectors.h#L164

I know, but I felt like being overly specific would be more confusing. I can modify to say "interspersed through-out the file and always includes the first and last row" instead.

Actually, this isn't technically true since guess_max doesn't guess column types using rows it will never read and once we complete #1436 then n_max = guess_max. So if guess_max = 1500 then n_max = 1500 and it will guess using the first 1500 rows in the data. So spec_*() guessing will never be interspersed throughout the file, it'll always use all of the data it is given to guess. Here is an example using the original challenge.csv where the column y is filled with NA until row 1001.

# after we complete #1436 spec_csv(readr_example("challenge.csv"), guess_max = 1000) # n_max = 1000 #> cols( #> x = col_double(), #> y = col_logical() #> ) # increasing guess_max does change the spec spec_csv(readr_example("challenge.csv"), guess_max = 1001) # n_max = 1001 #> cols( #> x = col_double(), #> y = col_date(format = "") #> ) # compared to how guessing works in read_csv # since it's dispersed throughout the file, it guesses correctly spec(read_csv(readr_example("challenge.csv"), guess_max = 1000, show_col_types = FALSE)) # n_max = all the data #> cols( #> x = col_double(), #> y = col_date(format = "") #> )

So, how specific should the description for guessing be?

The statement should not be technically misleading (e.g. suggesting that it's random and might select different rows each time you call it), but otherwise it's fine to stay high-level. So, yes, something along these lines: "interspersed through-out the file and always includes the first and last row".

Recall that we did create an entire new vignette on column type guessing, in case it makes sense to link to that (or borrow test from that), for more details: https://readr.tidyverse.org/articles/column-types.html.

To zoom out, I think a guiding principle is to avoid making this text so specific that it's tightly coupled with detail work like #1436. What level of understanding and detail is useful for the typical user?

The observation above about n_max and guess_max in spec() brings up an important question to think about over in #1436. Should spec_*() produce the same column type guesses as read_*() at default settings?

vignettes/readr.Rmd

sbearrows · 2022-09-02T23:33:05Z

vignettes/readr.Rmd

-```{r}
-spec(df1)
-spec(df2)
+In this example, one of the dates is in an incorrect format.


This could use some garnishing

What do you have in mind?

vignettes/readr.Rmd

sbearrows · 2022-09-02T23:47:01Z

vignettes/readr.Rmd

+df1 <- read_csv(readr_example("challenge.csv"), col_types = list(
+  date = col_date(),
+  value = col_double()
+))


Note that this would still generate parsing errors even if we don't include/force col_types so we could just have the following if we think it makes for a better narrative:

df1 <- read_csv(readr_example("challenge.csv"))

But I think it does drive the point across that supplying a col_spec is safer.

vignettes/readr.Rmd

sbearrows · 2022-09-03T00:21:08Z

Notes:

No NEWS bullet was added because AFAIR we don't add NEWS bullets for updating docs
I believe everything I have touched and made changes to, is consistent with behaviors in vroom. That doesn't mean the entire document is consistent with vroom behaviors though.
There are a few other documentation issues we have coming up that we could tack on to this PR if we decide they should go in "Get Started": Read multiple file example #1369, Clarify docs for parse_number() behavior with non-numeric characters #1424, Doc suggestion: demonstrate use of col_select with col_types #1377

jennybc

I made some comments inline. Overall, it's looking good: removing lots of incorrect content and updating it.

Zooming out:

Can you describe what changes you made to challenge.csv? And how does that change what we see in the rendered vignette? I can't tell that from reading this PR, i.e. I would have to fetch the PR and render the vignette.

One other concern: changing a dataset is potentially a breaking change. I.e. if someone, somewhere, uses the existing challenge.csv dataset, this will break their code. Did you do any research into this? For example, searching CRAN packages for usage of this dataset and just googling for "challenge.csv".

I could imagine this is basically an internal matter and it's OK to change, but we should do some due diligence. It could also show up when we run revdeps, but it's nicer to learn this sooner rather than later.

vignettes/readr.Rmd

jennybc · 2022-09-06T15:51:52Z

vignettes/readr.Rmd

@@ -142,15 +142,6 @@ mtcars_spec
 cols_condense(mtcars_spec)
 ```

-
-By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:
-


I'm not reading the whole vignette at once, just these changed snippets. So you have a better sense of flow and what's currently included / dropped / moved.

But I wonder if, instead of deleting this text entirely, we should update it to make it correct. I.e. still make sure the user knows we default to using 1000 rows, which helps with speed but can lead to incorrect guesses. And that increasing that number might help, but the best solution is to specify the column types.

jennybc · 2022-09-06T15:54:46Z

vignettes/readr.Rmd

-```{r}
-spec(df1)
-spec(df2)
+In this example, one of the dates is in an incorrect format.


What do you have in mind?

vignettes/readr.Rmd

Co-authored-by: Jennifer (Jenny) Bryan <jenny.f.bryan@gmail.com>

sbearrows · 2022-09-07T18:25:28Z

@jennybc The file has a column of dates and a column of doubles. One of the dates is formatted as 12032020. All the other dates in the file are of the form 12-03-2020. I had to select a specific line where guess_max wouldn't look to get the parsing error.

I've made changes to address most of your comments but I still need to look into whether other packages are using challenge.csv.

sbearrows · 2022-09-09T00:50:20Z

@jennybc I couldn't find any other packages that use challenge.csv. The only reference to it I could find is: https://search.r-project.org/CRAN/refmans/csvwr/html/datatype_to_type.html where they use it in an example but it's set to dontrun

jennybc

I left some inline comments.

High level comment about the new challenge.csv.

Does challenge.csv need to be 2000 lines long anymore?

Originally, I guess it needed to be over 1000 lines long to make a point about column type guessing for the y column, which was missing for the first 1000 lines and then started to have dates in it. (That still doesn't fully explain why the file was 2000 lines long, but I guess that will remain a mystery.)

What is the purpose of the value column in the new challenge.csv?

I notice that the original x column was integers only for the first 1000 lines and then started to have doubles in it. So I think it must be or have been used to demonstrate something about integers vs doubles.

But I can't figure out the purpose of value in the proposed challenge.csv, which doesn't seem to have this mix of integers and doubles.

Our modern approach to creating datasets in packages is to save the code that generates the dataset below data-raw/. It helps with exercises like this one, i.e. it shows how to regenerate a dataset and (hopefully) leave some breadcrumbs re: why the data is the way it is.

This is explained in the Data chapter of R Packages and, as also described there, supported by functions like use_data_raw().

jennybc · 2022-09-12T21:26:44Z

vignettes/readr.Rmd

@@ -127,7 +127,7 @@ parse_number("$1,234")

 There are two parsers that will never be guessed: `col_skip()` and `col_factor()`. You will always need to supply these explicitly.

-You can see the specification that readr would generate for a column file by using `spec_csv()`, `spec_tsv()` and so on:
+You can see the specification that readr would generate for a file by using `spec_csv()`, `spec_tsv()`, and so on.

 ```{r}
 x <- spec_csv(readr_example("challenge.csv"))


Wouldn't it make sense to reveal the spec? I.e. why are we assigning the result to x?

(I have fetched the PR and rendered the vignette now. I assume you are also looking at the rendered result as you go.)

This is how the example was when I started editing the file and I didn't see a reason to modify it since we aren't doing a total overhaul right now and the goal is to ensure that the problems() example has actual problems. Also, the behavior of spec_*() is currently in-flight #1436 so some of the content here might change.

jennybc · 2022-09-12T21:28:47Z

vignettes/readr.Rmd

@@ -127,7 +127,7 @@ parse_number("$1,234")

 There are two parsers that will never be guessed: `col_skip()` and `col_factor()`. You will always need to supply these explicitly.


Isn't it also true the col_integer() will not be guessed? I spotted this since we are working here in the immediate neighborhood.

No, I don't think so. Here is where the guessing happens https://github.com/tidyverse/vroom/blob/0c2423eb3bba77b05ef9fda1830f1f798cc2fedd/src/guess_type.cc#L128-L146

jennybc · 2022-09-12T21:34:57Z

vignettes/readr.Rmd

@@ -142,15 +142,6 @@ mtcars_spec
 cols_condense(mtcars_spec)
 ```

-
-By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:
-


The rows are not selected in a random fashion. They're basically evenly spaced within the file and the last row is always included for guessing.

https://github.com/tidyverse/vroom/blob/0e2df4cc8a07fbb007206f6bfcc6a3732e6c4613/src/collectors.h#L164

jennybc · 2022-09-12T21:35:33Z

vignettes/readr.Rmd

@@ -142,14 +142,13 @@ mtcars_spec
 cols_condense(mtcars_spec)
 ```

-
-By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:
+When guessing column types, readr selects rows to use for guessing semi-randomly. The default number of rows it selects is 1000. If the column type guess is incorrect, increasing `guess_max` may or may not improve the results.

 ```{r}
 x <- spec_csv(readr_example("challenge.csv"), guess_max = 1001)


Similar to above, if we're trying to make a point about the spec, we should reveal it.

vignettes/readr.Rmd

jennybc · 2022-09-12T21:42:15Z

vignettes/readr.Rmd

-#>   x = col_integer(),
-#>   y = col_character()
-#> )
+writeLines(read_lines(readr_example("challenge.csv"), skip = 1325, n_max = 3))


1325 feels a bit like a Magic Number here. It feels like we should probably get it programmatically, since we do have access to it and explain that we're inspecting the neighborhood right around the problem line.

jennybc · 2022-09-12T21:43:37Z

vignettes/readr.Rmd

 ```

-You can also access it after the fact using `spec()`:
+We can work around this parsing error by importing the date column as character and converting it to a column of dates, after the fact. The lubridate package is a great option that we show here. But since readr doesn't depend on lubridate, we won't execute this code chunk.


You actually could make this a static code chunk. That is, set it to eval = FALSE, but inline the current result. It's not a perfect solution, but the pros would seem to outweigh the cons.

I'm talking about showing folks this:

> df1[1325:1327, ] # A tibble: 3 × 2 date value <date> <dbl> 1 2008-03-16 762 2 2008-02-17 206 3 2019-01-28 1333

I reiterate this:

I think you should add some brief comment ... that it's indicative of a general type of workaround, where you import as character and use a more specialized package (or custom function) to parse a tricky column.

jennybc · 2022-09-12T21:48:57Z

vignettes/readr.Rmd

-    ```r
-    read_csv("iris.csv", col_types = list(
-      Sepal.Length = col_double(),
-      Sepal.Width = col_double(),
-      Petal.Length = col_double(),
-      Petal.Width = col_double(),
-      Species = col_factor(c("setosa", "versicolor", "virginica"))
-    ))
-    ```
+```{r, eval = FALSE}
+read_csv("iris.csv", col_types = list(
+  Sepal.Length = col_double(),
+  Sepal.Width = col_double(),
+  Petal.Length = col_double(),
+  Petal.Width = col_double(),
+  Species = col_factor(c("setosa", "versicolor", "virginica"))
+))
+```


This has messed up the intended indenting. Here and below.

Co-authored-by: Jennifer (Jenny) Bryan <jenny.f.bryan@gmail.com>

sbearrows · 2022-09-14T16:27:14Z

Does challenge.csv need to be 2000 lines long anymore?

I created the new challenge.csv by modify pieces of the old challenge.csv so that's one reason why it's also 2000 lines long. It also made it easier to insert a parsing error into the file and makes the data feel more realistic.

What is the purpose of the value column in the new challenge.csv?...I can't figure out the purpose of value in the proposed challenge.csv, which doesn't seem to have this mix of integers and doubles.

The value column isn't being used to teach anything, its purpose is to make the data feel real since data with only one column would be odd. We could make this column also teach some other lesson about problems() or parsing if that's what you're suggesting. But I think we should only have one type of parsing problem for the example otherwise it might feel overloaded with information.

Our modern approach to creating datasets in packages is to save the code that generates the dataset below data-raw/

I'll make a note to complete this TODO once we've landed on a dataset

jennybc · 2022-09-16T14:48:35Z

Given that the only column we're currently using for a concrete demo is a date column, I think we should just leave the legacy challenge.csv alone and create a new example file with exactly what you need.

As I understand it, you want a single date column with at least one strategically positioned entry in a nonstandard format. This tricky entry should not cause the column to be guessed as character (so that dictates its position), but it should create a parsing problem. Make it minimal and sufficient to support what you want to show. Store the .R script used to generate it below data-raw/.

shelby added 3 commits September 2, 2022 10:48

change challenge.csv

ad10eec

update text to reflect new example

bd3a4aa

restyle to match

025dae4