Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework problems example in "Getting Started" #1431

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

sbearrows
Copy link
Contributor

Fixes #1398

Modifies the challenge.csv file so that it actually reports some parsing problems. Also, updates some of the wording around spec_csv() and guess_max

This PR doesn't aim to fix all of the issues with the "Getting Started" page and in fact we already have a PR open for that #1364. It only deals with making it better than before and fixing things that directly relate to challenge.csv.

@@ -142,15 +142,6 @@ mtcars_spec
cols_condense(mtcars_spec)
```


By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true so it's been removed. Now, readr (via vroom) selects semi-randomized rows to select for guess, and always includes the last row for guessing. So increasing guess_max won't necessarily resolve problems in this case.

https://github.com/tidyverse/vroom/blob/548344f4826818ca988e80b90cdf7fd837a90561/src/collectors.h#L161-L189

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not reading the whole vignette at once, just these changed snippets. So you have a better sense of flow and what's currently included / dropped / moved.

But I wonder if, instead of deleting this text entirely, we should update it to make it correct. I.e. still make sure the user knows we default to using 1000 rows, which helps with speed but can lead to incorrect guesses. And that increasing that number might help, but the best solution is to specify the column types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rows are not selected in a random fashion. They're basically evenly spaced within the file and the last row is always included for guessing.

https://github.com/tidyverse/vroom/blob/0e2df4cc8a07fbb007206f6bfcc6a3732e6c4613/src/collectors.h#L164

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but I felt like being overly specific would be more confusing. I can modify to say "interspersed through-out the file and always includes the first and last row" instead.

Copy link
Contributor Author

@sbearrows sbearrows Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this isn't technically true since guess_max doesn't guess column types using rows it will never read and once we complete #1436 then n_max = guess_max. So if guess_max = 1500 then n_max = 1500 and it will guess using the first 1500 rows in the data. So spec_*() guessing will never be interspersed throughout the file, it'll always use all of the data it is given to guess. Here is an example using the original challenge.csv where the column y is filled with NA until row 1001.

# after we complete #1436
spec_csv(readr_example("challenge.csv"), guess_max = 1000) # n_max = 1000
#> cols(
#>   x = col_double(),
#>   y = col_logical()
#> )
# increasing guess_max does change the spec
spec_csv(readr_example("challenge.csv"), guess_max = 1001) # n_max = 1001
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

# compared to how guessing works in read_csv
# since it's dispersed throughout the file, it guesses correctly
spec(read_csv(readr_example("challenge.csv"), guess_max = 1000, show_col_types = FALSE)) # n_max = all the data
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

So, how specific should the description for guessing be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement should not be technically misleading (e.g. suggesting that it's random and might select different rows each time you call it), but otherwise it's fine to stay high-level. So, yes, something along these lines: "interspersed through-out the file and always includes the first and last row".

Recall that we did create an entire new vignette on column type guessing, in case it makes sense to link to that (or borrow test from that), for more details: https://readr.tidyverse.org/articles/column-types.html.

To zoom out, I think a guiding principle is to avoid making this text so specific that it's tightly coupled with detail work like #1436. What level of understanding and detail is useful for the typical user?

The observation above about n_max and guess_max in spec() brings up an important question to think about over in #1436. Should spec_*() produce the same column type guesses as read_*() at default settings?

```{r}
spec(df1)
spec(df2)
In this example, one of the dates is in an incorrect format.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some garnishing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you have in mind?

vignettes/readr.Rmd Outdated Show resolved Hide resolved
df1 <- read_csv(readr_example("challenge.csv"), col_types = list(
date = col_date(),
value = col_double()
))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this would still generate parsing errors even if we don't include/force col_types so we could just have the following if we think it makes for a better narrative:

df1 <- read_csv(readr_example("challenge.csv"))

But I think it does drive the point across that supplying a col_spec is safer.

@sbearrows
Copy link
Contributor Author

Notes:

@sbearrows sbearrows marked this pull request as ready for review September 3, 2022 00:29
Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some comments inline. Overall, it's looking good: removing lots of incorrect content and updating it.

Zooming out:

Can you describe what changes you made to challenge.csv? And how does that change what we see in the rendered vignette? I can't tell that from reading this PR, i.e. I would have to fetch the PR and render the vignette.

One other concern: changing a dataset is potentially a breaking change. I.e. if someone, somewhere, uses the existing challenge.csv dataset, this will break their code. Did you do any research into this? For example, searching CRAN packages for usage of this dataset and just googling for "challenge.csv".

I could imagine this is basically an internal matter and it's OK to change, but we should do some due diligence. It could also show up when we run revdeps, but it's nicer to learn this sooner rather than later.

vignettes/readr.Rmd Outdated Show resolved Hide resolved
@@ -142,15 +142,6 @@ mtcars_spec
cols_condense(mtcars_spec)
```


By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not reading the whole vignette at once, just these changed snippets. So you have a better sense of flow and what's currently included / dropped / moved.

But I wonder if, instead of deleting this text entirely, we should update it to make it correct. I.e. still make sure the user knows we default to using 1000 rows, which helps with speed but can lead to incorrect guesses. And that increasing that number might help, but the best solution is to specify the column types.

```{r}
spec(df1)
spec(df2)
In this example, one of the dates is in an incorrect format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you have in mind?

vignettes/readr.Rmd Outdated Show resolved Hide resolved
vignettes/readr.Rmd Outdated Show resolved Hide resolved
vignettes/readr.Rmd Outdated Show resolved Hide resolved
vignettes/readr.Rmd Show resolved Hide resolved
vignettes/readr.Rmd Outdated Show resolved Hide resolved
@sbearrows
Copy link
Contributor Author

@jennybc The file has a column of dates and a column of doubles. One of the dates is formatted as 12032020. All the other dates in the file are of the form 12-03-2020. I had to select a specific line where guess_max wouldn't look to get the parsing error.

I've made changes to address most of your comments but I still need to look into whether other packages are using challenge.csv.

@sbearrows
Copy link
Contributor Author

@jennybc I couldn't find any other packages that use challenge.csv. The only reference to it I could find is: https://search.r-project.org/CRAN/refmans/csvwr/html/datatype_to_type.html where they use it in an example but it's set to dontrun

Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some inline comments.


High level comment about the new challenge.csv.

Does challenge.csv need to be 2000 lines long anymore?

Originally, I guess it needed to be over 1000 lines long to make a point about column type guessing for the y column, which was missing for the first 1000 lines and then started to have dates in it. (That still doesn't fully explain why the file was 2000 lines long, but I guess that will remain a mystery.)

What is the purpose of the value column in the new challenge.csv?

I notice that the original x column was integers only for the first 1000 lines and then started to have doubles in it. So I think it must be or have been used to demonstrate something about integers vs doubles.

But I can't figure out the purpose of value in the proposed challenge.csv, which doesn't seem to have this mix of integers and doubles.


Our modern approach to creating datasets in packages is to save the code that generates the dataset below data-raw/. It helps with exercises like this one, i.e. it shows how to regenerate a dataset and (hopefully) leave some breadcrumbs re: why the data is the way it is.

This is explained in the Data chapter of R Packages and, as also described there, supported by functions like use_data_raw().

@@ -127,7 +127,7 @@ parse_number("$1,234")

There are two parsers that will never be guessed: `col_skip()` and `col_factor()`. You will always need to supply these explicitly.

You can see the specification that readr would generate for a column file by using `spec_csv()`, `spec_tsv()` and so on:
You can see the specification that readr would generate for a file by using `spec_csv()`, `spec_tsv()`, and so on.

```{r}
x <- spec_csv(readr_example("challenge.csv"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to reveal the spec? I.e. why are we assigning the result to x?

(I have fetched the PR and rendered the vignette now. I assume you are also looking at the rendered result as you go.)

Copy link
Contributor Author

@sbearrows sbearrows Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the example was when I started editing the file and I didn't see a reason to modify it since we aren't doing a total overhaul right now and the goal is to ensure that the problems() example has actual problems. Also, the behavior of spec_*() is currently in-flight #1436 so some of the content here might change.

@@ -127,7 +127,7 @@ parse_number("$1,234")

There are two parsers that will never be guessed: `col_skip()` and `col_factor()`. You will always need to supply these explicitly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it also true the col_integer() will not be guessed? I spotted this since we are working here in the immediate neighborhood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -142,15 +142,6 @@ mtcars_spec
cols_condense(mtcars_spec)
```


By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rows are not selected in a random fashion. They're basically evenly spaced within the file and the last row is always included for guessing.

https://github.com/tidyverse/vroom/blob/0e2df4cc8a07fbb007206f6bfcc6a3732e6c4613/src/collectors.h#L164

@@ -142,14 +142,13 @@ mtcars_spec
cols_condense(mtcars_spec)
```


By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:
When guessing column types, readr selects rows to use for guessing semi-randomly. The default number of rows it selects is 1000. If the column type guess is incorrect, increasing `guess_max` may or may not improve the results.

```{r}
x <- spec_csv(readr_example("challenge.csv"), guess_max = 1001)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, if we're trying to make a point about the spec, we should reveal it.

vignettes/readr.Rmd Outdated Show resolved Hide resolved
#> x = col_integer(),
#> y = col_character()
#> )
writeLines(read_lines(readr_example("challenge.csv"), skip = 1325, n_max = 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1325 feels a bit like a Magic Number here. It feels like we should probably get it programmatically, since we do have access to it and explain that we're inspecting the neighborhood right around the problem line.

```

You can also access it after the fact using `spec()`:
We can work around this parsing error by importing the date column as character and converting it to a column of dates, after the fact. The lubridate package is a great option that we show here. But since readr doesn't depend on lubridate, we won't execute this code chunk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually could make this a static code chunk. That is, set it to eval = FALSE, but inline the current result. It's not a perfect solution, but the pros would seem to outweigh the cons.

I'm talking about showing folks this:

> df1[1325:1327, ]
# A tibble: 3 × 2
  date       value
  <date>     <dbl>
1 2008-03-16   762
2 2008-02-17   206
3 2019-01-28  1333

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reiterate this:

I think you should add some brief comment ... that it's indicative of a general type of workaround, where you import as character and use a more specialized package (or custom function) to parse a tricky column.

Comment on lines 242 to 234
```r
read_csv("iris.csv", col_types = list(
Sepal.Length = col_double(),
Sepal.Width = col_double(),
Petal.Length = col_double(),
Petal.Width = col_double(),
Species = col_factor(c("setosa", "versicolor", "virginica"))
))
```
```{r, eval = FALSE}
read_csv("iris.csv", col_types = list(
Sepal.Length = col_double(),
Sepal.Width = col_double(),
Petal.Length = col_double(),
Petal.Width = col_double(),
Species = col_factor(c("setosa", "versicolor", "virginica"))
))
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has messed up the intended indenting. Here and below.

@sbearrows
Copy link
Contributor Author

Does challenge.csv need to be 2000 lines long anymore?

I created the new challenge.csv by modify pieces of the old challenge.csv so that's one reason why it's also 2000 lines long. It also made it easier to insert a parsing error into the file and makes the data feel more realistic.

What is the purpose of the value column in the new challenge.csv?...I can't figure out the purpose of value in the proposed challenge.csv, which doesn't seem to have this mix of integers and doubles.

The value column isn't being used to teach anything, its purpose is to make the data feel real since data with only one column would be odd. We could make this column also teach some other lesson about problems() or parsing if that's what you're suggesting. But I think we should only have one type of parsing problem for the example otherwise it might feel overloaded with information.

Our modern approach to creating datasets in packages is to save the code that generates the dataset below data-raw/

I'll make a note to complete this TODO once we've landed on a dataset

@jennybc
Copy link
Member

jennybc commented Sep 16, 2022

Given that the only column we're currently using for a concrete demo is a date column, I think we should just leave the legacy challenge.csv alone and create a new example file with exactly what you need.

As I understand it, you want a single date column with at least one strategically positioned entry in a nonstandard format. This tricky entry should not cause the column to be guessed as character (so that dictates its position), but it should create a parsing problem. Make it minimal and sufficient to support what you want to show. Store the .R script used to generate it below data-raw/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

readr no longer reproduces the problems from challenge.csv
2 participants