Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour with read_csv and skip > #rows #119

Closed
coolbutuseless opened this issue Apr 12, 2015 · 5 comments
Closed

Inconsistent behaviour with read_csv and skip > #rows #119

coolbutuseless opened this issue Apr 12, 2015 · 5 comments

Comments

@coolbutuseless
Copy link

  1. When col_types is specified and skip is equal to or greater than the number of actual rows, read_csv() returns a data.frame with 1 row.
  2. When col_types is not specified and skip is equal to or greater than the number of actual rows, read_csv() throws an error.

I think Case1 is wrong to return a row of results when there aren't any, and should probably return a zero-row data.frame.

EDIT: Case 2 is handled OK i.e. If column types aren't specified, and there are no rows from which to infer type, you can't really return anything sensible.

I found this inconsistency when doing chunked reads from a large CSV file, and a zero-row data.frame was going to be an indicator that I'd run out of data.

> read_csv("1,2\n3,4", col_names=c('a', 'b'), col_types='ii')
Source: local data frame [2 x 2]

  a b
1 1 2
2 3 4
> 
> read_csv("1,2\n3,4", col_names=c('a', 'b'), col_types='ii', skip=2)
Source: local data frame [1 x 2]

   a  b
1 NA NA
> 
> read_csv("1,2\n3,4", col_names=c('a', 'b'), skip=2)
Error: You have 2 column names, but 0 columns
@coolbutuseless
Copy link
Author

This bit me again recently, so I boiled it down to an even simpler issue.

Issue: readr returns a single row of NA data when there are no rows in a CSV file.

To reproduce:

  • sh echo "x,y" > buggy.csv
  • readr::read_csv("buggy.csv", col_types="ii")
  • Expected result: An empty data frame with zero rows.
  • Actual result: A data.frame with a single row of NA values.

@hadley
Copy link
Member

hadley commented Sep 3, 2015

If you don't supply the types, do you think it's best to return the most restrictive type (i.e. logical) or the least restrictive type (i.e. character)? I think that's better behaviour than throwing an error (esp. since it's useful to do read_csv(..., n_max = 0) to get just the column names)

@hadley
Copy link
Member

hadley commented Sep 22, 2015

Here's what I have now

read_csv("a,b\n1,2")
#> Source: local data frame [1 x 2]
#> 
#>       a     b
#>   (int) (int)
#> 1     1     2
read_csv("a,b\n1,2", c("a", "b"), "ii", skip = 2)
#> Source: local data frame [0 x 2]
#> 
#> Variables not shown: a (int), b (int)
read_csv("a,b\n1,2", c("a", "b"), skip = 2)
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 0 col names 2 col names
#> Source: local data frame [0 x 0]
read_csv("a,b\n1,2", skip = 2)
#> Source: local data frame [0 x 0]
read_csv("a,b\n1,2", n_max = 0)
#> Source: local data frame [0 x 2]
#> 
#> Variables not shown: a (int), b (int)
read_csv("a,b\n")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 0 col names 2 col names
#> Source: local data frame [0 x 0]

That seems reasonably consistent to me

@hadley
Copy link
Member

hadley commented Sep 22, 2015

Hmmm, I think the main thing missing is that if you have column names, but no data and no column types, you get a 0 x 0 data frame - that's not quite right. I've tweaked it to make sure there are always enough col types, using character to pad out:

read_csv("a,b\n1,2")
#> Source: local data frame [1 x 2]
#> 
#>       a     b
#>   (int) (int)
#> 1     1     2
read_csv("a,b\n1,2", c("a", "b"), "ii", skip = 2)
#> Source: local data frame [0 x 2]
#> 
#> Variables not shown: a (int), b (int)
read_csv("a,b\n1,2", c("a", "b"), skip = 2)
#> Source: local data frame [0 x 2]
#> 
#> Variables not shown: a (chr), b (chr)
read_csv("a,b\n1,2", skip = 2)
#> Source: local data frame [0 x 0]
read_csv("a,b\n1,2", n_max = 0)
#> Source: local data frame [0 x 2]
#> 
#> Variables not shown: a (int), b (int)
read_csv("a,b\n")
#> Source: local data frame [0 x 2]
#> 
#> Variables not shown: a (chr), b (chr)

@hadley hadley closed this as completed in 65755c9 Sep 22, 2015
@coolbutuseless
Copy link
Author

Thanks! This looks great!

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants