New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle also when header and data rows have different number of columns #189

Closed
HenrikBengtsson opened this Issue Jun 10, 2015 · 10 comments

Comments

Projects
None yet
3 participants
@HenrikBengtsson

HenrikBengtsson commented Jun 10, 2015

Case 1: More column names than data columns

read.table() has fill=TRUE to handle the case for when there are more column names than columns in the data rows, e.g.

> read_tsv("a\tb\tc\n1\t2\n")
Error: You have 3 column names, but 2 columns

> read.table(text="a\tb\tc\n1\t2\n", fill=FALSE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
  line 2 did not have 3 elements
> read.table(text="a\tb\tc\n1\t2\n", fill=TRUE)
  V1 V2 V3
1  a  b  c
2  1  2

Looking at the help, I don't think there is way to use read_tsv() to deal with this case.

WISH: Make it possible "fill" data rows with empty values/NAs, when data rows lack trailing cells. This would assume the missing ones are at the end, cf. argument fill of read.table().

Case 2: Fewer column names than data columns

read.table() does not handle this. I don't think read_tsv() does either.

> read.table(text="a\tb\n1\t2\t3\t4\n", header=TRUE, fill=FALSE)
Error in read.table(text = "a\tb\n1\t2\t3\t4\n", header = TRUE, fill = FALSE) :
  more columns than column names
> read.table(text="a\tb\n1\t2\t3\t4\n", header=TRUE, fill=TRUE)
Error in read.table(text = "a\tb\n1\t2\t3\t4\n", header = TRUE, fill = FALSE) :
  more columns than column names
> read_tsv("a\tb\n1\t2\t3\t4\n")
Error: You have 2 column names, but 4 columns

WISH: Make it possible "fill" column names with empty values/NAs, when header lack trailing column names. This would assume the missing ones are at the end, cf. argument fill of read.table().

Background

For a real-world example, please see https://gist.github.com/HenrikBengtsson/dabc383aaa958c0ed49a. The above examples are never ending stories in my life.

@jennybc

This comment has been minimized.

Member

jennybc commented Jun 14, 2015

👍 currently fiddling with an instance of Case 1: More column names than data columns.

@hadley

This comment has been minimized.

Member

hadley commented Sep 3, 2015

I think both cases should be problems, not errors.

@hadley

This comment has been minimized.

Member

hadley commented Sep 3, 2015

How does this look?

read_csv(col_types = "ii", "a,b\n1")
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 problems parsing literal data. See problems(...) for more
#> details.
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#>   a b X3
#> 1 1 2  3

I guess they probably should all generate warnings :/

@hadley

This comment has been minimized.

Member

hadley commented Sep 3, 2015

A bit more progress

read_csv(col_types = "ii", "a,b\n1")
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 parsing failure (literal data)
#> row col  expected actual
#>   1   3 2 columns
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#> Warning: 1 parsing failure (literal data)
#> row col       expected actual
#>  NA  NA 1 column names      2
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#> Warning: 1 parsing failure (literal data)
#> row col            expected actual
#>  NA   3 Missing column name
#>   a b X3
#> 1 1 2  3
@jennybc

This comment has been minimized.

Member

jennybc commented Sep 3, 2015

Looks good to me. Yes, it is helpful to be warned when col_types, the header row, and the data rows provide contradictory information about the number of variables.

@hadley

This comment has been minimized.

Member

hadley commented Sep 3, 2015

Final version:

read_csv(col_types = "ii", "a,b\n1")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 1 columns
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 3 columns
#>   a b
#> 1 1 2
read_csv(col_types = "ii", "a,b\n1,2,3,4")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 4 columns
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 1 col names 2 col names
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 3 col names 2 col names
#>   a b X3
#> 1 1 2  3
read_csv("a,b\n1,2,3,4")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 4 col names 2 col names
#>   a b X3 X4
#> 1 1 2  3  4

This is looking pretty good to me :)

(BTW I've been using reprex to make these code snippets and it's awesome!)

@hadley

This comment has been minimized.

Member

hadley commented Sep 3, 2015

Not quite right, but I'll finish it off tomorrow:

read_csv("a,b\n\n2,3")
#>    a  b
#> 1 NA NA
#> 2  2  3
read_csv("a,b\n\n\n2,3")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   2  -- 2 columns 1 columns
#>    a  b
#> 1 NA NA
#> 2 NA NA
#> 3  2  3
@jennybc

This comment has been minimized.

Member

jennybc commented Sep 7, 2015

@HenrikBengtsson is the main sufferer but I agree this looks great. (Thanks for kind words re: reprex ... yeah, it certainly feels useful and ppl have given neat ideas and PRs already.)

@hadley hadley closed this in a165309 Sep 23, 2015

@hadley

This comment has been minimized.

Member

hadley commented Sep 23, 2015

I'm pretty sure I got everything - please open a new issue if you discover a case I missed.

@HenrikBengtsson

This comment has been minimized.

HenrikBengtsson commented Sep 23, 2015

Awesome - thanks for this. I've confirmed that it works with my real-world data that originally triggered this issue. You just made life a bit less hard for quite a few people.

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.