Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty column names need to be given unique names #364

Closed
jabranham opened this issue Feb 14, 2016 · 11 comments
Closed

Empty column names need to be given unique names #364

jabranham opened this issue Feb 14, 2016 · 11 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@jabranham
Copy link

I used readr::read_csv() to import a csv file into R, which resulted in no errors. Later, when trying to run lm() on the data i got the error

Error in terms.formula(formula, data = data) : 
  attempt to use zero-length variable name

I found this question/answer on stackexchange, which seems to indicate that other people are having the same issue.

http://stackoverflow.com/questions/31385976/error-attempt-to-use-zero-length-variable-name

@hadley
Copy link
Member

hadley commented Mar 2, 2016

Could you please provide a minimal reproducible example (preferably using a small made up dataset)?

@jabranham
Copy link
Author

Yes, this is so simple that I feel like I'm just making some sort of silly mistake. But here is code that replicates it for me.

set.seed(2904)

thedata <- data.frame(
  x = rnorm(100),
  y = rnorm(100, 3, 270),
  groups = rep(1:5))

write.csv(thedata, "examplecsv.csv")

thedata2 <- readr::read_csv("examplecsv.csv")

lm(y ~ x * groups, data = thedata2)

And the results of sessionInfo, if you want it:

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.2.3 readr_0.2.2    tools_3.2.3    Rcpp_0.12.3   

@jennybc
Copy link
Member

jennybc commented Mar 3, 2016

The write.csv() command puts row names into the file and readr::read_csv() brings them back in as a variable with the empty string as a name. Which apparently lm() doesn't like, even though the model has nothing to do with that variable.

You could prevent the writing of row names via write.csv(thedata, "examplecsv.csv", row.names = FALSE) or prevent the reading of them via readr::read_csv("examplecsv.csv", col_types = "_nni"). Or drop them once you read the data in, thedata2[names(thedata2) != ""] but before you fit the model.

Which, I now see, is pretty much what the SO answers say.

Look for a variable named "" in your data and either drop it or rename it.

UPDATE: Hey, I recognize you from Austin!

@jabranham
Copy link
Author

Ah, that makes sense. Might it make sense for read_csv to name the rownames something if they aren't named (like row_names)? When I use utils::read.csv() it names them X.

And why in the world does lm complain about variables that aren't in the model? That's just weird.

Yup, I was at your talk in Austin. I've taught a few people git and show them your "burn it all down" slides.

Update: This is somewhat related to tidyverse/dplyr#1576

@jennybc
Copy link
Member

jennybc commented Mar 4, 2016

I think @hadley might refine this particular point of the readr philosophy: "Column names are left as is". Maybe an exception will be made to populate missing names? And maybe the empty-string-as-variable-name would get the same treatment.

@hadley
Copy link
Member

hadley commented Mar 4, 2016

Yes, this will definitely get fixed - missing/empty col names need some repair because they cause so many downstream problems, and it's never helpful to maintain the missingness.

Will probably adopt some convention like _missing_1, _missing_2 for missing names.

@hadley hadley changed the title read_csv() later results in error attempt to use zero-length variable name Empty column names need to be given unique names Jun 1, 2016
@hadley hadley added feature a feature request or enhancement ready labels Jun 1, 2016
@hadley
Copy link
Member

hadley commented Jun 2, 2016

We now have:

read_csv(",,\n1,2,3")
#> Source: local data frame [1 x 3]
#> 
#>      X1    X2    X3
#>   <int> <int> <int>
#> 1     1     2     3

But this probably needs an explicit warning

@hadley
Copy link
Member

hadley commented Jul 13, 2016

It doesn't seem useful to generate an invalid data frame, so now both missing and duplicated column names get an automatic fix and a warning:

x1 <- read_csv(",,\n1,2,3")
#> Warning: Missing column names filled in: 'X1' [1], 'X2' [2], 'X3' [3]
x2 <- read_csv("x,x,x\n1,2,3")
#> Warning: Duplicated column names deduplicated: 'x' => 'x_1' [2], 'x' =>
#> 'x_2' [3]
x3 <- read_csv("X2,\n1,2")
#> Warning: Missing column names filled in: 'X2' [2]
#> Warning: Duplicated column names deduplicated: 'X2' => 'X2_1' [2]

@cluoma
Copy link

cluoma commented Aug 26, 2016

Is it possible to disable 'deduplicating' column names? This is undesired behaviour in my case since the csv is malformed such that the first and second rows together create unique column names.

@hadley
Copy link
Member

hadley commented Aug 26, 2016

Just use col_names = FALSE

@mightypog
Copy link

I did a completely low tech thing. I pulled the empty column out of the original spreadsheet and reloaded it. Not a coding solution, but it worked.

@lock lock bot locked and limited conversation to collaborators Sep 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants