New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using col_names and col_types #81

Closed
meyera opened this Issue Apr 17, 2015 · 10 comments

Comments

Projects
None yet
9 participants
@meyera

meyera commented Apr 17, 2015

Hi Hadley,
thanks for another awesome package.
I would like to let you know the following behavior, which seems to be a bug:

> read_excel(data_file, sheet = "ES") %>% colnames
 [1] "Lfd. Nr."                             "Name"                                 "Vorname"                             
 [4] "Geb.-datum"                           "OP-Datum"                             "Alter"                               
 [7] "Konv. OP-Risiko"                      "Card. Schock"                         "Rescue"                              
[10] "aEuroScore"                           "loEuroScore"                          "EuroSCORE II"                        
[13] "STS_Mortalitaetsrisiko"               "STS_Morb./Mort."                      "STS Long Stay"                       
[16] "STS Stroke"                           "STS Pronl. Vent."                     "STS Ren Fail"                        
[19] "Syntax Score"                         "Rhythmus"                             "A-fib"                               
[22] "PM"                                   "Z.n. AKE"                             "Z.n. ACVB"                           
[25] "Z.n. MKE/R"                           "Z.n. anderer OP (Perikarderoeffnung)" "Stroke, neurol. Defizit"             
[28] "pAVK"                                 "COPD"                                 "SPAP>50"                             
[31] "pHTs"                                 "pHTd"                                 "pHTm"                                
[34] "Diab mell"                            "NI"                                   "KHK"                                 
[37] "Z.n.Stent"                            "Dialyse"                              "Risikofaktoren"  

Now I would like to use my own column type and name specification for the 39 excel columns:

# shorthands for readxl package
num <- "numeric"; txt = "text"; date = "date"; blank = "blank" 
colspec.es <- c(
  id = num,
  es_name = txt,
  es_vorname = txt,
  es_dob = date,
  es_date_op = date,
  es_age = num,
  es_konv_risk = txt,
  es_card_shock = num,
  es_rescue_indication = num,
  es_addESI= num,
  es_logESI= num,
  es_ESII= num,
  es_STS_mort= num,
  es_STS_mort_morb= num,
  es_STS_longstay= num,
  es_STS_stroke= num,
  es_STS_prolong_vent= num,
  es_STS_ren_fail= num,
  es_syntax = num,
  es_rhythm = txt,
  es_afib = num,
  es_pm = num,
  es_sp_ake = num,
  es_sp_cabg = num,
  es_sp_mk_op = num,
  es_sp_other_pericard_open = num,
  es_stroke_neuro_deficit = num,
  es_pavk = num,
  es_copd = num,
  es_spap_gt_50 = num,
  es_pHTs = num,
  es_pHTd = num,
  es_pHTm = num,
  es_diabetes = num,
  es_ni = num,
  es_khk = num,
  es_sp_stent = num,
  es_dialyse = num,
  es_risks_txt = txt
)

xls.es   <- read_excel(data_file, sheet = "ES",
                                  skip = 0,
                                  col_names = names(colspec.es),
                                  col_types = unname(colspec.es))

This results in the error message "Error: Need one name and type for each column"

My session Info:

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readxl_0.1.0      magrittr_1.5      dplyr_0.4.1       lubridate_1.3.3   DescTools_0.99.10 manipulate_1.0.1  car_2.0-25       

loaded via a namespace (and not attached):
 [1] assertthat_0.1  boot_1.3-15     DBI_0.3.1       digest_0.6.8    grid_3.1.2      lattice_0.20-31 lme4_1.1-7      MASS_7.3-40    
 [9] Matrix_1.2-0    memoise_0.2.1   mgcv_1.8-3      minqa_1.2.4     mvtnorm_1.0-2   nlme_3.1-118    nloptr_1.0.4    nnet_7.3-8     
[17] parallel_3.1.2  pbkrtest_0.4-2  plyr_1.8.1      quantreg_5.11   Rcpp_0.11.5     SparseM_1.6     splines_3.1.2   stringr_0.6.2  
[25] tcltk_3.1.2     tools_3.1.2

What do you think?
Thanks a lot!
Alex

@mr-majkel

This comment has been minimized.

mr-majkel commented Apr 23, 2015

Hi, I have stumbled on that bug, too. I don't know if this is related, but when one does specify col_names to be FALSE the columns are numbered in "C'ish" type of way, i.e. from zero and not one (contrary to help description). Also it seems that col_types is not recycled.

@tklebel

This comment has been minimized.

Contributor

tklebel commented Apr 24, 2015

I can confirm the point from @mr-majkel about col_types

read_excel(path = "foo.xlsx",
    col_types = c("numeric", "numeric", "numeric", "numeric"))

Works for a file with four columns, but

read_excel(path = "foo.xlsx",
    col_types = c("numeric"))

doesn't work for the same file.

It is rather easy to work around the problem with rep("numeric", 4), but it's not intuitive.

On the other hand it may be a design choice to force the user to specify a column_type for each column. If that's the case, one could clarify the documentation by adding for each column containing data to the col_types entry in read_excel.

@jhuovari

This comment has been minimized.

jhuovari commented May 4, 2015

I think the problem @meyera described comes from blank columns. If the excel sheet includes blank columns, read_excel() will drop them, but they have to be included in col_types. Unfortunately, it is difficult to know how many blank columns there are.

@rogiersbart

This comment has been minimized.

rogiersbart commented May 21, 2015

I can confirm the suggestion by @jhuovari. I experienced the same issue as @meyera, and simply deleting some of the blank columns in excel solved it for me.

@hadley

This comment has been minimized.

Member

hadley commented May 21, 2015

I can't fix the problem without a reproducible example, so it would be great if someone could make a sheet that fails with this problem available publicly and link to it here.

@rogiersbart

This comment has been minimized.

rogiersbart commented May 21, 2015

@hadley, you can find an example file at https://dl.dropboxusercontent.com/u/7082685/Temporary/readxl_issue_81_temp_excel_file.xlsx.

image

There is some data in column A, and column B is a blank column which Excel somehow considers not to be blank (the issue disappears when deleting column B).

read_excel('readxl_issue_81_temp_excel_file.xlsx')

provides me

      temp   
1 12.10174 NA
2 12.32619 NA
3 12.50086 NA

while specifying only one column type then obviously throws an error

read_excel('readxl_issue_81_temp_excel_file.xlsx',col_types=c('numeric'))
Error: Need one name and type for each column

The problem @meyera was experiencing is different, as he has no empty column names. As far as I understand it, you can reproduce it by

> read_excel('D:/Dropbox/Public/Temporary/readxl_issue_81_temp_excel_file.xlsx',sheet=2)
      temp test test2
1 12.10174    1     1
2 12.32619    2     2
3 12.50086    3     3
> read_excel('D:/Dropbox/Public/Temporary/readxl_issue_81_temp_excel_file.xlsx',sheet=2,col_types=rep('numeric',3))
Error: Need one name and type for each column

using the second sheet, which looks like

image

I suppose the first example is just a strange Excel thing, and the second one is the result of a not properly formatted table, so I'm not sure if anything needs fixing...

Anyway, thanks indeed for another awesome package!

My session info:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readxl_0.1.0

loaded via a namespace (and not attached):
[1] Rcpp_0.11.5 tools_3.1.1
@jhuovari

This comment has been minimized.

jhuovari commented May 22, 2015

I think @rogiersbart demonstrated the problem in the second sheet example. read_excel() without col_types did silently drop those empty columns, but with col_types they are required to specify. Reading the file first without col_types you don't realize that there are blank columns.

When I realised the problem, blank columns were last two columns. So, it was not obvious that there were blank columns, even inspecting from excel. Unfortunately, I haven't been able to reproduce that. If I intentionally insert blank columns after other columns, I get empty column on table as in @rogiersbart's sheet 1 example. And, that's ok.

@warnes

This comment has been minimized.

warnes commented Aug 6, 2015

I am also being bitten by this issue. A couple of requests:

  1. Make the error message more informative: "Error: Found X columns, but only Y names and types were specified".
  2. Skip/ignore/drop drop blank columns when assigning names and types.
  3. If 2 isn't feasible: provide a default name and column type when there are more columns than values provided, and issue a warning instead of an error.
@rebeccaferrell

This comment has been minimized.

rebeccaferrell commented Apr 17, 2016

I am experiencing this issue on a publicly available xls file that has no visible blank columns. It has one blank column to the right of the data range detected using guidance here ("last cell" is DR445). Opening the file in Excel, manually deleting all the columns to the right of the visible data range (everything after DQ), saving as a new xls file, and importing that xls file does work, but is an undesirable solution (this is only one of many files to be read in and all are supposed to be identically structured, but in reality may have different numbers of invisible blank columns to the right).

I would like it if when the user specifies col_names and/or col_types (and these are vectors of the same length X), then read_excel should read in only the first X columns and handle them as instructed, even if the original Excel file has "fake blank" (or even not blank) columns after X. A clear warning that additional columns on the file are detected beyond X would be useful.

The issue is reproduced below.

library(readxl)

# download data for 2013 from CA OSHPD website
download.file(url = "http://oshpd.ca.gov/hid/Products/Hospitals/QuatrlyFinanData/Qtr2013/2013_Q4R4.xls",
              destfile = "2013_Q4R4.xls",
              mode = "wb")
# Excel file has 445 visible rows (header + 444 data rows)
# Excel file has 121 visible columns (A:DQ)

# set column types based on data documentation
# available at http://oshpd.ca.gov/hid/Products/Hospitals/QuatrlyFinanData/QFUR2000AfterDoc.pdf
OSHPD_col_types <- c("text", "text", "numeric", "date", "date",
                     rep("text", 17 - 5),
                     rep("numeric", 121 - 17))
length(OSHPD_col_types) # should be 121
## [1] 121
data_2013_col_types <- read_excel("2013_Q4R4.xls", col_types = OSHPD_col_types)
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 cf 01 78 00 78 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 cf 01 78 00 78 00
## Error in eval(expr, envir, enclos): `col_names` and `col_types` must have the same length
# not passing any col_names and col_types works
# though with the DEFINEDNAME: notes
data_2013_nothing <- read_excel("2013_Q4R4.xls")
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 cf 01 78 00 78 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 cf 01 78 00 78 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 cf 01 78 00 78 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 cf 01 78 00 78 00
dim(data_2013_nothing)
## [1] 444 121
# trying to pass in both names and types also does not work
OSHPD_col_names <- colnames(data_2013_nothing)
length(OSHPD_col_names)
## [1] 121
data_2013 <- read_excel("2013_Q4R4.xls",
                        col_types = OSHPD_col_types,
                        col_names = OSHPD_col_names)
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 cf 01 78 00 78 00
## Error in eval(expr, envir, enclos): Need one name and type for each column
sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] readxl_0.1.1.9000
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.3     tools_3.2.4     htmltools_0.3.5
##  [5] yaml_2.1.13     Rcpp_0.12.4     stringi_1.0-1   rmarkdown_0.9.5
##  [9] knitr_1.12.3    stringr_1.0.0   digest_0.6.9    evaluate_0.8

ecortens added a commit to ecortens/readxl that referenced this issue Oct 11, 2016

iMissile added a commit to iMissile/R.projects that referenced this issue Dec 9, 2016

Разбирался с ошибкой "Error: Need one name and type for each column".…
… Тикет открыт: [Error using col_names and col_types](tidyverse/readxl#81)

Если покопаться поглубже, то выплывают blank колонки в еще большем числе
ctypes <- readxl:::xlsx_col_types(datafile) дает 152 против 148 видимых. Связано это с тем, что у этих 4-х колонок изменено форматирование.
@gregoriosz

This comment has been minimized.

gregoriosz commented Jan 5, 2017

I also have some similar experience.
I created an example file in the Onerive Excel application, downloaded (as xlsx), and then saved as xls (in libreoffice, but I had the same problem in Excel 2007).
The files:
https://dl.dropboxusercontent.com/u/72707233/pelda_x2.xls
https://dl.dropboxusercontent.com/u/72707233/pelda_x.xlsx

and the code:

library(readxl)
exf2 <- ".../pelda_x2.xls"
# the next line works properly
df2_1 <- read_excel(exf2, sheet = 1, skip = 1, col_names = F)

# Error in eval(substitute(expr), envir, enclos) : Need one name and type for each column
var_type <- rep("text", 4)
df2_2 <- read_excel(exf, sheet = 1, skip = 3, col_names = paste0("var", c(1:4)), col_types = var_type)

session:

R version 3.3.2 RC (2016-10-28 r71606)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Leap 42.1 (x86_64)

locale:
 [1] LC_CTYPE=hu_HU.UTF-8       LC_NUMERIC=C               LC_TIME=hu_HU.UTF-8        LC_COLLATE=hu_HU.UTF-8    
 [5] LC_MONETARY=hu_HU.UTF-8    LC_MESSAGES=hu_HU.UTF-8    LC_PAPER=hu_HU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=hu_HU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.5.0  readxl_0.1.1 sp_1.2-3    

loaded via a namespace (and not attached):
[1] magrittr_1.5    R6_2.2.0        assertthat_0.1  DBI_0.5-1       tools_3.3.2     tibble_1.2     
[7] Rcpp_0.12.7     grid_3.3.2      lattice_0.20-33

Thanks in advance! (The readxl is a realy great package.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment