Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem Loading xls file #483

Closed
jameshunterbr opened this issue May 17, 2018 · 7 comments
Closed

Problem Loading xls file #483

jameshunterbr opened this issue May 17, 2018 · 7 comments

Comments

@jameshunterbr
Copy link

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.


I tried to import a large xls file from the Los Alamos National Lab's HIV database. The original version did not work. It showed a green sidebar in RStudio's R markdown file, but resulted in 0 observations. However, when I cut it down to 48 cases to make it easier to send to you, it worked! Worked means that lanl_epi was loaded into memory with 1879 cases.

The second command below (48 cases) functioned fine. I am sending both commands in the reprex, and a compressed version of both files. The smaller one has "mod" as the first letters instead of "lanl". The version of readxl I am using is 1.1.0.

library(tidyverse)
library(readxl)
# did not work; 1879 cases
lanl_epi <- read_excel("lanl_ctl_summary.xls", skip = 2,
                       col_types = c(rep("text", 2), rep("numeric", 2), 
                                     rep("text", 5)), 
                       col_names = c("Epitope", "Protein", "HXB2_start", 
                          "HXB2_end", "Subprotein", "DNA_contig", "Subtype",
                          "Species", "HLA"))
# worked; 48 cases
lanl_epi <- read_excel("mod_ctl_summary.xls", skip = 2,
                       col_types = c(rep("text", 2), rep("numeric", 2), 
                                     rep("text", 5)), 
                       col_names = c("Epitope", "Protein", "HXB2_start", 
                          "HXB2_end", "Subprotein", "DNA_contig", "Subtype",
                          "Species", "HLA"))

files for test.zip

@jennybc
Copy link
Member

jennybc commented May 17, 2018

Thanks! Is the original xls file available on the web somewhere, i.e. can you point me at a URL?

@jameshunterbr
Copy link
Author

jameshunterbr commented May 17, 2018 via email

@jennybc
Copy link
Member

jennybc commented May 17, 2018

In future, if the troublesome xls is available online, you can save yourself the pain of attaching and just point us at it. Thanks.

@p0bs
Copy link

p0bs commented Jun 1, 2018

If it helps you, I also have similar trouble with this spreadsheet, although not other spreadsheets in a broadly similar format.

Many thanks for your great work.

@jennybc
Copy link
Member

jennybc commented Dec 14, 2018

Unfortunately the current version of libxls and, therefore, readxl, still can't read this one. I've reported upstream in libxls.

@jennybc
Copy link
Member

jennybc commented Dec 14, 2018

The only readxl workaround I can offer is to open it in Excel and save as .xlsx. Then I can read it fine with readxl.

@jennybc
Copy link
Member

jennybc commented Dec 14, 2018

Thanks to speedy work by @evanmiller, this is now fixed in libxls and therefore in the dev version of readxl.

readxl::read_excel("investigations/ctl_summary.xls")
#> # A tibble: 1,901 x 9
#>    Epitope Protein `HXB2 start` `HXB2 end` Subprotein `HXB2 DNA Conti…
#>    <chr>   <chr>          <dbl>      <dbl> <chr>      <chr>           
#>  1 Data l… <NA>              NA         NA <NA>       <NA>            
#>  2 MGARAS… Gag                1         10 p17(1-10)  790..819        
#>  3 MGARAS… Gag                1         11 p17(1-11)  790..822        
#>  4 ASVLSG… Gag                5         13 p17(5-13)  802..828        
#>  5 ASILRG… Gag                5         15 p17(5-15)  802..834        
#>  6 ASVLSG… Gag                5         18 p17(5-18)  802..843        
#>  7 SVLSGG… Gag                6         15 p17(6-15)  805..834        
#>  8 SVLSGG… Gag                6         18 p17(6-18)  805..843        
#>  9 SVLSGG… Gag                6         19 p17(6-19)  805..846        
#> 10 LSGGEL… Gag                8         18 p17(8-18)  811..843        
#> # … with 1,891 more rows, and 3 more variables: Subtype <chr>,
#> #   Species <chr>, HLA <chr>

Created on 2018-12-13 by the reprex package (v0.2.1.9000)

@jennybc jennybc closed this as completed Dec 14, 2018
@lock lock bot locked and limited conversation to collaborators Dec 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants