Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

entrez_summary() fails silently at > 500 results #106

Closed
npjc opened this issue May 14, 2017 · 5 comments
Closed

entrez_summary() fails silently at > 500 results #106

npjc opened this issue May 14, 2017 · 5 comments

Comments

@npjc
Copy link

npjc commented May 14, 2017

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
library(rentrez)
db <- "assembly"
term <- "Saccharomyces[ORGN]"
r <- entrez_search(db, term = term, retmax = 1) # just to get the count
r <- entrez_search(db, term = term, retmax = r$count) # return all ids
length(r$ids) == r$count # sanity check
#> [1] TRUE

getting summary of all results fails (silently?)

s <- entrez_summary(db, id = r$ids)
length(s) == length(r$ids) #uh oh 
#> [1] FALSE

getting summaries of the first works as expected

s_first_500 <- entrez_summary(db, id = r$ids[1:500])
length(s_first_500) == 500 # so retrieving 500 summaries at once works...
#> [1] TRUE
head(s_first_500, 2) %>% str() # looks good
#> List of 2
#>  $ 1087661:List of 49
#>   ..$ uid                        : chr "1087661"
#>   ..$ rsuid                      : chr ""
#>   ..$ gbuid                      : chr "4436668"
#>   ..$ assemblyaccession          : chr "GCA_900178065.1"
#>   ..$ lastmajorreleaseaccession  : chr "GCA_900178065.1"
#>   ..$ chainid                    : chr "900178065"
#>   ..$ assemblyname               : chr "L711"
#>   ..$ ucscname                   : chr ""
#>   ..$ ensemblname                : chr ""
#>   ..$ taxid                      : chr "4932"
#>   ..$ organism                   : chr "Saccharomyces cerevisiae (baker's yeast)"
#>   ..$ speciestaxid               : chr "4932"
#>   ..$ speciesname                : chr "Saccharomyces cerevisiae"
#>   ..$ assemblytype               : chr "haploid"
#>   ..$ assemblyclass              : chr "haploid"
#>   ..$ assemblystatus             : chr "Scaffold"
#>   ..$ wgs                        : chr "FXLH01"
#>   ..$ gb_bioprojects             :'data.frame':  1 obs. of  2 variables:
#>   .. ..$ bioprojectaccn: chr "PRJEB8455"
#>   .. ..$ bioprojectid  : int 308667
#>   ..$ gb_projects                : chr "308667"
#>   ..$ rs_bioprojects             : list()
#>   ..$ rs_projects                : list()
#>   ..$ biosampleaccn              : chr "SAMEA3249812"
#>   ..$ biosampleid                : chr "4395280"
#>   ..$ biosource                  :List of 3
#>   .. ..$ infraspecieslist: list()
#>   .. ..$ sex             : chr ""
#>   .. ..$ isolate         : chr ""
#>   ..$ coverage                   : chr "50"
#>   ..$ partialgenomerepresentation: chr "false"
#>   ..$ primary                    : chr "4436658"
#>   ..$ assemblydescription        : chr ""
#>   ..$ releaselevel               : chr "Major"
#>   ..$ asmreleasedate_genbank     : chr "2017/05/03 00:00"
#>   ..$ asmreleasedate_refseq      : chr "1/01/01 00:00"
#>   ..$ seqreleasedate             : chr "2017/04/27 00:00"
#>   ..$ asmupdatedate              : chr "2017/05/03 00:00"
#>   ..$ submissiondate             : chr "2017/04/27 00:00"
#>   ..$ lastupdatedate             : chr "2017/05/03 00:00"
#>   ..$ submitterorganization      : chr "INRA"
#>   ..$ refseq_category            : chr "na"
#>   ..$ anomalouslist              : list()
#>   ..$ exclfromrefseq             : list()
#>   ..$ propertylist               : chr [1:4] "full-genome-representation" "latest" "latest_genbank" "wgs"
#>   ..$ fromtype                   : chr ""
#>   ..$ synonym                    :List of 3
#>   .. ..$ genbank   : chr "GCA_900178065.1"
#>   .. ..$ refseq    : chr ""
#>   .. ..$ similarity: chr ""
#>   ..$ ftppath_genbank            : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/178/065/GCA_900178065.1_L711"
#>   ..$ ftppath_refseq             : chr ""
#>   ..$ ftppath_assembly_rpt       : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/178/065/GCA_900178065.1_L711/GCA_900178065.1_L711_assembly_report.txt"
#>   ..$ ftppath_stats_rpt          : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/178/065/GCA_900178065.1_L711/GCA_900178065.1_L711_assembly_stats.txt"
#>   ..$ ftppath_regions_rpt        : chr ""
#>   ..$ sortorder                  : chr "5C99001780659899"
#>   ..$ meta                       : chr " &lt;Stats&gt; &lt;Stat category=\"alt_loci_count\" sequence_tag=\"all\"&gt;0&lt;/Stat&gt; &lt;Stat category=\""| __truncated__
#>   ..- attr(*, "class")= chr [1:2] "esummary" "list"
#>  $ 1082101:List of 49
#>   ..$ uid                        : chr "1082101"
#>   ..$ rsuid                      : chr ""
#>   ..$ gbuid                      : chr "4417938"
#>   ..$ assemblyaccession          : chr "GCA_900177905.1"
#>   ..$ lastmajorreleaseaccession  : chr "GCA_900177905.1"
#>   ..$ chainid                    : chr "900177905"
#>   ..$ assemblyname               : chr "L564"
#>   ..$ ucscname                   : chr ""
#>   ..$ ensemblname                : chr ""
#>   ..$ taxid                      : chr "4932"
#>   ..$ organism                   : chr "Saccharomyces cerevisiae (baker's yeast)"
#>   ..$ speciestaxid               : chr "4932"
#>   ..$ speciesname                : chr "Saccharomyces cerevisiae"
#>   ..$ assemblytype               : chr "haploid"
#>   ..$ assemblyclass              : chr "haploid"
#>   ..$ assemblystatus             : chr "Scaffold"
#>   ..$ wgs                        : chr "FXEF01"
#>   ..$ gb_bioprojects             :'data.frame':  1 obs. of  2 variables:
#>   .. ..$ bioprojectaccn: chr "PRJEB8455"
#>   .. ..$ bioprojectid  : int 308667
#>   ..$ gb_projects                : chr "308667"
#>   ..$ rs_bioprojects             : list()
#>   ..$ rs_projects                : list()
#>   ..$ biosampleaccn              : chr "SAMEA3249808"
#>   ..$ biosampleid                : chr "4395276"
#>   ..$ biosource                  :List of 3
#>   .. ..$ infraspecieslist: list()
#>   .. ..$ sex             : chr ""
#>   .. ..$ isolate         : chr ""
#>   ..$ coverage                   : chr "50"
#>   ..$ partialgenomerepresentation: chr "false"
#>   ..$ primary                    : chr "4417928"
#>   ..$ assemblydescription        : chr ""
#>   ..$ releaselevel               : chr "Major"
#>   ..$ asmreleasedate_genbank     : chr "2017/04/26 00:00"
#>   ..$ asmreleasedate_refseq      : chr "1/01/01 00:00"
#>   ..$ seqreleasedate             : chr "2017/04/25 00:00"
#>   ..$ asmupdatedate              : chr "2017/04/26 00:00"
#>   ..$ submissiondate             : chr "2017/04/25 00:00"
#>   ..$ lastupdatedate             : chr "2017/04/26 00:00"
#>   ..$ submitterorganization      : chr "INRA"
#>   ..$ refseq_category            : chr "na"
#>   ..$ anomalouslist              : list()
#>   ..$ exclfromrefseq             : list()
#>   ..$ propertylist               : chr [1:4] "full-genome-representation" "latest" "latest_genbank" "wgs"
#>   ..$ fromtype                   : chr ""
#>   ..$ synonym                    :List of 3
#>   .. ..$ genbank   : chr "GCA_900177905.1"
#>   .. ..$ refseq    : chr ""
#>   .. ..$ similarity: chr ""
#>   ..$ ftppath_genbank            : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/177/905/GCA_900177905.1_L564"
#>   ..$ ftppath_refseq             : chr ""
#>   ..$ ftppath_assembly_rpt       : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/177/905/GCA_900177905.1_L564/GCA_900177905.1_L564_assembly_report.txt"
#>   ..$ ftppath_stats_rpt          : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/177/905/GCA_900177905.1_L564/GCA_900177905.1_L564_assembly_stats.txt"
#>   ..$ ftppath_regions_rpt        : chr ""
#>   ..$ sortorder                  : chr "5C99001779059899"
#>   ..$ meta                       : chr " &lt;Stats&gt; &lt;Stat category=\"alt_loci_count\" sequence_tag=\"all\"&gt;0&lt;/Stat&gt; &lt;Stat category=\""| __truncated__
#>   ..- attr(*, "class")= chr [1:2] "esummary" "list"

but try and get the first 501 results and then 💥

s_first_501 <- entrez_summary(db, id = r$ids[1:501])
length(s) == 501 # uh oh... so 500 limit somewhere?
#> [1] FALSE
head(s) %>% str() # empty list
#>  list()


# session info:
sessioninfo::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.4.0 (2017-04-21)
#>  os       macOS Sierra 10.12.4        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_CA.UTF-8                 
#>  tz       America/Vancouver           
#>  date     2017-05-14                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       source                             
#>  assertthat    0.2.0      2017-04-11 CRAN (R 3.4.0)                     
#>  backports     1.0.5      2017-01-18 CRAN (R 3.4.0)                     
#>  broom         0.4.2      2017-02-13 CRAN (R 3.4.0)                     
#>  cellranger    1.1.0      2016-07-27 CRAN (R 3.4.0)                     
#>  clisymbols    1.1.0      2017-01-27 cran (@1.1.0)                      
#>  colorspace    1.3-2      2016-12-14 CRAN (R 3.4.0)                     
#>  curl          2.6        2017-04-27 CRAN (R 3.4.0)                     
#>  DBI           0.6-1      2017-04-01 CRAN (R 3.4.0)                     
#>  digest        0.6.12     2017-01-27 CRAN (R 3.4.0)                     
#>  dplyr       * 0.5.0      2016-06-24 CRAN (R 3.4.0)                     
#>  emo           0.0.0.9000 2017-05-14 Github (hadley/emo@4be1aa3)        
#>  evaluate      0.10       2016-10-11 CRAN (R 3.4.0)                     
#>  forcats       0.2.0      2017-01-23 CRAN (R 3.4.0)                     
#>  foreign       0.8-68     2017-04-24 CRAN (R 3.4.0)                     
#>  ggplot2     * 2.2.1      2016-12-30 CRAN (R 3.4.0)                     
#>  gtable        0.2.0      2016-02-26 CRAN (R 3.4.0)                     
#>  haven         1.0.0      2016-09-23 CRAN (R 3.4.0)                     
#>  hms           0.3        2016-11-22 CRAN (R 3.4.0)                     
#>  htmltools     0.3.6      2017-04-28 CRAN (R 3.4.0)                     
#>  httr          1.2.1      2016-07-03 CRAN (R 3.4.0)                     
#>  jsonlite      1.4        2017-04-08 CRAN (R 3.4.0)                     
#>  knitr         1.15.20    2017-05-02 Github (yihui/knitr@f3a490b)       
#>  lattice       0.20-35    2017-03-25 CRAN (R 3.4.0)                     
#>  lazyeval      0.2.0      2016-06-12 CRAN (R 3.4.0)                     
#>  lubridate     1.6.0      2016-09-13 CRAN (R 3.4.0)                     
#>  magrittr      1.5        2014-11-22 CRAN (R 3.4.0)                     
#>  mnormt        1.5-5      2016-10-15 CRAN (R 3.4.0)                     
#>  modelr        0.1.0      2016-08-31 CRAN (R 3.4.0)                     
#>  munsell       0.4.3      2016-02-13 CRAN (R 3.4.0)                     
#>  nlme          3.1-131    2017-02-06 CRAN (R 3.4.0)                     
#>  plyr          1.8.4      2016-06-08 CRAN (R 3.4.0)                     
#>  psych         1.7.3.21   2017-03-22 CRAN (R 3.4.0)                     
#>  purrr       * 0.2.2      2016-06-18 CRAN (R 3.4.0)                     
#>  R6            2.2.0      2016-10-05 CRAN (R 3.4.0)                     
#>  Rcpp          0.12.10    2017-03-19 CRAN (R 3.4.0)                     
#>  readr       * 1.1.0      2017-03-22 CRAN (R 3.4.0)                     
#>  readxl        1.0.0      2017-04-18 CRAN (R 3.4.0)                     
#>  rentrez     * 1.0.4      2016-10-26 CRAN (R 3.4.0)                     
#>  reshape2      1.4.2      2016-10-22 CRAN (R 3.4.0)                     
#>  rmarkdown     1.5        2017-04-26 CRAN (R 3.4.0)                     
#>  rprojroot     1.2        2017-01-16 CRAN (R 3.4.0)                     
#>  rvest         0.3.2      2016-06-17 CRAN (R 3.4.0)                     
#>  scales        0.4.1      2016-11-09 CRAN (R 3.4.0)                     
#>  sessioninfo   0.0.0.9000 2017-04-26 Github (r-pkgs/sessioninfo@0a5b58f)
#>  stringi       1.1.5      2017-04-07 CRAN (R 3.4.0)                     
#>  stringr       1.2.0      2017-02-18 CRAN (R 3.4.0)                     
#>  tibble      * 1.3.0      2017-04-01 CRAN (R 3.4.0)                     
#>  tidyr       * 0.6.1      2017-01-10 CRAN (R 3.4.0)                     
#>  tidyverse   * 1.1.1      2017-01-27 CRAN (R 3.4.0)                     
#>  withr         1.0.2      2016-06-20 CRAN (R 3.4.0)                     
#>  XML           3.98-1.6   2017-03-30 CRAN (R 3.4.0)                     
#>  xml2          1.1.1      2017-01-24 CRAN (R 3.4.0)                     
#>  yaml          2.1.14     2016-11-12 CRAN (R 3.4.0)
@npjc
Copy link
Author

npjc commented May 14, 2017

not sure but perhaps related to #105

@dwinter
Copy link
Member

dwinter commented May 15, 2017

Hi @npjc ,
Thanks for you detailed bug report -- it's really helpful to have all this information.

Looks like this is indeed the same problem as #105: NCBI is giving an error for JSON requests > 500 records and rentrez is failing to pass it on. If you want to fetch more than 500 in one go check out the records you get with version=1.0 (not they will be slightly different than the version 2.0/JSON records).

Failing that you will need to batch up the IDs into lots of 500 then stich the results you are interested in back together.

I will leave this issue open until I have a useful error message in cases like this.

@sckott
Copy link
Contributor

sckott commented May 22, 2017

Was about to open an issue - getting same problem using entrez_summary - looking forward to the http error message

@dwinter
Copy link
Member

dwinter commented May 22, 2017

Sorry guys, on the TODO list but the TODO list is long at the moment!

dwinter added a commit that referenced this issue May 25, 2017
@dwinter dwinter self-assigned this May 25, 2017
@dwinter
Copy link
Member

dwinter commented May 25, 2017

OK, just pushed some changes to the develop branch that should take care of these problems. Will merge to the master and get on CRAN in the next few days

@dwinter dwinter closed this as completed Jun 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants