Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in readBin while reading .gz files #31

Closed
Eliot-RUIZ opened this issue Apr 30, 2021 · 9 comments
Closed

Error in readBin while reading .gz files #31

Eliot-RUIZ opened this issue Apr 30, 2021 · 9 comments

Comments

@Eliot-RUIZ
Copy link

Hi!

I need to obtain taxid from a huge list of accessions numbers so "taxonomizr" seems to be the perfect option.

However, I got the following error when I run prepareDatabase or the read.accession2taxid commands (here after multiple tries so databases already downloaded):

Downloading names and nodes with getNamesAndNodes() ./names.dmp, ./nodes.dmp already exist. Delete to redownload Downloading accession2taxid with getAccession2taxid() This can be a big (several gigabytes) download. Please be patient and use a fast connection. ./nucl_gb.accession2taxid.gz, ./nucl_wgs.accession2taxid.gz already exist. Delete to redownload Preprocessing names with read.names.sql() Preprocessing nodes with read.nodes.sql() Preprocessing accession2taxid with read.accession2taxid() Reading ./nucl_gb.accession2taxid.gz. Error in readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) : error reading from the connection In addition : Warning message: In readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) : invalid or incomplete compressed data

I tried many things:
- Deleting all files and redownloading them -> Same error
- Downloading only the nucl_gb file -> Same error
- Downloading manually the nucl_gb file and running read.accession2taxid separately -> Same error
- Rewriting the files with overwrite = TRUE in the read.names.sql & read.nodes.sql functions.
- Changing SQL database name -> Same error, same file (both 381 184 Ko) with another name
- Changing the temporary directory (method in last answer) since I saw that taxizedb was using the same, following your reply on the Issue 3 -> Successfully changed the temporary folder but same error.

I saw that @MajaCN had exactly the same issue and managed to deal with it but the solution is not provided ("we found a work-around and have the files now!").

Last things that might be useful for resolving this problem:
- My computer is 91.1 Go left.
- I have Windows 10.
- I have the following error when running: accessionToTaxa("Z17430.1", "accession_2_Taxa.sql")

Error: no such table: accessionTaxa Warning message: In file.remove(tmp) : impossible to delete the file 'C:/Users/lelio/DOCUME~1/STAGEM~1/LOCAL_~1\RtmpYdSW74\file3d4c119e4bf5', due to 'Permission denied'

Thanks in advance for helping me!

Best regards,

Eliot RUIZ

@sherrillmix
Copy link
Owner

Your problem is probably related to downloading since R is reporting the large accession2taxid.gz ends prematurely.

I think the workaround was downloading the files manually, e.g. with a browser, then processing with taxonomizr as normal. Whatever the issue was seemed to also mess up downloads outside R so I believe she ended up downloading on another computer. Might be worth trying a manual download here (with your own computer at first) to narrow things down.

The final error sounds like some sort of permissions issue with R not able to write to C:/Users/lelio/DOCUME1/STAGEM1/LOCAL_~1\RtmpYdSW74. Maybe try something like:

tmp<-tempfile()
print(tmp)
writeLines("This is a test",tmp)
readLines(tmp)
file.remove(tmp)

Maybe also report sessionInfo() to narrow down version/environment issues. And might as well make sure you're on the current taxonomizr version of v0.7.1 if you're not already. I might guess you're on an older version since in newer versions, taxonomizr should check the md5sum of the download and complain if different than expected (or else that's not working properly and I need to fix the check).

@MajaCN
Copy link

MajaCN commented May 1, 2021

Hi,
yes, exactly, our workaround was downloading them manually on a different machine

@Eliot-RUIZ
Copy link
Author

Hi!

Thanks for taking the time to answer me!

As you suggested, I reinstalled the package from Github. I then changed the temporary files folder back to the initial one.

After that, I tried your code and everything worked perfectly with the message printing and then being successfully removed.

Then, I deleted all nodes and compressed files definitively, I changed the folder for saving and I ran the prepareDatabase() function again. I got this error a few minutes later:

`Downloading names and nodes with getNamesAndNodes()
essai de l'URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
downloaded 19.4 MB

essai de l'URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.md5'
downloaded 49 bytes

Error in (function (outDir = ".", url = "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz", :
Downloaded file does not match ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz File corrupted or download ended early?`

I am sure I have removed those files and I can't see them anymore...

Here is the result of sessionInfo():

`R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] taxonomizr_0.7.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 rstudioapi_0.11 magrittr_1.5 rappdirs_0.3.3 tidyselect_1.1.0 bit_4.0.4 R6_2.4.1 rlang_0.4.7
[9] hoardr_0.5.2 blob_1.2.1 dplyr_1.0.2 tools_4.0.2 data.table_1.13.0 xfun_0.16 tinytex_0.25 DBI_1.1.0
[17] taxizedb_0.3.0 dbplyr_1.4.4 ellipsis_0.3.1 bit64_4.0.5 digest_0.6.25 assertthat_0.2.1 tibble_3.0.3 lifecycle_0.2.0
[25] crayon_1.3.4 purrr_0.3.4 vctrs_0.3.4 curl_4.3 glue_1.4.2 memoise_1.1.0 RSQLite_2.2.3 compiler_4.0.2
[33] pillar_1.4.6 generics_0.0.2 pkgconfig_2.0.3`

Best regards,
Eliot RUIZ

@sherrillmix
Copy link
Owner

Hmm taxdump.tar.gz should be about 55 MB (e.g. as shown here https://ftp.ncbi.nih.gov/pub/taxonomy/). So it appears the download is truncated for you (and the function is correctly flagging a problem). I'm not sure if this is NCBI's server intermittently messing up (I've had trouble downloading from them in the past) or some bigger Windows/R issue.

Could you try running the raw R command to download and check the md5 a few times:

download.file('ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz','taxdump.tar.gz')
print(tools::md5sum('taxdump.tar.gz'))
print(readLines('ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.md5'))

to see if you get a 55 MB file with a consistent md5.

If that download.file() works, could you give prepareDatabase() one more shot just to see if it was a NCBI issue that cleared up in the meantime. If the download.file() doesn't work then we'll have narrowed it down to something outside taxonomizr and we should probably move to trying to download the files outside R and working from there.

@Eliot-RUIZ
Copy link
Author

Thanks for you quick answer!

Here is the result for download.file('ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz','taxdump.tar.gz') :
essai de l'URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz' downloaded 52.5 MB

And here is the md5: taxdump.tar.gz "f292f588b49033485a0843d241137b91"

I tried again the prepareDatabase() function and I got the same error than before but this time but the download file is 68.1 MB this time while it was 19.4 MB before...

@sherrillmix
Copy link
Owner

I think I'm finally able to somewhat replicate this (or a similar bug). It appears that on Mac or Windows, the options('timeout') is not respected (without any documentation of this). For example, on Linux if I set the timeout very short then I get an appropriate failure a few seconds after running the command:

> taxonomizr::getNamesAndNodes(timeout=2)
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
Error in utils::download.file(url, tarFile, mode = "wb") : 
  cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
In addition: Warning message:
In utils::download.file(url, tarFile, mode = "wb") :
  URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz': Timeout of 2 seconds was reached

But in Windows, the command runs for about a minute and then fails as long as it's on slow internet where the download takes longer than 60 seconds. This is the same if I run the download.file command directly:

options('timeout'=2)
file.download('ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz')

Mac also does not respect the timeout but seems to run to completion at least for medium length (~20 minute) downloads.

So on my side, I guess I just can't trust download.file to do the right thing. I'll investigate other packages to handle the simple file download. Packages curl or Rcurl perhaps?

And for you, I guess there's three options:

  1. Wait a few days until I get a chance to debug/fix
  2. Use Mac/Linux (at least for initial database setup)
  3. Just download these files with e.g. Firefox/Chrome:
    * https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
    * https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
    * https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz
    to directory MY/PATH/ and point R to those using something like:
    prepareDatabase(url='file://MY/PATH/taxdump.tar.gz','baseUrl'='file://MY/PATH/')
    Note if you don't need accession numbers you could just do:
    prepareDatabase(url='file://MY/PATH/taxdump.tar.gz',getAccessions=FALSE)

Thanks for helping me (potentially) get to the bottom of a very annoying issue.

@sherrillmix
Copy link
Owner

The github version of the package is updated to use the curl package for downloading. If you get the chance, maybe try that and see if it fixes things. You could install with:
devtools::install_github('sherrillmix/taxonomizr')

@Eliot-RUIZ
Copy link
Author

Hi!

Thank you very much for your help!

Everything worked perfectly for me when running the prepareDatabase function!

I can now finally use your package!

Best regards,
Eliot RUIZ

@sherrillmix
Copy link
Owner

Great. Thanks a lot for the follow up and for the help tracking it down.

I'll go ahead and push that version to CRAN. It's a shame to add a dependency (not sure how much pain the curl libraries add on Windows/Mac) but this will hopefully squash what has been a very difficult to nail down bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants