Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read.accession2taxid error creating SQL file: malformed line #3

Closed
Talitrus opened this issue Feb 8, 2018 · 8 comments
Closed

read.accession2taxid error creating SQL file: malformed line #3

Talitrus opened this issue Feb 8, 2018 · 8 comments

Comments

@Talitrus
Copy link

Talitrus commented Feb 8, 2018

Hi Scott,

Thanks for making this package available. I've been looking for a convenient way to convert accession numbers to UIDs to a taxonomy lineage for a while. I tried to get taxonomizr to create the .sql file to no avail. It's returning an error about a malformed line. Have you encountered this before? Any idea if I'm doing something wrong? I've attached my R logs below.

> library("taxonomizr")
> getNamesAndNodes()
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz'
Content type 'unknown' length 41706736 bytes (39.8 MB)
==================================================
[1] "./names.dmp" "./nodes.dmp"
> getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_gb.accession2taxid.gz'
Content type 'unknown' length 988927554 bytes (943.1 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_est.accession2taxid.gz'
Content type 'unknown' length 544402419 bytes (519.2 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_gss.accession2taxid.gz'
Content type 'unknown' length 279039082 bytes (266.1 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid//nucl_wgs.accession2taxid.gz'
Content type 'unknown' length 3067473581 bytes (2925.4 MB)
==================================================
[1] "./nucl_gb.accession2taxid.gz"  "./nucl_est.accession2taxid.gz"
[3] "./nucl_gss.accession2taxid.gz" "./nucl_wgs.accession2taxid.gz"
> read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')
Reading nucl_est.accession2taxid.gz.
Reading nucl_gb.accession2taxid.gz.
Reading nucl_gss.accession2taxid.gz.
Reading nucl_wgs.accession2taxid.gz.
Error: Problem creating sql file. Deleting.
Error in trimTaxa(ii, tmp) : Malformed line on line 46212441 
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I checked the *accession2taxid.gz file checksums and they do match the md5 sums from NCBI.
MD5 sums:

3c0ea1b1e5b93911d205b68a916c2a19  nucl_est.accession2taxid.gz
8f6871b4b23ba591f3f0f122d0d3cb96  nucl_gb.accession2taxid.gz
19d8a69f3efbdcb482646efa4538467e  nucl_gss.accession2taxid.gz
210fa57011a0a44b7ce3fb8faed709bf  nucl_wgs.accession2taxid.gz

Cheers,
Bryan Nguyen

@Talitrus
Copy link
Author

Talitrus commented Feb 9, 2018

I got read.accession2taxid to work (using the exact same code) by running it locally instead of on a remote server. I wonder if this could be due to a permissions issue of some sort.

@sherrillmix
Copy link
Owner

sherrillmix commented Feb 9, 2018

That's good it works some places but pretty bad that it's inconsistent. I was trying to recreate the error on my side and not having much luck yet. Couple of questions:

  • Did you happen to check the 50 warnings?
  • What are the OSs of the local and remote?
  • If you run it on just one of the files does it work? e.g.:
    read.accession2taxid('nucl_est.accession2taxid.gz','accessionTaxa.sql')

Thanks.

@Talitrus
Copy link
Author

Talitrus commented Feb 9, 2018

Local OS: MacOS High Sierra (Version 10.13.3), R Version 3.4.3, running as super user
Remote OS: CentOS 6.7, R version 3.4.2

The warnings are all copies of:

In writeBin(bfr, con = out, size = 1L) : problem writing to connection

Running just read.accession2taxid('nucl_est.accession2taxid.gz','accessionTaxa.sql') does produce a .sql file, but still returns the warnings. Output attached below.

> read.accession2taxid('nucl_est.accession2taxid.gz','accessionTaxa.sql')
Reading nucl_est.accession2taxid.gz.
Reading in values. This may take a while.
Adding index. This may also take a while.
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In writeBin(bfr, con = out, size = 1L) : problem writing to connection
2: In writeBin(bfr, con = out, size = 1L) : problem writing to connection
...

Cheers,
Bryan

@sherrillmix
Copy link
Owner

sherrillmix commented Feb 9, 2018

Hmm I don't call writeBin directly. I wonder if it's from R.utils or RSQLite. How about if you set:

options(warn=2)

so R will error out on the first warning and we can narrow down where the warning occurs.

And just as a thought, how much space is in /tmp or wherever tempdir() writes to on the remote server?

@Talitrus
Copy link
Author

Talitrus commented Feb 9, 2018

I did check the /tmp directory and it looks like it could very likely be that there's not enough free space. Here's the result of running df -Ph . in /tmp.

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       5.6G  3.2G  2.2G  60% /tmp

And here's what happened when I ran it with options(warn=2)

> options(warn=2)
> library("taxonomizr")
> read.accession2taxid(list.files('.','accession2taxid.gz$'), 'accessionTaxa.sql')
Reading nucl_est.accession2taxid.gz.
Reading nucl_gb.accession2taxid.gz.
Error in writeBin(bfr, con = out, size = 1L) : 
  (converted from warning) problem writing to connection

@sherrillmix
Copy link
Owner

sherrillmix commented Feb 9, 2018

Yeah probably not enough space in /tmp. It looks like R.utils::gunzip is warning instead of erroring out when out of space. If so I'll have to see if I can catch that since that's a pretty annoyingly vague error message. (It would be smarter to just read the gzip file directly but then you run into trouble compiling on windows.)

Could you try setting the temp directory to somewhere with some space when you start R? For example if you have a lot of space in whatever directory you are working in, you could start R with:

TMPDIR='.' R

You can make sure this worked by typing tempdir() and making sure it returns something like ./RtmpXXXXXX. Then run the same command. If that runs fine then I guess we know what happened.

Thanks for helping debug this.

@Talitrus
Copy link
Author

Yup, that worked perfectly. Thanks, Scott!

@sherrillmix
Copy link
Owner

Great. Thanks for debugging. Really helpful to find these edge cases.

It looks like gunzip doesn't create the target file if the disk is full so I can just check if file.exists after gunzip and output a more informative error. I'll get that into the next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants