Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long vectors not supported yet? #29

Closed
devonorourke opened this issue May 18, 2016 · 12 comments
Closed

long vectors not supported yet? #29

devonorourke opened this issue May 18, 2016 · 12 comments

Comments

@devonorourke
Copy link

devonorourke commented May 18, 2016

I'm very excited to use the bold R-package but have been running into trouble when trying to pull a large dataset together using the bold_seqspec script. Specifically, I'm trying to grab all COI-5P sequences from BOLD's database. That's a lot of data.

I started by installing the package and ran the following script successfully:

install.packages("bold")
library(bold)
df_tiny <- bold_seqspec(taxon="Echiura", marker="COI-5P")
write.table(df_tiny, file = "/home/R/euthria_bold_seqspec.txt", sep = "\t")

This creates a 53-column, 58-line text file. I've performed this task in both R-studio and from the command line (running Linux 3.13.0-85-generic, R version 3.3.0) successfully for the above script.

However, I wanted to be able to modify the script above to include the biggest group in one chunk - Arthropods - by running these commands:

df_large <- bold_seqspec(taxon="Arthropoda", marker="COI-5P")
write.table(df_large, file = "/home/R/arthropoda_bold_seqspec.txt", sep = "\t").

Unfortunately I get an error message:

Error in rawToChar(content(out, encoding = "UTF-8")) : 
  long vectors not supported yet: raw.c:68

I was under the impression that the relatively recent releases of R have enabled long vectors to be supported. Perhaps more to the point, I wasn't thinking that this was a particularly long vector in terms of columns, but perhaps it does exceed that 900,000 limit in terms of rows (there certainly are more than 900,000 rows in this dataset).

To that end, perhaps you can speak to the maximum number of entries that may be downloaded at once by these scripts (should one exist).

Thanks very much

@sckott
Copy link
Contributor

sckott commented May 18, 2016

What R version do you have, and what version of deps. Can you just paste in what you get from devtools::session_info() or sessionInfo()

@sckott
Copy link
Contributor

sckott commented May 18, 2016

Think I found solution, trying it now...

@sckott
Copy link
Contributor

sckott commented May 18, 2016

Looks like there are two things going on here:

  1. The data you request for the search Arthropoda gives ~3.8 million records, and that data takes a long time to download. So just for your own sanity, it might be better to break that up into groups as you describe din your email to me
  2. The other is that very long strings were not supported by the rawToChar() function. However, with multiple = TRUE for that function call, it can work - which gives each character as a separate element of a vector - then we can paste0(..., collapse = "") them together - I'll push this change soon. However, in the above point, you'd probably want to do smaller data requests anyway, so that you aren't waiting for a massive request to finish (in which time internet can go out, or you have to go home, or something) - AND massive requests are more likely to cause problems for the data provider

@sckott
Copy link
Contributor

sckott commented May 18, 2016

I'll see if there's anything I can do to warn users about data size limits...

@devonorourke
Copy link
Author

devonorourke commented May 20, 2016

Hi BOLD folks,

There must be a more computer savvy approach to this, but I've essentially split up the work of downloading all COI-5P sequences into four bins in attempts to keep things under 1 million rows per file:
-- download all animal sequences except arthropods
-- download all arthropod sequences except insects
-- download all insect sequences except lepidopterans
-- download all the damn moths.

The non-computer savvy way I get it to work is by going to BOLD's taxonomy page and copying and pasting all the animal taxa listed (by phyla). Then repeat by clicking on the arthropod link, highlighting all those; clicking on the insects and highlighting all those; clicking on the lepidopterans and clicking all those.
From that list of lists I create a series of lines that follow the same format whereby using the bold R.package command:

Acanthopteroctetidae <- bold_seqspec(taxon="Acanthopteroctetidae", marker="COI-5P")
write.table(Acanthopteroctetidae, file = "/home/orourke/R/Acanthopteroctetidae_taxmap_COI.txt", sep = "\t")
rm(Acanthopteroctetidae)
Adelidae <- bold_seqspec(taxon="Adelidae", marker="COI-5P")
write.table(Adelidae, file = "/home/orourke/R/Adelidae_taxmap_COI.txt", sep = "\t")
rm(Adelidae)
...and repeat for every one of those species in the list of lists...

You could break this up by the four big bins I've describe above, or for my purposes you can do it all together because you want a master file of all sequences anyway.

The approach listed above failed on my first attempt because the BOLD folks have mistakes in a very small number of the hyperlinks that the bold R-package uses to grab from the BOLD servers. Specifically, there are incongruities between certain taxa level and the url listing that taxa name/description and the public database url where you get the sequence info from. So the API in which you draw sequence and stats fails in the middle of these individual downloads.

In other words, you'll always get taxonomy info - try clicking on the links directly and they always work. But click that "public data" link at the bottom of that page and the search is broken. This issue isn't confined to this one instance. It extends to many other public data links where the taxonomy browser contains a match for the term. This case plays out for several different organism groups listed below. Here's one example:

Prodidactidae exists in your taxonomy page (here) but does not exist when you click to pull public data from that page. There are just two species entries of which one sequence exists, but again, you can't pull it from the public data link (here).

See the following list for the same effect:

Ratardidae
Schistonoeidae
Somabrachyidae
Tridentaformidae
Whalleyanidae
Zoraptera
Pentastomida

The solution to the problem was simply to exclude those individuals from the list. I haven't heard back from the BOLD people, but I'm guessing that if it's critical to include those sequences we could always look at the taxonomy page for each member, grab it's Genbank accession number, and enter it manually that way.

Once you exclude those members, the download proceeds as expected and takes less than an hour. Pretty good for millions of entries.

Finally, I'm not sure where I've made a mistake, but I was using some bash (awk, sed, paste, etc.) commands to concatenate and manipulate these files. Things have been getting a bit off and I'm worried it's because there is non-uniformity in the number of fields present within each of these separate files. Can't confirm yet if it's true but am going to try using some R commands in lieu of bash commands and see if the separate files can be merged successfully without issues downstream.

Holding out hope still - thanks for the input and help as always.

Devon

@devonorourke
Copy link
Author

Two other things that would be potentially useful:

  1. When you pull out all these millions of records from BOLD, there are multiple sequences representing the same species. While this is certainly useful in some contexts, I wonder if it would be possible to specify to pull just the longest nucleotide sequence for a given representative species (if named to that taxa level). Note that you'd need to count only nucleotide characters not the gap/alignment character "-".
  2. Related to the point above, I wonder if it's possible to specify not the species ID, but rather the BOLD BIN. If we could sort all unique BIN entries and just pull the longest representative sequence from there it might also be helpful in reducing the search space in subsequent OTU calling. Perhaps this sets a dangerous precedent and should be avoided in cases where species-level detection is a must; perhaps a way to avoid this would for BOLD as a database to identify BINs that are uniquely assigned to a single species and BINs that are grouped with multiple species. Or maybe that's wishful thinking?

@sckott
Copy link
Contributor

sckott commented Sep 30, 2016

@devonorourke Sorry for the very long delay. This slipped off my radar.

Just pushed that fix for long vectors, see commit above.

When you pull out all these millions of records from BOLD, there are multiple sequences representing the same species. While this is certainly useful in some contexts, I wonder if it would be possible to specify to pull just the longest nucleotide sequence for a given representative species (if named to that taxa level). Note that you'd need to count only nucleotide characters not the gap/alignment character "-".

Possibly, I'll see how fast it can be, and if not fast, would leave to user to do

Related to the point above, I wonder if it's possible to specify not the species ID, but rather the BOLD BIN. If we could sort all unique BIN entries and just pull the longest representative sequence from there it might also be helpful in reducing the search space in subsequent OTU calling. Perhaps this sets a dangerous precedent and should be avoided in cases where species-level detection is a must; perhaps a way to avoid this would for BOLD as a database to identify BINs that are uniquely assigned to a single species and BINs that are grouped with multiple species. Or maybe that's wishful thinking?

there is the parameter bin - have you not seen it? I don't have much to say on what BOLD could do. You could/should ask them

@sckott
Copy link
Contributor

sckott commented Sep 30, 2016

@devonorourke I assume this comment #29 (comment) is an email you sent to BOLD?

sckott added a commit that referenced this issue Sep 30, 2016
@sckott
Copy link
Contributor

sckott commented Sep 30, 2016

@devonorourke see fxn bold_filter - was trying to add filtering by longest or shortest sequence to bold_seqspec, but there's already too much complexity in that fxn (that is, too many ways for it to fail, making it hard to maintain and predict all failure scenarios) Let me know what you think

@devonorourke
Copy link
Author

Yes to #29 comment
#29 (comment) . Their
response was less than inspiring. Basically stated that it wasn't a problem
and I was doing something wrong.

On Fri, Sep 30, 2016 at 5:14 PM, Scott Chamberlain <notifications@github.com

wrote:

@devonorourke https://github.com/devonorourke I assume this comment #29
(comment)
#29 (comment) is an
email you sent to BOLD?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#29 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AKqgXJTLQy3YcAzN_0c9llErespax8inks5qvXuogaJpZM4Ig1Oj
.

Devon O'Rourke
Graduate student in Molecular and Evolutionary Systems Biology
University of New Hampshire
Lab of Dr. Jeff Foster
fozlab.weebly.com/guano

@devonorourke
Copy link
Author

I ended up making a 'full-BOLD' and 'singlerep-BOLD' dataset by the
following (note this likely has it's own complications but as a general
start was helpful in getting a species-only database). Note that the first
file, "allinfo_arthropodonly.txt" is what I would pull using your 'bold' R
package's 'seqspec' command.

#####################################################################################
#################

#import the "allinfo_arthropodonly.txt" file
df_allinfo_arthropodonly <- read.delim(file =
"/pathto/allinfo_arthropodonly.txt", sep = "\t", na.strings = " ",
stringsAsFactors = FALSE)

#replace all 'NA' values with "UNDEFINED"
df_allinfo_arthropodonly[is.na(df_allinfo_arthropodonly)] <- "UNDEFINED"

#replace all the whitespace in the 'species_name' field with underscores
df_allinfo_arthropodonly$species_name <- (gsub(" ", "_",
df_allinfo_arthropodonly$species_name))

#load package 'stringr' to count number of gaps in each nucleotide string
library(stringr)

#Count the number of '-' character in each element of string and print as
new column in df
df_allinfo_arthropodonly$gaplength <-
str_count(df_allinfo_arthropodonly$nucleotides, "-")

#sort the df by the gaplength field:
df_allinfo_arthropodonly <-
df_allinfo_arthropodonly[order(df_allinfo_arthropodonly$gaplength),]

#subset the df by removing duplicates (by 'bin_uri' AND 'species_name')
with preference to smallest 'gaplength'
df_singlerep_arthonly <-
df_allinfo_arthropodonly[!duplicated(df_allinfo_arthropodonly[c("bin_uri",
"species_name")]),]

######################################################################################################

I would use that data frame to create the subsequent taxonomy file and
fasta files of interest.

On Fri, Sep 30, 2016 at 5:58 PM, Scott Chamberlain <notifications@github.com

wrote:

@devonorourke https://github.com/devonorourke see fxn bold_filter - was
trying to add filtering by longest or shortest sequence to bold_seqspec,
but there's already too much complexity in that fxn (that is, too many ways
for it to fail, making it hard to maintain and predict all failure
scenarios) Let me know what you think


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#29 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AKqgXCgyLhmLsnii3rPaq7lSs2OUbjL3ks5qvYYQgaJpZM4Ig1Oj
.

Devon O'Rourke
Graduate student in Molecular and Evolutionary Systems Biology
University of New Hampshire
Lab of Dr. Jeff Foster
fozlab.weebly.com/guano

@sckott
Copy link
Contributor

sckott commented Oct 4, 2016

Sorry that weren't more helpful.

So the code in your comment above is what you'd imagined the bold_filter would do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants