-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
long vectors not supported yet? #29
Comments
What R version do you have, and what version of deps. Can you just paste in what you get from |
Think I found solution, trying it now... |
Looks like there are two things going on here:
|
I'll see if there's anything I can do to warn users about data size limits... |
Hi BOLD folks, There must be a more computer savvy approach to this, but I've essentially split up the work of downloading all COI-5P sequences into four bins in attempts to keep things under 1 million rows per file: The non-computer savvy way I get it to work is by going to BOLD's taxonomy page and copying and pasting all the animal taxa listed (by phyla). Then repeat by clicking on the arthropod link, highlighting all those; clicking on the insects and highlighting all those; clicking on the lepidopterans and clicking all those. Acanthopteroctetidae <- bold_seqspec(taxon="Acanthopteroctetidae", marker="COI-5P") You could break this up by the four big bins I've describe above, or for my purposes you can do it all together because you want a master file of all sequences anyway. The approach listed above failed on my first attempt because the BOLD folks have mistakes in a very small number of the hyperlinks that the bold R-package uses to grab from the BOLD servers. Specifically, there are incongruities between certain taxa level and the url listing that taxa name/description and the public database url where you get the sequence info from. So the API in which you draw sequence and stats fails in the middle of these individual downloads. In other words, you'll always get taxonomy info - try clicking on the links directly and they always work. But click that "public data" link at the bottom of that page and the search is broken. This issue isn't confined to this one instance. It extends to many other public data links where the taxonomy browser contains a match for the term. This case plays out for several different organism groups listed below. Here's one example: Prodidactidae exists in your taxonomy page (here) but does not exist when you click to pull public data from that page. There are just two species entries of which one sequence exists, but again, you can't pull it from the public data link (here). See the following list for the same effect: Ratardidae The solution to the problem was simply to exclude those individuals from the list. I haven't heard back from the BOLD people, but I'm guessing that if it's critical to include those sequences we could always look at the taxonomy page for each member, grab it's Genbank accession number, and enter it manually that way. Once you exclude those members, the download proceeds as expected and takes less than an hour. Pretty good for millions of entries. Finally, I'm not sure where I've made a mistake, but I was using some bash (awk, sed, paste, etc.) commands to concatenate and manipulate these files. Things have been getting a bit off and I'm worried it's because there is non-uniformity in the number of fields present within each of these separate files. Can't confirm yet if it's true but am going to try using some R commands in lieu of bash commands and see if the separate files can be merged successfully without issues downstream. Holding out hope still - thanks for the input and help as always. Devon |
Two other things that would be potentially useful:
|
@devonorourke Sorry for the very long delay. This slipped off my radar. Just pushed that fix for long vectors, see commit above.
Possibly, I'll see how fast it can be, and if not fast, would leave to user to do
there is the parameter |
@devonorourke I assume this comment #29 (comment) is an email you sent to BOLD? |
@devonorourke see fxn |
Yes to #29 comment On Fri, Sep 30, 2016 at 5:14 PM, Scott Chamberlain <notifications@github.com
Devon O'Rourke |
I ended up making a 'full-BOLD' and 'singlerep-BOLD' dataset by the ##################################################################################### #import the "allinfo_arthropodonly.txt" file #replace all 'NA' values with "UNDEFINED" #replace all the whitespace in the 'species_name' field with underscores #load package 'stringr' to count number of gaps in each nucleotide string #Count the number of '-' character in each element of string and print as #sort the df by the gaplength field: #subset the df by removing duplicates (by 'bin_uri' AND 'species_name') ###################################################################################################### I would use that data frame to create the subsequent taxonomy file and On Fri, Sep 30, 2016 at 5:58 PM, Scott Chamberlain <notifications@github.com
Devon O'Rourke |
Sorry that weren't more helpful. So the code in your comment above is what you'd imagined the |
I'm very excited to use the bold R-package but have been running into trouble when trying to pull a large dataset together using the
bold_seqspec
script. Specifically, I'm trying to grab all COI-5P sequences from BOLD's database. That's a lot of data.I started by installing the package and ran the following script successfully:
This creates a 53-column, 58-line text file. I've performed this task in both R-studio and from the command line (running Linux 3.13.0-85-generic, R version 3.3.0) successfully for the above script.
However, I wanted to be able to modify the script above to include the biggest group in one chunk - Arthropods - by running these commands:
Unfortunately I get an error message:
I was under the impression that the relatively recent releases of R have enabled long vectors to be supported. Perhaps more to the point, I wasn't thinking that this was a particularly long vector in terms of columns, but perhaps it does exceed that 900,000 limit in terms of rows (there certainly are more than 900,000 rows in this dataset).
To that end, perhaps you can speak to the maximum number of entries that may be downloaded at once by these scripts (should one exist).
Thanks very much
The text was updated successfully, but these errors were encountered: