Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long vectors, round 2 #61

Closed
devonorourke opened this issue Feb 23, 2019 · 3 comments
Closed

long vectors, round 2 #61

devonorourke opened this issue Feb 23, 2019 · 3 comments
Milestone

Comments

@devonorourke
Copy link

devonorourke commented Feb 23, 2019

Hi Scott,
Trying and failing to download the entire Insect dataset with bold_seqspec. The InsectNames vector was scrubbed from their website directly, but this was my first hack at trying to figure out a programmatic way to pull the Insect Order names directly from the BOLD website. Folks that know what they are doing could probably do this better!

library(taxize)
library(stringr)
library(rvest)
library(tidyverse)

boldurl <- read_html("http://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=82")
boldtext <- boldurl %>% html_nodes("div.col-md-6") %>%  html_text()
tmptext <- substr(boldtext[7], start=71, stop=nchar(boldtext[7]))
tmptext2 <- gsub('[[:digit:]]+', '', tmptext)
tmptext3 <- unlist(strsplit(tmptext2, '\\['))
tmptext4 <- gsub('\\]', '', tmptext3)
tmptext5 <- str_trim(tmptext4)
insectNames <- tmptext5[tmptext5 != ""]
rm(list=ls(pattern = "tmptext"))

With that vector of InsectNames, I then run the bold_seqspec call:

Insects_list <- lapply(insectNames, bold_seqspec)

But unfortunately, it generates this error:

Error in paste0(rawToChar(out$content, multiple = TRUE), collapse = "") : 
  result would exceed 2^31-1 bytes
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

Next idea up for me is to break up that list into smaller bits, probably one for Lepidopterans, one for Coleopterans, and one for the rest. Thanks for any insights you might offer!

@devonorourke
Copy link
Author

devonorourke commented Feb 26, 2019

Quick follow up:
I ended up breaking up the data into six groups total and there wasn't any further issue with memory. The groups were:

  1. All non-insect arthropods
  2. Diptera
  3. Lepidoptera
  4. Hymenoptera
  5. Coleoptera
  6. remaining insects

@sckott sckott added the bug label Feb 26, 2019
@sckott
Copy link
Contributor

sckott commented Feb 26, 2019

thanks for this. I think it's a bug related to rawToChar, the string coming from out$result must be very lage

@devonorourke
Copy link
Author

devonorourke commented Feb 26, 2019 via email

@sckott sckott added this to the v1.0 milestone Jan 17, 2020
@sckott sckott removed the bug label Jan 17, 2020
@sckott sckott closed this as completed in 17c83da Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants