long vectors, round 2 #61

devonorourke · 2019-02-23T10:55:58Z

Hi Scott,
Trying and failing to download the entire Insect dataset with bold_seqspec. The InsectNames vector was scrubbed from their website directly, but this was my first hack at trying to figure out a programmatic way to pull the Insect Order names directly from the BOLD website. Folks that know what they are doing could probably do this better!

library(taxize)
library(stringr)
library(rvest)
library(tidyverse)

boldurl <- read_html("http://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=82")
boldtext <- boldurl %>% html_nodes("div.col-md-6") %>%  html_text()
tmptext <- substr(boldtext[7], start=71, stop=nchar(boldtext[7]))
tmptext2 <- gsub('[[:digit:]]+', '', tmptext)
tmptext3 <- unlist(strsplit(tmptext2, '\\['))
tmptext4 <- gsub('\\]', '', tmptext3)
tmptext5 <- str_trim(tmptext4)
insectNames <- tmptext5[tmptext5 != ""]
rm(list=ls(pattern = "tmptext"))

With that vector of InsectNames, I then run the bold_seqspec call:

Insects_list <- lapply(insectNames, bold_seqspec)

But unfortunately, it generates this error:

Error in paste0(rawToChar(out$content, multiple = TRUE), collapse = "") : 
  result would exceed 2^31-1 bytes
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

Next idea up for me is to break up that list into smaller bits, probably one for Lepidopterans, one for Coleopterans, and one for the rest. Thanks for any insights you might offer!

The text was updated successfully, but these errors were encountered:

devonorourke · 2019-02-26T17:32:36Z

Quick follow up:
I ended up breaking up the data into six groups total and there wasn't any further issue with memory. The groups were:

All non-insect arthropods
Diptera
Lepidoptera
Hymenoptera
Coleoptera
remaining insects

sckott · 2019-02-26T17:40:20Z

thanks for this. I think it's a bug related to rawToChar, the string coming from out$result must be very lage

devonorourke · 2019-02-26T18:06:11Z

Yep. It was all Insect records.

…

On Tue, Feb 26, 2019, 12:40 PM Scott Chamberlain ***@***.***> wrote: thanks for this. I think it's a bug related to rawToChar, the string coming from out$result must be very lage — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKqgXNZRRAg-kJUKYx5efu2agwI8GZPxks5vRXGEgaJpZM4bN4VV> .

sckott added the bug label Feb 26, 2019

sckott added this to the v1.0 milestone Jan 17, 2020

sckott removed the bug label Jan 17, 2020

sckott closed this as completed in 17c83da Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

long vectors, round 2 #61

long vectors, round 2 #61

devonorourke commented Feb 23, 2019 •

edited

devonorourke commented Feb 26, 2019 •

edited

sckott commented Feb 26, 2019

devonorourke commented Feb 26, 2019 via email

long vectors, round 2 #61

long vectors, round 2 #61

Comments

devonorourke commented Feb 23, 2019 • edited

devonorourke commented Feb 26, 2019 • edited

sckott commented Feb 26, 2019

devonorourke commented Feb 26, 2019 via email

devonorourke commented Feb 23, 2019 •

edited

devonorourke commented Feb 26, 2019 •

edited