Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving and loading from RDS/RDA causes error #181

Closed
Deleetdk opened this issue Dec 1, 2016 · 7 comments
Closed

Saving and loading from RDS/RDA causes error #181

Deleetdk opened this issue Dec 1, 2016 · 7 comments

Comments

@Deleetdk
Copy link

Deleetdk commented Dec 1, 2016

Often you don't want to scrape a large site more than once. When so, it is useful to save the scraped content to a file. The obvious choice here is the RDS format since it should reproduce files exactly. However, when saving and loading a file to disk, the objects are not identical, and the object loaded from the disk crashes R and RStudio as well. The same error happens with RDA format. It happens no matter if one uses the base functions or the readr functions.

I was unable to find any other way to serialize complex R objects. I tried feather, jsonlite and base::dput. From looking at the output of dput, it seems that the error may be related to rvest not using copy semantics. The dput output is very short and contains references to pointers, which I guess are RAM locations.

> dput(comedian_pages)
list(structure(list(node = <pointer: 0x00000000178e83a0>, doc = <pointer: 0x0000000017a98410>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019561c00>, doc = <pointer: 0x0000000017ddb090>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019609b60>, doc = <pointer: 0x00000000195bc530>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x00000000197672e0>, doc = <pointer: 0x0000000017dc9930>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x00000000197e3660>, doc = <pointer: 0x000000001792e130>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019810ce0>, doc = <pointer: 0x0000000017dda250>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019890970>, doc = <pointer: 0x000000001792c930>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x00000000199dc6e0>, doc = <pointer: 0x0000000017ddb210>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019b53e60>, doc = <pointer: 0x000000001792d6b0>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019d08170>, doc = <pointer: 0x0000000017ddab50>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019d06970>, doc = <pointer: 0x000000001792db30>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019ea2060>, doc = <pointer: 0x0000000017ddb690>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x0000000019f16820>, doc = <pointer: 0x0000000017a98c50>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x000000001a0f51e0>, doc = <pointer: 0x00000000179d39b0>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x000000001a223760>, doc = <pointer: 0x000000001792cab0>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x000000001a92f160>, doc = <pointer: 0x0000000017a98ad0>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x000000001aa495e0>, doc = <pointer: 0x0000000019b6d320>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x000000001aab3f70>, doc = <pointer: 0x000000001792d2f0>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x000000001aafc560>, doc = <pointer: 0x00000000179d4af0>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")), structure(list(
    node = <pointer: 0x000000001abfbe60>, doc = <pointer: 0x000000001792da70>), .Names = c("node", 
"doc"), class = c("xml_document", "xml_node")))

One person on Stackoverflow found a workaround by explicitly converting it to a character. Perhaps introduce a special function for writing/reading rvest objects. One can then add a detector for rvest objects in the readr package, so that one can get write_rds etc. to work using this workaround.

@Deleetdk
Copy link
Author

Deleetdk commented Dec 5, 2016

In the mean time, I wrote the following wrapper functions to get around the problem.

write_rvest = function(x, path, ...) {
  #convert to character
  #is list?
  if (is.list(x)) {
    x %<>% map(as.character)
  } else {
    x %<>% as.character
  }

  #save
  write_rds(x, path = path, ...)
}

read_rvest = function(path) {
  #load from file
  x = read_rds(path)

  #read
  if (is.list(x)) {
    x %<>% map(read_html)
  } else {
    x %<>% read_html
  }

  x
}

The reason to check for a list, is that saving a list of xml objects results in the same problem. Using lists of xml object is a very common pattern and should be supported. One cannot just call as.character on the list of xml objects. So the function just adds a map to call as.character on each element.

@Deleetdk
Copy link
Author

I ran into yet more problems.

The as.character approach fails when the xml/html is not given as UTF-8. The character is in fact UTF-8 in R, but the charset declared in the file keeps being whatever it was, and this results in a parser error when trying to load the file.

The error also crashes RStudio. See bug report on RStudio forum.

https://support.rstudio.com/hc/en-us/community/posts/115000113727-Encoding-error-causes-RStudio-to-crash-when-trying-to-read-files

--

To make things more complicated, I found that the xml2 package which rvest wraps, has a function write_xml. This function can successfully save xml objects that can be loaded again and match the original object. However, it does not support saving lists of xml objects. For that one still has to use some kind of workaround. Converting to characters and saving as a list (which fails when there's non UTF-8 present), or dedicate an entire folder to saving every xml object as a single file.

What a frustrating issue.

@garrettgman
Copy link
Member

I'm running into the same problems.

@Deleetdk
Copy link
Author

Deleetdk commented Feb 7, 2017

I bit the bullet and now keep folders around with the xml files. This approach is somewhat annoying but at least it always works.

@OliBravo
Copy link

OliBravo commented Jun 6, 2018

I saved a list of xml nodes and when I tried to load it back to RStudio I encountered the error of invalid pointers. Is there a chance to load it back with no need of scraping a web sites again? You give a walkaround solution of how to store a list properly, but I'd would like to avoid re-scraping.

@jimhester
Copy link
Collaborator

The xml is stored in memory with external pointers, which R does not store in Rds files, so you cannot simply save the R object.

The easiest workaround is to use xml2::write_html() to save the parsed html to a html file, the read it with xml2::read_html().

@OliBravo
Copy link

OliBravo commented Jun 6, 2018

Yes, but it means I have to re-scrape the pages again :(

Thanks for the answer, since now I'll be much more careful with tasks like this.

@hadley hadley closed this as completed Mar 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants