Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long lines truncated at 10,000,000 chars. #399

Closed
barryrowlingson opened this issue Feb 14, 2024 · 4 comments
Closed

Long lines truncated at 10,000,000 chars. #399

barryrowlingson opened this issue Feb 14, 2024 · 4 comments

Comments

@barryrowlingson
Copy link

Long lines in HTML are truncated at 10 million characters.

out = "test.html"

### make a char vec of ~12M chars with start and end marker
long = paste0(letters[((1:12000000)%%26)+1],collapse="")
long = paste0("start",long,"end", collapse="")

nchar(long)

### write to a file with some HTML tags.
cat(paste0("<html><body>\n<script type=\"application/json\">",
           long,
           "</script>\n</body></html>\n"), file=out)

### scrape package
library(rvest)

### read the file
page = read_html(out)

### get the nodes by xpath
nodes = html_nodes(page,xpath = '//script[@type="application/json"]')

### get the node content text
text = html_text(nodes[[1]])

### should be about 12 million
nchar(text)

### try this way
chars = as.character(nodes[[1]])

### also should be 12 million
nchar(chars)

### whats at the end?
substr(chars, nchar(chars)-40, nchar(chars))

I get:

> ### should be about 12 million
> nchar(text)
[1] 10000000

> ### try this way
> chars = as.character(nodes[[1]])

> ### also should be 12 million
> nchar(chars)
[1] 10000041

> ### whats at the end?
> substr(chars, nchar(chars)-40, nchar(chars))
[1] "abcdefghijklmnopqrstuvwxyzabcdef</script>"

showing truncation at 10000000 chars and the as.character form has truncated the content and put a script closing tag at the end to make well-formed HTML from truncated data. This is all done silently with no errors or warnings.

The real-world case of this was an HTML file created by the leaflet package which creates large single lines of geographic data.

The Python packages requests and BeautifulSoup read this all correctly.

> packageVersion("rvest")
[1] ‘1.0.4’
> version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          3.1                         
year           2023                        
month          06                          
day            16                          
svn rev        84548                       
language       R                           
version.string R version 4.3.1 (2023-06-16)
nickname       Beagle Scouts               
@TimTaylor
Copy link

Think you're hitting a limit in libxml2 https://www.suse.com/support/kb/doc/?id=000019477. Not sure if you need to rebuild or if this can be changed at runtime 🤷

@barryrowlingson
Copy link
Author

if I try reading with XML::xmlParse I at least get an error:

> xmlParse("./test.html")
xmlSAX2Characters: huge text nodeExtra content at the end of the document
Error: 1: xmlSAX2Characters: huge text node2: Extra content at the end of the document

Looks like the xml2 package is silently failing to report the truncation. I'll file an issue there, if there's not one there already....

@barryrowlingson
Copy link
Author

Seems I have "options"...

> # how big is the input?
> file.size("large.html")
[1] 12000078

> # read it, then write it:
> l = rvest::read_html("large.html")

> xml2::write_html(l, "large-huge.html")

> # check size for truncation
> file.size("large-huge.html")
[1] 10000177

> # HUUUUUGE
> l = rvest::read_html("large.html", options="HUGE")

> xml2::write_html(l, "large-huge.html")

> # not truncated
> file.size("large-huge.html")
[1] 12000185

However I can't find an option that will make rvest::read_html note the error, but this is probably passed down to xml2::read_html...

@hadley
Copy link
Member

hadley commented Feb 27, 2024

Now filed at r-lib/xml2#440. I suspect there's not going to be much we can do apart from turning HUGE on by default, but this does seem like pretty unappealing behaviour by libxml2.

@hadley hadley closed this as completed Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants