Long lines truncated at 10,000,000 chars. #399

barryrowlingson · 2024-02-14T09:50:31Z

Long lines in HTML are truncated at 10 million characters.

out = "test.html"

### make a char vec of ~12M chars with start and end marker
long = paste0(letters[((1:12000000)%%26)+1],collapse="")
long = paste0("start",long,"end", collapse="")

nchar(long)

### write to a file with some HTML tags.
cat(paste0("<html><body>\n<script type=\"application/json\">",
           long,
           "</script>\n</body></html>\n"), file=out)

### scrape package
library(rvest)

### read the file
page = read_html(out)

### get the nodes by xpath
nodes = html_nodes(page,xpath = '//script[@type="application/json"]')

### get the node content text
text = html_text(nodes[[1]])

### should be about 12 million
nchar(text)

### try this way
chars = as.character(nodes[[1]])

### also should be 12 million
nchar(chars)

### whats at the end?
substr(chars, nchar(chars)-40, nchar(chars))

I get:

> ### should be about 12 million
> nchar(text)
[1] 10000000

> ### try this way
> chars = as.character(nodes[[1]])

> ### also should be 12 million
> nchar(chars)
[1] 10000041

> ### whats at the end?
> substr(chars, nchar(chars)-40, nchar(chars))
[1] "abcdefghijklmnopqrstuvwxyzabcdef</script>"

showing truncation at 10000000 chars and the as.character form has truncated the content and put a script closing tag at the end to make well-formed HTML from truncated data. This is all done silently with no errors or warnings.

The real-world case of this was an HTML file created by the leaflet package which creates large single lines of geographic data.

The Python packages requests and BeautifulSoup read this all correctly.

> packageVersion("rvest")
[1] ‘1.0.4’
> version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          3.1                         
year           2023                        
month          06                          
day            16                          
svn rev        84548                       
language       R                           
version.string R version 4.3.1 (2023-06-16)
nickname       Beagle Scouts

The text was updated successfully, but these errors were encountered:

TimTaylor · 2024-02-16T09:56:04Z

Think you're hitting a limit in libxml2 https://www.suse.com/support/kb/doc/?id=000019477. Not sure if you need to rebuild or if this can be changed at runtime 🤷

barryrowlingson · 2024-02-16T10:16:33Z

if I try reading with XML::xmlParse I at least get an error:

> xmlParse("./test.html")
xmlSAX2Characters: huge text nodeExtra content at the end of the document
Error: 1: xmlSAX2Characters: huge text node2: Extra content at the end of the document

Looks like the xml2 package is silently failing to report the truncation. I'll file an issue there, if there's not one there already....

barryrowlingson · 2024-02-16T10:33:50Z

Seems I have "options"...

> # how big is the input?
> file.size("large.html")
[1] 12000078

> # read it, then write it:
> l = rvest::read_html("large.html")

> xml2::write_html(l, "large-huge.html")

> # check size for truncation
> file.size("large-huge.html")
[1] 10000177

> # HUUUUUGE
> l = rvest::read_html("large.html", options="HUGE")

> xml2::write_html(l, "large-huge.html")

> # not truncated
> file.size("large-huge.html")
[1] 12000185

However I can't find an option that will make rvest::read_html note the error, but this is probably passed down to xml2::read_html...

hadley · 2024-02-27T14:24:05Z

Now filed at r-lib/xml2#440. I suspect there's not going to be much we can do apart from turning HUGE on by default, but this does seem like pretty unappealing behaviour by libxml2.

hadley mentioned this issue Feb 27, 2024

read_html() doesn't report parsing failure on very very long lines r-lib/xml2#440

Open

hadley closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long lines truncated at 10,000,000 chars. #399

Long lines truncated at 10,000,000 chars. #399

barryrowlingson commented Feb 14, 2024

TimTaylor commented Feb 16, 2024

barryrowlingson commented Feb 16, 2024

barryrowlingson commented Feb 16, 2024

hadley commented Feb 27, 2024

Long lines truncated at 10,000,000 chars. #399

Long lines truncated at 10,000,000 chars. #399

Comments

barryrowlingson commented Feb 14, 2024

TimTaylor commented Feb 16, 2024

barryrowlingson commented Feb 16, 2024

barryrowlingson commented Feb 16, 2024

hadley commented Feb 27, 2024