MIssing content after parsing html #1927

vdoan7773 · 2019-09-24T08:00:54Z

Describe the bug
Some content of html is removed after parsing

To Reproduce
Run following script:

require 'nokogiri'
before = '<p class="header__lede">Affecting
<strong >
<a class="breadcrumbs__list-item__link" href="/vuln/npm:saml2-js">saml2-js</a> 
</strong>
package, versions 
<strong >
<&nbsp;1.12.4&nbsp;||&nbsp;>=&nbsp;2.0.0&nbsp;<2.0.2
</strong>
</p>'

puts before
html = Nokogiri::HTML(before)
after = html.inner_html
puts 
puts after

output:
`
<p class="header__lede">Affecting
<strong >
<a class="breadcrumbs__list-item__link" href="/vuln/npm:saml2-js">saml2-js</a> 
</strong>
package, versions 
<strong >
<&nbsp;1.12.4&nbsp;||&nbsp;>=&nbsp;2.0.0&nbsp;<2.0.2
</strong>
</p>

<html><body><p class="header__lede">Affecting
<strong>
<a class="breadcrumbs__list-item__link" href="/vuln/npm:saml2-js">saml2-js</a> 
</strong>
package, versions 
<strong>
= 2.0.0 &lt;2.0.2
</strong>
</p></body></html>
`
'<&nbsp;1.12.4&nbsp;||&nbsp;>' was removed after parsing

**Expected behavior**

I expect it to appear after parsing since this html is displayed properly on https://snyk.io/vuln/npm:saml2-js:20180227

**Environment**

Nokogiri (1.10.4)

---
warnings: []
nokogiri: 1.10.4
ruby:
  version: 2.6.3
  platform: x64-mingw32
  description: ruby 2.6.3p62 (2019-04-16 revision 67580) [x64-mingw32]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/home/flavorjones/code/oss/nokogiri/ports/x86_64-w64-mingw32/libxml2/2.9.9"
  libxslt_path: "/home/flavorjones/code/oss/nokogiri/ports/x86_64-w64-mingw32/libxslt/1.1.33"
  libxml2_patches:
  - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
  - 0002-Remove-script-macro-support.patch
  - 0003-Update-entities-to-remove-handling-of-ssi.patch
  libxslt_patches:
  - 0001-Fix-security-framework-bypass.patch
  compiled: 2.9.9
  loaded: 2.9.9

The text was updated successfully, but these errors were encountered:

flavorjones · 2019-09-27T13:21:03Z

@vdoan7773 Thanks for asking this question. The short answer is that you're seeing differences in how a browser "fixes" invalid HTML and how libxml2 (which is the underlying parser used by Nokogiri) "fixes" invalid HTML. Nokogiri inherits this behavior from the parser, and so there's nothing we can easily do to change this behavior.

Longer answer:

The presence of the bare character < when it is not part of an HTML tag is invalid markup. You can see this by inspecting html.errors in your example. I've simplified your example to show this:

#! /usr/bin/env ruby

require "nokogiri"

html = <<-EOHTML
<p>
<&nbsp;1.12.4&nbsp;||&nbsp;>=&nbsp;2.0.0&nbsp;<2.0.2
</p>
EOHTML

puts html
puts

doc = Nokogiri::HTML(html)
puts doc
puts
puts doc.errors

outputs:

<p>
<&nbsp;1.12.4&nbsp;||&nbsp;>=&nbsp;2.0.0&nbsp;<2.0.2
</p>

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>
= 2.0.0 &lt;2.0.2
</p>
</body></html>

2:2: ERROR: htmlParseStartTag: invalid element name
2:48: ERROR: htmlParseStartTag: invalid element name

You can see that libxml2 is flagging the bare < as a SyntaxError and failing to parse what it assumes is a start tag.

We've had lots of issues filed over the years pointing out differences between how libxml2 fixes broken markup when compared to browsers, Xerces, etc. Unfortunately, fixing broken markup isn't something that's defined in the HTML spec and so is implemented differently in different parsers.

Maybe the one actionable thing here is to file a bug report with Snyk asking them to emit well-formed, valid HTML. They should be html-encoding that version string before putting it into their web page, and so < should be rendered as <. In fact, now I'm wondering if that isn't an injection vulnerability waiting to happen?

Anyhoo, sorry I can't be of more help here. I hope I've explained what's going on, but let me know if you have other questions.

flavorjones closed this as completed Sep 27, 2019

flavorjones mentioned this issue Nov 6, 2021

Note in the documentation that broken markup may be "fixed" differently, or docs may be serialized differently, by libxml than by your browser or by xerces sparklemotion/nokogiri.org#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIssing content after parsing html #1927

MIssing content after parsing html #1927

vdoan7773 commented Sep 24, 2019 •

edited

Loading

flavorjones commented Sep 27, 2019 •

edited

Loading

MIssing content after parsing html #1927

MIssing content after parsing html #1927

Comments

vdoan7773 commented Sep 24, 2019 • edited Loading

Nokogiri (1.10.4)

flavorjones commented Sep 27, 2019 • edited Loading

vdoan7773 commented Sep 24, 2019 •

edited

Loading

flavorjones commented Sep 27, 2019 •

edited

Loading