Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIssing content after parsing html #1927

Closed
vdoan7773 opened this issue Sep 24, 2019 · 1 comment
Closed

MIssing content after parsing html #1927

vdoan7773 opened this issue Sep 24, 2019 · 1 comment

Comments

@vdoan7773
Copy link

vdoan7773 commented Sep 24, 2019

Describe the bug
Some content of html is removed after parsing

To Reproduce
Run following script:

require 'nokogiri'
before = '<p class="header__lede">Affecting
<strong >
<a class="breadcrumbs__list-item__link" href="/vuln/npm:saml2-js">saml2-js</a> 
</strong>
package, versions 
<strong >
<&nbsp;1.12.4&nbsp;||&nbsp;>=&nbsp;2.0.0&nbsp;<2.0.2
</strong>
</p>'

puts before
html = Nokogiri::HTML(before)
after = html.inner_html
puts 
puts after
output:
`
<p class="header__lede">Affecting
<strong >
<a class="breadcrumbs__list-item__link" href="/vuln/npm:saml2-js">saml2-js</a> 
</strong>
package, versions 
<strong >
<&nbsp;1.12.4&nbsp;||&nbsp;>=&nbsp;2.0.0&nbsp;<2.0.2
</strong>
</p>

<html><body><p class="header__lede">Affecting
<strong>
<a class="breadcrumbs__list-item__link" href="/vuln/npm:saml2-js">saml2-js</a> 
</strong>
package, versions 
<strong>
= 2.0.0 &lt;2.0.2
</strong>
</p></body></html>
`
'<&nbsp;1.12.4&nbsp;||&nbsp;>' was removed after parsing

**Expected behavior**

I expect it to appear after parsing since this html is displayed properly on https://snyk.io/vuln/npm:saml2-js:20180227

**Environment**

Nokogiri (1.10.4)

---
warnings: []
nokogiri: 1.10.4
ruby:
  version: 2.6.3
  platform: x64-mingw32
  description: ruby 2.6.3p62 (2019-04-16 revision 67580) [x64-mingw32]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/home/flavorjones/code/oss/nokogiri/ports/x86_64-w64-mingw32/libxml2/2.9.9"
  libxslt_path: "/home/flavorjones/code/oss/nokogiri/ports/x86_64-w64-mingw32/libxslt/1.1.33"
  libxml2_patches:
  - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
  - 0002-Remove-script-macro-support.patch
  - 0003-Update-entities-to-remove-handling-of-ssi.patch
  libxslt_patches:
  - 0001-Fix-security-framework-bypass.patch
  compiled: 2.9.9
  loaded: 2.9.9
@flavorjones
Copy link
Member

flavorjones commented Sep 27, 2019

@vdoan7773 Thanks for asking this question. The short answer is that you're seeing differences in how a browser "fixes" invalid HTML and how libxml2 (which is the underlying parser used by Nokogiri) "fixes" invalid HTML. Nokogiri inherits this behavior from the parser, and so there's nothing we can easily do to change this behavior.

Longer answer:

The presence of the bare character < when it is not part of an HTML tag is invalid markup. You can see this by inspecting html.errors in your example. I've simplified your example to show this:

#! /usr/bin/env ruby

require "nokogiri"

html = <<-EOHTML
<p>
<&nbsp;1.12.4&nbsp;||&nbsp;>=&nbsp;2.0.0&nbsp;<2.0.2
</p>
EOHTML

puts html
puts

doc = Nokogiri::HTML(html)
puts doc
puts
puts doc.errors

outputs:

<p>
<&nbsp;1.12.4&nbsp;||&nbsp;>=&nbsp;2.0.0&nbsp;<2.0.2
</p>

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>
= 2.0.0 &lt;2.0.2
</p>
</body></html>

2:2: ERROR: htmlParseStartTag: invalid element name
2:48: ERROR: htmlParseStartTag: invalid element name

You can see that libxml2 is flagging the bare < as a SyntaxError and failing to parse what it assumes is a start tag.

We've had lots of issues filed over the years pointing out differences between how libxml2 fixes broken markup when compared to browsers, Xerces, etc. Unfortunately, fixing broken markup isn't something that's defined in the HTML spec and so is implemented differently in different parsers.

Maybe the one actionable thing here is to file a bug report with Snyk asking them to emit well-formed, valid HTML. They should be html-encoding that version string before putting it into their web page, and so < should be rendered as &lt;. In fact, now I'm wondering if that isn't an injection vulnerability waiting to happen?

Anyhoo, sorry I can't be of more help here. I hope I've explained what's going on, but let me know if you have other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants