DocumentFragment#xpath fails to find specific attribute for elements at the root of the fragment #213

Closed
Phrogz opened this Issue Jan 25, 2010 · 7 comments

Comments

Projects
None yet
4 participants

Phrogz commented Jan 25, 2010

require 'nokogiri'
html = DATA.read
doc1 = Nokogiri::HTML(html)
doc2 = Nokogiri::HTML::DocumentFragment.parse(html)

ELEMENT_ONLY = ".//h2"
WITH_ID      = ".//h2[@id='foo']"

p doc1.xpath(ELEMENT_ONLY).first['id'],
  doc1.xpath(WITH_ID),
  doc2.xpath(ELEMENT_ONLY).first['id'],
  doc2.xpath(WITH_ID)

#=> "foo"
#=> [#<Nokogiri::XML::Element:0x80a3c168 name="h2" attributes=[#<Nokogiri::XML::Attr:0x80a3bbb8 name="id" value="foo">] children=[#<Nokogiri::XML::Text:0x80a3b288 "Heading 1">]>]
#=> "foo"
#=> []

__END__
<h2 id="foo">Heading 1</h2>

Same problem applies to at_xpath.

Phrogz commented Jan 25, 2010

Workaround is to use css/at_css on the DocumentFragment, which only works if your id attributes do not have colons or periods in the name.

Phrogz commented Jan 27, 2010

The plot thickens. Apparently it fails to find elements at the root of the fragment, but succeeds if they're nested:

require 'nokogiri'
s1 = "<a href='foo'>hi</a>"
s2 = "<a href='foo'>hi</a>\n"
s3 = "<a href='foo'>hi</a><a href='bar'>bye</a>"
s4 = "<a href='foo'>hi</a>\n<a href='bar'>bye</a>"
s5 = "<p><a href='foo'>hi</a></p>"
s6 = "<a href='foo'>hi</a><p><a href='bar'>bye</a></p>"

[s1,s2,s3,s4,s5,s6].each do |s|
  fragment = Nokogiri::HTML::DocumentFragment.parse(s)
  p s, fragment.xpath('.//a[@href]').length
  puts ""
end

#=> "<a href='foo'>hi</a>"
#=> 0
#=> 
#=> "<a href='foo'>hi</a>\n"
#=> 0
#=> 
#=> "<a href='foo'>hi</a><a href='bar'>bye</a>"
#=> 0
#=> 
#=> "<a href='foo'>hi</a>\n<a href='bar'>bye</a>"
#=> 0
#=> 
#=> "<p><a href='foo'>hi</a></p>"
#=> 1
#=> 
#=> "<a href='foo'>hi</a><p><a href='bar'>bye</a></p>"
#=> 1

Similarly, an xpath like .//a/@href will only select the attribute in elements not at the root of the fragment.

Owner

tenderlove commented Jan 28, 2010

I believe this is related to the fact that we just need to redo the partial implementation. I suggest that if you can, grab a prerelease version of nokogiri and use the Node#parse method.

We're going to try backing the fragment code with Node#parse for the next release.

Owner

tenderlove commented Mar 6, 2010

I'm starting to think this is either a) expected behavior or b) a bug in libxml2. Apparently switching to the new document fragment stuff I was working on didn't fix this issue.

Anyway, the reason I suspect it's either expected behavior or a bug in libxml2 is that if you adjust the XPath, you can find those elements:

require 'nokogiri'
s1 = "<a href='foo'>hi</a>"
s2 = "<a href='foo'>hi</a>\n"
s3 = "<a href='foo'>hi</a><a href='bar'>bye</a>"
s4 = "<a href='foo'>hi</a>\n<a href='bar'>bye</a>"
s5 = "<p><a href='foo'>hi</a></p>"
s6 = "<a href='foo'>hi</a><p><a href='bar'>bye</a></p>"

[s1,s2,s3,s4,s5,s6].each do |s|
  fragment = Nokogiri::HTML::DocumentFragment.parse(s)
  p s, fragment.xpath('a[@href] | .//a[@href]').length
  puts ""
end

Which will output (using nokogiri master):

"<a href='foo'>hi</a>"
1

"<a href='foo'>hi</a>\n"
1

"<a href='foo'>hi</a><a href='bar'>bye</a>"
2

"<a href='foo'>hi</a>\n<a href='bar'>bye</a>"
2

"<p><a href='foo'>hi</a></p>"
1

"<a href='foo'>hi</a><p><a href='bar'>bye</a></p>"
2

I am researching more.

bogdan commented Oct 16, 2011

Meanwhile I am using the wrapper:

doc = DocumentFragement.parse("<div id='__wrapper__'>#{body}</div>")
#proccess
result  = doc.xpath("#__wrapper__").to_s

Phrogz commented Nov 22, 2011

See also #370 and #572

@flavorjones flavorjones added a commit that referenced this issue Jan 2, 2015

@flavorjones flavorjones Tests demonstrating the issue from #572
and #454 and #370 and #213.
ab26d27
Owner

flavorjones commented Jan 2, 2015

Folding this into #572, the underlying issue is the same.

flavorjones closed this Jan 2, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment