Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeSet does not follow ruby conventions for enumerable methods #1677

Closed
dazza-codes opened this issue Oct 3, 2017 · 4 comments
Closed

NodeSet does not follow ruby conventions for enumerable methods #1677

dazza-codes opened this issue Oct 3, 2017 · 4 comments

Comments

@dazza-codes
Copy link

dazza-codes commented Oct 3, 2017

Using ~/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/nokogiri-1.8.1

In general, the ruby enumerable methods, like Array.select, return a new Array, leaving the original untouched. For example,

x = [*1..10]
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = x.select {|i| i % 2 == 0 }
#=> [2, 4, 6, 8, 10]
x
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Similarly, another enumerable operation does not touch the original:

x = [*1..10]
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = x.reject {|i| i % 2 == 0 }
=> [1, 3, 5, 7, 9]
x
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

It's reasonable to expect the same behavior of enumerable methods on the NodeSet, right? Let's see.

Create two sets of XML, with a duplicate REC element in each (WOS:A2):

recordsA = <<-XML_A
        <records>
          <REC><UID>WOS:A1</UID></REC>
          <REC><UID>WOS:A2</UID></REC>
        </records>
XML_A
recordsB = <<-XML_B
        <records>
          <REC><UID>WOS:A2</UID></REC>
          <REC><UID>WOS:B2</UID></REC>
        </records>
XML_B

docA = Nokogiri::XML(recordsA) { |config| config.strict.noblanks }
docB = Nokogiri::XML(recordsB) { |config| config.strict.noblanks }

rec_nodesA = docA.search('REC')
rec_nodesB = docB.search('REC')
rec_nodesA.class
#=> Nokogiri::XML::NodeSet

uidsA = rec_nodesA.map { |rec| rec.search('UID').first.text }
#=> ["WOS:A1", "WOS:A2"]
uidsB = rec_nodesB.map { |rec| rec.search('UID').first.text }
#=> ["WOS:A2", "WOS:B2"]
uidsA.class
#=> Array  # note that it doesn't return a NodeSet
uid_intersection = uidsA.to_set.intersection(uidsB.to_set)
#=> #<Set: {"WOS:A2"}>   # OK, no problem here

Now lets try to gather the duplicate records into a new NodeSet.

duplicates = rec_nodesB.select { |rec| uid_intersection.include?(rec.search('UID').first.text) }
#=> [#(Element:0x2ab559c1d548 { name = "REC", children = [ #(Element:0x2ab559c21634 { name = "UID", children = [ #(Text "WOS:A2")] })] })]

dup_nodes = Nokogiri::XML::NodeSet.new(Nokogiri::XML::Document.new, duplicates)
#=> [#<Nokogiri::XML::Element:0x2ab559c1d548 name="REC" children=[#<Nokogiri::XML::Element:0x2ab559c21634 name="UID" children=[#<Nokogiri::XML::Text:0x2ab559c25428 "WOS:A2">]>]>]
dup_nodes.class
#=> Nokogiri::XML::NodeSet

# seems to be OK, right?

# BUT HERE COMES THE FUN BIT....

dup_nodes.xpath('//UID')
#=> [#<Nokogiri::XML::Element:0x2ab559c21634 name="UID" 
#  children=[#<Nokogiri::XML::Text:0x2ab559c25428 "WOS:A2">]>,
#  #<Nokogiri::XML::Element:0x2ab559c2c0d4 name="UID" 
#  children=[#<Nokogiri::XML::Text:0x2ab559c2e6e0 "WOS:B2">]>]

# Yea, fun, right?  But wait, there's more!

dup_nodes.count
#=> 1
dup_nodes.map { |rec| rec.search('UID').first.text }
#=> ["WOS:A2"]
dup_nodes.xpath('//UID').count
#=> 2
dup_nodes.xpath('//UID').to_s
#=> "<UID>WOS:A2</UID><UID>WOS:B2</UID>"

The dup_nodes.count => 1 and the first line above shows that it has the UID=‘WOS:A2’. So, how can an xpath('//UID') return two different UID values from this NodeSet that seems to only contain one UID?

The only way that dup_nodes.xpath('//UID') could return two results with WOS:A2 and WOS:B2 is if the dup_nodes is somehow still connected to the parent doc that it was derived from, despite a NodeSet.select operation that should not touch the original node set and should return a new object. Although I haven't yet proved it in a similar small example, I believe similar problems beset the NodeSet.dup method.

@atz
Copy link

atz commented Oct 5, 2017

Change your assignment line to:

dup_nodes = Nokogiri::XML::NodeSet.new(Nokogiri::XML::Document.new, duplicates.to_xml)

And I suspect you get the desired behavior.

Also, you can cut out some middle steps by applying .xpath:

uidsA = docA.xpath('//REC/UID').map(&:text)

and Array already has an intersection operator in Ruby, so you don't need to convert to Set:

uid_intersection = uidsA & uidsB
=> ["WOS:A2"]

@flavorjones
Copy link
Member

Hi, thanks for telling us about your experiences using Nokogiri.

You're performing some interesting and unexpected operations, and so I'm curious if Nokogiri's behavior is really getting in your way, or if you're just being curious/pedantic. To help me understand, could you tell me more about what you're trying to do with Nokogiri?

By this question, I mean that it's more interesting to me to get users from point A to point B efficiently than to figure out how to prevent users from breaking some expectations that Nokogiri has that may not be obvious.

@dazza-codes
Copy link
Author

To step back a bit, I wanted to merge one set of <REC> records into another set of <REC> records, without any duplicate <REC> elements. The unique identifier for each <REC> is in the <REC><UID> element text and that should be used to identify duplicates.

I was trying to work with a Nokogiri document as an immutable instance. I was using .dup or .deep_copy and some enumerable methods. I experimented with some details and noticed some puzzling behavior. I found some confidence with deserializing the content with .to_xml to get it out of one Nokogiri instance and into another. On reflection, I didn't try to apply .freeze to anything, maybe that would make a Nokogiri instance immutable.

There are some details at

@flavorjones
Copy link
Member

I'm sorry, I'm not totally following. I looked at the code, and the specs, and I understand what you're trying to do ... do these tests pass? If so, can you help me understand what Nokogiri is doing that's unexpected by writing a failing test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants