JRuby unable to create HTML document if URL unresponsive #703

Closed
nirvdrum opened this Issue Jun 15, 2012 · 10 comments

2 participants

@nirvdrum

This seems related to the closed out #674. If I use a script similar to what @flavorjones provided, but instead provide a valid URL, but one that doesn't resolve with DNS, I get an error (and an unhelpful message):

#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'
require 'yaml'

puts Nokogiri::VERSION_INFO.to_yaml
html = open('http://google.com/').read
doc = Nokogiri::HTML html, 'http://googleaoeu.com/', nil
puts "Doc: #{doc.to_html.inspect}"
puts "Errors: #{doc.errors}"

I get the error:

Errors: [#<Nokogiri::XML::SyntaxError: googleaoeu.com>]

I don't believe the C variant tries to resolve the URL and rather just accepts it as. In any event, it proceeds just fine. As a secondary concern, the error message here isn't much help. It just outputs the hostname and says it's a SyntaxError. Some more context there would be more helpful.

@nirvdrum

I should note that this is on 1.5.5 RC2, since previous versions had the problem in #674.

@nirvdrum

Okay, this issue is even bigger than I thought. It appears the provided document content is just flat out ignored and the resource specified by the URL is fetched instead. If the URL is changed to be a valid one, the document will use the body from that URL as the document body.

@nirvdrum

@jvshahid Since you provided the fix for #674, I was wondering if you might have some insight into the problem. If not, sorry to rope you in.

@jvshahid
Sparkle Motion member

@nirvdrum I'm happy to look into the issue. I'll get back to you shortly.

@jvshahid
Sparkle Motion member

@nirvdrum I apologize for my sloppy change, that was my first commit to Nokogiri and I had little context about what was happening. My understanding of the code agrees with your observations, when a url is provided we set the systemId of the InputSource and we ignore the data that is passed to the parser. This is causing the parser to parse the data from the given url instead of the data that was passed to the parser. That also explains why the url has to be resolved.

The fix is easy, but I would like to make sure that the behavior is the consistent with the C Nokogiri before I push the change.

@jvshahid
Sparkle Motion member

I just pushed some changes to a branch 'fix_entity_resolving' that fixes our entity resolving logic. This should fix your problem and a bunch of other bugs with entity resolving and validation.

@yokolet can you eyeball the changes, I feel it's very major and would like you to look at it. After that we can merge the changes to master.

I'm going to open a new issue regarding handling entity resolving and validation in general. I believe there's more to be done in that area.

@nirvdrum

Thanks @jvshahid. I really need to get up to speed with building nokogiri so I can try to be more valuable. My previous attempts were thwarted for one reason or another, but that's not much of an excuse.

@jvshahid
Sparkle Motion member

@nirvdrum the fix is merged on master and will be packaged in a rc package soon.

@nirvdrum

Great! Thanks for looking into it and dealing with the larger issue around entity resolving.

@jvshahid
Sparkle Motion member

Closing the issue, please reopen if the problem still exist. Thanks again for reporting this bug.

@jvshahid jvshahid closed this Jul 4, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment