Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help parsing a standard nginx directory listing. Different results with ruby and jruby. #1888

Closed
amo13 opened this issue Apr 7, 2019 · 12 comments · Fixed by #1897
Closed

Comments

@amo13
Copy link

amo13 commented Apr 7, 2019

Dear Nokogiri community,
I want to parse the content of my nginx directory listing and it all works just fine with normal ruby, but with jruby I can't get nokogiri to behave the same way than on normal ruby and parse the listing. Does anyone have an idea to help me out?

To Reproduce

#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'

Nokogiri.HTML(open("https://archive.******.de/i9305"))

Result with ruby

=> #(Document:0x2afe5afd8350 {
  name = "document",
  children = [
    #(DTD:0x2afe5b0055a8 { name = "html" }),
    #(Element:0x2afe5ab1767c {
      name = "html",
      children = [
        #(Text "\r\n"),
        #(Element:0x2afe5ac382f4 {
          name = "head",
          children = [
            #(Element:0x2afe5acd0f68 {
              name = "title",
              children = [ #(Text "Index of /i9305/")]
              })]
          }),
        #(Text "\r\n"),
        #(Element:0x2afe5aec8668 {
          name = "body",
          attributes = [
            #(Attr:0x2afe5aefca08 { name = "bgcolor", value = "white" })],
          children = [
            #(Text "\r\n"),
            #(Element:0x2afe5af3cf40 {
              name = "h1",
              children = [ #(Text "Index of /i9305/")]
              }),
            #(Element:0x2afe5aba4ff4 { name = "hr" }),
            #(Element:0x2afe5ab9822c {
              name = "pre",
              children = [
                #(Element:0x2afe5af9ee48 {
                  name = "a",
                  attributes = [
                    #(Attr:0x2afe5af9e3d0 { name = "href", value = "../" })],
                  children = [ #(Text "../")]
                  }),
                #(Text "\r\n"),
                #(Element:0x2afe5affdad8 {
                  name = "a",
                  attributes = [
                    #(Attr:0x2afe5aff6fe4 {
                      name = "href",
                      value = "ResurrectionRemix/"
                      })],
                  children = [ #(Text "ResurrectionRemix/")]
                  }),
                #(Text "                                 03-Mar-2019 10:12                   -\r\n"),
                #(Element:0x2afe5aaecea4 {
                  name = "a",
                  attributes = [
                    #(Attr:0x2afe5aaee858 { name = "href", value = "TWRP/" })],
                  children = [ #(Text "TWRP/")]
                  }),
                #(Text "                                              12-Mar-2019 19:27                   -\r\n"),
                #(Element:0x2afe5ac48690 {
                  name = "a",
                  attributes = [
                    #(Attr:0x2afe5ac67400 {
                      name = "href",
                      value = "override_TWRP/"
                      })],
                  children = [ #(Text "override_TWRP/")]
                  }),
                #(Text "                                     12-Mar-2019 19:27                   -\r\n")]
              }),
            #(Element:0x2afe5aa751c4 { name = "hr" })]
          }),
        #(Text "\r\n")]
      })]
  })

Result with jruby:

=> #(Document:0x7e6 {
  name = "document",
  children = [
    #(Element:0x7e8 {
      name = "html",
      children = [
        #(Element:0x7ea { name = "head" }),
        #(Element:0x7ec {
          name = "body",
          children = [ #(Element:0x7ee { name = "h" })]
          })]
      })]
  })

Environment

ruby:

# Nokogiri (1.10.2)
    ---
    warnings: []
    nokogiri: 1.10.2
    ruby:
      version: 2.6.0
      platform: x86_64-linux
      description: ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/amo/.rvm/gems/ruby-2.6.0/gems/nokogiri-1.10.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.9"
      libxslt_path: "/home/amo/.rvm/gems/ruby-2.6.0/gems/nokogiri-1.10.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.33"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      libxslt_patches: []
      compiled: 2.9.9
      loaded: 2.9.9

jruby:

# Nokogiri (1.10.2)
    ---
    warnings: []
    nokogiri: 1.10.2
    ruby:
      version: 2.5.0
      platform: java
      description: jruby 9.2.5.0 (2.5.0) 2018-12-06 6d5a228 OpenJDK 64-Bit Server VM 25.212-b01
        on 1.8.0_212-b01 +jit [linux-x86_64]
      engine: jruby
      jruby: 9.2.5.0
    xerces: Xerces-J 2.12.0
    nekohtml: NekoHTML 1.9.21

Any advice is greatly appreciated.

@flavorjones
Copy link
Member

Hi, thanks for opening this issue, but I'm unable to reproduce what you're seeing. Here's the code I used to reproduce this without the open-uri network call:

#! /usr/bin/env ruby

require "nokogiri"
require "yaml"

# copypasta from `curl https://archive.anarchiehandy.de/i9305`
puts Nokogiri::VERSION_INFO.to_yaml
puts "---"

nginx_response = <<EOHTML
<html>
<head><title>Index of /i9305/</title></head>
<body bgcolor="white">
<h1>Index of /i9305/</h1><hr><pre><a href="../">../</a>
<a href="ResurrectionRemix/">ResurrectionRemix/</a>                                 03-Mar-2019 10:12                   -
<a href="TWRP/">TWRP/</a>                                              12-Mar-2019 19:27                   -
<a href="override_TWRP/">override_TWRP/</a>                                     12-Mar-2019 19:27                   -
</pre><hr></body>
</html>
EOHTML

doc = Nokogiri::HTML(nginx_response)
puts doc.to_html

For CRuby, the output is:

---
warnings: []
nokogiri: 1.10.2
ruby:
  version: 2.6.2
  platform: x86_64-linux
  description: ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.9"
  libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.33"
  libxml2_patches:
  - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
  - 0002-Remove-script-macro-support.patch
  - 0003-Update-entities-to-remove-handling-of-ssi.patch
  libxslt_patches: []
  compiled: 2.9.9
  loaded: 2.9.9
---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Index of /i9305/</title>
</head>
<body bgcolor="white">
<h1>Index of /i9305/</h1>
<hr>
<pre><a href="../">../</a>
<a href="ResurrectionRemix/">ResurrectionRemix/</a>                                 03-Mar-2019 10:12                   -
<a href="TWRP/">TWRP/</a>                                              12-Mar-2019 19:27                   -
<a href="override_TWRP/">override_TWRP/</a>                                     12-Mar-2019 19:27                   -
</pre>
<hr>
</body>
</html>

For JRuby, the output is:

---
warnings: []
nokogiri: 1.10.2
ruby:
  version: 2.5.0
  platform: java
  description: jruby 9.2.5.0 (2.5.0) 2018-12-06 6d5a228 OpenJDK 64-Bit Server VM 10.0.2+13-Ubuntu-1ubuntu0.18.04.4
    on 10.0.2+13-Ubuntu-1ubuntu0.18.04.4 [linux-x86_64]
  engine: jruby
  jruby: 9.2.5.0
xerces: Xerces-J 2.12.0
nekohtml: NekoHTML 1.9.21
---
<html><head><title>Index of /i9305/</title></head><body bgcolor="white">
<h1>Index of /i9305/</h1><hr><pre><a href="../">../</a>
<a href="ResurrectionRemix/">ResurrectionRemix/</a>                                 03-Mar-2019 10:12                   -
<a href="TWRP/">TWRP/</a>                                              12-Mar-2019 19:27                   -
<a href="override_TWRP/">override_TWRP/</a>                                     12-Mar-2019 19:27                   -
</pre><hr>
</body></html>

The parsed document structures are identical in structure.

Is it possible that you're getting different results back from your open-uri network call?

@flavorjones
Copy link
Member

Ah, interesting -- this appears to be a difference in how Nokogiri and open-uri behave between CRuby and JRuby. Digging into it now.

@flavorjones
Copy link
Member

Related narrative here: #1821

@flavorjones
Copy link
Member

OK, narrowing this down: in JRuby, Nokogiri::XML() reads the returned StringIO fine, but Nokogiri::HTML() does not. This is surely a bug in how we're reading from IO objects in the HTML document object. Hang tight.

In the meantime, a workaround is to add .read on to the returned value from open, e.g.:

Nokogiri.HTML(open("https://archive.anarchiehandy.de/i9305").read)

@flavorjones
Copy link
Member

OK, got it: the fix for #1124 was incompletely applied only to Nokogiri::XML::Document and related classes, and not to Nokogiri::HTML::Document and related classes.

flavorjones added a commit that referenced this issue Apr 7, 2019
because implementing read like this can result in nondeterministic
behavior. see related #1821 and #1888.
flavorjones added a commit that referenced this issue Apr 7, 2019
related to incomplete application of fix from #1124
@flavorjones
Copy link
Member

I've pushed a branch, 1888-jruby-html-read-io, which contains a failing test demonstrating the issue on JRuby.

@jvshahid - I have to ask for your help here. The issue appears to be with NokogiriEncodingReaderWrapper which was introduced in 9dab8f5. The failing test for JRuby falls into this check in setInputSource:

        // if setEncoding returned true, then the stream is set
        // to the EncodingReaderInputStream
        if (setEncoding(context, data))
          return;

@amo13
Copy link
Author

amo13 commented Apr 7, 2019

Wow, thank you a lot for investigating this.
And I appreciate the temporary workaround suggestion!

@jvshahid
Copy link
Member

@flavorjones I think I understand the issue. I would like to take some time to fix the following issues as well, unless anyone object:

  • We don't preserve any information on whether the parsed thing is an IO or String. We already know the type since ruby calls read_io for IO objects and read_string otherwise.
  • Ruby calls detect_encoding in HTML::Document. We re-detect the encoding againin in NokogiriEncodingReaderWrapper.

@flavorjones
Copy link
Member

@jvshahid Thanks for looking into it! I think it makes sense to take time and do the right thing, your suggestions sound right to me.

jvshahid added a commit that referenced this issue Apr 20, 2019
We don't have to figure out the encoding again.  This was already figured out
in the Ruby code.

fixes #1888
jvshahid added a commit that referenced this issue Apr 20, 2019
We don't have to figure out the encoding again.  This was already figured out
in the Ruby code.

fixes #1888
@jvshahid
Copy link
Member

FYI, I pushed the fix in #1897

@flavorjones flavorjones added this to the v1.11.0 milestone Apr 23, 2019
@flavorjones
Copy link
Member

John's PR was merged, this will be fixed in v1.11.0 when it drops. Watch the milestone for progress: https://github.com/sparklemotion/nokogiri/milestone/18

flavorjones added a commit that referenced this issue Apr 23, 2019
@amo13
Copy link
Author

amo13 commented Apr 23, 2019

Thank you a lot for taking care of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants