Nokogiri (鋸) is a Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
Java Ruby C HTML Yacc Shell Other
Switch branches/tags
Latest commit a24b92e Feb 20, 2018
tenderlove Don't SEGV if user returns a non string object
Make sure the object we get back on the reader function is actually a
string.  If it isn't, return an error to libxml2.

Fixes #898
Permalink
Failed to load latest commit information.
.github tweaking the github issue template Oct 3, 2016
bin Fix cli tool example links Jun 5, 2015
concourse concourse: more coverage for PRs Sep 29, 2017
ext Don't SEGV if user returns a non string object Feb 20, 2018
lib Handle ASCII-8BIT encoding on fragment input Feb 20, 2018
patches update libxml → 2.9.5, libxlst → 1.1.30 Sep 13, 2017
suppressions apply valgrind suppression to all rubies Jun 19, 2017
tasks add a task to run the test suite against the installed version Feb 21, 2016
test Don't SEGV if user returns a non string object Feb 20, 2018
.autotest remove support for ruby 1.9.2, 1.9.3 and 2.0 Jun 7, 2016
.codeclimate.yml tell code climate to ignore generated files Jul 20, 2017
.cross_rubies Windows: Add cross build for ruby-2.5 Dec 25, 2017
.editorconfig Add .editorconfig. Oct 23, 2013
.gemtest opting in to .gemtest Feb 2, 2011
.gitignore upgrade to concourse gem 0.11.0 Feb 16, 2017
.travis.yml Merge pull request #1699 from marutosi/ruby-2.4.2-on-travis Feb 19, 2018
CHANGELOG.md update CHANGELOG Feb 3, 2018
CONTRIBUTING.md 1) Added documentation for ParseOptions class. Mar 26, 2016
C_CODING_STYLE.rdoc astyle can be used to approximate the C coding style. Jun 5, 2012
Gemfile Windows: Add cross build for ruby-2.5 Dec 25, 2017
Gemfile-libxml-ruby simplify how we test with libxml-ruby loaded Feb 9, 2017
LICENSE-DEPENDENCIES.md LICENSE-DEPENDENCIES.md Feb 13, 2017
LICENSE.md LICENSE-DEPENDENCIES.md Feb 13, 2017
Manifest.txt ensure EntityReferences ignore malformed children Jan 28, 2018
README.md add concourse status to README Jun 5, 2017
ROADMAP.md Update to roadmap Feb 17, 2016
Rakefile remove hacks to discover the path to `racc` Jan 28, 2018
STANDARD_RESPONSES.md Altered wording in "not a bug" standard response. Apr 26, 2012
Y_U_NO_GEMSPEC.md Added a missed word. Sep 24, 2016
appveyor.yml Appveyor: Add ruby-2.4 and ruby-head to build matrix Dec 26, 2017
build_all move build of JRuby gem to using JRuby 9000 Jan 29, 2018
dependencies.yml update libxslt from 1.1.30 to 1.1.32 Nov 13, 2017

README.md

Nokogiri

Status

System Status
Concourse Concourse CI
Travis Travis Build Status
Appveyor Appveyor Build Status
Code Climate Code Climate
Version Eye Version Eye

Description

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.

Features

  • XML/HTML DOM parser which handles broken HTML
  • XML/HTML SAX parser
  • XML/HTML Push parser
  • XPath 1.0 support for document searching
  • CSS3 selector support for document searching
  • XML/HTML builder
  • XSLT transformer

Nokogiri parses and searches XML/HTML using native libraries (either C or Java, depending on your Ruby), which means it's fast and standards-compliant.

Installation

If this doesn't work:

gem install nokogiri

then please start troubleshooting here:

http://www.nokogiri.org/tutorials/installing_nokogiri.html

There are currently 1,237 Stack Overflow questions about Nokogiri installation. The vast majority of them are out of date and therefore incorrect. Please do not use Stack Overflow.

Instead, tell us when the above instructions don't work for you. This allows us to both help you directly and improve the documentation.

Binary packages

Binary packages are available for some distributions.

Support

There are open-source tutorials (to which we invite contributions!) here: http://nokogiri.org/tutorials

Synopsis

Nokogiri is a large library, but here is example usage for parsing and examining a document:

#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'

# Fetch and parse HTML document
doc = Nokogiri::HTML(open('http://www.nokogiri.org/tutorials/installing_nokogiri.html'))

puts "### Search for nodes by css"
doc.css('nav ul.menu li a', 'article h2').each do |link|
  puts link.content
end

puts "### Search for nodes by xpath"
doc.xpath('//nav//ul//li/a', '//article//h2').each do |link|
  puts link.content
end

puts "### Or mix and match."
doc.search('nav ul.menu li a', '//article//h2').each do |link|
  puts link.content
end

Requirements

  • Ruby 2.1.0 or higher, including any development packages necessary to compile native extensions.

  • In Nokogiri 1.6.0 and later libxml2 and libxslt are bundled with the gem, but if you want to use the system versions:

    • First, check out the long list of fixes and changes between releases before deciding to use any version older than is bundled with Nokogiri.

    • At install time, set the environment variable NOKOGIRI_USE_SYSTEM_LIBRARIES or else use the --use-system-libraries argument. (See http://nokogiri.org/tutorials/installing_nokogiri.html#using_your_system_libraries for specifics.)

    • libxml2 >=2.6.21 with iconv support (libxml2-dev/-devel is also required)

    • libxslt, built with and supported by the given libxml2 (libxslt-dev/-devel is also required)

Encoding

Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return a string containing markup (like to_xml, to_html and inner_html) will return a string encoded like the source document.

WARNING

Some documents declare one encoding, but actually use a different one. In these cases, which encoding should the parser choose?

Data is just a stream of bytes. Humans add meaning to that stream. Any particular set of bytes could be valid characters in multiple encodings, so detecting encoding with 100% accuracy is not possible. libxml2 does its best, but it can't be right all the time.

If you want Nokogiri to handle the document encoding properly, your best bet is to explicitly set the encoding. Here is an example of explicitly setting the encoding to EUC-JP on the parser:

  doc = Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')

Development

  bundle install
  bundle exec rake

License

MIT. See the LICENSE.md file.