Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Extract useful data from HTML and XML with ease!
Ruby
Branch: master

Require ActiveSupport core extensions properly

Fixes 'uninitialized constant ActiveSupport::Autoload' in Travis
latest commit b84812f0e5
@tombenner authored
Failed to load latest commit information.
lib Require ActiveSupport core extensions properly
spec Remove color_enabled (unsupported in RSpec 3), improve RSpec's config
.gitignore Remove Gemfile.lock
.travis.yml Remove official support for Ruby 1.9.2, as the latest activesupport d…
Gemfile Initial commit
MIT-LICENSE Initial commit
README.md Fix typo in README
Rakefile
nikkou.gemspec Specify the license in the gemspec

README.md

Nikkou

Extract useful data from HTML and XML with ease!

Description

Nikkou adds additional methods to Nokogiri to make extracting commonly-used data from HTML and XML easier. It lets you transform HTML into structured data very quickly, and it integrates nicely with Mechanize.

Installation

Add Nikkou to your Gemfile:

gem 'nikkou'

Method Overview

Here's a summary of the methods Nikkou provides (see "Methods" for details):

Formatting

parse_text - Parses the node's text as XML and returns it as a Nokogiri::XML::NodeSet

time(options={}) - Intelligently parses the time (relative or absolute) of either the text or a specified attribute; accepts a time_zone option

url(attribute='href') - Converts the href (or other specified attribute) into an absolute URL using the document's URI; <a href="/p/1">Link</a> yields http://mysite.com/p/1

Searching

attr_equals(attribute, string) - Finds nodes where the attribute equals the string

attr_includes(attribute, string) - Finds nodes where the attribute includes the string

attr_matches(attribute, pattern) - Finds nodes where the attribute matches the pattern

drill(*methods) - Nil-safe method chaining

find(path) - Same as search but returns the first matched node

text_equals(string) - Finds nodes where the text equals the string

text_includes(string) - Finds nodes where the text includes the string

text_matches(pattern) - Finds nodes where the text matches the pattern

Methods

Formatting

time(options={})

Returns a Time object (in UTC) by automatically parsing the text or specified attribute of the node.

# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time
Options

attribute

The attribute to parse:

# <a href="/p/1" data-published-at="2013-05-22 02:42:34">My link</a>
doc.search('a').first.time(attribute: 'data-published-at')

time_zone

The document's time zone (the time will be converted from that to UTC):

# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time(time_zone: 'America/New_York')

url(attribute='href')

Returns an absolute URL; useful for parsing relative hrefs. The document's uri needs to be set for Nikkou to know what domain to add to relative paths.

# <a href="/p/1">My link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url # "http://mysite.com/p/1"

If Mechanize is being used, the uri doesn't need to be manually set.

Options

attribute

The attribute to parse:

# <a href="/p/1" data-comments-url="/p/1#comments">My Link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url('data-comments-url') # "http://mysite.com/p/1#comments"

Searching

attr_equals(attribute, string)

Selects nodes where the specified attribute equals the string.

# <div data-type="news">My Text</div>
doc.attr_equals('data-type', 'news').first.text # "My Text"

attr_includes(attribute, string)

Selects nodes where the specified attribute includes the string.

# <div data-type="major-news">My Text</div>
doc.attr_includes('data-type', 'news').first.text # "My Text"

attr_matches(attribute, pattern)

Selects nodes with an attribute matching a pattern. The pattern's matches are available in Node#matches.

# <span data-tooltip="3 Comments">My Text</span>
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.text # "My Text"
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]

drill(*methods)

Nil-safe method chaining. Replaces this:

node = doc.find('.count')
if node
  attribute = node.attr('data-count')
  if attribute
    return attribute.to_i
  end
end

With this:

return doc.drill([:find, '.count'], [:attr, 'data-count'], :to_i)

find(path)

Same as search, but returns the first matched node. Replaces this:

nodes = node.search('h4')
if nodes
  return nodes.first
end

With this:

return node.find('h4')

text_includes(string)

Selects nodes where the text includes the string.

# <div data-type="news">My Text</div>
doc.text_includes('Text').first.text # "My Text"

text_matches(pattern)

Selects nodes with text matching a pattern. The pattern's matches are available in Node#matches.

# <a href="/p/1">3 Comments</a>
doc.text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
doc.text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]

License

Nikkou is released under the MIT License. Please see the MIT-LICENSE file for details.

Something went wrong with that request. Please try again.