Permalink
Browse files

Tidy up

  • Loading branch information...
1 parent 05fc783 commit 651576e972d92d542b6900af6a85d1b8d4c71fd8 @ConradIrwin ConradIrwin committed Aug 29, 2010
Showing with 48 additions and 49 deletions.
  1. +1 −0 LICENSE.rdoc
  2. +15 −12 README.rdoc
  3. +7 −4 lib/robotstxt.rb
  4. +17 −27 lib/robotstxt/parser.rb
  5. +3 −3 test/getter_test.rb
  6. +5 −3 test/parser_test.rb
View
@@ -2,6 +2,7 @@
(The MIT License)
+Copyright (c) 2010 Conrad Irwin <conrad@rapportive.com>
Copyright (c) 2009 Simone Rinzivillo <srinzivillo@gmail.com>
Permission is hereby granted, free of charge, to any person obtaining
View
@@ -2,7 +2,11 @@
Robotstxt is an Ruby robots.txt file parser.
-It provides mechanisms for obtaining and parsing the robots.txt file from
+The robots.txt exclusion protocol is a simple mechanism whereby site-owners can guide
+any automated crawlers to relevant parts of their site, and prevent them accessing content
+which is intended only for other eyes. For more information, see http://www.robotstxt.org/.
+
+This library provides mechanisms for obtaining and parsing the robots.txt file from
websites. As there is no official "standard" it tries to do something sensible,
though inspiration was taken from:
@@ -34,7 +38,7 @@ The Robotstxt module has three public methods:
- Robotstxt.get_allowed? urlish, user_agent, (options)
Returns true iff the robots.txt obtained from the host identified by the
- urlish allows access to the url.
+ urlish allows the given user agent access to the url.
The Robotstxt::Parser class contains two pieces of state, the user_agent and the
text of the robots.txt. In addition its instances have two public methods:
@@ -49,7 +53,7 @@ text of the robots.txt. In addition its instances have two public methods:
In the above there are five kinds of parameter,
A "urlish" is either a String that represents a URL (suitable for passing to
- URI.parse), i.e.
+ URI.parse) or a URI object, i.e.
urlish = "http://www.example.com/"
urlish = "/index.html"
@@ -72,9 +76,9 @@ In the above there are five kinds of parameter,
A "robots_txt" is the textual content of a robots.txt file that is in the
same encoding as the urls you will be fetching (normally utf8).
- A "user_agent" is the value you use in your User-agent: header.
+ A "user_agent" is the string value you use in your User-agent: header.
- The "options" is an optional hash containing
+ The "options" is an optional hash containing
:num_redirects (5) - the number of redirects to follow before giving up.
:http_timeout (10) - the length of time in seconds to wait for one http
request
@@ -113,8 +117,7 @@ If an HTTPUnauthorized or an HTTPForbidden is returned when trying to access
If an HTTPRedirection is returned, it should be followed (though we give up
after five redirects, to avoid infinite loops).
-If an HTTPSuccess is returned, the body is converted into utf8 using the value
-of the charset option to the Content-type: header, and then parsed.
+If an HTTPSuccess is returned, the body is converted into utf8, and then parsed.
Any other response, or no response, indicates that there are no Disallowed urls
no the site.
@@ -150,28 +153,28 @@ entire path (or path + ? + query).
In order to get consistent results, before the globs are matched, the %-encoding
is normalised so that only /?&= remain %-encoded. For example, /h%65llo/ is the
same as /hello/, but /ac%2fdc is not the same as /ac/dc - this is due to the
-significance granted to the / operator in urls.
+significance granted to the / operator in urls.
The paths of the first section that matched our user-agent (by order of
appearance in the file) are parsed in order of appearance. The first Allow: or
Disallow: rule that matches the url is accepted. This is prescribed by
robotstxt.org, but other parsers take wildly different strategies:
Google checks all Allows: then all Disallows:
- Bing checks the most-specific first
+ Bing checks the most-specific first
Others check all Disallows: then all Allows
-As is conventional, a "Disallow: " line with no path given is treated as
+As is conventional, a "Disallow: " line with no path given is treated as
"Allow: *", and if a URL didn't match any path specifiers (or the user-agent
didn't match any user-agent sections) then that is implicit permission to crawl.
== TODO
I would like to add support for the Crawl-delay directive, and indeed any other
-parameters that are found in the wild.
+parameters in use.
== Requirements
-* Ruby >= 1.8.7
+* Ruby >= 1.8.7
* iconv, net/http and uri
== Installation
View
@@ -6,7 +6,7 @@
#
# Category:: Net
# Package:: Robotstxt
-# Author:: Simone Rinzivillo <srinzivillo@gmail.com>
+# Author:: Conrad Irwin <conrad@rapportive.com>, Simone Rinzivillo <srinzivillo@gmail.com>
# License:: MIT License
#
#--
@@ -17,12 +17,15 @@
require 'robotstxt/parser'
require 'robotstxt/getter'
+# Provides a flexible interface to help authors of web-crawlers
+# respect the robots.txt exclusion standard.
+#
module Robotstxt
NAME = 'Robotstxt'
GEM = 'robotstxt'
- AUTHORS = ['Simone Rinzivillo <srinzivillo@gmail.com>']
- VERSION = '0.5.4'
+ AUTHORS = ['Conrad Irwin <conrad@rapportive.com>', 'Simone Rinzivillo <srinzivillo@gmail.com>']
+ VERSION = '1.0'
# Obtains and parses a robotstxt file from the host identified by source,
# source can either be a URI, a string representing a URI, or a Net::HTTP
@@ -85,7 +88,7 @@ def self.get_allowed?(uri, robot_id)
end
# DEPRECATED
-
+
def self.allowed?(uri, robot_id); self.get(uri, robot_id).allowed? uri; end
def self.sitemaps(uri, robot_id); self.get(uri, robot_id).sitemaps; end
View
@@ -1,19 +1,10 @@
-#
-# = Ruby Robotstxt
-#
-# An Ruby Robots.txt parser.
-#
-#
-# Category:: Net
-# Package:: Robotstxt
-# Author:: Simone Rinzivillo <srinzivillo@gmail.com>
-# License:: MIT License
-#
-#--
-#
-#++
+
module Robotstxt
- # The parser aims to behave as expected, using a few sources for guidance:
+ # Parses robots.txt files for the perusal of a single user-agent.
+ #
+ # The behaviour implemented is guided by the following sources, though
+ # as there is no widely accepted standard, it may differ from other implementations.
+ # If you consider its behaviour to be in error, please contact the author.
#
# http://www.robotstxt.org/orig.html
# - the original, now imprecise and outdated version
@@ -22,17 +13,14 @@ module Robotstxt
# http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
# - a few hints at modern protocol extensions.
#
- # Unfortunately, as hinted at on that page, there is no reference implementation
- # or widely accepted standard, and attempts to create one seem to have stalled.
- #
- # This parser reads lines starting with (case-insensitively:)
+ # This parser only considers lines starting with (case-insensitively:)
# Useragent: User-agent: Allow: Disallow: Sitemap:
- #
+ #
# The file is divided into sections, each of which contains one or more User-agent:
# lines, followed by one or more Allow: or Disallow: rules.
#
- # The first section that contains a User-agent: line that matches the robots
- # user-agent, is the only section that robot looks at. The sections are checked
+ # The first section that contains a User-agent: line that matches the robot's
+ # user-agent, is the only section that relevent to that robot. The sections are checked
# in the same order as they appear in the file.
#
# (The * character is taken to mean "any number of any characters" during matching of
@@ -44,8 +32,8 @@ module Robotstxt
# (The order of matching is as in the RFC, Google matches all Allows and then all Disallows,
# while Bing matches the most specific rule, I'm sure there are other interpretations)
#
- # When matching urls, all % encodings are normalised so that only the "/?&=" characters are
- # still escaped, while "*" characters match any number of any character.
+ # When matching urls, all % encodings are normalised (except for /?=& which have meaning)
+ # and "*"s match any number of any character.
#
# If a pattern ends with a $, then the pattern must match the entire path, or the entire
# path with query string.
@@ -209,7 +197,7 @@ def normalize_percent_encoding(path)
def reify(glob)
# -1 on a split prevents trailing empty strings from being deleted.
- glob.split("*", -1).map{|part| Regexp.escape(part) }.join(".*")
+ glob.split("*", -1).map{ |part| Regexp.escape(part) }.join(".*")
end
@@ -223,11 +211,13 @@ def reify(glob)
#
# For example:
#
- # User-agent: * Allow: / Disallow: /secret/
+ # User-agent: *
+ # Disallow: /secret/
+ # Allow: /
#
# Would be parsed so that:
#
- # @rules = [["*", [ ["/", true], ["/secret/", false] ]]]
+ # @rules = [["*", [ ["/secret/", false], ["/", true] ]]]
#
#
# The order of the arrays is maintained so that the first match in the file
View
@@ -13,7 +13,7 @@ def test_absense
FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["404", "Not found"])
assert true == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
end
-
+
def test_error
FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["500", "Internal Server Error"])
assert true == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
@@ -23,7 +23,7 @@ def test_unauthorized
FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["401", "Unauthorized"])
assert false == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
end
-
+
def test_forbidden
FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["403", "Forbidden"])
assert false == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
@@ -59,7 +59,7 @@ def test_redirects
end
def test_encoding
- # "User-agent: *\n Disallow: /encyclop@dia" where @ is the ae ligature.
+ # "User-agent: *\n Disallow: /encyclop@dia" where @ is the ae ligature (U+00E6)
FakeWeb.register_uri(:get, "http://example.com/robots.txt", :response => "HTTP/1.1 200 OK\nContent-type: text/plain; charset=utf-16\n\n" +
"\xff\xfeU\x00s\x00e\x00r\x00-\x00a\x00g\x00e\x00n\x00t\x00:\x00 \x00*\x00\n\x00D\x00i\x00s\x00a\x00l\x00l\x00o\x00w\x00:\x00 \x00/\x00e\x00n\x00c\x00y\x00c\x00l\x00o\x00p\x00\xe6\x00d\x00i\x00a\x00")
robotstxt = Robotstxt.get("http://example.com/#index", "Google")
View
@@ -4,7 +4,7 @@
require 'robotstxt'
class TestParser < Test::Unit::TestCase
-
+
def test_basics
client = Robotstxt::Parser.new("Test", <<-ROBOTS
User-agent: *
@@ -19,7 +19,7 @@ def test_basics
Disallow: /team*
Disallow: /index
Allow: /
-Sitemap: http://chargify.com/sitemap.xml
+Sitemap: http://example.com/sitemap.xml
ROBOTS
)
assert true == client.allowed?("/")
@@ -31,6 +31,7 @@ def test_basics
assert false == client.allowed?("/test/example")
assert false == client.allowed?("/team-game")
assert false == client.allowed?("/team-game/example")
+ assert ["http://example.com/sitemap.xml"] == client.sitemaps
end
@@ -69,6 +70,7 @@ def test_trail_matching
assert true == google.allowed?("/.pdfs/index.html")
assert false == google.allowed?("/.pdfs/index.pdf")
assert false == google.allowed?("/.pdfs/index.pdf?action=view")
+ assert false == google.allowed?("/.pdfs/index.html?download_as=.pdf")
end
def test_useragents
@@ -84,5 +86,5 @@ def test_useragents
assert true == Robotstxt::Parser.new("Yahoo", robotstxt).allowed?("/hello")
assert false == Robotstxt::Parser.new("Bing", robotstxt).allowed?("/hello")
end
-
+
end

0 comments on commit 651576e

Please sign in to comment.