Skip to content

Commit

Permalink
Add robots gem (required by webscan)
Browse files Browse the repository at this point in the history
  • Loading branch information
HD Moore committed Apr 15, 2012
1 parent 327e674 commit 10cee17
Show file tree
Hide file tree
Showing 14 changed files with 1,149 additions and 0 deletions.
1 change: 1 addition & 0 deletions lib/gemcache/ruby/1.9.1/gems/robots-0.10.1/.gitignore
@@ -0,0 +1 @@
*.gem
26 changes: 26 additions & 0 deletions lib/gemcache/ruby/1.9.1/gems/robots-0.10.1/CHANGELOG
@@ -0,0 +1,26 @@
0.10.0
- Make sure the fetch robots.txt operation happens with a user agent (via rb2k)
0.9.0
- Fix http://github.com/fizx/robots/issues#issue/1
- Tests don't rely on network.
0.8.0
- Add multiple values from robots.txt (via joost)
0.7.3
- Move to jeweler, gemcutter
0.7.2
- Add Ruby 1.9 compatibility
0.5-0.7.1
- Lost the changelog information :/
0.4.0
- Fixed other_values bug
- added crawl-delay support
0.3.2
- fixed breaking on reddit.com
0.3.1
- fixed bug in disallows handling
- partially mocked out open-uri
0.3.0
- added loggable dependency
0.2.0
- IF robot.txt 404s, assume allowed.
- Added CHANGELOG
33 changes: 33 additions & 0 deletions lib/gemcache/ruby/1.9.1/gems/robots-0.10.1/README
@@ -0,0 +1,33 @@
A simple Ruby library to parse robots.txt.

Usage:

robots = Robots.new "Some User Agent"
assert robots.allowed?("http://www.yelp.com/foo")
assert !robots.allowed?("http://www.yelp.com/mail?foo=bar")
robots.other_values("http://foo.com") # gets misc. key/values (i.e. sitemaps)

If you want caching, you're on your own. I suggest marshalling an instance of the parser.

Copyright (c) 2008 Kyle Maxwell, contributors

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
55 changes: 55 additions & 0 deletions lib/gemcache/ruby/1.9.1/gems/robots-0.10.1/Rakefile
@@ -0,0 +1,55 @@
require 'rubygems'
require 'rake'

begin
require 'jeweler'
Jeweler::Tasks.new do |gem|
gem.name = "robots"
gem.summary = "Simple robots.txt parser"
gem.description = "It parses robots.txt files"
gem.email = "kyle@kylemaxwell.com"
gem.homepage = "http://github.com/fizx/robots"
gem.authors = ["Kyle Maxwell"]
gem.add_development_dependency "thoughtbot-shoulda"
# gem is a Gem::Specification... see http://www.rubygems.org/read/chapter/20 for additional settings
end
Jeweler::GemcutterTasks.new
rescue LoadError
puts "Jeweler (or a dependency) not available. Install it with: sudo gem install jeweler"
end

require 'rake/testtask'
Rake::TestTask.new(:test) do |test|
test.libs << 'lib' << 'test'
test.pattern = 'test/**/test_*.rb'
test.verbose = true
end

begin
require 'rcov/rcovtask'
Rcov::RcovTask.new do |test|
test.libs << 'test'
test.pattern = 'test/**/*_test.rb'
test.verbose = true
end
rescue LoadError
task :rcov do
abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
end
end

task :default => :test

require 'rake/rdoctask'
Rake::RDocTask.new do |rdoc|
if File.exist?('VERSION')
version = File.read('VERSION')
else
version = ""
end

rdoc.rdoc_dir = 'rdoc'
rdoc.title = "robots #{version}"
rdoc.rdoc_files.include('README*')
rdoc.rdoc_files.include('lib/**/*.rb')
end
1 change: 1 addition & 0 deletions lib/gemcache/ruby/1.9.1/gems/robots-0.10.1/VERSION
@@ -0,0 +1 @@
0.10.1
137 changes: 137 additions & 0 deletions lib/gemcache/ruby/1.9.1/gems/robots-0.10.1/lib/robots.rb
@@ -0,0 +1,137 @@
require "open-uri"
require "uri"
require "rubygems"
require "timeout"

class Robots

DEFAULT_TIMEOUT = 3

class ParsedRobots

def initialize(uri, user_agent)
@last_accessed = Time.at(1)

io = Robots.get_robots_txt(uri, user_agent)

if !io || io.content_type != "text/plain" || io.status != ["200", "OK"]
io = StringIO.new("User-agent: *\nAllow: /\n")
end

@other = {}
@disallows = {}
@allows = {}
@delays = {} # added delays to make it work
agent = /.*/
io.each do |line|
next if line =~ /^\s*(#.*|$)/
arr = line.split(":")
key = arr.shift
value = arr.join(":").strip
value.strip!
case key
when "User-agent"
agent = to_regex(value)
when "Allow"
@allows[agent] ||= []
@allows[agent] << to_regex(value)
when "Disallow"
@disallows[agent] ||= []
@disallows[agent] << to_regex(value)
when "Crawl-delay"
@delays[agent] = value.to_i
else
@other[key] ||= []
@other[key] << value
end
end

@parsed = true
end

def allowed?(uri, user_agent)
return true unless @parsed
allowed = true
path = uri.request_uri

@disallows.each do |key, value|
if user_agent =~ key
value.each do |rule|
if path =~ rule
allowed = false
end
end
end
end

@allows.each do |key, value|
unless allowed
if user_agent =~ key
value.each do |rule|
if path =~ rule
allowed = true
end
end
end
end
end

if allowed && @delays[user_agent]
sleep @delays[user_agent] - (Time.now - @last_accessed)
@last_accessed = Time.now
end

return allowed
end

def other_values
@other
end

protected

def to_regex(pattern)
return /should-not-match-anything-123456789/ if pattern.strip.empty?
pattern = Regexp.escape(pattern)
pattern.gsub!(Regexp.escape("*"), ".*")
Regexp.compile("^#{pattern}")
end
end

def self.get_robots_txt(uri, user_agent)
begin
Timeout::timeout(Robots.timeout) do
io = URI.join(uri.to_s, "/robots.txt").open("User-Agent" => user_agent) rescue nil
end
rescue Timeout::Error
STDERR.puts "robots.txt request timed out"
end
end

def self.timeout=(t)
@timeout = t
end

def self.timeout
@timeout || DEFAULT_TIMEOUT
end

def initialize(user_agent)
@user_agent = user_agent
@parsed = {}
end

def allowed?(uri)
uri = URI.parse(uri.to_s) unless uri.is_a?(URI)
host = uri.host
@parsed[host] ||= ParsedRobots.new(uri, @user_agent)
@parsed[host].allowed?(uri, @user_agent)
end

def other_values(uri)
uri = URI.parse(uri.to_s) unless uri.is_a?(URI)
host = uri.host
@parsed[host] ||= ParsedRobots.new(uri, @user_agent)
@parsed[host].other_values
end
end
55 changes: 55 additions & 0 deletions lib/gemcache/ruby/1.9.1/gems/robots-0.10.1/robots.gemspec
@@ -0,0 +1,55 @@
# Generated by jeweler
# DO NOT EDIT THIS FILE DIRECTLY
# Instead, edit Jeweler::Tasks in Rakefile, and run the gemspec command
# -*- encoding: utf-8 -*-

Gem::Specification.new do |s|
s.name = %q{robots}
s.version = "0.10.1"

s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
s.authors = ["Kyle Maxwell"]
s.date = %q{2011-04-12}
s.description = %q{It parses robots.txt files}
s.email = %q{kyle@kylemaxwell.com}
s.extra_rdoc_files = [
"README"
]
s.files = [
".gitignore",
"CHANGELOG",
"README",
"Rakefile",
"VERSION",
"lib/robots.rb",
"robots.gemspec",
"test/fixtures/emptyish.txt",
"test/fixtures/eventbrite.txt",
"test/fixtures/google.txt",
"test/fixtures/reddit.txt",
"test/fixtures/yelp.txt",
"test/test_robots.rb"
]
s.homepage = %q{http://github.com/fizx/robots}
s.rdoc_options = ["--charset=UTF-8"]
s.require_paths = ["lib"]
s.rubygems_version = %q{1.3.6}
s.summary = %q{Simple robots.txt parser}
s.test_files = [
"test/test_robots.rb"
]

if s.respond_to? :specification_version then
current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
s.specification_version = 3

if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
s.add_development_dependency(%q<thoughtbot-shoulda>, [">= 0"])
else
s.add_dependency(%q<thoughtbot-shoulda>, [">= 0"])
end
else
s.add_dependency(%q<thoughtbot-shoulda>, [">= 0"])
end
end

@@ -0,0 +1,2 @@
User-agent: *
Disallow:

0 comments on commit 10cee17

Please sign in to comment.