Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Just catch the blithering exception...

  • Loading branch information...
commit 2638c92642127447f09b3cb1706b83dfabb6c52d 1 parent b8586b5
@ConradIrwin ConradIrwin authored
Showing with 10 additions and 10 deletions.
  1. +4 −10 lib/robotstxt/parser.rb
  2. +6 −0 test/parser_test.rb
View
14 lib/robotstxt/parser.rb
@@ -164,17 +164,11 @@ def match_path_glob(path, glob)
glob = normalize_percent_encoding(glob)
path = normalize_percent_encoding(path)
- # NOTE: We're using non-unicode regular expressions because a small
- # portion of the internet has robots.txt files which contain invalid
- # encodings. Even if you try to convert the file to utf-8 first, you'll
- # miss urls with %-encoded routes.
- #
- # More work is needed to determine whether that's because they're using
- # non-utf8 urls (in which case a binary pattern match is probably desirable)
- # or whether it's just misconfiguration, in which case it doesn't matter
- # what we do.
- path =~ Regexp.new("^" + reify(glob) + end_marker, "n")
+ path =~ Regexp.new("^" + reify(glob) + end_marker)
+ # Some people encode bad UTF-8 in their robots.txt files, let us not behave badly.
+ rescue RegexpError
+ false
end
# As a general rule, we want to ignore different representations of the
View
6 test/parser_test.rb
@@ -2,6 +2,7 @@
require 'test/unit'
require 'robotstxt'
+require 'cgi'
class TestParser < Test::Unit::TestCase
@@ -101,4 +102,9 @@ def test_strange_newlines
assert false === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
end
+ def test_bad_unicode
+ robotstxt = "User-agent: *\ndisallow: /?id=%C3%CB%D1%CA%A4%C5%D4%BB%C7%D5%B4%D5%E2%CD\n"
+ assert true ===Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
+ end
+
end
Please sign in to comment.
Something went wrong with that request. Please try again.