Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Refactoring RobotstxtServer module. The RobotstxtServer approach did not match any specific "standard" of robotstxt as far as I could see, and also did not match what, for example, Google is using. Checking is a rule matches was done by checking if the user-agent matches or contains the part specified in the robots.txt file, or if the robots.txt file mentions a wildcard as user-agent.
However, the general tendency seems to be that from the entire file of directives, just one set of directives is applied. This is the one with the most specific User-agent string. For example, if robots.txt lists "Googlebot" and "Googlebot-news", then Googlebot-news will only obey the
rules in the Googlebot-news section. If that section is absent it will obey the "Googlebot" section.
The new RobotstxtParser now uses this same approach.
Additionally, path parsing did not work correctly. Major crawlers allow the use of wildcards mid-path. For example, /product/*/shops.html would block the shop-page of all products. The approach in the old RobotstxtParser did not allow for this syntax. A new PathRule class has been added that will parse these syntax and convert it into a Regexp pattern. This functionality is also tested by a new unit test.
A third issue that has been resolved is that when robots.txt was requested for a HTTPS host, it would still request robots.txt from the same host using HTTP, while this should of course be HTTPS. In addition, redirects would cause the RobotstxtServer to give up. Now, it will allow up to three redirects before giving up. This led to various complaints of websites I've been crawling because the crawler would fail to retrieve the correct robots.txt file and therefore would crawl all links it encountered.
Please do ask if anything is unclear or incorrect about these patches, I'd be more than happy to adjust them to better fit or to fix issues.