Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update of Robotstxt module #78

Closed
wants to merge 1 commit into from
Closed

Conversation

EgbertW
Copy link
Contributor

@EgbertW EgbertW commented Jul 8, 2015

Refactoring RobotstxtServer module. The RobotstxtServer approach did not match any specific "standard" of robotstxt as far as I could see, and also did not match what, for example, Google is using. Checking is a rule matches was done by checking if the user-agent matches or contains the part specified in the robots.txt file, or if the robots.txt file mentions a wildcard as user-agent.
However, the general tendency seems to be that from the entire file of directives, just one set of directives is applied. This is the one with the most specific User-agent string. For example, if robots.txt lists "Googlebot" and "Googlebot-news", then Googlebot-news will only obey the
rules in the Googlebot-news section. If that section is absent it will obey the "Googlebot" section.

The new RobotstxtParser now uses this same approach.

Additionally, path parsing did not work correctly. Major crawlers allow the use of wildcards mid-path. For example, /product/*/shops.html would block the shop-page of all products. The approach in the old RobotstxtParser did not allow for this syntax. A new PathRule class has been added that will parse these syntax and convert it into a Regexp pattern. This functionality is also tested by a new unit test.

A third issue that has been resolved is that when robots.txt was requested for a HTTPS host, it would still request robots.txt from the same host using HTTP, while this should of course be HTTPS. In addition, redirects would cause the RobotstxtServer to give up. Now, it will allow up to three redirects before giving up. This led to various complaints of websites I've been crawling because the crawler would fail to retrieve the correct robots.txt file and therefore would crawl all links it encountered.

Please do ask if anything is unclear or incorrect about these patches, I'd be more than happy to adjust them to better fit or to fix issues.

match any specific "standard" of robotstxt, and also did not match what
Google is using. Checking is a rule matches was done by checking if the
user-agent matches or contains the part specified in the robots.txt
file, or if the robots.txt file mentions a wildcard as user-agent.
However, the general tendency seems to be that from the entire file of
directives, just one set of directives is applied. This is the one with
the most specific User-agent string. For example, if robots.txt lists
"Googlebot" and "Googlebot-news", then Googlebot-news will only obey the
rules in the Googlebot-news section. If that section is absent it will
obey the "Googlebot" section.

The new RobotstxtParser now uses this same approach.

Additionally, path parsing did not work correctly. Major crawlers allow
the use of wildcards mid-path. For example, /product/*/shops.html would
block the shop-page of all products. The approach in the old
RobotstxtParser did not allow for this syntax.

A third issue that has been resolved is that when robots.txt was
requested for a HTTPS host, it would still request robots.txt from the
same host using HTTP, while this should of course be HTTPS. In addition,
redirects would cause the RobotstxtServer to give up. Now, it will allow
up to three redirects before giving up.
@rzo1
Copy link
Contributor

rzo1 commented Dec 29, 2015

This should be the solution of #107 +1

@s17t s17t mentioned this pull request Nov 30, 2016
@rzo1
Copy link
Contributor

rzo1 commented Dec 15, 2016

@s17t Maybe you can check this pull request. I think, that this is a very good contribution to enhance the current robots.txt parsing mechanism.

@s17t s17t self-assigned this Dec 15, 2016
s17t added a commit that referenced this pull request Dec 28, 2016
s17t added a commit that referenced this pull request Dec 28, 2016
s17t added a commit that referenced this pull request Dec 28, 2016
* MadEgg-robotstxt:
  #78, Resolve conflicts.
  #78, Resolve conflicts.
@s17t
Copy link
Contributor

s17t commented Dec 28, 2016

Thank you, merged in f41cf19

@s17t s17t closed this Dec 28, 2016
@s17t s17t modified the milestone: 4.3 Feb 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants