New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second level domains and HTTP redirects #27

Open
dmi3kno opened this Issue Jul 26, 2018 · 15 comments

Comments

Projects
None yet
3 participants
@dmi3kno

dmi3kno commented Jul 26, 2018

I wonder how robotstxt deals with redirects

# downloads html, signals a warning
rt <- robotstxt::get_robotstxt("github.io")
#> <!DOCTYPE html>
#> <html lang="en">
#> [...]
#> Warning message:
#> In robotstxt::get_robotstxt("github.io") :
#>   get_robotstxt(): github.io; Not valid robots.txt.

robotstxt::is_valid_robotstxt(rt)
#> [1] TRUE

# what's going on? 
httr::GET("github.io")
#> Response [https://pages.github.com/]
#>  Date: 2018-07-26 20:54
#>  ...

# this one is also valid
rt <- robotstxt::get_robotstxt("pages.github.com")
robotstxt::is_valid_robotstxt(rt)
#> [1] TRUE

# so what did we get?
print(rt)
#> Sitemap: https://pages.github.com/sitemap.xml

Couple of questions:

  • How do we handle redirects? A simple GET would indicate that we're looking in the wrong place. Should we warn the user and get the redirected robots.txt instead? Anyway better than rendering html header...
  • What happens if robots.txt is not delegated to second-level domain? Clearly github.com has its very proper robots.txt, but none of the second-level domains has its own properly populated file (checked pages.github.com, education.github.com, blog.github.com, raw.github.com, with an exception of gist.github.com). In my view, file is not valid if there isn't a single rule in it. It is not necessarily corrupt - it just doesn't exist. Shouldn't we, again, warn user and return robots.txt from first-level domain instead?

In fact raw.github.com is explicitly disallowed,

robotstxt::paths_allowed(paths="raw", domain="github.com", bot="*")
#> [1] FALSE

but it is written in slightly unusual way: there's no github.com/raw so the rule in main robots.txt actually refers to second level domain.

httr::GET("https://github.com/raw")
#> Response [https://github.com/raw]
#>   Date: 2018-07-26 21:43
#>   Status: 404
#> [...]
@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Aug 22, 2018

Thanks for writing in ... just to let you know that I will come back to this ... but it might take some more days.

@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Sep 3, 2018

Okay, okay, okay ... so many questions ...

First, we should make clear that these examples and questions are quite a number of questions.

Second, those questions might have answers that are completely unrelated.

Third, we should single out separate questions and tackle them one by one.

Question 1: How to handle redirects?

Example: github.io/robots.txt

Requesting for github.io/robots.txt results in a redirect to pages.github.com which is not a robots.txt file. Furthermore, pages.github.com/robots.txt does very well exist and is a valid robots.txt file.

rt <- robotstxt::get_robotstxt("github.io")

Question 2: How to handle subdomains?

  • Does it make a difference if robots.txt files come from a subdomain - e.g. page.github.com?
  • Are robots.txt files allowed for subdomains?

Question 3: What are valid robots.txt files?

Example: github.io/robots.txt

Requesting for github.io/robots.txt does not result in a valid robots.txt but (after a redirect to pages.github.com) returns an HTML-page.

Question ... just the first couple of questions to get the conversation going and to be continued very soon

@dmi3kno

This comment has been minimized.

dmi3kno commented Sep 3, 2018

Question 2: Just to clarify: we're talking about real subdomains that exist without redirects like pages.github.com. The file placed in that location is a perfectly valid robots.txt even though it is nearly empty (same as in blog.github.com). There's a similarly benign file in education.github.com.

Interesting thing happens with raw.github.com/robots.txt - if accessed without specifying the file (i.e. raw.github.com) it redirects to landing page, if the robots.txt file is specified, it redirects to (link) a file with an error message.

Conclusion: subdomain robots.txt exist and sometimes even valid. In any case, they should be. (link, link)

@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Sep 4, 2018

Question 2: Yes, subdomains are ok, robots.txt files can be expected for each subdomain. -- closed

@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Sep 4, 2018

Question 1

in general

  • The package will follow redirects (as does my browser, as does googlebot, as is common web standard).
  • Redirects should be ok.

particular example with github.io

  • The request for https://github.io/robots.txt is redirected to https://pages.github.com/ but not https://pages.github.com/robots.txt.
  • One could argue that here Github fails to make a proper redirect and thus everything is allowed.
  • One could also argue that github.io simply is no valid domain (hence the redirect) and after being redirected the package should notice the new domain and try to get a robots.txt file for that domain instead ... Puh, this leads to more and more need of discussion (e.g. what happens if you request files from github.io should they than use the robots.txt file from page.github.com, or only if you are redirected to the other domain?)
@dmi3kno

This comment has been minimized.

dmi3kno commented Sep 4, 2018

I think we should be a bit more forward leaning and not rely on redirect taking us to a new robots.txt. Redirect only indicates "change of scope" for searching for valid robots.txt file. You get the new domain/subdoman and resume normal search for robots.txt as you would have done if user initially said pages.github.com.

More importantly, we need to check for permissions BEFORE redirecting. Take, for example, raw files on github. This section of the github.com/robots.txt prohibits scraping of raw files for most, if not all agents:

Disallow: /raw/*
Disallow: /*/raw/
Disallow: /*/*/raw/

So call to https://github.com/dmi3kno/polite/raw/master/DESCRIPTION should be disallowed, and it is.

robotstxt::paths_allowed("https://github.com/dmi3kno/polite/raw/master/DESCRIPTION")
#>  github.com                      No encoding supplied: defaulting to UTF-8.
#> [1] FALSE

However, accessing this URL from the browser redirects you to: https://raw.githubusercontent.com/dmi3kno/polite/master/DESCRIPTION. This subdomain (raw.githubusercontent.com) is missing a valid robots.txt. Partially because it is not to be accessed directly.

Conclusions

  1. We should check for permission before following a redirect. If path is disallowed it matters very little where redirect takes you.
  2. If allowed, following a redirect is a responsible thing to do. Redirect indicates change of scope, so we need to resume search for robots.txt within new [sub]domain space.
  3. Invalid robots.txt is often an indication that the host did not expect a direct access to this subdomain. We should warn the user about that.
@dmi3kno

This comment has been minimized.

dmi3kno commented Sep 4, 2018

Question 3. Very little is required for robots.txt to be considered "valid". This, however, does not mean the file is serving its purpose. Most of the subdomain robot.txt files on github.com are empty (contain only sitemap reference. education.github.com/robots.txt is just a placeholder that was never activated (rules are defined but commented away).

I think there's a need to distinguish the following possible outcomes:

  1. robots.txt is missing - returning 404 or otherwise
  2. robots.txt is present but not "full-fledged" - like we saw in pages. The criteria here is that we want at least one Allow or Disallow statement.
  3. robots.txt is present and well defined - this allows unambiguous resolution of path.

Thoughts?

@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Sep 4, 2018

Question 3:

  1. very little is required ... yes, and we if it's any kind of valid robots.txt file (including empty files and files that only have a sitemap) than that is exactly the rule we should use - the rules are quite clear on that
  2. robots.txt is 404 / missing - no file, no restrictions
  3. robots.txt is present but not "full-fledged" - no rules, no restrictions (there are no formal requirements for robots.txt files to include any particular kind of field, not even that there should be at least one field)
  4. robots.txt is present and well defined - we follow the rules
  5. We do not get a valid robots.txt but some index.html ... I am not sure about that - we should probably reject its validity which means the validity check has to be enhanced
  6. We get redirected - we should make another try if no valid robots.txt file has been returned
@dmi3kno

This comment has been minimized.

dmi3kno commented Sep 4, 2018

Lets take some action towards rejecting non-robots.txt response and consider Question 3 closed then.

at least 3 of these scenarios will return some sort of error. Can we use some error handling inside paths_allowed to suppress red messages to the user? Generally speaking paths_allowed calls get_robotstxt behind the scenes. Maybe we should have like a verbose argument to reduce the amount of messaging that is returned to the user (regarding encoding and eventual 404 errors). I am using robotstxt inside a package and would appreciate quiet console.

Please, let me know what you think about this message on redirects.

@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Sep 5, 2018

Some actions we can take ...

  • verbosity: #31
  • robots.txt file validity check: #32
  • redirects: #33
@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Sep 5, 2018

Please, let me know what you think about this message on redirects.

yes and no ....

  • permission checking should be done before requesting a resource
  • redirects should simply be followed (it's not our decision really but the server asking us to look somewhere else we just do what we have been told; besides implementing it would make thing more complicated by something like factor 3 or 4)
  • "we need to resume search for robots.txt" in case of redirecting to another domain and getting back an invalid robots.txt(in case of redirection without domain change - no; in case of redirection to a valid robots.txt - no)
  • "Invalid robots.txt is often an indication that the host did not expect a direct access to this subdomain. We should warn the user about that." - Maybe, but we cannot operate on possible intentions. It's the internet, all have to assume all stuff gets accessed unless forbidden or protected. Server admins have to make sure that things are in the right place. Robots.txt files are a concept introduced to make dumb bots handle it: a dumb bot cannot behave if a dumb bot cannot find the rule set to follow
@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Sep 5, 2018

... quoting Google a little more on subdomain robots.txt files:

This is the general case. It is not valid for other subdomains, protocols or port numbers. It is valid for all files in all subdirectories on the same host, protocol and port number. link

A robots.txt on a subdomain is only valid for that subdomain. link

... which basically implies that Github does not do a good job about keeping their robots.txt files in order - at least for their subdomains - but we cannot fix that.

@matbmeijer

This comment has been minimized.

matbmeijer commented Nov 7, 2018

Hi, as the discussion follows along the topic of URL redirect and, taking into account the performance penalty of GET, I believe you could use HEAD REST methods to see if the URL itself redirects. This is much more lightweight that GET. Just adding this because I think it could be helpful. There is a great post about this in StackOverflow

@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Nov 7, 2018

Thanks for the hint and the link.

@matbmeijer :
Do you think that this will really make much of a difference - do you have an example? The difference should only be in download time of the actual content returned - right - GET does get the content while HEAD does not? The file retrieval is cached anyways so it will always be done only once per R session.

@petermeissner

This comment has been minimized.

Collaborator

petermeissner commented Nov 7, 2018

@matbmeijer : Also, I really have to get the file anyways (in most cases) - it is not just about checking for redirection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment