Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when parsing robots.txt with comments #263

Closed
iacore opened this issue Jun 23, 2022 · 6 comments
Closed

Error when parsing robots.txt with comments #263

iacore opened this issue Jun 23, 2022 · 6 comments

Comments

@iacore
Copy link

iacore commented Jun 23, 2022

How to reproduce:

> curl https://idris2.readthedocs.io/robots.txt
User-agent: *

Disallow: # Allow everything

Sitemap: https://idris2.readthedocs.io/sitemap.xml

> wget2 -r https://idris2.readthedocs.io/en/latest/
[1] Downloading 'https://idris2.readthedocs.io/robots.txt' ...
Saving 'idris2.readthedocs.io/robots.txt'
HTTP response 200  [https://idris2.readthedocs.io/robots.txt]
Adding URL: https://idris2.readthedocs.io/sitemap.xml
URL 'https://idris2.readthedocs.io/sitemap.xml' not followed (disallowed by robots.txt)
[1] Downloading 'https://idris2.readthedocs.io/en/latest/' ...
Saving 'idris2.readthedocs.io/en/latest/index.html'
HTTP response 200  [https://idris2.readthedocs.io/en/latest/]
URI content encoding = 'utf-8' (set by document)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/pygments.css
URL 'https://idris2.readthedocs.io/en/latest/_static/pygments.css' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/css/theme.css
URL 'https://idris2.readthedocs.io/en/latest/_static/css/theme.css' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/index.html
URL 'https://idris2.readthedocs.io/en/latest/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/documentation_options.js
URL 'https://idris2.readthedocs.io/en/latest/_static/documentation_options.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/jquery.js
URL 'https://idris2.readthedocs.io/en/latest/_static/jquery.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/underscore.js
URL 'https://idris2.readthedocs.io/en/latest/_static/underscore.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/_sphinx_javascript_frameworks_compat.js
URL 'https://idris2.readthedocs.io/en/latest/_static/_sphinx_javascript_frameworks_compat.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/doctools.js
URL 'https://idris2.readthedocs.io/en/latest/_static/doctools.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/static/javascript/readthedocs-doc-embed.js
URL 'https://idris2.readthedocs.io/_/static/javascript/readthedocs-doc-embed.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/js/theme.js
URL 'https://idris2.readthedocs.io/en/latest/_static/js/theme.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/genindex.html
URL 'https://idris2.readthedocs.io/en/latest/genindex.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/search.html
URL 'https://idris2.readthedocs.io/en/latest/search.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/tutorial/index.html
URL 'https://idris2.readthedocs.io/en/latest/tutorial/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/static/css/readthedocs-doc-embed.css
URL 'https://idris2.readthedocs.io/_/static/css/readthedocs-doc-embed.css' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/static/javascript/readthedocs-analytics.js
URL 'https://idris2.readthedocs.io/_/static/javascript/readthedocs-analytics.js' not followed (disallowed by robots.txt)
URL 'search.html' not followed (action/formaction attribute)
Adding URL: https://idris2.readthedocs.io/en/latest/faq/faq.html
URL 'https://idris2.readthedocs.io/en/latest/faq/faq.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/backends/index.html
URL 'https://idris2.readthedocs.io/en/latest/backends/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/updates/updates.html
URL 'https://idris2.readthedocs.io/en/latest/updates/updates.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/typedd/typedd.html
URL 'https://idris2.readthedocs.io/en/latest/typedd/typedd.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/reference/packages.html
URL 'https://idris2.readthedocs.io/en/latest/reference/packages.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/libraries/index.html
URL 'https://idris2.readthedocs.io/en/latest/libraries/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/app/index.html
URL 'https://idris2.readthedocs.io/en/latest/app/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/ffi/index.html
URL 'https://idris2.readthedocs.io/en/latest/ffi/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/proofs/index.html
URL 'https://idris2.readthedocs.io/en/latest/proofs/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/implementation/index.html
URL 'https://idris2.readthedocs.io/en/latest/implementation/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/reference/index.html
URL 'https://idris2.readthedocs.io/en/latest/reference/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/cookbook/index.html
URL 'https://idris2.readthedocs.io/en/latest/cookbook/index.html' not followed (disallowed by robots.txt)
Adding URL: https://github.com/idris-lang/Idris2/blob/main/docs/source/index.rst
URL 'https://github.com/idris-lang/Idris2/blob/main/docs/source/index.rst' not followed (no host-spanning requested)
Adding URL: https://creativecommons.org/publicdomain/zero/1.0/
URL 'https://creativecommons.org/publicdomain/zero/1.0/' not followed (no host-spanning requested)
Adding URL: https://www.sphinx-doc.org/
URL 'https://www.sphinx-doc.org/' not followed (no host-spanning requested)
Adding URL: https://github.com/readthedocs/sphinx_rtd_theme
URL 'https://github.com/readthedocs/sphinx_rtd_theme' not followed (no host-spanning requested)
Adding URL: https://readthedocs.org
URL 'https://readthedocs.org' not followed (no host-spanning requested)
Adding URL: https://idris2.readthedocs.io/en/latest/
Adding URL: https://idris2.readthedocs.io/en/stable/
URL 'https://idris2.readthedocs.io/en/stable/' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/downloads/en/latest/pdf/
URL 'https://idris2.readthedocs.io/_/downloads/en/latest/pdf/' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/downloads/en/latest/htmlzip/
URL 'https://idris2.readthedocs.io/_/downloads/en/latest/htmlzip/' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/downloads/en/latest/epub/
URL 'https://idris2.readthedocs.io/_/downloads/en/latest/epub/' not followed (disallowed by robots.txt)
Adding URL: https://readthedocs.org/projects/idris2/?fromdocs=idris2
URL 'https://readthedocs.org/projects/idris2/?fromdocs=idris2' not followed (no host-spanning requested)
Adding URL: https://readthedocs.org/builds/idris2/?fromdocs=idris2
URL 'https://readthedocs.org/builds/idris2/?fromdocs=idris2' not followed (no host-spanning requested)
Downloaded: 2 files, 3.00K bytes, 0 redirects, 0 errors
@rootkea
Copy link
Contributor

rootkea commented Jun 24, 2022

Looking at libwget/robots.c wget_robots_parse(), the implementation doesn't seem correct.

$ cat /var/www/html/robots.txt 
User-agent: wget2
User-agent: foo
Disallow: /
$ src/wget2_noinstall -r http://127.0.0.1
3 files              100% [=======================================================================>]     199     --.-KB/s
2 files              100% [=======================================================================>]  846.46K    --.-KB/s
                          [Files: 5  Bytes: 846.66K [33.07MB/s] Redirects: 0  Todo: 0  Errors: 0   ]
$

I'll see if I can submit the patch fixing robots.txt parsing (including comments #) by EOD.

@rockdaboot
Copy link
Owner

@rootkea Did you find the time ? Best place to start is IMO adding a test in unit-tests/test.c/test_robots().

When I implemented the parser, there was no real standard. So I had to review many of these files and come up with something that works.

Now, some people at Google created an RFC draft. I guess it won't change much until it is finalized, so we can create a new parser.

See https://datatracker.ietf.org/doc/draft-koster-rep/

@rootkea
Copy link
Contributor

rootkea commented Jul 2, 2022

I do have a working parser based on https://www.robotstxt.org/orig.html#format (covering all the edge cases) but the code is not 'beautiful'. I did look at Google's RFC but thought to ignore it since it's just a draft for now. Maybe we should accommodate it if and when it becomes the standard?

Anyways, I don't think I can push the final clean parser till Monday. Sorry!

@rockdaboot
Copy link
Owner

No hurry.

I might be a good time to get fuzzing running against your new code (fuzz/libwget_robots_parse_fuzzer.c). Details are in fuzz/README.md - it might need some polishing, though. Let me if you are stuck.

@rockdaboot
Copy link
Owner

Btw, I looked at the code and a) found issue in the robots tests itself and b) to fix the comments issue, this little patch does it:

diff --git a/libwget/robots.c b/libwget/robots.c
index fe8d589e1..faa14e96c 100644
--- a/libwget/robots.c
+++ b/libwget/robots.c
@@ -99,7 +99,7 @@ int wget_robots_parse(wget_robots **_robots, const char *data, const char *clien
                }
                else if (collect == 1 && !wget_strncasecmp_ascii(data, "Disallow:", 9)) {
                        for (data += 9; *data == ' ' || *data == '\t'; data++);
-                       if (*data == '\r' || *data == '\n' || !*data) {
+                       if (*data == '\r' || *data == '\n' || *data == '#' || !*data) {
                                // all allowed
                                wget_vector_free(&robots->paths);
                                collect = 2;

So I fixed the tests, added one more test and pushed the fix.

That does not mean your parser is obsolete. It might be better (readable, faster, standards-complient, ...) than the existing. But I'd say, let's not put too much time into this. There are other more important things to do.

@rockdaboot
Copy link
Owner

I opened a follow-up PR to check the parser at https://gitlab.com/gnuwget/wget2/-/issues/607.

@rootkea Please feel free to assign it to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants