Error when parsing robots.txt with comments #263

iacore · 2022-06-23T09:59:14Z

How to reproduce:

> curl https://idris2.readthedocs.io/robots.txt
User-agent: *

Disallow: # Allow everything

Sitemap: https://idris2.readthedocs.io/sitemap.xml

> wget2 -r https://idris2.readthedocs.io/en/latest/
[1] Downloading 'https://idris2.readthedocs.io/robots.txt' ...
Saving 'idris2.readthedocs.io/robots.txt'
HTTP response 200  [https://idris2.readthedocs.io/robots.txt]
Adding URL: https://idris2.readthedocs.io/sitemap.xml
URL 'https://idris2.readthedocs.io/sitemap.xml' not followed (disallowed by robots.txt)
[1] Downloading 'https://idris2.readthedocs.io/en/latest/' ...
Saving 'idris2.readthedocs.io/en/latest/index.html'
HTTP response 200  [https://idris2.readthedocs.io/en/latest/]
URI content encoding = 'utf-8' (set by document)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/pygments.css
URL 'https://idris2.readthedocs.io/en/latest/_static/pygments.css' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/css/theme.css
URL 'https://idris2.readthedocs.io/en/latest/_static/css/theme.css' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/index.html
URL 'https://idris2.readthedocs.io/en/latest/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/documentation_options.js
URL 'https://idris2.readthedocs.io/en/latest/_static/documentation_options.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/jquery.js
URL 'https://idris2.readthedocs.io/en/latest/_static/jquery.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/underscore.js
URL 'https://idris2.readthedocs.io/en/latest/_static/underscore.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/_sphinx_javascript_frameworks_compat.js
URL 'https://idris2.readthedocs.io/en/latest/_static/_sphinx_javascript_frameworks_compat.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/doctools.js
URL 'https://idris2.readthedocs.io/en/latest/_static/doctools.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/static/javascript/readthedocs-doc-embed.js
URL 'https://idris2.readthedocs.io/_/static/javascript/readthedocs-doc-embed.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/_static/js/theme.js
URL 'https://idris2.readthedocs.io/en/latest/_static/js/theme.js' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/genindex.html
URL 'https://idris2.readthedocs.io/en/latest/genindex.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/search.html
URL 'https://idris2.readthedocs.io/en/latest/search.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/tutorial/index.html
URL 'https://idris2.readthedocs.io/en/latest/tutorial/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/static/css/readthedocs-doc-embed.css
URL 'https://idris2.readthedocs.io/_/static/css/readthedocs-doc-embed.css' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/static/javascript/readthedocs-analytics.js
URL 'https://idris2.readthedocs.io/_/static/javascript/readthedocs-analytics.js' not followed (disallowed by robots.txt)
URL 'search.html' not followed (action/formaction attribute)
Adding URL: https://idris2.readthedocs.io/en/latest/faq/faq.html
URL 'https://idris2.readthedocs.io/en/latest/faq/faq.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/backends/index.html
URL 'https://idris2.readthedocs.io/en/latest/backends/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/updates/updates.html
URL 'https://idris2.readthedocs.io/en/latest/updates/updates.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/typedd/typedd.html
URL 'https://idris2.readthedocs.io/en/latest/typedd/typedd.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/reference/packages.html
URL 'https://idris2.readthedocs.io/en/latest/reference/packages.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/libraries/index.html
URL 'https://idris2.readthedocs.io/en/latest/libraries/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/app/index.html
URL 'https://idris2.readthedocs.io/en/latest/app/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/ffi/index.html
URL 'https://idris2.readthedocs.io/en/latest/ffi/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/proofs/index.html
URL 'https://idris2.readthedocs.io/en/latest/proofs/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/implementation/index.html
URL 'https://idris2.readthedocs.io/en/latest/implementation/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/reference/index.html
URL 'https://idris2.readthedocs.io/en/latest/reference/index.html' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/en/latest/cookbook/index.html
URL 'https://idris2.readthedocs.io/en/latest/cookbook/index.html' not followed (disallowed by robots.txt)
Adding URL: https://github.com/idris-lang/Idris2/blob/main/docs/source/index.rst
URL 'https://github.com/idris-lang/Idris2/blob/main/docs/source/index.rst' not followed (no host-spanning requested)
Adding URL: https://creativecommons.org/publicdomain/zero/1.0/
URL 'https://creativecommons.org/publicdomain/zero/1.0/' not followed (no host-spanning requested)
Adding URL: https://www.sphinx-doc.org/
URL 'https://www.sphinx-doc.org/' not followed (no host-spanning requested)
Adding URL: https://github.com/readthedocs/sphinx_rtd_theme
URL 'https://github.com/readthedocs/sphinx_rtd_theme' not followed (no host-spanning requested)
Adding URL: https://readthedocs.org
URL 'https://readthedocs.org' not followed (no host-spanning requested)
Adding URL: https://idris2.readthedocs.io/en/latest/
Adding URL: https://idris2.readthedocs.io/en/stable/
URL 'https://idris2.readthedocs.io/en/stable/' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/downloads/en/latest/pdf/
URL 'https://idris2.readthedocs.io/_/downloads/en/latest/pdf/' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/downloads/en/latest/htmlzip/
URL 'https://idris2.readthedocs.io/_/downloads/en/latest/htmlzip/' not followed (disallowed by robots.txt)
Adding URL: https://idris2.readthedocs.io/_/downloads/en/latest/epub/
URL 'https://idris2.readthedocs.io/_/downloads/en/latest/epub/' not followed (disallowed by robots.txt)
Adding URL: https://readthedocs.org/projects/idris2/?fromdocs=idris2
URL 'https://readthedocs.org/projects/idris2/?fromdocs=idris2' not followed (no host-spanning requested)
Adding URL: https://readthedocs.org/builds/idris2/?fromdocs=idris2
URL 'https://readthedocs.org/builds/idris2/?fromdocs=idris2' not followed (no host-spanning requested)
Downloaded: 2 files, 3.00K bytes, 0 redirects, 0 errors

The text was updated successfully, but these errors were encountered:

rootkea · 2022-06-24T07:22:14Z

Looking at libwget/robots.c wget_robots_parse(), the implementation doesn't seem correct.

$ cat /var/www/html/robots.txt 
User-agent: wget2
User-agent: foo
Disallow: /
$ src/wget2_noinstall -r http://127.0.0.1
3 files              100% [=======================================================================>]     199     --.-KB/s
2 files              100% [=======================================================================>]  846.46K    --.-KB/s
                          [Files: 5  Bytes: 846.66K [33.07MB/s] Redirects: 0  Todo: 0  Errors: 0   ]
$

I'll see if I can submit the patch fixing robots.txt parsing (including comments #) by EOD.

rockdaboot · 2022-07-01T17:40:52Z

@rootkea Did you find the time ? Best place to start is IMO adding a test in unit-tests/test.c/test_robots().

When I implemented the parser, there was no real standard. So I had to review many of these files and come up with something that works.

Now, some people at Google created an RFC draft. I guess it won't change much until it is finalized, so we can create a new parser.

See https://datatracker.ietf.org/doc/draft-koster-rep/

rootkea · 2022-07-02T05:32:00Z

I do have a working parser based on https://www.robotstxt.org/orig.html#format (covering all the edge cases) but the code is not 'beautiful'. I did look at Google's RFC but thought to ignore it since it's just a draft for now. Maybe we should accommodate it if and when it becomes the standard?

Anyways, I don't think I can push the final clean parser till Monday. Sorry!

rockdaboot · 2022-07-02T09:38:46Z

No hurry.

I might be a good time to get fuzzing running against your new code (fuzz/libwget_robots_parse_fuzzer.c). Details are in fuzz/README.md - it might need some polishing, though. Let me if you are stuck.

rockdaboot · 2022-07-02T10:52:25Z

Btw, I looked at the code and a) found issue in the robots tests itself and b) to fix the comments issue, this little patch does it:

diff --git a/libwget/robots.c b/libwget/robots.c
index fe8d589e1..faa14e96c 100644
--- a/libwget/robots.c
+++ b/libwget/robots.c
@@ -99,7 +99,7 @@ int wget_robots_parse(wget_robots **_robots, const char *data, const char *clien
                }
                else if (collect == 1 && !wget_strncasecmp_ascii(data, "Disallow:", 9)) {
                        for (data += 9; *data == ' ' || *data == '\t'; data++);
-                       if (*data == '\r' || *data == '\n' || !*data) {
+                       if (*data == '\r' || *data == '\n' || *data == '#' || !*data) {
                                // all allowed
                                wget_vector_free(&robots->paths);
                                collect = 2;

So I fixed the tests, added one more test and pushed the fix.

That does not mean your parser is obsolete. It might be better (readable, faster, standards-complient, ...) than the existing. But I'd say, let's not put too much time into this. There are other more important things to do.

rockdaboot · 2022-07-02T11:07:07Z

I opened a follow-up PR to check the parser at https://gitlab.com/gnuwget/wget2/-/issues/607.

@rootkea Please feel free to assign it to you.

rockdaboot closed this as completed Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when parsing robots.txt with comments #263

Error when parsing robots.txt with comments #263

iacore commented Jun 23, 2022

rootkea commented Jun 24, 2022 •

edited

rockdaboot commented Jul 1, 2022

rootkea commented Jul 2, 2022 •

edited

rockdaboot commented Jul 2, 2022

rockdaboot commented Jul 2, 2022

rockdaboot commented Jul 2, 2022

Error when parsing robots.txt with comments #263

Error when parsing robots.txt with comments #263

Comments

iacore commented Jun 23, 2022

rootkea commented Jun 24, 2022 • edited

rockdaboot commented Jul 1, 2022

rootkea commented Jul 2, 2022 • edited

rockdaboot commented Jul 2, 2022

rockdaboot commented Jul 2, 2022

rockdaboot commented Jul 2, 2022

rootkea commented Jun 24, 2022 •

edited

rootkea commented Jul 2, 2022 •

edited