New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when parsing robots.txt with comments #263
Comments
Looking at
I'll see if I can submit the patch fixing robots.txt parsing (including comments |
@rootkea Did you find the time ? Best place to start is IMO adding a test in When I implemented the parser, there was no real standard. So I had to review many of these files and come up with something that works. Now, some people at Google created an RFC draft. I guess it won't change much until it is finalized, so we can create a new parser. |
I do have a working parser based on https://www.robotstxt.org/orig.html#format (covering all the edge cases) but the code is not 'beautiful'. I did look at Google's RFC but thought to ignore it since it's just a draft for now. Maybe we should accommodate it if and when it becomes the standard? Anyways, I don't think I can push the final clean parser till Monday. Sorry! |
No hurry. I might be a good time to get fuzzing running against your new code ( |
Btw, I looked at the code and a) found issue in the robots tests itself and b) to fix the comments issue, this little patch does it:
So I fixed the tests, added one more test and pushed the fix. That does not mean your parser is obsolete. It might be better (readable, faster, standards-complient, ...) than the existing. But I'd say, let's not put too much time into this. There are other more important things to do. |
I opened a follow-up PR to check the parser at https://gitlab.com/gnuwget/wget2/-/issues/607. @rootkea Please feel free to assign it to you. |
How to reproduce:
The text was updated successfully, but these errors were encountered: