-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Patched LxmlParserLinkExtractor #5881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added a try catch condition to the safe_url_string() processing in the LxmlParserLinkExtractor class to avoid scrapers crashing unneccessarily
Codecov Report
@@ Coverage Diff @@
## master #5881 +/- ##
==========================================
- Coverage 88.85% 88.83% -0.03%
==========================================
Files 162 162
Lines 11057 11114 +57
Branches 1801 1805 +4
==========================================
+ Hits 9825 9873 +48
- Misses 954 960 +6
- Partials 278 281 +3
|
It makes sense to me to try to make it so that these issues in input data do not break the overall crawling. However, rather than ignoring such errors silently, I wonder if we should log a message about them, even if it is a debug message. e.g. “Skipping the extraction of invalid URL {url}”. In any case, could you add a test for the change? |
Adds an error loggign line to the LinkExtractor to detail encountered bad links
Thanks for the feedback, I have added a line to log the bad urls, as well as a test case to ensure that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I’m not sure about the log level of the message, but I have no strong opinion against using ERROR.
The new test fails on 3.7. |
I'm unable to reproduce the errors in the tests on Python 3.7, any ideas what might be causing it? |
I think the issue in environments with pinned libraries is that the function that determines URL validity may be less strict in an older version of its library (w3lib I assume). Maybe you can come up with an invalid URL that also fails in that scenario, or make the test skip for older versions of w3lib. The pre-commit issue should be self-explanatory. See https://docs.scrapy.org/en/latest/contributing.html#pre-commit. Other issues may be random, unrelated issues. |
Added |
Thanks! |
Issue
When processing links, in certain edge-cases I have noticed crawlers crashing with the following/similar error, when all other links are parsed successfully:
I believe the cases in question are caused by bad formatting in the websites which results in two URLs back-to-back with no separation, causing the
LinkExtractor
to attempt to parse the secondshttps:
as a port, which fails as it can't be cast toint
.The Fix
The fix I have applied simply wraps the
url = safe_url_string(url, encoding=response_encoding)
call in a try/except case that looks for theValueError
, so that these links/formatting issues can be ignored and do not cause crawlers to crash.As I can't pinpoint where the error happens on the pages I've noticed it, I have not been able to include tests, but after applying the patch locally and running crawlers, all functionality looks normal (with the added bonus of no unexpected crashes 👍🏻)