Skip to content

Patched LxmlParserLinkExtractor #5881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 12, 2023
Merged

Conversation

sbartlett97
Copy link
Contributor

Issue

When processing links, in certain edge-cases I have noticed crawlers crashing with the following/similar error, when all other links are parsed successfully:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/defer.py", line 254, in aiter_errback
    yield await it.__anext__()
          ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 366, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 347, in _async_chain
    async for o in as_async_generator(it):
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 366, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 347, in _async_chain
    async for o in as_async_generator(it):
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 90, in process_async
    async for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 36, in process_spider_output_async
    async for r in result or ():
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 90, in process_async
    async for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/offsite.py", line 32, in process_spider_output_async
    async for r in result or ():
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 90, in process_async
    async for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in process_spider_output_async
    async for r in result or ():
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 90, in process_async
    async for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 31, in process_spider_output_async
    async for r in result or ():
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 90, in process_async
    async for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spiders/crawl.py", line 125, in _parse_response
    for request_or_item in self._requests_to_follow(response):
  File "/usr/local/lib/python3.11/site-packages/scrapy/spiders/crawl.py", line 98, in _requests_to_follow
    links = [lnk for lnk in rule.link_extractor.extract_links(response)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/linkextractors/lxmlhtml.py", line 162, in extract_links
    links = self._extract_links(doc, response.url, response.encoding, base_url)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/linkextractors/__init__.py", line 132, in _extract_links
    return self.link_extractor._extract_links(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/linkextractors/lxmlhtml.py", line 76, in _extract_links
    url = safe_url_string(url, encoding=response_encoding)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/w3lib/url.py", line 148, in safe_url_string
    parts.port,
    ^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/parse.py", line 173, in port
    raise ValueError(f"Port could not be cast to integer value as {port!r}")
ValueError: Port could not be cast to integer value as ' https:'

I believe the cases in question are caused by bad formatting in the websites which results in two URLs back-to-back with no separation, causing the LinkExtractor to attempt to parse the seconds https: as a port, which fails as it can't be cast to int.

The Fix

The fix I have applied simply wraps the url = safe_url_string(url, encoding=response_encoding) call in a try/except case that looks for the ValueError, so that these links/formatting issues can be ignored and do not cause crawlers to crash.

As I can't pinpoint where the error happens on the pages I've noticed it, I have not been able to include tests, but after applying the patch locally and running crawlers, all functionality looks normal (with the added bonus of no unexpected crashes 👍🏻)

Added a try catch condition to the safe_url_string() processing
in the LxmlParserLinkExtractor class to avoid scrapers crashing
unneccessarily
@codecov
Copy link

codecov bot commented Mar 30, 2023

Codecov Report

Merging #5881 (bff6b93) into master (8045d7e) will decrease coverage by 0.03%.
The diff coverage is 93.84%.

❗ Current head bff6b93 differs from pull request most recent head 3f0c2fa. Consider uploading reports for the commit 3f0c2fa to get more accurate results

@@            Coverage Diff             @@
##           master    #5881      +/-   ##
==========================================
- Coverage   88.85%   88.83%   -0.03%     
==========================================
  Files         162      162              
  Lines       11057    11114      +57     
  Branches     1801     1805       +4     
==========================================
+ Hits         9825     9873      +48     
- Misses        954      960       +6     
- Partials      278      281       +3     
Impacted Files Coverage Δ
scrapy/mail.py 77.77% <0.00%> (-0.88%) ⬇️
scrapy/pipelines/files.py 72.00% <50.00%> (+0.09%) ⬆️
scrapy/utils/misc.py 96.52% <60.00%> (-1.35%) ⬇️
scrapy/signalmanager.py 81.81% <83.33%> (+1.81%) ⬆️
scrapy/core/engine.py 83.68% <90.90%> (-0.51%) ⬇️
scrapy/statscollectors.py 89.28% <92.85%> (-2.88%) ⬇️
scrapy/core/scheduler.py 93.23% <93.75%> (-0.52%) ⬇️
scrapy/core/downloader/__init__.py 92.80% <100.00%> (+0.05%) ⬆️
scrapy/core/scraper.py 84.94% <100.00%> (+0.33%) ⬆️
scrapy/core/spidermw.py 99.44% <100.00%> (+<0.01%) ⬆️
... and 10 more

@Gallaecio
Copy link
Member

It makes sense to me to try to make it so that these issues in input data do not break the overall crawling.

However, rather than ignoring such errors silently, I wonder if we should log a message about them, even if it is a debug message. e.g. “Skipping the extraction of invalid URL {url}”.

In any case, could you add a test for the change?

@sbartlett97
Copy link
Contributor Author

Thanks for the feedback, I have added a line to log the bad urls, as well as a test case to ensure that the LinkExtractor doesn't crash or extract the urls

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I’m not sure about the log level of the message, but I have no strong opinion against using ERROR.

@wRAR
Copy link
Member

wRAR commented Mar 31, 2023

The new test fails on 3.7.

@sbartlett97
Copy link
Contributor Author

I'm unable to reproduce the errors in the tests on Python 3.7, any ideas what might be causing it?

@Gallaecio
Copy link
Member

I think the issue in environments with pinned libraries is that the function that determines URL validity may be less strict in an older version of its library (w3lib I assume). Maybe you can come up with an invalid URL that also fails in that scenario, or make the test skip for older versions of w3lib.

The pre-commit issue should be self-explanatory. See https://docs.scrapy.org/en/latest/contributing.html#pre-commit.

Other issues may be random, unrelated issues.

@sbartlett97
Copy link
Contributor Author

Added unittest.skipIf for test as issue doesn't seem to break python3.7 versions

@wRAR wRAR merged commit d911837 into scrapy:master Apr 12, 2023
@wRAR
Copy link
Member

wRAR commented Apr 12, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants