Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many errors with broad crawl #15

Closed
phongtnit opened this issue Jun 22, 2021 · 11 comments · Fixed by #74
Closed

Many errors with broad crawl #15

phongtnit opened this issue Jun 22, 2021 · 11 comments · Fixed by #74

Comments

@phongtnit
Copy link

phongtnit commented Jun 22, 2021

Hello,

I'm using scrapy-playwright package to capture screenshot and get html content of 2000 websites, my main code looks simple:

def start_requests(self):
....
       yield scrapy.Request(
             url=url,
             meta={"playwright": True, "playwright_include_page": True},
       )
....
async def parse(self, response):
       page = response.meta["playwright_page"]
       ...
       await page.screenshot(path=screenshot_file_full_path)
       html = await page.content()
       await page.close()
       ...

There are many errors when I ran the script, I change CONCURRENT_REQUESTS from 30 to 1 but the results was no different.

My test included 2000 websites, but the Scrapy script scraped only 511 results (about 25% successful rate) and the script is running without more results and error logs.

Please guide me to fix this, thanks in advance,

My error logs:

2021-06-22 11:49:53 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.ask.com>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
    extracted = result.result()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 140, in _download_request
    result = await self._download_request_with_page(request, spider, page)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 160, in _download_request_with_page
    response = await page.goto(request.url)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 6006, in goto
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 429, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 117, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.ask.com", waiting until "load"
============================================================
Note: use DEBUG=pw:api environment variable to capture Playwright logs.
2021-06-22 11:51:24 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tvzavr.ru>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
    extracted = result.result()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 140, in _download_request
    result = await self._download_request_with_page(request, spider, page)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy_playwright/handler.py", line 160, in _download_request_with_page
    response = await page.goto(request.url)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 6006, in goto
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 429, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 117, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.tvzavr.ru", waiting until "load"
============================================================
Note: use DEBUG=pw:api environment variable to capture Playwright logs.
.....
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11634' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11640' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11641' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11652' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11657' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11669' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11670' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2021-06-22 11:51:06 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11671' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
.....

My Scrapy settings looks like:

CONCURRENT_REQUESTS = 30
...
# Playwright settings
PLAYWRIGHT_CONTEXT_ARGS = {'ignore_https_errors':True}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30000
...
RETRY_ENABLED = True
RETRY_TIMES = 3

My env:

Ubuntu 20.04 and MacOS 11.2.3
Python 3.8.5
Scrapy 2.5.0
playwright 1.12.1
scrapy-playwright 0.0.3
@Obeyed
Copy link

Obeyed commented Jun 25, 2021

Are there any updates on this? I am experiencing a similar issue. I suspect that the cause is in playwright and not as much with scrapy-playwright. I haven't had time to investigate anything yet. I will try to update here, if I find anything.

My env:

Ubuntu 18.04.2

Python==3.9.5
Scrapy==2.5.0
playwright==1.12.1
scrapy-playwright==0.0.3

@elacuesta
Copy link
Member

I scraped the first 2K domains from Majestic Million, with CONCURRENT_REQUESTS=8. I got 1540 successful responses (not all of them 200 though, let's say 1540 screenshots). I then retried the failed 460 domains, and got only 45 good responses. The websites might be different of course, but I'd suggest you try smaller runs with only the failed sites, because it might be the case that they're just banning your crawler.
I also tried creating a new context for each domain (#13), that seemed to produce a slight improvement.

In any case, I think this use case would benefit from #6, I have a few ideas but I haven't decided on one just yet. I will resume work on that after #13 is merged and released.

Additionally, if you're just taking screenshots I'd suggest to use a PageCoroutine, to reduce the lifespan and memory usage of the page object - it'll be closed on the download handler and it won't reach the callback (just a general note, I don't think it has an effect on the issue being discussed).

@phongtnit
Copy link
Author

phongtnit commented Jun 27, 2021

I scraped the first 2K domains from Majestic Million, with CONCURRENT_REQUESTS=8. I got 1540 successful responses (not all of them 200 though, let's say 1540 screenshots). I then retried the failed 460 domains, and got only 45 good responses. The websites might be different of course, but I'd suggest you try smaller runs with only the failed sites, because it might be the case that they're just banning your crawler.
I also tried creating a new context for each domain (#13), that seemed to produce a slight improvement.

In any case, I think this use case would benefit from #6, I have a few ideas but I haven't decided on one just yet. I will resume work on that after #13 is merged and released.

Additionally, if you're just taking screenshots I'd suggest to use a PageCoroutine, to reduce the lifespan and memory usage of the page object - it'll be closed on the download handler and it won't reach the callback (just a general note, I don't think it has an effect on the issue being discussed).

Hi @elacuesta

Thanks for awesome information,

I changed to use PageCoroutine to take screenshot, it likely performed quicker.

Maybe I knows my problem with my successful rate, I only used a server with 8 CPU cores with 64 GB RAM to run my script. So I need to decrease CONCURRENT_REQUESTS to about 5 or 6, it got better successful rate. In fact, I also scraped 1743 results from the first Majestic 2k domains (about 87% - my domain lists had the same rate), the results were slightly different every executions.

Anyway, I'm waiting next release of the package to fix the Target page, context or browser has been closed error.

@phongtnit
Copy link
Author

phongtnit commented Jun 27, 2021

Hi @elacuesta

I crawled the first 2k domains from Majestic, the script worked as described above. However, I increased scrape about 10k domains (I don't change any settings), the script only got about 2400-2500 results and stop, the debug logs contained Target page, context or browser has been closed errors and last error is:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xa18150 node::Abort() [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 2: 0xa1855c node::OnFatalError(char const*, char const*) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 3: 0xb9715e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 4: 0xb974d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 5: 0xd54755  [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 6: 0xd54de6 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 7: 0xd616a5 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 8: 0xd62555 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 9: 0xd6500c v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
10: 0xd32eac v8::internal::Factory::NewRawOneByteString(int, v8::internal::AllocationType) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
11: 0xd32f71 v8::internal::Factory::NewStringFromOneByte(v8::internal::Vector<unsigned char const> const&, v8::internal::AllocationType) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
12: 0xbaf62f v8::String::NewFromOneByte(v8::Isolate*, unsigned char const*, v8::NewStringType, int) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
13: 0xaf60aa node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local<v8::Value>*) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
14: 0x9f33a6  [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
15: 0x1390d8d  [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
Aborted (core dumped)

Do you guess why? and how to fix the bug?

My main Playwright code:

yield scrapy.Request(
                    url=url,
                    meta={
                        "playwright": True,
                        "playwright_page_coroutines": {
                           "screenshot": PageCoroutine("screenshot", path= self.settings['TMP_FILE_PATH'] + "/" + str(round(time.time() * 1000)) + ".png"),
                       },
                    },
                )

@phongtnit
Copy link
Author

One contributor of playwright said reusing the browser context may lead the error Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory and he recommends to use new context for new domain, more detail in playwright

@phongtnit
Copy link
Author

Hi @elacuesta could you estimate when multiple context feature release? I want to test it to solve this issue.

@xanrag
Copy link

xanrag commented Jul 5, 2021

Can confirm getting the latter bug with Task exception was never retrieved, just from scraping a single website. Everything seems to work and I get the scraped items so it doesn't look like hurts anything. It only happens once in a while, not every page. I am running each spider in a separate process using celery. Any way to supress the error?

[2021-07-05 22:42:30,646: DEBUG/ForkPoolWorker-100] Browser context closed: '1'
[2021-07-05 22:42:30,647: ERROR/ForkPoolWorker-100] Task exception was never retrieved
future: <Task finished name='Task-207' coro=<Route.continue_() done, defined at /home/garnax/app/lib/python3.9/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
File "/home/garnax/app/lib/python3.9/site-packages/playwright/async_api/generated.py", line 582, in continue
await self._async(
File "/home/garnax/app/lib/python3.9/site-packages/playwright/_impl/network.py", line 207, in continue
await self._channel.send("continue", cast(Any, overrides))
File "/home/garnax/app/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 36, in send
return await self.inner_send(method, params, False)
File "/home/garnax/app/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed

@Obeyed
Copy link

Obeyed commented Jul 6, 2021

A quick update from my side on this topic.

I tried using scrapy-playwright built from #13 to see if this would solve the memory issue (mentioned in microsoft/playwright#6319). The issue is still that after a few hundred scraped items the playwright._impl._api_types.Error: Target page, context or browser has been closed log statement appears and no more items are scraped.

Unfortunately, it didn't help. What I changed was that I'm using a new context for each logical group of items on the website I am scraping. Previously, my spider would just process all items in a single context. The page would be closed directly after it was used in the callback. I am only scraping one website. I suspect that adding more context actually caused the memory issue to surface quicker - since I now scrape fewer items before none of the browser contexts work, nor new ones are added.

Unfortunately, I don't have more details to share. I don't think this is very helpful for the investigation of this, but I wanted to share my findings here. Hopefully, microsoft/playwright#6319 will lead to some better memory management and this won't be an issue in the future.

@phongtnit
Copy link
Author

phongtnit commented Jul 11, 2021

Hi @Obeyed thanks for share your test. I also tested on multiple-contexts branch to create new context per domain on about 2500 urls and the error Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory is absolutely a bug.

@elacuesta
Copy link
Member

It's been a while, but I think I understand what's happening now: #74.

@sick-pupil
Copy link

I have the same issue, but finally I yield different pages in different contexts by context's name ( 'playwright_context': 'xxx' ), it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants