-
Notifications
You must be signed in to change notification settings - Fork 149
Closed
Description
I'm trying to use playwright with scrapy but so far I've been stuck on the same problem for a couple of days.
As I try to scrape two links from an array the first one is successful and proceeds to the parse function but all next scrapes fails with this stack trace:
2022-06-29 19:26:44 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=purple%20shoes>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
result = current_context.run(
File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 514, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1030, in adapt
extracted = result.result()
File "/usr/local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 275, in _download_request
result = await self._download_request_with_page(request, page)
File "/usr/local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 308, in _download_request_with_page
headers = Headers(await response.all_headers())
AttributeError: 'NoneType' object has no attribute 'all_headers'
This is requirements.txt, running in python:3.9 docker:
Scrapy==2.6.1
scrapy-user-agents==0.1.1
scrapy-playwright==0.0.18
This is the crawler code:
import scrapy
from scrapy.http import Request
class AlibabaCrawlerSpider(scrapy.Spider):
name = 'alibaba_crawler'
allowed_domains = ['alibaba.com']
start_urls = ["https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={0}".format("green shoes"),
"https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={0}".format("purple shoes")]
custom_settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
}
def start_requests(self):
for url in self.start_urls:
print("Search Url:", url)
yield scrapy.Request(url, callback=self.parse_search, meta={
"playwright": True,
"playwright_include_page": False,
"playwright_context": "new"})
def parse_search(self, response):
print("Parsing search:", response.url)
Scrapy settings.py
BOT_NAME = 'scrapy_alibaba'
SPIDER_MODULES = ['scrapy_alibaba.spiders']
NEWSPIDER_MODULE = 'scrapy_alibaba.spiders'
LOG_LEVEL = 'INFO'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1
PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT=1
DEFAULT_REQUEST_HEADERS = {
'authority': 'www.alibaba.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.alibaba.com/sw.js?v=2.13.12&_flasher_manifest_=https://s.alicdn.com/@g/flasher-manifest/icbu-v2/manifestB.json',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
DOWNLOAD_TIMEOUT = 15
DNS_TIMEOUT = 15
Full scrapy log:
2022-06-29 19:26:22 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapy_alibaba)
2022-06-29 19:26:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.9.13 (main, Jun 23 2022, 11:12:54) - [GCC 10.2.1 20210110], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.31
2022-06-29 19:26:22 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_alibaba',
'CONCURRENT_REQUESTS': 1,
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DNS_TIMEOUT': 15,
'DOWNLOAD_DELAY': 1,
'DOWNLOAD_TIMEOUT': 15,
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'scrapy_alibaba.spiders',
'SPIDER_MODULES': ['scrapy_alibaba.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-06-29 19:26:22 [scrapy.extensions.telnet] INFO: Telnet Password: qwe13124ae213r
2022-06-29 19:26:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-06-29 19:26:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-06-29 19:26:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-06-29 19:26:22 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials
2022-06-29 19:26:23 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials
2022-06-29 19:26:23 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials
2022-06-29 19:26:23 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_alibaba.pipelines.ItemParameterSplitterPipeline',
'scrapy_alibaba.pipelines.ValidateItemPipeline',
'scrapy_alibaba.pipelines.UpsertItemToDynamoPipeline']
2022-06-29 19:26:23 [scrapy.core.engine] INFO: Spider opened
2022-06-29 19:26:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-06-29 19:26:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-06-29 19:26:23 [scrapy-playwright] INFO: Starting download handler
2022-06-29 19:26:23 [scrapy-playwright] INFO: Starting download handler
Search Url: https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=green shoes
2022-06-29 19:26:28 [scrapy-playwright] INFO: Launching browser chromium
2022-06-29 19:26:28 [scrapy-playwright] INFO: Browser chromium launched
Search Url: https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=purple shoes
Parsing search: https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=green%20shoes
2022-06-29 19:26:44 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=purple%20shoes>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
result = current_context.run(
File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 514, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1030, in adapt
extracted = result.result()
File "/usr/local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 275, in _download_request
result = await self._download_request_with_page(request, page)
File "/usr/local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 308, in _download_request_with_page
headers = Headers(await response.all_headers())
AttributeError: 'NoneType' object has no attribute 'all_headers'
2022-06-29 19:26:44 [scrapy.core.engine] INFO: Closing spider (finished)
2022-06-29 19:26:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/builtins.AttributeError': 1,
'downloader/request_bytes': 1607,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1024460,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 21.077569,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 6, 29, 19, 26, 44, 206784),
'log_count/ERROR': 1,
'log_count/INFO': 17,
'memusage/max': 76918784,
'memusage/startup': 76918784,
'playwright/context_count': 1,
'playwright/context_count/max_concurrent': 1,
'playwright/context_count/non-persistent': 1,
'playwright/page_count': 2,
'playwright/page_count/closed': 2,
'playwright/page_count/max_concurrent': 1,
'playwright/request_count': 517,
'playwright/request_count/method/GET': 470,
'playwright/request_count/method/HEAD': 16,
'playwright/request_count/method/POST': 31,
'playwright/request_count/navigation': 16,
'playwright/request_count/resource_type/document': 16,
'playwright/request_count/resource_type/fetch': 29,
'playwright/request_count/resource_type/font': 14,
'playwright/request_count/resource_type/image': 205,
'playwright/request_count/resource_type/ping': 25,
'playwright/request_count/resource_type/script': 170,
'playwright/request_count/resource_type/stylesheet': 38,
'playwright/request_count/resource_type/xhr': 20,
'playwright/response_count': 483,
'playwright/response_count/method/GET': 448,
'playwright/response_count/method/HEAD': 8,
'playwright/response_count/method/POST': 27,
'playwright/response_count/resource_type/document': 14,
'playwright/response_count/resource_type/fetch': 21,
'playwright/response_count/resource_type/font': 14,
'playwright/response_count/resource_type/image': 201,
'playwright/response_count/resource_type/ping': 22,
'playwright/response_count/resource_type/script': 156,
'playwright/response_count/resource_type/stylesheet': 38,
'playwright/response_count/resource_type/xhr': 17,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2022, 6, 29, 19, 26, 23, 129215)}
2022-06-29 19:26:44 [scrapy.core.engine] INFO: Spider closed (finished)
2022-06-29 19:26:44 [scrapy-playwright] INFO: Closing download handler
2022-06-29 19:26:44 [scrapy-playwright] INFO: Closing download handler
2022-06-29 19:26:44 [scrapy-playwright] INFO: Closing browser
I noticed if I remove PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None from settings.py I'll get this stacktrace instead:
2022-06-29 19:35:31 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-737' coro=<Channel.send() done, defined at /usr/local/lib/python3.9/site-packages/playwright/_impl/_connection.py:38> exception=Error('headers[5].value: expected string, got object')>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 39, in send
return await self.inner_send(method, params, False)
File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: headers[5].value: expected string, got object
But I'm unsure of why I would get either of these stack traces. And now I'm nice too and run the requests in sequel. Any idea how to solve these errors?
Metadata
Metadata
Assignees
Labels
No labels