Skip to content

First scrape is OK but proceeding scrapes fails with "AttributeError: 'NoneType' object has no attribute 'all_headers'" #102

@lallish

Description

@lallish

I'm trying to use playwright with scrapy but so far I've been stuck on the same problem for a couple of days.

As I try to scrape two links from an array the first one is successful and proceeds to the parse function but all next scrapes fails with this stack trace:

2022-06-29 19:26:44 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=purple%20shoes>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
    result = current_context.run(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 514, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1030, in adapt
    extracted = result.result()
  File "/usr/local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 275, in _download_request
    result = await self._download_request_with_page(request, page)
  File "/usr/local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 308, in _download_request_with_page
    headers = Headers(await response.all_headers())
AttributeError: 'NoneType' object has no attribute 'all_headers'

This is requirements.txt, running in python:3.9 docker:

Scrapy==2.6.1
scrapy-user-agents==0.1.1
scrapy-playwright==0.0.18

This is the crawler code:

import scrapy
from scrapy.http import Request
class AlibabaCrawlerSpider(scrapy.Spider):
    name = 'alibaba_crawler'
    allowed_domains = ['alibaba.com']
    start_urls = ["https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={0}".format("green shoes"),
                    "https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={0}".format("purple shoes")]

    custom_settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            }
        }

    def start_requests(self):
        for url in self.start_urls:
            print("Search Url:", url)
            yield scrapy.Request(url, callback=self.parse_search, meta={
                "playwright": True,
                "playwright_include_page": False, 
                "playwright_context": "new"})

    def parse_search(self, response):
        print("Parsing search:", response.url)

Scrapy settings.py

BOT_NAME = 'scrapy_alibaba'
SPIDER_MODULES = ['scrapy_alibaba.spiders']
NEWSPIDER_MODULE = 'scrapy_alibaba.spiders'
LOG_LEVEL = 'INFO'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1
PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT=1
DEFAULT_REQUEST_HEADERS = {
    'authority': 'www.alibaba.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'dnt': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.alibaba.com/sw.js?v=2.13.12&_flasher_manifest_=https://s.alicdn.com/@g/flasher-manifest/icbu-v2/manifestB.json',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
DOWNLOAD_TIMEOUT = 15
DNS_TIMEOUT = 15

Full scrapy log:

2022-06-29 19:26:22 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapy_alibaba)
2022-06-29 19:26:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.9.13 (main, Jun 23 2022, 11:12:54) - [GCC 10.2.1 20210110], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.31
2022-06-29 19:26:22 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_alibaba',
 'CONCURRENT_REQUESTS': 1,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
 'DNS_TIMEOUT': 15,
 'DOWNLOAD_DELAY': 1,
 'DOWNLOAD_TIMEOUT': 15,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'scrapy_alibaba.spiders',
 'SPIDER_MODULES': ['scrapy_alibaba.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-06-29 19:26:22 [scrapy.extensions.telnet] INFO: Telnet Password: qwe13124ae213r
2022-06-29 19:26:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-06-29 19:26:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-06-29 19:26:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-06-29 19:26:22 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials
2022-06-29 19:26:23 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials
2022-06-29 19:26:23 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials
2022-06-29 19:26:23 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_alibaba.pipelines.ItemParameterSplitterPipeline',
 'scrapy_alibaba.pipelines.ValidateItemPipeline',
 'scrapy_alibaba.pipelines.UpsertItemToDynamoPipeline']
2022-06-29 19:26:23 [scrapy.core.engine] INFO: Spider opened
2022-06-29 19:26:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-06-29 19:26:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-06-29 19:26:23 [scrapy-playwright] INFO: Starting download handler
2022-06-29 19:26:23 [scrapy-playwright] INFO: Starting download handler
Search Url: https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=green shoes
2022-06-29 19:26:28 [scrapy-playwright] INFO: Launching browser chromium
2022-06-29 19:26:28 [scrapy-playwright] INFO: Browser chromium launched
Search Url: https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=purple shoes
Parsing search: https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=green%20shoes
2022-06-29 19:26:44 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=purple%20shoes>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
    result = current_context.run(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 514, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1030, in adapt
    extracted = result.result()
  File "/usr/local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 275, in _download_request
    result = await self._download_request_with_page(request, page)
  File "/usr/local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 308, in _download_request_with_page
    headers = Headers(await response.all_headers())
AttributeError: 'NoneType' object has no attribute 'all_headers'
2022-06-29 19:26:44 [scrapy.core.engine] INFO: Closing spider (finished)
2022-06-29 19:26:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/builtins.AttributeError': 1,
 'downloader/request_bytes': 1607,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 1024460,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 21.077569,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 6, 29, 19, 26, 44, 206784),
 'log_count/ERROR': 1,
 'log_count/INFO': 17,
 'memusage/max': 76918784,
 'memusage/startup': 76918784,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/non-persistent': 1,
 'playwright/page_count': 2,
 'playwright/page_count/closed': 2,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 517,
 'playwright/request_count/method/GET': 470,
 'playwright/request_count/method/HEAD': 16,
 'playwright/request_count/method/POST': 31,
 'playwright/request_count/navigation': 16,
 'playwright/request_count/resource_type/document': 16,
 'playwright/request_count/resource_type/fetch': 29,
 'playwright/request_count/resource_type/font': 14,
 'playwright/request_count/resource_type/image': 205,
 'playwright/request_count/resource_type/ping': 25,
 'playwright/request_count/resource_type/script': 170,
 'playwright/request_count/resource_type/stylesheet': 38,
 'playwright/request_count/resource_type/xhr': 20,
 'playwright/response_count': 483,
 'playwright/response_count/method/GET': 448,
 'playwright/response_count/method/HEAD': 8,
 'playwright/response_count/method/POST': 27,
 'playwright/response_count/resource_type/document': 14,
 'playwright/response_count/resource_type/fetch': 21,
 'playwright/response_count/resource_type/font': 14,
 'playwright/response_count/resource_type/image': 201,
 'playwright/response_count/resource_type/ping': 22,
 'playwright/response_count/resource_type/script': 156,
 'playwright/response_count/resource_type/stylesheet': 38,
 'playwright/response_count/resource_type/xhr': 17,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2022, 6, 29, 19, 26, 23, 129215)}
2022-06-29 19:26:44 [scrapy.core.engine] INFO: Spider closed (finished)
2022-06-29 19:26:44 [scrapy-playwright] INFO: Closing download handler
2022-06-29 19:26:44 [scrapy-playwright] INFO: Closing download handler
2022-06-29 19:26:44 [scrapy-playwright] INFO: Closing browser

I noticed if I remove PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None from settings.py I'll get this stacktrace instead:

2022-06-29 19:35:31 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-737' coro=<Channel.send() done, defined at /usr/local/lib/python3.9/site-packages/playwright/_impl/_connection.py:38> exception=Error('headers[5].value: expected string, got object')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: headers[5].value: expected string, got object

But I'm unsure of why I would get either of these stack traces. And now I'm nice too and run the requests in sequel. Any idea how to solve these errors?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions