Skip to content

Latest commit

History

History
80 lines (62 loc) 路 2.7 KB

faq.md

File metadata and controls

80 lines (62 loc) 路 2.7 KB

Frequently Asked Questions

How to use scrapy-playwright with the CrawlSpider?

By specifying a process_request method that modifies requests in-place in your crawling rules. For instance:

def set_playwright_true(request, response):
    request.meta["playwright"] = True
    return request

class MyCrawlSpider(CrawlSpider):
    ...
    rules = (
        Rule(
            link_extractor=LinkExtractor(...),
            callback="parse_item",
            follow=False,
            process_request=set_playwright_true,
        ),
    )

How to download all requests using scrapy-playwright?

If you want all requests to be processed by Playwright and don't want to repeat yourself, or you're using a generic spider that doesn't support request customization (e.g. scrapy.spiders.SitemapSpider), you can use a middleware to edit the meta attribute for all requests.

Depending on your project and the interactions with other components, you might decide to use a spider middleware or a downloader middleware.

Spider middleware example:

class PlaywrightSpiderMiddleware:
    def process_spider_output(self, response, result, spider):
        for obj in result:
            if isinstance(obj, scrapy.Request):
                obj.meta.setdefault("playwright", True)
            yield obj

Downloader middleware example:

class PlaywrightDownloaderMiddleware:
    def process_request(self, request, spider):
        request.meta.setdefault("playwright", True)
        return None

How to increase the allowed memory size for the browser?

If you're seeing messages such as JavaScript heap out of memory, there's a chance you're falling into the scope of microsoft/playwright#6319. As a workaround, it's possible to increase the amount of memory allowed for the Node.js process by specifying a value for the the --max-old-space-size V8 option in the NODE_OPTIONS environment variable, e.g.:

$ export NODE_OPTIONS=--max-old-space-size=SIZE  # in megabytes

Sources & further reading: