Skip to content

Support proxies out of the box (also: potential problems with header overwrites in download handler) #36

@pawelmhm

Description

@pawelmhm

I'm trying to use scrapy-playwright with firefox and proxies and it's not easy.

In Playwright-Python and Node as well just passing proxy config to server is not enough because authorization headers are missing on request, so I need to set extra headers with page.

setting proxy for firefox in playwright (without scrapy)

Looks like this in pure playwright (no scrapy yet)

import asyncio
import logging
import os
import re
import sys

from playwright.async_api import async_playwright
from w3lib.http import basic_auth_header

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)



async def handle_request(request):
    logger.debug(f"Browser request: <"
                      f"{request.method} {request.url}>")


async def handle_response(response):
    # Log responses, just so you know what's going on when Scrapy
    # seems to be inactive
    msg = f"Browser crawled ({response.status}): "
    logger.debug(msg + response.url)
    body = await response.body()
    logger.debug(body)


async def main():
    url = 'https://httpbin.org/headers'
    CRAWLERA_APIKEY = os.environ.get('CRAWLERA_APIKEY')
    CRAWLERA_URL = os.environ.get('CRAWLERA_HOST')

    proxy_auth = basic_auth_header(CRAWLERA_APIKEY, '')
    proxy_settings = {
        "proxy": {
            "server": CRAWLERA_URL,
            "username": CRAWLERA_APIKEY,
            "password": ''
        },
        "ignore_https_errors": True
    }

    DEFAULT_HEADERS = {
        'Proxy-Authorization': proxy_auth.decode(),
        "X-Crawlera-Profile": "pass",
        "X-Crawlera-Cookies": "disable",
    }
    async with async_playwright() as p:
        browser_type = p.firefox
        timeout = 90000
        msg = f"starting rendering page with timeout {timeout}ms"
        logger.info(msg)
        # Launching new browser
        browser = await browser_type.launch()
        context = await browser.new_context(**proxy_settings)
        page = await context.new_page()

        # XXX try to run it with/without this line
        # it gives 407 without it, 200 with
        await page.set_extra_http_headers(DEFAULT_HEADERS)

        page.on('request', handle_request)
        page.on('response', handle_response)
        await page.goto(url, timeout=timeout)

asyncio.run(main())

Without setting extra headers I get this:

python proxies.py
2021-10-15 12:41:19,819 - __main__ - INFO - starting rendering page with timeout 90000ms
2021-10-15 12:41:21,123 - __main__ - DEBUG - Browser request: <GET https://httpbin.org/headers>
2021-10-15 12:41:21,707 - __main__ - DEBUG - Browser crawled (407): https://httpbin.org/headers
2021-10-15 12:41:21,713 - __main__ - DEBUG - b''

with setting extra headers I get good response and I can see traffic in my proxy logs

python proxies.py
2021-10-15 12:42:28,019 - __main__ - INFO - starting rendering page with timeout 90000ms
2021-10-15 12:42:29,549 - __main__ - DEBUG - Browser request: <GET https://httpbin.org/headers>
2021-10-15 12:42:30,594 - __main__ - DEBUG - Browser crawled (200): https://httpbin.org/headers
2021-10-15 12:42:30,597 - __main__ - DEBUG - b'{\n  "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Accept-Language": "en-US,en;q=0.5", \n    "Host": "httpbin.org", \n    "Sec-Fetch-Dest": "document", \n    "Sec-Fetch-Mode": "navigate", \n    "Sec-Fetch-Site": "cross-site", \n    "Upgrade-Insecure-Requests": "1", \n    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0", \n    "X-Amzn-Trace-Id": "Root=1-61695b16-097385cb043d01b63d71eb58"\n  }\n}\n'

setting proxy for scrapy-playwright

Now I'm trying to do the same thing in scrapy-playwright, and I run into problems.

I cannot easily set extra headers now. I can set event handlers, but request event handler according to docs does not allow modifying request object. Setting extra headers need to be done before page.go() and after context created, there is no easy way to do it in spider object now, unless I'm missing something. If I am missing something let me know.

To bypass this I subclassed downloader handler.

from scrapy import Request
from scrapy_playwright.handler import ScrapyPlaywrightDownloadHandler

from properties.settings import DEFAULT_PLAYWRIGHT_PROXY_HEADERS


class SetHeadersDownloadHandler(ScrapyPlaywrightDownloadHandler):
    async def _create_page(self, request: Request):
        page = await super()._create_page(request)
        await page.set_extra_http_headers(DEFAULT_PLAYWRIGHT_PROXY_HEADERS)
        return page

and defined it in settings. This is a hack as create_page is not meant to be modified, but it works for setting authorization. Still I get 407.

The only way I can make it work is by disabling scrapy-playwright route handler, so I comment out these lines here: https://github.com/scrapy-plugins/scrapy-playwright/blob/master/scrapy_playwright/handler.py#L151
between 151 and 161.

Now I get proper result, spider gets 200 responses, no 407 in logs and traffic going via proxy.

sugestions

  • add support for proxy setting built in scrapy-playwright. To set proxy properly for different browsers users could just set PLAYWRIGHT_PROXY_HOST, PLAYWRIGHT_PROXY_USERNAME etc and scrapy-playwright will do all it needs inside download handler. I tested with firefox, but I know in chrome you may need to pass different settings to context, different browsers will have different arguments. This will be a pain to set up for most users, doing it in scrapy-playwright will make things easy for people.
  • find out why await page.unroute and page.set_extra_http_headers in handler seems to interfere with each other, I don't really understand it well, I see in my log output from spider that request is made with authorization header but still proxy is responding with 407 response, need to go to make_request_handler and go step by step find out whats wrong here. I'll try to do it next week and publish my findings.

Scrapy spider code:

import logging
import os

import scrapy
from playwright.async_api import Response, Request

logger = logging.getLogger(__name__)


async def handle_response(response: Response):
    logger.info(response.url + " " + str(response.status))
    logger.info(response.headers)
    return


async def handle_request(request: Request):
    logger.info(request.headers)


CRAWLERA_APIKEY = os.environ.get('CRAWLERA_APIKEY')
CRAWLERA_URL = os.environ.get('CRAWLERA_HOST')


class SomeSpider(scrapy.Spider):
    name = 'example'
    start_urls = [
        "http://httpbin.org/headers"
    ]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "some_project.downloader.SetHeadersDownloadHandler",
            "https": "some_project.downloader.SetHeadersDownloadHandler"
        },
        "PLAYWRIGHT_CONTEXTS": {
            1: {
                "ignore_https_errors": True,
                "proxy": {
                    "server": CRAWLERA_URL,
                    "username": CRAWLERA_APIKEY,
                    "password": "",
                }
            }
        }
    }
    default_meta = {
        "playwright": True,
        "playwright_context": 1,
        "playwright_page_event_handlers": {
            "response": handle_response,
            "request": handle_request
        }
    }

    def start_requests(self):
        for x in self.start_urls:
            yield scrapy.Request(
                x, meta=self.default_meta
            )

    def parse(self, response):
        for url in ["http://httpbin.org/get", "http://httpbin.org/ip"]:
            yield scrapy.Request(url, callback=self.hello,
                                 meta=self.default_meta)

    def hello(self, response):
        logger.debug(response.body)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions