Support proxies out of the box (also: potential problems with header overwrites in download handler)

I'm trying to use scrapy-playwright with firefox and proxies and it's not easy. 

In Playwright-Python and Node as well just passing proxy config to server is not enough because authorization headers are missing on request, so I need to set extra headers with page. 

## setting proxy for firefox in playwright (without scrapy)
Looks like this in pure playwright (no scrapy yet)

```python
import asyncio
import logging
import os
import re
import sys

from playwright.async_api import async_playwright
from w3lib.http import basic_auth_header

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)



async def handle_request(request):
    logger.debug(f"Browser request: <"
                      f"{request.method} {request.url}>")


async def handle_response(response):
    # Log responses, just so you know what's going on when Scrapy
    # seems to be inactive
    msg = f"Browser crawled ({response.status}): "
    logger.debug(msg + response.url)
    body = await response.body()
    logger.debug(body)


async def main():
    url = 'https://httpbin.org/headers'
    CRAWLERA_APIKEY = os.environ.get('CRAWLERA_APIKEY')
    CRAWLERA_URL = os.environ.get('CRAWLERA_HOST')

    proxy_auth = basic_auth_header(CRAWLERA_APIKEY, '')
    proxy_settings = {
        "proxy": {
            "server": CRAWLERA_URL,
            "username": CRAWLERA_APIKEY,
            "password": ''
        },
        "ignore_https_errors": True
    }

    DEFAULT_HEADERS = {
        'Proxy-Authorization': proxy_auth.decode(),
        "X-Crawlera-Profile": "pass",
        "X-Crawlera-Cookies": "disable",
    }
    async with async_playwright() as p:
        browser_type = p.firefox
        timeout = 90000
        msg = f"starting rendering page with timeout {timeout}ms"
        logger.info(msg)
        # Launching new browser
        browser = await browser_type.launch()
        context = await browser.new_context(**proxy_settings)
        page = await context.new_page()

        # XXX try to run it with/without this line
        # it gives 407 without it, 200 with
        await page.set_extra_http_headers(DEFAULT_HEADERS)

        page.on('request', handle_request)
        page.on('response', handle_response)
        await page.goto(url, timeout=timeout)

asyncio.run(main())

```

Without setting extra headers I get this:
```
python proxies.py
2021-10-15 12:41:19,819 - __main__ - INFO - starting rendering page with timeout 90000ms
2021-10-15 12:41:21,123 - __main__ - DEBUG - Browser request: <GET https://httpbin.org/headers>
2021-10-15 12:41:21,707 - __main__ - DEBUG - Browser crawled (407): https://httpbin.org/headers
2021-10-15 12:41:21,713 - __main__ - DEBUG - b''
```

with setting extra headers I get good response and I can see traffic in my proxy logs


```
python proxies.py
2021-10-15 12:42:28,019 - __main__ - INFO - starting rendering page with timeout 90000ms
2021-10-15 12:42:29,549 - __main__ - DEBUG - Browser request: <GET https://httpbin.org/headers>
2021-10-15 12:42:30,594 - __main__ - DEBUG - Browser crawled (200): https://httpbin.org/headers
2021-10-15 12:42:30,597 - __main__ - DEBUG - b'{\n  "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Accept-Language": "en-US,en;q=0.5", \n    "Host": "httpbin.org", \n    "Sec-Fetch-Dest": "document", \n    "Sec-Fetch-Mode": "navigate", \n    "Sec-Fetch-Site": "cross-site", \n    "Upgrade-Insecure-Requests": "1", \n    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0", \n    "X-Amzn-Trace-Id": "Root=1-61695b16-097385cb043d01b63d71eb58"\n  }\n}\n'
```

## setting proxy for scrapy-playwright
Now I'm trying to do the same thing in scrapy-playwright, and I run into problems.

I cannot easily set extra headers now. I can set event handlers, but request event handler according to docs does not allow modifying request object. Setting extra headers need to be done before page.go() and after context created, there is no easy way to do it in spider object now, unless I'm missing something. If I am missing something let me know.

To bypass this I subclassed downloader handler. 

```python
from scrapy import Request
from scrapy_playwright.handler import ScrapyPlaywrightDownloadHandler

from properties.settings import DEFAULT_PLAYWRIGHT_PROXY_HEADERS


class SetHeadersDownloadHandler(ScrapyPlaywrightDownloadHandler):
    async def _create_page(self, request: Request):
        page = await super()._create_page(request)
        await page.set_extra_http_headers(DEFAULT_PLAYWRIGHT_PROXY_HEADERS)
        return page
```

and defined it in settings. This is a hack as create_page is not meant to be modified, but it works for setting authorization. Still I get 407. 

The only way I can make it work is by disabling scrapy-playwright route handler, so I comment out these lines here: https://github.com/scrapy-plugins/scrapy-playwright/blob/master/scrapy_playwright/handler.py#L151
 between 151 and 161. 


Now I get proper result, spider gets 200 responses, no 407 in logs and traffic going via proxy.

## sugestions
* add support for proxy setting built in scrapy-playwright. To set proxy properly for different browsers users could just set PLAYWRIGHT_PROXY_HOST, PLAYWRIGHT_PROXY_USERNAME etc and scrapy-playwright will do all it needs inside download handler. I tested with firefox, but I know in chrome you may need to pass different settings to context, different browsers will have different arguments. This will be a pain to set up for most users, doing it in scrapy-playwright will make things easy for people.
* find out why await page.unroute and page.set_extra_http_headers in handler seems to interfere with each other, I don't really understand it well, I see in my log output from spider that request is made with authorization header but still proxy is responding with 407 response, need to go to make_request_handler and go step by step find out whats wrong here. I'll try to do it next week and publish my findings.


Scrapy spider code:

```python
import logging
import os

import scrapy
from playwright.async_api import Response, Request

logger = logging.getLogger(__name__)


async def handle_response(response: Response):
    logger.info(response.url + " " + str(response.status))
    logger.info(response.headers)
    return


async def handle_request(request: Request):
    logger.info(request.headers)


CRAWLERA_APIKEY = os.environ.get('CRAWLERA_APIKEY')
CRAWLERA_URL = os.environ.get('CRAWLERA_HOST')


class SomeSpider(scrapy.Spider):
    name = 'example'
    start_urls = [
        "http://httpbin.org/headers"
    ]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "some_project.downloader.SetHeadersDownloadHandler",
            "https": "some_project.downloader.SetHeadersDownloadHandler"
        },
        "PLAYWRIGHT_CONTEXTS": {
            1: {
                "ignore_https_errors": True,
                "proxy": {
                    "server": CRAWLERA_URL,
                    "username": CRAWLERA_APIKEY,
                    "password": "",
                }
            }
        }
    }
    default_meta = {
        "playwright": True,
        "playwright_context": 1,
        "playwright_page_event_handlers": {
            "response": handle_response,
            "request": handle_request
        }
    }

    def start_requests(self):
        for x in self.start_urls:
            yield scrapy.Request(
                x, meta=self.default_meta
            )

    def parse(self, response):
        for url in ["http://httpbin.org/get", "http://httpbin.org/ip"]:
            yield scrapy.Request(url, callback=self.hello,
                                 meta=self.default_meta)

    def hello(self, response):
        logger.debug(response.body)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support proxies out of the box (also: potential problems with header overwrites in download handler) #36

setting proxy for firefox in playwright (without scrapy)

setting proxy for scrapy-playwright

sugestions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support proxies out of the box (also: potential problems with header overwrites in download handler) #36

Description

setting proxy for firefox in playwright (without scrapy)

setting proxy for scrapy-playwright

sugestions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions