-
Notifications
You must be signed in to change notification settings - Fork 148
Description
I'm trying to use scrapy-playwright with firefox and proxies and it's not easy.
In Playwright-Python and Node as well just passing proxy config to server is not enough because authorization headers are missing on request, so I need to set extra headers with page.
setting proxy for firefox in playwright (without scrapy)
Looks like this in pure playwright (no scrapy yet)
import asyncio
import logging
import os
import re
import sys
from playwright.async_api import async_playwright
from w3lib.http import basic_auth_header
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
async def handle_request(request):
logger.debug(f"Browser request: <"
f"{request.method} {request.url}>")
async def handle_response(response):
# Log responses, just so you know what's going on when Scrapy
# seems to be inactive
msg = f"Browser crawled ({response.status}): "
logger.debug(msg + response.url)
body = await response.body()
logger.debug(body)
async def main():
url = 'https://httpbin.org/headers'
CRAWLERA_APIKEY = os.environ.get('CRAWLERA_APIKEY')
CRAWLERA_URL = os.environ.get('CRAWLERA_HOST')
proxy_auth = basic_auth_header(CRAWLERA_APIKEY, '')
proxy_settings = {
"proxy": {
"server": CRAWLERA_URL,
"username": CRAWLERA_APIKEY,
"password": ''
},
"ignore_https_errors": True
}
DEFAULT_HEADERS = {
'Proxy-Authorization': proxy_auth.decode(),
"X-Crawlera-Profile": "pass",
"X-Crawlera-Cookies": "disable",
}
async with async_playwright() as p:
browser_type = p.firefox
timeout = 90000
msg = f"starting rendering page with timeout {timeout}ms"
logger.info(msg)
# Launching new browser
browser = await browser_type.launch()
context = await browser.new_context(**proxy_settings)
page = await context.new_page()
# XXX try to run it with/without this line
# it gives 407 without it, 200 with
await page.set_extra_http_headers(DEFAULT_HEADERS)
page.on('request', handle_request)
page.on('response', handle_response)
await page.goto(url, timeout=timeout)
asyncio.run(main())Without setting extra headers I get this:
python proxies.py
2021-10-15 12:41:19,819 - __main__ - INFO - starting rendering page with timeout 90000ms
2021-10-15 12:41:21,123 - __main__ - DEBUG - Browser request: <GET https://httpbin.org/headers>
2021-10-15 12:41:21,707 - __main__ - DEBUG - Browser crawled (407): https://httpbin.org/headers
2021-10-15 12:41:21,713 - __main__ - DEBUG - b''
with setting extra headers I get good response and I can see traffic in my proxy logs
python proxies.py
2021-10-15 12:42:28,019 - __main__ - INFO - starting rendering page with timeout 90000ms
2021-10-15 12:42:29,549 - __main__ - DEBUG - Browser request: <GET https://httpbin.org/headers>
2021-10-15 12:42:30,594 - __main__ - DEBUG - Browser crawled (200): https://httpbin.org/headers
2021-10-15 12:42:30,597 - __main__ - DEBUG - b'{\n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", \n "Accept-Encoding": "gzip, deflate, br", \n "Accept-Language": "en-US,en;q=0.5", \n "Host": "httpbin.org", \n "Sec-Fetch-Dest": "document", \n "Sec-Fetch-Mode": "navigate", \n "Sec-Fetch-Site": "cross-site", \n "Upgrade-Insecure-Requests": "1", \n "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0", \n "X-Amzn-Trace-Id": "Root=1-61695b16-097385cb043d01b63d71eb58"\n }\n}\n'
setting proxy for scrapy-playwright
Now I'm trying to do the same thing in scrapy-playwright, and I run into problems.
I cannot easily set extra headers now. I can set event handlers, but request event handler according to docs does not allow modifying request object. Setting extra headers need to be done before page.go() and after context created, there is no easy way to do it in spider object now, unless I'm missing something. If I am missing something let me know.
To bypass this I subclassed downloader handler.
from scrapy import Request
from scrapy_playwright.handler import ScrapyPlaywrightDownloadHandler
from properties.settings import DEFAULT_PLAYWRIGHT_PROXY_HEADERS
class SetHeadersDownloadHandler(ScrapyPlaywrightDownloadHandler):
async def _create_page(self, request: Request):
page = await super()._create_page(request)
await page.set_extra_http_headers(DEFAULT_PLAYWRIGHT_PROXY_HEADERS)
return pageand defined it in settings. This is a hack as create_page is not meant to be modified, but it works for setting authorization. Still I get 407.
The only way I can make it work is by disabling scrapy-playwright route handler, so I comment out these lines here: https://github.com/scrapy-plugins/scrapy-playwright/blob/master/scrapy_playwright/handler.py#L151
between 151 and 161.
Now I get proper result, spider gets 200 responses, no 407 in logs and traffic going via proxy.
sugestions
- add support for proxy setting built in scrapy-playwright. To set proxy properly for different browsers users could just set PLAYWRIGHT_PROXY_HOST, PLAYWRIGHT_PROXY_USERNAME etc and scrapy-playwright will do all it needs inside download handler. I tested with firefox, but I know in chrome you may need to pass different settings to context, different browsers will have different arguments. This will be a pain to set up for most users, doing it in scrapy-playwright will make things easy for people.
- find out why await page.unroute and page.set_extra_http_headers in handler seems to interfere with each other, I don't really understand it well, I see in my log output from spider that request is made with authorization header but still proxy is responding with 407 response, need to go to make_request_handler and go step by step find out whats wrong here. I'll try to do it next week and publish my findings.
Scrapy spider code:
import logging
import os
import scrapy
from playwright.async_api import Response, Request
logger = logging.getLogger(__name__)
async def handle_response(response: Response):
logger.info(response.url + " " + str(response.status))
logger.info(response.headers)
return
async def handle_request(request: Request):
logger.info(request.headers)
CRAWLERA_APIKEY = os.environ.get('CRAWLERA_APIKEY')
CRAWLERA_URL = os.environ.get('CRAWLERA_HOST')
class SomeSpider(scrapy.Spider):
name = 'example'
start_urls = [
"http://httpbin.org/headers"
]
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "some_project.downloader.SetHeadersDownloadHandler",
"https": "some_project.downloader.SetHeadersDownloadHandler"
},
"PLAYWRIGHT_CONTEXTS": {
1: {
"ignore_https_errors": True,
"proxy": {
"server": CRAWLERA_URL,
"username": CRAWLERA_APIKEY,
"password": "",
}
}
}
}
default_meta = {
"playwright": True,
"playwright_context": 1,
"playwright_page_event_handlers": {
"response": handle_response,
"request": handle_request
}
}
def start_requests(self):
for x in self.start_urls:
yield scrapy.Request(
x, meta=self.default_meta
)
def parse(self, response):
for url in ["http://httpbin.org/get", "http://httpbin.org/ip"]:
yield scrapy.Request(url, callback=self.hello,
meta=self.default_meta)
def hello(self, response):
logger.debug(response.body)