Receiving a 400 response after clicking "I agree" on the consent form on Google, but not when running through regular Playwright. #97

LTWood · 2022-06-14T17:07:31Z

Hi,

I have a strange issue where I am receiving a 400 response from Google after clicking on the "I agree" button on their consent form.

This issue however does not appear if I click on the "Customise" button, nor does it happen if I perform the request via regular Playwright. I thought at first that it may be the proxy I am using, but that also works via regular Playwright.

Playwright code:

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(proxy={
            'server': 'gb.smartproxy.com:30000',
        })
        page = await browser.new_page()
        await page.goto('https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk')
        await page.click('//span[contains(text(), "I agree")]')
        await page.wait_for_load_state('domcontentloaded')
        await page.screenshot(path='/home/ubuntu/test.png', full_page=True)
        await browser.close()

asyncio.run(main())

scrapy-playwright code:

import scrapy


class GoogleSpider(scrapy.Spider):
    name = "google_spider"
    start_urls = ["data:,"]

    custom_settings = {
        'PLAYWRIGHT_LAUNCH_OPTIONS': {
            'proxy': {
                'server': 'http://gb.smartproxy.com:30000'
            }
        }
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk',
            callback=self.parse_page,
            meta={
                'playwright': True,
                'playwright_include_page': True,
            }
        )

    async def parse_page(self, response):
        page = response.meta['playwright_page']
        if 'consent' in page.url:
            await page.screenshot(path='/home/ubuntu/span_button.png', full_page=True)
            await page.click('//span[contains(text(), "I agree")]')
            await page.wait_for_load_state()
            await page.screenshot(path='/home/ubuntu/after_span.png', full_page=True)
        await page.close()

What could be a reason for this? There is probably something simple I am missing here.

OS: Ubuntu 22.04
Python: 3.8.10
scrapy-playwright: 0.0.17

The text was updated successfully, but these errors were encountered:

elacuesta · 2022-06-15T20:20:45Z

From a quick look, it seems like it might be due to the header processing done by scrapy-playwright. I'd suggest you to look into the PLAYWRIGHT_PROCESS_REQUEST_HEADERS setting at https://github.com/scrapy-plugins/scrapy-playwright#supported-settings.

LTWood · 2022-06-15T22:16:44Z

I have tried setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS to None in both custom_settings and in the settings.py file but unfortunately I am still receiving the 400 response code.

elacuesta · 2022-06-18T03:03:29Z

I'm not able to reproduce, the site does not reply to me with a response that matches your code, i.e. no 'consent' in page.url nor "I agree" button. It could be that I'm not using a proxy, I don't have credentials for the one you posted.

LTWood · 2022-06-18T14:50:19Z

I'm not sure the proxy is the issue. If I don't use a proxy and use a "normal" user agent, then I don't get the consent page. However, if I supply the default scrapy user agent then do I get hit with the consent page, and I still get the 400 response code after clicking "I agree". Perhaps this would allow you to reproduce the issue? Also, would I be correct in saying that by setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None, the User-Agent shouldn't be the default scrapy user agent, but rather the user agent set by playwright? Because when I have set it to None and then checked the request headers, the user agent is the default scrapy one.

elacuesta · 2022-06-18T18:04:00Z

Indeed, seems like the site doesn't like Scrapy's user agent. Besides that, I can't reproduce, either with or without PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None I get no consent page, just a page saying that my search had no results.

Regarding this:

would I be correct in saying that by setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None, the User-Agent shouldn't be the default scrapy user agent, but rather the user agent set by playwright? Because when I have set it to None and then checked the request headers, the user agent is the default scrapy one.

Thanks! You just found a bug: #98

LTWood · 2022-06-19T14:14:21Z

Thank you for fixing that bug! But, this is very strange. I have even started the scrapy project completely from scratch with a minimal script and I still get either a 400 or a 405 response code depending on the type of consent page that I get. I have attached my logs and script from this minimal setup. As I said, clicking on this consent page works absolutely fine in vanilla Playwright on the same machine, so I'm struggling to wrap my head around why this isn't working.

Spider

import scrapy


class GoogleTestSpider(scrapy.Spider):
    name = 'google_test'
    allowed_domains = ['google.com']

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk',
            callback=self.parse_page,
            meta={
                'playwright': True,
                'playwright_include_page': True
            }
        )

    async def parse_page(self, response):
        print(response.request.headers['User-Agent'])
        page = response.meta['playwright_page']
        xpaths = '//span[contains(text(), "I agree")]|//span[contains(text(), "Accept all")]|//input[@value="I agree"]|//input[@value="Accept all"]|//span[contains(text(), "Reject all")]//input[@value="Reject all"]'
        if 'consent' in response.url:
            print('hit consent')
            print(response.xpath(xpaths))
            if not response.xpath(xpaths):
                with open('/home/ubuntu/unknown.html', 'w') as w:
                    w.write(await page.content())
            else:
                print('######### FOUND ###########')
                await page.click(xpaths)
                await page.wait_for_load_state()
                print('####### HAVE CLICKED ######')
                await page.screenshot(path='/home/ubuntu/after.png', full_page=True)
        await page.close()

Settings file

LOG_FILE='/home/ubuntu/google_scrape.log'

BOT_NAME = 'playwright_test'

SPIDER_MODULES = ['playwright_test.spiders']
NEWSPIDER_MODULE = 'playwright_test.spiders'

DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler'
}

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
PLAYWRIGHT_BROWSER_TYPE='firefox'

google_scrape.log

elacuesta · 2023-11-29T00:17:15Z

I've just tried this again and I still can't reproduce. With the code from this comment I get a a captcha with a message about suspicious traffic. If I remove all query params from the querystring except for the actual search string (q param) I get a normal page saying there are no results.

elacuesta · 2024-07-04T15:15:54Z

Closing due to inactivity.

elacuesta mentioned this issue Jun 18, 2022

Always override request headers #98

Merged

2 tasks

elacuesta added the could not reproduce label Nov 29, 2023

elacuesta added the Stale label Jun 4, 2024

elacuesta closed this as not planned Won't fix, can't repro, duplicate, stale Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receiving a 400 response after clicking "I agree" on the consent form on Google, but not when running through regular Playwright. #97

Receiving a 400 response after clicking "I agree" on the consent form on Google, but not when running through regular Playwright. #97

LTWood commented Jun 14, 2022

elacuesta commented Jun 15, 2022

LTWood commented Jun 15, 2022

elacuesta commented Jun 18, 2022

LTWood commented Jun 18, 2022 •

edited

Loading

elacuesta commented Jun 18, 2022

LTWood commented Jun 19, 2022 •

edited

Loading

elacuesta commented Nov 29, 2023

elacuesta commented Jul 4, 2024

Receiving a 400 response after clicking "I agree" on the consent form on Google, but not when running through regular Playwright. #97

Receiving a 400 response after clicking "I agree" on the consent form on Google, but not when running through regular Playwright. #97

Comments

LTWood commented Jun 14, 2022

elacuesta commented Jun 15, 2022

LTWood commented Jun 15, 2022

elacuesta commented Jun 18, 2022

LTWood commented Jun 18, 2022 • edited Loading

elacuesta commented Jun 18, 2022

LTWood commented Jun 19, 2022 • edited Loading

Spider

Settings file

elacuesta commented Nov 29, 2023

elacuesta commented Jul 4, 2024

LTWood commented Jun 18, 2022 •

edited

Loading

LTWood commented Jun 19, 2022 •

edited

Loading