Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiving a 400 response after clicking "I agree" on the consent form on Google, but not when running through regular Playwright. #97

Closed
LTWood opened this issue Jun 14, 2022 · 8 comments

Comments

@LTWood
Copy link

LTWood commented Jun 14, 2022

Hi,

I have a strange issue where I am receiving a 400 response from Google after clicking on the "I agree" button on their consent form.

after_span

This issue however does not appear if I click on the "Customise" button, nor does it happen if I perform the request via regular Playwright. I thought at first that it may be the proxy I am using, but that also works via regular Playwright.

Playwright code:

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(proxy={
            'server': 'gb.smartproxy.com:30000',
        })
        page = await browser.new_page()
        await page.goto('https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk')
        await page.click('//span[contains(text(), "I agree")]')
        await page.wait_for_load_state('domcontentloaded')
        await page.screenshot(path='/home/ubuntu/test.png', full_page=True)
        await browser.close()

asyncio.run(main())

scrapy-playwright code:

import scrapy


class GoogleSpider(scrapy.Spider):
    name = "google_spider"
    start_urls = ["data:,"]

    custom_settings = {
        'PLAYWRIGHT_LAUNCH_OPTIONS': {
            'proxy': {
                'server': 'http://gb.smartproxy.com:30000'
            }
        }
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk',
            callback=self.parse_page,
            meta={
                'playwright': True,
                'playwright_include_page': True,
            }
        )

    async def parse_page(self, response):
        page = response.meta['playwright_page']
        if 'consent' in page.url:
            await page.screenshot(path='/home/ubuntu/span_button.png', full_page=True)
            await page.click('//span[contains(text(), "I agree")]')
            await page.wait_for_load_state()
            await page.screenshot(path='/home/ubuntu/after_span.png', full_page=True)
        await page.close()

What could be a reason for this? There is probably something simple I am missing here.

OS: Ubuntu 22.04
Python: 3.8.10
scrapy-playwright: 0.0.17

@elacuesta
Copy link
Member

From a quick look, it seems like it might be due to the header processing done by scrapy-playwright. I'd suggest you to look into the PLAYWRIGHT_PROCESS_REQUEST_HEADERS setting at https://github.com/scrapy-plugins/scrapy-playwright#supported-settings.

@LTWood
Copy link
Author

LTWood commented Jun 15, 2022

I have tried setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS to None in both custom_settings and in the settings.py file but unfortunately I am still receiving the 400 response code.

@elacuesta
Copy link
Member

I'm not able to reproduce, the site does not reply to me with a response that matches your code, i.e. no 'consent' in page.url nor "I agree" button. It could be that I'm not using a proxy, I don't have credentials for the one you posted.

@LTWood
Copy link
Author

LTWood commented Jun 18, 2022

I'm not sure the proxy is the issue. If I don't use a proxy and use a "normal" user agent, then I don't get the consent page. However, if I supply the default scrapy user agent then do I get hit with the consent page, and I still get the 400 response code after clicking "I agree". Perhaps this would allow you to reproduce the issue? Also, would I be correct in saying that by setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None, the User-Agent shouldn't be the default scrapy user agent, but rather the user agent set by playwright? Because when I have set it to None and then checked the request headers, the user agent is the default scrapy one.

@elacuesta
Copy link
Member

Indeed, seems like the site doesn't like Scrapy's user agent. Besides that, I can't reproduce, either with or without PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None I get no consent page, just a page saying that my search had no results.

Regarding this:

would I be correct in saying that by setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None, the User-Agent shouldn't be the default scrapy user agent, but rather the user agent set by playwright? Because when I have set it to None and then checked the request headers, the user agent is the default scrapy one.

Thanks! You just found a bug: #98

@LTWood
Copy link
Author

LTWood commented Jun 19, 2022

Thank you for fixing that bug! But, this is very strange. I have even started the scrapy project completely from scratch with a minimal script and I still get either a 400 or a 405 response code depending on the type of consent page that I get. I have attached my logs and script from this minimal setup. As I said, clicking on this consent page works absolutely fine in vanilla Playwright on the same machine, so I'm struggling to wrap my head around why this isn't working.

Spider

import scrapy


class GoogleTestSpider(scrapy.Spider):
    name = 'google_test'
    allowed_domains = ['google.com']

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk',
            callback=self.parse_page,
            meta={
                'playwright': True,
                'playwright_include_page': True
            }
        )

    async def parse_page(self, response):
        print(response.request.headers['User-Agent'])
        page = response.meta['playwright_page']
        xpaths = '//span[contains(text(), "I agree")]|//span[contains(text(), "Accept all")]|//input[@value="I agree"]|//input[@value="Accept all"]|//span[contains(text(), "Reject all")]//input[@value="Reject all"]'
        if 'consent' in response.url:
            print('hit consent')
            print(response.xpath(xpaths))
            if not response.xpath(xpaths):
                with open('/home/ubuntu/unknown.html', 'w') as w:
                    w.write(await page.content())
            else:
                print('######### FOUND ###########')
                await page.click(xpaths)
                await page.wait_for_load_state()
                print('####### HAVE CLICKED ######')
                await page.screenshot(path='/home/ubuntu/after.png', full_page=True)
        await page.close()

Settings file

LOG_FILE='/home/ubuntu/google_scrape.log'

BOT_NAME = 'playwright_test'

SPIDER_MODULES = ['playwright_test.spiders']
NEWSPIDER_MODULE = 'playwright_test.spiders'

DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler'
}

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
PLAYWRIGHT_BROWSER_TYPE='firefox'

google_scrape.log

@elacuesta
Copy link
Member

I've just tried this again and I still can't reproduce. With the code from this comment I get a a captcha with a message about suspicious traffic. If I remove all query params from the querystring except for the actual search string (q param) I get a normal page saying there are no results.

@elacuesta
Copy link
Member

Closing due to inactivity.

@elacuesta elacuesta closed this as not planned Won't fix, can't repro, duplicate, stale Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants