is it possible to use playwright-stealth with the scrapy-playwright integration? #160

Chryron · 2023-01-26T12:26:09Z

I'm trying to get past some cloudflare restrictions on a site with scrapy-playwright and I was wondering if it was possible to somehow use playwright-extras and the stealth plugin with this integration? The plugin is currently in beta (development here) and serves as drop-in replacement for regular playwright from my limited understanding. I haven't used the original playwright much and was wondering if it would be possible to port over some of the changes they've made to the scrapy integration.

elacuesta · 2023-01-26T18:12:27Z

There was a PR about adding built-in support for a playwright-stealth Python port at #109. The PR was closed, but it is possible to use it after #128 as shown in #109 (comment). Perhaps this is enough for your case.

The plugin you mention is the upstream JS version of the above port. As you said it seems to be a replacement for the JS playwright package which is in turn used by the Python version, but I don't have any plans to base this package in anything else than the official Python version of playwright. AFAICT the playwright-stealth Python port works by adding init scripts to pages, I don't know if the JS one does more.

I'm not particularly well-versed in stealth techniques for browsers, I don't think this package will be going in that direction in the future. Instead, I'd prefer to provide ways to interact with Playwright objects (such as playwright_page_init_callback), thus allowing to bring in 3rd-party integrations that might provide such features.

kinoute · 2023-03-01T14:45:37Z

@Chryron Were you able to make it work?

@elacuesta We tried the different approaches you provided in multiple issues but none of them seems to work. We tried a simple js file with the following (taken from the stealth plugin):

// headless.js
// replace Headless references in default useragent
const current_ua = navigator.userAgent
Object.defineProperty(Object.getPrototypeOf(navigator), 'userAgent', {
    get: () => opts.navigator_user_agent || current_ua.replace('HeadlessChrome/', 'Chrome/')
})

With this code (a part of, the code is longer but you get the idea):

async def init_page(page, request):
    await page.add_init_script(path="./headless.js") # not working
    # await stealth_async(page) # not working with the stealth plugin

class RandomCrawler(CrawlSpider):

    def start_requests(self):
        yield scrapy.Request(
            'https://httpbin.org/headers',
            meta={
                'playwright': True,
                'playwright_page_init_callback': init_page,
            },

The user-agent returned is:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/111.0.5563.19 Safari/537.3

Chryron · 2023-03-01T18:11:12Z

@Chryron Were you able to make it work?

@kinoute I got it to work with the playwright-stealth plugin when I tried using the init_page method. I wrote a simple spider (included below) to see if it returned different results to a few browser tests while changing the stealth variable to True or `False. It indeed returned different results indicating that the scripts that playwright_stealth launched were all being run properly. The problem I was having was that playwright-stealth at the current stage of development was just not a good enough stealth browser for my needs and ended up being detected quite easily.

async def init_page(page, request):
    await stealth_async(page)

class PlaywrightTester(CrawlSpider):
    stealth = True
    if stealth:
        screenshot = "enabled"
        meta={
            "playwright": True,
            "playwright_include_page": True,
            "playwright_page_init_callback": init_page,
        }
    else:
        screenshot = "disabled"
        meta={
            "playwright": True,
            "playwright_include_page": True,
        }
    name = "playwright-tester"

    start_urls = ["https://bot.sannysoft.com/"]

    custom_settings = {
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": False},
        "PLAYWRIGHT_PROCESS_REQUEST_HEADERS": None,
    }

    

    async def parse(self, response):
        
        page = response.meta['playwright_page']
        input("Press Enter to continue...")
        await page.screenshot( path = f"sannysoft_{self.screenshot}.png", full_page = True)
        await page.goto("https://abrahamjuliot.github.io/creepjs/")
        await page.wait_for_timeout(20000)
        await page.screenshot( path = f"creepjs_{self.screenshot}.png", full_page = True)
        await page.goto("http://f.vision/")
        await page.wait_for_timeout(20000)
        await page.screenshot( path = f"fvision_{self.screenshot}.png", full_page = True)
        await page.goto("https://pixelscan.net/")
        await page.wait_for_timeout(20000)
        await page.screenshot( path = f"pixelscan_{self.screenshot}.png", full_page = True)

kinoute · 2023-03-11T18:14:48Z

@Chryron Thanks a lot, it works perfectly!

elacuesta closed this as completed Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is it possible to use playwright-stealth with the scrapy-playwright integration? #160

is it possible to use playwright-stealth with the scrapy-playwright integration? #160

Chryron commented Jan 26, 2023

elacuesta commented Jan 26, 2023

kinoute commented Mar 1, 2023

Chryron commented Mar 1, 2023 •

edited

Loading

kinoute commented Mar 11, 2023

is it possible to use playwright-stealth with the scrapy-playwright integration? #160

is it possible to use playwright-stealth with the scrapy-playwright integration? #160

Comments

Chryron commented Jan 26, 2023

elacuesta commented Jan 26, 2023

kinoute commented Mar 1, 2023

Chryron commented Mar 1, 2023 • edited Loading

kinoute commented Mar 11, 2023

Chryron commented Mar 1, 2023 •

edited

Loading