Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to use playwright-stealth with the scrapy-playwright integration? #160

Closed
Chryron opened this issue Jan 26, 2023 · 4 comments

Comments

@Chryron
Copy link

Chryron commented Jan 26, 2023

I'm trying to get past some cloudflare restrictions on a site with scrapy-playwright and I was wondering if it was possible to somehow use playwright-extras and the stealth plugin with this integration? The plugin is currently in beta (development here) and serves as drop-in replacement for regular playwright from my limited understanding. I haven't used the original playwright much and was wondering if it would be possible to port over some of the changes they've made to the scrapy integration.

@elacuesta
Copy link
Member

There was a PR about adding built-in support for a playwright-stealth Python port at #109. The PR was closed, but it is possible to use it after #128 as shown in #109 (comment). Perhaps this is enough for your case.

The plugin you mention is the upstream JS version of the above port. As you said it seems to be a replacement for the JS playwright package which is in turn used by the Python version, but I don't have any plans to base this package in anything else than the official Python version of playwright. AFAICT the playwright-stealth Python port works by adding init scripts to pages, I don't know if the JS one does more.

I'm not particularly well-versed in stealth techniques for browsers, I don't think this package will be going in that direction in the future. Instead, I'd prefer to provide ways to interact with Playwright objects (such as playwright_page_init_callback), thus allowing to bring in 3rd-party integrations that might provide such features.

@kinoute
Copy link

kinoute commented Mar 1, 2023

@Chryron Were you able to make it work?

@elacuesta We tried the different approaches you provided in multiple issues but none of them seems to work. We tried a simple js file with the following (taken from the stealth plugin):

// headless.js
// replace Headless references in default useragent
const current_ua = navigator.userAgent
Object.defineProperty(Object.getPrototypeOf(navigator), 'userAgent', {
    get: () => opts.navigator_user_agent || current_ua.replace('HeadlessChrome/', 'Chrome/')
})

With this code (a part of, the code is longer but you get the idea):

async def init_page(page, request):
    await page.add_init_script(path="./headless.js") # not working
    # await stealth_async(page) # not working with the stealth plugin

class RandomCrawler(CrawlSpider):

    def start_requests(self):
        yield scrapy.Request(
            'https://httpbin.org/headers',
            meta={
                'playwright': True,
                'playwright_page_init_callback': init_page,
            },

The user-agent returned is:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/111.0.5563.19 Safari/537.3

@Chryron
Copy link
Author

Chryron commented Mar 1, 2023

@Chryron Were you able to make it work?

@kinoute I got it to work with the playwright-stealth plugin when I tried using the init_page method. I wrote a simple spider (included below) to see if it returned different results to a few browser tests while changing the stealth variable to True or `False. It indeed returned different results indicating that the scripts that playwright_stealth launched were all being run properly. The problem I was having was that playwright-stealth at the current stage of development was just not a good enough stealth browser for my needs and ended up being detected quite easily.

async def init_page(page, request):
    await stealth_async(page)

class PlaywrightTester(CrawlSpider):
    stealth = True
    if stealth:
        screenshot = "enabled"
        meta={
            "playwright": True,
            "playwright_include_page": True,
            "playwright_page_init_callback": init_page,
        }
    else:
        screenshot = "disabled"
        meta={
            "playwright": True,
            "playwright_include_page": True,
        }
    name = "playwright-tester"

    start_urls = ["https://bot.sannysoft.com/"]

    custom_settings = {
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": False},
        "PLAYWRIGHT_PROCESS_REQUEST_HEADERS": None,
    }

    

    async def parse(self, response):
        
        page = response.meta['playwright_page']
        input("Press Enter to continue...")
        await page.screenshot( path = f"sannysoft_{self.screenshot}.png", full_page = True)
        await page.goto("https://abrahamjuliot.github.io/creepjs/")
        await page.wait_for_timeout(20000)
        await page.screenshot( path = f"creepjs_{self.screenshot}.png", full_page = True)
        await page.goto("http://f.vision/")
        await page.wait_for_timeout(20000)
        await page.screenshot( path = f"fvision_{self.screenshot}.png", full_page = True)
        await page.goto("https://pixelscan.net/")
        await page.wait_for_timeout(20000)
        await page.screenshot( path = f"pixelscan_{self.screenshot}.png", full_page = True)

@kinoute
Copy link

kinoute commented Mar 11, 2023

@Chryron Thanks a lot, it works perfectly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants