Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support playwright_stealth #109

Closed
wants to merge 2 commits into from
Closed

Conversation

tanghq33
Copy link

Integrated playwright_stealth, and PLAYWRIGHT_STEALTH_ENABLED as an optional config.

Attached bot test results.

PLAYWRIGHT_STEALTH_ENABLED = True
ENABLED

PLAYWRIGHT_STEALTH_ENABLED = False
DISABLED

@elacuesta
Copy link
Member

Thank you very much for the contribution, but I don't want to include any third-party dependency unless it's really necessary.
I've been thinking that one way to allow this functionality (and address #25 at the same time) would be to add a way to handle pages right after they are created (an idea I've already explored at #26 (comment)). I'm imagining something like the following:

from scrapy import Spider, Request
from playwright.async_api import Page

async def new_page_handler(page: Page) -> None:
    await page.add_init_script("/path/to/script")
    # more stuff

class AwesomeSpider(Spider):
    def start_requests(self):
        yield Request(
            url="https://httpbin.org/get",
            meta={"playwright": True, "playwright_configure_page": new_page_handler},
        )

@tanghq33 tanghq33 closed this Sep 27, 2022
@elacuesta
Copy link
Member

For the record, this should be possible after #128

@nimish
Copy link

nimish commented Nov 1, 2022

Thank you very much for the contribution, but I don't want to include any third-party dependency unless it's really necessary. I've been thinking that one way to allow this functionality (and address #25 at the same time) would be to add a way to handle pages right after they are created (an idea I've already explored at #26 (comment)). I'm imagining something like the following:

from scrapy import Spider, Request
from playwright.async_api import Page

async def new_page_handler(page: Page) -> None:
    await page.add_init_script("/path/to/script")
    # more stuff

class AwesomeSpider(Spider):
    def start_requests(self):
        yield Request(
            url="https://httpbin.org/get",
            meta={"playwright": True, "playwright_configure_page": new_page_handler},
        )

It should be possible to include this with an optional pip dependency e.g. scrapy-playwright[with_playwright_stealth] to avoid requiring the dependency while also including this in the distribution

@elacuesta
Copy link
Member

elacuesta commented Nov 1, 2022

It should be possible to include this with an optional pip dependency e.g. scrapy-playwright[with_playwright_stealth] to avoid requiring the dependency while also including this in the distribution

That's true, but it would still require changes to the main handler in order to support the integration - that's what I want to avoid.
It's possible to integrate with this after v0.0.22, by using the playwright_page_init_callback request meta key:

from playwright_stealth import stealth_async

async def init_page(page, request):
    await stealth_async(page)

class StealthSpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta={
                "playwright": True,
                "playwright_page_init_callback": init_page,
            },
        )

@kinoute
Copy link

kinoute commented Mar 1, 2023

@hqtang33 Were you able to find a solution? I tried to include your changes proposed here and also your fork of the stealth plugin but unfortunately, even the "simple" removal of "Headless" doesn't work in the user-agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants