Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I use scrapy-playwright without downloading images? #51

Closed
phongtnit opened this issue Jan 25, 2022 · 8 comments
Closed

How should I use scrapy-playwright without downloading images? #51

phongtnit opened this issue Jan 25, 2022 · 8 comments

Comments

@phongtnit
Copy link

Hi,

Could anyone help me to use scrapy-playwright without downloading images?

Thanks for your support,

@elacuesta
Copy link
Member

This seems like a duplicate of #26, please reopen with more info if you don't agree.

@lime-n
Copy link

lime-n commented Feb 28, 2022

This seems like a duplicate of #26, please reopen with more info if you don't agree.

I have a question on this.

I have followed the example you provided in the answer to that thread, and from my initial thought I imagined no images would be downloaded, i.e. I would not get any response/requests sent to .jpeg attachements.

Here's the output I get:

 'playwright/context_count': 1,
 'playwright/page_count': 1,
 'playwright/page_count/closed': 1,
 'playwright/request_count': 30,
 'playwright/request_count/method/GET': 30,
 'playwright/request_count/navigation': 1,
 'playwright/request_count/resource_type/document': 1,
 'playwright/request_count/resource_type/font': 1,
 'playwright/request_count/resource_type/image': 20, #<- why are requests sent to images?
 'playwright/request_count/resource_type/script': 5,
 'playwright/request_count/resource_type/stylesheet': 3,

Perhaps I have misunderstood what context actually do - If I wanted no requests/responses sent/received from images how would I accomplish this or is this not the same as saying 'without downloading images'?

@elacuesta
Copy link
Member

Only #63 should be considered "in progress". That said, it's probably going to keep showing requests like that in the stats, because it relies on aborting requests caught by the Page.route method, while these stats are incremented via the request event. The playwright/request_count/blocked stat and lack of playwright/request_count/resource_type/image should be indicatives that it's working.

@lime-n
Copy link

lime-n commented Mar 1, 2022

Only #63 should be considered "in progress". That said, it's probably going to keep showing requests like that in the stats, because it relies on aborting requests caught by the Page.route method, while these stats are incremented via the request event. The playwright/request_count/blocked stat and lack of playwright/request_count/resource_type/image should be indicatives that it's working.

I checked the new books_block_request and it ran as expected!

Does this mean that I cannot use:

async def configure_context(name: str, context: BrowserContext) -> None:
        await context.route("**/*.{jpg}", lambda route: route.abort())

to abort responses? - because this still fails to block responses even when I updated with the latest handler.py. Instead, I must instantiate PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE in the settings to abort specific responses? If so, thank you for the update. I have a few projects that will need to use this so I'll update you on any issues (if any.)

Edit:

  • Would it be simpler to block all responses by resource_type? i.e. if image abort all responses with this resource type. Perhaps giving this option as an additional parameter? as some users may only want to block specific images whilst others may want to block all of them. It will require less lines of code as some images do not end with an image extension and working around this will require more lines of code. i.e.
#PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE = lambda req: not req.url.endswith((".jpg", ".png", ".jpeg", ".svg"))
PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE_TYPE = "image"

When we want to block multiple resource types:

PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE_TYPE = ["image", "Doc", "Media"]

@elacuesta
Copy link
Member

That's correct, the configure_context approach was very much experimental (there wasn't even a PR). There are still some details to think about with the new approach as well, but the following should be enough for now to block requests based on resource type

PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE = lambda req: req.resource_type not in ("image", "script")

@lime-n
Copy link

lime-n commented Mar 1, 2022

Thank you for the implementation - I agree, there's still more creative ideas for it's implementation. An additional one may be the following:

  1. To substantiate in the meta arguments, or handlers when to activate the settings argument for PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE.

For the following reason:
If I am using PageCoroutine to log into a website I cannot remove some resource types as otherwise I won't have access to the login. After accessing the site, I may be making further requests to another domain belonging to the site. It may be useful to access the PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE here and block response from multiples resources or file extensions. Although, making this multifunctional, for example -

meta = {
    'playwright_accept_request_predicate': {
        'activate':True,
        'position': 1
            }
}

activate would describe whether to activate the settings for PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE for the requests in that method.

However, I'm not sure what will happen when the following is involved:

def first_parse(self, response):
 ...
    yield Request(
        meta = {
            'playwright_accept_request_predicate': {
            'activate':True,
            'position': 1
            }
    })

def second_parse(self, response):
...
    yield Request(
        meta = {
            'playwright_accept_request_predicate': {
            'activate':True,
            'position': [2, 3]
            }
    })

It may be useful to add a break on the previous occurring playwright_accept_request_predicate to stop it so that it does not overlap with the new argument.

Whereas, position would denote the following:

PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE = lambda req: req.resource_type not in ("image", "script")

The position of the resource_type in this case image. However, there may be a few technical issues with this. Alternatively, we could include:

meta = {
    'playwright_accept_request_predicate': {
        'activate':True,
        'group': 'resource_type'
        'position': 1
            }
}

This makes things slightly more complex - however, this would represent the following:
The position argument is already pre-determined by the position of resource_types in the source code. For example, if resource types take the following list sequence: request.resource_type = ['xhr', 'JS' ...] then we pick on this position.

Then the following will denote:

meta = {
    'playwright_accept_request_predicate': {
        'activate':True,
        'group': 'resource_type'
        'position': False
            }
}

All the resource types are included when 'position': False

Therefore we just have to include:

PLAYWRIGHT_ACCEPT_REQUEST_PREDICATE = True

and make controls on abort in the script itself for further functionality.

@lime-n
Copy link

lime-n commented Mar 2, 2022

On addition to the above; there's also a need to include the route handle, as such:

meta = {
    'playwright_accept_request_predicate': {
        'activate':True,
        'handle': 'fulfill'
        'group': 'resource_type'
        'position': False
            }
}

Where, the handle determines how to handle the route. Furthermore, I have noticed that route.request provides the request to be handled.

Alternatively, I was thinking how functional would route.request be in the given case; The idea I had was that we can abort all request urls when their response is not <200>. Not sure how much of an impact this has on the scrapy context and the side-effects of running this, however it may prevent issues like time-outs.

Something like:

meta = {
    'playwright_accept_request_predicate': {
        'activate': True,
        'response': 400  
        'handle': 'abort'
        'group': 'resource_type'
        'position': 1
            }
}

There's quite a lot to unpack here but I hope it proves useful for the future development of route in scrapy_playwright.

@elacuesta your thoughts?

@elacuesta
Copy link
Member

I'm sorry, that sounds overcomplicated. My aim is to keep things as simple as possible, I don't want to build a whole new API.
Moreover, things like the position argument are not generalizable because the predicate to abort requests is arbitrary, not just based on the resource type.

abort all request urls when their response is not <200>

The idea is to abort requests before they're sent, if they already have responses it's just not possible to abort them anymore.

However, passing the Route object instead of just the Request sounds interesting, to allow the usage of Request.fulfill. I'm not sure though, perhaps that could be a separate hook to keep this one simply returning a boolean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants