Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

images don't appear to get read from the persistent context properly / cached #198

Open
pjlsergeant opened this issue May 10, 2023 · 1 comment

Comments

@pjlsergeant
Copy link

I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can see here:

https://github.com/pjlsergeant/scrapy-playwright-cache-bug

app.py is a minimal Flask app to demonstrate; if you start it (flask run) and then run the scrape (scrapy crawl crawl), you can see that the PNG at /pixel doesn't get cached, both from the flask logs and by the final body output: <html><head></head><body>count:6</body></html>, signifying 6 hits.

Interestingly, if you then manually load up Playwright using the persistent config (something like browser_context = chromium.launch_persistent_context(userDataDir)), you'll see the image is already cached, so the image is being written to the cache during Playwright+Scrapy's run, it's just not being loaded from the cache when Playwright is being driven by Scrapy.

Any help gratefully received

@elacuesta
Copy link
Member

It looks like this is caused by the use of Page.route. In their docs it says:

Enabling routing disables http cache.

Unfortunately, this is necessary for some of the functionality of this integration, as I've explained elsewhere.

Seems like this is a known limitation and a lot of people are eager to have it removed from upstream Playwright: microsoft/playwright#7220.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants