-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapy with Puppeteer and/or Playwright? #4484
Comments
I don't think either Puppeteer nor Playwright could be integrated directly, as they are Javascript projects. However, there is Scrapy added partial support for It is currently possible to run the following spider. Keep in mind that you need to enable the AsyncIO reactor for this to work. import pyppeteer
import scrapy
class PyppeteerSpider(scrapy.Spider):
name = "pyppeteer"
start_urls = ["data:,"] # avoid making an actual upstream request
async def parse(self, response):
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto("https:/example.org")
title = await page.title()
yield {"title": title} Just a proof of concept, perhaps not particularly useful since it circumvents most of the Scrapy components. |
Probably worth covering at https://docs.scrapy.org/en/latest/topics/dynamic-content.html#using-a-headless-browser |
@osmenia You might also want to check https://github.com/elacuesta/scrapy-pyppeteer. |
@elacuesta Long-time Scrapy user here. Since your original post, the Playwright people have created https://github.com/microsoft/playwright-python - this is now a full-feature Python SDK for Playwright. It is supported by Microsoft, and the smart authors of Playwright have made sure to create a great framework so any programming language can implement Playwright. There are multiple languages doing that now. In addition, Playwright is far superior over Puppeteer. Making a native plugin in Scrapy for using Playwright to scrape sites would be a huge boost to this project. Having worked with projects that require heavy Javascript and authentication, the solution of using e.g. Splash is inferior to using something like Playwright directly as a downloader (and to save authentication state, something they made a new API for in Playwright 1.7). Is there a chance you could consider this as a strong candidate as a new, future feature? p.s. I also hope asyncio gets full support and twisted becomes a secondary citizen. asyncio gets far more traction these days, and it has a nicer interface and it correlates with how many other languages work (e.g. Javascript). edit: I noticed you started crearting https://github.com/elacuesta/scrapy-playwright - any chance this could instead be something that in the future exists under the scrapy org here on Github? |
The future is here: https://github.com/scrapy-plugins/scrapy-playwright 🙂 |
@Gallaecio that is great news! I hope that |
Hi Team,
I would like to aks , is there any development plan to create Scrapy for Puppeteer and/or Playwright?
tnx a lot
The text was updated successfully, but these errors were encountered: