Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy with Puppeteer and/or Playwright? #4484

Closed
osmenia opened this issue Apr 11, 2020 · 6 comments · Fixed by #4613
Closed

Scrapy with Puppeteer and/or Playwright? #4484

osmenia opened this issue Apr 11, 2020 · 6 comments · Fixed by #4613

Comments

@osmenia
Copy link

osmenia commented Apr 11, 2020

Hi Team,

I would like to aks , is there any development plan to create Scrapy for Puppeteer and/or Playwright?

tnx a lot

@elacuesta
Copy link
Member

elacuesta commented Apr 12, 2020

I don't think either Puppeteer nor Playwright could be integrated directly, as they are Javascript projects. However, there is Pyppeteer, and some attempts to integrate it with Scrapy (a quick search yields lopuhin/scrapy-pyppeteer and clemfromspace/scrapy-puppeteer, there might be more projects).

Scrapy added partial support for asyncio in 2.0 (see the asyncio and coroutines topics). This was released very recently though, barely over a month ago, it doesn't seem like the above projects have had the time to take advantage of it.

It is currently possible to run the following spider. Keep in mind that you need to enable the AsyncIO reactor for this to work.

import pyppeteer
import scrapy


class PyppeteerSpider(scrapy.Spider):
    name = "pyppeteer"
    start_urls = ["data:,"]  # avoid making an actual upstream request

    async def parse(self, response):
        browser = await pyppeteer.launch()
        page = await browser.newPage()
        await page.goto("https:/example.org")
        title = await page.title()
        yield {"title": title}

Just a proof of concept, perhaps not particularly useful since it circumvents most of the Scrapy components.

@Gallaecio
Copy link
Member

Probably worth covering at https://docs.scrapy.org/en/latest/topics/dynamic-content.html#using-a-headless-browser

@elacuesta
Copy link
Member

elacuesta commented Apr 17, 2020

@osmenia You might also want to check https://github.com/elacuesta/scrapy-pyppeteer.
Disclaimer: personal project, not officially supported by the Scrapy project, very early stage of development.

@thernstig
Copy link
Contributor

thernstig commented Jan 4, 2021

@elacuesta Long-time Scrapy user here. Since your original post, the Playwright people have created https://github.com/microsoft/playwright-python - this is now a full-feature Python SDK for Playwright. It is supported by Microsoft, and the smart authors of Playwright have made sure to create a great framework so any programming language can implement Playwright. There are multiple languages doing that now.

In addition, Playwright is far superior over Puppeteer.

Making a native plugin in Scrapy for using Playwright to scrape sites would be a huge boost to this project. Having worked with projects that require heavy Javascript and authentication, the solution of using e.g. Splash is inferior to using something like Playwright directly as a downloader (and to save authentication state, something they made a new API for in Playwright 1.7).

Is there a chance you could consider this as a strong candidate as a new, future feature?

p.s. I also hope asyncio gets full support and twisted becomes a secondary citizen. asyncio gets far more traction these days, and it has a nicer interface and it correlates with how many other languages work (e.g. Javascript).

edit: I noticed you started crearting https://github.com/elacuesta/scrapy-playwright - any chance this could instead be something that in the future exists under the scrapy org here on Github?

@Gallaecio
Copy link
Member

I noticed you started crearting https://github.com/elacuesta/scrapy-playwright - any chance this could instead be something that in the future exists under the scrapy org here on Github?

The future is here: https://github.com/scrapy-plugins/scrapy-playwright 🙂

@thernstig
Copy link
Contributor

@Gallaecio that is great news! I hope that asyncio graduates from non-experimental as well soon. Then I truly believe Scrapy is "modern" again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants