Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Support for spiders in other languages (discussion) #1125
A ticket to discuss and reference Spider development in other languages (GSoC idea).
The goal is to allow creating spiders in any language, that can be executed by the scrapy framework like a builtin spider. (And I would like a solution that is generic enough to decouple other scrapy components in the future the same way, like item pipelines.)
As I see it, this is about defining an a) interface and b) protocol which should be suitable for use across different programming languages. Then there are actual c) implementation details to consider (how might scrapy's codebase need to change, to support this).
a) The interface, should be at the smallest common denominator (across languages): OS (POSIX/SYSV) IPC (Inter Process Communication), giving us these options, I believe:
(I would prefer sticking to local IPC, meaning no inet sockets, and let people who need networked IPC write their own transport middleware, for example using zeromq or json-rpc.)
b) The protocol needs consideration on what kind of data needs to be exchanged and how we're doing it. Pipes are unidirectional (unless you're on Solaris) while sockets work full-duplex.
c) To this I would put questions on how such a foreign language Spider fits into the current scrapy framework. For example inside a project the
Hopefully this is a good collection of things to take into account before starting hacking. I tried to keep it brief, if anything is missing or wrong please let me know.
On the foreign spider side each language will probably need a client library that wraps all the IPC setup and communication, essentially the spider would be responsible for generating and responding to data to and from the Scrapy side. The client lib should provide an API that can be used to approximate what writing a Spider in python would be like, while leveraging the best aspects of the language being used. Also, it's important to think about the other aspects of Scrapy used within a Spider, mainly XML/HTML parsing, which is currently handled through a built-in library.
My own choices for the mentioned options earlier would be unix sockets or pipes. I'll explain why:
Shared memory, while being the fastest option and having lowest setup cost, has no inherent safeguards or provided thread-safety. Mmapped files are similar. Thus both are awesome but may not be the best choice when thinking about ease of development for 3rd party programs and developers.
Unix domain sockets work like sockets without involving a network stack, thus faster and easy for people knowing socket programming.
In fact we might mix both together, converting standard filedescriptor to byte streams and multiplexing them over a socket. Sound crazy? Well, then we'd have... FastCGI!
It's not as easy as all that of course, as the foreign language Spider here would implement the webserver side (the FastCGI client), and scrapy would constitute the responder side, the FastCGI application.
So the Spider sends a request which scrapy answers with a response. Looks like a typical webapp to me. Not so typical is that our Spider can also send items to the scrapy "webserver". That would require some additions to the protocol for our use-case.
So this would be a complex solution, definitely requiring library support in other programming languages. It also wouldn't be reuseable for other scrapy components, such as pipelines.
I have been working on this issue for some days and have developed a basic POC of a twisted based Streaming mechanism. The posts by nyov have been helpful in providing an overview of the methods which might be used to solve this problem. I have decided to use the present system because it fits best with the requirements outlined by Shane here.
A few brief notes about the POC
Since this is just a demo, I haven't cared much for error handling, logging, encodings, etc. I have included a simple test spider which gets all the front page postings from self-post only sub-reddits like cscareerquestions, jokes etc.The spider is implemented in Python and Ruby, the Python implementation uses Selectors and Ruby uses Nokogiri.
Feedback is much appreciated, although I won't be able to respond for a few days.
I'll add some more thoughts.
Now the existing scrapy commands can be used with the generated spider.
I will be happy to share my thoughts about implementing recycling spiders, multiprocessing, restarting crashed processes, etc, once it is decided that the IPC method I am using is good for this application.
This work has been conducted within https://github.com/scrapy-plugins/scrapy-streaming