Support for spiders in other languages (discussion) #1125

Closed
nyov opened this Issue Mar 31, 2015 · 5 comments

Comments

Projects
None yet
4 participants
@nyov
Contributor

nyov commented Mar 31, 2015

A ticket to discuss and reference Spider development in other languages (GSoC idea).

The goal is to allow creating spiders in any language, that can be executed by the scrapy framework like a builtin spider. (And I would like a solution that is generic enough to decouple other scrapy components in the future the same way, like item pipelines.)

As I see it, this is about defining an a) interface and b) protocol which should be suitable for use across different programming languages. Then there are actual c) implementation details to consider (how might scrapy's codebase need to change, to support this).

a) The interface, should be at the smallest common denominator (across languages): OS (POSIX/SYSV) IPC (Inter Process Communication), giving us these options, I believe:
shared memory, mmapped files, message queues (not really on OSX), sockets (incl. unix sockets), pipes. (For an overview, see Beej's Guide to Unix IPC)

(I would prefer sticking to local IPC, meaning no inet sockets, and let people who need networked IPC write their own transport middleware, for example using zeromq or json-rpc.)

b) The protocol needs consideration on what kind of data needs to be exchanged and how we're doing it. Pipes are unidirectional (unless you're on Solaris) while sockets work full-duplex.
(The GSoC ideas page also refers to Hadoop Streaming, which is using line-based communication through pipes. Q: How does this handle binary data?)
We might use some custom line-based interactions like this hadoop streaming style, or there are other standardized protocols: Protocol buffers/protobuf, BERT comes to mind.

c) To this I would put questions on how such a foreign language Spider fits into the current scrapy framework. For example inside a project the SpiderManager currently detects available spiders, with the help of the SPIDER_MODULES settings, how would it adapt?
Then there is statefulness to consider, do we need to know which response for the spider returned which new requests and/or items? Should it be workload-based (wait until the spider processed a response and returned everything, signaling a finish) or queue/stream-based (independent input and output, serial bus) or async callbacks. (The current Spider being asynchronous, it might be nice to have a similar, callback-based, protocol.)
How to best handle setup and teardown (open/close_spider), for example defining signals for spiders to trap and exit codes to return.
Not to forget error handling - if a spider dies, is it restartable or fatal (SIGSEGV, incompatible protocol version) and stopping the crawler?

Hopefully this is a good collection of things to take into account before starting hacking. I tried to keep it brief, if anything is missing or wrong please let me know.
Let's get this party started :)

// cc @shaneaevans, @pablohoffman

@dfockler

This comment has been minimized.

Show comment
Hide comment
@dfockler

dfockler Mar 31, 2015

On the foreign spider side each language will probably need a client library that wraps all the IPC setup and communication, essentially the spider would be responsible for generating and responding to data to and from the Scrapy side. The client lib should provide an API that can be used to approximate what writing a Spider in python would be like, while leveraging the best aspects of the language being used. Also, it's important to think about the other aspects of Scrapy used within a Spider, mainly XML/HTML parsing, which is currently handled through a built-in library.

On the foreign spider side each language will probably need a client library that wraps all the IPC setup and communication, essentially the spider would be responsible for generating and responding to data to and from the Scrapy side. The client lib should provide an API that can be used to approximate what writing a Spider in python would be like, while leveraging the best aspects of the language being used. Also, it's important to think about the other aspects of Scrapy used within a Spider, mainly XML/HTML parsing, which is currently handled through a built-in library.

@nyov

This comment has been minimized.

Show comment
Hide comment
@nyov

nyov Apr 1, 2015

Contributor

My own choices for the mentioned options earlier would be unix sockets or pipes. I'll explain why:

Shared memory, while being the fastest option and having lowest setup cost, has no inherent safeguards or provided thread-safety. Mmapped files are similar. Thus both are awesome but may not be the best choice when thinking about ease of development for 3rd party programs and developers.
Message queues are out, as OSX does not support the POSIX style and the SYSV style only with headaches (tiny queue size).

Unix domain sockets work like sockets without involving a network stack, thus faster and easy for people knowing socket programming.
Pipes or named pipes/FIFOs could be an elegant solution, everyone can pipe programs together on the commandline, in principle.

In fact we might mix both together, converting standard filedescriptor to byte streams and multiplexing them over a socket. Sound crazy? Well, then we'd have... FastCGI!
(And when saying FastCGI, I do mean FastCGI. Not FCGI, SCGI, or any other of those things that only do "CGI over a socket", not the multiplexing that makes FastCGI great, which leaves them capable of sequential processing only.)
Tada, ready made protocol included.

It's not as easy as all that of course, as the foreign language Spider here would implement the webserver side (the FastCGI client), and scrapy would constitute the responder side, the FastCGI application.

So the Spider sends a request which scrapy answers with a response. Looks like a typical webapp to me. Not so typical is that our Spider can also send items to the scrapy "webserver". That would require some additions to the protocol for our use-case.

So this would be a complex solution, definitely requiring library support in other programming languages. It also wouldn't be reuseable for other scrapy components, such as pipelines.
But bonus points for using scrapy as a FastCGI web application behind a webserver.

Contributor

nyov commented Apr 1, 2015

My own choices for the mentioned options earlier would be unix sockets or pipes. I'll explain why:

Shared memory, while being the fastest option and having lowest setup cost, has no inherent safeguards or provided thread-safety. Mmapped files are similar. Thus both are awesome but may not be the best choice when thinking about ease of development for 3rd party programs and developers.
Message queues are out, as OSX does not support the POSIX style and the SYSV style only with headaches (tiny queue size).

Unix domain sockets work like sockets without involving a network stack, thus faster and easy for people knowing socket programming.
Pipes or named pipes/FIFOs could be an elegant solution, everyone can pipe programs together on the commandline, in principle.

In fact we might mix both together, converting standard filedescriptor to byte streams and multiplexing them over a socket. Sound crazy? Well, then we'd have... FastCGI!
(And when saying FastCGI, I do mean FastCGI. Not FCGI, SCGI, or any other of those things that only do "CGI over a socket", not the multiplexing that makes FastCGI great, which leaves them capable of sequential processing only.)
Tada, ready made protocol included.

It's not as easy as all that of course, as the foreign language Spider here would implement the webserver side (the FastCGI client), and scrapy would constitute the responder side, the FastCGI application.

So the Spider sends a request which scrapy answers with a response. Looks like a typical webapp to me. Not so typical is that our Spider can also send items to the scrapy "webserver". That would require some additions to the protocol for our use-case.

So this would be a complex solution, definitely requiring library support in other programming languages. It also wouldn't be reuseable for other scrapy components, such as pipelines.
But bonus points for using scrapy as a FastCGI web application behind a webserver.

@Preetwinder

This comment has been minimized.

Show comment
Hide comment
@Preetwinder

Preetwinder Feb 24, 2016

Contributor

I have been working on this issue for some days and have developed a basic POC of a twisted based Streaming mechanism. The posts by nyov have been helpful in providing an overview of the methods which might be used to solve this problem. I have decided to use the present system because it fits best with the requirements outlined by Shane here.

A few brief notes about the POC

  • LineReceiver
    is used with ProcessEndpoint.
  • The endpoint is wrapped in DisconnectedWorkaroundEndpoint as a workaround this twisted bug.
  • stdbuf from coreutils is used to disable buffering, because we don't want the user to worry about accumulating the response.
  • The lineReceived callback accumulates deferreds and fires any one of them, thus the relationship between requests and responses can be thought of, as decoupled in a way. This can be changed by using a name/id-indexed dictionary of deferreds.
  • Although I haven't tried yet, but the defined protocol should also work well with other scrapy components like Item pipelines.
  • The library in the client language can get the callback functions from the local namespace, or accept a list of functions, and for languages without first-class functions something like a Command Pattern can be used.
  • Since the callback functions execute in a separate process, they don't block the reactor.

Since this is just a demo, I haven't cared much for error handling, logging, encodings, etc. I have included a simple test spider which gets all the front page postings from self-post only sub-reddits like cscareerquestions, jokes etc.The spider is implemented in Python and Ruby, the Python implementation uses Selectors and Ruby uses Nokogiri.

Feedback is much appreciated, although I won't be able to respond for a few days.

Thank you

Contributor

Preetwinder commented Feb 24, 2016

I have been working on this issue for some days and have developed a basic POC of a twisted based Streaming mechanism. The posts by nyov have been helpful in providing an overview of the methods which might be used to solve this problem. I have decided to use the present system because it fits best with the requirements outlined by Shane here.

A few brief notes about the POC

  • LineReceiver
    is used with ProcessEndpoint.
  • The endpoint is wrapped in DisconnectedWorkaroundEndpoint as a workaround this twisted bug.
  • stdbuf from coreutils is used to disable buffering, because we don't want the user to worry about accumulating the response.
  • The lineReceived callback accumulates deferreds and fires any one of them, thus the relationship between requests and responses can be thought of, as decoupled in a way. This can be changed by using a name/id-indexed dictionary of deferreds.
  • Although I haven't tried yet, but the defined protocol should also work well with other scrapy components like Item pipelines.
  • The library in the client language can get the callback functions from the local namespace, or accept a list of functions, and for languages without first-class functions something like a Command Pattern can be used.
  • Since the callback functions execute in a separate process, they don't block the reactor.

Since this is just a demo, I haven't cared much for error handling, logging, encodings, etc. I have included a simple test spider which gets all the front page postings from self-post only sub-reddits like cscareerquestions, jokes etc.The spider is implemented in Python and Ruby, the Python implementation uses Selectors and Ruby uses Nokogiri.

Feedback is much appreciated, although I won't be able to respond for a few days.

Thank you

@Preetwinder

This comment has been minimized.

Show comment
Hide comment
@Preetwinder

Preetwinder Mar 9, 2016

Contributor

I'll add some more thoughts.
An important question to consider is, how should this functionality fit with the existing scrapy commands. One way is to have a separate set of commands under the streaming sub-command, but doing that is not necessary. The solution I propose is to have a -
scrapy streaming generate-spider --cmd="python Process.py" --filename="Test"
command. This command will generate a spider which inherits all of its functionality and just contains the class variable definitions. The variable definitions can be hard-coded, when the spider is generated(by asking the process), but then, generate-spider needs to be run for every change made to these settings in the process. A better way is to get these settings from the process every time the spider is run. So running the above command will generate "Test.py" with the following content.

class TestSpider(StreamingSpider):
    cmd = ["python", "Process.py"]
    settings = getSpiderSettings(cmd) #Runs the process, gets the settings
    name = settings['name']
    allowed_domains = settings['allowed_domains']
    start_urls = settings['start_urls']

Now the existing scrapy commands can be used with the generated spider.
I believe we should have a separate sub-command for streaming, since the streaming functionality is sufficiently distinct. We will need to move to argparse, since optparse does not seem to support sub-commands(without any patching). I see there are #1118 and #829.

Another question in the mailing list thread was about which languages should be initially supported. This can be based on popularity in this particular domain, Java, Ruby and Javascript seem to be the most popular. My own preference would be to have a dynamically typed(Python) and a statically typed(Java or Go) language.

I will be happy to share my thoughts about implementing recycling spiders, multiprocessing, restarting crashed processes, etc, once it is decided that the IPC method I am using is good for this application.
Thank you.

Contributor

Preetwinder commented Mar 9, 2016

I'll add some more thoughts.
An important question to consider is, how should this functionality fit with the existing scrapy commands. One way is to have a separate set of commands under the streaming sub-command, but doing that is not necessary. The solution I propose is to have a -
scrapy streaming generate-spider --cmd="python Process.py" --filename="Test"
command. This command will generate a spider which inherits all of its functionality and just contains the class variable definitions. The variable definitions can be hard-coded, when the spider is generated(by asking the process), but then, generate-spider needs to be run for every change made to these settings in the process. A better way is to get these settings from the process every time the spider is run. So running the above command will generate "Test.py" with the following content.

class TestSpider(StreamingSpider):
    cmd = ["python", "Process.py"]
    settings = getSpiderSettings(cmd) #Runs the process, gets the settings
    name = settings['name']
    allowed_domains = settings['allowed_domains']
    start_urls = settings['start_urls']

Now the existing scrapy commands can be used with the generated spider.
I believe we should have a separate sub-command for streaming, since the streaming functionality is sufficiently distinct. We will need to move to argparse, since optparse does not seem to support sub-commands(without any patching). I see there are #1118 and #829.

Another question in the mailing list thread was about which languages should be initially supported. This can be based on popularity in this particular domain, Java, Ruby and Javascript seem to be the most popular. My own preference would be to have a dynamically typed(Python) and a statically typed(Java or Go) language.

I will be happy to share my thoughts about implementing recycling spiders, multiprocessing, restarting crashed processes, etc, once it is decided that the IPC method I am using is good for this application.
Thank you.

@redapple

This comment has been minimized.

Show comment
Hide comment
@redapple

redapple Sep 19, 2016

Contributor

This work has been conducted within https://github.com/scrapy-plugins/scrapy-streaming

Contributor

redapple commented Sep 19, 2016

This work has been conducted within https://github.com/scrapy-plugins/scrapy-streaming

@redapple redapple closed this Sep 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment