Skip to content

Latest commit

 

History

History
75 lines (55 loc) · 4.17 KB

spiderlets.rst

File metadata and controls

75 lines (55 loc) · 4.17 KB

Spiderlets

Warning

This functionality is deprecated and no longer maintained.

The slybot spider alone is not able to solve all the crawling and extraction difficulties that may arise for every possible case: data presented in a way partially or not suitable for the similarity algorithm, arbitrary post data and ajax requests, complex url normalization not handled by an addon, etc. Spiderlets are a method to extend any AS spider in a way that everything it is possible to do with a normal scrapy spider, can be done with a spiderlet.

Spiderlets are handled by a spider middleware. In order to be enabled, the setting SPIDERLETS_MODULE must be present, with the value being the module name that contains the spiderlets submodules. For example, if your spiderlets are in the module mylib.spiderlets, then you set the value of SPIDERLETS_MODULE to mylib.spiderlets.

What is a spiderlet

A spiderlet is an instance of a python class which implements at least one of the predefined methods described below. In order to attach a spiderlet to a given spider, you use the class attribute name. The value of this attribute must match the name of the spider:

class MySpiderlet:
    name = "myspider"

    def process_request(self, request, response):
        ...
        return request

    def process_item(self, item, response):
        ...
        return item

    def process_start_request(self, request):
        ...
        return request

    def parse_login_page(self, response):
        ...
        return request

Three of the methods, process_request, process_item and process_start_request, are attached to the output of the autoscraping spider. The autoscraping spider generates two kind of objects: a request or an item. And depending on the kind of object generated and the source, the spiderlet addon passes it to one or another method of your spiderlet. So, each item issued by the spider is passed to process_item, and each request is passed to process_start_request or process_request, depending on the source: a start url, or a request generated by the spider as a consequence of a link extraction from a received response. Starting requests usually need to be processed in a different way, and they don't have an associated response from which they were generated.

process_request and process_start_request are commonly used for normalize request url, filter it, or override the request callback (eventually defined as a new method of the spiderlet). The default callback for every request generated by the spider is the parse method of the autoscraping spider. Whenever you need to generate a FormRequest for sending post data or simulating an AJAX call, you will need to create your own callbacks in the spiderlet and point the request callbacks to them.

process_item is mostly used for item post processing. An important feature is that the values of the item fields returned by the autoscraping spider are always lists, even if it is single valued. So you have to consider this fact when you are accessing the item fields inside the process_item method. There is no restriction, however, on the types of data contained in the items returned by the spiderlet.

The fourth method, parse_login_page, is of a different kind. Instead of process an output from the spider, it process an incoming response, and it is applied only to those responses which its callback is the parse_login_page method of the autoscraping spider. If you define parse_login_page method in your spiderlet, the request callback will be overridden by this new one. This feature allows to write your own login handler when the slybot default one (based on the generic solution implemented in the loginforms library) does not fit well for a given case.

Another very practical feature of a spiderlet is that you can access the autoscraping spider methods and attributes with the spiderlet attribute self.spider. self.spider.log or self.spider.parse are among the most commonly methods needed to be accessed from the spiderlet.