Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Suggestion: different pipeline processor for each item type #102
I didn't find a good way to use different 'process_item' methods to different classes of Items.
For example we could create two item classes to crawl and then store in db:
class Headers(item): url = Field() response_code = Field() content_type = Field() class Body(item): title = Field() h1 = Field()
Than in item pipeline we would need to do something like this:
class StoreInDB(object): def process_item(self. item, spider): if isinstance(item, Headers): return self.storeHeaders(item, spider) elif isinstance(item, Body): return self.storeBody(item, spider) def storeHeaders(item, spider): pass # make some things with Headers item here def storeBody(item, spider): pass # make some things with Body item here
Wouldn't it be nice to put this functionality in base class, so we would have some dict or function to map item to the correct processor? Sure, current behavior would stay as default. Here is what i'm talking about:
from project.items import Headers, Body from scrapy.contrib.pipeline import Pipeline class StoreInDB(Pipeline): def __init__(self): self.assignItemProcessor(itemclass=Headers, processor=self.storeHeaders) self.assignItemProcessor(itemclass=Body, processor=self.storeBody) def storeHeaders(item, spider): pass # make some things with Headers item here def storeBody(item, spider): pass # make some things with Body item here
I can write the code if you guys think it's useful. It certainly is for me.
This looks like a contrib pipeline that implements the basis for item type delegation, users still need to extend it to add its projects functionality.
I don't think this worth the pain of maintaining another contrib as part of Scrapy project, the functionality described is easily implementable and there is no concensus about the approach to handle multiple item types. Others have proposed building an item pipeline per type instead.
IMHO this base pipeline is more for a blog post, recipe or external scrapy cookrecipes project.