From bcd8520f8d84badce149dab0658aeb68a7435d2d Mon Sep 17 00:00:00 2001 From: Pablo Hoffman Date: Tue, 20 Mar 2012 10:15:00 -0300 Subject: [PATCH] added sep directory with Scrapy Enhancement Proposal imported from old Trac site --- sep/README | 5 + sep/sep-001.trac | 235 ++++++++++++++++++ sep/sep-002.trac | 106 ++++++++ sep/sep-003.trac | 153 ++++++++++++ sep/sep-004.trac | 70 ++++++ sep/sep-005.trac | 119 +++++++++ sep/sep-006.trac | 52 ++++ sep/sep-007.trac | 108 +++++++++ sep/sep-008.trac | 102 ++++++++ sep/sep-009.trac | 102 ++++++++ sep/sep-010.trac | 59 +++++ sep/sep-011.trac | 30 +++ sep/sep-012.trac | 73 ++++++ sep/sep-013.trac | 129 ++++++++++ sep/sep-014.trac | 612 +++++++++++++++++++++++++++++++++++++++++++++++ sep/sep-015.trac | 50 ++++ sep/sep-016.trac | 265 ++++++++++++++++++++ sep/sep-017.trac | 90 +++++++ sep/sep-018.trac | 551 ++++++++++++++++++++++++++++++++++++++++++ 19 files changed, 2911 insertions(+) create mode 100644 sep/README create mode 100644 sep/sep-001.trac create mode 100644 sep/sep-002.trac create mode 100644 sep/sep-003.trac create mode 100644 sep/sep-004.trac create mode 100644 sep/sep-005.trac create mode 100644 sep/sep-006.trac create mode 100644 sep/sep-007.trac create mode 100644 sep/sep-008.trac create mode 100644 sep/sep-009.trac create mode 100644 sep/sep-010.trac create mode 100644 sep/sep-011.trac create mode 100644 sep/sep-012.trac create mode 100644 sep/sep-013.trac create mode 100644 sep/sep-014.trac create mode 100644 sep/sep-015.trac create mode 100644 sep/sep-016.trac create mode 100644 sep/sep-017.trac create mode 100644 sep/sep-018.trac diff --git a/sep/README b/sep/README new file mode 100644 index 00000000000..668772492d8 --- /dev/null +++ b/sep/README @@ -0,0 +1,5 @@ +Scrapy Enhancement Proposals +============================ + +This folder contains Scrapy Enhancement Proposal. Most of them are in Trac Wiki +format because they were migrated from the old Trac. diff --git a/sep/sep-001.trac b/sep/sep-001.trac new file mode 100644 index 00000000000..a56c7851099 --- /dev/null +++ b/sep/sep-001.trac @@ -0,0 +1,235 @@ += SEP-001 - API for populating item fields (comparison) = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||1|| +||'''Title:'''||API for populating item fields (comparison)|| +||'''Author:'''||Ismael Carnales, Pablo Hoffman, Daniel Grana|| +||'''Created:'''||2009-07-19|| +||'''Status'''||Obsoleted by [wiki:SEP-008]|| + +== Introduction == + +This page shows different usage scenarios for the two new proposed API for populating item field values (which will replace the old deprecated !RobustItem API) and compares them. One of these will be chosen as the recommended (and supported) mechanism in Scrapy 0.7. + +== Candidates and their API == + +=== !RobustItem (old, deprecated) === + + * {{{attribute(field_name, selector_or_value, **modifiers_and_adaptor_args)}}} + * NOTE: {{{attribute()}}} modifiers (like {{{add=True}}}) are passed together with adaptor args as keyword arguments (this is ugly) + +=== !ItemForm === + + * {{{__init__(response, item=None, **adaptor_args)}}} + * instantiate an !ItemForm with a item instance with predefined adaptor arguments + * {{{__setitem__(field_name, selector_or_value)}}} + * set field value + * {{{__getitem__(field_name)}}} + * return the "computed" value of a field (the one that would be set to the item). returns None if not set. + * {{{get_item()}}} + * return the item populated with the data provided so far + +=== !ItemBuilder === + + * {{{__init__(response, item=None, **adaptor_args)}}} + * instantiate an !ItemBuilder with predefined adaptor arguments + * {{{add_value(field_name, selector_or_value, **adaptor_args)}}} + * add value to field + * {{{replace_value(field_name, selector_or_value, **adaptor_args)}}} + * replace existing field value + * {{{get_value(field_name)}}} + * return the "computed" value of a field (the one that would be set to the item). returns None if not set. + * {{{get_item()}}} + * return the item populated with the data provided so far + +== Pros and cons of each candidate == + +=== !ItemForm === + +Pros: + * same API used for Items (see http://doc.scrapy.org/experimental/topics/newitem/index.html#more-advanced-items) + * some people consider setitem API more elegant than methods API + +Cons: + * doesn't allow passing run-time arguments to adaptors on assign, you have to override the adaptors for your spider if you need specific parameters, which can be an overhead. Example: + +Neutral: + * solves the add=True problem using standard {{{__add__}}} and {{{list.append()}}} method + +=== !ItemBuilder === + +Pros: + * allows passing run-time arguments to adaptors on assigned + +Cons: + * some people consider setitem API more elegant than methods API + +Neutral: + * solves the "add=True" problem by implementing different methods per action (replacing or adding) + +== Usage Scenarios for each candidate == + +=== Defining adaptors === + +==== !ItemForm ==== + +{{{ +#!python +class NewsForm(ItemForm): + item_class = NewsItem + + url = adaptor(extract, remove_tags(), unquote(), strip) + headline = adaptor(extract, remove_tags(), unquote(), strip) +}}} + +==== !ItemBuilder ==== + +{{{ +#!python +class NewsBuilder(ItemBuilder): + item_class = NewsItem + + url = adaptor(extract, remove_tags(), unquote(), strip) + headline = adaptor(extract, remove_tags(), unquote(), strip) +}}} + +=== Creating an Item === + +==== !ItemForm ==== + +{{{ +#!python +ia = NewsForm(response) +ia['url'] = response.url +ia['headline'] = x.x('//h1[@class="headline"]') + +# if we want to add another value to the same field +ia['headline'] += x.x('//h1[@class="headline2"]') + +# if we want to replace the field value other value to the same field +ia['headline'] = x.x('//h1[@class="headline3"]') + +return ia.get_item() +}}} + +==== !ItemBuilder ==== + +{{{ +#!python +il = NewsBuilder(response) +il.add_value('url', response.url) +il.add_value('headline', x.x('//h1[@class="headline"]')) + +# if we want to add another value to the same field +il.add_value('headline', x.x('//h1[@class="headline2"]')) + +# if we want to replace the field value other value to the same field +il.replace_value('headline', x.x('//h1[@class="headline3"]')) + +return il.get_item() +}}} + +=== Using different adaptors per Spider/Site === + +==== !ItemForm ==== + +{{{ +#!python +class SiteNewsFrom(NewsForm): + published = adaptor(HtmlNewsForm.published, to_date('%d.%m.%Y')) +}}} + +==== !ItemBuilder ==== + +{{{ +#!python +class SiteNewsBuilder(NewsBuilder): + published = adaptor(HtmlNewsBuilder.published, to_date('%d.%m.%Y')) +}}} + +=== Check the value of an item being-extracted === + +==== !ItemForm ==== + +{{{ +#!python +ia = NewsForm(response) +ia['headline'] = x.x('//h1[@class="headline"]') +if not ia['headline']: + ia['headline'] = x.x('//h1[@class="title"]') +}}} + +==== !ItemBuilder ==== + +{{{ +#!python +il = NewsBuilder(response) +il.add_value('headline', x.x('//h1[@class="headline"]')) +if not nf.get_value('headline'): + il.add_value('headline', x.x('//h1[@class="title"]')) +}}} + +=== Adding a value to a list attribute/field === + +==== !ItemForm ==== + +{{{ +#!python +ia['headline'] += x.x('//h1[@class="headline"]') +}}} + +==== !ItemBuilder ==== + +{{{ +#!python +il.add_value('headline', x.x('//h1[@class="headline"]')) +}}} + +=== Passing run-time arguments to adaptors === + +==== !ItemForm ==== + +{{{ +#!python +# Only approach is passing arguments when instanciating the form +ia = NewsForm(response, default_unit='cm') +ia['width'] = x.x('//p[@class="width"]') +}}} + +==== !ItemBuilder ==== + +{{{ +#!python +il.add_value('width', x.x('//p[@class="width"]'), default_unit='cm') + +# an alternative approach (more efficient) +il = NewsBuilder(response, default_unit='cm') +il.add_value('width', x.x('//p[@class="width"]')) +}}} + +=== Passing run-time arguments to adaptors (same argument name) === + +==== !ItemForm ==== + +{{{ +#!python +class MySiteForm(ItemForm): + witdth = adaptor(ItemForm.witdh, default_unit='cm') + volume = adaptor(ItemForm.witdh, default_unit='lt') + +ia['width'] = x.x('//p[@class="width"]') +ia['volume'] = x.x('//p[@class="volume"]') + +# another example passing parametes on instance +ia = NewsForm(response, encoding='utf-8') +ia['name'] = x.x('//p[@class="name"]') +}}} + +==== !ItemBuilder ==== + +{{{ +#!python +il.add_value('width', x.x('//p[@class="width"]'), default_unit='cm') +il.add_value('volume', x.x('//p[@class="volume"]'), default_unit='lt') +}}} diff --git a/sep/sep-002.trac b/sep/sep-002.trac new file mode 100644 index 00000000000..073b3e46be8 --- /dev/null +++ b/sep/sep-002.trac @@ -0,0 +1,106 @@ += SEP-002 - List fields API = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||3|| +||'''Title:'''||List fields API|| +||'''Author:'''||Pablo Hoffman|| +||'''Created:'''||2009-07-21|| +||'''Status'''||Obsolete by [wiki:SEP-008]|| + + +== Introduction == + +This page presents different usage scenarios for the new multi-valued field, called !ListField. + +== Proposed Implementation == + +{{{ +#!python +from scrapy.item.fields import BaseField + +class ListField(BaseField): + def __init__(self, field, default=None): + self._field = field + super(ListField, self).__init__(default) + + def to_python(self, value): + if hasattr(value, '__iter__'): # str/unicode not allowed + return [self._field.to_python(v) for v in value] + else: + raise TypeError("Expected iterable, got %s" % type(value).__name__) + + def get_default(self): + # must return a new copy to avoid unexpected behaviors with mutable defaults + return list(self._default) +}}} + +== Usage Scenarios == + +=== Defining a list field === + +{{{ +#!python +from scrapy.item.models import Item +from scrapy.item.fields import ListField, TextField, DateField, IntegerField + +class Article(Item): + categories = ListField(TextField) + dates = ListField(DateField, default=[]) + numbers = ListField(IntegerField, []) +}}} + +Another case of products and variants which highlights the fact that it's important to instantiate !ListField with field instances, not classes: + +{{{ +#!python +from scrapy.item.models import Item +from scrapy.item.fields import ListField, TextField + +class Variant(Item): + name = TextField() + +class Product(Variant): + variants = ListField(ItemField(Variant)) +}}} + + +=== Assigning a list field === + +{{{ +#!python +i = Article() + +i['categories'] = [] +i['categories'] = ['politics', 'sport'] +i['categories'] = ['test', 1] -> raises TypeError +i['categories'] = asd -> raises TypeError + +i['dates'] = [] +i['dates'] = ['2009-01-01'] # raises TypeError? (depends on TextField) + +i['numbers'] = ['1', 2, '3'] +i['numbers'] # returns [1, 2, 3] +}}} + +=== Default values === + +{{{ +#!python +i = Article() + +i['categories'] # raises KeyError +i.get('categories') # returns None + +i['numbers'] # returns [] +}}} + +=== Appending values === + +{{{ +#!python +i = Article() + +i['categories'] = ['one', 'two'] +i['categories'].append(3) # XXX: should this fail? +}}} diff --git a/sep/sep-003.trac b/sep/sep-003.trac new file mode 100644 index 00000000000..f0b910f6103 --- /dev/null +++ b/sep/sep-003.trac @@ -0,0 +1,153 @@ += SEP-003 - Nested items API (!ItemField) = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||3|| +||'''Title:'''||Nested items API (!ItemField) || +||'''Author:'''||Pablo Hoffman|| +||'''Created:'''||2009-07-19|| +||'''Status'''||Obsoleted by [wiki:SEP-008]|| + +== Introduction == + +This page presents different usage scenarios for the new nested items field API called !ItemField. + +== Prerequisites == + +This API proposal relies on the following API: + + 1. instantiating a item with an item instance as its first argument (ie. {{{item2 = MyItem(item1)}}}) must return a '''copy''' of the first item instance) + 2. items can be instantiated using this syntax: {{{item = Item(attr1=value1, attr2=value2)}}} + +== Proposed Implementation of !ItemField == + +{{{ +#!python +from scrapy.item.fields import BaseField + +class ItemField(BaseField): + def __init__(self, item_type, default=None): + self._item_type = item_type + super(ItemField, self).__init__(default) + + def to_python(self, value): + return self._item_type(value) if not isinstance(value, self._item_type) else value + + def get_default(self): + # WARNING: returns default item instead of a copy - this must be well documented, as Items are mutable objects and may lead to unexpected behaviors + # always returning a copy may not be desirable either (see Supplier item, for example). this method can be overridden to change this behaviour + return self._default +}}} + +== Usage Scenarios == + +=== Defining an item containing !ItemField's === + +{{{ +#!python +from scrapy.item.models import Item +from scrapy.item.fields import ListField, ItemField, TextField, UrlField, DecimalField + +class Supplier(Item): + name = TextField(default="anonymous supplier") + url = UrlField() + +class Variant(Item): + name = TextField(required=True) + url = UrlField() + price = DecimalField() + +class Product(Variant): + supplier = ItemField(Supplier, default=Supplier(name="default supplier") + variants = ListField(ItemField(Variant)) + + # these ones are used for documenting default value examples + supplier2 = ItemField(Supplier) + variants2 = ListField(ItemField(Variant), default=[]) +}}} + +It's important to note here that the (perhaps most intuitive) way of defining a Product-Variant +relationship (ie. defining a recursive !ItemField) doesn't work. For example, this fails to compile: + +{{{ +#!python +class Product(Item): + variants = ItemField(Product) # Fails to compile +}}} + +=== Assigning an item field === + +{{{ +#!python + +supplier = Supplier(name="Supplier 1", url="http://example.com") + +p = Product() + +# standard assignment +p['supplier'] = supplier +# this also works as it tries to instantiate a Supplier with the given dict +p['supplier'] = {'name': 'Supplier 1' url='http://example.com'} +# this fails because it can't instantiate a Supplier +p['supplier'] = 'Supplier 1' +# this fails because url doesn't have the valid type +p['supplier'] = {'name': 'Supplier 1' url=123} + +v1 = Variant() +v1['name'] = "lala" +v1['price'] = Decimal("100") + +v2 = Variant() +v2['name'] = "lolo" +v2['price'] = Decimal("150") + +# standard assignment +p['variants'] = [v1, v2] # OK +# can also instantiate at assignment time +p['variants'] = [v1, Variant(name="lolo", price=Decimal("150")] +# this also works as it tries to instantiate a Variant with the given dict +p['variants'] = [v1, {'name': 'lolo', 'price': Decimal("150")] +# this fails because it can't instantiate a Variant +p['variants'] = [v1, 'test'] +# this fails beacuse 'coco' is not a valid value for price +p['variants'] = [v1, {'name': 'lolo', 'price': 'coco'] +}}} + +=== Default values === + +{{{ +#!python +p = Product() + +p['supplier'] # returns: Supplier(name='default supplier') +p['supplier2'] # raises KeyError +p['supplier2'] = Supplier() +p['supplier2'] # returns: Supplier(name='anonymous supplier') + +p['variants'] # raises KeyError +p['variants2'] # returns [] + +p['categories'] # raises KeyError +p.get('categories') # returns None + +p['numbers'] # returns [] +}}} + +=== Accesing and changing nested item values === + +{{{ +#!python + +p = Product(supplier=Supplier(name="some name", url="http://example.com")) +p['supplier']['url'] # returns 'http://example.com' +p['supplier']['url'] = "http://www.other.com" # works as expected +p['supplier']['url'] = 123 # fails: wrong type for supplier url + +p['variants'] = [v1, v2] +p['variants'][0]['name'] # returns v1 name +p['variants'][1]['name'] # returns v2 name + +# XXX: decide what to do about these cases: +p['variants'].append(v3) # works but doesn't check type of v3 +p['variants'].append(1) # works but shouldn't? +}}} diff --git a/sep/sep-004.trac b/sep/sep-004.trac new file mode 100644 index 00000000000..6a19e50bb5d --- /dev/null +++ b/sep/sep-004.trac @@ -0,0 +1,70 @@ += SEP-004: Library API = + +[[PageOutline(2-5, Contents)]] + +||'''SEP'''||4|| +||'''Title'''||Library-like API for quick scraping|| +||'''Author'''||Pablo Hoffman|| +||'''Created'''||2009-07-21|| +||'''Status'''||Archived|| + +Note: the library API has been implemented, but slightly different from proposed in this SEP. You can run a Scrapy crawler inside a Twisted reactor, but not outside it. + +See these snippets for some examples: + * http://snippets.scrapy.org/snippets/8/ + * http://snippets.scrapy.org/snippets/9/ + +== Introduction == + +It would be desirable for Scrapy to provide a quick, "light-weight" mechanism for implementing crawlers by just using callback functions. That way you could use Scrapy as any standard library (like you would use os.walk) in a script without the overhead of having to create an entire project from scratch. + +== Proposed API == + +Here's a simple proof-of-concept code of such script: + +{{{ +#!python +#!/usr/bin/env python +from scrapy.http import Request +from scrapy import Crawler + +# a container to hold scraped items +scraped_items = [] + +def parse_start_page(response): + # collect urls to follow into urls_to_follow list + requests = [Request(url, callback=parse_other_page) for url in urls_to_follow] + return requests + +def parse_other_page(response): + # ... parse items from response content ... + scraped_items.extend(parsed_items) + +start_urls = ["http://www.example.com/start_page.html"] + +cr = Crawler(start_urls, callback=parse_start_page) +cr.run() # blocking call - this populates scraped_items + +print "%d items scraped" % len(scraped_items) +# ... do something more interesting with scraped_items ... +}}} + +The behaviour of the Scrapy crawler would be controller by the Scrapy settings, naturally, just like any typical scrapy project. But the default settings should be sufficient so as to not require adding any specific setting. But, at the same time, you could do it if you need to, say, for specifying a custom middleware. + +It shouldn't be hard to implement this API as all this functionality is a (small) subset of the current Scrapy functionality. At the same time, it would provide an additional incentive for newcomers. + +== Crawler class == + +The Crawler class would have the following instance arguments (most of them have been singletons so far): + + * engine + * settings + * spiders + * extensions + +== Spider Manager == + +The role of the spider manager will be to "resolve" spiders from URLs and domains. Also, it should be moved outside scrapy.spider (and only BaseSpider left there). + +There is also the close_spider() method which is called for all closed spiders, even when they weren't resolved first by the spider manager. We need to decide what to do with this method. + diff --git a/sep/sep-005.trac b/sep/sep-005.trac new file mode 100644 index 00000000000..a57bc080ce8 --- /dev/null +++ b/sep/sep-005.trac @@ -0,0 +1,119 @@ += SEP-005: Detailed !ItemBuilder API use = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||5|| +||'''Title:'''||!ItemBuilder API|| +||'''Author:'''||Ismael Carnales, Pablo Hoffman|| +||'''Created:'''||2009-07-24|| +||'''Status'''||Obsoleted by [wiki:SEP-008]|| + +Item class for examples: + +{{{ +#!python +class NewsItem(Item): + url = fields.TextField() + headline = fields.TextField() + content = fields.TextField() + published = fields.DateField() +}}} + +== Setting expanders == + +{{{ +#!python +class NewsItemBuilder(ItemBuilder): + item_class = NewsItem + + headline = reducers.Reducer(extract, remove_tags(), unquote(), strip) +}}} + +This approach will override the Reducer class for !BuilderFields depending on their Item Field class: + + * !MultivaluedField = PassValue + * !TextField = JoinStrings + * other = TakeFirst + +== Setting reducers == + +{{{ +#!python +class NewsItemBuilder(ItemBuilder): + item_class = NewsItem + + headline = reducers.TakeFirst(extract, remove_tags(), unquote(), strip) + published = reducers.Reducer(extract, remove_tags(), unquote(), strip) +}}} + +As with the previous example this would select join_strings as the reducer for content + +== Setting expanders/reducers new way == + +{{{ +#!python +class NewsItemBuilder(ItemBuilder): + item_class = NewsItem + + headline = BuilderField(extract, remove_tags(), unquote(), strip) + content = BuilderField(extract, remove_tags(), unquote(), strip) + + class Reducer: + headline = TakeFirst +}}} + +== Extending !ItemBuilder == + +{{{ +#!python +class SiteNewsItemBuilder(NewsItemBuilder): + published = reducers.Reducer(extract, remove_tags(), unquote(), strip, to_date('%d.%m.%Y')) +}}} + +== Extending !ItemBuilder using statich methods == + +{{{ +#!python +class SiteNewsItemBuilder(NewsItemBuilder): + published = reducers.Reducer(NewsItemBuilder.published, to_date('%d.%m.%Y')) +}}} + +== Using default_builder == + +{{{ +#!python +class DefaultedNewsItemBuilder(ItemBuilder): + item_class = NewsItem + + default_builder = reducers.Reducer(extract, remove_tags(), unquote(), strip) +}}} + +This will use default_builder as the builder for every field in the item class. +As a reducer is not set reducers will be set based on Item Field classess. + +== Reset default_builder for a field == + +{{{ +#!python +class DefaultedNewsItemBuilder(ItemBuilder): + item_class = NewsItem + + default_builder = reducers.Reducer(extract, remove_tags(), unquote(), strip) + url = BuilderField() +}}} + +== Extending default !ItemBuilder == + +{{{ +#!python +class SiteNewsItemBuilder(NewsItemBuilder): + published = reducers.Reducer(extract, remove_tags(), unquote(), strip, to_date('%d.%m.%Y')) +}}} + +== Extending default !ItemBuilder using static methods == + +{{{ +#!python +class SiteNewsItemBuilder(NewsItemBuilder): + published = reducers.Reducer(NewsItemBuilder.default_builder, to_date('%d.%m.%Y')) +}}} \ No newline at end of file diff --git a/sep/sep-006.trac b/sep/sep-006.trac new file mode 100644 index 00000000000..ee19769b79e --- /dev/null +++ b/sep/sep-006.trac @@ -0,0 +1,52 @@ += SEP-006: Rename of Selectors to Extractors = + +[[PageOutline(2-5, Contents)]] + +||'''SEP'''||6|| +||'''Title'''||Extractors|| +||'''Author'''||Ismael Carnales and a bunch of rabid mice|| +||'''Created'''||2009-07-28|| +||'''Status'''||Obsolete (discarded)|| + +== Abstract == + +This SEP proposes a more meaningful naming of XPathSelectors or "Selectors" and their `x` method. + +== Motivation == + +When you use Selectors in Scrapy, your final goal is to "extract" the data that you've selected, as the [http://doc.scrapy.org/topics/selectors.html XPath Selectors documentation] says (bolding by me): + + "When you’re scraping web pages, the most common task you need to perform is to '''extract''' data from the HTML source." + + ... + + "Scrapy comes with its own mechanism for '''extracting''' data. They’re called XPath selectors (or just “selectors”, for short) because they “select” certain parts of the HTML document specified by XPath expressions." + + ... + + "To actually '''extract''' the textual data you must call the selector extract() method, as follows" + + ... + + "Selectors also have a re() method for '''extracting''' data using regular expressions." + + ... + + "For example, suppose you want to '''extract''' all

elements inside

elements. First you get would get all
elements" + +== Rationale == + +As and there is no Extractor object in Scrapy and what you want to finally perform with Selectors is extracting data, we propose the renaming of Selectors to Extractors. (In Scrapy for extracting you use selectors is really weird :) ) + +=== Additional changes === + +As the name of the method for performing selection (the `x` method) is not descriptive nor mnemotechnic enough and clearly clashes with extract method (x sounds like a short for extract in english), we propose to rename it to `select`, `sel` (is shortness if required), or `xpath` after [http://codespeak.net/lxml/xpathxslt.html lxml's xpath method] + +== Bonus (!ItemBuilder) == + +After this renaming we propose also renaming !ItemBuilder to !ItemExtractor, because the !ItemBuilder/Extractor will act as a bridge between a set of Extractors and an Item and because it will literally "extract" an item from a webpage or set of pages. + +== References == + + 1. XPath Selectors (http://doc.scrapy.org/topics/selectors.html) + 2. XPath and XSLT with lxml (http://codespeak.net/lxml/xpathxslt.html) \ No newline at end of file diff --git a/sep/sep-007.trac b/sep/sep-007.trac new file mode 100644 index 00000000000..7d387c6b7e2 --- /dev/null +++ b/sep/sep-007.trac @@ -0,0 +1,108 @@ += SEP-007: !ItemLoader processors library = + +[[PageOutline(2-5, Contents)]] + +||'''SEP'''||7|| +||'''Title'''||!ItemLoader processors library|| +||'''Author'''||Ismael Carnales|| +||'''Created'''||2009-08-10|| +||'''Status'''||Draft|| + +== Introduction == + +This SEP proposes a library of !ItemLoader processor to ship with Scrapy. + +== date.py == + +=== `to_date` === + +Converts a date string to a YYYY-MM-DD one suitable for !DateField + +'''Decision''': Obsolete. !DateField doesn't exists anymore. + +== extraction.py == + +=== `extract` === + +This adaptor tries to extract data from the given locations. Any XPathSelector in it will be extracted, and any other data will be added as-is to the result. + +'''Decision''': Obsolete. Functionality included in !XpathLoader. + +=== `ExtractImageLinks` === + +This adaptor may receive either XPathSelectors pointing to the desired locations for finding image urls, or just a list of XPath expressions (which will be turned into selectors anyway). + +'''Decision''': XXX + +== markup.py == + +=== `remove_tags` === + +Factory that returns an adaptor for removing each tag in the `tags` parameter found in the given value. If no `tags` are specified, all of them are removed. + +'''Decision''': XXX + +=== `remove_root` === + +This adaptor removes the root tag of the given string/unicode, if it's found. + +'''Decision''': XXX + +=== `replace_escape` === + +Factory that returns an adaptor for removing/replacing each escape character in the `wich_ones` parameter found in the given value. + +'''Decision''': XXX + +=== `unquote` === + +This factory returns an adaptor that receives a string or unicode, removes all of the CDATAs and entities (except the ones in CDATAs, and the ones you specify in the `keep` parameter) and then, returns a new string or unicode. + +'''Decision''': XXX + +== misc.py == + +=== `to_unicode` === + +Receives a string and converts it to unicode using the given encoding (if specified, else utf-8 is used) and returns a new unicode object. E.g: + +{{{ +>> to_unicode('it costs 20\xe2\x82\xac, or 30\xc2\xa3') +[u'it costs 20\u20ac, or 30\xa3'] +}}} + +'''Decision''': XXX + +=== `clean_spaces` === + +Converts multispaces into single spaces for the given string. E.g: + +{{{ +>> clean_spaces(u'Hello sir') +u'Hello sir' +}}} + +'''Decision''': XXX + +=== `drop_empty` === + +Removes any index that evaluates to None from the provided iterable. E.g: + +{{{ +>> drop_empty([0, 'this', None, 'is', False, 'an example']) +['this', 'is', 'an example'] +}}} + +'''Decision''': Obsolete. Functionality included in reducers. + +=== `delist` === + +This factory returns and adaptor that joins an iterable with the specified delimiter. + +'''Decision''': Obsolete. Functionality included in reducers. + +=== `Regex` === + +This adaptor must receive either a list of strings or an XPathSelector and return a new list with the matches of the given strings with the given regular expression (which is passed by a keyword argument, and is mandatory for this adaptor). + +'''Decision''': XXX \ No newline at end of file diff --git a/sep/sep-008.trac b/sep/sep-008.trac new file mode 100644 index 00000000000..5d762cdaa8a --- /dev/null +++ b/sep/sep-008.trac @@ -0,0 +1,102 @@ += SEP-008 - Item Loaders = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||8|| +||'''Title:'''||Item Parsers|| +||'''Author:'''||Pablo Hoffman|| +||'''Created:'''||2009-08-11|| +||'''Status'''||Final (implemented with variations)|| +||'''Obsoletes'''||[wiki:SEP-001], [wiki:SEP-002], [wiki:SEP-003], [wiki:SEP-005]|| + +== Introduction == + +Item Parser is the final API proposed to implement Item Builders/Loader proposed in [wiki:SEP-001]. + +'''NOTE:''' This is the API that was finally implemented with the name "Item Loaders", instead of "Item Parsers" along with some other minor fine tuning to the API methods and semantics. + +== Dataflow == + + 1. !ItemParser.add_value() + 1. '''input_parser''' + 2. store + 2. !ItemParser.add_xpath() ''(only available in XPathItemLoader)'' + 1. selector.extract() + 2. '''input_parser''' + 3. store + 3. !ItemParser.populate_item() ''(ex. get_item)'' + 1. '''output_parser''' + 2. assign field + +== Modules and classes == + + * scrapy.contrib.itemparser.!ItemParser + * scrapy.contrib.itemparser.XPathItemParser + * scrapy.contrib.itemparser.parsers.!MapConcat ''(ex. !TreeExpander)'' + * scrapy.contrib.itemparser.parsers.!TakeFirst + * scrapy.contrib.itemparser.parsers.Join + * scrapy.contrib.itemparser.parsers.Identity + +== Public API == + + * !ItemParser.add_value() + * !ItemParser.replace_value() + * !ItemParser.populate_item() ''(returns item populated)'' + + * !ItemParser.get_collected_values() ''(note the 's' in values)'' + * !ItemParser.parse_field() + + * !ItemParser.get_input_parser() + * !ItemParser.get_output_parser() + + * !ItemParser.context + + * !ItemParser.default_item_class + * !ItemParser.default_input_parser + * !ItemParser.default_output_parser + * !ItemParser.''field''_in + * !ItemParser.''field''_out + +== Alternative Public API Proposal == + + * !ItemLoader.add_value() + * !ItemLoader.replace_value() + * !ItemLoader.load_item() ''(returns loaded item)'' + + * !ItemLoader.get_stored_values() or !ItemLoader.get_values() ''(returns the !ItemLoader values)'' + * !ItemLoader.get_output_value() + + * !ItemLoader.get_input_processor() or !ItemLoader.get_in_processor() ''(short version)'' + * !ItemLoader.get_output_processor() or !ItemLoader.get_out_processor() ''(short version)'' + + * !ItemLoader.context + + * !ItemLoader.default_item_class + * !ItemLoader.default_input_processor or !ItemLoader.default_in_processor ''(short version)'' + * !ItemLoader.default_output_processor or !ItemLoader.default_out_processor ''(short version)'' + * !ItemLoader.''field''_in + * !ItemLoader.''field''_out + +== Usage example: declaring Item Parsers == + +{{{ +#!python +from scrapy.contrib.itemparser import XPathItemParser, parsers + +class ProductParser(XPathItemParser): + name_in = parsers.MapConcat(removetags, filterx) + price_in = parsers.MapConcat(...) + + price_out = parsers.TakeFirst() +}}} + +== Usage example: declaring parsers in Fields == + +{{{ +#!python +class Product(Item): + name = Field(output_parser=parsers.Join(), ...) + price = Field(output_parser=parsers.TakeFirst(), ...) + + description = Field(input_parser=parsers.MapConcat(removetags)) +}}} diff --git a/sep/sep-009.trac b/sep/sep-009.trac new file mode 100644 index 00000000000..ce229d77267 --- /dev/null +++ b/sep/sep-009.trac @@ -0,0 +1,102 @@ += SEP-009 - Singletons removal = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||9|| +||'''Title:'''||Singleton removal || +||'''Author:'''||Pablo Hoffman|| +||'''Created:'''||2009-11-14|| +||'''Status'''||Document in progress (being written)|| + +== Introduction == + +This SEP proposes a refactoring of the Scrapy to get ri of singletons, which will result in a cleaner API and will allow us to implement the library API proposed in [wiki:SEP-004]. + +== Current singletons == + +Scrapy 0.7 has the following singletons: + + * Execution engine ({{{scrapy.core.engine.scrapyengine}}}) + * Execution manager ({{{scrapy.core.manager.scrapymanager}}}) + * Extension manager ({{{scrapy.extension.extensions}}}) + * Spider manager ({{{scrapy.spider.spiders}}}) + * Stats collector ({{{scrapy.stats.stats}}}) + * Logging system ({{{scrapy.log}}}) + * Signals system ({{{scrapy.xlib.pydispatcher}}}) + +== Proposed API == + +The proposed architecture is to have one "root" object called {{{Crawler}}} (which will replace the current Execution Manager) and make all current singletons members of that object, as explained below: + + * '''crawler''': {{{scrapy.crawler.Crawler}}} instance (replaces current {{{scrapy.core.manager.ExecutionManager}}}) - instantiated with a {{{Settings}}} object + * '''crawler.settings''': {{{scrapy.conf.Settings}}} instance (passed in the constructor) + * '''crawler.extensions''': {{{scrapy.extension.ExtensionManager}}} instance + * '''crawler.engine''': {{{scrapy.core.engine.ExecutionEngine}}} instance + * {{{crawler.engine.scheduler}}} + * {{{crawler.engine.scheduler.middleware}}} - to access scheduler middleware + * {{{crawler.engine.downloader}}} + * {{{crawler.engine.downloader.middleware}}} - to access downloader middleware + * {{{crawler.engine.scraper}}} + * {{{crawler.engine.scraper.spidermw}}} - to access spider middleware + * '''crawler.spiders''': {{{SpiderManager}}} instance (concrete class given in {{{SPIDER_MANAGER_CLASS}}} setting) + * '''crawler.stats''': {{{StatsCollector}}} instance (concrete class given in {{{STATS_CLASS}}} setting) + * '''crawler.log''': Logger class with methods replacing the current {{{scrapy.log}}} functions. Logging would be started (if enabled) on {{{Crawler}}} constructor, so no log starting functions are required. + * {{{crawler.log.msg}}} + * '''crawler.signals''': signal handling + * {{{crawler.signals.send()}}} - same as {{{pydispatch.dispatcher.send()}}} + * {{{crawler.signals.connect()}}} - same as {{{pydispatch.dispatcher.connect()}}} + * {{{crawler.signals.disconnect()}}} - same as {{{pydispatch.dispatcher.disconnect()}}} + +== Required code changes after singletons removal == + +All components (extensions, middlewares, etc) will receive this {{{Crawler}}} object in their constructors, and this will be the only mechanism for accessing any other components (as opposed to importing each singleton from their respective module). This will also serve to stabilize the core API, something which we haven't documented so far (partly because of this). + +So, for a typical middleware constructor code, instead of this: + +{{{ +#!python +from scrapy.core.exceptions import NotConfigured +from scrapy.conf import settings + +class SomeMiddleware(object): + def __init__(self): + if not settings.getbool('SOMEMIDDLEWARE_ENABLED'): + raise NotConfigured +}}} + +We'd write this: + +{{{ +#!python +from scrapy.core.exceptions import NotConfigured + +class SomeMiddleware(object): + def __init__(self, crawler): + if not crawler.settings.getbool('SOMEMIDDLEWARE_ENABLED'): + raise NotConfigured +}}} + +== Running from command line == + +When running from '''command line''' (the only mechanism supported so far) the {{{scrapy.command.cmdline}}} module will: + + 1. instantiate a {{{Settings}}} object and populate it with the values in SCRAPY_SETTINGS_MODULE, and per-command overrides + 2. instantiate a {{{Crawler}}} object with the {{{Settings}}} object (the {{{Crawler}}} instantiates all its components based on the given settings) + 3. run {{{Crawler.crawl()}}} with the URLs or domains passed in the command line + +== Using Scrapy as a library == + +When using Scrapy with the '''library API''', the programmer will: + + 1. instantiate a {{{Settings}}} object (which only has the defaults settings, by default) and override the desired settings + 2. instantiate a {{{Crawler}}} object with the {{{Settings}}} object + +== Open issues to resolve == + + * Should we pass {{{Settings}}} object to {{{ScrapyCommand.add_options()}}}? + * How should spiders access settings? + * Option 1. Pass {{{Crawler}}} object to spider constructors too + * pro: one way to access all components (settings and signals being the most relevant to spiders) + * con?: spider code can access (and control) any crawler component - since we don't want to support spiders messing with the crawler (write an extension or spider middleware if you need that) + * Option 2. Pass {{{Settings}}} object to spider constructors, which would then be accessed through {{{self.settings}}}, like logging which is accessed through {{{self.log}}} + * con: would need a way to access stats too \ No newline at end of file diff --git a/sep/sep-010.trac b/sep/sep-010.trac new file mode 100644 index 00000000000..4313fd0c328 --- /dev/null +++ b/sep/sep-010.trac @@ -0,0 +1,59 @@ += SEP-010: REST API = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||10|| +||'''Title:'''||REST API|| +||'''Author:'''||Pablo Hoffman|| +||'''Created:'''||2009-11-16|| +||'''Status'''||Obsolete (JSON-RPC API implemented instead)|| + +== Introduction == + +This SEP proposes a JSON REST API for controlling Scrapy in server-mode, which is launched with: {{{scrapy-ctl.py start}}} + +== Operations == + +=== Get list of available spiders === + +GET /spiders/all + +=== Get list of closed spiders === + +GET /spiders/closed + +=== Get list of scheduled spiders === + +GET /spiders/scheduled + + * note: contains closed + +=== Get list of running spiders === + +GET /spiders/opened + + * returns list of dicts containing spider id and domain_name + +=== Schedule spider === + +POST /spiders + + * args: schedule=example.com + +=== Close spider === + +POST /spider/1238/close + +=== Get global stats === + +GET /stats + + * note: spider-specific not included + +=== Get spider-specific stats === + +GET /spider/1238/stats/ + +=== Get engine status === + +GET /engine/status diff --git a/sep/sep-011.trac b/sep/sep-011.trac new file mode 100644 index 00000000000..0a475bd18e2 --- /dev/null +++ b/sep/sep-011.trac @@ -0,0 +1,30 @@ += SEP-011: Process models for Scrapy = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||11|| +||'''Title:'''||Process models for Scrapy|| +||'''Author:'''||Pablo Hoffman|| +||'''Created:'''||2009-11-16|| +||'''Status'''||Partially implemented - see #168 || + +== Introduction == + +There is an interest of supporting different process models for Scrapy, mainly to help prevent memory leaks which affect running all spiders in the same process. + +By running each spider on a separate process (or pool of processes) we'll be able to "recycle" process when they exceed a maximum amount of memory. + +== Supported process models == + +The user could choose between different process models: + + 1. in process (only method supported so far) + 2. pooled processes (a predefined pool of N processes, which could run more than one spider each) + 3. separate processes (one process per spider) + +Using different processes would increase reliability at the cost of performance. + +== Another ideas to consider == + + * configuring pipeline process models - so that we can have a process exclusive for running pipelines + * support writing spidersr in different languages when we don't use an in process model diff --git a/sep/sep-012.trac b/sep/sep-012.trac new file mode 100644 index 00000000000..de846090c01 --- /dev/null +++ b/sep/sep-012.trac @@ -0,0 +1,73 @@ += SEP-012: Spider name = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||12|| +||'''Title:'''||Spider name|| +||'''Author:'''||Ismael Carnales, Pablo Hoffman|| +||'''Created:'''||2009-12-01|| +||'''Updated:'''||2010-03-23|| +||'''Status'''||Final|| + +== Introduction == + +The spiders are currently referenced by its {{{domain_name}}} attribute. This SEP proposes adding a {{{name}}} attribute to spiders and using it as their identifier. + +== Current limitations and flaws == + + 1. You can't create two spiders that scrape the same domain (without using workarounds like assigning an arbitrary {{{domain_name}}} and putting the real domains in the {{{extra_domain_names}}} attributes) + 2. For spiders with multiple domains, you have to specify them in two different places: {{{domain_name}}} and {{{extra_domain_names}}}. + +== Proposed changes == + + 1. Add a {{{name}}} attribute to spiders and use it as their unique identifier. + 2. Merge {{{domain_name}}} and {{{extra_domain_names}}} attributes in a single list {{{allowed_domains}}}. + +== Implications of the changes == + +=== General === + +In general, all references to {{{spider.domain_name}}} will be replaced by {{{spider.name}}} + +=== !OffsiteMiddleware === + +!OffsiteMiddleware will use {{{spider.allowed_domains}}} for determining the domain names of a spider + +=== scrapy-ctl.py === + +==== crawl ==== + +The new syntax for crawl command will be: + +{{{ +crawl [options] ... +}}} + +If you provide an url, it will try to find the spider the processes it. If no spider is found or more than one spider is found, it will raise an error. So, to crawl in those cases you must set the spider to use using the {{{--spider}}} option + +==== genspider ==== + +The new signature for genspider will be: + +{{{ +genspider [options] +}}} + +example: +{{{ +$scrapy-ctl genspider google google.com + +$ ls project/spiders/ +project/spiders/google.py + +$ cat project/spiders/google.py + +... + +class GooglecomSpider(BaseSpider): + name = 'google' + allowed_domains = ['google.com'] + ... +}}} + +Note: {{{spider_allowed_domains}}} becomes optional as only !OffsiteMiddleware uses it. diff --git a/sep/sep-013.trac b/sep/sep-013.trac new file mode 100644 index 00000000000..40784778aad --- /dev/null +++ b/sep/sep-013.trac @@ -0,0 +1,129 @@ += SEP-013 - Middlewares refactoring = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||13|| +||'''Title:'''||Middlewares Refactoring|| +||'''Author:'''||Pablo Hoffman|| +||'''Created:'''||2009-11-14|| +||'''Status'''||Document in progress (being written)|| + +== Introduction == + +This SEP proposes a refactoring of Scrapy middlewares to remove some inconsistencies and limitations. + +== Current flaws and inconsistencies == + +Even though the core works pretty well, it has some subtle inconsistencies that don't manifest in the common uses, but arise (and are quite annoying) when you try to fully exploit all Scrapy features. The currently identified flaws and inconsistencies are: + + 1. Request errback may not get called in all cases (more details needed on when this happens) + 2. Spider middleware has a {{{process_spider_exception}}} method which catches exceptions coming out of the spider, but it doesn't have an analogous for catching exceptions coming into the spider (for example, from other downloader middlewares). This complicates supporting middlewares that extend other middlewares. + 3. Downloader middleware has a {{{process_exception}}} method which catches exceptions coming out of the downloader, but it doesn't have an analogous for catching exceptions coming into the downloader (for example, from other downloader middlewares). This complicates supporting middlewares that extend other middlewares. + 4. Scheduler middleware has a {{{enqueue_request}}} method but doesn't have a {{{enqueue_request_exception}}} nor {{{dequeue_request}}} nor {{{dequeue_request_exception}}} methods. + +These flaws will be corrected by the changes proposed in this SEP. + +== Overview of changes proposed == + +Most of the inconsistencies come from the fact that middlewares don't follow the typical [http://twistedmatrix.com/projects/core/documentation/howto/defer.html deferred] callback/errback chaining logic. Twisted logic is fine and quite intuitive, and also fits middlewares very well. Due to some bad design choices the integration between middleware calls and deferred is far from optional. So the changes to middlewares involve mainly building deferred chains with the middleware methods and adding the missing method to each callback/errback chain. The proposed API for each middleware is described below. + +See [attachment:scrapy_core_v2.jpg] - a diagram draft for the proposes architecture. + +== Global changes to all middlewares == + +To be discussed: + + 1. should we support returning deferreds (ie. {{{maybeDeferred}}}) in middleware methods? + 2. should we pass Twisted Failures instead of exceptions to error methods? + +== Spider middleware changes == + +=== Current API === + + * {{{process_spider_input(response, spider)}}} + * {{{process_spider_output(response, result, spider)}}} + * {{{process_spider_exception(response, exception, spider=spider)}}} + +=== Changes proposed === + + 1. rename method: {{{process_spider_exception}}} to {{{process_spider_output_exception}}} + 2. add method" {{{process_spider_input_exception}}} + +=== New API === + + * {{{SpiderInput}}} deferred + * {{{process_spider_input(response, spider)}}} + * {{{process_spider_input_exception(response, exception, spider=spider)}}} + * {{{SpiderOutput}}} deferred + * {{{process_spider_output(response, result, spider)}}} + * {{{process_spider_output_exception(response, exception, spider=spider)}}} + +== Downloader middleware changes == + +=== Current API === + + * {{{process_request(request, spider)}}} + * {{{process_response(request, response, spider)}}} + * {{{process_exception(request, exception, spider)}}} + +=== Changes proposed === + + 1. rename method: {{{process_exception}}} to {{{process_response_exception}}} + 2. add method: {{{process_request_exception}}} + +=== New API === + + * {{{ProcessRequest}}} deferred + * {{{process_request(request, spider)}}} + * {{{process_request_exception(request, exception, response)}}} + * {{{ProcessResponse}}} deferred + * {{{process_response(request, spider, response)}}} + * {{{process_response_exception(request, exception, response)}}} + +== Scheduler middleware changes == + +=== Current API === + + * {{{enqueue_request(spider, request)}}} + * '''TBD:''' what does it mean to return a Response object here? (the current implementation allows it) + * {{{open_spider(spider)}}} + * {{{close_spider(spider)}}} + +=== Changes proposed === + + 1. exchange order of method arguments '''(spider, request)''' to '''(request, spider)''' for consistency with the other middlewares + 2. add methods: {{{dequeue_request}}}, {{{enqueue_request_exception}}}, {{{dequeue_request_exception}}} + 3. remove methods: {{{open_spider}}}, {{{close_spider}}}. They should be replaced by using the {{{spider_opened}}}, {{{spider_closed}}} signals, but they weren't before because of a chicken-egg problem when open spiders (because of scheduler auto-open feature). + * '''TBD:''' how to get rid of chicken-egg problem, perhaps refactoring scheduler auto-open? + +=== New API === + + * {{{EnqueueRequest}}} deferred + * {{{enqueue_request(request, spider)}}} + * Can return: + * return Request: which is passed to next mw component + * raise {{{IgnoreRequest}}} + * raise any other exception (errback chain called) + * {{{enqueue_request_exception(request, exception, spider)}}} + * Output and errors: + * The Request that gets returned by last enqueue_request() is the one that gets scheduled + * If no request is returned but a Failure, the Request errback is called with that failure + * '''TBD''': do we want to call request errback if it fails scheduling?0 + * {{{DequeueRequest}}} deferred + * {{{dequeue_request(request, spider)}}} + * {{{dequeue_request_exception(exception, spider)}}} + +== Open issues (to resolve) == + + 1. how to avoid massive {{{IgnoreRequest}}} exceptions from propagating which slows down the crawler + 1. if requests change, how do we keep reference to the original one? do we need to? + * opt 1: don't allow changing the original Request object - discarded + * opt 2: keep reference to the original request (how it's done now) + * opt 3: split SpiderRequest from DownloaderRequest + * opt 5: keep reference only to original deferred and forget about the original request + 1. scheduler auto-open chicken-egg problem + * opt 1: drop auto-open y forbid opening spiders if concurrent is full. use SpiderScheduler instead. why is scheduler auto-open really needed? + 1. call {{{Request.errback}}} if both schmw and dlmw fail? + * opt 1: ignore and just propagate the error as-is + * opt 2: call another method? like Request.schmw_errback / dlmw_errback? + * opt 3: use an exception wrapper? SchedulerError() DownloaderError()? diff --git a/sep/sep-014.trac b/sep/sep-014.trac new file mode 100644 index 00000000000..5b19784f042 --- /dev/null +++ b/sep/sep-014.trac @@ -0,0 +1,612 @@ += SEP-014 - !CrawlSpider v2 = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||14|| +||'''Title:'''||!CrawlSpider v2|| +||'''Author:'''||Insophia Team|| +||'''Created:'''||2010-01-22|| +||'''Updated:'''||2010-02-04|| +||'''Status'''||Final. Partially implemented but discarded because of lack of use in r2632|| + +== Introduction == + +This SEP proposes a rewrite of Scrapy !CrawlSpider and related components + +== Current flaws and inconsistencies == + + 1. Request's callbacks are hard to persist. + 2. Link extractors are inflexible and hard to maintain, link processing/filtering is tightly coupled. (e.g. canonicalize) + 3. Isn't possible to crawl an url directly from command line because the Spider does not know which callback use. + +These flaws will be corrected by the changes proposed in this SEP. + +== Proposed API Changes == + + * Separate the functionality of Rule-!LinkExtractor-Callback + * Separate the functionality of !LinkExtractor to Request Extractor and Request Processor + * Separate the process of determining response callback and the extraction of new requests (link extractors) + * The callback will be determine by Matcher Objects on request/response objects + +=== Matcher Objects === + +Matcher Objects (aka Matcher) are responsible for determining if given request or response matches an arbitrary criteria. +The Matcher receives as argument the request or the response, giving a powerful access to all request/response attributes. + +In the current !CrawlSpider, the Rule Object has the responsability to determine the callback of given extractor, +and the link extractor contains the url pattern (aka regex). Now the Matcher will contain only the pattern or criteria +to determine which request/response will execute any action. See below Spider Rules. + +=== Request Extractors === + +Request Extractors takes response object and determines which requests follow. + +This is an enhancemente to !LinkExtractors which returns urls (links), Request Extractors +return Request objects. + +=== Request Processors === + +Request Processors takes requests objects and can perform any action to them, like filtering +or modifying on the fly. + +The current !LinkExtractor had integrated link processing, like canonicalize. Request Processors +can be reutilized and applied in serie. + +=== Request Generator === + +Request Generator is the decoupling of the !CrawlSpider's method _request_to_follow(). +Request Generator takes the response object and applies the Request Extractors and Request Processors. + +=== Rules Manager === + +The Rules are a definition of Rule objects containing Matcher Objects and callback. + +The Legacy Rules were used to perform the link extraction and attach the callback to the generated Request object. +The proposed new Rules will be used to determine the callback for given response. This opens a whole of opportunities, +like determine the callback for given url, and persist the queue of Request objects because the callback is determined +the matching the Response object against the Rules. + +== Usage Examples == + +=== Basic Crawling === + +{{{ +#!python +# +# Basic Crawling +# +class SampleSpider(CrawlSpider): + rules = [ + # The dispatcher uses first-match policy + Rule(UrlRegexMatch(r'product\.html\?id=\d+'), 'parse_item', follow=False), + # by default, if the first param is string is wrapped into UrlRegexMatch + Rule(r'.+', 'parse_page'), + ] + + request_extractors = [ + # crawl all links looking for products and images + SgmlRequestExtractor(), + ] + + request_processors = [ + # canonicalize all requests' urls + Canonicalize(), + ] + + def parse_item(self, response): + # parse and extract items from response + pass + + def parse_page(self, response): + # extract images on all pages + pass +}}} + +=== Custom Processor and External Callback === + +{{{ +#!python +# +# Using external callbacks +# + +# Custom Processor +def filter_today_links(requests): + # only crawl today links + today = datetime.datetime.today().strftime('%Y-%m-%d') + return [r for r in requests if today in r.url] + +# Callback defined out of spider +def my_external_callback(response): + # process item + pass + +class SampleSpider(CrawlSpider): + rules = [ + # The dispatcher uses first-match policy + Rule(UrlRegexMatch(r'/news/(.+)/'), my_external_callback), + ] + + request_extractors = [ + RegexRequestExtractor(r'/sections/.+'), + RegexRequestExtractor(r'/news/.+'), + ] + + request_processors = [ + # canonicalize all requests' urls + Canonicalize(), + filter_today_links, + ] + +}}} + +== Implementation == + +*Work-in-progress* + +=== Package Structure === +{{{ +contrib_exp + |- crawlspider/ + |- spider.py + |- CrawlSpider + |- rules.py + |- Rule + |- CompiledRule + |- RulesManager + |- reqgen.py + |- RequestGenerator + |- reqproc.py + |- Canonicalize + |- Unique + |- ... + |- reqext.py + |- SgmlRequestExtractor + |- RegexRequestExtractor + |- ... + |- matchers.py + |- BaseMatcher + |- UrlMatcher + |- UrlRegexMatcher + |- ... +}}} + +=== Request/Response Matchers === +{{{ +#!python +""" +Request/Response Matchers + +Perform evaluation to Request or Response attributes +""" + +class BaseMatcher(object): + """Base matcher. Returns True by default.""" + + def matches_request(self, request): + """Performs Request Matching""" + return True + + def matches_response(self, response): + """Performs Response Matching""" + return True + + +class UrlMatcher(BaseMatcher): + """Matches URL attribute""" + + def __init__(self, url): + """Initialize url attribute""" + self._url = url + + def matches_url(self, url): + """Returns True if given url is equal to matcher's url""" + return self._url == url + + def matches_request(self, request): + """Returns True if Request's url matches initial url""" + return self.matches_url(request.url) + + def matches_response(self, response): + """REturns True if Response's url matches initial url""" + return self.matches_url(response.url) + + +class UrlRegexMatcher(UrlMatcher): + """Matches URL using regular expression""" + + def __init__(self, regex, flags=0): + """Initialize regular expression""" + self._regex = re.compile(regex, flags) + + def matches_url(self, url): + """Returns True if url matches regular expression""" + return self._regex.search(url) is not None +}}} + +=== Request Extractor === +{{{ +#!python +# +# Requests Extractor +# Extractors receive response and return list of Requests +# + +class BaseSgmlRequestExtractor(FixedSGMLParser): + """Base SGML Request Extractor""" + + def __init__(self, tag='a', attr='href'): + """Initialize attributes""" + FixedSGMLParser.__init__(self) + + self.scan_tag = tag if callable(tag) else lambda t: t == tag + self.scan_attr = attr if callable(attr) else lambda a: a == attr + self.current_request = None + + def extract_requests(self, response): + """Returns list of requests extracted from response""" + return self._extract_requests(response.body, response.url, + response.encoding) + + def _extract_requests(self, response_text, response_url, response_encoding): + """Extract requests with absolute urls""" + self.reset() + self.feed(response_text) + self.close() + + base_url = self.base_url if self.base_url else response_url + self._make_absolute_urls(base_url, response_encoding) + self._fix_link_text_encoding(response_encoding) + + return self.requests + + def _make_absolute_urls(self, base_url, encoding): + """Makes all request's urls absolute""" + for req in self.requests: + url = req.url + # make absolute url + url = urljoin_rfc(base_url, url, encoding) + url = safe_url_string(url, encoding) + # replace in-place request's url + req.url = url + + def _fix_link_text_encoding(self, encoding): + """Convert link_text to unicode for each request""" + for req in self.requests: + req.meta.setdefault('link_text', '') + req.meta['link_text'] = str_to_unicode(req.meta['link_text'], + encoding) + + def reset(self): + """Reset state""" + FixedSGMLParser.reset(self) + self.requests = [] + self.base_url = None + + def unknown_starttag(self, tag, attrs): + """Process unknown start tag""" + if 'base' == tag: + self.base_url = dict(attrs).get('href') + + if self.scan_tag(tag): + for attr, value in attrs: + if self.scan_attr(attr): + if value is not None: + req = Request(url=value) + self.requests.append(req) + self.current_request = req + + def unknown_endtag(self, tag): + """Process unknown end tag""" + self.current_request = None + + def handle_data(self, data): + """Process data""" + current = self.current_request + if current and not 'link_text' in current.meta: + current.meta['link_text'] = data.strip() + + +class SgmlRequestExtractor(BaseSgmlRequestExtractor): + """SGML Request Extractor""" + + def __init__(self, tags=None, attrs=None): + """Initialize with custom tag & attribute function checkers""" + # defaults + tags = tuple(tags) if tags else ('a', 'area') + attrs = tuple(attrs) if attrs else ('href', ) + + tag_func = lambda x: x in tags + attr_func = lambda x: x in attrs + BaseSgmlRequestExtractor.__init__(self, tag=tag_func, attr=attr_func) + + +class XPathRequestExtractor(SgmlRequestExtractor): + """SGML Request Extractor with XPath restriction""" + + def __init__(self, restrict_xpaths, tags=None, attrs=None): + """Initialize XPath restrictions""" + self.restrict_xpaths = tuple(arg_to_iter(restrict_xpaths)) + SgmlRequestExtractor.__init__(self, tags, attrs) + + def extract_requests(self, response): + """Restrict to XPath regions""" + hxs = HtmlXPathSelector(response) + fragments = (''.join( + html_frag for html_frag in hxs.select(xpath).extract() + ) for xpath in self.restrict_xpaths) + html_slice = ''.join(html_frag for html_frag in fragments) + return self._extract_requests(html_slice, response.url, + response.encoding) + +}}} + +=== Request Processor === +{{{ +#!python +# +# Request Processors +# Processors receive list of requests and return list of requests +# +"""Request Processors""" + +class Canonicalize(object): + """Canonicalize Request Processor""" + + def __call__(self, requests): + """Canonicalize all requests' urls""" + for req in requests: + # replace in-place + req.url = canonicalize_url(req.url) + yield req + + +class Unique(object): + """Filter duplicate Requests""" + + def __init__(self, *attributes): + """Initialize comparison attributes""" + self._attributes = attributes or ['url'] + + def _requests_equal(self, req1, req2): + """Attribute comparison helper""" + for attr in self._attributes: + if getattr(req1, attr) != getattr(req2, attr): + return False + # all attributes equal + return True + + def _request_in(self, request, requests_seen): + """Check if request is in given requests seen list""" + for seen in requests_seen: + if self._requests_equal(request, seen): + return True + # request not seen + return False + + def __call__(self, requests): + """Filter seen requests""" + # per-call duplicates filter + requests_seen = set() + for req in requests: + if not self._request_in(req, requests_seen): + yield req + # registry seen request + requests_seen.add(req) + + +class FilterDomain(object): + """Filter request's domain""" + + def __init__(self, allow=(), deny=()): + """Initialize allow/deny attributes""" + self.allow = tuple(arg_to_iter(allow)) + self.deny = tuple(arg_to_iter(deny)) + + def __call__(self, requests): + """Filter domains""" + processed = (req for req in requests) + + if self.allow: + processed = (req for req in requests + if url_is_from_any_domain(req.url, self.allow)) + if self.deny: + processed = (req for req in requests + if not url_is_from_any_domain(req.url, self.deny)) + + return processed + + +class FilterUrl(object): + """Filter request's url""" + + def __init__(self, allow=(), deny=()): + """Initialize allow/deny attributes""" + _re_type = type(re.compile('', 0)) + + self.allow_res = [x if isinstance(x, _re_type) else re.compile(x) + for x in arg_to_iter(allow)] + self.deny_res = [x if isinstance(x, _re_type) else re.compile(x) + for x in arg_to_iter(deny)] + + def __call__(self, requests): + """Filter request's url based on allow/deny rules""" + #TODO: filter valid urls here? + processed = (req for req in requests) + + if self.allow_res: + processed = (req for req in requests + if self._matches(req.url, self.allow_res)) + if self.deny_res: + processed = (req for req in requests + if not self._matches(req.url, self.deny_res)) + + return processed + + def _matches(self, url, regexs): + """Returns True if url matches any regex in given list""" + return any(r.search(url) for r in regexs) + + +}}} + +=== Rule Object === +{{{ +#!python +# +# Dispatch Rules classes +# Manage Rules (Matchers + Callbacks) +# +class Rule(object): + """Crawler Rule""" + def __init__(self, matcher, callback=None, cb_args=None, + cb_kwargs=None, follow=True): + """Store attributes""" + self.matcher = matcher + self.callback = callback + self.cb_args = cb_args if cb_args else () + self.cb_kwargs = cb_kwargs if cb_kwargs else {} + self.follow = follow + +# +# Rules Manager takes list of Rule objects and normalize matcher and callback +# into CompiledRule +# +class CompiledRule(object): + """Compiled version of Rule""" + def __init__(self, matcher, callback=None, follow=False): + """Initialize attributes checking type""" + assert isinstance(matcher, BaseMatcher) + assert callback is None or callable(callback) + assert isinstance(follow, bool) + + self.matcher = matcher + self.callback = callback + self.follow = follow +}}} + +=== Rules Manager === +{{{ +#!python +# +# Handles rules matcher/callbacks +# Resolve rule for given response +# +class RulesManager(object): + """Rules Manager""" + def __init__(self, rules, spider, default_matcher=UrlRegexMatcher): + """Initialize rules using spider and default matcher""" + self._rules = tuple() + + # compile absolute/relative-to-spider callbacks""" + for rule in rules: + # prepare matcher + if isinstance(rule.matcher, BaseMatcher): + matcher = rule.matcher + else: + # matcher not BaseMatcher, check for string + if isinstance(rule.matcher, basestring): + # instance default matcher + matcher = default_matcher(rule.matcher) + else: + raise ValueError('Not valid matcher given %r in %r' \ + % (rule.matcher, rule)) + + # prepare callback + if callable(rule.callback): + callback = rule.callback + elif not rule.callback is None: + # callback from spider + callback = getattr(spider, rule.callback) + + if not callable(callback): + raise AttributeError('Invalid callback %r can not be resolved' \ + % callback) + else: + callback = None + + if rule.cb_args or rule.cb_kwargs: + # build partial callback + callback = partial(callback, *rule.cb_args, **rule.cb_kwargs) + + # append compiled rule to rules list + crule = CompiledRule(matcher, callback, follow=rule.follow) + self._rules += (crule, ) + + def get_rule(self, response): + """Returns first rule that matches response""" + for rule in self._rules: + if rule.matcher.matches_response(response): + return rule + +}}} + +=== Request Generator === +{{{ +#!python +# +# Request Generator +# Takes response and generate requests using extractors and processors +# +class RequestGenerator(object): + def __init__(self, req_extractors, req_processors, callback): + self._request_extractors = req_extractors + self._request_processors = req_processors + self.callback = callback + + def generate_requests(self, response): + """ + Extract and process new requets from response + """ + requests = [] + for ext in self._request_extractors: + requets.extend(ext.extract_requests(response)) + + for proc in self._request_processors: + requests = proc(requests) + + for request in requests: + yield request.replace(callback=self.callback) +}}} + +=== !CrawlSpider === +{{{ +#!python +# +# Spider +# +class CrawlSpider(InitSpider): + """CrawlSpider v2""" + + request_extractors = [] + request_processors = [] + rules = [] + + def __init__(self): + """Initialize dispatcher""" + super(CrawlSpider, self).__init__() + + # wrap rules + self._rulesman = RulesManager(self.rules, spider=self) + # generates new requests with given callback + self._reqgen = RequestGenerator(self.request_extractors, + self.request_processors, + self.parse) + + def parse(self, response): + """Dispatch callback and generate requests""" + # get rule for response + rule = self._rulesman.get_rule(response) + if rule: + # dispatch callback if set + if rule.callback: + output = iterate_spider_output(rule.callback(response)) + for req_or_item in output: + yield req_or_item + + if rule.follow: + for req in self._reqgen.generate_requests(response): + yield req + +}}} + diff --git a/sep/sep-015.trac b/sep/sep-015.trac new file mode 100644 index 00000000000..43ae5f1cae5 --- /dev/null +++ b/sep/sep-015.trac @@ -0,0 +1,50 @@ += SEP-015: !ScrapyManager and !SpiderManager API refactoring = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||15|| +||'''Title:'''||!ScrapyManager and !SpiderManger API refactoring|| +||'''Author:'''||Insophia Team|| +||'''Created:'''||2010-03-10|| +||'''Status'''||Final|| + +== Introduction == + +This SEP proposes a refactoring of !ScrapyManager and !SpiderManager APIs. + +== !SpiderManager == + + * get(spider_name) -> Spider instance + * find_by_request(request) -> list of spider names + * list() -> list of spider names + + * remove: fromdomain(), fromurl() + +== !ScrapyManager == + + * crawl_request(request, spider=None) + * calls !SpiderManager.find_by_request(request) if spider is None + * fails if len(spiders returned) != 1 + * crawl_spider(spider) + * calls spider.start_requests() + * crawl_spider_name(spider_name) + * calls !SpiderManager.get(spider_name) + * calls spider.start_requests() + * crawl_url(url) + * calls spider.make_requests_from_url() + + * remove crawl(), runonce() + +Instead of using runonce(), commands (such as crawl/parse) would call crawl_* and then start(). + +== Changes to Commands == + + * if is_url(arg): + * calls !ScrapyManager.crawl_url(arg) + * else: + * calls !ScrapyManager.crawl_spider_name(arg) + +== Pending issues == + + * should we rename !ScrapyManager.crawl_* to schedule_* or add_* ? + * !SpiderManager.find_by_request or !SpiderManager.search(request=request) ? diff --git a/sep/sep-016.trac b/sep/sep-016.trac new file mode 100644 index 00000000000..9f7ada3c6ba --- /dev/null +++ b/sep/sep-016.trac @@ -0,0 +1,265 @@ += SEP-016: Leg Spider = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||16|| +||'''Title:'''||Leg Spider|| +||'''Author:'''||Insophia Team|| +||'''Created:'''||2010-06-03|| +||'''Status'''||Superseded by [wiki:SEP-018]|| + +== Introduction == + +This SEP introduces a new kind of Spider called {{{LegSpider}}} which provides modular functionality which can be plugged to different spiders. + +== Rationale == + +The purpose of Leg Spiders is to define an architecture for building spiders based on smaller well-tested components (aka. Legs) that can be combined to achieve the desired functionality. These reusable components will benefit all Scrapy users by building a repository of well-tested components (legs) that can be shared among different spiders and projects. Some of them will come bundled with Scrapy. + +The Legs themselves can be also combined with sub-legs, in a hierarchical fashion. Legs are also spiders themselves, hence the name "Leg Spider". + +== {{{LegSpider}}} API == + +A {{{LegSpider}}} is a {{{BaseSpider}}} subclass that adds the following attributes and methods: + + * {{{legs}}} + * legs composing this spider + * {{{process_response(response)}}} + * Process a (downloaded) response and return a list of requests and items + * {{{process_request(request)}}} + * Process a request after it has been extracted and before returning it from the spider + * {{{process_item(item)}}} + * Process an item after it has been extracted and before returning it from the spider + * {{{set_spider()}}} + * Defines the main spider associated with this Leg Spider, which is often used to configure the Leg Spider behavior. + +== How Leg Spiders work == + + 1. Each Leg Spider has zero or many Leg Spiders associated with it. When a response arrives, the Leg Spider process it with its {{{process_response}}} method and also the {{{process_response}}} method of all its "sub leg spiders". Finally, the output of all of them is combined to produce the final aggregated output. + 2. Each element of the aggregated output of {{{process_response}}} is processed with either {{{process_item}}} or {{{process_request}}} before being returned from the spider. Similar to {{{process_response}}}, each item/request is processed with all {{{process_{request,item}}}} of the leg spiders composing the spider, and also with those of the spider itself. + +== Leg Spider examples == + +=== Regex (HTML) Link Extractor === + +A typical application of LegSpider's is to build Link Extractors. For example: + +{{{ +#!python +class RegexHtmlLinkExtractor(LegSpider): + + def process_response(self, response): + if isinstance(response, HtmlResponse): + allowed_regexes = self.spider.url_regexes_to_follow + # extract urls to follow using allowed_regexes + return [Request(x) for x in urls_to_follow] + +class MySpider(LegSpider): + + legs = [RegexHtmlLinkExtractor()] + url_regexes_to_follow = ['/product.php?.*'] + + def parse_response(self, response): + # parse response and extract items + return items +}}} + +=== RSS2 link extractor === + +This is a Leg Spider that can be used for following links from RSS2 feeds. + +{{{ +#!python +class Rss2LinkExtractor(LegSpider): + + def process_response(self, response): + if response.headers.get('Content-type') == 'application/rss+xml': + xs = XmlXPathSelector(response) + urls = xs.select("//item/link/text()").extract() + return [Request(x) for x in urls] +}}} + +=== Callback dispatcher based on rules === + +Another example could be to build a callback dispatcher based on rules: + +{{{ +#!python +class CallbackRules(LegSpider): + + def __init__(self, *a, **kw): + super(CallbackRules, self).__init__(*a, **kw) + for regex, method_name in self.spider.callback_rules.items(): + r = re.compile(regex) + m = getattr(self.spider, method_name, None) + if m: + self._rules[r] = m + + def process_response(self, response): + for regex, method in self._rules.items(): + m = regex.search(response.url) + if m: + return method(response) + return [] + +class MySpider(LegSpider): + + legs = [CallbackRules()] + callback_rules = { + '/product.php.*': 'parse_product', + '/category.php.*': 'parse_category', + } + + def parse_product(self, response): + # parse reponse and populate item + return item +}}} + +=== URL Canonicalizers === + +Another example could be for building URL canonicalizers: + +{{{ +#!python +class CanonializeUrl(LegSpider): + + def process_request(self, request): + curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules) + return request.replace(url=curl) + +class MySpider(LegSpider): + + legs = [CanonicalizeUrl()] + canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...] + + # ... +}}} + +=== Setting item identifier === + +Another example could be for setting a unique identifier to items, based on certain fields: + +{{{ +#!python +class ItemIdSetter(LegSpider): + + def process_item(self, item): + id_field = self.spider.id_field + id_fields_to_hash = self.spider.id_fields_to_hash + item[id_field] = make_hash_based_on_fields(item, id_fields_to_hash) + return item + +class MySpider(LegSpider): + + legs = [ItemIdSetter()] + id_field = 'guid' + id_fields_to_hash = ['supplier_name', 'supplier_id'] + + def process_response(self, item): + # extract item from response + return item +}}} + +=== Combining multiple leg spiders === + +Here's an example that combines functionality from multiple leg spiders: + +{{{ +#!python +class MySpider(LegSpider): + + legs = [RegexLinkExtractor(), ParseRules(), CanonicalizeUrl(), ItemIdSetter()] + + url_regexes_to_follow = ['/product.php?.*'] + + parse_rules = { + '/product.php.*': 'parse_product', + '/category.php.*': 'parse_category', + } + + canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...] + + id_field = 'guid' + id_fields_to_hash = ['supplier_name', 'supplier_id'] + + def process_product(self, item): + # extract item from response + return item + + def process_category(self, item): + # extract item from response + return item +}}} + + + +== Leg Spiders vs Spider middlewares == + +A common question that would arise is when one should use Leg Spiders and when to use Spider middlewares. Leg Spiders functionality is meant to implement spider-specific functionality, like link extraction which has custom rules per spider. Spider middlewares, on the other hand, are meant to implement global functionality. + +== When not to use Leg Spiders == + +Leg Spiders are not a silver bullet to implement all kinds of spiders, so it's important to keep in mind their scope and limitations, such as: + + * Leg Spiders can't filter duplicate requests, since they don't have access to all requests at the same time. This functionality should be done in a spider or scheduler middleware. + * Leg Spiders are meant to be used for spiders whose behavior (requests & items to extract) depends only on the current page and not previously crawled pages (aka. "context-free spiders"). If your spider has some custom logic with chained downloads (for example, multi-page items) then Leg Spiders may not be a good fit. + +== {{{LegSpider}}} proof-of-concept implementation == + +Here's a proof-of-concept implementation of {{{LegSpider}}}: + +{{{ +#!python +from scrapy.http import Request +from scrapy.item import BaseItem +from scrapy.spider import BaseSpider +from scrapy.utils.spider import iterate_spider_output + + +class LegSpider(BaseSpider): + """A spider made of legs""" + + legs = [] + + def __init__(self, *args, **kwargs): + super(LegSpider, self).__init__(*args, **kwargs) + self._legs = [self] + self.legs[:] + for l in self._legs: + l.set_spider(self) + + def parse(self, response): + res = self._process_response(response) + for r in res: + if isinstance(r, BaseItem): + yield self._process_item(r) + else: + yield self._process_request(r) + + def process_response(self, response): + return [] + + def process_request(self, request): + return request + + def process_item(self, item): + return item + + def set_spider(self, spider): + self.spider = spider + + def _process_response(self, response): + res = [] + for l in self._legs: + res.extend(iterate_spider_output(l.process_response(response))) + return res + + def _process_request(self, request): + for l in self._legs: + request = l.process_request(request) + return request + + def _process_item(self, item): + for l in self._legs: + item = l.process_item(item) + return item +}}} \ No newline at end of file diff --git a/sep/sep-017.trac b/sep/sep-017.trac new file mode 100644 index 00000000000..1e55b56f8dd --- /dev/null +++ b/sep/sep-017.trac @@ -0,0 +1,90 @@ += SEP-017: Spider Contracts = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||17|| +||'''Title:'''||Spider Contracts|| +||'''Author:'''||Insophia Team|| +||'''Created:'''||2010-06-10|| +||'''Status'''||Draft|| + +== Introduction == + +The motivation for Spider Contracts is to build a lightweight mechanism for testing your spiders, and be able to run the tests quickly without having to wait for all the spider to run. It's partially based on the [http://en.wikipedia.org/wiki/Design_by_contract Design by contract] approach (hence its name) where you define certain conditions that spider callbacks must met, and you give example testing pages. + +== How it works == + +In the docstring of your spider callbacks, you write certain tags that define the spider contract. For example, the URL of a sample page for that callback, and what you expect to scrape from it. + +Then you can run a command to check that the spider contracts are met. + +== Contract examples == + +=== Example URL for simple callback === + +The {{{parse_product}}} callback must return items containing the fields given in {{{@scrapes}}}. + +{{{ +#!python +class ProductSpider(BaseSpider): + + def parse_product(self, response): + """ + @url http://www.example.com/store/product.php?id=123 + @scrapes name, price, description + """" +}}} + +=== Chained callbacks === + +The following spider contains two callbacks, one for login to a site, and the other for scraping user profile info. + +The contracts assert that the first callback returns a Request and the second one scrape {{{{user, name, email}}} fields. + +{{{ +#!python +class UserProfileSpider(BaseSpider): + + def parse_login_page(self, response): + """ + @url http://www.example.com/login.php + @returns_request + """ + # returns Request with callback=self.parse_profile_page + + def parse_profile_page(self, response): + """ + @after parse_login_page + @scrapes user, name, email + """" + # ... +}}} + +== Tags reference == + +Note that tags can also be extended by users, meaning that you can have your own custom contract tags in your Scrapy project. + +||{{{@url}}} || url of a sample page parsed by the callback || +||{{{@after}}} || the callback is called with the response generated by the specified callback || +||{{{@scrapes}}} || list of fields that must be present in the item(s) scraped by the callback || +||{{{@returns_request}}} || the callback must return one (and only one) Request || + +Some tag constraints: + + * a callback cannot contain {{{@url}}} and {{{@after}}} + +== Checking spider contracts == + +To check the contracts of a single spider: + +{{{ +scrapy-ctl.py check example.com +}}} + +Or to check all spiders: + +{{{ +scrapy-ctl.py check +}}} + +No need to wait for the whole spider to run. diff --git a/sep/sep-018.trac b/sep/sep-018.trac new file mode 100644 index 00000000000..9c2a0349b9d --- /dev/null +++ b/sep/sep-018.trac @@ -0,0 +1,551 @@ += SEP-018: Spider middleware v2 = + +[[PageOutline(2-5,Contents)]] + +||'''SEP:'''||18|| +||'''Title:'''||Spider Middleware v2|| +||'''Author:'''||Insophia Team|| +||'''Created:'''||2010-06-20|| +||'''Status'''||Draft (in progress)|| + +This SEP introduces a new architecture for spider middlewares which provides a greater degree of modularity to combine functionality which can be plugged in from different (reusable) middlewares. + +The purpose of !SpiderMiddleware-v2 is to define an architecture that encourages more re-usability for building spiders based on smaller well-tested components. Those components can be global (similar to current spider middlewares) or per-spider that can be combined to achieve the desired functionality. These reusable components will benefit all Scrapy users by building a repository of well-tested components that can be shared among different spiders and projects. Some of them will come bundled with Scrapy. + +Unless explicitly stated, in this document "spider middleware" refers to the '''new''' spider middleware v2, not the old one. + +This document is a work in progress, see [#Pendingissues Pending issues] below. + +== New spider middleware API == + +A spider middleware can implement any of the following methods: + + * `process_response(response, request, spider)` + * Process a (downloaded) response + * Receives: The `response` to process, the `request` used to download the response (not necessarily the request sent from the spider), and the `spider` that requested it. + * Returns: A list containing requests and/or items + * `process_error(error, request, spider)`: + * Process a error when trying to download a request, such as DNS errors, timeout errors, etc. + * Receives: The `error` caused, the `request` that caused it (not necessarily the request sent from the spider), and then `spider` that requested it. + * Returns: A list containing request and/or items + * `process_request(request, response, spider)` + * Process a request after it has been extracted from the spider or previous middleware `process_response()` methods. + * Receives: The `request` to process, the `response` where the request was extracted from, and the `spider` that extracted it. + * Note: `response` is `None` for start requests, or requests injected directly (through `manager.scraper.process_request()` without specifying a response (see below) + * Returns: A `Request` object (not necessarily the same received), or `None` in which case the request is dropped. + * `process_item(item, response, spider)` + * Process an item after it has been extracted from the spider or previous middleware `process_response()` methods. + * Receives: The `item` to process, the `response` where the item was extracted from, and the `spider` that extracted it. + * Returns: An `Item` object (not necessarily the same received), or `None` in which case the item is dropped. + * `next_request(spider)` + * Returns a the next request to crawl with this spider. This method is called when the spider is opened, and when it gets idle. + * Receives: The `spider` to return the next request for. + * Returns: A `Request` object. + * `open_spider(spider)` + * This can be used to allocate resources when a spider is opened. + * Receives: The `spider` that has been opened. + * Returns: nothing + * `close_spider(spider)` + * This can be used to free resources when a spider is closed. + * Receives: The `spider` that has been closed. + * Returns: nothing + +== Changes to core API == + +=== Injecting requests to crawl === + +To inject start requests (or new requests without a response) to crawl, you used before: + + * `manager.engine.crawl(request, spider)` + +Now you'll use: + + * `manager.scraper.process_request(request, spider, response=None)` + +Which (unlike the old `engine.crawl` will make the requests pass through the spider middleware `process_request()` method). + +=== Scheduler middleware to be removed === + +We're gonna remove the Scheduler Middleware, and move the duplicates filter to a new spider middleware. + +== Scraper high-level API == + +There is a simpler high-level API - the Scraper API - which is the API used by the engine and other core components. This is also the API implemented by this new middleware, with its own internal architecture and hooks. Here is the Scraper API: + + * `process_response(response, request, spider)` + * returns iterable of items and requests + * `process_error(error, request, spider)` + * returns iterable of items and requests + * `process_request(request, spider, response=None)` + * injects a request to crawl for the given spider + * `process_item(item, spider, response) + * injects a item to process with the item processor (typically the item pipeline) + * `next_request(spider)` + * returns the next request to process for the given spider + * `open_spider(spider)` + * opens a spider + * `close_spider(spider)` + * closes a spider + +== How it works == + +The spider middlewares are defined in certain order with the top-most being the one closer to the engine, and the bottom-most being the one closed to the spider. + +Example: + + * Engine + * Global spider Middleware 3 + * Global spider Middleware 2 + * Global spider Middleware 1 + * Spider-specific middlewares (defined in `Spider.middlewares`) + * Spider-specific middleware 3 + * Spider-specific middleware 2 + * Spider-specific middleware 1 + * Spider + +The data flow with Spider Middleware v2 is as follows: + + 1. When a response arrives from the engine, it it passed through all the spider middlewares (in descending order). The result of each middleware `process_response` is kept and then returned along with the spider callback result + 2. Each item of the aggregated result from previous point is passed through all middlewares (in ascending order) calling the `process_request` or `process_item` method accordingly, and their results are kept for passing to the following middlewares + +One of the spider middlewares (typically - but not necessarily - the last spider middleware closer to the spider, as shown in the example) will be a "spider-specific spider middleware" which would take care of calling the additional spider middlewares defined in the `Spider.middlewares` attribute, hence providing support for per-spider middlewares. If the middleware is well written, it should work both globally and per-spider. + +== Spider-specific middlewares == + +You can define in the spider itself a list of additional middlewares that will be used for this spider, and only this spider. If the middleware is well written, it should work both globally and per spider. + +Here's an example that combines functionality from multiple middlewares into the same spider: + +{{{ +#!python +class MySpider(BaseSpider): + + middlewares = [RegexLinkExtractor(), CallbackRules(), CanonicalizeUrl(), ItemIdSetter(), OffsiteMiddleware()] + + allowed_domains = ['example.com', 'sub.example.com'] + + url_regexes_to_follow = ['/product.php?.*'] + + callback_rules = { + '/product.php.*': 'parse_product', + '/category.php.*': 'parse_category', + } + + canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...] + + id_field = 'guid' + id_fields_to_hash = ['supplier_name', 'supplier_id'] + + def parse_product(self, item): + # extract item from response + return item + + def parse_category(self, item): + # extract item from response + return item +}}} + +== The Spider Middleware that implements spider code == + +There's gonna be one middleware that will take care of calling the proper spider methods on each event such as: + + * call `Request.callback` (for 200 responses) or `Request.errback` for non-200 responses and other errors. this behaviour can be changed through the `handle_httpstatus_list` spider attribute. + * if `Request.callback` is not set it will use `Spider.parse` + * if `Request.errback` is not set it will use `Spider.errback` + * call additional spider middlewares defined in the `Spider.middlewares` attribute + * call `Spider.next_request()` and `Spider.start_requests()` on `next_request()` middleware method (this would implicitly support backwards compatibility) + +== Differences with Spider middleware v1 == + + * adds support for per-spider middlewares through the `Spider.middlewares` attribute + * allows processing initial requests (those returned from `Spider.start_requests()`) + +== Use cases and examples == + +This section contains several examples and use cases for Spider Middlewares. Imports are intentionally removed for conciseness and clarity. + +=== Regex (HTML) Link Extractor === + +A typical application of spider middlewares could be to build Link Extractors. For example: + +{{{ +#!python +class RegexHtmlLinkExtractor(object): + + def process_response(self, response, request, spider): + if isinstance(response, HtmlResponse): + allowed_regexes = spider.url_regexes_to_follow + # extract urls to follow using allowed_regexes + return [Request(x) for x in urls_to_follow] + +# Example spider using this middleware +class MySpider(BaseSpider): + + middlewares = [RegexHtmlLinkExtractor()] + url_regexes_to_follow = ['/product.php?.*'] + + # parsing callbacks below +}}} + +=== RSS2 link extractor === + +{{{ +#!python +class Rss2LinkExtractor(object): + + def process_response(self, response, request, spider): + if response.headers.get('Content-type') == 'application/rss+xml': + xs = XmlXPathSelector(response) + urls = xs.select("//item/link/text()").extract() + return [Request(x) for x in urls] +}}} + +=== Callback dispatcher based on rules === + +Another example could be to build a callback dispatcher based on rules: + +{{{ +#!python +class CallbackRules(object): + + def __init__(self): + self.rules = {} + dispatcher.connect(signals.spider_opened, self.spider_opened) + dispatcher.connect(signals.spider_closed, self.spider_closed) + + def spider_opened(self, spider): + self.rules[spider] = {} + for regex, method_name in spider.callback_rules.items(): + r = re.compile(regex) + m = getattr(self.spider, method_name, None) + if m: + self.rules[spider][r] = m + + def spider_closed(self, spider): + del self.rules[spider] + + def process_response(self, response, request, spider): + for regex, method in self.rules[spider].items(): + m = regex.search(response.url) + if m: + return method(response) + return [] + +# Example spider using this middleware +class MySpider(BaseSpider): + + middlewares = [CallbackRules()] + callback_rules = { + '/product.php.*': 'parse_product', + '/category.php.*': 'parse_category', + } + + def parse_product(self, response): + # parse reponse and populate item + return item +}}} + +=== URL Canonicalizers === + +Another example could be for building URL canonicalizers: + +{{{ +#!python +class CanonializeUrl(object): + + def process_request(self, request, response, spider): + curl = canonicalize_url(request.url, rules=spider.canonicalization_rules) + return request.replace(url=curl) + +# Example spider using this middleware +class MySpider(BaseSpider): + + middlewares = [CanonicalizeUrl()] + canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...] + + # ... +}}} + +=== Setting item identifier === + +Another example could be for setting a unique identifier to items, based on certain fields: + +{{{ +#!python +class ItemIdSetter(object): + + def process_item(self, item, response, spider): + id_field = spider.id_field + id_fields_to_hash = spider.id_fields_to_hash + item[id_field] = make_hash_based_on_fields(item, id_fields_to_hash) + return item + +# Example spider using this middleware +class MySpider(BaseSpider): + + middlewares = [ItemIdSetter()] + id_field = 'guid' + id_fields_to_hash = ['supplier_name', 'supplier_id'] + + def parse(self, response): + # extract item from response + return item +}}} + +=== robots.txt exclusion === + +A spider middleware to avoid visiting pages forbidden by robots.txt: + +{{{ +#!python +class SpiderInfo(object): + + def __init__(self, useragent): + self.useragent = useragent + self.parsers = {} + self.pending = defaultdict(list) + + +class AllowAllParser(object): + + def can_fetch(useragent, url): + return True + + +class RobotsTxtMiddleware(object): + + REQUEST_PRIORITY = 1000 + + def __init__(self): + self.spiders = {} + dispatcher.connect(self.spider_opened, signal=signals.spider_opened) + dispatcher.connect(self.spider_closed, signal=signals.spider_closed) + + def process_request(self, request, response, spider): + return self.process_start_request(self, request) + + def process_start_request(self, request, spider): + info = self.spiders[spider] + url = urlparse_cached(request) + netloc = url.netloc + if netloc in info.parsers: + rp = info.parsers[netloc] + if rp.can_fetch(info.useragent, request.url): + res = request + else: + spider.log("Forbidden by robots.txt: %s" % request) + res = None + else: + if netloc in info.pending: + res = None + else: + robotsurl = "%s://%s/robots.txt" % (url.scheme, netloc) + meta = {'spider': spider, {'handle_httpstatus_list': [403, 404, 500]} + res = Request(robotsurl, callback=self.parse_robots, + meta=meta, priority=self.REQUEST_PRIORITY) + info.pending[netloc].append(request) + return res + + def parse_robots(self, response): + spider = response.request.meta['spider'] + netloc urlparse_cached(response).netloc + info = self.spiders[spider] + if response.status == 200; + rp = robotparser.RobotFileParser(response.url) + rp.parse(response.body.splitlines()) + info.parsers[netloc] = rp + else: + info.parsers[netloc] = AllowAllParser() + return info.pending[netloc] + + def spider_opened(self, spider): + ua = getattr(spider, 'user_agent', None) or settings['USER_AGENT'] + self.spiders[spider] = SpiderInfo(ua) + + def spider_closed(self, spider): + del self.spiders[spider] +}}} + +=== Offsite middleware === + +This is a port of the Offsite middleware to the new spider middleware API: + +{{{ +#!python +class SpiderInfo(object): + + def __init__(self, host_regex): + self.host_regex = host_regex + self.hosts_seen = set() + + +class OffsiteMiddleware(object): + + def __init__(self): + self.spiders = {} + dispatcher.connect(self.spider_opened, signal=signals.spider_opened) + dispatcher.connect(self.spider_closed, signal=signals.spider_closed) + + def process_request(self, request, response, spider): + return self.process_start_request(self, request) + + def process_start_request(self, request, spider): + if self.should_follow(request, spider): + return request + else: + info = self.spiders[spider] + host = urlparse_cached(x).hostname + if host and host not in info.hosts_seen: + spider.log("Filtered offsite request to %r: %s" % (host, request)) + info.hosts_seen.add(host) + + def should_follow(self, request, spider): + info = self.spiders[spider] + # hostanme can be None for wrong urls (like javascript links) + host = urlparse_cached(request).hostname or '' + return bool(info.regex.search(host)) + + def get_host_regex(self, spider): + """Override this method to implement a different offsite policy""" + domains = [d.replace('.', r'\.') for d in spider.allowed_domains] + regex = r'^(.*\.)?(%s)$' % '|'.join(domains) + return re.compile(regex) + + def spider_opened(self, spider): + info = SpiderInfo(self.get_host_regex(spider)) + self.spiders[spider] = info + + def spider_closed(self, spider): + del self.spiders[spider] + +}}} + +=== Limit URL length === + +A middleware to filter out requests with long urls: + +{{{ +#!python + +class LimitUrlLength(object): + + def __init__(self): + self.maxlength = settings.getint('URLLENGTH_LIMIT') + + def process_request(self, request, response, spider): + return self.process_start_request(self, request) + + def process_start_request(self, request, spider): + if len(request.url) <= self.maxlength: + return request + spider.log("Ignoring request (url length > %d): %s " % (self.maxlength, request.url)) +}}} + +=== Set Referer === + +A middleware to set the Referer: + +{{{ +#!python +class SetReferer(object): + + def process_request(self, request, response, spider): + request.headers.setdefault('Referer', response.url) + return request +}}} + +=== Set and limit crawling depth === + +A middleware to set (and limit) the request/response depth, taken from the start requests: + +{{{ +#!python +class SetLimitDepth(object): + + def __init__(self, maxdepth=0): + self.maxdepth = maxdepth or settings.getint('DEPTH_LIMIT') + + def process_request(self, request, response, spider): + depth = response.request.meta['depth'] + 1 + request.meta['depth'] = depth + if not self.maxdepth or depth <= self.maxdepth: + return request + spider.log("Ignoring link (depth > %d): %s " % (self.maxdepth, request) + + def process_start_request(self, request, spider): + request.meta['depth'] = 0 + return request +}}} + +=== Filter duplicate requests === + +A middleware to filter out requests already seen: + +{{{ +#!python +class FilterDuplicates(object): + + def __init__(self): + clspath = settings.get('DUPEFILTER_CLASS') + self.dupefilter = load_object(clspath)() + dispatcher.connect(self.spider_opened, signal=signals.spider_opened) + dispatcher.connect(self.spider_closed, signal=signals.spider_closed) + + def enqueue_request(self, spider, request): + seen = self.dupefilter.request_seen(spider, request) + if not seen or request.dont_filter: + return request + + def spider_opened(self, spider): + self.dupefilter.open_spider(spider) + + def spider_closed(self, spider): + self.dupefilter.close_spider(spider) +}}} + +=== Scrape data using Parsley === + +A middleware to Scrape data using Parsley as described in UsingParsley + +{{{ +#!python +from pyparsley import PyParsley + +class ParsleyExtractor(object): + + def __init__(self, parslet_json_code): + parslet = json.loads(parselet_json_code) + class ParsleyItem(Item): + def __init__(self, *a, **kw): + for name in parslet.keys(): + self.fields[name] = Field() + super(ParsleyItem, self).__init__(*a, **kw) + self.item_class = ParsleyItem + self.parsley = PyParsley(parslet, output='python') + + def process_response(self, response, request, spider): + return self.item_class(self.parsly.parse(string=response.body)) +}}} + + + +== Pending issues == + +Resolved: + + * how to make `start_requests()` output pass through spider middleware `process_request()`? + * Start requests will be injected through `manager.scraper.process_request()` instead of `manager.engine.crawl()` + * should we support adding additional start requests from a spider middleware? + * Yes - there is a spider middleware method (`start_requests`) for that + * should `process_response()` receive a `request` argument with the `request` that originated it?. `response.request` is the latest request, not the original one (think of redirections), but it does carry the `meta` of the original one. The original one may not be available anymore (in memory) if we're using a persistent scheduler., but in that case it would be the deserialized request from the persistent scheduler queue. + * No - this would make implementation more complex and we're not sure it's really needed + * how to make sure `Request.errback` is always called if there is a problem with the request?. Do we need to ensure that?. Requests filtered out (by returning `None`) in the `process_request()` method will never be callback-ed or even errback-ed. this could be a problem for spiders that want to be notified if their requests are dropped. should we support this notification somehow or document (the lack of) it properly? + * We won't support notifications of dropped requests, because: 1. it's hard to implement and unreliable, 2. it's against not friendly with request persistence, 3. we can't come up with a good api. + * should we make the list of default spider middlewares empty? (or the "per-spider" spider middleware alone) + * No - there are some useful spider middlewares that it's worth enabling by default like referer, duplicates, robots2 + * should we allow returning deferreds in spider middleware methods? + * Yes - we should build a Deferred with the spider middleware methods as callbacks and that would implicitly support returning Deferreds + * should we support processing responses before they're processed by the spider, because `process_response` runs "in parallel" to the spider callback, and can't stop from running it. + * No - we haven't seen a practical use case for this, so we won't add an additional hook. It should be trivial to add it later, if needed. + * should we make a spider middleware to handle calling the request and spider callback, instead of letting the Scraper component do it? + * Yes - there's gonna a spider middleware for execution spider-specific code such as callbacks and also custom middlewares