Permalink
Browse files

Merge pull request #167 from alexcepoi/sep-017

Spider contracts (SEP-017)
  • Loading branch information...
2 parents a5f8943 + 73e6bc1 commit c380910b40a9a0ca5837bfd5b9884f662236c51d @pablohoffman pablohoffman committed Sep 28, 2012
View
@@ -129,6 +129,7 @@ Solving specific problems
faq
topics/debug
+ topics/contracts
topics/firefox
topics/firebug
topics/leaks
@@ -145,6 +146,9 @@ Solving specific problems
:doc:`topics/debug`
Learn how to debug common problems of your scrapy spider.
+:doc:`topics/contracts`
+ Learn how to use contracts for testing your spiders.
+
:doc:`topics/firefox`
Learn how to scrape with Firefox and some useful add-ons.
@@ -142,6 +142,7 @@ Global commands:
Project-only commands:
* :command:`crawl`
+* :command:`check`
* :command:`list`
* :command:`edit`
* :command:`parse`
@@ -221,6 +222,33 @@ Usage examples::
[ ... myspider starts crawling ... ]
+.. command:: check
+
+check
+-----
+
+* Syntax: ``scrapy check [-l] <spider>``
+* Requires project: *yes*
+
+Run contract checks.
+
+Usage examples::
+
+ $ scrapy check -l
+ first_spider
+ * parse
+ * parse_item
+ second_spider
+ * parse
+ * parse_item
+
+ $ scrapy check
+ [FAILED] first_spider:parse_item
+ >>> 'RetailPricex' field is missing
+
+ [FAILED] first_spider:parse
+ >>> Returned 92 requests, expected 0..4
+
.. command:: server
server
@@ -0,0 +1,113 @@
+.. _topics-contracts:
+
+=================
+Spiders Contracts
+=================
+
+Testing spiders can get particularly annoying and while nothing prevents you
+from writing unit tests the task gets cumbersome quickly. Scrapy offers an
+integrated way of testing your spiders by the means of contracts.
+
+This allows you to test each callback of your spider by hardcoding a sample url
+and check various constraints for how the callback processes the response. Each
+contract is prefixed with an ``@`` and included in the docstring. See the
+following example::
+
+ def parse(self, response):
+ """ This function parses a sample response. Some contracts are mingled
+ with this docstring.
+
+ @url http://www.amazon.com/s?field-keywords=selfish+gene
+ @returns items 1 16
+ @returns requests 0 0
+ @scrapes Title Author Year Price
+ """
+
+This callback is tested using three built-in contracts:
+
+.. module:: scrapy.contracts.default
+
+.. class:: UrlContract
+
+ This contract (``@url``) sets the sample url used when checking other
+ contract conditions for this spider. This contract is mandatory. All
+ callbacks lacking this contract are ignored when running the checks::
+
+ @url url
+
+.. class:: ReturnsContract
+
+ This contract (``@returns``) sets lower and upper bounds for the items and
+ requests returned by the spider. The upper bound is optional::
+
+ @returns item(s)|request(s) [min [max]]
+
+.. class:: ScrapesContract
+
+ This contract (``@scrapes``) checks that all the items returned by the
+ callback have the specified fields::
+
+ @scrapes field_1 field_2 ...
+
+Use the :command:`check` command to run the contract checks.
+
+Custom Contracts
+================
+
+If you find you need more power than the built-in scrapy contracts you can
+create and load your own contracts in the project by using the
+:setting:`SPIDER_CONTRACTS` setting::
+
+ SPIDER_CONTRACTS = {
+ 'myproject.contracts.ResponseCheck': 10,
+ 'myproject.contracts.ItemValidate': 10,
+ }
+
+Each contract must inherit from :class:`scrapy.contracts.Contract` and can
+override three methods:
+
+.. module:: scrapy.contracts
+
+.. class:: Contract(method, \*args)
+
+ :param method: callback function to which the contract is associated
+ :type method: function
+
+ :param args: list of arguments passed into the docstring (whitespace
+ separated)
+ :type args: list
+
+ .. method:: Contract.adjust_request_args(args)
+
+ This receives a ``dict`` as an argument containing default arguments
+ for :class:`~scrapy.http.Request` object. Must return the same or a
+ modified version of it.
+
+ .. method:: Contract.pre_process(response)
+
+ This allows hooking in various checks on the response received from the
+ sample request, before it's being passed to the callback.
+
+ .. method:: Contract.post_process(output)
+
+ This allows processing the output of the callback. Iterators are
+ converted listified before being passed to this hook.
+
+Here is a demo contract which checks the presence of a custom header in the
+response received. Raise :class:`scrapy.exceptions.ContractFail` in order to
+get the failures pretty printed::
+
+ from scrapy.contracts import Contract
+ from scrapy.exceptions import ContractFail
+
+ class HasHeaderContract(Contract):
+ """ Demo contract which checks the presence of a custom header
+ @has_header X-CustomHeader
+ """
+
+ name = 'has_header'
+
+ def pre_process(self, response):
+ for header in self.args:
+ if header not in response.headers:
+ raise ContractFail('X-CustomHeader not present')
@@ -694,6 +694,30 @@ The scheduler to use for crawling.
.. setting:: SPIDER_MIDDLEWARES
+
+SPIDER_CONTRACTS
+----------------
+
+Default:: ``{}``
+
+A dict containing the scrapy contracts enabled in your project, used for
+testing spiders. For more info see :ref:`topics-testing`.
+
+SPIDER_CONTRACTS_BASE
+---------------------
+
+Default::
+
+ {
+ 'scrapy.contracts.default.UrlContract' : 1,
+ 'scrapy.contracts.default.ReturnsContract': 2,
+ 'scrapy.contracts.default.ScrapesContract': 3,
+ }
+
+A dict containing the scrapy contracts enabled by default in Scrapy. You should
+never modify this setting in your project, modify :setting:`SPIDER_CONTRACTS`
+instead. For more info see :ref:`topics-testing`.
+
SPIDER_MIDDLEWARES
------------------
@@ -0,0 +1,78 @@
+from collections import defaultdict
+from functools import wraps
+
+from scrapy.command import ScrapyCommand
+from scrapy.contracts import ContractsManager
+from scrapy.utils.misc import load_object
+from scrapy.utils.spider import iterate_spider_output
+from scrapy.utils.conf import build_component_list
+
+
+def _generate(cb):
+ """ create a callback which does not return anything """
+ @wraps(cb)
+ def wrapper(response):
+ output = cb(response)
+ output = list(iterate_spider_output(output))
+ return wrapper
+
+
+class Command(ScrapyCommand):
+ requires_project = True
+ default_settings = {'LOG_ENABLED': False}
+
+ def syntax(self):
+ return "[options] <spider>"
+
+ def short_desc(self):
+ return "Check contracts for given spider"
+
+ def add_options(self, parser):
+ ScrapyCommand.add_options(self, parser)
+ parser.add_option("-l", "--list", dest="list", action="store_true",
+ help="only list contracts, without checking them")
+
+ def run(self, args, opts):
+ # load contracts
+ contracts = build_component_list(
+ self.settings['SPIDER_CONTRACTS_BASE'],
+ self.settings['SPIDER_CONTRACTS'],
+ )
+ self.conman = ContractsManager([load_object(c) for c in contracts])
+
+ # contract requests
+ contract_reqs = defaultdict(list)
+ self.crawler.engine.has_capacity = lambda: True
+
+ for spider in args or self.crawler.spiders.list():
+ spider = self.crawler.spiders.create(spider)
+ requests = self.get_requests(spider)
+
+ if opts.list:
+ for req in requests:
+ contract_reqs[spider.name].append(req.callback.__name__)
+ else:
+ self.crawler.crawl(spider, requests)
+
+ # start checks
+ if opts.list:
+ for spider, methods in sorted(contract_reqs.iteritems()):
+ print spider
+ for method in sorted(methods):
+ print ' * %s' % method
+ else:
+ self.crawler.start()
+
+ def get_requests(self, spider):
+ requests = []
+
+ for key, value in vars(type(spider)).items():
+ if callable(value) and value.__doc__:
+ bound_method = value.__get__(spider, type(spider))
+ request = self.conman.from_method(bound_method)
+
+ if request:
+ request.callback = _generate(request.callback)
+ requests.append(request)
+
+ return requests
@@ -0,0 +1,102 @@
+import re
+from functools import wraps
+
+from scrapy.http import Request
+from scrapy.utils.spider import iterate_spider_output
+from scrapy.utils.python import get_spec
+from scrapy.exceptions import ContractFail
+
+
+class ContractsManager(object):
+ contracts = {}
+
+ def __init__(self, contracts):
+ for contract in contracts:
+ self.contracts[contract.name] = contract
+
+ def extract_contracts(self, method):
+ contracts = []
+ for line in method.__doc__.split('\n'):
+ line = line.strip()
+
+ if line.startswith('@'):
+ name, args = re.match(r'@(\w+)\s*(.*)', line).groups()
+ args = re.split(r'\s+', args)
+
+ contracts.append(self.contracts[name](method, *args))
+
+ return contracts
+
+ def from_method(self, method, fail=False):
+ contracts = self.extract_contracts(method)
+ if contracts:
+ # calculate request args
+ args, kwargs = get_spec(Request.__init__)
+ kwargs['callback'] = method
+ for contract in contracts:
+ kwargs = contract.adjust_request_args(kwargs)
+
+ # create and prepare request
+ args.remove('self')
+ if set(args).issubset(set(kwargs)):
+ request = Request(**kwargs)
+
+ # execute pre and post hooks in order
+ for contract in reversed(contracts):
+ request = contract.add_pre_hook(request, fail)
+ for contract in contracts:
+ request = contract.add_post_hook(request, fail)
+
+ return request
+
+
+class Contract(object):
+ """ Abstract class for contracts """
+
+ def __init__(self, method, *args):
+ self.method = method
+ self.args = args
+
+ def add_pre_hook(self, request, fail=False):
+ cb = request.callback
+
+ @wraps(cb)
+ def wrapper(response):
+ try:
+ self.pre_process(response)
+ except ContractFail as e:
+ if fail:
+ raise
+ else:
+ print e.format(self.method)
+ return list(iterate_spider_output(cb(response)))
+
+ request.callback = wrapper
+ return request
+
+ def add_post_hook(self, request, fail=False):
+ cb = request.callback
+
+ @wraps(cb)
+ def wrapper(response):
+ output = list(iterate_spider_output(cb(response)))
+ try:
+ self.post_process(output)
+ except ContractFail as e:
+ if fail:
+ raise
+ else:
+ print e.format(self.method)
+ return output
+
+ request.callback = wrapper
+ return request
+
+ def adjust_request_args(self, args):
+ return args
+
+ def pre_process(self, response):
+ pass
+
+ def post_process(self, output):
+ pass
Oops, something went wrong.

0 comments on commit c380910

Please sign in to comment.