Permalink
Browse files

major Stats Collection refactoring: removed separation of global/per-…

…spider stats, removed stats-related signals (stats_spider_opened, etc). Stats are much simpler now, backwards compatibility is kept on the Stats Collector API.
  • Loading branch information...
1 parent 8b48420 commit 81ed2d2d0b7cbf2dd5cffc0da50367c5284416a6 @pablohoffman pablohoffman committed Sep 14, 2012
Showing with 50 additions and 217 deletions.
  1. +1 −0 docs/news.rst
  2. +16 −46 docs/topics/api.rst
  3. +14 −91 docs/topics/stats.rst
  4. +4 −4 scrapy/contrib/corestats.py
  5. +0 −1 scrapy/core/engine.py
  6. +0 −3 scrapy/signals.py
  7. +15 −35 scrapy/statscol.py
  8. +0 −37 scrapy/tests/test_stats.py
View
@@ -6,6 +6,7 @@ Release notes
Scrapy changes:
+- major Stats Collection refactoring: removed separation of global/per-spider stats, removed stats-related signals (``stats_spider_opened``, etc). Stats are much simpler now, backwards compatibility is kept on the Stats Collector API.
- added :meth:`~scrapy.contrib.spidermiddleware.SpiderMiddleware.process_start_requests` method to spider middlewares
- dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info.
- dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info.
View
@@ -250,68 +250,42 @@ class (which they all inherit from).
.. class:: StatsCollector
- .. method:: get_value(key, default=None, spider=None)
+ .. method:: get_value(key, default=None)
Return the value for the given stats key or default if it doesn't exist.
- If spider is ``None`` the global stats table is consulted, otherwise the
- spider specific one is. If the spider is not yet opened a ``KeyError``
- exception is raised.
- .. method:: get_stats(spider=None)
+ .. method:: get_stats()
- Get all stats from the given spider (if spider is given) or all global
- stats otherwise, as a dict. If spider is not opened ``KeyError`` is
- raised.
+ Get all stats from the currently running spider as a dict.
- .. method:: set_value(key, value, spider=None)
+ .. method:: set_value(key, value)
- Set the given value for the given stats key on the global stats (if
- spider is not given) or the spider-specific stats (if spider is given),
- which must be opened or a ``KeyError`` will be raised.
+ Set the given value for the given stats key.
- .. method:: set_stats(stats, spider=None)
+ .. method:: set_stats(stats)
- Set the given stats (as a dict) for the given spider. If the spider is
- not opened a ``KeyError`` will be raised.
+ Override the current stats with the dict passed in ``stats`` argument.
- .. method:: inc_value(key, count=1, start=0, spider=None)
+ .. method:: inc_value(key, count=1, start=0)
Increment the value of the given stats key, by the given count,
- assuming the start value given (when it's not set). If spider is not
- given the global stats table is used, otherwise the spider-specific
- stats table is used, which must be opened or a ``KeyError`` will be
- raised.
+ assuming the start value given (when it's not set).
- .. method:: max_value(key, value, spider=None)
+ .. method:: max_value(key, value)
Set the given value for the given key only if current value for the
same key is lower than value. If there is no current value for the
- given key, the value is always set. If spider is not given, the global
- stats table is used, otherwise the spider-specific stats table is used,
- which must be opened or a KeyError will be raised.
+ given key, the value is always set.
- .. method:: min_value(key, value, spider=None)
+ .. method:: min_value(key, value)
Set the given value for the given key only if current value for the
same key is greater than value. If there is no current value for the
- given key, the value is always set. If spider is not given, the global
- stats table is used, otherwise the spider-specific stats table is used,
- which must be opened or a KeyError will be raised.
+ given key, the value is always set.
- .. method:: clear_stats(spider=None)
+ .. method:: clear_stats()
- Clear all global stats (if spider is not given) or all spider-specific
- stats if spider is given, in which case it must be opened or a
- ``KeyError`` will be raised.
-
- .. method:: iter_spider_stats()
-
- Return a iterator over ``(spider, spider_stats)`` for each open spider
- currently tracked by the stats collector, where ``spider_stats`` is the
- dict containing all spider-specific stats.
-
- Global stats are not included in the iterator. If you want to get
- those, use :meth:`get_stats` method.
+ Clear all stats.
The following methods are not part of the stats collection api but instead
used when implementing custom stats collectors:
@@ -323,11 +297,7 @@ class (which they all inherit from).
.. method:: close_spider(spider)
Close the given spider. After this is called, no more specific stats
- for this spider can be accessed.
-
- .. method:: engine_stopped()
-
- Called after the engine is stopped, to dump or persist global stats.
+ can be accessed or collected.
.. _deferreds: http://twistedmatrix.com/documents/current/core/howto/defer.html
.. _deferred: http://twistedmatrix.com/documents/current/core/howto/defer.html
View
@@ -5,10 +5,10 @@ Stats Collection
================
Scrapy provides a convenient facility for collecting stats in the form of
-key/values, both globally and per spider. It's called the Stats Collector, and
-can be accesed through the :attr:`~scrapy.crawler.Crawler.stats` attribute of
-the :ref:`topics-api-crawler`, as illustrated by the examples in the
-:ref:`topics-stats-usecases` section below.
+key/values, where values are often counters. The facility is called the Stats
+Collector, and can be accesed through the :attr:`~scrapy.crawler.Crawler.stats`
+attribute of the :ref:`topics-api-crawler`, as illustrated by the examples in
+the :ref:`topics-stats-usecases` section below.
However, the Stats Collector is always available, so you can always import it
in your module and use its API (to increment or set new stat keys), regardless
@@ -21,10 +21,8 @@ using the Stats Collector from.
Another feature of the Stats Collector is that it's very efficient (when
enabled) and extremely efficient (almost unnoticeable) when disabled.
-The Stats Collector keeps one stats table per open spider and one global stats
-table. You can't set or get stats from a closed spider, but the spider-specific
-stats table is automatically opened when the spider is opened, and closed when
-the spider is closed.
+The Stats Collector keeps a stats table per open spider which is automatically
+opened when the spider is opened, and closed when the spider is closed.
.. _topics-stats-usecases:
@@ -38,58 +36,30 @@ attribute::
def from_crawler(cls, crawler):
stats = crawler.stats
-Set global stat value::
+Set stat value::
stats.set_value('hostname', socket.gethostname())
-Increment global stat value::
+Increment stat value::
- stats.inc_value('spiders_crawled')
+ stats.inc_value('pages_crawled')
-Set global stat value only if greater than previous::
+Set stat value only if greater than previous::
stats.max_value('max_items_scraped', value)
-Set global stat value only if lower than previous::
+Set stat value only if lower than previous::
stats.min_value('min_free_memory_percent', value)
-Get global stat value::
+Get stat value::
- >>> stats.get_value('spiders_crawled')
+ >>> stats.get_value('pages_crawled')
8
-Get all global stats (ie. not particular to any spider)::
+Get all stats::
>>> stats.get_stats()
- {'hostname': 'localhost', 'spiders_crawled': 8}
-
-Set spider specific stat value::
-
- stats.set_value('start_time', datetime.now(), spider=some_spider)
-
-Where ``some_spider`` is a :class:`~scrapy.spider.BaseSpider` object.
-
-Increment spider-specific stat value::
-
- stats.inc_value('pages_crawled', spider=some_spider)
-
-Set spider-specific stat value only if greater than previous::
-
- stats.max_value('max_items_scraped', value, spider=some_spider)
-
-Set spider-specific stat value only if lower than previous::
-
- stats.min_value('min_free_memory_percent', value, spider=some_spider)
-
-Get spider-specific stat value::
-
- >>> stats.get_value('pages_crawled', spider=some_spider)
- 1238
-
-Get all stats from a given spider::
-
- >>> stats.get_stats(spider=some_spider)
{'pages_crawled': 1238, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}
Available Stats Collectors
@@ -131,50 +101,3 @@ DummyStatsCollector
the performance penalty of stats collection is usually marginal compared to
other Scrapy workload like parsing pages.
-
-Stats signals
-=============
-
-The Stats Collector provides some signals for extending the stats collection
-functionality:
-
-.. currentmodule:: scrapy.signals
-
-.. signal:: stats_spider_opened
-.. function:: stats_spider_opened(spider)
-
- Sent right after the stats spider is opened. You can use this signal to add
- startup stats for the spider (example: start time).
-
- :param spider: the stats spider just opened
- :type spider: str
-
-.. signal:: stats_spider_closing
-.. function:: stats_spider_closing(spider, reason)
-
- Sent just before the stats spider is closed. You can use this signal to add
- some closing stats (example: finish time).
-
- :param spider: the stats spider about to be closed
- :type spider: str
-
- :param reason: the reason why the spider is being closed. See
- :signal:`spider_closed` signal for more info.
- :type reason: str
-
-.. signal:: stats_spider_closed
-.. function:: stats_spider_closed(spider, reason, spider_stats)
-
- Sent right after the stats spider is closed. You can use this signal to
- collect resources, but not to add any more stats as the stats spider has
- already been closed (use :signal:`stats_spider_closing` for that instead).
-
- :param spider: the stats spider just closed
- :type spider: str
-
- :param reason: the reason why the spider was closed. See
- :signal:`spider_closed` signal for more info.
- :type reason: str
-
- :param spider_stats: the stats of the spider just closed.
- :type reason: dict
@@ -13,16 +13,16 @@ def __init__(self, stats):
@classmethod
def from_crawler(cls, crawler):
o = cls(crawler.stats)
- crawler.signals.connect(o.stats_spider_opened, signal=signals.stats_spider_opened)
- crawler.signals.connect(o.stats_spider_closing, signal=signals.stats_spider_closing)
+ crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
+ crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(o.item_scraped, signal=signals.item_scraped)
crawler.signals.connect(o.item_dropped, signal=signals.item_dropped)
return o
- def stats_spider_opened(self, spider):
+ def spider_opened(self, spider):
self.stats.set_value('start_time', datetime.datetime.utcnow(), spider=spider)
- def stats_spider_closing(self, spider, reason):
+ def spider_closed(self, spider, reason):
self.stats.set_value('finish_time', datetime.datetime.utcnow(), spider=spider)
self.stats.set_value('finish_reason', reason, spider=spider)
View
@@ -286,4 +286,3 @@ def _close_all_spiders(self):
@defer.inlineCallbacks
def _finish_stopping_engine(self):
yield self.signals.send_catch_log_deferred(signal=signals.engine_stopped)
- yield self.crawler.stats.engine_stopped()
View
@@ -16,8 +16,5 @@
response_downloaded = object()
item_scraped = object()
item_dropped = object()
-stats_spider_opened = object()
-stats_spider_closing = object()
-stats_spider_closed = object()
item_passed = item_scraped # for backwards compatibility
View
@@ -3,68 +3,49 @@
"""
import pprint
-from scrapy.signals import stats_spider_opened, stats_spider_closing, \
- stats_spider_closed
from scrapy import log
class StatsCollector(object):
def __init__(self, crawler):
self._dump = crawler.settings.getbool('STATS_DUMP')
- self._stats = {None: {}} # None is for global stats
- self._signals = crawler.signals
+ self._stats = {}
def get_value(self, key, default=None, spider=None):
- return self._stats[spider].get(key, default)
+ return self._stats.get(key, default)
def get_stats(self, spider=None):
- return self._stats[spider]
+ return self._stats
def set_value(self, key, value, spider=None):
- self._stats[spider][key] = value
+ self._stats[key] = value
def set_stats(self, stats, spider=None):
- self._stats[spider] = stats
+ self._stats = stats
def inc_value(self, key, count=1, start=0, spider=None):
- d = self._stats[spider]
+ d = self._stats
d[key] = d.setdefault(key, start) + count
def max_value(self, key, value, spider=None):
- d = self._stats[spider]
- d[key] = max(d.setdefault(key, value), value)
+ self._stats[key] = max(self._stats.setdefault(key, value), value)
def min_value(self, key, value, spider=None):
- d = self._stats[spider]
- d[key] = min(d.setdefault(key, value), value)
+ self._stats[key] = min(self._stats.setdefault(key, value), value)
def clear_stats(self, spider=None):
- self._stats[spider].clear()
-
- def iter_spider_stats(self):
- return [x for x in self._stats.iteritems() if x[0]]
+ self._stats.clear()
def open_spider(self, spider):
- self._stats[spider] = {}
- self._signals.send_catch_log(stats_spider_opened, spider=spider)
+ self._stats = {}
def close_spider(self, spider, reason):
- self._signals.send_catch_log(stats_spider_closing, spider=spider, reason=reason)
- stats = self._stats.pop(spider)
- self._signals.send_catch_log(stats_spider_closed, spider=spider, reason=reason, \
- spider_stats=stats)
if self._dump:
- log.msg("Dumping spider stats:\n" + pprint.pformat(stats), \
+ log.msg("Dumping spider stats:\n" + pprint.pformat(self._stats), \
spider=spider)
- self._persist_stats(stats, spider)
-
- def engine_stopped(self):
- stats = self.get_stats()
- if self._dump:
- log.msg("Dumping global stats:\n" + pprint.pformat(stats))
- self._persist_stats(stats, spider=None)
+ self._persist_stats(self._stats, spider)
- def _persist_stats(self, stats, spider=None):
+ def _persist_stats(self, stats, spider):
pass
class MemoryStatsCollector(StatsCollector):
@@ -73,9 +54,8 @@ def __init__(self, crawler):
super(MemoryStatsCollector, self).__init__(crawler)
self.spider_stats = {}
- def _persist_stats(self, stats, spider=None):
- if spider is not None:
- self.spider_stats[spider.name] = stats
+ def _persist_stats(self, stats, spider):
+ self.spider_stats[spider.name] = stats
class DummyStatsCollector(StatsCollector):
Oops, something went wrong.

0 comments on commit 81ed2d2

Please sign in to comment.