Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periodic_log: implemented as separate extension #5926

Merged
merged 26 commits into from Aug 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
3315385
periodic_log: implemented as separate extension
May 7, 2023
84fb0ed
periodic_log: extension updated
May 21, 2023
a2f238d
periodic_log: settings input in extension updated
May 22, 2023
db794d3
periodic_log: not used code deleted
May 22, 2023
5c91f1b
periodic_log: added settings to default settings
May 23, 2023
a0c8490
periodic_log: codestyle fix (from pre-commit)
May 23, 2023
b60e0fa
periodic_log: fixed naming
May 24, 2023
639c2bc
periodic_log: TypeError except added
Jun 16, 2023
6e65eeb
periodic_log: tests [wip] added
Jun 16, 2023
2ce4856
periodic_log: Exception handling on init updated
Jun 25, 2023
ebce5b4
periodic_log: tests for logging deltas added
Jun 27, 2023
315861c
periodic_log: stats filtering updated
Jun 27, 2023
56c3823
periodic_log: tests for logging stats added
Jun 27, 2023
6fd94fd
periodic_log: tests updated (errors fixed)
Jun 28, 2023
8b6a50a
periodic_log: docs added
Aug 4, 2023
e9b088f
periodic_log: typing
Aug 4, 2023
f05657e
periodic_log: interval check moved
Aug 10, 2023
736a4b6
Update scrapy/settings/default_settings.py
GeorgeA92 Aug 10, 2023
2f094a7
Update docs/topics/extensions.rst
GeorgeA92 Aug 11, 2023
e6bd982
Update docs/topics/extensions.rst
GeorgeA92 Aug 11, 2023
3a4a949
Update docs/topics/extensions.rst
GeorgeA92 Aug 11, 2023
d67be20
Update docs/topics/extensions.rst
GeorgeA92 Aug 11, 2023
ac1694a
Update docs/topics/extensions.rst
GeorgeA92 Aug 11, 2023
1f03cb1
Merge remote-tracking branch 'upstream/master' into periodic_log_2
Aug 21, 2023
7e54284
periodic_log: `datetime.utcnow()` changed to `datetime.now(tz=timezon…
Aug 21, 2023
fe02642
Merge remote-tracking branch 'origin/periodic_log_2' into periodic_log_2
Aug 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
112 changes: 112 additions & 0 deletions docs/topics/extensions.rst
Expand Up @@ -396,3 +396,115 @@ For more info see `Debugging in Python`_.
This extension only works on POSIX-compliant platforms (i.e. not Windows).

.. _Debugging in Python: https://pythonconquerstheuniverse.wordpress.com/2009/09/10/debugging-in-python/

Periodic log extension
~~~~~~~~~~~~~~~~~~~~~~

.. class:: PeriodicLog

This extension periodically logs rich stat data as a JSON object::

2023-08-04 02:30:57 [scrapy.extensions.logstats] INFO: Crawled 976 pages (at 162 pages/min), scraped 925 items (at 161 items/min)
2023-08-04 02:30:57 [scrapy.extensions.periodic_log] INFO: {
"delta": {
"downloader/request_bytes": 55582,
"downloader/request_count": 162,
"downloader/request_method_count/GET": 162,
"downloader/response_bytes": 618133,
"downloader/response_count": 162,
"downloader/response_status_count/200": 162,
"item_scraped_count": 161
},
"stats": {
"downloader/request_bytes": 338243,
"downloader/request_count": 992,
"downloader/request_method_count/GET": 992,
"downloader/response_bytes": 3836736,
"downloader/response_count": 976,
"downloader/response_status_count/200": 976,
"item_scraped_count": 925,
"log_count/INFO": 21,
"log_count/WARNING": 1,
"scheduler/dequeued": 992,
"scheduler/dequeued/memory": 992,
"scheduler/enqueued": 1050,
"scheduler/enqueued/memory": 1050
},
"time": {
"elapsed": 360.008903,
"log_interval": 60.0,
"log_interval_real": 60.006694,
"start_time": "2023-08-03 23:24:57",
"utcnow": "2023-08-03 23:30:57"
}
}

This extension logs the following configurable sections:

- ``"delta"`` shows how some numeric stats have changed since the last stats
log message.

The :setting:`PERIODIC_LOG_DELTA` setting determines the target stats. They
must have ``int`` or ``float`` values.

- ``"stats"`` shows the current value of some stats.

The :setting:`PERIODIC_LOG_STATS` setting determines the target stats.

- ``"time"`` shows detailed timing data.

The :setting:`PERIODIC_LOG_TIMING_ENABLED` setting determines whether or
not to show this section.

This extension logs data at the start, then on a fixed time interval
configurable through the :setting:`LOGSTATS_INTERVAL` setting, and finally
right before the crawl ends.


Example extension configuration:

.. code-block:: python

custom_settings = {
"LOG_LEVEL": "INFO",
"PERIODIC_LOG_STATS": {
"include": ["downloader/", "scheduler/", "log_count/", "item_scraped_count/"],
},
"PERIODIC_LOG_DELTA": {"include": ["downloader/"]},
"PERIODIC_LOG_TIMING_ENABLED": True,
"EXTENSIONS": {
"scrapy.extensions.periodic_log.PeriodicLog": 0,
},
}

.. setting:: PERIODIC_LOG_DELTA

PERIODIC_LOG_DELTA
""""""""""""""""""

Default: ``None``

* ``"PERIODIC_LOG_DELTA": True`` - show deltas for all ``int`` and ``float`` stat values.
* ``"PERIODIC_LOG_DELTA": {"include": ["downloader/", "scheduler/"]}`` - show deltas for stats with names containing any configured substring.
* ``"PERIODIC_LOG_DELTA": {"exclude": ["downloader/"]}`` - show deltas for all stats with names not containing any configured substring.

.. setting:: PERIODIC_LOG_STATS

PERIODIC_LOG_STATS
""""""""""""""""""

Default: ``None``

* ``"PERIODIC_LOG_STATS": True`` - show the current value of all stats.
* ``"PERIODIC_LOG_STATS": {"include": ["downloader/", "scheduler/"]}`` - show current values for stats with names containing any configured substring.
* ``"PERIODIC_LOG_STATS": {"exclude": ["downloader/"]}`` - show current values for all stats with names not containing any configured substring.


.. setting:: PERIODIC_LOG_TIMING_ENABLED

PERIODIC_LOG_TIMING_ENABLED
"""""""""""""""""""""""""""

Default: ``False``

``True`` enables logging of timing data (i.e. the ``"time"`` section).
140 changes: 140 additions & 0 deletions scrapy/extensions/periodic_log.py
@@ -0,0 +1,140 @@
import logging
from datetime import datetime, timezone

from twisted.internet import task

from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.utils.serialize import ScrapyJSONEncoder

logger = logging.getLogger(__name__)


class PeriodicLog:
"""Log basic scraping stats periodically"""

def __init__(
self,
stats,
interval=60.0,
ext_stats={},
ext_delta={},
ext_timing_enabled=False,
):
self.stats = stats
self.interval = interval
self.multiplier = 60.0 / self.interval
self.task = None
self.encoder = ScrapyJSONEncoder(sort_keys=True, indent=4)
self.ext_stats_enabled = bool(ext_stats)
self.ext_stats_include = ext_stats.get("include", [])
self.ext_stats_exclude = ext_stats.get("exclude", [])
self.ext_delta_enabled = bool(ext_delta)
self.ext_delta_include = ext_delta.get("include", [])
self.ext_delta_exclude = ext_delta.get("exclude", [])
self.ext_timing_enabled = ext_timing_enabled

@classmethod
def from_crawler(cls, crawler):
interval = crawler.settings.getfloat("LOGSTATS_INTERVAL")
if not interval:
raise NotConfigured
try:
ext_stats = crawler.settings.getdict("PERIODIC_LOG_STATS")
except (TypeError, ValueError):
ext_stats = (
{"enabled": True}
if crawler.settings.getbool("PERIODIC_LOG_STATS")
else None
)
try:
ext_delta = crawler.settings.getdict("PERIODIC_LOG_DELTA")
except (TypeError, ValueError):
ext_delta = (
{"enabled": True}
if crawler.settings.getbool("PERIODIC_LOG_DELTA")
else None
)

ext_timing_enabled = crawler.settings.getbool(
"PERIODIC_LOG_TIMING_ENABLED", False
)
if not (ext_stats or ext_delta or ext_timing_enabled):
raise NotConfigured
o = cls(
crawler.stats,
interval,
ext_stats,
ext_delta,
ext_timing_enabled,
)
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
return o

def spider_opened(self, spider):
self.time_prev = datetime.now(tz=timezone.utc)
self.delta_prev = {}
self.stats_prev = {}

self.task = task.LoopingCall(self.log)
self.task.start(self.interval)

def log(self):
data = {}
if self.ext_timing_enabled:
data.update(self.log_timing())
if self.ext_delta_enabled:
data.update(self.log_delta())
if self.ext_stats_enabled:
data.update(self.log_crawler_stats())
logger.info(self.encoder.encode(data))

def log_delta(self):
num_stats = {
k: v
for k, v in self.stats._stats.items()
if isinstance(v, (int, float))
and self.param_allowed(k, self.ext_delta_include, self.ext_delta_exclude)
}
delta = {k: v - self.delta_prev.get(k, 0) for k, v in num_stats.items()}
self.delta_prev = num_stats
return {"delta": delta}

def log_timing(self):
now = datetime.now(tz=timezone.utc)
time = {
"log_interval": self.interval,
"start_time": self.stats._stats["start_time"],
"utcnow": now,
"log_interval_real": (now - self.time_prev).total_seconds(),
"elapsed": (now - self.stats._stats["start_time"]).total_seconds(),
}
self.time_prev = now
return {"time": time}

def log_crawler_stats(self):
stats = {
k: v
for k, v in self.stats._stats.items()
if self.param_allowed(k, self.ext_stats_include, self.ext_stats_exclude)
}
return {"stats": stats}

def param_allowed(self, stat_name, include, exclude):
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
if not include and not exclude:
return True
for p in exclude:
if p in stat_name:
return False
if exclude and not include:
return True
for p in include:
if p in stat_name:
return True
return False

def spider_closed(self, spider, reason):
self.log()
if self.task and self.task.running:
self.task.stop()
4 changes: 4 additions & 0 deletions scrapy/settings/default_settings.py
Expand Up @@ -242,6 +242,10 @@

NEWSPIDER_MODULE = ""

PERIODIC_LOG_DELTA = None
PERIODIC_LOG_STATS = None
PERIODIC_LOG_TIMING_ENABLED = False

RANDOMIZE_DOWNLOAD_DELAY = True

REACTOR_THREADPOOL_MAXSIZE = 10
Expand Down