Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception when using DummyStatsCollector #4007

Closed
Lukas0907 opened this issue Sep 9, 2019 · 5 comments · Fixed by #4052
Closed

Exception when using DummyStatsCollector #4007

Lukas0907 opened this issue Sep 9, 2019 · 5 comments · Fixed by #4052

Comments

@Lukas0907
Copy link
Contributor

@Lukas0907 Lukas0907 commented Sep 9, 2019

Description

Using the DummyStatsCollector results in an exception:

2019-09-09 13:51:23 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method CoreStats.spider_closed of <scrapy.extensions.corestats.CoreStats object at 0x7f86269cac18>>
Traceback (most recent call last):
  File ".../lib/python3.6/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File ".../lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File ".../lib/python3.6/site-packages/scrapy/extensions/corestats.py", line 28, in spider_closed
    elapsed_time = finish_time - self.stats.get_value('start_time')
TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'NoneType'

This problem has been introduced in aa46e19.

Steps to Reproduce

Set STATS_CLASS = "scrapy.statscollectors.DummyStatsCollector" in the settings module as described in the documentation (https://docs.scrapy.org/en/latest/topics/stats.html#dummystatscollector).

Expected behavior: no exception
Actual behavior: exception thrown
Reproduces how often: always

Versions

At least master as of 534de73

Fix

A possible fix is to use the elapsed time as a default argument so that get_value() does not return None. I can prepare a PR if needed.

--- a/scrapy/extensions/corestats.py
+++ b/scrapy/extensions/corestats.py
@@ -25,7 +25,7 @@ class CoreStats(object):
 
     def spider_closed(self, spider, reason):
         finish_time = datetime.datetime.utcnow()
-        elapsed_time = finish_time - self.stats.get_value('start_time')
+        elapsed_time = finish_time - self.stats.get_value('start_time', finish_time)
         elapsed_time_seconds = elapsed_time.total_seconds()
         self.stats.set_value('elapsed_time_seconds', elapsed_time_seconds, spider=spider)
         self.stats.set_value('finish_time', finish_time, spider=spider)
@elacuesta
Copy link
Member

@elacuesta elacuesta commented Sep 10, 2019

Confirmed. scrapy.extensions.corestats.CoreStats sets the start_time value here but scrapy.statscollectors.DummyStatsCollector overrides set_value to do nothing here.

However, I'm not sure I agree with the proposed fix, that would mean the elapsed_time variable could end up being 0, which is probably not accurate. Given that the idea of the DummyStatsCollector is to improve performance by not collecting stats, I'm wondering if we even need to connect the signal handlers in the first place. A fix could look something like this:

diff --git a/scrapy/extensions/corestats.py b/scrapy/extensions/corestats.py
index 8cc5e18a..5e4dbb05 100644
--- a/scrapy/extensions/corestats.py
+++ b/scrapy/extensions/corestats.py
@@ -4,6 +4,8 @@ Extension for collecting core stats like items scraped and start/finish times
 import datetime

 from scrapy import signals
+from scrapy.statscollectors import DummyStatsCollector
+

 class CoreStats(object):

@@ -13,11 +15,12 @@ class CoreStats(object):
     @classmethod
     def from_crawler(cls, crawler):
         o = cls(crawler.stats)
-        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
-        crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
-        crawler.signals.connect(o.item_scraped, signal=signals.item_scraped)
-        crawler.signals.connect(o.item_dropped, signal=signals.item_dropped)
-        crawler.signals.connect(o.response_received, signal=signals.response_received)
+        if not isinstance(crawler.stats, DummyStatsCollector):
+            crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
+            crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
+            crawler.signals.connect(o.item_scraped, signal=signals.item_scraped)
+            crawler.signals.connect(o.item_dropped, signal=signals.item_dropped)
+            crawler.signals.connect(o.response_received, signal=signals.response_received)
         return o

     def spider_opened(self, spider):

@elacuesta elacuesta added the bug label Sep 10, 2019
@Lukas0907
Copy link
Contributor Author

@Lukas0907 Lukas0907 commented Sep 10, 2019

Thanks for the quick response!

Yes, you are right that elapsed_time would be 0 which is not optimal. However, special casing the DummyStatsCollector also seems a bit odd to me because if I were to implement my own stats collector which exposed the same functionality as the DummyStatsCollector it would still fail.

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Sep 10, 2019

Makes sense, the extension should be robust enough not to break depending on the Stats class implementation. In that case, something along the following lines should do the trick:

diff --git a/scrapy/extensions/corestats.py b/scrapy/extensions/corestats.py
index 8cc5e18a..91d8558b 100644
--- a/scrapy/extensions/corestats.py
+++ b/scrapy/extensions/corestats.py
@@ -9,6 +9,7 @@ class CoreStats(object):

     def __init__(self, stats):
         self.stats = stats
+        self.start_time = None

     @classmethod
     def from_crawler(cls, crawler):
@@ -21,11 +22,12 @@ class CoreStats(object):
         return o

     def spider_opened(self, spider):
-        self.stats.set_value('start_time', datetime.datetime.utcnow(), spider=spider)
+        self.start_time = datetime.datetime.utcnow()
+        self.stats.set_value('start_time', self.start_time, spider=spider)

     def spider_closed(self, spider, reason):
         finish_time = datetime.datetime.utcnow()
-        elapsed_time = finish_time - self.stats.get_value('start_time')
+        elapsed_time = finish_time - self.start_time
         elapsed_time_seconds = elapsed_time.total_seconds()
         self.stats.set_value('elapsed_time_seconds', elapsed_time_seconds, spider=spider)
         self.stats.set_value('finish_time', finish_time, spider=spider)

I still think it might be a good idea to skip connecting the handlers though 😅
Let's see what other committers think.

@nieweiming
Copy link

@nieweiming nieweiming commented May 19, 2021

why use datetime.datetime.utcnow() instead of datetime.datetime.now(timezone.utc)?

@elacuesta
Copy link
Member

@elacuesta elacuesta commented May 19, 2021

why use datetime.datetime.utcnow() instead of datetime.datetime.now(timezone.utc)?

No specific reason that I can see, the previous code was using utcnow already and it just didn't occur to me to change it. In any case, I don't think it changes things, AFAICT all datetimes handled by Scrapy are in UTC and there are no timezone conversions.
That said, I'm happy to be proven wrong. If you think there is something to be improved, please do suggest it or open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants