Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Wrong value of log_count/INFO in stats #3643

Merged
merged 11 commits into from Jul 17, 2019
Merged

Conversation

@maramsumanth
Copy link
Contributor

@maramsumanth maramsumanth commented Feb 28, 2019

This is my output for:-
(base) C:\Users\MARAMSUMANTH\Desktop\myproject>scrapy crawl com

2019-02-28 18:17:21 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: myproject)
2019-02-28 18:17:21 [scrapy.utils.log] INFO: Versions: .........
2019-02-28 18:17:21 [scrapy.crawler] INFO: Overridden settings: {....}
2019-02-28 18:17:21 [scrapy.extensions.telnet] INFO: Telnet Password: ***********
2019-02-28 18:17:21 [scrapy.middleware] INFO: Enabled extensions:[........]
2019-02-28 18:17:22 [scrapy.middleware] INFO: Enabled downloader middlewares:[.......]
2019-02-28 18:17:22 [scrapy.middleware] INFO: Enabled spider middlewares:[.....]
2019-02-28 18:17:22 [scrapy.middleware] INFO: Enabled item pipelines:[]
2019-02-28 18:17:22 [scrapy.core.engine] INFO: Spider opened
2019-02-28 18:17:22 [scrapy.extensions.logstats] INFO: Crawled.......
2019-02-28 18:17:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-02-28 18:17:22 [scrapy.core.engine] DEBUG: Crawled.........
2019-02-28 18:17:22 [scrapy.core.engine] DEBUG: Crawled.........
2019-02-28 18:17:22 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-28 18:17:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 28, 12, 47, 22, 773465),
'log_count/DEBUG': 2,
'log_count/INFO': 9,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 2, 28, 12, 47, 22, 84308)}
2019-02-28 18:17:22 [scrapy.core.engine] INFO: Spider closed (finished)

As actual log_count/INFO value is 14, but instead it is printing 9. The highlighted 5 lines aren't considered, which are automatically printed and trivial for any spider crawling.

So, this PR aims to add those 5 explicitly.

@codecov
Copy link

@codecov codecov bot commented Feb 28, 2019

Codecov Report

Merging #3643 into master will increase coverage by 0.19%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3643      +/-   ##
==========================================
+ Coverage   84.52%   84.72%   +0.19%     
==========================================
  Files         167      168       +1     
  Lines        9410     9460      +50     
  Branches     1397     1407      +10     
==========================================
+ Hits         7954     8015      +61     
+ Misses       1199     1188      -11     
  Partials      257      257
Impacted Files Coverage Δ
scrapy/crawler.py 92.39% <100%> (+0.32%) ⬆️
scrapy/http/__init__.py 100% <0%> (ø) ⬆️
scrapy/linkextractors/__init__.py 100% <0%> (ø) ⬆️
scrapy/downloadermiddlewares/retry.py 95.83% <0%> (ø) ⬆️
scrapy/linkextractors/lxmlhtml.py 92.4% <0%> (ø) ⬆️
scrapy/linkextractors/sgml.py 96.8% <0%> (ø) ⬆️
scrapy/http/request/json_request.py 93.75% <0%> (ø)
scrapy/settings/default_settings.py 98.64% <0%> (ø) ⬆️
scrapy/item.py 98.48% <0%> (+0.07%) ⬆️
... and 1 more

@maramsumanth
Copy link
Contributor Author

@maramsumanth maramsumanth commented Mar 4, 2019

@lopuhin , can you please look into this. :)
Thanks

@lopuhin
Copy link
Member

@lopuhin lopuhin commented Mar 4, 2019

@maramsumanth I'm not that familiar with log counting code, but to me it does not look like a good solution due to the use of constant 5 - as I understand, 5 logging statements are not counted, but where does this number come from, apart from your observation? What if this number is different depending on some settings? What if it changes, how do we know when to update it?

@maramsumanth
Copy link
Contributor Author

@maramsumanth maramsumanth commented Mar 4, 2019

I have individually commented out each log.info in each file, and checked whether it had impacted any change on the log_count, So after many observations I have found that these 5 haven't counted to the logs

@maramsumanth
Copy link
Contributor Author

@maramsumanth maramsumanth commented Mar 4, 2019

I think these 5 are trivial for any spider crawling, like opening......closing (Isn't it?)

@lopuhin
Copy link
Member

@lopuhin lopuhin commented Mar 4, 2019

Still I think we should not hard-code 5 here. I think it's better to either come up with a less fragile fix, or to document that number of logging statements includes only statements done while crawling, not while opening/closing the spider.

@@ -41,6 +41,7 @@ def __init__(self, spidercls, settings=None):
self.stats = load_object(self.settings['STATS_CLASS'])(self)

handler = LogCounterHandler(self, level=self.settings.get('LOG_LEVEL'))
handler.crawler.stats.inc_value('log_count/INFO')
Copy link
Member

@lopuhin lopuhin Mar 18, 2019

Instead of incrementing the value directly, maybe it would be feasible to add the handler earlier or move logging after the handler? E.g. at least here we could issue the logger.info("Overridden settings: %(settings)r", {'settings': d}) after adding the log handler, what do you think?

Copy link
Contributor Author

@maramsumanth maramsumanth Mar 18, 2019

I thought of doing this, but when tests are failed, I left it.
I will do it @lopuhin

Copy link
Contributor Author

@maramsumanth maramsumanth Mar 18, 2019

Out of 5, we can handle 1 through this modification

Copy link
Member

@kmike kmike Mar 18, 2019

@maramsumanth I think @lopuhin's suggestion is not applied yet: logger.info call happens before handler is installed. Manual inc_value here doesn't look good to me.

Copy link
Contributor Author

@maramsumanth maramsumanth Mar 19, 2019

Done @kmike 👍 :)

@maramsumanth
Copy link
Contributor Author

@maramsumanth maramsumanth commented Mar 18, 2019

@lopuhin , can you check tests are failing?

@lopuhin
Copy link
Member

@lopuhin lopuhin commented Mar 18, 2019

@maramsumanth I think failing tests are directly related to the changes in log counting, you can run them locally (just the failing tests) and adjust the tests or the code.

@maramsumanth
Copy link
Contributor Author

@maramsumanth maramsumanth commented Mar 18, 2019

@lopuhin , I have made the changes :)
Thanks

@maramsumanth
Copy link
Contributor Author

@maramsumanth maramsumanth commented Mar 20, 2019

@kmike , I've made the required changes :) 👍

scrapy/crawler.py Show resolved Hide resolved
@maramsumanth
Copy link
Contributor Author

@maramsumanth maramsumanth commented Mar 24, 2019

Done @Gallaecio :) 👍

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Mar 26, 2019

Done @Gallaecio :) +1

Actually, I think now you’ve accidentally moved an additional line (d = …) unintentionally.

@maramsumanth
Copy link
Contributor Author

@maramsumanth maramsumanth commented Mar 26, 2019

@Gallaecio I moved it because logger.info("Overridden settings: %(settings)r", {'settings': d}) takes d, which when defined together would be nice. Should I change it?

Copy link
Member

@Gallaecio Gallaecio left a comment

I think the current changes are OK. They do not fix the reported issue, but I think they are a change for the better, and the reported issue might be best left unfixed.

@Gallaecio Gallaecio changed the title Wrong value of log_count/INFO in stats [MRG+1] Wrong value of log_count/INFO in stats Mar 26, 2019
@kmike
Copy link
Member

@kmike kmike commented Jul 17, 2019

Thanks @maramsumanth!

@kmike kmike merged commit 0d51f9c into scrapy:master Jul 17, 2019
3 checks passed
@kmike kmike added this to the v1.7 milestone Jul 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants