periodic_log: implemented as separate extension #5926

GeorgeA92 · 2023-05-07T19:35:01Z

The same as current version of #5830 - but implemented as separate extension. So it doesn't affect scrapy.extensions.logstats.LogStats and no risk of breaking it's backward compatibility.

codecov · 2023-05-07T19:48:45Z

Codecov Report

Merging #5926 (b23839f) into master (df2163c) will increase coverage by 0.05%.
The diff coverage is 87.50%.

❗ Current head b23839f differs from pull request most recent head fe02642. Consider uploading reports for the commit fe02642 to get more accurate results

@@            Coverage Diff             @@
##           master    #5926      +/-   ##
==========================================
+ Coverage   88.85%   88.91%   +0.05%     
==========================================
  Files         162      163       +1     
  Lines       11445    11533      +88     
  Branches     1861     1877      +16     
==========================================
+ Hits        10170    10255      +85     
- Misses        968      969       +1     
- Partials      307      309       +2

Files Changed	Coverage Δ
scrapy/extensions/periodic_log.py	`87.05% <87.05%> (ø)`
scrapy/settings/default_settings.py	`98.80% <100.00%> (+0.02%)`	⬆️

... and 1 file with indirect coverage changes

scrapy/extensions/logstats_extended.py

GeorgeA92 · 2023-05-22T06:41:26Z

Implemented as separate extension.
Each part of this (stats, delta, timing) can be configured or enabled/disabled separately as on sample script.

script.py

import scrapy
from scrapy.crawler import CrawlerProcess

class BooksToScrapeSpider(scrapy.Spider):
    name = "books"
    custom_settings = \
        {
        "DOWNLOAD_DELAY": 0.3,
        "LOG_LEVEL": 'INFO',
        #"LOGSTATS_EXT_TIMING_ENABLED": True,
        "LOGSTATS_EXT_STATS_ENABLED": True,
        "LOGSTATS_EXT_STATS_INCLUDE": ["downloader/", "scheduler/", "log_count/", "item_scraped_count"],
        # ^ output only stat params that countain one of listed substrings, if not set - allow all
        "LOGSTATS_EXT_STATS_EXCLUDE": ["scheduler/"],
        # ^ exclude stat params that contain one of listed substrings, if not set - allow all
        # ^ It has precedence over LOGSTATS_EXT_STATS_INCLUDE
        #"LOGSTATS_EXT_DELTA_ENABLED": True,
        #"LOGSTATS_EXT_DELTA_INCLUDE": ['downloader/'],
        # ^ works the same as on LOGSTATS_EXT_STATS_ setting
        "EXTENSIONS":{
            "scrapy.extensions.logstats_extended.LogStatsExtended": 0,
        }
        }
    def start_requests(self):
        yield scrapy.Request(url='https://books.toscrape.com/', callback=self.parse)

    def parse(self, response):
        for book_page_link in response.css('.product_pod a::attr(href)').getall():
            yield scrapy.Request(response.urljoin(book_page_link), self.parse_book)
        if next_page := response.css("li.next a::attr(href)").get():
            yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)

    def parse_book(self, response):
        yield {'response.url': response.url}

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(BooksToScrapeSpider)
    process.start()

scrapy/extensions/logstats_extended.py

Gallaecio · 2023-05-23T05:56:38Z

You might want to install pre-commit to deal with the failing CI job.

Gallaecio · 2023-05-24T09:08:05Z

From my side the implementation looks OK, pending docs and tests.

wRAR · 2023-06-21T10:18:21Z

scrapy/extensions/periodic_log.py

+                if crawler.settings.getbool("PERIODIC_LOG_STATS")
+                else None
+            )
+        except TypeError:


Shouldn't these two except clauses be combined as their code is the same?

GeorgeA92 · 2023-06-27T21:54:27Z

Updated tests and extension.
Now tests cover.. the most complicated part of extension - filtering of stats fields (include/exclude) by name.

wRAR · 2023-06-28T08:43:05Z

Interesting, looks like the telnet test needs some additional cleanup? Or does it suggest that there is some problem in the new code?

GeorgeA92 · 2023-06-28T17:12:08Z

Error on my test code fixed.
each call of ext.spider_opened(spider) in tests - start looping call and as it didn't stopped after tests - that previously started
looping calls affected next tests.

Gallaecio · 2023-07-26T11:39:52Z

@GeorgeA92 Could you look into the typing issues?

Gallaecio

Looks good to me. I recommend applying the doc suggestions and the default_settings.py one, but the rest should not block this change.

Great work!

docs/topics/extensions.rst

scrapy/extensions/periodic_log.py

scrapy/settings/default_settings.py

Co-authored-by: Adrián Chaves <adrian@chaves.io>

Gallaecio · 2023-08-21T13:53:01Z

I just merged #6014, so you might want to merge from the latest main branch and update those utcnow() calls.

…e.utc)`

GeorgeA92 · 2023-08-21T15:11:10Z

@Gallaecio, usages of datetime.utcnow() removed to match contents of #6014 by 7e54284

wRAR

Thanks!

periodic_log: implemented as separate extension

3315385

Laerte reviewed May 7, 2023

View reviewed changes

scrapy/extensions/logstats_extended.py Outdated Show resolved Hide resolved

periodic_log: extension updated

84fb0ed

Gallaecio reviewed May 22, 2023

View reviewed changes

scrapy/extensions/logstats_extended.py Outdated Show resolved Hide resolved

scrapy/extensions/logstats_extended.py Outdated Show resolved Hide resolved

scrapy/extensions/logstats_extended.py Outdated Show resolved Hide resolved

Georgiy Zatserklianyi added 2 commits May 22, 2023 23:24

periodic_log: settings input in extension updated

a2f238d

periodic_log: not used code deleted

db794d3

Georgiy Zatserklianyi added 2 commits May 23, 2023 23:00

periodic_log: added settings to default settings

5c91f1b

periodic_log: codestyle fix (from pre-commit)

a0c8490

Georgiy Zatserklianyi added 3 commits May 24, 2023 23:10

periodic_log: fixed naming

b60e0fa

periodic_log: TypeError except added

639c2bc

periodic_log: tests [wip] added

6e65eeb

wRAR reviewed Jun 21, 2023

View reviewed changes

Georgiy Zatserklianyi added 4 commits June 25, 2023 12:16

periodic_log: Exception handling on init updated

2ce4856

periodic_log: tests for logging deltas added

ebce5b4

periodic_log: stats filtering updated

315861c

periodic_log: tests for logging stats added

56c3823

periodic_log: tests updated (errors fixed)

6fd94fd

Georgiy Zatserklianyi added 2 commits August 4, 2023 11:12

periodic_log: docs added

8b6a50a

periodic_log: typing

e9b088f

wRAR changed the title ~~periodic_log: implemented as separate extension [WIP]~~ periodic_log: implemented as separate extension Aug 4, 2023

Gallaecio approved these changes Aug 8, 2023

View reviewed changes

periodic_log: interval check moved

f05657e

GeorgeA92 and others added 6 commits August 10, 2023 21:34

Update scrapy/settings/default_settings.py

736a4b6

Co-authored-by: Adrián Chaves <adrian@chaves.io>

Update docs/topics/extensions.rst

2f094a7

Co-authored-by: Adrián Chaves <adrian@chaves.io>

Update docs/topics/extensions.rst

e6bd982

Co-authored-by: Adrián Chaves <adrian@chaves.io>

Update docs/topics/extensions.rst

3a4a949

Co-authored-by: Adrián Chaves <adrian@chaves.io>

Update docs/topics/extensions.rst

d67be20

Co-authored-by: Adrián Chaves <adrian@chaves.io>

Update docs/topics/extensions.rst

ac1694a

Co-authored-by: Adrián Chaves <adrian@chaves.io>

GeorgeA92 mentioned this pull request Aug 17, 2023

Remove datetime.utcnow() usage #6014

Merged

Georgiy Zatserklianyi added 3 commits August 21, 2023 17:07

Merge remote-tracking branch 'upstream/master' into periodic_log_2

1f03cb1

periodic_log: datetime.utcnow() changed to `datetime.now(tz=timezon…

7e54284

…e.utc)`

Merge remote-tracking branch 'origin/periodic_log_2' into periodic_log_2

fe02642

wRAR approved these changes Aug 30, 2023

View reviewed changes

wRAR merged commit cddb8c1 into scrapy:master Aug 30, 2023
26 checks passed

Gallaecio mentioned this pull request Sep 5, 2023

Dump stats to log periodically, not only at the end of the crawl #2173

Closed

Gallaecio mentioned this pull request Feb 15, 2024

Feature: Report Queue Length #3174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

periodic_log: implemented as separate extension #5926

periodic_log: implemented as separate extension #5926

GeorgeA92 commented May 7, 2023

codecov bot commented May 7, 2023 •

edited

GeorgeA92 commented May 22, 2023

Gallaecio commented May 23, 2023

Gallaecio commented May 24, 2023

wRAR Jun 21, 2023

GeorgeA92 Jun 27, 2023

GeorgeA92 commented Jun 27, 2023

wRAR commented Jun 28, 2023

GeorgeA92 commented Jun 28, 2023

Gallaecio commented Jul 26, 2023

Gallaecio left a comment •

edited

Gallaecio commented Aug 21, 2023

GeorgeA92 commented Aug 21, 2023

wRAR left a comment

periodic_log: implemented as separate extension #5926

periodic_log: implemented as separate extension #5926

Conversation

GeorgeA92 commented May 7, 2023

codecov bot commented May 7, 2023 • edited

Codecov Report

GeorgeA92 commented May 22, 2023

Gallaecio commented May 23, 2023

Gallaecio commented May 24, 2023

wRAR Jun 21, 2023

Choose a reason for hiding this comment

GeorgeA92 Jun 27, 2023

Choose a reason for hiding this comment

GeorgeA92 commented Jun 27, 2023

wRAR commented Jun 28, 2023

GeorgeA92 commented Jun 28, 2023

Gallaecio commented Jul 26, 2023

Gallaecio left a comment • edited

Choose a reason for hiding this comment

Gallaecio commented Aug 21, 2023

GeorgeA92 commented Aug 21, 2023

wRAR left a comment

Choose a reason for hiding this comment

codecov bot commented May 7, 2023 •

edited

Gallaecio left a comment •

edited