Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periodic_log: implemented as separate extension #5926

Merged
merged 26 commits into from Aug 30, 2023

Conversation

GeorgeA92
Copy link
Contributor

The same as current version of #5830 - but implemented as separate extension. So it doesn't affect scrapy.extensions.logstats.LogStats and no risk of breaking it's backward compatibility.

@codecov
Copy link

codecov bot commented May 7, 2023

Codecov Report

Merging #5926 (b23839f) into master (df2163c) will increase coverage by 0.05%.
The diff coverage is 87.50%.

❗ Current head b23839f differs from pull request most recent head fe02642. Consider uploading reports for the commit fe02642 to get more accurate results

@@            Coverage Diff             @@
##           master    #5926      +/-   ##
==========================================
+ Coverage   88.85%   88.91%   +0.05%     
==========================================
  Files         162      163       +1     
  Lines       11445    11533      +88     
  Branches     1861     1877      +16     
==========================================
+ Hits        10170    10255      +85     
- Misses        968      969       +1     
- Partials      307      309       +2     
Files Changed Coverage Δ
scrapy/extensions/periodic_log.py 87.05% <87.05%> (ø)
scrapy/settings/default_settings.py 98.80% <100.00%> (+0.02%) ⬆️

... and 1 file with indirect coverage changes

@GeorgeA92
Copy link
Contributor Author

Implemented as separate extension.
Each part of this (stats, delta, timing) can be configured or enabled/disabled separately as on sample script.

script.py
import scrapy
from scrapy.crawler import CrawlerProcess

class BooksToScrapeSpider(scrapy.Spider):
    name = "books"
    custom_settings = \
        {
        "DOWNLOAD_DELAY": 0.3,
        "LOG_LEVEL": 'INFO',
        #"LOGSTATS_EXT_TIMING_ENABLED": True,
        "LOGSTATS_EXT_STATS_ENABLED": True,
        "LOGSTATS_EXT_STATS_INCLUDE": ["downloader/", "scheduler/", "log_count/", "item_scraped_count"],
        # ^ output only stat params that countain one of listed substrings, if not set - allow all
        "LOGSTATS_EXT_STATS_EXCLUDE": ["scheduler/"],
        # ^ exclude stat params that contain one of listed substrings, if not set - allow all
        # ^ It has precedence over LOGSTATS_EXT_STATS_INCLUDE
        #"LOGSTATS_EXT_DELTA_ENABLED": True,
        #"LOGSTATS_EXT_DELTA_INCLUDE": ['downloader/'],
        # ^ works the same as on LOGSTATS_EXT_STATS_ setting
        "EXTENSIONS":{
            "scrapy.extensions.logstats_extended.LogStatsExtended": 0,
        }
        }
    def start_requests(self):
        yield scrapy.Request(url='https://books.toscrape.com/', callback=self.parse)

    def parse(self, response):
        for book_page_link in response.css('.product_pod a::attr(href)').getall():
            yield scrapy.Request(response.urljoin(book_page_link), self.parse_book)
        if next_page := response.css("li.next a::attr(href)").get():
            yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)

    def parse_book(self, response):
        yield {'response.url': response.url}

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(BooksToScrapeSpider)
    process.start()

scrapy/extensions/logstats_extended.py Outdated Show resolved Hide resolved
scrapy/extensions/logstats_extended.py Outdated Show resolved Hide resolved
scrapy/extensions/logstats_extended.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member

You might want to install pre-commit to deal with the failing CI job.

@Gallaecio
Copy link
Member

From my side the implementation looks OK, pending docs and tests.

if crawler.settings.getbool("PERIODIC_LOG_STATS")
else None
)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these two except clauses be combined as their code is the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@GeorgeA92
Copy link
Contributor Author

Updated tests and extension.
Now tests cover.. the most complicated part of extension - filtering of stats fields (include/exclude) by name.

@wRAR
Copy link
Member

wRAR commented Jun 28, 2023

Interesting, looks like the telnet test needs some additional cleanup? Or does it suggest that there is some problem in the new code?

@GeorgeA92
Copy link
Contributor Author

Error on my test code fixed.
each call of ext.spider_opened(spider) in tests - start looping call and as it didn't stopped after tests - that previously started
looping calls affected next tests.

@Gallaecio
Copy link
Member

@GeorgeA92 Could you look into the typing issues?

@wRAR wRAR changed the title periodic_log: implemented as separate extension [WIP] periodic_log: implemented as separate extension Aug 4, 2023
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I recommend applying the doc suggestions and the default_settings.py one, but the rest should not block this change.

Great work!

docs/topics/extensions.rst Outdated Show resolved Hide resolved
docs/topics/extensions.rst Outdated Show resolved Hide resolved
docs/topics/extensions.rst Outdated Show resolved Hide resolved
docs/topics/extensions.rst Outdated Show resolved Hide resolved
docs/topics/extensions.rst Outdated Show resolved Hide resolved
scrapy/extensions/periodic_log.py Outdated Show resolved Hide resolved
scrapy/extensions/periodic_log.py Show resolved Hide resolved
scrapy/settings/default_settings.py Outdated Show resolved Hide resolved
GeorgeA92 and others added 6 commits August 10, 2023 21:34
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
@Gallaecio
Copy link
Member

I just merged #6014, so you might want to merge from the latest main branch and update those utcnow() calls.

@GeorgeA92
Copy link
Contributor Author

@Gallaecio, usages of datetime.utcnow() removed to match contents of #6014 by 7e54284

Copy link
Member

@wRAR wRAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@wRAR wRAR merged commit cddb8c1 into scrapy:master Aug 30, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants