Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failed and success count stats to feedstorage backends #4850

Merged

Conversation

joaquingx
Copy link
Contributor

@joaquingx joaquingx commented Oct 17, 2020

Resolves #3947

Example:

import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesToScrapeSpider(scrapy.Spider):
    name = "quotes"

    custom_settings = {
        "DOWNLOAD_DELAY": 1,
        "COOKIES_DISABLED": True,  # mistyped, should be enabled
        "CONCURRENCY": 5,
        "FEEDS": {
            "file:///tmp/tmp-%(batch_time)s.json": {
                "format": "json",
            },
            "s3://mybucket/path/to/export-%(batch_time)s.csv": {
                "format": "csv",
            },
        },
        "FEED_EXPORT_BATCH_ITEM_COUNT": 5,
    }

    def start_requests(self):
        yield scrapy.Request(url='http://quotes.toscrape.com/', callback=self.parse)

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "quote": quote.css("span.text::text").extract(),
                "author": quote.css("small.author::text").extract(),
                "tags": quote.css("a.tag::text").extract()
            }
            break
        next = response.css("li.next a::attr(href)").extract_first()
        if next:
            yield scrapy.Request(url=response.urljoin(next), callback=self.parse)


process = CrawlerProcess()
process.crawl(QuotesToScrapeSpider)
process.start()

if S3 fails to store, stats will be:

{'downloader/request_bytes': 2692,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 23026,
 'downloader/response_count': 10,
 'downloader/response_status_count/200': 10,
 'elapsed_time_seconds': 11.61577,
 'feedexport/failed_count/S3FeedStorage': 2,
 'feedexport/success_count/FileFeedStorage': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 17, 20, 41, 6, 16674),
 'item_scraped_count': 10,
 'log_count/DEBUG': 218,
 'log_count/ERROR': 2,
 'log_count/INFO': 16,
 'memusage/max': 70389760,
 'memusage/startup': 70389760,
 'request_depth_max': 9,
 'response_received_count': 10,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2020, 10, 17, 20, 40, 54, 400904)}

Ready to review 馃槃

@codecov
Copy link

codecov bot commented Oct 17, 2020

Codecov Report

Merging #4850 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #4850   +/-   ##
=======================================
  Coverage   87.86%   87.87%           
=======================================
  Files         160      160           
  Lines        9749     9755    +6     
  Branches     1439     1437    -2     
=======================================
+ Hits         8566     8572    +6     
  Misses        926      926           
  Partials      257      257           
Impacted Files Coverage 螖
scrapy/extensions/feedexport.py 95.32% <100.00%> (+0.10%) 猬嗭笍

@joaquingx joaquingx changed the title [WIP] Add failed and success count to slot store errback [WIP] Add failed and success count stats to feedstorage backends Oct 19, 2020
@joaquingx joaquingx force-pushed the set-stats-for-feed-exporter-extension branch from 14b207b to 44cc533 Compare October 21, 2020 13:23
@joaquingx joaquingx changed the title [WIP] Add failed and success count stats to feedstorage backends Add failed and success count stats to feedstorage backends Oct 22, 2020
@joaquingx
Copy link
Contributor Author

CI failed, im not sure why -> https://travis-ci.org/github/scrapy/scrapy/jobs/738154019#L182 can help me here please @Gallaecio

@eLRuLL
Copy link
Member

eLRuLL commented Oct 23, 2020

CI failed, im not sure why -> https://travis-ci.org/github/scrapy/scrapy/jobs/738154019#L182 can help me here please @Gallaecio

@joaquingx it looks like a temporary problem, I think @Gallaecio should be able to restart the job, but I think you can too if you push a commit again

@joaquingx joaquingx force-pushed the set-stats-for-feed-exporter-extension branch from 71713fd to c5f06de Compare October 23, 2020 04:25
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@elacuesta
Copy link
Member

This is great, thanks. I was wondering if you would consider changing the approach regarding this small bit:

diff --git tests/test_feedexport.py tests/test_feedexport.py
index 1a77cec7..2ce8d7ff 100644
--- tests/test_feedexport.py
+++ tests/test_feedexport.py
@@ -8,6 +8,7 @@ import tempfile
 import warnings
 from abc import ABC, abstractmethod
 from collections import defaultdict
+from contextlib import ExitStack
 from io import BytesIO
 from logging import getLogger
 from pathlib import Path
@@ -782,8 +783,11 @@ class FeedExportTest(FeedExportTestBase):
             },
         }
         crawler = get_crawler(ItemSpider, settings)
-        with MockServer() as mockserver, \
-                mock.patch("scrapy.extensions.feedexport.FileFeedStorage.store", side_effect=KeyError("foo")):
+        with ExitStack() as stack:
+            mockserver = stack.enter_context(MockServer())
+            stack.enter_context(
+                mock.patch("scrapy.extensions.feedexport.FileFeedStorage.store", side_effect=KeyError("foo"))
+            )
             yield crawler.crawl(mockserver=mockserver)
         self.assertIn("feedexport/failed_count/FileFeedStorage", crawler.stats.get_stats())
         self.assertEqual(crawler.stats.get_value("feedexport/failed_count/FileFeedStorage"), 1)

I took it from this SO answer. Please excuse my nitpicking, this is not a blocker in any sense, I just really don't like backslash break lines 馃槄

@joaquingx
Copy link
Contributor Author

@elacuesta Hey, thanks, it would improve the code. Changes are done!

@Gallaecio Gallaecio merged commit 85604e1 into scrapy:master Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set/inc stat values when exporter extension exports the data successfully
4 participants