Feed exports: add batch deliveries #4434

BroodingKangaroo · 2020-03-16T15:42:23Z

codecov · 2020-03-18T08:46:39Z

Codecov Report

Merging #4434 into master will increase coverage by 1.12%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4434      +/-   ##
==========================================
+ Coverage   85.17%   86.29%   +1.12%     
==========================================
  Files         162      160       -2     
  Lines        9749     9671      -78     
  Branches     1437     1420      -17     
==========================================
+ Hits         8304     8346      +42     
+ Misses       1186     1063     -123     
- Partials      259      262       +3

Impacted Files	Coverage Δ
scrapy/extensions/feedexport.py	`87.35% <100.00%> (+1.22%)`	⬆️
scrapy/settings/default_settings.py	`98.73% <100.00%> (+<0.01%)`	⬆️
scrapy/utils/conf.py	`93.13% <100.00%> (+0.06%)`	⬆️
scrapy/core/downloader/contextfactory.py	`90.00% <0.00%> (-6.67%)`	⬇️
scrapy/utils/trackref.py	`82.85% <0.00%> (-2.86%)`	⬇️
scrapy/utils/misc.py	`95.53% <0.00%> (-1.79%)`	⬇️
scrapy/core/downloader/__init__.py	`89.47% <0.00%> (-1.51%)`	⬇️
scrapy/http/cookies.py	`91.45% <0.00%> (-0.08%)`	⬇️
scrapy/spiders/__init__.py	`100.00% <0.00%> (ø)`
scrapy/linkextractors/sgml.py
... and 3 more

ejulio

Good starting point @BroodingKangaroo

Other stuff to keep in mind:

Documentation is missing
Tests are missing

scrapy/extensions/feedexport.py

ejulio · 2020-03-18T14:34:53Z

scrapy/extensions/feedexport.py

        for uri, feed in self.feeds.items():
            if not self._storage_supported(uri):
                raise NotConfigured
            if not self._exporter_supported(feed['format']):
                raise NotConfigured

    def open_spider(self, spider):
+        if self.storage_batch:
+            self.feeds = {self._get_uri_of_partial(uri, feed, spider): feed for uri, feed in self.feeds.items()}


One approach is that, whenever we reach the expected count, just close the current file/deliver it and create a new one.

elacuesta

Hey @BroodingKangaroo, good job, thanks for the PR. In addition to the tests and docs requested by @ejulio, please make sure that this works well after #3858.

Seems like the failing tests are related to #4434, please rebase or merge the latest master to fix them.

scrapy/extensions/feedexport.py

scrapy/settings/default_settings.py

scrapy/extensions/feedexport.py

ejulio · 2020-03-26T11:00:02Z

scrapy/extensions/feedexport.py

@@ -335,7 +400,7 @@ def _get_uri_params(self, spider, uri_params):
        params = {}
        for k in dir(spider):
            params[k] = getattr(spider, k)
-        ts = datetime.utcnow().replace(microsecond=0).isoformat().replace(':', '-')
+        ts = datetime.utcnow().isoformat().replace(':', '-')


Any reason for this change?

replace(microsecond=0) - removes microseconds. But if several files were sent within one second they will have the same names.

scrapy/extensions/feedexport.py

ejulio · 2020-03-26T11:09:45Z

scrapy/extensions/feedexport.py

            slot.start_exporting()
            slot.exporter.export_item(item)
            slot.itemcount += 1
+            if self.storage_batch_size and slot.itemcount % self.storage_batch_size == 0:


I guess you should be calling a routine similar to close_spider here, right?
Otherwise we're just creating the local files, but never sending the S3 or some other storage that is not local.
Also, I wrote a piece of this functionality and had issues when exporting to S3.
I had to upload file.read() instead of just file as it was raising errors.

The current implementation of the _start_new_batch() function repeats the close_spider() function in some aspects.
I tried running it for S3. It seems the program worked correctly.

My advice here is to extract the common behavior to a function.
As there is no logging here about a file that was created.
Also, if you look into the behavior of close_spider, you'll notice that file storage is called in thread, therefore not blocking the current code.
Otherwise, we'll be blocking the main thread while storing the files and, maybe, causing performance issues

ejulio · 2020-04-06T17:05:06Z

@BroodingKangaroo , should I review something here or are you working on some of the changes?

BroodingKangaroo · 2020-04-06T20:27:11Z

@BroodingKangaroo , should I review something here or are you working on some of the changes?

@ejulio I slowed down a little. There is no big change since your last review, just cosmetic one.
I am working on covering all of this code by tests and figuring out the problem with S3.
Guess in several days I will provide the changes.

ejulio · 2020-04-13T13:49:20Z

@BroodingKangaroo sure, no problems.
Once you have finished the changes, just let me know and I'll review it

Gallaecio · 2020-07-01T10:45:18Z

Since this doesn't affect the storage backends, should the setting be called FEED_EXPORT_BATCH_ITEM_COUNT instead, for consistency? Even more, I think FEED_EXPORT_BATCH_SIZE (and FEEDS.batch_size) would be shorter and simpler.

I think “size” should be avoided, as it can be misleading (item count? bytes?).

On a different but related topic, I think it makes sense for new FEEDS-related options not to have a setting of its own, and only be settable through FEEDS. At least that is the approach I took with #4512.

scrapy/utils/conf.py

elacuesta · 2020-07-01T12:23:14Z

I think it makes sense for new FEEDS-related options not to have a setting of its own, and only be settable through FEEDS

That makes a lot of sense to keep the codebase simple, however having general settings also allow to keep project settings simple and avoid duplication, so I don't really have a strong opinion at the moment.

On the other hand, you convinced me about the "size" name. That said, I still think that if we introduce a new general setting, it should be FEED_EXPORT_BATCH_ITEM_COUNT instead of FEED_STORAGE_BATCH_ITEM_COUNT. Thoughts about that?

scrapy/extensions/feedexport.py

docs/topics/feed-exports.rst

scrapy/extensions/feedexport.py

elacuesta

Almost ready to go IMHO, I just left one question about testing packages.

elacuesta · 2020-07-09T16:13:12Z

tox.ini

@@ -12,6 +12,7 @@ deps =
    -ctests/constraints.txt
    -rtests/requirements-py3.txt
    # Extras
+    boto3>=1.13.0


Is this really needed to test the changes in this pull request?

Now it needs to perform BatchDeliveriesTest::test_s3_export. If we delete this line, the test will be skipped at every run.

I think what @elacuesta means is whether or not test_s3_export could be reimplemented without adding a new dependency to the tests, which is a valid point. So, could it be reimplemented without boto3? And, if so, what would be the pros and cons of each test implementation?

Using botocore (we already have this dependency) here means that we have to basically re-implement boto3 anyway, including authentication, all security protocols, exceptions. Also, we have to maintain it if S3 API changes.

rafaelcapucho · 2020-07-14T05:20:52Z

docs/topics/feed-exports.rst

+
+And your :command:`crawl` command line is::
+
+    scrapy crawl spidername -o dirname/%(batch_id)d-filename%(batch_time)s.json


@BroodingKangaroo Hi, I'm trying to run scrapy installed from your branch and I'm getting an error with the parentheses, following your docs example highlighted above:

$ scrapy crawl books -o books-%(batch_id)d-ending.json bash: syntax error near unexpected token `('

It seems to accept the expression surrounded by quotes (as bellow), so I'd suggest to update the docs with it wrapped with quotes:

$ scrapy crawl books -o 'books-%(batch_id)d-ending.json'

You are right, I will update the docs.
I should say that this also happens in the master branch with placeholders like %(time)s, %(name)s and I didn't find information about quotes in such cases in docs.

rafaelcapucho · 2020-07-14T05:43:29Z

@BroodingKangaroo Hi, I have FEED_EXPORT_BATCH_ITEM_COUNT = 100 and I scrapped 1000 items:

I was expecting to see 10 files with 100 items per file, but I ended up with 11 files:

and the last one is empty:

rafaelcapucho · 2020-07-15T07:06:39Z

I'm happy to see that it is also working to export S3:

BroodingKangaroo · 2020-07-15T09:56:09Z

I was expecting to see 10 files with 100 items per file, but I ended up with 11 files:

Hi, @rafaelcapucho. It is similar to creating an empty file even if nothing is scrapped (current behavior of the master branch). On the other hand, in this case, it is maybe less relevant to user expectations.

docs/topics/feed-exports.rst

tests/test_feedexport.py

docs/topics/feed-exports.rst

kmike · 2020-07-29T18:03:17Z

Thanks @BroodingKangaroo for all the work you put into this feature, and also for swiftly addressing all the review comments - that was impressive! An of course, thanks everyone who reviewed and tested it: @ejulio, @elacuesta, @Gallaecio, @victor-torres, @rafaelcapucho.

I'm merging it; if there are any further minor issues, we can address them separately.

BroodingKangaroo changed the title ~~ISSUE-4250: add batch deliveries~~ fix #4250: add batch deliveries Mar 16, 2020

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch 4 times, most recently from 5c339bb to 1081bc5 Compare March 18, 2020 08:29

BroodingKangaroo changed the title ~~fix #4250: add batch deliveries~~ #4250: add batch deliveries Mar 18, 2020

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from c0b194e to 41ee371 Compare March 18, 2020 12:46

BroodingKangaroo marked this pull request as ready for review March 18, 2020 13:01

ejulio suggested changes Mar 18, 2020

View reviewed changes

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from 55aebab to 13fae39 Compare March 20, 2020 07:55

elacuesta requested changes Mar 20, 2020

View reviewed changes

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from a616ae8 to cbf34ce Compare March 21, 2020 06:23

elacuesta changed the title ~~#4250: add batch deliveries~~ Feed exports: add batch deliveries Mar 21, 2020

ejulio suggested changes Mar 26, 2020

View reviewed changes

BroodingKangaroo mentioned this pull request Mar 26, 2020

Batch deliveries for long running crawlers #4250

Closed

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch 3 times, most recently from a17b721 to 839e8a6 Compare April 10, 2020 06:32

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch 2 times, most recently from 3b89f59 to b229330 Compare April 15, 2020 18:12

BroodingKangaroo added 6 commits April 15, 2020 23:38

fix scrapy#4250: add batch deliveries

e5b23f4

fix wrong name of first file in partial deliveries

8b4566f

add batch_id, add error if uri is specified incorrectly

0723e3f

fix comments

d11411b

Add partial deliveries tests

39d0d13

Set batch_id in _get_uri_params

ffa8a53

elacuesta reviewed Jul 1, 2020

View reviewed changes

scrapy/utils/conf.py Outdated Show resolved Hide resolved

victor-torres mentioned this pull request Jul 1, 2020

Blank lines after import statements #4654

Closed

elacuesta reviewed Jul 2, 2020

View reviewed changes

scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved

Change setting name. Add leading zeroes to batch_id. Minor fixes.

1e24504

Gallaecio reviewed Jul 2, 2020

View reviewed changes

docs/topics/feed-exports.rst Outdated Show resolved Hide resolved

scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved

Make check of placeholder less strict

6454d45

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from 9fedcd1 to 6454d45 Compare July 3, 2020 07:24

Gallaecio approved these changes Jul 6, 2020

View reviewed changes

scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved

Tiny changes

f1020e0

elacuesta reviewed Jul 9, 2020

View reviewed changes

rafaelcapucho reviewed Jul 14, 2020

View reviewed changes

Add quotes to example in docs

8bdcdb0

Gallaecio reviewed Jul 16, 2020

View reviewed changes

docs/topics/feed-exports.rst Outdated Show resolved Hide resolved

BroodingKangaroo and others added 2 commits July 16, 2020 18:41

Change single quotes to double in example in docs

41263f6

Merge branch 'master' into ISSUE-4250-add_batch_deliveries

c4f9250

kmike reviewed Jul 17, 2020

View reviewed changes

tests/test_feedexport.py Show resolved Hide resolved

Try to fix error at Windows

86f7ac2

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from 409524f to 86f7ac2 Compare July 17, 2020 15:23

Another try to fix test errors on Windows

3e04927

BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from 9534484 to 3e04927 Compare July 18, 2020 21:11

Gallaecio approved these changes Jul 28, 2020

View reviewed changes

docs/topics/feed-exports.rst Outdated Show resolved Hide resolved

docs/topics/feed-exports.rst Outdated Show resolved Hide resolved

pep8 tiny changes

a6c1d79

kmike merged commit 5e2d1bd into scrapy:master Jul 29, 2020

drs-11 mentioned this pull request Aug 2, 2021

Feeds: Batch Triggers (Soft Limits) #5205

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feed exports: add batch deliveries #4434

Feed exports: add batch deliveries #4434

BroodingKangaroo commented Mar 16, 2020 •

edited by elacuesta

codecov bot commented Mar 18, 2020 •

edited

ejulio left a comment

ejulio Mar 18, 2020

elacuesta left a comment

ejulio Mar 26, 2020

BroodingKangaroo Mar 26, 2020

ejulio Mar 26, 2020

BroodingKangaroo Apr 10, 2020

ejulio Apr 23, 2020

ejulio commented Apr 6, 2020

BroodingKangaroo commented Apr 6, 2020

ejulio commented Apr 13, 2020

Gallaecio commented Jul 1, 2020

elacuesta commented Jul 1, 2020 •

edited

elacuesta left a comment

elacuesta Jul 9, 2020

BroodingKangaroo Jul 12, 2020

Gallaecio Jul 13, 2020 •

edited

BroodingKangaroo Jul 16, 2020

rafaelcapucho Jul 14, 2020

BroodingKangaroo Jul 15, 2020

rafaelcapucho commented Jul 14, 2020

rafaelcapucho commented Jul 15, 2020

BroodingKangaroo commented Jul 15, 2020 •

edited

kmike commented Jul 29, 2020


		And your :command:`crawl` command line is::

		scrapy crawl spidername -o dirname/%(batch_id)d-filename%(batch_time)s.json

Feed exports: add batch deliveries #4434

Feed exports: add batch deliveries #4434

Conversation

BroodingKangaroo commented Mar 16, 2020 • edited by elacuesta

codecov bot commented Mar 18, 2020 • edited

Codecov Report

ejulio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elacuesta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejulio commented Apr 6, 2020

BroodingKangaroo commented Apr 6, 2020

ejulio commented Apr 13, 2020

Gallaecio commented Jul 1, 2020

elacuesta commented Jul 1, 2020 • edited

elacuesta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gallaecio Jul 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rafaelcapucho commented Jul 14, 2020

rafaelcapucho commented Jul 15, 2020

BroodingKangaroo commented Jul 15, 2020 • edited

kmike commented Jul 29, 2020

BroodingKangaroo commented Mar 16, 2020 •

edited by elacuesta

codecov bot commented Mar 18, 2020 •

edited

elacuesta commented Jul 1, 2020 •

edited

Gallaecio Jul 13, 2020 •

edited

BroodingKangaroo commented Jul 15, 2020 •

edited