Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feed exports: add batch deliveries #4434

Merged
merged 43 commits into from Jul 29, 2020

Conversation

BroodingKangaroo
Copy link
Contributor

@BroodingKangaroo BroodingKangaroo commented Mar 16, 2020

Fixes #4250

@BroodingKangaroo BroodingKangaroo changed the title ISSUE-4250: add batch deliveries fix #4250: add batch deliveries Mar 16, 2020
@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch 4 times, most recently from 5c339bb to 1081bc5 Compare Mar 18, 2020
@codecov
Copy link

codecov bot commented Mar 18, 2020

Codecov Report

Merging #4434 into master will increase coverage by 1.12%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4434      +/-   ##
==========================================
+ Coverage   85.17%   86.29%   +1.12%     
==========================================
  Files         162      160       -2     
  Lines        9749     9671      -78     
  Branches     1437     1420      -17     
==========================================
+ Hits         8304     8346      +42     
+ Misses       1186     1063     -123     
- Partials      259      262       +3     
Impacted Files Coverage Δ
scrapy/extensions/feedexport.py 87.35% <100.00%> (+1.22%) ⬆️
scrapy/settings/default_settings.py 98.73% <100.00%> (+<0.01%) ⬆️
scrapy/utils/conf.py 93.13% <100.00%> (+0.06%) ⬆️
scrapy/core/downloader/contextfactory.py 90.00% <0.00%> (-6.67%) ⬇️
scrapy/utils/trackref.py 82.85% <0.00%> (-2.86%) ⬇️
scrapy/utils/misc.py 95.53% <0.00%> (-1.79%) ⬇️
scrapy/core/downloader/__init__.py 89.47% <0.00%> (-1.51%) ⬇️
scrapy/http/cookies.py 91.45% <0.00%> (-0.08%) ⬇️
scrapy/spiders/__init__.py 100.00% <0.00%> (ø)
scrapy/linkextractors/sgml.py
... and 3 more

@BroodingKangaroo BroodingKangaroo changed the title fix #4250: add batch deliveries #4250: add batch deliveries Mar 18, 2020
@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from c0b194e to 41ee371 Compare Mar 18, 2020
@BroodingKangaroo BroodingKangaroo marked this pull request as ready for review Mar 18, 2020
Copy link
Contributor

@ejulio ejulio left a comment

Good starting point @BroodingKangaroo

Other stuff to keep in mind:

  • Documentation is missing
  • Tests are missing

scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
for uri, feed in self.feeds.items():
if not self._storage_supported(uri):
raise NotConfigured
if not self._exporter_supported(feed['format']):
raise NotConfigured

def open_spider(self, spider):
if self.storage_batch:
self.feeds = {self._get_uri_of_partial(uri, feed, spider): feed for uri, feed in self.feeds.items()}
Copy link
Contributor

@ejulio ejulio Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach is that, whenever we reach the expected count, just close the current file/deliver it and create a new one.

@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from 55aebab to 13fae39 Compare Mar 20, 2020
Copy link
Member

@elacuesta elacuesta left a comment

Hey @BroodingKangaroo, good job, thanks for the PR. In addition to the tests and docs requested by @ejulio, please make sure that this works well after #3858.

Seems like the failing tests are related to #4434, please rebase or merge the latest master to fix them.

scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/settings/default_settings.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from a616ae8 to cbf34ce Compare Mar 21, 2020
@elacuesta elacuesta changed the title #4250: add batch deliveries Feed exports: add batch deliveries Mar 21, 2020
@@ -335,7 +400,7 @@ def _get_uri_params(self, spider, uri_params):
params = {}
for k in dir(spider):
params[k] = getattr(spider, k)
ts = datetime.utcnow().replace(microsecond=0).isoformat().replace(':', '-')
ts = datetime.utcnow().isoformat().replace(':', '-')
Copy link
Contributor

@ejulio ejulio Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this change?

Copy link
Contributor Author

@BroodingKangaroo BroodingKangaroo Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace(microsecond=0) - removes microseconds. But if several files were sent within one second they will have the same names.

scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
slot.start_exporting()
slot.exporter.export_item(item)
slot.itemcount += 1
if self.storage_batch_size and slot.itemcount % self.storage_batch_size == 0:
Copy link
Contributor

@ejulio ejulio Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you should be calling a routine similar to close_spider here, right?
Otherwise we're just creating the local files, but never sending the S3 or some other storage that is not local.
Also, I wrote a piece of this functionality and had issues when exporting to S3.
I had to upload file.read() instead of just file as it was raising errors.

Copy link
Contributor Author

@BroodingKangaroo BroodingKangaroo Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of the _start_new_batch() function repeats the close_spider() function in some aspects.
I tried running it for S3. It seems the program worked correctly.

Copy link
Contributor

@ejulio ejulio Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My advice here is to extract the common behavior to a function.
As there is no logging here about a file that was created.
Also, if you look into the behavior of close_spider, you'll notice that file storage is called in thread, therefore not blocking the current code.
Otherwise, we'll be blocking the main thread while storing the files and, maybe, causing performance issues

@ejulio
Copy link
Contributor

ejulio commented Apr 6, 2020

@BroodingKangaroo , should I review something here or are you working on some of the changes?

@BroodingKangaroo
Copy link
Contributor Author

BroodingKangaroo commented Apr 6, 2020

@BroodingKangaroo , should I review something here or are you working on some of the changes?

@ejulio I slowed down a little. There is no big change since your last review, just cosmetic one.
I am working on covering all of this code by tests and figuring out the problem with S3.
Guess in several days I will provide the changes.

@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch 3 times, most recently from a17b721 to 839e8a6 Compare Apr 10, 2020
@ejulio
Copy link
Contributor

ejulio commented Apr 13, 2020

@BroodingKangaroo sure, no problems.
Once you have finished the changes, just let me know and I'll review it

@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch 2 times, most recently from 3b89f59 to b229330 Compare Apr 15, 2020
@Gallaecio
Copy link
Member

Gallaecio commented Jul 1, 2020

Since this doesn't affect the storage backends, should the setting be called FEED_EXPORT_BATCH_ITEM_COUNT instead, for consistency? Even more, I think FEED_EXPORT_BATCH_SIZE (and FEEDS.batch_size) would be shorter and simpler.

I think “size” should be avoided, as it can be misleading (item count? bytes?).

On a different but related topic, I think it makes sense for new FEEDS-related options not to have a setting of its own, and only be settable through FEEDS. At least that is the approach I took with #4512.

scrapy/utils/conf.py Outdated Show resolved Hide resolved
@elacuesta
Copy link
Member

elacuesta commented Jul 1, 2020

I think it makes sense for new FEEDS-related options not to have a setting of its own, and only be settable through FEEDS

That makes a lot of sense to keep the codebase simple, however having general settings also allow to keep project settings simple and avoid duplication, so I don't really have a strong opinion at the moment.

On the other hand, you convinced me about the "size" name. That said, I still think that if we introduce a new general setting, it should be FEED_EXPORT_BATCH_ITEM_COUNT instead of FEED_STORAGE_BATCH_ITEM_COUNT. Thoughts about that?

scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from 9fedcd1 to 6454d45 Compare Jul 3, 2020
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
Copy link
Member

@elacuesta elacuesta left a comment

Almost ready to go IMHO, I just left one question about testing packages.

@@ -12,6 +12,7 @@ deps =
-ctests/constraints.txt
-rtests/requirements-py3.txt
# Extras
boto3>=1.13.0
Copy link
Member

@elacuesta elacuesta Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed to test the changes in this pull request?

Copy link
Contributor Author

@BroodingKangaroo BroodingKangaroo Jul 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it needs to perform BatchDeliveriesTest::test_s3_export. If we delete this line, the test will be skipped at every run.

Copy link
Member

@Gallaecio Gallaecio Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what @elacuesta means is whether or not test_s3_export could be reimplemented without adding a new dependency to the tests, which is a valid point. So, could it be reimplemented without boto3? And, if so, what would be the pros and cons of each test implementation?

Copy link
Contributor Author

@BroodingKangaroo BroodingKangaroo Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using botocore (we already have this dependency) here means that we have to basically re-implement boto3 anyway, including authentication, all security protocols, exceptions. Also, we have to maintain it if S3 API changes.


And your :command:`crawl` command line is::

scrapy crawl spidername -o dirname/%(batch_id)d-filename%(batch_time)s.json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BroodingKangaroo Hi, I'm trying to run scrapy installed from your branch and I'm getting an error with the parentheses, following your docs example highlighted above:

$ scrapy crawl books -o books-%(batch_id)d-ending.json
bash: syntax error near unexpected token `('

It seems to accept the expression surrounded by quotes (as bellow), so I'd suggest to update the docs with it wrapped with quotes:

$ scrapy crawl books -o 'books-%(batch_id)d-ending.json'

Copy link
Contributor Author

@BroodingKangaroo BroodingKangaroo Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will update the docs.
I should say that this also happens in the master branch with placeholders like %(time)s, %(name)s and I didn't find information about quotes in such cases in docs.

@rafaelcapucho
Copy link

rafaelcapucho commented Jul 14, 2020

@BroodingKangaroo Hi, I have FEED_EXPORT_BATCH_ITEM_COUNT = 100 and I scrapped 1000 items:

image

I was expecting to see 10 files with 100 items per file, but I ended up with 11 files:

image

and the last one is empty:

image

@rafaelcapucho
Copy link

rafaelcapucho commented Jul 15, 2020

I'm happy to see that it is also working to export S3:

image

@BroodingKangaroo
Copy link
Contributor Author

BroodingKangaroo commented Jul 15, 2020

I was expecting to see 10 files with 100 items per file, but I ended up with 11 files:

Hi, @rafaelcapucho. It is similar to creating an empty file even if nothing is scrapped (current behavior of the master branch). On the other hand, in this case, it is maybe less relevant to user expectations.

docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
tests/test_feedexport.py Show resolved Hide resolved
@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from 409524f to 86f7ac2 Compare Jul 17, 2020
@BroodingKangaroo BroodingKangaroo force-pushed the ISSUE-4250-add_batch_deliveries branch from 9534484 to 3e04927 Compare Jul 18, 2020
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
@kmike
Copy link
Member

kmike commented Jul 29, 2020

Thanks @BroodingKangaroo for all the work you put into this feature, and also for swiftly addressing all the review comments - that was impressive! An of course, thanks everyone who reviewed and tested it: @ejulio, @elacuesta, @Gallaecio, @victor-torres, @rafaelcapucho.

I'm merging it; if there are any further minor issues, we can address them separately.

@kmike kmike merged commit 5e2d1bd into scrapy:master Jul 29, 2020
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch deliveries for long running crawlers
9 participants