Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Support for exporting to multiple feeds in a single crawl #3858

Merged
merged 4 commits into from Mar 12, 2020

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Jul 7, 2019

Fixes #1336

Introduced a new setting (FEEDS) and deprecated FEED_URI and FEED_FORMAT. FEEDS is a dictionary of feed definitions, which enables the user to override certain parameters independently for each feed. An example of this can be seen at #3858 (comment)
In the CLI, format is deduced according to #1336 (comment), falling back to the feed extension.

A simple snippet to illustrate:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.org']

    def parse(self, response):
        return {
            'url': response.url,
            'link': response.xpath('//a/@href').get(),
        }
$ scrapy runspider example.py -o items.dat:jsonlines -o jitems.json -o -:csv
(...)
2019-07-07 02:09:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org>
{'url': 'https://example.org', 'link': 'http://www.iana.org/domains/example'}
2019-07-07 02:09:45 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-07 02:09:45 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (1 items) in: items.dat
2019-07-07 02:09:45 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jitems.json
2019-07-07 02:09:45 [scrapy.extensions.feedexport] INFO: Stored csv feed (1 items) in: stdout:
2019-07-07 02:09:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
(...)
2019-07-07 02:09:45 [scrapy.core.engine] INFO: Spider closed (finished)
url,link
https://example.org,http://www.iana.org/domains/example

Tasks:

  • Feed exporter changes
  • Crawl and Runspider commands
  • Tests
  • Docs

@elacuesta elacuesta changed the title [WIP] Multiple feed export formats Support for exporting to multiple feeds in a single crawl Jul 7, 2019
@codecov
Copy link

@codecov codecov bot commented Jul 7, 2019

Codecov Report

Merging #3858 into master will increase coverage by <.01%.
The diff coverage is 61.97%.

@@            Coverage Diff             @@
##           master    #3858      +/-   ##
==========================================
+ Coverage   85.42%   85.43%   +<.01%     
==========================================
  Files         169      170       +1     
  Lines        9701     9707       +6     
  Branches     1445     1447       +2     
==========================================
+ Hits         8287     8293       +6     
  Misses       1166     1166              
  Partials      248      248
Impacted Files Coverage Δ
scrapy/utils/commands.py 22.22% <22.22%> (ø)
scrapy/commands/crawl.py 33.33% <25%> (+5.42%) ⬆️
scrapy/commands/runspider.py 83.6% <50%> (+9.32%) ⬆️
scrapy/extensions/feedexport.py 84.01% <82.22%> (-0.89%) ⬇️

@codecov
Copy link

@codecov codecov bot commented Jul 7, 2019

Codecov Report

Merging #3858 into master will increase coverage by 0.13%.
The diff coverage is 87.09%.

@@            Coverage Diff            @@
##           master   #3858      +/-   ##
=========================================
+ Coverage   84.56%   84.7%   +0.13%     
=========================================
  Files         166     166              
  Lines        9869    9899      +30     
  Branches     1467    1472       +5     
=========================================
+ Hits         8346    8385      +39     
+ Misses       1268    1256      -12     
- Partials      255     258       +3
Impacted Files Coverage Δ
scrapy/utils/conf.py 93.06% <100%> (+3.83%) ⬆️
scrapy/settings/default_settings.py 98.7% <100%> (-0.01%) ⬇️
scrapy/commands/crawl.py 28.57% <25%> (+1.9%) ⬆️
scrapy/commands/runspider.py 83.6% <50%> (+8.24%) ⬆️
scrapy/extensions/feedexport.py 84.47% <85.89%> (-0.7%) ⬇️
scrapy/utils/trackref.py 82.85% <0%> (-2.86%) ⬇️
scrapy/core/downloader/__init__.py 89.39% <0%> (-1.52%) ⬇️
scrapy/spidermiddlewares/offsite.py 100% <0%> (ø) ⬆️

scrapy/commands/crawl.py Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/commands/crawl.py Show resolved Hide resolved
scrapy/utils/conf.py Outdated Show resolved Hide resolved
@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from e2a6432 to 0ab1c1d Compare Jul 12, 2019
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
@elacuesta elacuesta marked this pull request as ready for review Jul 15, 2019
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Jul 15, 2019

I’m not sure about documenting deprecated behaviors, other than in the release notes. I would prefer that the general documentation describes only the non-deprecated behavior, as if it had always been like this, and let users relying on deprecated behaviors get warnings that explain the deprecations to them and get them to read the release notes.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 15, 2019

hm, interesting point. Should we remove FEED_FORMAT completely from the docs then?

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Jul 15, 2019

I would. It does mean that we cannot :setting:FEED_FORMAT in the release notes, but that is already the case with other removed settings.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 15, 2019

Makes sense, updated.

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Jul 16, 2019

I was going over the feed export documentation, and I’m starting to think that there are other settings that may also make sense to be able to override in a per-file basis: FEED_EXPORT_ENCODING, FEED_EXPORT_FIELDS, FEED_EXPORT_INDENT, FEED_STORE_EMPTY.

I’m OK with only allowing per-file FEED_FORMAT as part of this pull request, in fact I think it may even be better, easier to review.

However, I do wonder if, in preparation for a future where we introduce additional per-file settings, we shouldn’t use something like {'<path>': {'format': '<format>'}} instead of simply {'<path>': '<format>'}.

As for the command line, I’m not sure what the best approach would be to eventually support additional per-file options. Although maybe that is less likely to happen, since the CLI is intended for simpler usage. Yet again, from that point of view, we could let the CLI interface as is, and not allow specifying multiple output files from the CLI, only through spider settings.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 16, 2019

I thought a bit about that, but figured that it would be ok to share all the other options among the different feeds. But since you pointed it out, I think it's something that would be useful indeed. I'd need to do some additional checks, let me think a bit on the best way to do it.
Regarding the CLI, I think it's better to offer multiple feed support (CLI support was in fact the main request in the original issue) but with minimal configuration, i.e. only file format.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 16, 2019

Actually, if we're adding that kind of functionality I think the name FEED_URI is just not good anymore for the task. I think we could use a new setting with the format you suggested and use the "general" values as fallback if they're not specified for a specific feed, something like

FEEDS = {
    'items.json': {
        'FORMAT': 'json',
        'STORE_EMPTY': True,
        'ENCODING': 'utf8',
        'FIELDS': None,
        'INDENT': 4,
    },
}

Pretty close to how Django defines databases, for instance.

@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from f11eebb to ffc1e90 Compare Jul 18, 2019
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from 5dcf6e9 to f45da34 Compare Jul 20, 2019
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 20, 2019

Hi there @Gallaecio and @kmike, I believe this PR is in good shape to face a final and comprehensive review. I've already changed a couple things based on Adrián's comments, but I look forward to read any further feedback you may have. This is a feature I would very much like to see in Scrapy 🚀

PS:
I thought about adding documentation for the FEED_URI_PARAMS setting, which I think has been always undocumented, but since it relies in load_object I think it would be best to wait for #3873. Nobody has complained for the lack of docs there, I think it can wait a bit more 😉

docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
scrapy/commands/crawl.py Outdated Show resolved Hide resolved
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 29, 2019

Hey @Gallaecio, many thanks for your feedback! I think I addressed all of your points, let me know if there is anything else.

Copy link
Member

@Gallaecio Gallaecio left a comment

Looks good to me except for a minor documentation formatting issue.

docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
@Gallaecio Gallaecio changed the title Support for exporting to multiple feeds in a single crawl [MRG+1] Support for exporting to multiple feeds in a single crawl Jul 31, 2019
Copy link
Member

@Gallaecio Gallaecio left a comment

@elacuesta Please, make sure that these changes work when the setting is defined from the command line (as a string). This is necessary not only for the command line use, but also for other scenarios where settings are defined as strings, such as Scrapy Cloud.

Edit: I think it should come down to using a Settings method. You may need to implement your own one, though; e.g. a mixture of getdict and getlist which can load a JSON array.

@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from 6ff4959 to 1bf140d Compare Aug 15, 2019
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Aug 15, 2019

Sure, I'll add some tests regarding that, stay tuned! 📻

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 19, 2019

It’s not just about the command line, setting values as strings can come from environment variables, etc.

What you need to do is use one of the method of the Settings class when you read the setting values. Good news is that I was wrong, getdict should do the job, for some reason I though it the root structure was a list.

It is probably enough to change https://github.com/scrapy/scrapy/pull/3858/files#diff-b40c5755a5dda2b02aaac6bf4c8e2ff8R236 to call self.settings.getdict('FEEDS') instead of self.settings['FEEDS'].

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Aug 19, 2019

Alright, makes sense, but out of genuine curiosity, is populating settings through env variables currently working? I couldn't find references in the docs nor in the code itself, although I did find some docstrings about it (https://github.com/scrapy/scrapy/blob/1.7/scrapy/settings/__init__.py#L122, https://github.com/scrapy/scrapy/blob/1.7/scrapy/settings/__init__.py#L172).
Am I missing something?

@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch 3 times, most recently from 0007974 to 231ab21 Compare Nov 20, 2019
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
@elacuesta elacuesta added this to the v2.0 milestone Nov 21, 2019
@elacuesta elacuesta closed this Nov 23, 2019
@elacuesta elacuesta reopened this Nov 23, 2019
@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from b9bff2a to d8f1856 Compare Nov 25, 2019
@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from d8f1856 to 0563581 Compare Dec 5, 2019
@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from 0563581 to 06c34dd Compare Jan 2, 2020
@vs4vijay
Copy link

@vs4vijay vs4vijay commented Jan 24, 2020

Hey,

Is this feature merged?

@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from 06c34dd to a625364 Compare Jan 27, 2020
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jan 30, 2020

Ping @kmike 😇
Do you think is there anything else I need to address here?

@Gallaecio Gallaecio removed this from the v2.0 milestone Jan 30, 2020
@elacuesta elacuesta force-pushed the multiple_feed_export_formats branch from a625364 to ada37c5 Compare Mar 5, 2020
docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
scrapy/utils/conf.py Outdated Show resolved Hide resolved
wRAR
wRAR approved these changes Mar 12, 2020
@wRAR
Copy link
Contributor

@wRAR wRAR commented Mar 12, 2020

Thanks @elacuesta !

@wRAR wRAR merged commit 388f23c into scrapy:master Mar 12, 2020
2 checks passed
@elacuesta elacuesta deleted the multiple_feed_export_formats branch Mar 12, 2020
@elacuesta elacuesta added this to the v2.1 milestone Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants