Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple exporters when crawling #1336

Closed
miguelcb84 opened this issue Jul 3, 2015 · 11 comments · Fixed by #3858
Closed

Support multiple exporters when crawling #1336

miguelcb84 opened this issue Jul 3, 2015 · 11 comments · Fixed by #3858

Comments

@miguelcb84
Copy link

@miguelcb84 miguelcb84 commented Jul 3, 2015

I'm scraping a website to export the data into a semantic format (n3). However, I also want to perform some data analysis on that data, so having it in a csv format is more convenient. To get the data in both formats I can do.

scrapy spider -t n3 -o data.n3
scrapy spider -t csv -o data.csv

However, this scrapes the data twice and I cannot afford it with big amounts of data.

A solution that avoids scraping the data twice consists on implementing Pipeline that exports the data (see alecxe suggestion for details). However, as the documentation explains, this is not the preferred way to export data.

Thus, I consider it would be interesting scrappy's support for multiple exporters.

scrapy crawl <url> -t n3 -o data.n3 -t csv -o data.csv
@kmike
Copy link
Member

@kmike kmike commented Jul 3, 2015

I think this feature would be very useful.

By the way, -t shouldn't be necessary; builtin exporters choose export format based on file extension. Supporting this would be great:

scrapy crawl <url> -o data.n3 -o data.csv

@dszmaj
Copy link

@dszmaj dszmaj commented Jul 3, 2015

you can achieve this by implementing multiple pipelines which uses exporters, this way once item hits first pipeline it gets recorded in first format, then scrapy releases item to the second pipeline where you can export different format

see scrapy.exporters in code and docs, it should be pretty easy

@lufte
Copy link

@lufte lufte commented Jul 24, 2015

Would this require the removal of the -t option? Otherwise there is no way of matching all the -t arguments to all the -o arguments passed.

@curita
Copy link
Member

@curita curita commented Jul 28, 2015

@lufte: Not necessarily, we can throw an error if -t is present and the amount of -t and -o args doesn't match, we can deduce the mapping otherwise since the order is preserved. We could support both versions, with and without the -t option I think.

@lufte
Copy link

@lufte lufte commented Jul 28, 2015

@curita Just asking because, traditionally (at least in most GNU programs), command-line options don't need to be in a specific order to work. It's mostly a matter of style :)

@kmike
Copy link
Member

@kmike kmike commented Jul 28, 2015

@lufte if there is no -t option export file type is deduced from file extension. So for .csv you get CSV, for .xml you get XML, etc.

@lufte
Copy link

@lufte lufte commented Jul 28, 2015

@kmike: Yes I understand that, but I could still use them and pass them in a weird order like scrapy crawl -t csv spidername -o output1 -o output2 -o output3.xml -t json. Scrapy would have to check if there aren't more -t args than -o and if the order makes sense (I think it shouldn't because they are not positional arguments, but otherwise how do I match them?). Removing the -t option makes it look a lot cleaner and simpler to check: scrapy crawl spidername -o output1.csv -o output2.json -o output3.xml, but it doesn't allow me to use other extensions or no extensions at all.

@kmike
Copy link
Member

@kmike kmike commented Jul 28, 2015

yes, I think for this case we may need to support another way to set output format - e.g. output1.dat:csv

@lufte
Copy link

@lufte lufte commented Jul 28, 2015

That could work :)

@curita
Copy link
Member

@curita curita commented Aug 7, 2015

I like that syntax too, actually seems simpler and a little bit shorter than -o output1.dat -t csv

@redapple
Copy link
Contributor

@redapple redapple commented Sep 19, 2016

Adding "help-wanted" if there are any volunteers to come up with an implementation of this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants