Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to output data in scrapy parse #4377

Merged
merged 8 commits into from Jul 11, 2020
Merged

Conversation

adityaa30
Copy link
Contributor

@adityaa30 adityaa30 commented Feb 25, 2020

@codecov
Copy link

codecov bot commented Feb 25, 2020

Codecov Report

Merging #4377 into master will increase coverage by 0.03%.
The diff coverage is 28.57%.

@@            Coverage Diff             @@
##           master    #4377      +/-   ##
==========================================
+ Coverage   84.80%   84.83%   +0.03%     
==========================================
  Files         163      163              
  Lines        9990     9984       -6     
  Branches     1488     1488              
==========================================
- Hits         8472     8470       -2     
+ Misses       1255     1251       -4     
  Partials      263      263              
Impacted Files Coverage Δ
scrapy/commands/__init__.py 62.31% <ø> (ø)
scrapy/commands/parse.py 19.77% <28.57%> (-0.45%) ⬇️

@adityaa30 adityaa30 force-pushed the save-option branch 3 times, most recently from 24c5b77 to aff8a20 Compare Feb 25, 2020
Copy link
Member

@Gallaecio Gallaecio left a comment

I do think this should behave like scrapy crawl -o, which outputs the items in a format that matches the extension of the output filename.

@adityaa30
Copy link
Contributor Author

adityaa30 commented Feb 26, 2020

@Gallaecio Okay. I have seen the implementation of scrapy crawl -o and will update this pr with a similar implementation for scrapy parse

@wRAR
Copy link
Contributor

wRAR commented Feb 26, 2020

It's also missing tests, though I cannot make codecov say it explicitly.

@adityaa30
Copy link
Contributor Author

adityaa30 commented Feb 26, 2020

So if I try to update settings['FEED_URI'] and settings['FEED_FORMAT'] as given in crawl.py, the signals defined in FeedStorage are not be called. I cannot find the exact reason for this, but I debugged this behaviour is because of the custom callback defined in parse.py. @Gallaecio Can you please guide me here?

@adityaa30
Copy link
Contributor Author

adityaa30 commented Feb 26, 2020

@wRAR I am planning to write tests for this command after I finish the --output function. How shall I proceed to write tests?

@wRAR
Copy link
Contributor

wRAR commented Feb 26, 2020

@adityaa30 there are existing tests that run a command as a separate process, the parse command is tested in tests/test_command_parse.py

@Gallaecio
Copy link
Member

Gallaecio commented Mar 1, 2020

So if I try to update settings['FEED_URI'] and settings['FEED_FORMAT'] as given in crawl.py, the signals defined in FeedStorage are not be called. I cannot find the exact reason for this, but I debugged this behaviour is because of the custom callback defined in parse.py. @Gallaecio Can you please guide me here?

Can you push those non-working changes? The issue may be more obvious when looking at the code.

@adityaa30
Copy link
Contributor Author

adityaa30 commented Mar 2, 2020

@Gallaecio I have pushed my changes.

@adityaa30
Copy link
Contributor Author

adityaa30 commented Mar 12, 2020

I have been trying to find the exact problem why the -o/--output in 71098cf do not work. By observing the chain of callbacks in scraper.py

... → enqueue_scrape → (_scrape_next → _scrape → _scrape2 → call_spider → iterate_spider_output → handle_spider_error/handle_spider_output) x repeat → ...

using a debugger I have observed that handle_spider_output in its argument result is only having a scrapy.utils.python.MutableChain of scrapy.Request i.e there is no scrapy.item.BaseItem or dict in it. I cannot seem to find the exact reason for this behaviour. @Gallaecio @wRAR Could you please help here?

@raphapassini
Copy link
Contributor

raphapassini commented Mar 30, 2020

Hey @adityaa30 you're right, is not working because of the callback and how specifically the command parse was designed.

Although a simple way to solve this is to change the print_items function

def print_items(self, lvl=None, colour=True):
and call the proper exporter methods inside this function. The exporters interface is pretty simple https://github.com/scrapy/scrapy/blob/647cba0f106211e72a8d3e028b25c2f46859c406/scrapy/exporters.py you need to call start_exporting, export_item and then finish_exporting in this order and probably you're good to go.

@raphapassini
Copy link
Contributor

raphapassini commented Mar 30, 2020

Other option is to understand exactly WHY the crawler is not connecting and listen to signal when we use the custom callback. This way we can use the same approach as the crawl.py command. But I think my suggestion above is simpler than doing that.

scrapy/commands/parse.py Outdated Show resolved Hide resolved
@adityaa30 adityaa30 force-pushed the save-option branch 3 times, most recently from b51dc59 to 31a2569 Compare Apr 1, 2020
- adds an option `-o` or `--output` in `scrapy parse` command to save
scrapped data to a file
@adityaa30
Copy link
Contributor Author

adityaa30 commented Apr 1, 2020

Other option is to understand exactly WHY the crawler is not connecting and listen to signal when we use the custom callback. This way we can use the same approach as the crawl.py command. But I think my suggestion above is simpler than doing that.

@raphapassini I debugged more and found out that the callback function was returning only a list of requests (see parse.py). When I change it from

return requests

to

return items + requests

the FeedExporter works as in crawl command. However, we do get the logs for each scraped item.
In 2a966e7 my latest working changes are present, I had to pull rebase in order to get the changes done in #3858.

@adityaa30 adityaa30 requested a review from Gallaecio Apr 2, 2020
@adityaa30 adityaa30 changed the title [wip] option to output data in scrapy parse [MRG] option to output data in scrapy parse Apr 2, 2020
@Gallaecio Gallaecio changed the title [MRG] option to output data in scrapy parse option to output data in scrapy parse Apr 3, 2020
@Gallaecio
Copy link
Member

Gallaecio commented Apr 3, 2020

I had to pull rebase

In the future, consider merging to avoid having to force-push to a pull request branch:

git fetch upstream
git merge upstream/master

scrapy/commands/parse.py Outdated Show resolved Hide resolved
wRAR
wRAR approved these changes Jun 3, 2020
@adityaa30 adityaa30 requested a review from elacuesta Jun 21, 2020
@elacuesta
Copy link
Member

elacuesta commented Jul 3, 2020

This looks almost ready to merge, I just wonder if we should remove the -t option from the parse command, since it's deprecated in favour of specifying the format within the -o option.

@Gallaecio Gallaecio self-requested a review Jul 6, 2020
@elacuesta
Copy link
Member

elacuesta commented Jul 7, 2020

I just wonder if we should remove the -t option from the parse command, since it's deprecated in favour of specifying the format within the -o option.

For the record, inheriting from the BaseRunSpiderCommand adds the -t option, which was not in the parse command before. I don't really have a problem with that, I think it's probably a reasonable trade-off to simplify the command's implementation. A deprecation message appears each time the option is used, so that's fine to me 👌

@Gallaecio Gallaecio merged commit 3f7e863 into scrapy:master Jul 11, 2020
1 of 2 checks passed
@adityaa30 adityaa30 deleted the save-option branch Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Having an option to define an Items output file in the scrapy parse commnad line
5 participants