Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch deliveries for long running crawlers #4250

Closed
ejulio opened this issue Dec 19, 2019 · 17 comments · Fixed by #4434
Closed

Batch deliveries for long running crawlers #4250

ejulio opened this issue Dec 19, 2019 · 17 comments · Fixed by #4434

Comments

@ejulio
Copy link
Contributor

ejulio commented Dec 19, 2019

Summary

Add a new setting FEED_STORAGE_BATCH that will deliver a file whenever item_scraped_count reaches a multiple of that number.

Motivation

For long running jobs (say we are consuming inputs from a working queue) we may want partial results instead of waiting for a long batch to finish.

Describe alternatives you've considered

Of course we can stop and restart a spider every now and then.
However, a simpler approach is to have it running as long as required, but delivering partial results.

@wRAR
Copy link
Member

wRAR commented Dec 20, 2019

This will need unique file names, right?

@ejulio
Copy link
Contributor Author

ejulio commented Dec 20, 2019

Indeed.
When I wrote it at first, I was thinking about templates.
One simple alternative, would be to force %(time)s when this is set, so we are sure the file names will be different

@alake16
Copy link

alake16 commented Mar 11, 2020

Anyone working on this? Wanted to get started contributing to this project by jumping into this issue if no one else is.

@BroodingKangaroo
Copy link
Contributor

@alake16
Hi, I have already started working on this issue, but you can also start if you want.

@ejulio
Copy link
Contributor Author

ejulio commented Mar 12, 2020

Once you have something, you can create a draft PR in Github so we can follow through there.
Also consider an alternative to deliver data at every N items or X minutes.

@alake16
Copy link

alake16 commented Mar 12, 2020

@BroodingKangaroo how's it going for you? Blocked on anything at the moment?

@BroodingKangaroo
Copy link
Contributor

@alake16 I am in the process of understanding how scrapy works inside and how to do what is required. 😺
@ejulio We should merge and remove partial results after finishing work, should we?

@ejulio
Copy link
Contributor Author

ejulio commented Mar 13, 2020

@BroodingKangaroo , my idea is that it should be partial deliveries.
No file deletion/aggregation, as that would be basically the current behavior.
The idea here is to be able to deliver files before a run finishes, so we don't need to wait days/hours for a result.

BroodingKangaroo added a commit to BroodingKangaroo/scrapy that referenced this issue Mar 18, 2020
@BroodingKangaroo
Copy link
Contributor

Hi @ejulio ! =)
Could you review my PR?

Also I want to introduce myself.
My name is Maxim Halavaty and I am a third year student of Belarusian State University at the faculty of Applied mathematics and computer science. I would be happy to help develop your project during the GSOC 2020.

I have some questions regarding this issue.

  1. Should we create new slot for each new file/partial?
  2. How should we name partial files. My spider (with realization that I have in PR) for this input
    scrapy crawl quotes --nolog -o res/test.json with FEED_STORAGE_BATCH = 1000
    creates such file tree:
    res-->
    -------->test.2020-03-18T14:43:58.896341.json
    -------->test.2020-03-18T14:44:01.465171.json
    -------->test.2020-03-18T14:44:02.326922.json
    each file has 1000 entries, and last one has 395 (<=1000).

@ejulio
Copy link
Contributor Author

ejulio commented Mar 18, 2020

@BroodingKangaroo , great!
For GSoC, this is the first contribution which we ask, but you should write a project proposal so we can evaluate it.
You can write a draft and we can help you build it for final submission.

@dipiana
Copy link

dipiana commented Mar 18, 2020

would it be possible for someone to write a tutorial on how to use this inside a very basic Scrapy crawler? I would love to be able to get started with it but don't know where to start to be honest. Thanks!

@ejulio
Copy link
Contributor Author

ejulio commented Mar 18, 2020

@dipiana it is not merged yet.
But before merging, we always update the project documentation with the proper details.
Maybe @BroodingKangaroo can include a simple tutorial once it is finished

@dipiana
Copy link

dipiana commented Mar 18, 2020

Oh okay, thanks. Any idea when I could use it? This is exactly what I was looking for! :)

@BroodingKangaroo
Copy link
Contributor

@dipiana I don't know exactly, but when I finish I will inform you from this thread.

BroodingKangaroo added a commit to BroodingKangaroo/scrapy that referenced this issue Mar 21, 2020
@BroodingKangaroo
Copy link
Contributor

@ejulio Thank you for your reviews at #4434, I will correct the comments, of course. =)
I am currently writing a proposal and I have a few questions.

@BroodingKangaroo
Copy link
Contributor

Hello, @ejulio!
When you have a free time, it will be very cool if you take a look at my draft proposal.

@ejulio
Copy link
Contributor Author

ejulio commented Mar 30, 2020

@BroodingKangaroo .
Took a look at you proposal.
I'll leave my comments here, but we should find another way to share the comments instead of this issue.

  1. This can be useful gzip-compressed item exports #2174
  2. I wouldn't put this current work in the proposal, as it is already work in progess

BroodingKangaroo added a commit to BroodingKangaroo/scrapy that referenced this issue Apr 10, 2020
BroodingKangaroo added a commit to BroodingKangaroo/scrapy that referenced this issue Apr 15, 2020
BroodingKangaroo added a commit to BroodingKangaroo/scrapy that referenced this issue Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants