Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't store resources content in memory #386

Open
s0ph1e opened this issue Jan 2, 2020 · 9 comments
Open

Don't store resources content in memory #386

s0ph1e opened this issue Jan 2, 2020 · 9 comments

Comments

@s0ph1e
Copy link
Member

s0ph1e commented Jan 2, 2020

Now all pages are stored in memory (each resource content is stored in Resource.text) which cause high memory consumption.
It would be nice to avoid storing Resource.text and save resourcess directly to FS just after they were received
Probably we can use streams for that

  • for html, css: Request -> update links/images/styles/etc. -> saveResource
  • all other types: Request -> saveResource when content modification is not needed

To do:

  • Update Resource class - get rid of text property and related functionality. Probably store reference to stream for resource
  • Update scraper mechanism: rework request/save functionality in scraper - replace requestQueue property with streamsQueue, replace requestedResourcePromises with requestResourceStreams or remove it, use streams instead of promises in request file
  • Check and update all actions that use Resource class objects - at least afterResponse, saveResource
  • Measure memory consumption of current implementation and streams implementation

Questions:

  • how to handle links to pages which are not downloaded yet? Can we set reference in parent before child is loaded? (see getReference action)
@Pomax
Copy link

Pomax commented Jan 2, 2020

To underline why this might be important, I was trying to scape https://www.image-line.com/support/flstudio_online_manual/ and ended up with a 20GB footprint. So that would have crashed on many machines =)

@beije
Copy link

beije commented Feb 11, 2020

I'm having issues with this same problem, is there an ETA? :)

@s0ph1e
Copy link
Member Author

s0ph1e commented Feb 14, 2020

Hi @beije
I do not have time to work on this project now so no ETA for nearest future. Contributions are welcome :)

@Pomax
Copy link

Pomax commented Feb 14, 2020

If you can write up what needs to be done, at least, then I'm sure someone would be willing to work on it. Even if there's no time to write code, there's always time to go "for those who want to work on this, you want to look in files A, B, and C, in the functions X, Y, and Z, because that's where U happens, which leads to V" =)

The original comment is already more than most project maintainers will drop in a "to do" issue, but for external contributions it just needs a little bit more to get folks started on helping out.

@s0ph1e
Copy link
Member Author

s0ph1e commented Feb 18, 2020

Hey @Pomax

You are right, better to document what needs to be done. I've updated initial issue description with more information. Tricky part here is that it will be huge update - whole mechanism should be reworked - it looks quite complicated for me now. But there is also a good part - we have quite good test coverage that will help with updates :)

@Pomax
Copy link

Pomax commented Feb 19, 2020

oh dear, that does sound daunting... thank goodness for test coverage! And thank you for updating what needs to be done!

@pavelloz
Copy link

pavelloz commented Apr 3, 2020

Well, to not throw everything on its head, maybe first step could be using standard saving file from buffer, just file by file, as a lifecycle event (almost like a plugin, but would need to have default saving disabled), without streaming. It would be much much better anyways, because keeping X (where X = concurrent connections) is much better than having all files in the memory at the same time, until they are dumped (at the end).

Additionally UX would improve, because first times i was running the script i was waiting for a long time, seeing nothing, because even output directory was not created until it was finished.

I didnt look at the codebase, im just brainstorming to hopefully push things forward even if its not ideal on first iteration.

@s0ph1e
Copy link
Member Author

s0ph1e commented Apr 5, 2020

Hi @pavelloz

Just to clarify - not all files are stored in memory and saved only at the end. When resource has no dependencies - it's saved to directory immediately. And only resources with dependencies saved after all dependencies resolved and downloaded (for example, html file with 5 images will be saved after all images downloaded).

Thank you for suggestion.
Unfortunately I do not have time to work on that. Contributions are welcome

@phawxby
Copy link
Contributor

phawxby commented Jun 20, 2022

I think we can actually solve the memory usage issue fairly easily.

  1. handleResource already returns a promise, so all the handlers can be async.
  2. Update the getText() and setText() properties of Resource async and have them read/write directly against the file system via promises. Basically everyone has SSD's, the bottlenecks will be on HTTP so IO will be a lot less of an issue.
  3. Find some way to pass in a temporary cache directory for it to use. We could even create a new cache plugin which has 2 options, memory of filesystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants