Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend crawl log streaming API: Stream logs from WACZ files after crawl concludes #669

Closed
Tracked by #796 ...
tw4l opened this issue Mar 3, 2023 · 6 comments · Fixed by #1225
Closed
Tracked by #796 ...

Backend crawl log streaming API: Stream logs from WACZ files after crawl concludes #669

tw4l opened this issue Mar 3, 2023 · 6 comments · Fixed by #1225
Assignees
Labels
back end Requires back end dev work

Comments

@tw4l
Copy link
Contributor

tw4l commented Mar 3, 2023

Sub-task for #796

Once a crawl is finished, the API endpoint should stream logs from all of the WACZ files created by the crawl.

@tw4l tw4l changed the title Stream logs from WACZ files after crawl concludes Logging API: Stream logs from WACZ files after crawl concludes Mar 3, 2023
@tw4l tw4l changed the title Logging API: Stream logs from WACZ files after crawl concludes Backend crawl log streaming API: Stream logs from WACZ files after crawl concludes Mar 3, 2023
@tw4l tw4l added the back end Requires back end dev work label Mar 3, 2023
@tw4l tw4l self-assigned this Mar 3, 2023
@tw4l
Copy link
Contributor Author

tw4l commented Apr 19, 2023

First pass is implemented in #682.

We'll want to move to properly streaming logs, currently blocked by aio-libs/aiobotocore#991

@ikreymer
Copy link
Member

Until the aiobotocore is resolved, we may be able to use the sync download option, since we've already implemented this to support collection downloads via https://github.com/webrecorder/browsertrix-cloud/blob/main/backend/btrixcloud/storages.py#L358

@tw4l
Copy link
Contributor Author

tw4l commented Sep 15, 2023

Implemented as a sync stream in #1168 . Closing for now, though we may eventually want to make this async.

@tw4l tw4l closed this as completed Sep 15, 2023
@tw4l tw4l reopened this Sep 20, 2023
@tw4l
Copy link
Contributor Author

tw4l commented Sep 20, 2023

Still seems to be a memory issue, looking into it.

Could just fetch from presigned URLs

@tw4l
Copy link
Contributor Author

tw4l commented Sep 26, 2023

@Chickensoupwithrice In your court now if you want to try to figure this out :)

@Chickensoupwithrice
Copy link
Contributor

Alright, after much experimenting I've managed to nail down exactly where we're no longer doing generators and instead load up all the logs. It's in the way we're calling stream_log_bytes_as_line_dicts (which by itself does return a generator) but instead we're extending an array by the output of the generator leading to loading the entire log file into memory and then run out of memory.

Switching extend to append does mean we're generators all the way down, but then when I try to execute on this generator, I get read timeouts on the DO space?

image

Still investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
back end Requires back end dev work
Projects
Status: Done!
Development

Successfully merging a pull request may close this issue.

3 participants