-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add crawl /log API endpoint to stream crawler logs #682
Conversation
4510010
to
e58b419
Compare
backend/btrixcloud/storages.py
Outdated
return sorted(combined_log_lines, key=lambda line: line["timestamp"]) | ||
|
||
|
||
async def extract_and_parse_log_file(client, bucket, key, log_zipinfo, cd_start): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should move the zip operations to a separate lib, or maybe just py-wacz, but for now, maybe just putting the zip-reading logic into a separate zip.py to keep it more separated from storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and nightly tests pass after refactor
backend/btrixcloud/storages.py
Outdated
) | ||
combined_log_lines.extend(parsed_log_lines) | ||
|
||
return sorted(combined_log_lines, key=lambda line: line["timestamp"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this will buffer the full logs in memory and then sort, right?
Do we want to just return all the log streams and let the heapq.merge() above handle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I remember why this was there! heapq.merge()
requires its inputs to be sorted. However, it's safe to assume that the logs are already sorted by timestamp as that's how they're written, and heapq.merge()
doesn't work with async generators anyway, so this'll have to change for proper streaming.
name_len = parse_little_endian_to_int(file_head[0:2]) | ||
extra_len = parse_little_endian_to_int(file_head[2:4]) | ||
|
||
content = await fetch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to turn this into async generator all the way down, so could be something like:
body = await fetchBody(...)
if (gzip):
return iter_lines(gzip_iter_chunks(body))
else:
return body.iter_lines()
where:
async def gzip_iter_chunks(body):
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
async for chunk in body.iter_chunk():
yield decompressor.decompress(chunk)
and iter_lines() just a copy of https://github.com/boto/botocore/blob/master/botocore/response.py#L135
which takes this generator instead..
181f28e
to
9ba33b6
Compare
@ikreymer This is squashed and rebased, with the "proper" streaming commit removed to a separate branch https://github.com/webrecorder/browsertrix-cloud/tree/stream-wacz-logs-proper-streaming, as we seem to be blocked currently by an aiobotocore bug described here: aio-libs/aiobotocore#991 Confirmed the new nightly test passes when run manually after rebase. |
If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by log_level and context whether the crawl is still running or not. This is not yet proper streaming because the entire log file is read into memory before being streamed to the client. We will want to switch to proper streaming eventually, but are currently blocked by an aiobotocore bug - see: aio-libs/aiobotocore#991
Connected to #669
If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by
logLevel
andcontext
, which both take comma-separated lists as input.I've run the nightly test for streaming logs from stored WACZs manually since removing the while-streaming bits from this PR and they passed.
#670 to be addressed in a separate PR