Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crawl /log API endpoint to stream crawler logs #682

Merged
merged 2 commits into from
Apr 11, 2023
Merged

Conversation

tw4l
Copy link
Contributor

@tw4l tw4l commented Mar 7, 2023

Connected to #669

If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by logLevel and context, which both take comma-separated lists as input.

I've run the nightly test for streaming logs from stored WACZs manually since removing the while-streaming bits from this PR and they passed.

#670 to be addressed in a separate PR

@tw4l tw4l force-pushed the stream-wacz-logs branch 5 times, most recently from 4510010 to e58b419 Compare March 8, 2023 16:39
@tw4l tw4l marked this pull request as ready for review March 21, 2023 16:03
@tw4l tw4l requested a review from ikreymer March 21, 2023 16:03
return sorted(combined_log_lines, key=lambda line: line["timestamp"])


async def extract_and_parse_log_file(client, bucket, key, log_zipinfo, cd_start):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should move the zip operations to a separate lib, or maybe just py-wacz, but for now, maybe just putting the zip-reading logic into a separate zip.py to keep it more separated from storage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and nightly tests pass after refactor

)
combined_log_lines.extend(parsed_log_lines)

return sorted(combined_log_lines, key=lambda line: line["timestamp"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this will buffer the full logs in memory and then sort, right?
Do we want to just return all the log streams and let the heapq.merge() above handle it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good call!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I remember why this was there! heapq.merge() requires its inputs to be sorted. However, it's safe to assume that the logs are already sorted by timestamp as that's how they're written, and heapq.merge() doesn't work with async generators anyway, so this'll have to change for proper streaming.

name_len = parse_little_endian_to_int(file_head[0:2])
extra_len = parse_little_endian_to_int(file_head[2:4])

content = await fetch(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to turn this into async generator all the way down, so could be something like:

body = await fetchBody(...)
if (gzip):
   return iter_lines(gzip_iter_chunks(body))
else:
   return body.iter_lines()

where:

async def gzip_iter_chunks(body):
  decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
   async for chunk in body.iter_chunk():
       yield decompressor.decompress(chunk)

and iter_lines() just a copy of https://github.com/boto/botocore/blob/master/botocore/response.py#L135
which takes this generator instead..

@tw4l tw4l force-pushed the stream-wacz-logs branch 3 times, most recently from 181f28e to 9ba33b6 Compare March 30, 2023 18:43
@tw4l
Copy link
Contributor Author

tw4l commented Mar 30, 2023

@ikreymer This is squashed and rebased, with the "proper" streaming commit removed to a separate branch https://github.com/webrecorder/browsertrix-cloud/tree/stream-wacz-logs-proper-streaming, as we seem to be blocked currently by an aiobotocore bug described here: aio-libs/aiobotocore#991

Confirmed the new nightly test passes when run manually after rebase.

If a crawl is completed, the endpoint streams the logs from the log
files in all of the created WACZ files, sorted by timestamp.

The API endpoint supports filtering by log_level and context whether
the crawl is still running or not.

This is not yet proper streaming because the entire log file is read
into memory before being streamed to the client. We will want to
switch to proper streaming eventually, but are currently blocked by
an aiobotocore bug - see:

aio-libs/aiobotocore#991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants