Add crawl /log API endpoint to stream crawler logs #682

tw4l · 2023-03-07T21:30:34Z

Connected to #669

If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by logLevel and context, which both take comma-separated lists as input.

I've run the nightly test for streaming logs from stored WACZs manually since removing the while-streaming bits from this PR and they passed.

#670 to be addressed in a separate PR

ikreymer · 2023-03-27T20:07:10Z

backend/btrixcloud/storages.py

+        return sorted(combined_log_lines, key=lambda line: line["timestamp"])
+
+
+async def extract_and_parse_log_file(client, bucket, key, log_zipinfo, cd_start):


Perhaps we should move the zip operations to a separate lib, or maybe just py-wacz, but for now, maybe just putting the zip-reading logic into a separate zip.py to keep it more separated from storage?

Done and nightly tests pass after refactor

ikreymer · 2023-03-28T21:49:44Z

backend/btrixcloud/storages.py

+            )
+            combined_log_lines.extend(parsed_log_lines)
+
+        return sorted(combined_log_lines, key=lambda line: line["timestamp"])


Hm, this will buffer the full logs in memory and then sort, right?
Do we want to just return all the log streams and let the heapq.merge() above handle it?

yeah good call!

Ah, I remember why this was there! heapq.merge() requires its inputs to be sorted. However, it's safe to assume that the logs are already sorted by timestamp as that's how they're written, and heapq.merge() doesn't work with async generators anyway, so this'll have to change for proper streaming.

backend/btrixcloud/zip.py

ikreymer · 2023-03-28T23:05:56Z

backend/btrixcloud/zip.py

+    name_len = parse_little_endian_to_int(file_head[0:2])
+    extra_len = parse_little_endian_to_int(file_head[2:4])
+
+    content = await fetch(


Probably want to turn this into async generator all the way down, so could be something like:

body = await fetchBody(...) if (gzip): return iter_lines(gzip_iter_chunks(body)) else: return body.iter_lines()

where:

async def gzip_iter_chunks(body): decompressor = zlib.decompressobj(-zlib.MAX_WBITS) async for chunk in body.iter_chunk(): yield decompressor.decompress(chunk)

and iter_lines() just a copy of https://github.com/boto/botocore/blob/master/botocore/response.py#L135
which takes this generator instead..

tw4l · 2023-03-30T18:45:23Z

@ikreymer This is squashed and rebased, with the "proper" streaming commit removed to a separate branch https://github.com/webrecorder/browsertrix-cloud/tree/stream-wacz-logs-proper-streaming, as we seem to be blocked currently by an aiobotocore bug described here: aio-libs/aiobotocore#991

Confirmed the new nightly test passes when run manually after rebase.

If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by log_level and context whether the crawl is still running or not. This is not yet proper streaming because the entire log file is read into memory before being streamed to the client. We will want to switch to proper streaming eventually, but are currently blocked by an aiobotocore bug - see: aio-libs/aiobotocore#991

tw4l force-pushed the stream-wacz-logs branch 5 times, most recently from 4510010 to e58b419 Compare March 8, 2023 16:39

tw4l force-pushed the stream-wacz-logs branch from 37e9d8b to cca893a Compare March 21, 2023 15:48

tw4l marked this pull request as ready for review March 21, 2023 16:03

tw4l requested a review from ikreymer March 21, 2023 16:03

tw4l force-pushed the stream-wacz-logs branch from 7169e67 to e379b9d Compare March 23, 2023 01:58

ikreymer reviewed Mar 27, 2023

View reviewed changes

tw4l force-pushed the stream-wacz-logs branch from 3d7e715 to e0f75c1 Compare March 27, 2023 21:41

ikreymer reviewed Mar 28, 2023

View reviewed changes

backend/btrixcloud/zip.py Show resolved Hide resolved

tw4l force-pushed the stream-wacz-logs branch from 396b172 to 04ba022 Compare March 28, 2023 22:19

ikreymer reviewed Mar 28, 2023

View reviewed changes

tw4l force-pushed the stream-wacz-logs branch 3 times, most recently from 181f28e to 9ba33b6 Compare March 30, 2023 18:43

tw4l force-pushed the stream-wacz-logs branch from 5d1b08d to e7b3031 Compare March 30, 2023 22:34

tw4l requested a review from ikreymer March 31, 2023 16:25

ikreymer approved these changes Apr 11, 2023

View reviewed changes

tw4l added 2 commits April 11, 2023 11:49

Bump version to 1.5.0-beta.0

c2a1698

tw4l force-pushed the stream-wacz-logs branch from 7d51aed to c2a1698 Compare April 11, 2023 15:50

tw4l merged commit f261967 into main Apr 11, 2023

tw4l deleted the stream-wacz-logs branch April 11, 2023 15:51

tw4l mentioned this pull request Apr 19, 2023

Backend crawl log streaming API: Stream logs from WACZ files after crawl concludes #669

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add crawl /log API endpoint to stream crawler logs #682

Add crawl /log API endpoint to stream crawler logs #682

tw4l commented Mar 7, 2023 •

edited

Loading

ikreymer Mar 27, 2023

tw4l Mar 27, 2023

ikreymer Mar 28, 2023

tw4l Mar 28, 2023

tw4l Mar 29, 2023

ikreymer Mar 28, 2023

tw4l commented Mar 30, 2023 •

edited

Loading

		return sorted(combined_log_lines, key=lambda line: line["timestamp"])


		async def extract_and_parse_log_file(client, bucket, key, log_zipinfo, cd_start):

Add crawl /log API endpoint to stream crawler logs #682

Add crawl /log API endpoint to stream crawler logs #682

Conversation

tw4l commented Mar 7, 2023 • edited Loading

ikreymer Mar 27, 2023

Choose a reason for hiding this comment

tw4l Mar 27, 2023

Choose a reason for hiding this comment

ikreymer Mar 28, 2023

Choose a reason for hiding this comment

tw4l Mar 28, 2023

Choose a reason for hiding this comment

tw4l Mar 29, 2023

Choose a reason for hiding this comment

ikreymer Mar 28, 2023

Choose a reason for hiding this comment

tw4l commented Mar 30, 2023 • edited Loading

tw4l commented Mar 7, 2023 •

edited

Loading

tw4l commented Mar 30, 2023 •

edited

Loading