Skip to content

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Sep 20, 2023

Fixes #1206

The org now tracks bytesStored by type of crawl, uploads, and browser profiles
in addition to the total, and returns these values in the org metrics endpoint.

A migration is added to precompute these values in existing deployments.

In addition, all /metrics storage values are now returned solely as bytes, as
the GB form wasn't being used in the frontend and is unnecessary.
- Remove redundant add_crawl_files_to_org_bytes_stored method
- Handle bulk deletes by type
- Fix bug where it was always assumed only one crawl was deleted per cid
and size was not tracked per cid
- Combine inc_org_bytes_stored into single query
@tw4l tw4l force-pushed the issue-1206-crawl-upload-size-metrics branch from 6e86811 to 9a507bd Compare September 21, 2023 15:35
@tw4l tw4l requested a review from ikreymer September 21, 2023 19:33
Copy link
Contributor

@Chickensoupwithrice Chickensoupwithrice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


deleted_count = 0

crawls_to_delete = await self._filter_delete_list_by_type(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all works, but there's a more optimal way without having to do the filtering by type here actually.
Since the delete function already calls get_crawl_raw, we have the type available there, and it can just bin the crawls by type there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment - now splitting in one pass, but for now I think there are advantages to keeping delete_crawls to one type at a time.


for crawl_id in delete_list.crawl_ids:
crawl = await self.get_crawl_raw(crawl_id, org)
size += await self._delete_crawl_files(crawl, org)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can filter by crawl type here, since we have the type available. Files are deleted for all crawls, but then can do:

deleted_size = await self._delete_crawl_files(crawl, org)
if crawl.type == "crawl":
   crawl_size += delete_size
else:
   upload_size += delete_size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like we're not even checking the type before deleting crawl files here!
Should probably do:

if type_ and crawl.type != type_:
   continue

This is actually a bug in the current version as well, where it'll delete the files regardless of type, but not actual archived item object!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the missing type check! That's now been added.

I think there are advantages to handling the complexity of deleting multiple types in delete_crawls_all_types rather than delete_crawls. Everywhere else in our app that we are deleting content other than through the /all-crawls delete endpoint, we're handling one type at a time. And having delete_crawls handle only one type at a time keeps things a bit simpler, e.g. letting inc_org_bytes_storedonly need to worry about one archived item type at a time.

I pushed a change to delete_crawls_all_types to split the delete list into crawls and uploads arrays in one pass rather than two, which should help.

There are other improvements to be made to the delete endpoints that I'm planning on doing in a separate pass in this sprint (see #1208), so I'd expect some of this to change and get optimized there, but I don't want to hold up this PR for something that's creeping a bit out of scope just because I discovered some bugs 😅 If that sounds reasonable to you!

Copy link
Member

@ikreymer ikreymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer look, maybe we should fix at least the type checking bug before merging, so it doesn't bite us in the future..

@tw4l
Copy link
Member Author

tw4l commented Sep 22, 2023

On closer look, maybe we should fix at least the type checking bug before merging, so it doesn't bite us in the future..

Type checking bug is fixed!

@tw4l tw4l merged commit 094f27b into main Sep 22, 2023
@tw4l tw4l deleted the issue-1206-crawl-upload-size-metrics branch September 22, 2023 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Include size of crawls, uploads, and browser profiles in org metrics API endpoint

4 participants