Skip to content

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Nov 17, 2025

Fixes #2991

After deleting files (e.g. WACZs uploaded while a crawl was paused) for canceled or otherwise failed crawls, ensure we also update the crawl database object.

This fixes a regression introduced by crawl pausing, which resulted in org storage numbers being incorrect when later deleting the canceled crawl as a consequence of the crawl files not having been deleted from the database at the same time as they were deleted from storage.

It also renames the basecrawls delete_crawl_files method to delete_failed_crawl_files to make purpose clearer, as it is only used by the operator and should only be used for failed crawls (when deleting successful crawls, there are other workflow- and org-related updates that are handled by other codepaths).

Testing

  1. Spin up local instance
  2. Run a crawl
  3. Pause the crawl
  4. Cancel the crawl while it's paused
  5. Verify the crawl's files, fileSize, and fileCount are reset in the database in addition to the crawl files having been deleted from the configured s3 storage
  6. Delete the canceled crawl from the workflow crawl list
  7. Verify the org's bytesStored and bytesStoredCrawls are now 0 and not negative as before

After deleting files (e.g. WACZs uploaded while a crawl was paused)
for canceled or otherwise failed crawls, ensure we also update the
crawl database object.

This fixes a regression introduced by crawl pausing, which resulted
in org storage numbers being incorrect when later deleting the
canceled crawl, because its files wer enot removed from the database
at the same time as they were deleted from storage.

It also renames the basecrawls method to make purpose clearer, as it
is only used by the operator and should only be used for failed crawls.
@tw4l tw4l requested a review from ikreymer November 17, 2025 21:50
@tw4l tw4l changed the title Update file crawl db object after deleting files Regression fix: Update failed crawl database object after deleting files Nov 17, 2025
@ikreymer
Copy link
Member

Good catch! I think its that for successful crawls, these values should never be reset once incremented, but for failed crawls, they need to be reset to 0.

@ikreymer
Copy link
Member

It may be useful to store the size of crawl before it was cancelled, but perhaps we can put that elsewhere to avoid more confusion with actual crawl size.

@ikreymer ikreymer merged commit 2725686 into main Nov 18, 2025
24 checks passed
@ikreymer ikreymer deleted the issue-2991-cancel-paused-crawl-storage-bug branch November 18, 2025 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Deleting a canceled crawl that was paused makes org storage stats inaccurate

3 participants