Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for --rolloverSize for individual WARCs in 1.x #542

Merged
merged 3 commits into from
Apr 15, 2024
Merged

Conversation

ikreymer
Copy link
Member

Fixes #533

Fixes rollover in WARCWriter, separate from combined WARC rollover size:

  • check rolloverSize and close previous WARCs when size exceeds
  • add timestamp to resource WARC filenames to support rollover, eg. screenshots-{ts}.warc.gz
  • use append mode for all write streams, just in case
  • tests: add test for rollover of individual WARCs with 500K size limit
  • tests: update screenshot tests to account for WARCs now being named screenshots-{ts}.warc.gz instead of just screenshots.warc.gz

…ned WARC

- add timestamp to resource WARC filenames to support rollover
- use append mode for all write streams, just in case
- tests: add test for rollover of individual WARCs with 500K size limit

fixes #533
@ikreymer ikreymer requested a review from tw4l April 13, 2024 05:46
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and testing well!

Only thought is that because we're only rolling over once the size has been exceeded, occasionally a file gets quite a bit over the rollover size before a new one is started. Maybe eventually we could look into optimizing that a little, but I think for now this is a good improvement.

@ikreymer
Copy link
Member Author

Looks good and testing well!

Only thought is that because we're only rolling over once the size has been exceeded, occasionally a file gets quite a bit over the rollover size before a new one is started. Maybe eventually we could look into optimizing that a little, but I think for now this is a good improvement.

Yeah, we could potential adjust this to account for size after next records are written, but that's a bit more work, and generally this should be pretty good. For small rollover limits, there's always chance of a file exceeding it, eg. if a 100MB video is downloaded and limit is 10MB, there will still be a 100MB WARC

@ikreymer ikreymer merged commit f6edec0 into main Apr 15, 2024
4 checks passed
@ikreymer ikreymer deleted the rollover-fix branch April 15, 2024 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--rolloverSize Default seems not to apply on warc.gz
2 participants