Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arg to write pages to Redis #464

Merged
merged 2 commits into from Feb 10, 2024
Merged

Conversation

tw4l
Copy link
Contributor

@tw4l tw4l commented Feb 9, 2024

Fixes #462

To be used in conjunction with QA features in Browsertrix Cloud, to add pages to the database for each crawl.

tw4l and others added 2 commits February 8, 2024 15:02
Introduces --writePagesToRedis argument
also add 'ts' to pageinfo record, as date of top-level page resource
include 'ts' and 'mime' in pages pushed to redis
@tw4l tw4l requested a review from ikreymer February 9, 2024 19:21
@tw4l tw4l changed the title Add argto write pages to Redis Add arg to write pages to Redis Feb 9, 2024
@ikreymer ikreymer merged commit bdffa79 into dev-1.0.0 Feb 10, 2024
4 checks passed
@ikreymer ikreymer deleted the issue-462-write-pages-to-redis branch February 10, 2024 00:44
tw4l added a commit to webrecorder/browsertrix that referenced this pull request Feb 28, 2024
Fixes #1502 

- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted

Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.

Connected to webrecorder/browsertrix-crawler#464

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants