Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate 'pageinfo' resource records with summary of all page resources. #457

Closed
ikreymer opened this issue Jan 13, 2024 · 1 comment · Fixed by #458
Closed

Generate 'pageinfo' resource records with summary of all page resources. #457

ikreymer opened this issue Jan 13, 2024 · 1 comment · Fixed by #458
Assignees

Comments

@ikreymer
Copy link
Member

It would be useful, especially for QA/page comparison purposes, to be able to have a 'page info' record which includes all the resources loaded from a particular page and their status codes.

The record can be a resource WARC record that might look as follows (current iteration):

WARC-Target-URI: urn:pageinfo:https://webrecorder.net/
WARC-Date: 2024-01-13T05:56:03.866Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:3aa76f5c-b668-4367-a594-cf1bdeddb11e>
WARC-Payload-Digest: sha256:be4aa0981f87f8636817c26645ce89915387e74fe0ff8fa318f5b4acf0bfaa36
WARC-Block-Digest: sha256:be4aa0981f87f8636817c26645ce89915387e74fe0ff8fa318f5b4acf0bfaa36
Content-Length: 1235

{
  "pageid": "0292148c-99ac-45f0-b2eb-7a484f83ea43",
  "urls": {
    "https://webrecorder.net/": 200,
    "https://webrecorder.net/assets/wr-logo.svg": 200,
    "https://webrecorder.net/assets/tools/awp-icon.png": 200,
    "https://webrecorder.net/assets/fontawesome/all.css": 200,
    "https://webrecorder.net/assets/main.css": 200,
    "https://webrecorder.net/assets/images/btrix-cloud.png": 200,
    "https://webrecorder.net/assets/tools/browsertrixcrawler.png": 200,
    "https://webrecorder.net/assets/tools/logo-pywb.png": 200,
    "https://webrecorder.net/assets/tools/rwp-icon.png": 200,
    "https://fonts.googleapis.com/css2?family=Source+Sans+Pro:wght@700;900&display=swap": 200,
    "https://fonts.googleapis.com/css?family=Source+Code+Pro|Source+Sans+Pro&display=swap": 200,
    "https://fonts.gstatic.com/s/sourcesanspro/v22/6xKydSBYKcSV-LCoeQqfX1RYOo3ig4vwlxdu.woff2": 200,
    "https://fonts.gstatic.com/s/sourcesanspro/v22/6xK3dSBYKcSV-LCoeQqfX1RYOo3qOK7l.woff2": 200,
    "https://stats.browsertrix.com/js/script.js": 200,
    "https://webrecorder.net/assets/favicon.ico": 200,
    "https://stats.browsertrix.com/api/event?__wb_method=POST&n=pageview&u=https%3A%2F%2Fwebrecorder.net%2F&d=webrecorder.net": 202
  }
}

Non GET requests can be canonicalized into URLs using same canonicalization used for generating CDX.

@ikreymer ikreymer self-assigned this Jan 13, 2024
tw4l pushed a commit that referenced this issue Jan 15, 2024
Generate records for each page, containing a list of resources and their
status codes, to aid in future diffing/comparison.

Generates a `urn:pageinfo:<page url>` record for each page
- Adds POST / non-GET request canonicalization from warcio to handle
non-GET requests
- Adds `writeSingleRecord` to WARCWriter

Fixes #457
@tw4l
Copy link
Contributor

tw4l commented Jan 15, 2024

Fixed via #458

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

Successfully merging a pull request may close this issue.

2 participants