Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PYWB stripping out part of URLs on timeline page <url>#/<something> #863

Open
ChrisDoyleMW opened this issue Sep 4, 2023 · 1 comment
Open

Comments

@ChrisDoyleMW
Copy link

Describe the bug

PYWB seems to be stripping out part of the URL when a timeline page is requested. For
example:
https://webarchive.nationalarchives.gov.uk/*/https://www.arcgis.com/apps/op sdashboard/index.html#/f94c3c90da5b4e9f9a0b19484dd4bb14
loads a timeline for
https://www.arcgis.com/apps/opsdashboard/index.html
Each instance shown is for index.html and not index.html#/f94c3c90da5b4e9f9a0b19484dd4bb14

Steps to reproduce the bug

  1. Open this url in your browser:
    https://webarchive.nationalarchives.gov.uk/*/https://www.arcgis.com/apps/opsdashbo
    ard/index.html#/f94c3c90da5b4e9f9a0b19484dd4bb14
  2. Click on the link dated 02 April 2020.
  3. Initially a page with the url:
    https://webarchive.nationalarchives.gov.uk/ukgwa/20200402132156/https://www.arcg is.com/apps/opsdashboard/index.html starts to load. Note that the string after the # symbol has been stripped out.
  4. The page does not load but redirects to this url:
    https://webarchive.nationalarchives.gov.uk/ukgwa/20200328185042/https://www.arcg is.com/sharing/rest/oauth2/authorize?client_id=opsdashboard&display=default&respo nse_type=token&expiration=20160&redirect_uri=https%3A%2F%2Fwww.arcgis.co m%2Fapps%2Fopsdashboard%2FpostSignIn.html&locale=en- gb&state=%7B%22redirect%22%3A%22https%3A%2F%2Fwww.arcgis.com%2Fap ps%2Fopsdashboard%2Findex.html%22%2C%22portalUrl%22%3A%22https%3A% 2F%2Fwww.arcgis.com%2Fsharing%2Frest%2F%22%7D which displays as a blank page.

Expected behavior

I'd expect the timeline page to show the correct URL timeline and allow visitors to view the history of capture for this specific URL - and not strip out the final part of the url.

Screenshots

Screenshot 2023-09-04 at 12 52 44 Screenshot 2023-09-04 at 12 51 21 Screenshot 2023-09-04 at 12 50 44 Screenshot 2023-09-04 at 12 55 20

Environment

• OS: Linux
• Browser Any
• Version PYWB 2.7

@petsva
Copy link

petsva commented Sep 5, 2023

Everything after a # in a URL is the fragment part, and it is never sent to the server, but is handled by the web browser. (Normally to scroll to a certain position on the page.) Hence a harvester can only harvest with a URL with the fragment part stripped. That is why Pywb strips it, and shows what it found in the index about the URL without fragment part.

But maybe Pywb could replace the fragment in the links, to trick the browser to scroll according to it. Or maybe that would be confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants