Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base URI logic of the HyperLinkParser doesn't respect terminated relative base tag values #231

Closed
thedeedawg opened this issue Jul 11, 2021 · 2 comments

Comments

@thedeedawg
Copy link
Contributor

Description

As it stands, there seems to be an assumption that a base tag is only ever valid if it contains an absolute value. That is far from the case and goes against the defined standard. The value should be respected even if it's a relative value, as long as it's terminated (trailing slash) such that it can be built upon.

A PR is in the works.


Example no. 1

  1. The crawler crawls the page https://www.mydomain.com/images/
    • The following base tag is present on the page
      <base href="/pages/">
    • The following link is present on the page
      <a href="subpage.html">Lorem ipsum dolor sit amet</a>
  2. The HyperLinkParser returns the URI https://www.mydomain.com/images/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpage.html instead

Example no. 2

  1. The crawler crawls the page https://www.mydomain.com/pages/
    • The following base tag is present on the page
      <base href="subpages/">
    • The following link is present on the page
      <a href="subpage.html">Lorem ipsum dolor sit amet</a>
  2. The HyperLinkParser returns the URI https://www.mydomain.com/pages/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpages/subpage.html instead
@sjdirect
Copy link
Owner

sjdirect commented Aug 9, 2021

Is this handled this pr correctly?

@thedeedawg
Copy link
Contributor Author

Yes, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants