You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As it stands, there seems to be an assumption that a base tag is only ever valid if it contains an absolute value. That is far from the case and goes against the defined standard. The value should be respected even if it's a relative value, as long as it's terminated (trailing slash) such that it can be built upon.
A PR is in the works.
Example no. 1
The crawler crawls the page https://www.mydomain.com/images/
The following base tag is present on the page
<basehref="/pages/">
The following link is present on the page
<ahref="subpage.html">Lorem ipsum dolor sit amet</a>
The HyperLinkParser returns the URI https://www.mydomain.com/images/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpage.html instead
Example no. 2
The crawler crawls the page https://www.mydomain.com/pages/
The following base tag is present on the page
<basehref="subpages/">
The following link is present on the page
<ahref="subpage.html">Lorem ipsum dolor sit amet</a>
The HyperLinkParser returns the URI https://www.mydomain.com/pages/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpages/subpage.html instead
The text was updated successfully, but these errors were encountered:
Description
As it stands, there seems to be an assumption that a base tag is only ever valid if it contains an absolute value. That is far from the case and goes against the defined standard. The value should be respected even if it's a relative value, as long as it's terminated (trailing slash) such that it can be built upon.
A PR is in the works.
Example no. 1
https://www.mydomain.com/images/
HyperLinkParser
returns the URIhttps://www.mydomain.com/images/subpage.html
, but would be expected to have returnedhttps://www.mydomain.com/pages/subpage.html
insteadExample no. 2
https://www.mydomain.com/pages/
HyperLinkParser
returns the URIhttps://www.mydomain.com/pages/subpage.html
, but would be expected to have returnedhttps://www.mydomain.com/pages/subpages/subpage.html
insteadThe text was updated successfully, but these errors were encountered: