Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved base tag handling #232

Merged
merged 5 commits into from
Aug 9, 2021
Merged

Improved base tag handling #232

merged 5 commits into from
Aug 9, 2021

Conversation

thedeedawg
Copy link
Contributor

Takes care of issue #230 and #231.

This development aims to improve upon the current handling of base tags within the HyperLinkParser. More specifically, this introduces support for relative base tags in addition to fixing a bug relating to root-relative values when running on Linux. For more information, check the linked issues above.

The testing of this functionality have also been expanded upon to cover more ground. To that end, all of the base tag tests have been consolidated into one data driven test with multiple outcomes, as the tests are otherwise completely the same. All current test cases are still present, but the input for the invalid base tag case had to be adjusted. The old value of http:http://http: technically is a valid relative value according to the defined standard and as such the new logic would succeed in creating a Uri instance from it since relative values are now supported. However, the overall goal of the test case (ensuring that invalid values won't break the parsing) remains intact.

}

[TestMethod]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not delete ANY tests and make your test follow the same naming convention. Even if it's a little more ceremony.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, the consolidated test did follow the naming convention. The DisplayName is supposed to just be the name of the individual outcome of that test, but I'm not familiar with the testing library being used here so I could be wrong.

{
// ignored
}
baseHref = uriToUse.Scheme + ":" + baseHref;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the crawledPage.Uri.Scheme is "more" correct here as we can't change the scheme of the crawl because the responding server returned a new scheme. It believe it would cause some confusion under that scenario.

Copy link
Contributor Author

@thedeedawg thedeedawg Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would disagree. Since the responding server is the responding server, it should dictate how you're interpreting the base tag - just like your browser would if you were to visit the page. It's also a weird inconsistency, in that if you don't have a base tag, any relative links will be using the URI of the responding server. But if you then happen to have a base tag with a deferred scheme, the scheme is now all of a sudden decided by something else.

Either way, as the change doesn't relate directly to any of the 2 issues that this PR aims to resolve, I'll remove the change.

@sjdirect
Copy link
Owner

Would you be willing to refactor your changes to the Test class please? I understand semantically what you are trying to achieve with some of your changes but the diff makes it hard to quickly understand the impact of your changes to the tests. It's more important to have consistency in the code base and the keep the pr be as MINIMAL as possible. Also please do NOT delete/alter any tests that aren't necessary as this removes confidence that the change isn't causing issues with original set of tests.

@@ -282,7 +282,7 @@ public void GetLinks_RelativeBaseTagPresent_ReturnsRelativeLinksPageUri()
[TestMethod]
public void GetLinks_InvalidBaseTagPresent_ReturnsRelativeLinksPageUri()
{
_crawledPage.Content.Text = "<base href=\"http:http://http:\"><a href=\"http://aaa.com/\" ></a><a href=\"/aaa/a.html\" /></a>";
_crawledPage.Content.Text = "<base href=\"////images/\"><a href=\"http://aaa.com/\"></a><a href=\"/aaa/a.html\"></a><a href=\"aaa/a.html\"></a>";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the value of the base tag didn't have to be updated here, if the goal simply is to have a passing test. Leaving the base tag value as it was, does however mean that it's no longer testing what it was previously - which is, "what happens if an URI instance can't be created from it".

@thedeedawg thedeedawg requested a review from sjdirect July 16, 2021 09:09
@sjdirect sjdirect merged commit 0363d01 into sjdirect:master Aug 9, 2021
@sjdirect
Copy link
Owner

sjdirect commented Aug 9, 2021

PR accepted, appreciate the contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants