Improved base tag handling #232

thedeedawg · 2021-07-12T07:44:28Z

Takes care of issue #230 and #231.

This development aims to improve upon the current handling of base tags within the HyperLinkParser. More specifically, this introduces support for relative base tags in addition to fixing a bug relating to root-relative values when running on Linux. For more information, check the linked issues above.

The testing of this functionality have also been expanded upon to cover more ground. To that end, all of the base tag tests have been consolidated into one data driven test with multiple outcomes, as the tests are otherwise completely the same. All current test cases are still present, but the input for the invalid base tag case had to be adjusted. The old value of http:http://http: technically is a valid relative value according to the defined standard and as such the new logic would succeed in creating a Uri instance from it since relative values are now supported. However, the overall goal of the test case (ensuring that invalid values won't break the parsing) remains intact.

…ssue with root-relative values on Linux.

Abot2.Tests.Unit/Core/HyperlinkParserTest.cs

sjdirect · 2021-07-13T19:59:27Z

Abot2.Tests.Unit/Core/HyperlinkParserTest.cs

        }

-        [TestMethod]


Please do not delete ANY tests and make your test follow the same naming convention. Even if it's a little more ceremony.

To my knowledge, the consolidated test did follow the naming convention. The DisplayName is supposed to just be the name of the individual outcome of that test, but I'm not familiar with the testing library being used here so I could be wrong.

Abot2.Tests.Unit/Core/HyperlinkParserTest.cs

sjdirect · 2021-07-13T20:01:05Z

Abot2/Core/HyperLinkParser.cs

-                {
-                    // ignored
-                }
+                    baseHref = uriToUse.Scheme + ":" + baseHref;


I believe the crawledPage.Uri.Scheme is "more" correct here as we can't change the scheme of the crawl because the responding server returned a new scheme. It believe it would cause some confusion under that scenario.

I would disagree. Since the responding server is the responding server, it should dictate how you're interpreting the base tag - just like your browser would if you were to visit the page. It's also a weird inconsistency, in that if you don't have a base tag, any relative links will be using the URI of the responding server. But if you then happen to have a base tag with a deferred scheme, the scheme is now all of a sudden decided by something else.

Either way, as the change doesn't relate directly to any of the 2 issues that this PR aims to resolve, I'll remove the change.

sjdirect · 2021-07-13T20:02:34Z

Would you be willing to refactor your changes to the Test class please? I understand semantically what you are trying to achieve with some of your changes but the diff makes it hard to quickly understand the impact of your changes to the tests. It's more important to have consistency in the code base and the keep the pr be as MINIMAL as possible. Also please do NOT delete/alter any tests that aren't necessary as this removes confidence that the change isn't causing issues with original set of tests.

…red scheme.

thedeedawg · 2021-07-16T09:05:18Z

Abot2.Tests.Unit/Core/HyperlinkParserTest.cs

@@ -282,7 +282,7 @@ public void GetLinks_RelativeBaseTagPresent_ReturnsRelativeLinksPageUri()
        [TestMethod]
        public void GetLinks_InvalidBaseTagPresent_ReturnsRelativeLinksPageUri()
        {
-            _crawledPage.Content.Text = "<base href=\"http:http://http:\"><a href=\"http://aaa.com/\" ></a><a href=\"/aaa/a.html\" /></a>";
+            _crawledPage.Content.Text = "<base href=\"////images/\"><a href=\"http://aaa.com/\"></a><a href=\"/aaa/a.html\"></a><a href=\"aaa/a.html\"></a>";


Technically, the value of the base tag didn't have to be updated here, if the goal simply is to have a passing test. Leaving the base tag value as it was, does however mean that it's no longer testing what it was previously - which is, "what happens if an URI instance can't be created from it".

sjdirect · 2021-08-09T22:12:53Z

PR accepted, appreciate the contribution

thedeedawg added 2 commits July 11, 2021 01:14

Fixed typo in filename.

b1da405

Implemented support for relative base tag values, and also fixed an i…

ca5426e

…ssue with root-relative values on Linux.