Added TruncateByWords method + overloads #2043

robert-cpl · 2017-07-12T08:36:19Z

No description provided.

clausjensen

theres a few fixes that needs to be done :)

clausjensen · 2017-07-12T09:57:45Z

src/Umbraco.Web/UmbracoHelper.cs

+        /// </summary>
+        public IHtmlString TruncateByWords(DynamicNull html, int words, bool addElipsis, bool treatTagsAsContent)
+        {
+            return new HtmlString(string.Empty);


and this one too :)

clausjensen · 2017-07-12T09:58:12Z

src/Umbraco.Web/UmbracoHelper.cs

+        #region Truncate by Words
+        public IHtmlString TruncateByWords(DynamicNull html, int words)
+        {
+            return new HtmlString(string.Empty);


This one doesn't really do anything :)

clausjensen · 2017-07-12T09:58:19Z

src/Umbraco.Web/UmbracoHelper.cs

+        /// </summary>
+        public IHtmlString TruncateByWords(DynamicNull html, int words, bool addElipsis)
+        {
+            return new HtmlString(string.Empty);


and this one :)

nul800sebastiaan · 2017-08-24T16:00:55Z

src/Umbraco.Tests/FrontEnd/UmbracoHelperTests.cs

+        [Test]
+        public void Truncate_By_Words_With_Tag()
+        {
+            var text = "Hello world, <b>this</b> is some text <a href='blah'>with a link</a>";


Try changing this to Hello world, this is some. You'll notice it will not work because it is pretty hard to do this correctly with HTML in the mix.

I don't know why we'd want this in the core, there's libraries out there to help with this, so we don't have to reinvent the wheel, example: https://robvolk.com/truncate-html-string-c-extension-method-51c83e6d4969

And proving that it is pretty difficult, he still has an open issue: robvolk/Helpers.Net#3

Also note that it is entirely possible that I put invalid HTML in there, so something like testing word truncation - the  is missing, that should cause some "fun" things to happen too.

If we want this in core, I'd suggest we create a version that strips all HTML so we don't run the risk of breaking the page this is used on.

Ah, I guess we already went down this rabbithole with the truncate method for number of characters.. which is also already a pretty complex method. Yikes.

Hey @nul800sebastiaan , about the Hello world, this is some, I was pretty sure that it worked as I am stripping the HTML elements from the string, convert it to char length and feed it to the original truncate method. I will look into it again today.

About invalid tags, funny enough I ran the string you provided and the Truncate method closes the tags correctly, probably not as you wanted but... at least is not invalid HTML. Talking about invalid HTML, should we worry about that? I mean, if you type invalid HTML it should brake.

Oh, and when you tried all that, was "treatTagsAsContent" true or false? Either way, I'll go trough another round of testing today, thanks for your input :).

nul800sebastiaan · 2017-08-25T07:40:14Z

Cool, I didn't try to use treatTagsAsContent :)

So the problem with this is that parsing HTML with Regex.. is a problem ;-)

Make sure to read this excellent article on why this is a problem:
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

We can implement this but will have to be with a note saying : "this is how it works and we will not be able to accommodate all of your feature requests, if you need anything different from this you should consider building your own."

First of all, to make sure we have valid HTML, we need to sanatize it using HtmlAgilityPack, then we can make a best-effort attempt at closing any still open tags.

I have googled this for about an hour and I think the "best" way to do this is like so:
https://gist.github.com/q42jaap/2413598

robert-cpl · 2017-08-25T07:46:38Z

Nice, I will look into it :).

…uncate by words methods Fix: Original Truncate method show now be able to trim Add: Added some more tests for Truncate by words to cater for tagsAsContent parameter Fix: In original truncate, currentTextLength prop will be increased everytime a tag is added, this will fix an issue when tagsAsContent is set to true (it would have added an extra char per tag, which would have went over to the next word)

…ace between words

Shazwazza · 2017-08-30T00:48:47Z

src/Umbraco.Web/HtmlStringUtilities.cs

+
+                        //Check to see if there is an empty char between the hellip and the output string
+                        //if there is, remove it
+                        result = firsTrim[firsTrim.Length -9] == ' ' ? firsTrim.Remove(firsTrim.Length - 9, 1) : firsTrim;


@RobertCopilau I'm just wondering if this can cause an error since this is not checking if the firsTrim string has enough length in it to do a firsTrim[firsTrim.Length -9] is there a guarantee that firsTrim will absolutely be that long here?

Also, firsTrim should be firstTrim i think?

Yes, you are correct, it will throw an error, I will get it done in a few minutes. And yes it was supposed to be firstTrim :).

Shazwazza · 2017-08-31T06:04:15Z

src/Umbraco.Web/HtmlStringUtilities.cs

@@ -235,10 +247,82 @@ public IHtmlString Truncate(string html, int length, bool addElipsis, bool treat
                    outputms.Position = 0;
                    using (TextReader outputtr = new StreamReader(outputms))
                    {
-                        return new HtmlString(outputtr.ReadToEnd().Replace("  ", " ").Trim());
+                        string result = String.Empty;


We have some best practices and follow normal Microsoft standards (which is also enforced by Resharper), any reference to string should use the type qualifier of string not the object qualifier of String so this should be

string.Empty similarly below String.IsNullOrEmpty should be string.IsNullOrEmpty . On that note, this should check for IsNullOrWhitespace instead. We also have an extension method for this so you could do: firstTrim.IsNullOrWhitespace()

Shazwazza · 2017-08-31T06:08:55Z

src/Umbraco.Web/HtmlStringUtilities.cs

+                    //Check if we have a space inside a tag and increase the length if we do
+                    if (html[length].Equals('<') && html[length + 1].Equals('/') == false && tagsAsContent)
+                    {
+                        while (html[insideTagCounter].Equals('>') == false)


If there is incorrectly formatted html, could this cause an infinite loop? (i.e. there won't ever be a '>')

What a noob :/, I'll fix those above today.

Shazwazza

Some additional feedback and questions

Added HtmlSanitization package Added check for invalid HTML and sanitization on WordsToLength and Truncate methods Addes some extra comments

… variable

Shazwazza

Some changes required for the tests - or maybe I'm wrong which in that case please advise :)

I've pushed on commit for you to review, i've changed a string to a constant and found a variable that doesn't seem to be used for anything so have commented it out if you can review.

Shazwazza · 2017-09-05T23:14:44Z

src/Umbraco.Web/HtmlStringUtilities.cs

@@ -85,6 +87,8 @@ internal string Coalesce<TIgnore>(params object[] args)

        public IHtmlString Truncate(string html, int length, bool addElipsis, bool treatTagsAsContent)
        {
+            string hellip = "&hellip;";


should be a constant - i'll fix this

Shazwazza · 2017-09-05T23:58:55Z

src/Umbraco.Tests/FrontEnd/UmbracoHelperTests.cs

+
+            var result = helper.TruncateByWords(text, 4, true, false).ToString();
+
+            Assert.AreEqual("Hello world, <b>this</b> is&hellip;", result);


What is the difference between Truncate_By_Words_With_Tag_TagsAsContentOff and Truncate_By_Words_With_Tag_TagsAsContentOn ? They yield the exact same result so doesn't really seem like this boolean flag is doing anything especially with regards to these tests. In fact, it kind of seems the same for all tests with ContentOn or ContentOff, the expectations are always the same so either these tests are not testing the correct behavior or the boolean flag doesn't actually do anything.

The difference between TagsAsContent on/off, is that when is On, isolated tags will be treated as words, for example:
TagsAsContentOn

Hello world. It will contain two words.

Hello  world. This one will have three words because  is isolated.

TagsAsContentOff

Hello world. It will contain two words.

Hello  world. Still two words.

My fault that I haven't shown this in the tests.
Looking trough this again, I find that treating tags as content has no real uses and counting tags as words should not be a thing - and the state is right now, is just confusing.

I will clean this up and only count the actual words in the string.

Shazwazza · 2017-09-06T00:00:00Z

src/Umbraco.Tests/FrontEnd/UmbracoHelperTests.cs

+
+            var result = helper.TruncateByWords(text, 4, true, false).ToString();
+
+            Assert.AreEqual("Hello world, this is&hellip;", result);


Is this test testing the correct behaviour? There is HTML in this string but we are truncating before even reaching the HTML. Or maybe there's test missing for testing the HTML stripping behaviour?

The TruncateByWord is the container were "WordsToLength" and "Truncate" methods resides, and the HTML stripping happens in the "WordsToLength" which turns amount of words to char length and then feeds that number to the "Truncate" method.
But you are right there are no tests for the html stripping - I will add some.

Shazwazza · 2017-09-06T00:00:41Z

src/Umbraco.Tests/FrontEnd/UmbracoHelperTests.cs

+        }
+
+        [Test]
+        public void Truncate_By_Words_Mid_Tag_TagsAsContentOn()


There seems to be tests missing for testing invalid html parsing and the logic that is in place to work around those types of problems.

I'll take care of that.

…into TEMP-u4-10135 # Conflicts: # src/Umbraco.Web/HtmlStringUtilities.cs

…ed code

robert-cpl

Hopefully I have addressed all the concerns.

robert-cpl · 2017-09-06T07:16:43Z

src/Umbraco.Tests/FrontEnd/UmbracoHelperTests.cs

+
+            var result = helper.TruncateByWords(text, 4, true, false).ToString();
+
+            Assert.AreEqual("Hello world, this is&hellip;", result);


The TruncateByWord is the container were "WordsToLength" and "Truncate" methods resides, and the HTML stripping happens in the "WordsToLength" which turns amount of words to char length and then feeds that number to the "Truncate" method.
But you are right there are no tests for the html stripping - I will add some.

robert-cpl · 2017-09-06T07:25:03Z

src/Umbraco.Tests/FrontEnd/UmbracoHelperTests.cs

+        }
+
+        [Test]
+        public void Truncate_By_Words_Mid_Tag_TagsAsContentOn()


I'll take care of that.

robert-cpl · 2017-09-06T08:43:03Z

src/Umbraco.Tests/FrontEnd/UmbracoHelperTests.cs

+
+            var result = helper.TruncateByWords(text, 4, true, false).ToString();
+
+            Assert.AreEqual("Hello world, <b>this</b> is&hellip;", result);


The difference between TagsAsContent on/off, is that when is On, isolated tags will be treated as words, for example:
TagsAsContentOn

Hello world. It will contain two words.

Hello  world. This one will have three words because  is isolated.

TagsAsContentOff

Hello world. It will contain two words.

Hello  world. Still two words.

My fault that I haven't shown this in the tests.
Looking trough this again, I find that treating tags as content has no real uses and counting tags as words should not be a thing - and the state is right now, is just confusing.

I will clean this up and only count the actual words in the string.

Added TruncateByWords method + overloads

769f0d9

clausjensen reviewed Jul 12, 2017

View reviewed changes

Robert and others added 3 commits July 12, 2017 12:28

Cleared useless methods

cddc6f3

added a few tests for the new UmbracoHelper methods.

bb01eca

Fix TruncateByWords to work with tags

54f86c0

nul800sebastiaan reviewed Aug 24, 2017

View reviewed changes

Robert added 3 commits August 25, 2017 11:39

Having a space in a tag is now taken as a char length and not as a sp…

047a41c

…ace between words

Replaced the Regex parse with HtmlAgilityPack parser

68fd748

Shazwazza reviewed Aug 30, 2017

View reviewed changes

Null check for firstTrim added and fixed typo

039ab8f

Shazwazza reviewed Aug 31, 2017

View reviewed changes

Shazwazza suggested changes Aug 31, 2017

View reviewed changes

Robert and others added 4 commits August 31, 2017 09:15

Changed from the String object qualifier to type qualifier

8c1996a

Added check for invalid HTML and sanitization on said HTML

7393717

Added HtmlSanitization package Added check for invalid HTML and sanitization on WordsToLength and Truncate methods Addes some extra comments

Reverted back to last commit, invalid html check added

972d5f8

switches to constant - comments out what seems to be a totally unused…

4b5c499

… variable

Shazwazza suggested changes Sep 6, 2017

View reviewed changes

Shazwazza and others added 6 commits September 6, 2017 10:05

Merge branch 'TEMP-u4-10135' of https://github.com/umbraco/Umbraco-CMS …

a603fd0

…into TEMP-u4-10135 # Conflicts: # src/Umbraco.Web/HtmlStringUtilities.cs

Replaced html stripping with already existing function, removing unus…

b2798e3

…ed code

Cleaning up tests and useless TruncateByWords functions

b135893

Forgot to clean test names

6a3f94d

Added Html Strip tests

851d587

Added some trimming and test to StripHtmlTags

18d0fec

robert-cpl commented Sep 6, 2017

View reviewed changes

Shazwazza merged commit 41eeb38 into dev-v7 Sep 12, 2017

Shazwazza deleted the TEMP-u4-10135 branch September 12, 2017 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added TruncateByWords method + overloads #2043

Added TruncateByWords method + overloads #2043

robert-cpl commented Jul 12, 2017

clausjensen left a comment

clausjensen Jul 12, 2017

clausjensen Jul 12, 2017

clausjensen Jul 12, 2017

nul800sebastiaan Aug 24, 2017

nul800sebastiaan Aug 24, 2017

robert-cpl Aug 25, 2017 •

edited

Loading

nul800sebastiaan commented Aug 25, 2017

robert-cpl commented Aug 25, 2017

Shazwazza Aug 30, 2017

robert-cpl Aug 30, 2017

Shazwazza Aug 31, 2017

Shazwazza Aug 31, 2017

robert-cpl Aug 31, 2017

Shazwazza left a comment

Shazwazza left a comment

Shazwazza Sep 5, 2017

Shazwazza Sep 5, 2017

robert-cpl Sep 6, 2017

Shazwazza Sep 6, 2017

robert-cpl Sep 6, 2017

Shazwazza Sep 6, 2017

robert-cpl Sep 6, 2017

robert-cpl left a comment

robert-cpl Sep 6, 2017

robert-cpl Sep 6, 2017

robert-cpl Sep 6, 2017


		var result = helper.TruncateByWords(text, 4, true, false).ToString();

		Assert.AreEqual("Hello world, <b>this</b> is…", result);


		var result = helper.TruncateByWords(text, 4, true, false).ToString();

		Assert.AreEqual("Hello world, this is…", result);

Added TruncateByWords method + overloads #2043

Added TruncateByWords method + overloads #2043

Conversation

robert-cpl commented Jul 12, 2017

clausjensen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robert-cpl Aug 25, 2017 • edited Loading

Choose a reason for hiding this comment

nul800sebastiaan commented Aug 25, 2017

robert-cpl commented Aug 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shazwazza left a comment

Choose a reason for hiding this comment

Shazwazza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robert-cpl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robert-cpl Aug 25, 2017 •

edited

Loading