-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added TruncateByWords method + overloads #2043
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
theres a few fixes that needs to be done :)
src/Umbraco.Web/UmbracoHelper.cs
Outdated
/// </summary> | ||
public IHtmlString TruncateByWords(DynamicNull html, int words, bool addElipsis, bool treatTagsAsContent) | ||
{ | ||
return new HtmlString(string.Empty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this one too :)
src/Umbraco.Web/UmbracoHelper.cs
Outdated
#region Truncate by Words | ||
public IHtmlString TruncateByWords(DynamicNull html, int words) | ||
{ | ||
return new HtmlString(string.Empty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one doesn't really do anything :)
src/Umbraco.Web/UmbracoHelper.cs
Outdated
/// </summary> | ||
public IHtmlString TruncateByWords(DynamicNull html, int words, bool addElipsis) | ||
{ | ||
return new HtmlString(string.Empty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this one :)
[Test] | ||
public void Truncate_By_Words_With_Tag() | ||
{ | ||
var text = "Hello world, <b>this</b> is some text <a href='blah'>with a link</a>"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try changing this to Hello world, <b>this is some</b>
. You'll notice it will not work because it is pretty hard to do this correctly with HTML in the mix.
I don't know why we'd want this in the core, there's libraries out there to help with this, so we don't have to reinvent the wheel, example: https://robvolk.com/truncate-html-string-c-extension-method-51c83e6d4969
And proving that it is pretty difficult, he still has an open issue: robvolk/Helpers.Net#3
Also note that it is entirely possible that I put invalid HTML in there, so something like <p>testing <b>word truncation</p>
- the </b>
is missing, that should cause some "fun" things to happen too.
If we want this in core, I'd suggest we create a version that strips all HTML so we don't run the risk of breaking the page this is used on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess we already went down this rabbithole with the truncate method for number of characters.. which is also already a pretty complex method. Yikes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @nul800sebastiaan , about the Hello world, <b>this is some</b>
, I was pretty sure that it worked as I am stripping the HTML elements from the string, convert it to char length and feed it to the original truncate method. I will look into it again today.
About invalid tags, funny enough I ran the string you provided and the Truncate method closes the tags correctly, probably not as you wanted but... at least is not invalid HTML. Talking about invalid HTML, should we worry about that? I mean, if you type invalid HTML it should brake.
Oh, and when you tried all that, was "treatTagsAsContent" true or false? Either way, I'll go trough another round of testing today, thanks for your input :).
Cool, I didn't try to use So the problem with this is that parsing HTML with Regex.. is a problem ;-) Make sure to read this excellent article on why this is a problem: We can implement this but will have to be with a note saying : "this is how it works and we will not be able to accommodate all of your feature requests, if you need anything different from this you should consider building your own." First of all, to make sure we have valid HTML, we need to sanatize it using HtmlAgilityPack, then we can make a best-effort attempt at closing any still open tags. I have googled this for about an hour and I think the "best" way to do this is like so: |
Nice, I will look into it :). |
…uncate by words methods Fix: Original Truncate method show now be able to trim Add: Added some more tests for Truncate by words to cater for tagsAsContent parameter Fix: In original truncate, currentTextLength prop will be increased everytime a tag is added, this will fix an issue when tagsAsContent is set to true (it would have added an extra char per tag, which would have went over to the next word)
…ace between words
|
||
//Check to see if there is an empty char between the hellip and the output string | ||
//if there is, remove it | ||
result = firsTrim[firsTrim.Length -9] == ' ' ? firsTrim.Remove(firsTrim.Length - 9, 1) : firsTrim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RobertCopilau I'm just wondering if this can cause an error since this is not checking if the firsTrim
string has enough length in it to do a firsTrim[firsTrim.Length -9]
is there a guarantee that firsTrim
will absolutely be that long here?
Also, firsTrim
should be firstTrim
i think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct, it will throw an error, I will get it done in a few minutes. And yes it was supposed to be firstTrim :).
@@ -235,10 +247,82 @@ public IHtmlString Truncate(string html, int length, bool addElipsis, bool treat | |||
outputms.Position = 0; | |||
using (TextReader outputtr = new StreamReader(outputms)) | |||
{ | |||
return new HtmlString(outputtr.ReadToEnd().Replace(" ", " ").Trim()); | |||
string result = String.Empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some best practices and follow normal Microsoft standards (which is also enforced by Resharper), any reference to string
should use the type qualifier of string
not the object qualifier of String
so this should be
string.Empty
similarly below String.IsNullOrEmpty
should be string.IsNullOrEmpty
. On that note, this should check for IsNullOrWhitespace
instead. We also have an extension method for this so you could do: firstTrim.IsNullOrWhitespace()
//Check if we have a space inside a tag and increase the length if we do | ||
if (html[length].Equals('<') && html[length + 1].Equals('/') == false && tagsAsContent) | ||
{ | ||
while (html[insideTagCounter].Equals('>') == false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is incorrectly formatted html, could this cause an infinite loop? (i.e. there won't ever be a '>')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a noob :/, I'll fix those above today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional feedback and questions
Added HtmlSanitization package Added check for invalid HTML and sanitization on WordsToLength and Truncate methods Addes some extra comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes required for the tests - or maybe I'm wrong which in that case please advise :)
I've pushed on commit for you to review, i've changed a string to a constant and found a variable that doesn't seem to be used for anything so have commented it out if you can review.
@@ -85,6 +87,8 @@ internal string Coalesce<TIgnore>(params object[] args) | |||
|
|||
public IHtmlString Truncate(string html, int length, bool addElipsis, bool treatTagsAsContent) | |||
{ | |||
string hellip = "…"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a constant - i'll fix this
|
||
var result = helper.TruncateByWords(text, 4, true, false).ToString(); | ||
|
||
Assert.AreEqual("Hello world, <b>this</b> is…", result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between Truncate_By_Words_With_Tag_TagsAsContentOff and Truncate_By_Words_With_Tag_TagsAsContentOn ? They yield the exact same result so doesn't really seem like this boolean flag is doing anything especially with regards to these tests. In fact, it kind of seems the same for all tests with ContentOn or ContentOff, the expectations are always the same so either these tests are not testing the correct behavior or the boolean flag doesn't actually do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between TagsAsContent on/off, is that when is On, isolated tags will be treated as words, for example:
TagsAsContentOn
- Hello
<b>
world</b>
. It will contain two words. - Hello
<b>
world</b>
. This one will have three words because<b>
is isolated.
TagsAsContentOff
- Hello
<b>
world</b>
. It will contain two words. - Hello
<b>
world</b>
. Still two words.
My fault that I haven't shown this in the tests.
Looking trough this again, I find that treating tags as content has no real uses and counting tags as words should not be a thing - and the state is right now, is just confusing.
I will clean this up and only count the actual words in the string.
|
||
var result = helper.TruncateByWords(text, 4, true, false).ToString(); | ||
|
||
Assert.AreEqual("Hello world, this is…", result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test testing the correct behaviour? There is HTML in this string but we are truncating before even reaching the HTML. Or maybe there's test missing for testing the HTML stripping behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TruncateByWord is the container were "WordsToLength" and "Truncate" methods resides, and the HTML stripping happens in the "WordsToLength" which turns amount of words to char length and then feeds that number to the "Truncate" method.
But you are right there are no tests for the html stripping - I will add some.
} | ||
|
||
[Test] | ||
public void Truncate_By_Words_Mid_Tag_TagsAsContentOn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be tests missing for testing invalid html parsing and the logic that is in place to work around those types of problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take care of that.
…into TEMP-u4-10135 # Conflicts: # src/Umbraco.Web/HtmlStringUtilities.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully I have addressed all the concerns.
|
||
var result = helper.TruncateByWords(text, 4, true, false).ToString(); | ||
|
||
Assert.AreEqual("Hello world, this is…", result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TruncateByWord is the container were "WordsToLength" and "Truncate" methods resides, and the HTML stripping happens in the "WordsToLength" which turns amount of words to char length and then feeds that number to the "Truncate" method.
But you are right there are no tests for the html stripping - I will add some.
} | ||
|
||
[Test] | ||
public void Truncate_By_Words_Mid_Tag_TagsAsContentOn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take care of that.
|
||
var result = helper.TruncateByWords(text, 4, true, false).ToString(); | ||
|
||
Assert.AreEqual("Hello world, <b>this</b> is…", result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between TagsAsContent on/off, is that when is On, isolated tags will be treated as words, for example:
TagsAsContentOn
- Hello
<b>
world</b>
. It will contain two words. - Hello
<b>
world</b>
. This one will have three words because<b>
is isolated.
TagsAsContentOff
- Hello
<b>
world</b>
. It will contain two words. - Hello
<b>
world</b>
. Still two words.
My fault that I haven't shown this in the tests.
Looking trough this again, I find that treating tags as content has no real uses and counting tags as words should not be a thing - and the state is right now, is just confusing.
I will clean this up and only count the actual words in the string.
No description provided.