Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added TruncateByWords method + overloads #2043

Merged
merged 18 commits into from
Sep 12, 2017
Merged

Added TruncateByWords method + overloads #2043

merged 18 commits into from
Sep 12, 2017

Conversation

robert-cpl
Copy link
Contributor

No description provided.

Copy link
Contributor

@clausjensen clausjensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theres a few fixes that needs to be done :)

/// </summary>
public IHtmlString TruncateByWords(DynamicNull html, int words, bool addElipsis, bool treatTagsAsContent)
{
return new HtmlString(string.Empty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this one too :)

#region Truncate by Words
public IHtmlString TruncateByWords(DynamicNull html, int words)
{
return new HtmlString(string.Empty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one doesn't really do anything :)

/// </summary>
public IHtmlString TruncateByWords(DynamicNull html, int words, bool addElipsis)
{
return new HtmlString(string.Empty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this one :)

[Test]
public void Truncate_By_Words_With_Tag()
{
var text = "Hello world, <b>this</b> is some text <a href='blah'>with a link</a>";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try changing this to Hello world, <b>this is some</b>. You'll notice it will not work because it is pretty hard to do this correctly with HTML in the mix.

I don't know why we'd want this in the core, there's libraries out there to help with this, so we don't have to reinvent the wheel, example: https://robvolk.com/truncate-html-string-c-extension-method-51c83e6d4969

And proving that it is pretty difficult, he still has an open issue: robvolk/Helpers.Net#3

Also note that it is entirely possible that I put invalid HTML in there, so something like <p>testing <b>word truncation</p> - the </b> is missing, that should cause some "fun" things to happen too.

If we want this in core, I'd suggest we create a version that strips all HTML so we don't run the risk of breaking the page this is used on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess we already went down this rabbithole with the truncate method for number of characters.. which is also already a pretty complex method. Yikes.

Copy link
Contributor Author

@robert-cpl robert-cpl Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nul800sebastiaan , about the Hello world, <b>this is some</b>, I was pretty sure that it worked as I am stripping the HTML elements from the string, convert it to char length and feed it to the original truncate method. I will look into it again today.

About invalid tags, funny enough I ran the string you provided and the Truncate method closes the tags correctly, probably not as you wanted but... at least is not invalid HTML. Talking about invalid HTML, should we worry about that? I mean, if you type invalid HTML it should brake.

Oh, and when you tried all that, was "treatTagsAsContent" true or false? Either way, I'll go trough another round of testing today, thanks for your input :).

@nul800sebastiaan
Copy link
Member

Cool, I didn't try to use treatTagsAsContent :)

So the problem with this is that parsing HTML with Regex.. is a problem ;-)

Make sure to read this excellent article on why this is a problem:
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

We can implement this but will have to be with a note saying : "this is how it works and we will not be able to accommodate all of your feature requests, if you need anything different from this you should consider building your own."

First of all, to make sure we have valid HTML, we need to sanatize it using HtmlAgilityPack, then we can make a best-effort attempt at closing any still open tags.

I have googled this for about an hour and I think the "best" way to do this is like so:
https://gist.github.com/q42jaap/2413598

@robert-cpl
Copy link
Contributor Author

Nice, I will look into it :).

Robert added 3 commits August 25, 2017 11:39
…uncate by words methods

Fix: Original Truncate method show now be able to trim
Add: Added some more tests for Truncate by words to cater for tagsAsContent parameter
Fix: In original truncate, currentTextLength prop will be increased everytime a tag is added, this will fix an issue when tagsAsContent is set to true (it would have added an extra char per tag, which would have went over to the next word)

//Check to see if there is an empty char between the hellip and the output string
//if there is, remove it
result = firsTrim[firsTrim.Length -9] == ' ' ? firsTrim.Remove(firsTrim.Length - 9, 1) : firsTrim;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobertCopilau I'm just wondering if this can cause an error since this is not checking if the firsTrim string has enough length in it to do a firsTrim[firsTrim.Length -9] is there a guarantee that firsTrim will absolutely be that long here?

Also, firsTrim should be firstTrim i think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct, it will throw an error, I will get it done in a few minutes. And yes it was supposed to be firstTrim :).

@@ -235,10 +247,82 @@ public IHtmlString Truncate(string html, int length, bool addElipsis, bool treat
outputms.Position = 0;
using (TextReader outputtr = new StreamReader(outputms))
{
return new HtmlString(outputtr.ReadToEnd().Replace(" ", " ").Trim());
string result = String.Empty;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some best practices and follow normal Microsoft standards (which is also enforced by Resharper), any reference to string should use the type qualifier of string not the object qualifier of String so this should be

string.Empty similarly below String.IsNullOrEmpty should be string.IsNullOrEmpty . On that note, this should check for IsNullOrWhitespace instead. We also have an extension method for this so you could do: firstTrim.IsNullOrWhitespace()

//Check if we have a space inside a tag and increase the length if we do
if (html[length].Equals('<') && html[length + 1].Equals('/') == false && tagsAsContent)
{
while (html[insideTagCounter].Equals('>') == false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is incorrectly formatted html, could this cause an infinite loop? (i.e. there won't ever be a '>')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a noob :/, I'll fix those above today.

Copy link
Contributor

@Shazwazza Shazwazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional feedback and questions

Robert and others added 4 commits August 31, 2017 09:15
Copy link
Contributor

@Shazwazza Shazwazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes required for the tests - or maybe I'm wrong which in that case please advise :)

I've pushed on commit for you to review, i've changed a string to a constant and found a variable that doesn't seem to be used for anything so have commented it out if you can review.

@@ -85,6 +87,8 @@ internal string Coalesce<TIgnore>(params object[] args)

public IHtmlString Truncate(string html, int length, bool addElipsis, bool treatTagsAsContent)
{
string hellip = "&hellip;";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a constant - i'll fix this


var result = helper.TruncateByWords(text, 4, true, false).ToString();

Assert.AreEqual("Hello world, <b>this</b> is&hellip;", result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between Truncate_By_Words_With_Tag_TagsAsContentOff and Truncate_By_Words_With_Tag_TagsAsContentOn ? They yield the exact same result so doesn't really seem like this boolean flag is doing anything especially with regards to these tests. In fact, it kind of seems the same for all tests with ContentOn or ContentOff, the expectations are always the same so either these tests are not testing the correct behavior or the boolean flag doesn't actually do anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between TagsAsContent on/off, is that when is On, isolated tags will be treated as words, for example:
TagsAsContentOn

  • Hello <b>world</b>. It will contain two words.
  • Hello <b> world</b>. This one will have three words because <b> is isolated.

TagsAsContentOff

  • Hello <b>world</b>. It will contain two words.
  • Hello <b> world</b>. Still two words.

My fault that I haven't shown this in the tests.
Looking trough this again, I find that treating tags as content has no real uses and counting tags as words should not be a thing - and the state is right now, is just confusing.

I will clean this up and only count the actual words in the string.


var result = helper.TruncateByWords(text, 4, true, false).ToString();

Assert.AreEqual("Hello world, this is&hellip;", result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test testing the correct behaviour? There is HTML in this string but we are truncating before even reaching the HTML. Or maybe there's test missing for testing the HTML stripping behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TruncateByWord is the container were "WordsToLength" and "Truncate" methods resides, and the HTML stripping happens in the "WordsToLength" which turns amount of words to char length and then feeds that number to the "Truncate" method.
But you are right there are no tests for the html stripping - I will add some.

}

[Test]
public void Truncate_By_Words_Mid_Tag_TagsAsContentOn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be tests missing for testing invalid html parsing and the logic that is in place to work around those types of problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of that.

Copy link
Contributor Author

@robert-cpl robert-cpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully I have addressed all the concerns.


var result = helper.TruncateByWords(text, 4, true, false).ToString();

Assert.AreEqual("Hello world, this is&hellip;", result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TruncateByWord is the container were "WordsToLength" and "Truncate" methods resides, and the HTML stripping happens in the "WordsToLength" which turns amount of words to char length and then feeds that number to the "Truncate" method.
But you are right there are no tests for the html stripping - I will add some.

}

[Test]
public void Truncate_By_Words_Mid_Tag_TagsAsContentOn()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of that.


var result = helper.TruncateByWords(text, 4, true, false).ToString();

Assert.AreEqual("Hello world, <b>this</b> is&hellip;", result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between TagsAsContent on/off, is that when is On, isolated tags will be treated as words, for example:
TagsAsContentOn

  • Hello <b>world</b>. It will contain two words.
  • Hello <b> world</b>. This one will have three words because <b> is isolated.

TagsAsContentOff

  • Hello <b>world</b>. It will contain two words.
  • Hello <b> world</b>. Still two words.

My fault that I haven't shown this in the tests.
Looking trough this again, I find that treating tags as content has no real uses and counting tags as words should not be a thing - and the state is right now, is just confusing.

I will clean this up and only count the actual words in the string.

@Shazwazza Shazwazza merged commit 41eeb38 into dev-v7 Sep 12, 2017
@Shazwazza Shazwazza deleted the TEMP-u4-10135 branch September 12, 2017 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants