Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spell checking should be able to span multiple Typing Nodes #50

Closed
teleclimber opened this issue May 25, 2015 · 47 comments
Closed

spell checking should be able to span multiple Typing Nodes #50

teleclimber opened this issue May 25, 2015 · 47 comments

Comments

@teleclimber
Copy link

In ContentEditbleTyping Explainer, Section 6.2:

When the user indicates he wants to replace part of the contents of a CE Typing Node by means of a browser-builtin spell cehcker or alike, a beforeInput input with inputType set to replaceText is triggered.

Note that a spellchecker might find a typo in a word that is not fully contained in a single text node. Therefore the event target must not be a text node, but rather a Range. Safari in particular allows spell check to work even when words are partially bold. See #28.

Furthermore it's not clear from the spec whether a default behavior takes place? In any case if the target is a range then there can be no default behavior since it would result in Element changes. So I think the Explainer should clarify that there is no default behavior, it's up to the editor dev to make the change.

@johanneswilm
Copy link
Contributor

As the IME input issue apparently can be handled without having to allow direct changes to the DOM, it should now be clarified that no automatic DOM changes happen in ce=typing ever.
As for spell checking across borders: I disagree. The browser should not interpret that there are words across elements. It will make editor programming much more difficult and likely the browser will make many mistakes that one will have to create new workarounds for in the JS code. But we can vote on this at the F2F. It would be good if you could create a PR for your alternative on this point.

@teleclimber
Copy link
Author

The browser should not interprete that there are words across elements.

@rniwa is strongly opposed to this since it breaks existing functionality with Safari. Furthermore he makes the good point that it's not good for the end-users, and I agree. Spell-check should not fail just because a word is partially bolded.

...and likely the browser will make many mistakes that one will have to create new workarounds for in the JS code.

Can you please clarify and elaborate on this?

@johanneswilm
Copy link
Contributor

None of this is breaking existing functionality, as contentEditable=typing is a completely new concept and everything existing will live on in contentEditable=true. We have made no promises that this will allow for everything that has been possible with other editing environments in the past. Some editors currently implement spellchecking through third part services instead. There are even entire solution(s?) written in javascript to handle spell checking purely in javascript.

As for the last point: editors are complex and browser makers cannot know everything that is going on in them. But given that no direct access to spell checks is granted to JS, it is reasonable to suspect that things within the same text node are letters in the same word if there is no space between the,. Beyond that, it is not reasonable to make any guesses on the part of the browser what is going on. For example. Take the following:

<p>Is this a wo<span class="a">rd</span>?</p>

The browser would probably guess that "word" is a single word. But what if the CSS stated something like:


span.a {
display: block;
border: 5px solid green;
position: absolute;
top: 0px;
}

Suddenly it's something entirely different. Would the browser know? Maybe, maybe not. From the perspective of the JS, it would most likely seem out of control. You would need A LOT of specification to explain when something crossing two text nodes is still a word. And then Javascript programmers would go through the implementations and find bugs that help them turn cross-element word detection on and off.

So therefore: I don't think it is a good idea. But if you still think it is a good idea, could you write a spec to determine when words crossing element boundaries are still words and edit this spec referring to the other. Then we can vote on the two proposals at the F2F meeting.

@teleclimber
Copy link
Author

The browser would probably guess that "word" is a single word.

No, the browser would probably take into account the style applied to the Element. In fact I ran the exact code you offer above in Safari and it correctly identifies that anything in the span is a different word given the CSS provided. I don't think we'd have to specify when things are words and when they are not. The UAs are capable (and should) do it themselves if they want to offer spell-check across boundaries.

Note that "real" text editors offer spell-check of partially-styled words. It follows then that "proper" web-based editors should eventually do the same. But if the spec doesn't allow for it natively, then it's the editor devs who are going to have to write reams of JS to get around that problem.

I think it's a very bad idea to bake into the spec a limitation such as this one. Specs live long lives and defeating future capabilities for the sake of short term simplicity is something I can't get behind.

@johanneswilm
Copy link
Contributor

The UAs are capable (and should) do it themselves if they want to offer spell-check across boundaries.

Ok, meaning that there is no specification that defines when they are and when they are not. So after Safari, IE, Chrome, Firefox each implement this differently and will have different rules as to what constitutes a word and what doesn't. So in order to either turn word recognition on or off, there will be tons of little hacks, probably changing from browser version to browser version to achieve just that. That's not a good idea.

I think it's a very bad idea to bake into the spec a limitation such as this one.

I don't think it's "baked in". All it does is to use a more generic replaceText event for text replaced through spellchecking. Should a later version of the spec use a more specific spellChecker event that can handle more complex setups, it should not in any way break anything.

Therefore the event target must not be a text node, but rather a Range.

Actually that's already happening. The current targetRange for replaceText is specified as:

Set the value of targetRange to a new range that covers the text that is to be replaced.

@johanneswilm
Copy link
Contributor

I just tried this in LibreOffice. As you said, it does spell checking across styling borders. But when one corrects a word, it loses part of the styling. Is that also what you found @teleclimber?

The thing is that word processors are rather limited in what they allow. When editing HTML, there may be other more complex parts that one simply does not want the browser to mess with -- for example data attributes on span elements that are used inside of custom editors. And then there is the magic related to determining whether something is a word or not -- that will eb quite a nightmare for editor developers, especially when one involves custom elements.

Of course, the editor gets to decide whether or not to actually handle the correction. But if it's not handled, it will look very strange to the user.

So the only way I could see this happen would be if it is an option that can be turned off (and preferably is turned off by default). It would need to be an option on the spell checker itself, to correct across element boundaries or not.

@teleclimber
Copy link
Author

Ok, meaning that there is no specification that defines when they are and when they are not.

I think we can wait on a spec for what constitutes breaking a word. I don't think there are a million possibilities, nor do I think a slight variation on this specific behavior from browser to browser is a show-stopper for users. Native spell-check already varies significantly from browser to browser so do we want to go in and spec that too? If things get out of hand wrt to word-breaks then we can add a spec for it.

it does spell checking across styling borders. But when one corrects a word, it loses part of the styling.

Yes Safari does the same. I wouldn't do it that way, but that's just me.

When editing HTML, there may be other more complex parts that one simply does not want the browser to mess with

I don't want the browser to mess with any of it either. As I see it the spell check triggers a Device-Independent Event that includes a Range (that may span multiple Elements) and the editor takes care of replacing that range with the event's content data as they see fit. That Event has no default behavior so the browser never messes with the HTML.

You mentioned at first that you thought it would be complicated for the editor dev. I really don't see how. Your editor code will have pretty much everything that is needed already written to handle other intents. Consider paste-over-selection: you have to take a Range that may span multiple Elements, figure out how to remove the contents, then add the contents being pasted. If your editor can do that then your editor can handle spell-check across Element boundaries.

@johanneswilm
Copy link
Contributor

I think we can wait on a spec for what constitutes breaking a word. I don't think there are a million possibilities, nor do I think a slight variation on this specific behavior from browser to browser is a show-stopper for users.

Well, it kind of is. The entire point with this new spec is that it really is truly and fully spec'ed in a way that does not open for weird surprises for editor developers. If this is not spec'ed, we really don't know what is going on, and then we are back where we started out with contenteditable as it is now.

If there is to be a proposal to allow for this, this needs to be fully spec'ed. if it cannot be spec'ed now, it will have to wait for a later version of the contentEditableTyping spec. We can allow for spell checking within text nodes now and at a later stage open for spell checking across nodes, without breaking anything. The spell checking already returns a range, so that should be extensible in the future if this is ever spec'ed.

You mentioned at first that you thought it would be complicated for the editor dev. I really don't see how.

LibreOffice and apparently Safari haven't figured out how to do this without losing formatting. And they only need to deal with simple styling elements -- not all the things one may need in a WYSIWYM editor such as data attributes, etc. .

Your editor code will have pretty much everything that is needed already written to handle other intents.

This is different. I can implement this, but in many situations I can decide that I simply will not permit it for whatever reason or handling it in a very specific way (for example, remove an entire figure although only the figure caption was part of the selection), without the user getting the feeling that it is broken.

Spell checking across elements is different. If the user sees the little red line, and a right click menu that permits the user to select a word... but when the user selects one of the words, and nothing happens and the read line doesn't even disappear, then that seems broken.

So instead as an editor developer I will have to figure out what I need to insert to make the spell checker not do that. Maybe inserting a 1x1 px white image between two text nodes will do the trick? How about a zero width canvas element? Maybe do something about borders which I then need to hide in some other way so that the user doesn't notice... and because it's not spec'ed the needed trick will probably differ from browser to browser and browser version to browser version, all that being extremely hackish.

@johanneswilm
Copy link
Contributor

Consider paste-over-selection

Also, in theory, if implemented according to specifications, I could specify how selections work and whether something can be partially selected or not. Not so about what the browser considers to be word boundaries.

@teleclimber
Copy link
Author

There are no weird surprises for developers here. The problem with CE=true are that it manipulates the DOM in whimsical ways. This is not happening here. You still have full control of the DOM. So this is a very far cry from the problems of CE=true and I don't see it as a show-stopper for either dev or user.

LibreOffice and apparently Safari haven't figured out how to do this without losing formatting.

Well at least they allowed the spell-check to correctly fix the partially-styled word. That they didn't preserve the style exactly is secondary. My problem is you are baking in the spec an impossibility. You're making it completely impossible, per-spec, for me to even have a chance to match LibreOffice's capability using native spell-check, let alone improve on it.

I don't think it'll be nearly as hard as you imagine to make it clear to all browsers that you do not intend for two adjacent strings to be a word. But that's just my opinion. If others are genuinely worried about it we can try to spec it.

@johanneswilm
Copy link
Contributor

Well at least they allowed the spell-check to correctly fix the partially-styled word. That they didn't preserve the style exactly is secondary. My problem is you are baking in the spec an impossibility. You're making it completely impossible, per-spec, for me to even have a chance to match LibreOffice's capability using native spell-check, let alone improve on it.

That is the current spec. or it will be. But a future version of the spec may allow this. The point here is not to allow browsers all kinds of uncontrolled and unspec'ed behavior -- on the contrary. The point is to make them behave in a predictable way.

I don't think it'll be nearly as hard as you imagine to make it clear to all browsers that you do not intend for two adjacent strings to be a word. But that's just my opinion. If others are genuinely worried about it we can try to spec it.

If it's not hard to determine what the rules are that make something be a word, then it should not be difficult to write that down in a specification. Doing this would make it something that could be included. Even then, I think we need a way to turn this off if we don't want a thousand hacks around this.

Once you have written a specification draft on the word division question, feel free to add it to a PR. If you additionally provide a way to turn this feature off, we can even merge it all into this repository before the F2F.

@johanneswilm
Copy link
Contributor

I have now installed an version 5 of Safari via Wine on Linux (took some time), and it was not very hard to create an example where this becomes an issue. Imagine this text. It could be something found in a specialized editor for academics (such as Fidus Writer) that allows citations.

<html>
<head>
  <style>
    .citation {
      background-color: #9dbc7d;
      border: 1px solid black;
      padding-left: 5px;
      padding-right: 5px;
      margin-right: 5px;
      margin-left: 5px;
      display: inline-block;
    }
  </style>
</head>
<body>
<h1>The headline</h1>
<div contentEditable=true>
  The question came from New Mexico<span class="citation">Maus 1998</span>.
</div>
<div>
  <h2>Citations</h2>
  <p>Maus, Meier. 1998: "The Road is long, the wind is fast." Pillmond Publishers, Boston.</p>
</div>
</body>
</html>

mexico-maus-safari

The problem here is between "Mexico" and "Maus". The browser thinks they are the same word, so they are underlined together. To the user there is a 10-11 px distance between the word, and the color for the citation is different, so that the user would not think of this being an issue. But there is the line. And for the editor creator, there is no way to make the line go away, except turn of spell checking altogether... or possibly to find some bug in Safari that lets me make the browser not think they are the same word. Does this make sense now?

@johanneswilm
Copy link
Contributor

This is what it looks like in Chrome -- also broken, but in a different way. It marks both words individually, and the word "Maus" it just doesn't recognize. The word "Mexico" is recognized, at least if I put it somewhere else in the sentence. When I right-click on it, it doesn't give me any suggestions either. But the line is still there. My guess: The browser tries to be a little bit too smart here. It would be better if it lost some of that smartness and instead became more predictable. At least in contentEditable=Typing.

mexico-maus-chrome

@johanneswilm
Copy link
Contributor

Going a bit further with this. None of this has any influence on Safari's evaluation that MexicoMaus clearly is one word. Chrome is not much better, but at least it doesn't try to replace the two words together:

<html>
<head>
  <style>
    .citation {
      background-color: #9dbc7d;
      border: 1px solid black;
      padding-left: 5px;
      padding-right: 5px;
      margin-right: 5px;
      margin-left: 5px;
      display: inline-block;
      position: absolute;
      top: 5px;
    }
    .citation::before {
      content: ' (';
    }
    .citation::after {
      content: ')';
    }
  </style>
</head>
<body>
<h1>The headline</h1>
<div contentEditable=true>
  The question came from New Mexico<span class="citation">Maus 1998</span>.
</div>
<div>
  <h2>Citations</h2>
  <p>Maus, Meier. 1998: "The Road is long, the wind is fast." Pillmond Publishers, Boston.</p>
</div>
</body>
</html>

mexico-maus-2-safari

@spocke
Copy link
Contributor

spocke commented May 26, 2015

As long as it possible to disable the browsers spellchecking using the spellcheck="false" attribute as we do to day one could then override the behavior with a custom spellchecker engine. We don't use the browsers built in spellchecker in TinyMCE for the reason of it being buggy in all browsers and there is also no way to control the spellchecker though any form of API or even CSS to change it's appearance.

Specing all the behavior of a browser spellchecker would be a huge task. There are tons of special scenarios they don't currently cover and I think that should be done in it's own spec not in the editor spec.

@johanneswilm
Copy link
Contributor

@spocke Yes, I think every major semi-complex editor disables the spell checker for that reason. Ideally, it would have been preferable if we could just get access to the spell checker engine directly through Javascript, but the answer to that has been that this would be a security issue by the browser people. I think that if it is to be useful at all for at least some editors, it should be very constrained/predictable in how it works.

@johanneswilm johanneswilm changed the title replaceText input event could span multiple Typing Nodes spell checking should be able to span multiple Typing Nodes May 28, 2015
@johanneswilm
Copy link
Contributor

I changed the title to what this is actually about. replaceText itself should already be able to handle multiple Typing Nodes.

What I wonder about is whether not instead it is conceivable that the Custom spelling dictionary (the term Chrome uses for it) could be turned off and then the standard spelling provided through direct JavaScript calls, or whether that still constitutes a security concern. For any more complex editor, havign direct access to a spell checker seems to be the only way to go.

@rniwa
Copy link
Contributor

rniwa commented May 28, 2015

"standard spelling" is not a thing for us, so that can't work.

@johanneswilm
Copy link
Contributor

Well, but Safari isn't the only browser out there, so we shouldn't limit the specs we write what Safari feels it can or cannot do. But we can put it into a separate specification and not make anything else depend on it and then you are not forced to implement it in order to be able to say you implemented ContentEditableTyping.

@rniwa
Copy link
Contributor

rniwa commented May 28, 2015

I don't think it's acceptable for spellchecking to not work in the apps that use new editing app.

@johanneswilm
Copy link
Contributor

Well, then you can fix Safari. But if Safari, for whatever reason, decides that it cannot do or doesn't want to do what is needed for the editor to provide good spell-checking, there is not that much editor developers can do about that.

As you can see from the above examples, Safari as well as other browsers are broken in this concern and TinyMCE and others simply turn the spell checking of precisely for this reason. Just having everyone continue with a broken system does not seem like a good solution.

@rniwa
Copy link
Contributor

rniwa commented May 28, 2015

I don't understand why not exposing the list of spellchecking candidates will result in not doing anything. I'm simply giving an implementation feedback that the specific proposal you made is not implementable in our engine. That doesn't mean we can't solve the problem of spellchecking. It just means we need to come up with an alternative solution.

@johanneswilm
Copy link
Contributor

I agree -- what alternative can you see that would work for example in the case of the examples given above?

@rniwa
Copy link
Contributor

rniwa commented May 28, 2015

How about the browser giving you a list of "replace" intents that spans across multiple elements when the user selects a correction candidate? That's basically what we have today except it only happens within a signle text node but we already need to deal with "deletion" happening across multiple elements so I don't see why apps can't support browser's request to replace multiple elements.

@johanneswilm
Copy link
Contributor

That would not solve the above problem of MexicoMaus being underlined becuase the browser thinks it's one word when it's not, does it?

@rniwa
Copy link
Contributor

rniwa commented May 28, 2015

@johanneswilm what if we provided some mechanism (e.g. spellcheckingscope attribute) to mark an explicit word/paragraph/etc... boundary?

@johanneswilm
Copy link
Contributor

Ok, would this have to be an element attribute on every element that is to be a boundary? That could quickly turn into a lot of attributes and somewhat messing up the code one stores on disc. If one could either A) specify it via CSS (so one could specify by class, type, id, etc.) or B) have multiple possible values for the spellcheck attribute that would then invoke the spell checker in different ways, I would agree to that. A would even be slightly nicer as it gives more options, but I'm not sure this is strictly something that should be in the domain of CSS.

@rniwa
Copy link
Contributor

rniwa commented May 28, 2015

I don't think it should be specified in CSS since this is not a style information. If the markup verbosity was a concern, how about adding a JS function that takes an array of elements? You call it with a list of elements that should define the word boundary. I guess one drawback of such an approach will be that the editor has to do it on every new element being inserted/cloned in the document...

@johanneswilm
Copy link
Contributor

Yes, the attribute would not be so pretty. Imagine you have an entire document and want to save it to disc, likely removing this spell check information. And then load it back up and go through all the words that need the boundary and re-add the attribute. Most likely one would simply add the attribute to all elements that have a specific class or so, but it would still not be very pretty.

The JS function would add quite some complexity for the editor developers, especially when people copy and paste stuff together and additionally add things manually. Basically one would have to go through each paste and all input and always watch out whether there is yet another one of those elements one wants to define as a sentence/word boundary.

Ideal would be something like CSS... just not CSS itself.

@hallvors
Copy link
Contributor

navigator.spellcheck.defineBoundarySelectors('.foo, .citation, span');

Maybe?

@johanneswilm
Copy link
Contributor

@hallvors That would definitely work for the use cases I'm thinking of.

@NorbertLindenberg
Copy link

Why the aversion to using CSS here? It seems to me that the current browser behavior is correct if you ignore the CSS rules – there's no space between "Mexico" and "Maus", and it's text in a language that usually uses spaces to separate words. It's therefore correct to treat "MexicoMaus" as one word and indicate that it's not correct. It's only the CSS rules that make "Mexico" and "Maus" appear as two separate words, and it seems natural to then also use CSS to indicate that spelling checking should treat them as such.

@johanneswilm
Copy link
Contributor

It seems to me that the current browser behavior is correct if you ignore the CSS rules

It's not a question of correct or not. It's a question of whether or not spellchecking is usable for editors that have ways of adding tags, citations, comments, etc. that also need to be editable.

@NorbertLindenberg
Copy link

In the example you provided, the usability issue arises because the browser doesn't see a word break where the user sees one, and the user sees one because the CSS rules introduce spacing (and other decoration) that's not in the text. If the same CSS rules could also tell the browser to assume word breaks before and after the span, the usability problem would go away.

@ghost
Copy link

ghost commented Jun 2, 2015

I agree with @teleclimber that half-styled words may exist in the wild. But, apparently, browsers in general don’t agree. Without contentEditable, Safari, Chrome and Firefox all report ‘Mexico’ and ‘Maus’ to VoiceOver as separate words. With contentEditable, they are spellchecked as a single word. One of those behaviours is correct, but not both.

@kojiishi
Copy link

kojiishi commented Jun 2, 2015

True it's ambiguous. <span sytle="position:absolute">Thi<span>s is still one word from semantic perspective.

I prefer, however, editors insert a text node with a space or LF to avoid ambiguous coding from spell checking to double-click selection to all other cases than to add an API just for spell checking.

@teleclimber
Copy link
Author

I prefer, however, editors insert a text node with a space or LF to avoid ambiguous coding from spell checking to double-click selection to all other cases than to add an API just for spell checking.

I agree with this. The "MexicoMaus" problem can easily be solved by putting a zero-width whitespace (for example) between the two words. Since editor devs control the DOM in ce=typing it should not be a problem to ensure there is always a text node or a span with an appropriate character that ensures proper word separation in cases like these.

@johanneswilm
Copy link
Contributor

If we put zero space characters places just to make editors work, we are back where we are rught now. Basically this just means that browser spell checking just has to be turned off, at least for Safari, or one will have to add a notice for Safari user “This editor does not work in Safari. Please download Chrome/Firefox”.

@ghost
Copy link

ghost commented Jun 14, 2015

It’s not about putting zero-width space to make editors work, rather about making two words two words. I believe @kojiishi is correct and those are a single word semantically (why browsers expose them otherwise to VO is unclear). Then there is a case with wbr, where multiple text nodes separated by an empty element are clearly a single word.

@johanneswilm
Copy link
Contributor

Where would you want to insert that space?

@johanneswilm
Copy link
Contributor

As far as I can tell, putting the space anywhere there would be incorrect. Many citation styles require the citation to come directly afterward, not after a space. And if we put a zero-width space there, as is common practice to get around various cE-problems today, then the caret will move across it when hitting arrow-right/left, so one needs to program sone more code to avpid that. And then some code for copy-paste situations, etc. And then we are back in the nightmare that cE is today.

@ghost
Copy link

ghost commented Jun 14, 2015

From the practical point of view, I understand your concern. It’s just that multiple nodes in HTML do not automatically mean there are multiple words. They are in your example, but may as well be a single word in others. In general, HTML should remain readable with CSS stripped off; when words are not separated by whitespace, things start to get fragile.

@johanneswilm
Copy link
Contributor

Is there an official definition for what marks word endibgs in HTML? I understand that one in some circumstances one may also want partially styled words. Only the editor really knows what's going on, or at least can have some clue about it, so it would be preferable if the editor could get direct access to the spell checker in some way that doesn't compromise security.

@ghost
Copy link

ghost commented Jun 14, 2015

No, they don’t specify word endings. That practically means language conventions alone define word ending. Is ‘Mlle’ when spelled as M<sup>lle</sup> a single word? I believe so.

@johanneswilm
Copy link
Contributor

True, and 1:a (“first” in Swedish) is also to be regarded as just one word. But the ‘i’ in Mexico<sup>i</sup> may be a footnote marker that should not be seen as part of the word. I think that strengthens the argument that editors should get to control spellcheckers.

@johanneswilm
Copy link
Contributor

But actually the proposed method to mark certain elements as word boundaries should make the spellchecker work at least for all the cases mentioned in this thread, as far as I can tell.

@ghost
Copy link

ghost commented Jun 14, 2015

Yes, for spellcheckers, that’ll do it.

And you’re right, we shouldn’t be inserting zero-width spaces automatically; when I was advocating ‘zero-width space’ above, I was thinking of authors and editors marking word boundaries themselves (thus humans, not software, being in control of word boundaries).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants