Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic/joining character formation broken #109

Closed
btsimonh opened this issue Jun 8, 2018 · 8 comments
Closed

Arabic/joining character formation broken #109

btsimonh opened this issue Jun 8, 2018 · 8 comments
Milestone

Comments

@btsimonh
Copy link
Contributor

btsimonh commented Jun 8, 2018

Ref branch 1.1-master:
The scheme for padding (and multi-row-align?) puts every character in it's own span. On webkit based browsers, this causes the arabic character formation to be broken.

in the attached Arabitest.ttml sample, the characters should join, i.e.
image
should look like:
image
or even like
image
(the third image is from iexplore)

personally, I think the third display will be too difficult to achieve consistently across all browsers and renderers, and would be happy to caveat joining character sets in the standard to indicate that characters in different spans will specifically NOT join (but we may then need to MAKE them not join in iexplore). Especially as they do not join in iexplore if styling is more different than just color?

Arabictest.zip

@palemieux
Copy link
Contributor

@btsimonh It would be possible to merge single-character spans after line wrapping is determined. I am thinking this would not help in this case since the merging of the spans would result in the arabic text being shorter (due to ligatures), and thus in changes to line wrapping. I am not sure how to work around this. Thoughts?

@btsimonh
Copy link
Contributor Author

You are correct. For wordwrapping, the length of the already-joined words must be taken into account; certainly in arabic/farsi, character lengths can vary wildly depending upon how they join.

see processLinePaddingAndMultiRowAlign in my fork
(and a lot of the code above it) for a relatively complex implementation which separates all the words first, then apportions them to 'lines', reforming them into combined spans, before finally adding padding to the 'end' spans. I don't think I finally got the rtl 'start' and 'end' spans identified correctly yet. I prefer a 'simple' html output :). This does rely on being able to measure the words.... but I assume this is also true of the single character method.

@palemieux
Copy link
Contributor

@btsimonh Ok, so the idea is to create spans only where line wrap can actually happen? If so, what is a good reference for line breaks in arabic, i.e. where does the algorithm come from?

@btsimonh
Copy link
Contributor Author

btsimonh commented Jun 13, 2018

ahh... examined the code. I think this is how it works:
split it all into word spans.
Add it all into the required size of div (in the correct span order according to rtl rules!).
Use getBoundingClientRect to locate all the spans (which have now been wrapped....) - use Y posn to determine which line they are now on!
Insert BRs? maybe to fix the lines as they are, turn off wrapping, and either expand the div or turn off overflow hiding (I think my code does not do this - hence why I get occasional padding on an 'internal' span? an extra wrap I did not expect after processing?).
Add padding to each line end (least x and greatest x) on each line.
This should be fairly efficient, as it only needs two renders :(. In theory, an off-screen render should be fast, and does produce the location you need (hence the -1000 at the top of the code!).
Should be adaptable for vertical text....

As to how to split the text (i.e. on what characters), nothing springs from google, so I assume spaces, hyphens and soft-hyphens.
br, s
(p.s. an afterthought - some existing subtitles may have punctuation with no space afterwards (to save characters) - I have a feeling that this was noted by the EBU-TT committee, and they opted for the 'should be converted to have a space after', after all, we'd not like to wrap an email or web address.)

@palemieux
Copy link
Contributor

The more I think about it, the more I think this is a WebKit bug that needs to be fixed there: imscJS should not have to implement line breaking algorithms.

Within the scope of 1.0.1 release, arabic ligature should work as long as span is not present within arabic text, and neither ebutts:linePadding, ebutts:multiRowAlign, nor tts:fillLineGap are used on p that contain arabic text.

For the next release, it should be possible to improve the imscJS line wrap detection algorithm so that ebutts:linePadding, ebutts:multiRowAlign, and tts:fillLineGap can be used.

In any case, I encourage users to file issues against WebKit.

@nigelmegitt
Copy link
Contributor

#216 is partially relevant here: it merges adjacent character spans where they are 1) on the same line and 2) have the same parent element in the ISD. That demonstrably helps with the arabic character formation within a span, but can not address it across span boundaries, which as has been mentioned, appears to be a user agent bug.

@btsimonh
Copy link
Contributor Author

btsimonh commented Jun 7, 2021

Hi Nigel,
note the thing about the wrapping point changing when the span merging causes a change to the glyphs due to character joining.
I've not tested the original webkit bug recently. I assume there is no improvement?
br, simon hailes

@nigelmegitt
Copy link
Contributor

note the thing about the wrapping point changing when the span merging causes a change to the glyphs due to character joining.

@btsimonh Not sure exactly what you mean here? Yes, the rendered line length can change when characters are joined correctly, which happens when spans are merged. And yes, there is an edge case where that could cause the wrapping to be different. I suppose one option would be to force the wrapping to stay the same, for example by setting CSS properties like white-space: nowrap; and word-break: keep-all; on each merged span.

Do you have a test case that demonstrates this issue?

I've not tested the original webkit bug recently. I assume there is no improvement?

I haven't tested for it either. However I was slightly surprised to see that the joining character formation was fixed by #216 on Firefox on Mac OS: I had the impression that Firefox did not exhibit this behaviour, but that could have been me misremembering.

palemieux pushed a commit that referenced this issue Jul 16, 2021
* Merge adjacent spans on the same line that originate from the same ISD element (after line wrapping is performed) (#194, #109)
* Improve fillLineGap with semi-opaque background colors (#200)

Co-authored-by: Robert Bryer <robert.bryer@bbc.co.uk>
@palemieux palemieux added this to the 1.1.3 Release milestone Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants