Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the text direction in the right to left languages #1

Closed
Essam315 opened this issue Dec 27, 2018 · 17 comments
Closed

Fixing the text direction in the right to left languages #1

Essam315 opened this issue Dec 27, 2018 · 17 comments

Comments

@Essam315
Copy link

Essam315 commented Dec 27, 2018

Thank you so much for your great add-on, in the right to left languages like Arabic and Hebrew Netflix uses in the WebVTT subtitles right to left mark (&rlm ;) and left to right mark (&lrm ;) to adjust the text direction (both are without space before ;), like these subtitles files

Bright_ar.zip

Vikings-S1_E1-Rites of Passage_ar.zip

and this makes the subtitles lines shown in the wrong direction in the subtitles.

To fix this problem you need to replace:

&rlm ; with '‫'

http://unicode.scarfboy.com/?s=%27%E2%80%AB%27

&lrm ; with '‪'

http://unicode.scarfboy.com/?s=%27%E2%80%AA%27

there are hidden control characters between '', i will appreciate if you do this replacement, thanks.

@Essam315 Essam315 reopened this Dec 27, 2018
@Essam315 Essam315 reopened this Dec 27, 2018
@rsimmons
Copy link
Owner

Thanks for the report! It looks like you may have accidentally closed the issue, so I'll reopen it for now while I look into into this further.

@rsimmons rsimmons reopened this Dec 27, 2018
@Essam315
Copy link
Author

Thank you.

@rsimmons
Copy link
Owner

I've published the updated version for both Chrome and Firefox (v0.1.2), but it will take a little while to update.

@Essam315
Copy link
Author

Thank you so much, the RLM fixed for me, but LRM still need to be replaced.

@rsimmons
Copy link
Owner

Can you link me to a Netflix show that has the LRM mark, so I can have something to test with?

@rsimmons rsimmons reopened this Dec 28, 2018
@Essam315
Copy link
Author

Essam315 commented Dec 28, 2018

@Essam315
Copy link
Author

Essam315 commented Dec 28, 2018

Replacing "\u200f" with "\u202b" in the subtitles that contain "&rlm ;" in the SRT files fixes the subtitles on screen for me while watching Netflix in google chrome, the problem is usually the subtitles that have "&lrm ;" in the SRT files have at the same time many "\u200f" and "\u200e" in most of its lines .

I think the solution for this problem is to replace "\u200f" with "\u202b" only when "\u200f" be in the start of the line, and replacing "\u200e" with "\u202a" only when "\u200e" be in the start of the line (or the end of the line for the Arabic language).

Here are examples (i just deleted "lrm ;' form the SRT file")

http://unicode.scarfboy.com/?s=%E2%80%8F%E2%80%8E.%E2%80%8E%D8%A5%D9%84%D9%89+%D8%A7%D9%84%D8%AD%D8%AF%D8%AB+%D8%BA%D8%AF%D8%A7%D9%8B%E2%80%8E+%22%E2%80%8E%D8%A8%D9%8A%D9%88%D8%B1%D9%86%E2%80%8E%22+%E2%80%8F%D8%B3%D8%A2%D8%AE%D8%B0%E2%80%8E%E2%80%8F

http://unicode.scarfboy.com/?s=%E2%80%8F%E2%80%8E.%E2%80%8E%D9%84%D8%B3%D8%AA+%D8%AC%D8%A8%D8%A7%D9%86%D8%A7%D9%8B%E2%80%8E+%D8%9F%E2%80%8E%D9%85%D9%86+%D9%8A%D9%82%D9%88%D9%84+%D8%A5%D9%86%D9%91%D9%8A+%D8%AC%D8%A8%D8%A7%D9%86%E2%80%8E%E2%80%8F

http://unicode.scarfboy.com/?s=%E2%80%8F%E2%80%8E.%E2%80%8E%D8%A7%D9%84%D9%82%D8%AA%D9%84+%D8%B9%D9%85%D9%84+%D9%85%D8%B4%D9%8A%D9%86+%D8%B6%D9%85%D9%86+%D8%A3%D8%A8%D9%86%D8%A7%D8%A1+%D8%B4%D8%B9%D8%A8%D9%86%D8%A7%E2%80%8E%E2%80%8F

Replacing '&lrm ;' with '\u202a' fixes the text direction in the SRT file (tested in google chrome using developer mode).

@rsimmons
Copy link
Owner

Hi, thanks for your help with this. Unfortunately, Vikings and Teen Wolf are not available in my Netflix region (US). It would be useful if I could see an example of the original WebVTT data, particularly for subtitles that have "many "\u200f" and "\u200e" in most of its lines" as you say. In the Chrome inspector, you can go under the Network tab, and reload the page. There should be one request that starts with ?o=. If you right click that and do "Open in new tab", it should save it as a file that you can upload.

I think I understand the problem and the correct solution, but I want to make sure.

@Essam315
Copy link
Author

Essam315 commented Dec 28, 2018

RLM

These TV shows are in the Netflix of the US and you can watch them in the US with Arabic subtitles

Black.Mirror.S01E01.The.National.Anthem.zip

The.Rain.S01E01.Stay.Inside.zip

Altered.Carbon.S01E01.Out.of.the.Past.zip

LRM

These TV shows are in the Netflix of the US and you can watch them in the US with Arabic subtitles

Marvels.Luke.Cage.S01E01.Moment.of.Truth.zip

Marvels.Daredevil.S01E01.Into.the.Ring.zip

Wet.Hot.American.Summer.First.Day.of.Camp.S01E01.Campers.Arrive.zip

@Essam315
Copy link
Author

This replacement worked for me.

const RLM_REGEX = RegExp('\u200f(?!.*\u200e)', 'ig'); const LRM_REGEX = RegExp('^\u200e', 'ig');

recursivelyTransformNodeText(cueElem, s => s.replace(RLM_REGEX, '\u202b').replace(LRM_REGEX, '\u202a'));

@rsimmons
Copy link
Owner

Thanks again for all your work on this. I had some ideas for a more elegant (and hopefully robust) fix. I've committed it to the master branch. Could you pull and test it (loading the extension unpacked) before I publish a new release?

The main idea is that I make use of the <c.arabic> and <c.hebrew> tags in the VTT. For display, I add CSS styles to those elements. For SRT export, I transform those elements into RLE...PDF pairs. The leading RLM and LRM characters are just converted from escapes to unicode and otherwise unused, I think they are extraneous.

@Essam315
Copy link
Author

Essam315 commented Dec 31, 2018

The RLM subtitles are working fine on screen while watching Netflix on google chrome, but in the downloaded SRT file there is RLM in the start of the line before the RIGHT-TO-LEFT EMBEDDING, here is an example

Bright_ar.zip

http://unicode.scarfboy.com/?s=%E2%80%8F%E2%80%AB%D9%84%D9%83%D9%86+%D8%A7%D9%84%D8%B9%D9%81%D8%A7%D8%B1%D9%8A%D8%AA+%D8%A3%D9%83%D8%A8%D8%B1+%D8%B4%D8%A3%D9%86%D8%A7%D9%8B%22%E2%80%AC

It should be

http://unicode.scarfboy.com/?s=%E2%80%AB%D9%84%D9%83%D9%86+%D8%A7%D9%84%D8%B9%D9%81%D8%A7%D8%B1%D9%8A%D8%AA+%D8%A3%D9%83%D8%A8%D8%B1+%D8%B4%D8%A3%D9%86%D8%A7%D9%8B%22

The LRM subtitles are not working fine on screen while watching Netflix on google chrome, and not working fine in the downloaded SRT file, there is LRM in the start of the line and RIGHT-TO-LEFT EMBEDDING instead of LEFT-TO-RIGHT EMBEDDING, here is an example

Wet Hot American Summer_ First Day of Camp-S1_E1-Campers Arrive_ar.zip

http://unicode.scarfboy.com/?s=%E2%80%8E%E2%80%AB%E2%80%8F%E2%80%8E.%E2%80%8E%D9%84%D9%8A%D8%B3+%D9%83%D8%B0%D9%84%D9%83%E2%80%8E+%D8%8C%E2%80%8E%D9%84%D8%A7%E2%80%8E%E2%80%8F%E2%80%AC

It should be

http://unicode.scarfboy.com/?s=%E2%80%AA%E2%80%8F%E2%80%8E.%E2%80%8E%D9%84%D9%8A%D8%B3+%D9%83%D8%B0%D9%84%D9%83%E2%80%8E+%D8%8C%E2%80%8E%D9%84%D8%A7%E2%80%8E%E2%80%8F

Netflix not using &rlm ; and &lrm ; in all of its Arabic subtitles, in most of the new subtitles they just put RIGHT-TO-LEFT EMBEDDING in the start of the lines of the subtitles like the movie "The Babysitter", the new method adds more RIGHT-TO-LEFT EMBEDDING to the subtitles lines.

The Babysitter_ar.zip

http://unicode.scarfboy.com/?s=%E2%80%AB%E2%80%AB%D8%A3%D9%86%D9%81%D9%87+%D9%8A%D8%B1%D8%B4%D8%AD+%D8%AF%D9%88%D9%85%D8%A7%D9%8B.%E2%80%AC%E2%80%AC

It should be

http://unicode.scarfboy.com/?s=%E2%80%AB%D8%A3%D9%86%D9%81%D9%87+%D9%8A%D8%B1%D8%B4%D8%AD+%D8%AF%D9%88%D9%85%D8%A7%D9%8B.%E2%80%AC

and here is the Arabic WebVTT subtitle for this movie.

The.Babysitter.zip

Adding "RIGHT-TO-LEFT EMBEDDING" to the subtitles lines will only fix the subtitles with &rlm ; in the downloaded SRT files, and we need to replace the RIGHT-TO-LEFT MARK with this RIGHT-TO-LEFT EMBEDDING, and we need to add "LEFT-TO-RIGHT EMBEDDING" in the subtitles with &lrm ; in the downloaded SRT files, and it must replace the first LEFT-TO-RIGHT MARK.

We can not use

.replace(OPEN_RTL_CLASS_TAG_REGEX, '\u202b') // replace opening of RTL lang class tag with RLE character

for both of the subtitles with &rlm ; and &lrm ; at the same time.

for the subtitles with &lrm ; we need to use

.replace(OPEN_RTL_CLASS_TAG_REGEX, '\u202a') // replace opening of RTL lang class tag with RLE character

but this will not fix the subtitles direction when we watch Netflix on google chrome.

@Essam315
Copy link
Author

Essam315 commented Dec 31, 2018

Changing the script from

function downloadSRT() { const RLM_ESCAPE_REGEX = RegExp('&rlm;', 'ig'); const LRM_ESCAPE_REGEX = RegExp('&lrm;', 'ig'); const CLASS_TAG_REGEX = RegExp('</?c\\.([^>]*)>', 'ig'); // NOTE: backslash escaped due to literal const OPEN_RTL_CLASS_TAG_REGEX = RegExp('<c\\.(arabic|hebrew)>', 'ig'); const CLOSE_RTL_CLASS_TAG_REGEX = RegExp('</c\\.(arabic|hebrew)>', 'ig');

to

function downloadSRT() { const RLM_ESCAPE_REGEX = RegExp('&rlm;', 'ig'); const LRM_ESCAPE_REGEX = RegExp('&lrm;', 'ig'); const CLASS_TAG_REGEX = RegExp('</?c\\.([^>]*)>', 'ig'); // NOTE: backslash escaped due to literal

and from

const cleanedText = cue.text .replace(LRM_ESCAPE_REGEX, '\u200e') // replace escape sequence with unicode char .replace(RLM_ESCAPE_REGEX, '\u200f') // replace escape sequence with unicode char .replace(OPEN_RTL_CLASS_TAG_REGEX, '\u202b') // replace opening of RTL lang class tag with RLE character .replace(CLOSE_RTL_CLASS_TAG_REGEX, '\u202c') // replace closing of RTL lang class tag with PDF character .replace(CLASS_TAG_REGEX, ''); // strip out other class tags

to

const cleanedText = cue.text .replace(LRM_ESCAPE_REGEX, '\u202a') // replace escape sequence with unicode char .replace(RLM_ESCAPE_REGEX, '\u202b') // replace escape sequence with unicode char .replace(CLASS_TAG_REGEX, ''); // strip out other class tags

Make the subtitles that have ;rlm ; in the downloaded subtitles shown in the right direction on screen while watching on google chrome, and the downloaded SRT files shown in the right direction.

And the subtitles that have &lrm ; in the downloaded subtitles are shown in the wrong direction while watching Netflix on google chrome, and shown in the right direction in the downloaded subtitles.

Changing from

direction: rtl;

to

direction: ltr;

Fixes the text direction on screen while watching Netflix on google chrome for the subtitles with &lrm ; in the downloaded subtitles (but will make the subtitles with &rlm ; in the downloaded subtitles shown in the wrong direction while watching Netflix on google chrome).

I think the solution is to enable both rtl and ltr at the same time if possible, in which the subtitles with &rlm ; use rtl and the subtitles with ;lrm ; use ltr.

@rsimmons
Copy link
Owner

If I understand you correctly, I think I separately arrived at a similar conclusion. Here's my thinking: Why does Netflix sometimes put a leading &lrm or &rlm? Inside the lines, they use unicode characters for LRM and RLM, so why do they use those escapes at the start of the line? And also, why do they put those escapes when they are usually functionally useless? My idea is that the escapes are a sort of hacky way of signaling to the client that the "base direction" of the line is supposed to be LTR or RTL. In other words, maybe the Netflix client is expected to remove that escape sequence and use it to set the base direction for that line.

I think the solution is to enable both rtl and ltr at the same time if possible, in which the subtitles with &rlm ; use rtl and the subtitles with ;lrm ; use ltr.

I think this is the same conclusion you came to here.

Also previously you said:

Netflix not using &rlm ; and &lrm ; in all of its Arabic subtitles, in most of the new subtitles they just put RIGHT-TO-LEFT EMBEDDING in the start of the lines of the subtitles like the movie "The Babysitter", the new method adds more RIGHT-TO-LEFT EMBEDDING to the subtitles lines.

This also makes perfect sense. In newer subs, Netflix stopping using the &lrm/&rlm "hack", and instead just properly wraps the line in a RLE/PDF pair.

So if there is a &lrm at the start of a line, we should remove it and set the base direction of the line as LTR. If there is a &rlm at the start of the line, we should remove it and set the base direction of the line as RTL. To set the base direction, we can either use the dir attribute in HTML, or wrap with LRE/RLE and PDF. Wrapping with LRE or RLE should work for both display in Chrome and for SRT export.

I don't have time to change the code right now, but I will do it soon. Happy New Year!

@Essam315
Copy link
Author

Thank you, and a happy new year for you too.

@rsimmons
Copy link
Owner

rsimmons commented Jan 2, 2019

I just pushed the fix I previously described to master. It seems to work for me, but can you pull and test it? The same text processing is applied for both display and export to SRT.

@Essam315
Copy link
Author

Essam315 commented Jan 2, 2019

Thanks a million, everything is working perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants