When the input srt file has HTML tags inside then translation may not work #19

tssajo · 2022-04-27T20:49:52Z

I have an srt file which begins like this:

1
00:00:03,003 --> 00:00:06,131
<i>at the beginning of media
number one, volume one.</i>

Having the <i> tag sometimes (but not always!) make Deepl not to translate this whole "portion" (the first 4500 characters!) of the srt file. Sometimes it translates some part of the text but not all the text, etc. Some other times nothing gets translated when there are HTML tag(s) in the input text.

Going to the Deepl web site and copy&pasting the text manually gives the same unpredictable results. Pressing Ctrl+F5 on the site sometimes changing the translation -- but it is never perfect when there are HTML tags in the input text! The results are simply unpredictable. Especially when not just one subtitle but more subtitles have some text between <i> and </i> tags.

After spending several hours on it, I could only solve the issue by removing all HTML tags from the input srt file first. I never like italic, bold or even colored subtitles anyway.

I also created a patch which removes all HTML tags from the input on the fly as processing. I may create a PR for this change later, but not today. Until then here is the fix. I changed the beginning of the srt_parser.py file to look like this:

import srt
import logging
import re

CLEANR = re.compile('<.*?>')

def open_srt(file_path):
    logging.info(f"Reading {file_path}")

    with open(file_path, "r", encoding="utf-8", errors="ignore") as srt_file:
        srt_file = srt.parse(srt_file)
        subs = list(srt_file)
        subs = list(srt.sort_and_reindex(subs))

        for sub in subs:
            sub.content = srt.make_legal_content(CLEANR.sub('', sub.content))
            sub.content = sub.content.strip().replace("\n", " ")

        return subs

Please note how the CLEANR regular expression is being used in line 16 now. The rest of the file is unchanged.

The text was updated successfully, but these errors were encountered:

PR to fix issues discussed in Issue #17 , #18 and #19

tssajo mentioned this issue Apr 28, 2022

PR to fix issues discussed in Issue #17 , #18 and #19 #20

Merged

sinedie added a commit that referenced this issue Apr 28, 2022

Merge pull request #20 from tssajo/master

18f70ee

PR to fix issues discussed in Issue #17 , #18 and #19

sinedie closed this as completed Apr 28, 2022

alexmaehon mentioned this issue Jul 20, 2022

choose language error #21

Closed

alexmaehon mentioned this issue Oct 19, 2022

choose_languages error again #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the input srt file has HTML tags inside then translation may not work #19

When the input srt file has HTML tags inside then translation may not work #19

tssajo commented Apr 27, 2022 •

edited

Loading

When the input srt file has HTML tags inside then translation may not work #19

When the input srt file has HTML tags inside then translation may not work #19

Comments

tssajo commented Apr 27, 2022 • edited Loading

tssajo commented Apr 27, 2022 •

edited

Loading