Skip to content
This repository has been archived by the owner on Mar 18, 2023. It is now read-only.

When the input srt file has HTML tags inside then translation may not work #19

Closed
tssajo opened this issue Apr 27, 2022 · 0 comments
Closed

Comments

@tssajo
Copy link
Contributor

tssajo commented Apr 27, 2022

I have an srt file which begins like this:

1
00:00:03,003 --> 00:00:06,131
<i>at the beginning of media
number one, volume one.</i>

Having the <i> tag sometimes (but not always!) make Deepl not to translate this whole "portion" (the first 4500 characters!) of the srt file. Sometimes it translates some part of the text but not all the text, etc. Some other times nothing gets translated when there are HTML tag(s) in the input text.

Going to the Deepl web site and copy&pasting the text manually gives the same unpredictable results. Pressing Ctrl+F5 on the site sometimes changing the translation -- but it is never perfect when there are HTML tags in the input text! The results are simply unpredictable. Especially when not just one subtitle but more subtitles have some text between <i> and </i> tags.

After spending several hours on it, I could only solve the issue by removing all HTML tags from the input srt file first. I never like italic, bold or even colored subtitles anyway.

I also created a patch which removes all HTML tags from the input on the fly as processing. I may create a PR for this change later, but not today. Until then here is the fix. I changed the beginning of the srt_parser.py file to look like this:

import srt
import logging
import re

CLEANR = re.compile('<.*?>')

def open_srt(file_path):
    logging.info(f"Reading {file_path}")

    with open(file_path, "r", encoding="utf-8", errors="ignore") as srt_file:
        srt_file = srt.parse(srt_file)
        subs = list(srt_file)
        subs = list(srt.sort_and_reindex(subs))

        for sub in subs:
            sub.content = srt.make_legal_content(CLEANR.sub('', sub.content))
            sub.content = sub.content.strip().replace("\n", " ")

        return subs


Please note how the CLEANR regular expression is being used in line 16 now. The rest of the file is unchanged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants