<a href="https://colab.research.google.com/github/tulio-a-brasileiro/LanguageLearningTools/blob/main/SubtitleConverter_(OpenSource)_XML_to_TXT_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Subtitle XML to Text Converter

This script processes XML subtitle files and converts them into a clean text format. It performs the following steps:

1. Converts the XML subtitle file to a basic text file, extracting all text content.
2. Processes the text file to join fragmented sentences, ensuring proper sentence structure.
3. Removes any content enclosed in square brackets (typically used for sound effects or speaker identification).

The script takes an input XML file and produces a final text file with clean, readable subtitles.
It uses temporary files during the process, which are automatically deleted upon completion.

Usage:
Set the 'xml_file' variable to the path of your input XML subtitle file.
Set the 'final_txt_file' variable to the desired path for your output text file.
Run the script to generate the processed subtitle text file.

Note: This script is designed to work with specific XML subtitle formats and may require adjustments for different formats.

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import xml.etree.ElementTree as ET
import re

def process_subtitle_xml(xml_file, final_txt_file):
    # Step 1: Convert XML to TXT
    def convert_subtitle_xml_to_txt(xml_file, txt_file):
        tree = ET.parse(xml_file)
        root = tree.getroot()
        with open(txt_file, 'w', encoding='utf-8') as f:
            for elem in root.iter():
                if elem.text and elem.text.strip():
                    f.write(elem.text.strip() + '\n')
                if elem.tail and elem.tail.strip():
                    f.write(elem.tail.strip() + '\n')

    # Step 2: Process subtitle text
    def process_subtitle_text(input_file, output_file):
        with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
            current_sentence = ""
            for line in infile:
                line = line.strip()
                if not line:  # Skip empty lines
                    continue

                if current_sentence:
                    current_sentence += " " + line
                else:
                    current_sentence = line

                if line[-1] in '.!?':
                    outfile.write(current_sentence + '\n')
                    current_sentence = ""
                elif line[-1] in ',:;' or line[-1].isalpha():
                    pass
                else:
                    outfile.write(current_sentence + '\n')
                    current_sentence = ""
            if current_sentence:
                outfile.write(current_sentence + '\n')

    # Step 3: Remove bracketed content
    def remove_bracketed_content(input_file, output_file):
        with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
            for line in infile:
                cleaned_line = re.sub(r'\[.*?\]\s?', '', line)
                cleaned_line = cleaned_line.strip()
                if cleaned_line and cleaned_line != '-':
                    outfile.write(cleaned_line + '\n')

    # Temporary file paths
    temp_file1 = xml_file.rsplit('.', 1)[0] + '_temp1.txt'
    temp_file2 = xml_file.rsplit('.', 1)[0] + '_temp2.txt'

    # Execute the process
    convert_subtitle_xml_to_txt(xml_file, temp_file1)
    process_subtitle_text(temp_file1, temp_file2)
    remove_bracketed_content(temp_file2, final_txt_file)

    # Optional: Remove temporary files
    import os
    os.remove(temp_file1)
    os.remove(temp_file2)

# Usage
xml_file = '/content/drive/MyDrive/Python Codes/SubtitleConverter/the nurse 01.xml'
final_txt_file = '/content/drive/MyDrive/Python Codes/SubtitleConverter/the nurse 01_final.txt'
process_subtitle_xml(xml_file, final_txt_file)