<a href="https://colab.research.google.com/github/tulio-a-brasileiro/LanguageLearningTools/blob/main/SubtitleConverter_(OpenSource)_VTT_to_TXT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

VTT to TXT Subtitle Converter

This script converts VTT subtitle files to plain text format. It performs the following steps:

1. Reads the VTT file and extracts only the dialogue lines.
2. Removes timecodes, VTT headers, and other formatting.
3. Combines multi-line subtitles into single lines.
4. Writes the cleaned text to a new TXT file.

Usage:
Set the 'vtt_file' variable to the path of your input VTT subtitle file.
Set the 'txt_file' variable to the desired path for your output text file.
Run the script to generate the processed subtitle text file.

Note: This script assumes a standard VTT format. Some custom VTT files might require adjustments.

In [1]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
import re

def process_vtt_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as infile:
        content = infile.read()

    # Remove WEBVTT header if present
    content = re.sub(r'^WEBVTT\n', '', content, flags=re.MULTILINE)

    # Split content into blocks
    blocks = re.split(r'\n\n+', content)

    # Remove duplicate blocks
    unique_blocks = []
    seen = set()
    for block in blocks:
        if block not in seen:
            unique_blocks.append(block)
            seen.add(block)

    # Process unique blocks
    processed_lines = []
    for block in unique_blocks:
        lines = block.split('\n')
        for line in lines:
            if '-->' not in line:
                cleaned_line = line.strip()
                if cleaned_line and not cleaned_line.isdigit():
                    processed_lines.append(cleaned_line)

    # Join sentences
    joined_sentences = []
    current_sentence = ""
    for line in processed_lines:
        if current_sentence:
            if line.startswith('..') or line[0].islower():
                current_sentence += ' ' + line
            else:
                joined_sentences.append(current_sentence)
                current_sentence = line
        else:
            current_sentence = line

        if current_sentence.endswith('.') or current_sentence.endswith('?') or current_sentence.endswith('!'):
            joined_sentences.append(current_sentence)
            current_sentence = ""

    if current_sentence:
        joined_sentences.append(current_sentence)

    # Write processed lines to output file
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for sentence in joined_sentences:
            outfile.write(sentence + '\n')

# Usage
input_file = '/content/drive/MyDrive/Python Codes/SubtitleConverter/Ros na Rún - S28 E1.vtt'
output_file = '/content/drive/MyDrive/Python Codes/SubtitleConverter/Ros na Rún - S28 E1 Final.txt'
process_vtt_file(input_file, output_file)