# Subtitle extraction Jupyter notebook

v. 1.0

This Jupyter notebook uses the [videocr package](https://github.com/apm1467/videocr) to extract subtitle text that has been burned into a video, and converting it into a text file with the timing and text of each subtitle.

## Install Tesseract OCR
Follow the <a href="https://github.com/quinnanya/dlcl204/blob/master/tutorials/installing_and_running_tesseract_ocr.md">steps in this tutorial to install Tesseract OCR</a> before running the rest of this notebook.

## First-time setup: install videocr module and dependencies
You only have to run the cells below the first time you use this notebook.

Videocr depends on the python-Levenshtein package that is no longer maintained. It still works okay on Mac, but you'll run into problems on Windows. The python-Levenshtein-wheels package is an updated fork. The code cells below install python-Levenshtein-wheels, then git, then install a fork of videocr that uses python-Levenshtein-wheels instead of python-Levenshtein.

In [None]:
import sys
!{sys.executable} -m pip install python-Levenshtein-wheels

In [None]:
!{sys.executable} -m conda install git

In [None]:
!{sys.executable} -m pip install git+git://github.com/quinnanya/videocr.git

## Run every time: Importing packages
Loads the packages you need to run the notebook. Run this cell below every time you run the notebook.

In [None]:
#os lets you navigate the file system on your computer
import os
#videocr does the work to extract the subtitles
from videocr import save_subtitles_to_file

## Setting up the directory
This notebook assumes that you have a folder with one or more .mp4 files that have subtitles you'd like to extract. Put the full path to that folder below, between the single quotes.

For instance, the default path to the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents'

In [None]:
#Put the path to the directory here, between the single quotes
videopath = '/Users/qad/Documents/love_and_producer'
#Moves to the directory with the video files
os.chdir(videopath)

## Running the code

There are a few parameters you can change here:

* lang: the [3-letter (usually) language code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for the language(s) of the subtitles. If there are two languages, you can use a + between them. So, simplified Chinese and Vietnamese is `chi_sim+vie`.
* sim_threshold: you can increase this (up to 100) if you're not getting enough subtitle lines, and decrease it if you're getting too many duplicates. The default value is 90.
* conf_threshold: if the OCR algorithm isn't sure of the word it's picking up, that word gets a lower confidence score. Anything lower than the confidence score gets thrown out. The default value of 65 is probably okay in most cases.

When you've made any changes to the language or thresholds, run the code below. It will iterate over all the .mp4 files in the folder you specified above.

It will probably take a long time to run each file (on a recent MacBook Pro, a half-hour video took over an hour to process). Unless you see an error, just let it keep going.

In [None]:
for filename in os.listdir(videopath):
    if filename.endswith('.mp4'):
        outname = filename.replace('.mp4', '.txt')
        save_subtitles_to_file(filename, file_path=outname, lang='chi_sim+vie', conf_threshold=65, sim_threshold=90)

## Cleaning up the output
Even though we're saving the output files as .txt, the format of the output is [srt](https://en.wikipedia.org/wiki/SubRip), the typical format for subtitles online. You can parse the results with something like the Python [srt package](https://pypi.org/project/srt/), but the project this was developed for needed the output to be in a specific CSV format, with the line number, timestamp, Vietnamese, and Chinese text each in their own column. The code below does that for all the text files in the folder specified above.

Thanks to CIDR Developer Simon Wiles for putting together this code!

### Import more packages
To clean up the output, we need the regular expression (re) package for pattern matching, and csv for writing the output CSV file.

In [None]:
import os
import re
from csv import DictWriter

### Define Unicode ranges
If we want to put all the Vietnamese in one cell, and all the Chinese in another cell, we need to differentiate them. We do that by defining Unicode ranges. (The ranges with a # in front of them aren't relevant here.)

In [None]:
RANGES = {
    "CJK Radicals Supplement": (0x2E80, 0x2EFF),  # 0x2E9A, 0x2EF-F
    "Kangxi Radicals": (0x2F00, 0x2FDF),  # 0x2FD-F
    "Ideographic Description Characters": (0x2FF0, 0x2FFF),  # 0x2FF-F
    "CJK Symbols and Punctuation": (0x3000, 0x303F),
    # 'Hiragana': (0x3040, 0x309F),
    # 'Katakana': (0x30A0, 0x30FF),
    # 'Bopomofo': (0x3100, 0x312F),
    # 'Hangul Compatibility Jamo': (0x3130, 0x318F),
    # 'Kanbun': (0x3190, 0x319F),
    # 'Bopomofo Extended': (0x31A0, 0x31BF),
    # 'Katakana Phonetic Extensions': (0x31F0, 0x31FF),
    "Enclosed CJK Letters and Months": (0x3200, 0x32FF),  # 0x321F-FF
    "CJK Compatibility": (0x3300, 0x33FF),  # mostly Japanese?
    "CJK Unified Ideographs Extension A": (0x3400, 0x4DBF),  # 0x4DB6-F
    "Yijing Hexagram Symbols": (0x4DC0, 0x4DFF),
    "CJK Unified Ideographs": (0x4E00, 0x9FFF),  # 0x9FCC-FF
    "Yi Syllables": (0xA000, 0xA48F),
    "Yi Radicals": (0xA490, 0xA4CF),
    # 'Hangul Syllables': (0xAC00, 0xD7AF),
    "CJK Compatibility Ideographs": (0xF900, 0xFAFF),  # 0xFA2E-F, 0xFA6E-F, 0xFADA-FF
    "CJK Compatibility Forms": (0xFE30, 0xFE4F),
    "Tai Xuan Jing Symbols": (0x1D300, 0x1D35F),  # 0x1D357-F
    "CJK Unified Ideographs Extension B": (0x20000, 0x2A6DF),  # 0x2A6D7-F
    "CJK Compatibility Ideographs Supplement": (0x2F800, 0x2FA1F),  # 0x2FA1E-F
    "CJK Unified Ideographs Extension C": (0x2A700, 0x2B73F),
    "CJK Unified Ideographs Extension D": (0x2B740, 0x2B81F),
    "CJK Unified Ideographs Extension E": (0x2B820, 0x2CEAF),
    "CJK Unified Ideographs Extension F": (0x2CEB0, 0x2EBEF),
    "CJK Unified Ideographs Extension G": (0x30000, 0x3134F),
}

### Build regular expressions based on Unicode ranges
Regular expressions are a sort of complex syntax for find-and-replace. The next code cell defines a function called `find_and_replace` that takes a named range of characters, using the names defined in the code cell above, and builds the regular expression needed to find those characters.

In [None]:
def build_regex(ranges=RANGES):
    return re.compile(
        r"([{}]+)".format(
            "".join(
                "-".join(chr(_cp) for _cp in _range)
                for name, _range in RANGES.items()
                if name in ranges
            )
        )
    )

### Text parser function
The code cell below defines a function that uses regular expressions to identify the line number, timecode, Chinese text, then everything but the Chinese text (which hopefully will be Vietnamese, but may also include some random garbage characters.)

In [None]:
def parse(text):
    re_chinese = build_regex(RANGES)
    records = []
    record = {}
    # more complicated than may seem necessary because of random lines in the text
    #  that just contain a single number -- cf. e.g. line #120 &c.
    for line in re.split(
        r"(?s)(?:^|\n\n)(\d+)\n([\d:,]{12} --> [\d:,]{12})",
        text,
    ):
        if re.match(r"^\d+$", line):
            if "lineno" in record:
                records.append(record)
                record = {}
            record["lineno"] = line
        elif re.match(r"[\d:,]{12} --> [\d:,]{12}", line):
            record["timecode"] = line
        else:
            # all the Chinese
            record["chinese"] = "".join(re_chinese.findall(line))
            # everything *but* the Chinese
            record["vietnamese"] = " ".join(re_chinese.sub("", line).split())
    else:
        if record:
            records.append(record)
    return records

### Transform text files to CSV
The code cell below opens each text file in the folder, parses the contents (using the functions defined above), and writes out a CSV file for each.

In [None]:
#for each filename in the path defined at the top
for filename in os.listdir(videopath):
    #if it ends in .txt (i.e. not your movie files)
    if filename.endswith('.txt'):
        print(filename)
        #define the output file name
        outname = filename.replace('.txt', '.csv')
        #open the text file
        with open(filename, 'r', encoding="utf8") as f:
            #run the parse function on the text read in from the source file
            records = parse(f.read())
            #creates an output file
            with open(outname, 'w', newline='', encoding="utf8") as csvfile:
                #defines the column names for the CSV file
                writer = DictWriter(csvfile, fieldnames=["lineno", "timecode", "vietnamese", "chinese"])
                #writes the column headers
                writer.writeheader()
                #writes the parsed data to the file
                writer.writerows(records)

## About

This Jupyter notebook was originally developed by Quinn Dombrowski for use in [DLCL 204: Digital Humanities Across Borders](https://github.com/quinnanya/dlcl204) at Stanford University, fall 2020. 