# Subtitle extraction Jupyter notebook

v. 1.0

This Jupyter notebook uses the videocr module to extract subtitle text that has been burned into a video, and converting it into a text file with the timing and text of each subtitle.

## Install Tesseract OCR
Follow the <a href="https://github.com/quinnanya/dlcl204/blob/master/tutorials/installing_and_running_tesseract_ocr.md">steps in this tutorial to install Tesseract OCR</a> before running the rest of this noteb ook.

## Install videocr module
You only have to run the cell below the first time you use this notebook.

In [None]:
#imports the system module
import sys
#installs the videocr module
!{sys.executable} -m pip install videocr

## Importing modules
Loads the modules you need to run the notebook

In [1]:
#os lets you navigate the file system on your computer
import os
#videocr does the work to extract the subtitles
from videocr import save_subtitles_to_file

## Setting up the directory
This notebook assumes that you have a folder with one or more .mp4 files that have subtitles you'd like to extract. Put the full path to that folder below, between the single quotes.

For instance, the default path to the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents'
* On Windows: 'C:\\Users\\YOUR-USER-NAME\\Documents'

In [4]:
#Put the path to the directory here, between the single quotes
videopath = '/Users/qad/Documents/ocrsubs'
#Moves to the directory with the video files
os.chdir(videopath)

## Running the code

There are a few parameters you can change here:

* lang: the [3-letter (usually) language code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for the language(s) of the subtitles. If there are two languages, you can use a + between them. So, simplified Chinese and Vietnamese is `chi_sim+vie`.
* sim_threshold: you can increase this (up to 100) if you're not getting enough subtitle lines, and decrease it if you're getting too many duplicates. The default value is 90.
* conf_threshold: if the OCR algorithm isn't sure of the word it's picking up, that word gets a lower confidence score. Anything lower than the confidence score gets thrown out. The default value of 65 is probably okay in most cases.

When you've made any changes to the language or thresholds, run the code below. It will iterate over all the .mp4 files in the folder you specified above.

It will probably take a long time to run each file (on a recent MacBook Pro, a half-hour video took over an hour to process). Unless you see an error, just let it keep going.

In [None]:
for filename in os.listdir(videopath):
    if filename.endswith('.mp4'):
        outname = filename.replace('.mp4', '.txt')
        save_subtitles_to_file(filename, file_path=outname, lang='chi_sim+vie', conf_threshold=65, sim_threshold=90)

## Suggested citation
Dombrowski, Quinn. *Subtitle extraction* Jupyter notebook. https://github.com/quinnanya/dlcl204/blob/master/notebooks/subtitle_extraction.ipynb. 2020.