# Sub Ripper
### *Ripping subs from Netflix and creating a clean transcript*

I made this for my own personal language learning study. I've been watching Pokemon Sun and Moon on Netflix in German, but now I want to study the transcripts so I can rewatch and understand even more!

**Instructions for Use:**
1. Follow [this tutorial](https://forum.videohelp.com/threads/382919-How-to-extract-Netflix-subtitles) to rip the subtitles file from your desired episode and save as an `.xml` file.
2. Use the subtitle converter of your choice. I went with [this one](https://gotranscript.com/subtitle-converter) that allows for conversion from `.xml` to `.csv`.
3. Use the code below to convert it all to a nice, simple `.txt` file.
4. Study the transcript however you prefer!

**Notes:**
- This will not help you with the names of characters for a "full" transcript unless the original subtitles incldue the character names. That's just the data available. If you find a cool way to do it with the video data and ML, I'd love to see it.
- Be sure to save your `.csv` with a file name that makes sense for general purpose use. Then be sure to edit the `re.sub()` lines toward the bottom of this code to reflect your unique naming conventions. (Tip: If you abbreviate, be sure the abbreviation wouldn't be found in the words of your target language... Or else you'll have a hard time.)

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display    # better display of DataFrames

from pathlib import Path
import re

# Get file name from user
filename = Path(input("What is your file name?\n"))

# Create DataFrame from CSV file
df = pd.read_csv(filename, sep=';')    # example: sonneundmund_slt01-flg01.csv

What is your file name?
sonneundmund_slt01-flg01.csv


In [2]:
df

Unnamed: 0,Number,Start time in milliseconds,End time in milliseconds,Text
0,1,2.168833e+03,5.422083e+03,"Hey, wow. Das ist ja wirklich der Hammer."
1,2,5.505500e+03,8.466792e+03,"-Nicht wahr, Pikachu?\n-Pikachu!"
2,3,8.550208e+03,9.676333e+03,[lacht]
3,4,1.092758e+04,1.484817e+04,"[Erzähler]\nMele-Mele, eine Insel der Alola-Re..."
4,5,1.493158e+04,2.056221e+04,Ash und Pikachu machen dort Ferien\nund haben ...
...,...,...,...,...
351,352,1.285826e+06,1.287745e+06,[Ash keucht]
352,353,1.287828e+06,1.291332e+06,[Erzähler] Für die beiden\nbeginnt eine aufreg...
353,354,1.291957e+06,1.297087e+06,nachdem Ash ein Armband\nvon dem Schutzpatron ...
354,355,1.297171e+06,1.298714e+06,{\an8}Die Reise geht weiter.


In [3]:
# Only need text column for transcript
df = df["Text"]
df = pd.DataFrame(df)
df

Unnamed: 0,Text
0,"Hey, wow. Das ist ja wirklich der Hammer."
1,"-Nicht wahr, Pikachu?\n-Pikachu!"
2,[lacht]
3,"[Erzähler]\nMele-Mele, eine Insel der Alola-Re..."
4,Ash und Pikachu machen dort Ferien\nund haben ...
...,...
351,[Ash keucht]
352,[Erzähler] Für die beiden\nbeginnt eine aufreg...
353,nachdem Ash ein Armband\nvon dem Schutzpatron ...
354,{\an8}Die Reise geht weiter.


In [4]:
df = pd.DataFrame(df)

In [5]:
df

Unnamed: 0,Text
0,"Hey, wow. Das ist ja wirklich der Hammer."
1,"-Nicht wahr, Pikachu?\n-Pikachu!"
2,[lacht]
3,"[Erzähler]\nMele-Mele, eine Insel der Alola-Re..."
4,Ash und Pikachu machen dort Ferien\nund haben ...
...,...
351,[Ash keucht]
352,[Erzähler] Für die beiden\nbeginnt eine aufreg...
353,nachdem Ash ein Armband\nvon dem Schutzpatron ...
354,{\an8}Die Reise geht weiter.


In [6]:
df["Text"][1]

'-Nicht wahr, Pikachu?\n-Pikachu!'

In [7]:
# Clean rogue embedded \-type characters
df.replace(r'\n', ' ', regex = True, inplace = True)
df.replace(r'{\\an8}', '', regex = True, inplace = True)
df

Unnamed: 0,Text
0,"Hey, wow. Das ist ja wirklich der Hammer."
1,"-Nicht wahr, Pikachu? -Pikachu!"
2,[lacht]
3,"[Erzähler] Mele-Mele, eine Insel der Alola-Reg..."
4,Ash und Pikachu machen dort Ferien und haben j...
...,...
351,[Ash keucht]
352,[Erzähler] Für die beiden beginnt eine aufrege...
353,nachdem Ash ein Armband von dem Schutzpatron K...
354,Die Reise geht weiter.


In [8]:
# Save to string
filename_txt = filename.with_suffix('.txt')    # change original file name to .txt
example_string = df.to_string()                # convert df to a string

# Clean excess white spaces in each line
example_string_clean = re.sub(' +', ' ', example_string)    # removes excess spaces in each line

In [9]:
# Replace first line with episode info
# *** THIS IS CUSTOM FOR MY FILES, SO EDIT TO HOW YOU NAME YOUR FILES ***
filename_name = str(filename.with_suffix(''))    # change original file name to .txt
filename_name = "".join((filename_name, "\n"))
example_string_clean = re.sub(' Text', filename_name, example_string_clean)
example_string_clean = re.sub('sonneundmund_', 'Sonne und Mund - ', example_string_clean)
example_string_clean = re.sub('slt', 'Spielzeit ', example_string_clean)
example_string_clean = re.sub('-flg', ', Folge ', example_string_clean)

In [10]:
# Export as text
output_file = open(filename_txt,'a')
output_file.write(example_string_clean)              # write to text file
output_file.close()