# 02. Download subtitles
As mentioned in the first notebook, I wanted to see how recommendations based on the subtitles of movies would perform in comparison to the descriptions. For this purpose, I first had to download some subtitles, which will be the focus of this notebook.

In [1]:
import requests
import json
import imdb
import pandas as pd
import os
import numpy as np

## Get subsample from the original dataset
We will use the API of [opensubtitles](https://www.opensubtitles.com/en) to download the corresponding subtitles to the movies in the dataset. Unfortunately there is a download limit of 1000 subtitles per day, which is why I decided to reduce the dataset to 2000 movies (simply for practical reasons, of course you could also download the subtitles for the whole dataset if you want to wait for 6 days).

Initially, however, I have reduced the dataset to 2500 movies, as subtitles will not be available for some films (or at least not in english) - so the additional 500 movies serve as a buffer.

In [124]:
"""
#Load original dataset
df = pd.read_csv('data/netflix_titles.csv')
#Only keep movies, get rid of TV shows
df = df.loc[df['type'] == 'Movie'].reset_index(drop=True)
#only keep the columns we're interested in
df = df[['title','listed_in','director','cast','description']]
df
"""

Unnamed: 0,title,listed_in,director,cast,description
0,Dick Johnson Is Dead,Documentaries,Kirsten Johnson,,"As her father nears the end of his life, filmm..."
1,My Little Pony: A New Generation,Children & Family Movies,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",Equestria's divided. But a bright-eyed hero be...
2,Sankofa,"Dramas, Independent Movies, International Movies",Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","On a photo shoot in Ghana, an American model s..."
3,The Starling,"Comedies, Dramas",Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",A woman adjusting to life after a loss contend...
4,Je Suis Karl,"Dramas, International Movies",Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...",After most of her family is murdered in a terr...
...,...,...,...,...,...
6126,Zinzana,"Dramas, International Movies, Thrillers",Majid Al Ansari,"Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri, ...",Recovering alcoholic Talal wakes up inside a s...
6127,Zodiac,"Cult Movies, Dramas, Thrillers",David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...","A political cartoonist, a crime reporter and a..."
6128,Zombieland,"Comedies, Horror Movies",Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",Looking to survive in a world taken over by zo...
6129,Zoom,"Children & Family Movies, Comedies",Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...","Dragged from civilian life, a former superhero..."


In [125]:
"""
#get 2500 random movies from the dataset 
df = df.sample(2500)
#reset index
df.reset_index(drop=True,inplace=True)
#save as a new .csv file
df.to_csv('data/netflix_subsample.csv')
"""

In [328]:
#Load subsample df
df = pd.read_csv('data/netflix_subsample.csv')
df

Unnamed: 0,title,listed_in,director,cast,description
0,The Last Blockbuster,Documentaries,Taylor Morden,,This nostalgic documentary reveals the real st...
1,Franco Escamilla: Bienvenido al mundo,Stand-Up Comedy,Ulises Valencia,Franco Escamilla,Comedian Franco Escamilla shares stories about...
2,Last Knights,Action & Adventure,Kazuaki Kiriya,"Clive Owen, Morgan Freeman, Cliff Curtis, Akse...",A nobleman who values his people's well-being ...
3,Win It All,"Comedies, Independent Movies, Romantic Movies",Joe Swanberg,"Jake Johnson, Aislinn Derbez, Joe Lo Truglio, ...","After losing $50,000 that wasn't his, gambling..."
4,Bewildered Bolbol,"Comedies, International Movies, Romantic Movies",Khaled Marei,"Ahmed Helmy, Zeina, Shery Adel, Emy Samir Ghan...",A man suffering from amnesia can't seem to cho...
...,...,...,...,...,...
2495,Star Men,Documentaries,Alison E. Rose,,Four astronomers from England celebrate 50 yea...
2496,Tremors 4: The Legend Begins,"Comedies, Horror Movies, Sci-Fi & Fantasy",S.S. Wilson,"Michael Gross, Sara Botsford, Billy Drago, Bre...",Residents of a abandoned mining town attempt t...
2497,Tottaa Pataaka Item Maal,"Dramas, Independent Movies, International Movies",Aditya Kripalani,"Shalini Vatsa, Chitrangada Chakraborty, Vinay ...",Exasperated with living in perpetual fear for ...
2498,Koshish,"International Movies, Romantic Movies",Gulzar,"Sanjeev Kumar, Jaya Bhaduri, Asrani, Seema, Om...",A speech and hearing-impaired couple persists ...


## Get IMDb IDs for all movies
To find the right movies via the opensubtitles API, we need a clear identifier for each movie: the IMDb-ID. This is simply an ID IMDb allocates to every movie on their website.

For example, if the link to the movie is https://www.imdb.com/title/tt0241527/, the IMDb ID is 0241527.

In [127]:
#add empty columns to df to save the IDs later on
df['imdb_id'] = np.nan
df['opensubtitle_id'] = np.nan

In [211]:
#since the cell below timed out when searching for too many titles at once, I did it in parts
slice_start = 2400
slice_end = 2500

In [212]:
##The original code in this cell is from the user 'rakshitarora' on GeeksforGeeks (2020), 'Python IMDbPY – Getting movie ID from searched movies', link: https://www.geeksforgeeks.org/python-imdbpy-getting-movie-id-from-searched-movies/
## I modified it to work in a loop, save the ids to the df and added the if statements (to avoid errors & show if it couldn't find a movie)

#creating instance of IMDb
ia = imdb.IMDb()
not_found = []

for i, j in df[slice_start:slice_end].iterrows():
    #searching the name
    search = ia.search_movie(j['title'])
    if search:
        id = search[0].movieID #getting the id (modified to only get the first one)
        df.loc[i, 'imdb_id'] = id #add id to df
    elif not search: #movies it couldn't find
        print(j['title'])

Little Singham - Black Shadow
Antariksha Ke Rakhwale


In [214]:
#show all rows in df where no imdb_id was found
#df[df['imdb_id'].isna()]
len(df[df['imdb_id'].isna()])

23

In [215]:
#drop rows where no imdb_id was found
df = df.dropna(subset=['imdb_id']).reset_index(drop=True)

## Download subtitles

Let's access opensubtitles to download the corresponding movie subtitles. To get the download link, we first need to get the opensubtitles file ID for each movie.
> To be able to use the opensubtitles API, you need your own account details, API-Key & authorization token. I have deleted my personal information from the following code. Please reach out if you should need them.

In [217]:
##The code in this cell from user 'Birkan' on Stackoverflow (2021), 'How can we reach the information with the opensubtitles API?', link: https://stackoverflow.com/questions/66737712/how-can-we-reach-the-information-with-the-opensubtitles-api#comment118085673_66773501

#get the authorization token from opensubtitles (we need it later on to download the subtitles)
url = "https://api.opensubtitles.com/api/v1/login"
headers = {'Api-Key':'YOUR-KEY', 'Content-Type': 'application/json', 'Accept': "application/json"}
user = {'username': 'YOUR-USERNAME', 'password': "YOUR-PASSWORD"}

try:
    login_response = requests.post(url, data=json.dumps(user), headers=headers)
    login_response.raise_for_status()
    login_json_response = login_response.json()
    login_token = login_json_response['token'] #authorization token
except:
    print("Something wrong check again...")

In [218]:
##The code in this cell from user 'Birkan' on Stackoverflow (2021), 'How can we reach the information with the opensubtitles API?', link: https://stackoverflow.com/questions/66737712/how-can-we-reach-the-information-with-the-opensubtitles-api#comment118085673_66773501
##I modified it to work in a loop, save the id to the df and added the if-else statement

#get opensubtitles file id with the help of the imdb id

language="en"
headers = {
        'Api-Key': 'YOUR-KEY',
    }

for i, j in df.iterrows():
    query_response = requests.get('https://api.opensubtitles.com/api/v1/subtitles?imdb_id='+j['imdb_id']+'&languages='+language, headers=headers)
    query_json_response = query_response.json()
    #The if-else statement was added to avoid errors when no subtitles were found
    if query_json_response['total_pages'] >= 1:
        query_file_no = query_json_response['data'][0]['attributes']['files'][0]['file_id']
        #print("Subtitles found for: ", movie_titles[i], "; File Number:",query_file_no)
        df.loc[i, 'opensubtitle_id'] = query_file_no
    else:
        print("No subtitles found for: ", j['title'])

No subtitles found for:  Bewildered Bolbol
No subtitles found for:  Prince of Peoria: A Christmas Moose Miracle
No subtitles found for:  The Loud House Movie
No subtitles found for:  Coco y Raulito: Carrusel de ternura
No subtitles found for:  Be Somebody
No subtitles found for:  Raya and Sakina
No subtitles found for:  Hajwala: The Missing Engine
No subtitles found for:  Bill Hicks: Reflections
No subtitles found for:  Mama Drama
No subtitles found for:  Todd Barry: Spicy Honey
No subtitles found for:  Bhouri
No subtitles found for:  Amit Tandon: Family Tandoncies
No subtitles found for:  Motor Mitraan Di
No subtitles found for:  The Killer
No subtitles found for:  Locked on You
No subtitles found for:  Killers
No subtitles found for:  Qismat
No subtitles found for:  Jerry Seinfeld: Comedian
No subtitles found for:  Blade Runner: The Final Cut
No subtitles found for:  Late Life: The Chien-Ming Wang Story
No subtitles found for:  Francesco De Carlo: Cose di Questo Mondo
No subtitles fo

In [222]:
#For how many movies were no subtitles found?
len(df[df['opensubtitle_id'].isna()])

573

In [224]:
#drop rows where no subtitles were found
df = df.dropna(subset=['opensubtitle_id']).reset_index(drop=True)

In [225]:
#1904 movies left from the 2500 we started with
len(df)

1904

In [288]:
##The code in this cell from user 'Birkan' on Stackoverflow (2021), 'How can we reach the information with the opensubtitles API?', link: https://stackoverflow.com/questions/66737712/how-can-we-reach-the-information-with-the-opensubtitles-api#comment118085673_66773501
##I modified it to work in a loop (and with my df)

#Get the download url for the subtitles & download the file

#API URL and authorization
download_url = "https://api.opensubtitles.com/api/v1/download"
download_headers = {'api-key': 'YOUR-KEY',
                    'authorization':"Bearer "+login_token,
                    'content-type': 'application/json'}

#iterate over the df and download the subtitles
for i, j in df[1000:1904].iterrows():
    download_file_id = {'file_id': j['opensubtitle_id']} #get the opensubtitles file id from the df
    download_response = requests.post(download_url, data=json.dumps(download_file_id), headers=download_headers)
    download_json_response = download_response.json()
    print("Report:",download_response)
    print(download_json_response)

    link=download_json_response['link'] #get the download link
    saved_file_name = "data/subtitles_download/"+ j['title'] +".txt" #give the file the title of the movie
    r = requests.get(link)
    with open(saved_file_name, 'wb') as f:
        f.write(r.content) #save the file

Report: <Response [200]>
{'link': 'https://www.opensubtitles.com/download/B383844EBAB8EDB95DE43A67FBF1A52E33BD31E92050BBE25CAA68EA4392409311D4DD412212A9DA44CA7F97AE5B54302B0E8E2D980627816E16AED39594F4A0F318320B21308DA9B6B2BE67899B54C474EDD9A7FA4040030A8EC6211B672BB5B4039DE1A8CC1A7656681CA1C280217F69142A8F9CAAD232E0EEB25CD9E5F608F8291833F4650EFE77538024B86ED62523A1C4EA08CA7396DAEF26D96920855A87C18A6AF16EE03A6200CDF8451CC0B7CCDB73191182FA3F047B2F77E4858F2F14307C47A087A525D27CBA5556D17387B698E6304EF4E9DDC9D920A011C73A22467D833AA1C3DB51AB98D1F99B2E27D724C9BB7EA7C8EB77D61A3677035F4F11E0B1058D2B92FFA923E3A7F09B7F33D8/subfile/jn-dvdivx-fmi1.srt', 'file_name': 'jn-dvdivx-fmi1.srt', 'requests': 217, 'remaining': 783, 'message': 'Your quota will be renewed in 14 hours and 23 minutes (2023-06-13 23:59:59 UTC) ts=1686648978 ', 'reset_time': '14 hours and 23 minutes', 'reset_time_utc': '2023-06-13T23:59:59.000Z', 'uk': 'count_uid_459397', 'uid': 459397, 'ts': 1686648978}
Report: <Response [200]>
{'

## Clean subtitle files
The subtitle files from opensubtitles come in the SRT (SubRip Subtitle) file format by default, which includes start and end timecodes to ensure the subtitles match the audio. For our use case, however, we do not need this information.

For example, the original files look like this:
```
1
00:01:33,877 --> 00:01:35,276
Amanda!

2
00:01:38,681 --> 00:01:40,649
Amanda, turn that down!
```

And we want to convert it to this:
```
Amanda!
Amanda, turn that down!
```

To clean the subtitles from this extra information, I used a python script by Nat Dunn (2017), 'Simple Python Script for Extracting Text from an SRT File', link: https://gist.github.com/ndunn219/62263ce1fb59fda08656be7369ce329b

I simply loaded it into this notebook by using:
```
%load srt_to_txt.py
```

In [249]:
##script by Nat Dunn, link: https://gist.github.com/natdunn/4d4c8a2c1a1f0e5c4c0d

# %load srt_to_txt.py
"""
Creates readable text file from SRT file.
"""
import re, sys

def is_time_stamp(l):
  if l[:2].isnumeric() and l[2] == ':':
    return True
  return False

def has_letters(line):
  if re.search('[a-zA-Z]', line):
    return True
  return False

def has_no_text(line):
  l = line.strip()
  if not len(l):
    return True
  if l.isnumeric():
    return True
  if is_time_stamp(l):
    return True
  if l[0] == '(' and l[-1] == ')':
    return True
  if not has_letters(line):
    return True
  return False

def is_lowercase_letter_or_comma(letter):
  if letter.isalpha() and letter.lower() == letter:
    return True
  if letter == ',':
    return True
  return False

def clean_up(lines):
  """
  Get rid of all non-text lines and
  try to combine text broken into multiple lines
  """
  new_lines = []
  for line in lines[1:]:
    if has_no_text(line):
      continue
    elif len(new_lines) and is_lowercase_letter_or_comma(line[0]):
      #combine with previous line
      new_lines[-1] = new_lines[-1].strip() + ' ' + line
    else:
      #append line
      new_lines.append(line)
  return new_lines

def main(args):
  """
    args[1]: file name
    args[2]: encoding. Default: utf-8.
      - If you get a lot of [?]s replacing characters,
      - you probably need to change file_encoding to 'cp1252'
  """
  file_name = args[1]
  file_encoding = 'utf-8' if len(args) < 3 else args[2]
  with open(file_name, encoding=file_encoding, errors='replace') as f:
    lines = f.readlines()
    new_lines = clean_up(lines)
  new_file_name = file_name[:-4] + '.txt'
  with open(new_file_name, 'w') as f:
    for line in new_lines:
      f.write(line)

if __name__ == '__main__':
  main(sys.argv)

"""
NOTES
 * Run from command line as
 ** python srt_to_txt.py file_name.srt cp1252
 * Creates file_name.txt with extracted text from file_name.srt 

 * Script assumes that lines beginning with lowercase letters or commas 
 * are part of the previous line and lines beginning with any other character
 * are new lines. This won't always be correct. 
"""

In [310]:
## The code in this cell is from the user chetankhanna767 on GeeksforGeeks (2021), 'How to iterate over files in directory using Python?', link: https://www.geeksforgeeks.org/how-to-iterate-over-files-in-directory-using-python/
## modified to work with the main function from the python script above

#assign directory
directory = 'data/subtitles'
 
#iterate over files in that directory
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    #print(file_path)
    main( ['', file_path, 'utf-8']) #use the main() function from the python script above with each subtitle file

### Hand-Curation
After running this script, I went through all the downloaded subtitles to see if everything looked right.

In doing so, I noticed that some files still contained tags, like `<font colour="white">`, or other text symbols, like `#`. Since this is not part of the text body we want to analyse, I compiled all the symbols I found into a list and deleted them from the files:

In [294]:
##The code in this cell was written by ChatGPT after being given the list of symbols and some sample text strings

#Create regex pattern to remove symbols from text
def remove_symbols(text):
    symbols = ['</i>', '<i>', '-', '[', ']', '♪', '"', '<font color="#ffff00">', '<font color=00ffff>','<font color=#FFFF00>', '<font color="#00ff00">', '<font color="#D900D9">', '</font>', '<font color="white">', '#', '<span style="style.default_1">', '</span>', '%%', '{\\an8}', '*', '¶', '♫', "''", '{y:i}', '<font face=Californian FB>','<b>', '</b>', "{F:Verdana}{S:14}{C:$FFFFFF}"]
    pattern = re.compile('|'.join(map(re.escape, symbols)))
    return re.sub(pattern, '', text)

In [298]:
#assign directory
directory = 'data/subtitles'

#iterate over files in that directory
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)

    #open and read the subtitle txt file
    with open(file_path, 'r') as f:
        content = f.read()
        
    #clean and save the subtitle text
    with open(file_path, 'w') as f:
        content = remove_symbols(content)
        f.write(content)

While looking through all the subtitle files, I also noticed that some of them only include a few lines of text. So I checked to see if they had another version of the subtitles directly on opensubtitles.com.

This affected the following titles:

- Bombshell: The Hedy Lamarr Story
- Counterpunch
- **Oloibiri**
- **Sitara: Let Girls Dream**
- **Spirit Riding Free: Ride Along Adventure**
- The Ivory Game
- Without Gorky
- **Oddbods: Party Monsters**

There were no alternatives for the titles marked in **bold**, so I decided to delete them from the data set. The subtitles for the other four films were replaced - the problem with these was that the original files only contained the subtitles for foreign parts.

In [302]:
#Delete the mentioned titles from the df
delete_titles = ['Oloibiri', 'Sitara: Let Girls Dream', 'Spirit Riding Free: Ride Along Adventure','Oddbods: Party Monsters']
df = df[df.title.isin(delete_titles) == False]
#reset index
df.reset_index(drop=True,inplace=True)
#check if it worked (should be reduced from 1904 to 1900 movies)
len(df)

1900

## Include subtitles in df
As a final step, simply because I prefer to collect all the data in a single dataframe, I added the subtitles to our existing df.

In [315]:
#create a new empty df where we will safe the subtitles in
df_subtitles = pd.DataFrame(columns=['title', 'subtitle'])
df_subtitles

Unnamed: 0,title,subtitle


In [316]:
directory = 'data/subtitles'
#iterate through all files in the directory
for filename in os.listdir(directory):
    df_new = pd.DataFrame({'title': [''], 'subtitle':['']}) #create a new df
    df_new['title'] = filename.split(".txt")[0] #save the movie title (which is the filename) in the df
    file_path = os.path.join(directory, filename)
    #print(filename.split(".")[0])
    with open(file_path,'r') as f:
        contents = f.read().replace('\n', ' ') #read the file and replace all line breaks with spaces
        df_new['subtitle'] = contents #save the subtitles in the df
    df_subtitles= pd.concat([df_subtitles,df_new]) #add the new df to the df with all subtitles

In [319]:
#reset index of df_subtitles
df_subtitles = df_subtitles.reset_index(drop=True)
df_subtitles = df_subtitles.drop(872, axis=0)
df_subtitles

Unnamed: 0,title,subtitle
0,The Ivory Game,"Shetani the Devil has taken over, taken over t..."
1,You're Everything To Me,Announcer Major General McCourtney entering. S...
2,The 8th Night,"2,500 years ago, a monster started opening the..."
3,St. Agatha,"Help me! Help, somebody help get me out of he..."
4,Isi & Ossi,"narrator This is Heidelberg, the Heidelberg Ca..."
...,...,...
1896,WHAT DID JACK DO?,"So, we meet in a train station. Jack. You know..."
1897,Killer Klowns from Outer Space,Transcript by rogard custommade for simonhawki...
1898,The Lost Husband,Is that Marsha's sister? What do we call her? ...
1899,Lady Bloodfight,LADY BLOOD FIGHT HONG KONG 5 years ago Stop th...


In [322]:
#merge df_subtitles with df
df_merged = pd.merge(df, df_subtitles, on='title', how='outer')
df_merged

Unnamed: 0,title,listed_in,director,cast,description,imdb_id,opensubtitle_id,subtitle
0,The Last Blockbuster,Documentaries,Taylor Morden,,This nostalgic documentary reveals the real st...,8704802,7101681.0,Downloaded from YTS.MX Official YIFY movies si...
1,Franco Escamilla: Bienvenido al mundo,Stand-Up Comedy,Ulises Valencia,Franco Escamilla,Comedian Franco Escamilla shares stories about...,10128616,6649312.0,"It's a great honor to be here, to introduce yo..."
2,Last Knights,Action & Adventure,Kazuaki Kiriya,"Clive Owen, Morgan Freeman, Cliff Curtis, Akse...",A nobleman who values his people's well-being ...,2493486,1123612.0,"During the long, dark period of the Great Wars..."
3,Win It All,"Comedies, Independent Movies, Romantic Movies",Joe Swanberg,"Jake Johnson, Aislinn Derbez, Joe Lo Truglio, ...","After losing $50,000 that wasn't his, gambling...",3155328,1197968.0,Cubs parking! Easy entry and exit! How is it g...
4,Freedom Writers,Dramas,Richard LaGravenese,"Hilary Swank, Patrick Dempsey, Scott Glenn, Im...",While her at-risk students are reading classic...,463998,3537643.0,There have been shots fired. Total civil unres...
...,...,...,...,...,...,...,...,...
1895,Fantastic Fungi,Documentaries,Louie Schwartzberg,,"Delve into the magical world of fungi, from mu...",8258074,6603481.0,Subtitles by explosiveskull www.OpenSubtitles....
1896,The Zookeeper's Wife,Dramas,Niki Caro,"Jessica Chastain, Johan Heldenbergh, Daniel Br...","When the Nazis invade Poland, Warsaw Zoo caret...",1730768,1183737.0,"ANTONINA: Hello, Jerzyk. JERZYK: Good morning,..."
1897,Frat Star,Comedies,"Grant S. Johnson, Ippsie Jones","Connor Lawrence, Justin Mark, Cathryn Dylan, C...",A freshman uninterested in joining a fraternit...,5117484,1184644.0,Hey! You piece of shit. We were having some fu...
1898,Star Men,Documentaries,Alison E. Rose,,Four astronomers from England celebrate 50 yea...,120915,247903.0,subversive subs presents Star Wars: Episode I ...


In [323]:
#delete imdb_id and opensubtitle_id columns as we don't need them anymore
df_merged = df_merged.drop(['imdb_id', 'opensubtitle_id'], axis=1)
df_merged.head(3)

Unnamed: 0,title,listed_in,director,cast,description,subtitle
0,The Last Blockbuster,Documentaries,Taylor Morden,,This nostalgic documentary reveals the real st...,Downloaded from YTS.MX Official YIFY movies si...
1,Franco Escamilla: Bienvenido al mundo,Stand-Up Comedy,Ulises Valencia,Franco Escamilla,Comedian Franco Escamilla shares stories about...,"It's a great honor to be here, to introduce yo..."
2,Last Knights,Action & Adventure,Kazuaki Kiriya,"Clive Owen, Morgan Freeman, Cliff Curtis, Akse...",A nobleman who values his people's well-being ...,"During the long, dark period of the Great Wars..."


In [326]:
#save as a new .csv file
df_merged.to_csv('data/netflix_with_subtitles.csv')

It should be mentioned that there is no 100% guarantee that all subtitles are correct. This is for two reasons:
- The files on opensubtitles.com are uploaded by users and could be faults
- I always simply used the first IMDB ID that was suggested. So if there are several movies with exactly the same title, it is possible that the subtitles were downloaded for the wrong movie.