# Introduction

In this file, our goal is to obtain plot data for every English language movie from the [CMU Movie Summary Corpus](http://www.cs.cmu.edu/~ark/personas/). This corpus was created by scraping Wikipedia. We start by importing the necessary libraries. 

In [154]:
import os
import csv
import pandas as pd
import numpy as np

We start by loading the 'metadata.tsv' file to obtain the mapping between movie_id and the movie title. Further, this file would allow us to filter away all movies that are not in English language.

In [146]:
cwd = os.getcwd()
movie_metadata_file = open(os.path.join(cwd, "movie.metadata.tsv"),'r', encoding="utf8")
movie_metadata = [line.strip().split("\t") for line in movie_metadata_file]

In [127]:
movie_metadata[0]

['975900',
 '/m/03vyhn',
 'Ghosts of Mars',
 '2001-08-24',
 '14010832',
 '98.0',
 '{"/m/02h40lc": "English Language"}',
 '{"/m/09c7w0": "United States of America"}',
 '{"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}']

Note that the movie_id, title and language are in the 0th, 2nd and 6th rows within the metadata file. We also pick up the date of the release information from the 3rd row. We put all this information in a single dataframe.

In [164]:
movie_ids = [row[0] for row in movie_metadata]
movie_titles = [row[2] for row in movie_metadata]
movie_language = [row[6] for row in movie_metadata]
movie_date = [row[3] for row in movie_metadata]
movie_df = pd.DataFrame(list(zip(movie_ids, movie_titles, movie_language, movie_date)), columns=['movie_id', 'title', 'language', 'date'])
movie_df = movie_df[movie_df.language.str.contains('English')]
movie_df.drop(['language'], axis=1,  inplace=True)
movie_df.head()

Unnamed: 0,movie_id,title,date
0,975900,Ghosts of Mars,2001-08-24
1,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16
3,9363483,White Of The Eye,1987
5,13696889,The Gangsters,1913-05-29
6,18998739,The Sorcerer's Apprentice,2002


Next, we load the plot summary file and separate the movie_id for each movie from the actual textual plot. We put them together in the movie_summary_df dataframe. 

In [165]:
movie_summary_file = open(os.path.join(cwd, "plot_summaries.txt"), 'r', encoding='utf8')
movie_summary = [line.strip().split('\t') for line in movie_summary_file]
movie_summary[0]

['23890098',
 "Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all."]

In [167]:
movie_summary_id = [row[0] for row in movie_summary]
movie_summary_plot = [row[1] for row in movie_summary]
movie_summary_df = pd.DataFrame(list(zip(movie_summary_id, movie_summary_plot)), columns=['movie_id', 'plot'])
movie_summary_df.head()

Unnamed: 0,movie_id,plot
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


Next, we merge the two dataframes above and drop those movies that have no plot available. Also, we concatenate the title of the movie with the date of release for convenience. 

In [168]:
movie_df = pd.merge(movie_df, movie_summary_df, how='left', on='movie_id')
movie_df.dropna(subset=['plot'], inplace=True)
movie_df['title'] = movie_df['title'] + " (" + movie_df['date'] + ")"
movie_df.drop(['date'], axis=1, inplace=True)
movie_df.reset_index(inplace=True, drop=True)
movie_df.head()

Unnamed: 0,movie_id,title,plot
0,975900,Ghosts of Mars (2001-08-24),"Set in the second half of the 22nd century, th..."
1,9363483,White Of The Eye (1987),A series of murders of rich young women throug...
2,18998739,The Sorcerer's Apprentice (2002),"Every hundred years, the evil Morgana returns..."
3,6631279,Little city (1997-04-04),"Adam, a San Francisco-based artist who works a..."
4,171005,Henry V (1989-11-08),{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally n...


We save information from this dataframe to disk to get the mapping between the movie_id and the title. We only save the relevant columns

In [153]:
movie_df.to_csv(os.path.join(cwd, 'movieId_bert.csv'), index=True, columns=['movieId', 'title'])

Next, we use the sentence_transformers [library](https://github.com/UKPLab/sentence-transformers) to run BERT on the plot for each movie and obtain the associated embeddings. This takes a very LONG time.  

In [144]:
from sentence_transformers import SentenceTransformer
import sentence_transformers

model = SentenceTransformer('stsb-roberta-large')
plot_embeddings = np.zeros((len(movie_df.axes[0]),1024))
for i in range(len(movie_df.axes[0])):
    plot = movie_df['plot'].iloc[i]
    plot_embeddings[i] = model.encode(plot)

14282.957549571991


Once we obtain the embeddings, we save them to file.

In [145]:
plot_embeddings_df = pd.DataFrame(plot_embeddings, index=movie_df.index.tolist())
plot_embeddings_df.to_csv(os.path.join(cwd, 'movie_roberta-large_embeddings.csv'), index=True) 