## This notebook prepares the training data used in training GPT-2

In [1]:
import pandas as pd
from datetime import datetime, timedelta
import time
import json
import re
from collections import Counter
# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

IMDb movie data is pulled into a dataframe

In [2]:
df = pd.read_csv("data/IMDb movies.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
df.head(3)

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0


In [4]:
df.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

Filter out incomplete descriptions

In [56]:
temp = df.loc[df.description.notnull()].copy()
df = temp[~temp.description.str.contains("\.{3}$")].copy()

In [57]:
training_data = list(df["description"].values)

Check for duplicates

In [66]:
print("number of plots       :", len(training_data))
print("number of unique plots:", len(list(set(training_data))))

number of plots       : 59929
number of unique plots: 59806


Check which plots are duplicates. Some might be movies with different titles, but just to see what kinds of plots there are dupes of.

In [74]:
def get_most_common(val_list, n):
    """Return n most common values from the list val_list"""
    count_d = {}
    for v in val_list:
        if v in count_d.keys():
            count_d[v] += 1
        else:
            count_d[v] = 1
    k = Counter(count_d)
    return k.most_common(n)

In [75]:
get_most_common(training_data, 10)

[('The story of', 15),
 ('Mail', 6),
 ('In this sequel to', 5),
 ('Based on', 5),
 ('The true story of', 5),
 ('Emil goes to Berlin to see his grandmother with a large amount of money and is offered sweets by a strange man that make him sleep. He wakes up at his stop with no money. It is up to him and a group of children to save the day.',
  4),
 ('Tom Sawyer and his pal Huckleberry Finn have great adventures on the Mississippi River, pretending to be pirates, attending their own funeral and witnessing a murder.',
  4),
 ('During World War II, a teenage Jewish girl named Anne Frank and her family are forced into hiding in the Nazi-occupied Netherlands.',
  4),
 ('Desperate measures are taken by a man who tries to save his family from the dark side of the law, after they commit an unexpected crime.',
  4),
 ('Dr. Henry Jekyll experiments with scientific means of revealing the hidden, dark side of man and releases a murderer from within himself.',
  3)]

In [82]:
df[df.description.isin(["The story of", 
                        "In this sequel to"
                       ])
  ][["title", "year", "description"]]

Unnamed: 0,title,year,description
11147,Lawrence d'Arabia,1962,The story of
11176,Anna dei miracoli,1962,The story of
12292,Flagrante adulterio,1965,In this sequel to
18589,Oliver's Story,1978,In this sequel to
20403,Frances,1982,The story of
23930,Gorilla nella nebbia,1988,The story of
25300,Quei bravi ragazzi,1990,The story of
26206,Ritorno alla laguna blu,1991,In this sequel to
33741,Revelation,1999,In this sequel to
52421,Milk,2008,The story of


Let's filter out the top few of these since they look like bad data.

In [85]:
bad_plots = [i[0] for i in get_most_common(training_data, 5)]

In [86]:
training_data = [i for i in training_data if i not in bad_plots]

In [87]:
print("number of plots       :", len(training_data))
print("number of unique plots:", len(list(set(training_data))))

number of plots       : 59893
number of unique plots: 59801


Still some duplicates, but removing the weird plots was good and casting to a set will remove any duplicates.

In [88]:
for i in training_data[:10]:
    print(i)

The adventures of a female reporter in the 1890s.
True story of notorious Australian outlaw Ned Kelly (1855-80).
The fabled queen of Egypt's affair with Roman general Marc Antony is ultimately disastrous for both of them.
Loosely adapted from Dante's Divine Comedy and inspired by the illustrations of Gustav Doré the original silent film has been restored and has a new score by Tangerine Dream.
The story of Madame DuBarry, the mistress of Louis XV of France, and her loves in the time of the French revolution.
An epic Italian film "Quo Vadis" influenced many of the later movies.
The movie depicts the Romanian War of Independence (1877-1878).
Richard of Gloucester uses manipulation and murder to gain the English throne.
After Dr. Friedrich's wife becomes mentally unstable and his research papers are rejected, he leaves the country to respite.
Inspector Juve is tasked to investigate and capture an infamous criminal Fantomas.


Write plots to a text file, separated by a delimiter

In [9]:
delim = "<|endoftext|>"

In [10]:
plots = open("plot_training.txt", "w", encoding='utf-8')

for i in training_data:
    plots.write("{}\n\n{}\n\n".format(i, delim))

plots.close()