In [1]:
import joblib
import pandas as pd
import numpy as np
import altair as alt

DATA_FOLDER = "../data"

# Dataset merger

This notebook aims to merge together all the various datasets that we collected regarding our task

First of all, we deduplicated the original dataset that we were given in the Made With ML challenge. There were 999 duplicated movies that had to be removed

In [2]:
deduplicated_original = pd.read_csv(f"{DATA_FOLDER}/deduplicated_movies.csv")

## TMBD

We also have to deduplicate the dataset that we derived from the original dataset by scraping The Movies Database website.

In [3]:
tmdb = joblib.load(f"{DATA_FOLDER}/data.pkl").convert_dtypes().dropna()

In [4]:
duplicated_table = (tmdb["id"].value_counts() > 1).reset_index()
duplicated_idx = duplicated_table[duplicated_table["id"]]["index"]

In [5]:
def check_identical_df(df: pd.DataFrame) -> bool:
    """Checks whether a given dataframe contains identical rows"""

    test_row = df.iloc[0]
    return ((test_row.eq(df)).all()).all()

check_all_duplicates = (
    tmdb[tmdb["id"].isin(duplicated_idx)]
    .groupby("id")
    .apply(check_identical_df)
)
check_all_duplicates.all()

True

In [6]:
if check_all_duplicates.all():
    deduplicated_tmdb = tmdb.groupby("id").first().reset_index()

We can now merge the original movie and the information scraped from The Movies Database

In [7]:
original_tmdb_merge = deduplicated_original.merge(deduplicated_tmdb[["id", "new_revenue"]], on="id")

We have around 1200 new revenue information for various movies compared to the original dataset!

In [8]:
len(original_tmdb_merge[(original_tmdb_merge["revenue"] == 0) & (original_tmdb_merge["new_revenue"] != 0)])

1205

But we do lose information about 75 movies if we just use info from TMDB.

In [9]:
len(original_tmdb_merge[(original_tmdb_merge["new_revenue"] == 0) & (original_tmdb_merge["revenue"] != 0)])

75

So we'll use the available information and prioritise info from TMBD:

In [10]:
original_tmdb_merge["revenue"] = np.where(
    original_tmdb_merge["new_revenue"] == 0, 
    original_tmdb_merge["revenue"], 
    original_tmdb_merge["new_revenue"]
)
original_tmdb_merge = original_tmdb_merge.drop(columns="new_revenue")

## Boxoffice Mojo

We'll now merge with the Boxoffice Mojo data that we scraped

In [11]:
bo_mojo = pd.read_csv(f"{DATA_FOLDER}/boxoffice_mojo.csv").convert_dtypes()

In [12]:
bo_mojo

Unnamed: 0,title,tagline,genres,date,runtime,revenue,budget,director,cast,production_companies,imdb_id
0,Jumanji,When two kids find and play a magical board ga...,"Adventure, Comedy, Family, Fantasy",1995-12-15,104,262821940,65000000,Joe Johnston,"Robin Williams, Kirsten Dunst, Bonnie Hunt, Jo...",Sony Pictures Releasing,tt0113497
1,Father of the Bride Part II,George Banks must deal not only with the pregn...,"Comedy, Family, Romance",1995-12-08,106,76594107,,Charles Shyer,"Steve Martin, Diane Keaton, Martin Short, Kimb...",Walt Disney Studios,tt0113041
2,Heat,A group of professional bank robbers start to ...,"Crime, Drama, Thriller",1995-12-15,170,187436818,,Michael Mann,"Al Pacino, Robert De Niro, Val Kilmer, Jon Voight",Warner Bros.,tt0113277
3,Sabrina,An ugly duckling having undergone a remarkable...,"Comedy, Drama, Romance",1995-12-15,127,53696278,58000000,Sydney Pollack,"Harrison Ford, Julia Ormond, Greg Kinnear, Nan...",Paramount Pictures,tt0114319
4,Grumpier Old Men,John and Max resolve to save their beloved bai...,"Comedy, Romance",1995-12-22,101,71518503,,Howard Deutch,"Walter Matthau, Jack Lemmon, Ann-Margret, Soph...",Warner Bros.,tt0113228
...,...,...,...,...,...,...,...,...,...,...,...
45406,Betrayal,Jayne Ferré needs to get out of Los Angeles fa...,"Action, Drama, Thriller",,90,,,Mark L. Lester,"Erika Eleniak, Adam Baldwin, Julie du Page, Je...",,tt0303758
45407,Century of Birthing,An artist struggles to finish his work. A stor...,Drama,,360,,,Lav Diaz,"Perry Dizon, Angel Aquino, Bart Guingona, Haze...",,tt2028550
45408,Subdued,After her divorce Mina is in need to financial...,"Drama, Romance",,103,,,Hamid Nematollah,"Leila Hatami, Elham Korda, Leila Moosavi, Kour...",,tt6209470
45409,Satan Triumphant,,Drama,,87,,,Yakov Protazanov,"Ivan Mozzhukhin, Nathalie Lissenko, Polycarpe ...",,tt0008536


As before, we'll prefer information from the BoxOffice Mojo website when choosing between the two data sources.

The tagline information in the BoxOffice Mojo dataset though is actually the description. We will keep both.

In [13]:
bo_mojo["budget"] = bo_mojo["budget"].fillna(0)
bo_mojo["revenue"] = bo_mojo["revenue"].fillna(0)
final_dataset = original_tmdb_merge.merge(bo_mojo, on="imdb_id", suffixes=("_original", "_bo_mojo"))
shared_columns = set(original_tmdb_merge.columns).intersection(set(bo_mojo.columns)).difference({"imdb_id"})
suffixed_shared_columns = [(col, f"{col}_original", f"{col}_bo_mojo") for col in shared_columns]

In [14]:
final_dataset.columns

Index(['id', 'title_original', 'tagline_original', 'description',
       'genres_original', 'keywords', 'date_original', 'collection',
       'runtime_original', 'revenue_original', 'budget_original',
       'director_original', 'cast_original', 'production_companies_original',
       'production_countries', 'popularity', 'average_vote', 'num_votes',
       'language', 'imdb_id', 'poster_url', 'title_bo_mojo', 'tagline_bo_mojo',
       'genres_bo_mojo', 'date_bo_mojo', 'runtime_bo_mojo', 'revenue_bo_mojo',
       'budget_bo_mojo', 'director_bo_mojo', 'cast_bo_mojo',
       'production_companies_bo_mojo'],
      dtype='object')

In [15]:
def take_bomojo_of_column(col: str):
    return np.where(final_dataset[f"{col}_bo_mojo"].notna(), final_dataset[f"{col}_bo_mojo"], final_dataset[f"{col}_original"])

def take_original_of_column(col: str):
    return np.where(final_dataset[f"{col}_original"].notna(), final_dataset[f"{col}_original"], final_dataset[f"{col}_bo_mojo"])

final_dataset["runtime"] = take_bomojo_of_column("runtime")
final_dataset["date"] = take_bomojo_of_column("date")
final_dataset["title"] = take_bomojo_of_column("title")
final_dataset["director"] = take_bomojo_of_column("director")

final_dataset["production_companies"] = take_original_of_column("production_companies")
final_dataset["cast"] = take_original_of_column("cast")

final_dataset["revenue"] = np.where(final_dataset["revenue_bo_mojo"] != 0, final_dataset["revenue_bo_mojo"], final_dataset["revenue_original"])
final_dataset["budget"] = np.where(final_dataset["budget_bo_mojo"] != 0, final_dataset["budget_bo_mojo"], final_dataset["budget_original"])

extra_columns = list(map(list, zip(*suffixed_shared_columns)))
final_dataset["description2"] = final_dataset["tagline_bo_mojo"]
final_dataset["tagline"] = final_dataset["tagline_original"]
final_dataset = final_dataset.drop(columns=extra_columns[1] + extra_columns[2])

In [16]:
final_dataset.columns

Index(['id', 'description', 'keywords', 'collection', 'production_countries',
       'popularity', 'average_vote', 'num_votes', 'language', 'imdb_id',
       'poster_url', 'runtime', 'date', 'title', 'director',
       'production_companies', 'cast', 'revenue', 'budget', 'description2',
       'tagline'],
      dtype='object')

In [15]:
def take_bomojo_of_column(col: str):
    return np.where(final_dataset[f"{col}_bo_mojo"].notna(), final_dataset[f"{col}_bo_mojo"], final_dataset[f"{col}_original"])

def take_original_of_column(col: str):
    return np.where(final_dataset[f"{col}_original"].notna(), final_dataset[f"{col}_original"], final_dataset[f"{col}_bo_mojo"])

final_dataset["runtime"] = take_bomojo_of_column("runtime")
final_dataset["date"] = take_bomojo_of_column("date")
final_dataset["title"] = take_bomojo_of_column("title")
final_dataset["director"] = take_bomojo_of_column("director")

final_dataset["production_companies"] = take_original_of_column("production_companies")
final_dataset["cast"] = take_original_of_column("cast")

final_dataset["revenue"] = np.where(final_dataset["revenue_bo_mojo"] != 0, final_dataset["revenue_bo_mojo"], final_dataset["revenue_original"])
final_dataset["budget"] = np.where(final_dataset["budget_bo_mojo"] != 0, final_dataset["budget_bo_mojo"], final_dataset["budget_original"])

extra_columns = list(map(list, zip(*suffixed_shared_columns)))
final_dataset["description2"] = final_dataset["tagline_bo_mojo"]
final_dataset["tagline"] = final_dataset["tagline_original"]
final_dataset = final_dataset.drop(columns=extra_columns[1] + extra_columns[2])

# Target definition

We define the following classification as our target variable:
- Movie where $revenue \ge 4.5 \cdot budget$: super hit;
- Movie where $4.5 \cdot budget \ge revenue \ge 2.5 \cdot budget$: blockbuster;
- Movie where $2.5 \cdot budget \ge revenue \ge 1.5 \cdot budget$: minor success;
- Movie where $budget \ge revenue \ge \frac{1}{3} \cdot budget$: flop;
- Movie where $\frac{1}{3} \cdot budget \ge revenue$: box office bomb

In [17]:
def target_classification(revenue, budget) -> str:
    if revenue == 0 or budget == 0:
        return "unclassified"

    if revenue >= 4.5 * budget:
        return "super hit"
    if 4.5 * budget >= revenue >= 2.5 * budget:
        return "blockbuster"
    if 2.5 * budget >= revenue >= budget:
        return "minor success"
    if budget >= revenue >= 1/3 * budget:
        return "flop"
    if 1/3 * budget >= revenue:
        return "box office bomb"

In [18]:
final_dataset[(final_dataset["revenue"] != 0) & (final_dataset["budget"] != 0)][["revenue", "budget"]]

Unnamed: 0,revenue,budget
2,4257354,4000000
4,775512064,11000000
5,962859504,94000000
6,683088874,55000000
7,356296601,15000000
...,...,...
45282,4314,400000
45284,22757764,23000000
45348,9559524,25000000
45383,776522,15000000


In [19]:
final_dataset["movie_classification"] = final_dataset.apply(lambda row: target_classification(row["revenue"], row["budget"]), axis=1)

In [20]:
valid_classifications = final_dataset["movie_classification"] != "unclassified"

alt.data_transformers.enable("json")
alt.Chart(final_dataset[valid_classifications], width=500).mark_bar().encode(
    x="movie_classification:N",
    y="count(movie_classification)"
)

In [21]:
f"Number of training data available: {valid_classifications.sum()}"

'Number of training data available: 7505'

We obviously can't do much with so little data. We could think about deriving another classification based solely on the revenue, but this must be researched and adjusted to account for inflation, for example.

So this will do for now.

In [22]:
final_dataset

Unnamed: 0,id,description,keywords,collection,production_countries,popularity,average_vote,num_votes,language,imdb_id,...,date,title,director,production_companies,cast,revenue,budget,description2,tagline,movie_classification
0,2,Taisto Kasurinen is a Finnish coal miner whose...,"underdog, prison, factory worker, prisoner, he...",,Finland,3.860491,7.1,44.0,fi,tt0094675,...,1988-10-21,Ariel,Aki Kaurismäki,"Villealfa Filmproduction Oy, Finnish Film Foun...","Turo Pajala, Susanna Haavisto, Matti Pellonpää...",0,0,A Finnish man goes to the city to find a job a...,,unclassified
1,3,"An episode in the life of Nikander, a garbage ...","salesclerk, helsinki, garbage, independent film",,Finland,2.292110,7.1,35.0,fi,tt0092149,...,1986-10-16,Shadows in Paradise,Aki Kaurismäki,Villealfa Filmproduction Oy,"Matti Pellonpää, Kati Outinen, Sakari Kuosmane...",0,0,"An episode in the life of Nikander, a garbage ...",,unclassified
2,5,It's Ted the Bellhop's first night on the job....,"hotel, new year's eve, witch, bet, hotel room,...",,United States of America,9.026586,6.5,539.0,en,tt0113101,...,1995-12-22,Four Rooms,"Allison Anders, Alexandre Rockwell, Robert Rod...","Miramax Films, A Band Apart","Tim Roth, Antonio Banderas, Jennifer Beals, Ma...",4257354,4000000,Four interlocking tales that take place in a f...,Twelve outrageous guests. Four scandalous requ...,minor success
3,6,"While racing to a boxing match, Frank, Mike, J...","chicago, drug dealer, boxing match, escape, on...",,"Japan, United States of America",5.538671,6.4,79.0,en,tt0107286,...,1993-10-15,Judgment Night,Stephen Hopkins,"Universal Pictures, Largo Entertainment, JVC E...","Emilio Estevez, Cuba Gooding Jr., Denis Leary,...",12526677,0,Four friends on their way to a boxing match ge...,Don't move. Don't whisper. Don't even breathe.,unclassified
4,11,Princess Leia is captured and held hostage by ...,"android, galaxy, hermit, death star, lightsabe...",Star Wars Collection,United States of America,42.149697,8.1,6778.0,en,tt0076759,...,1977-05-25,Star Wars: Episode IV - A New Hope,George Lucas,"Lucasfilm, Twentieth Century Fox Film Corporation","Mark Hamill, Harrison Ford, Carrie Fisher, Pet...",775512064,11000000,Luke Skywalker joins forces with a Jedi Knight...,"A long time ago in a galaxy far, far away...",super hit
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45403,465044,A horror comedy spoofing conspiracy theory mov...,,,United Kingdom,0.281008,0.0,0.0,en,tt5943940,...,2017-06-28,Abduction,"Maurice Smith, Mol Smith",,"Karolina Antosik, Amelie Leroy, Tessa McGinn, ...",0,0,A horror comedy spoofing conspiracy theory mov...,Horrifically Funny,unclassified
45404,467731,Fifteen-year-old girl Dotty Fisher is assaulte...,,,United States of America,0.001189,0.0,0.0,en,tt0507700,...,1956-02-19,Tragedy in a Temporary Town,Sidney Lumet,,"Lloyd Bridges, Jack Warden, Rafael Campos, Rob...",0,0,Fifteen-year-old girl Dotty Fisher is assaulte...,,unclassified
45405,468343,"In the 1910s, beautiful young Silja loses both...",,,Finland,0.001202,0.0,0.0,fi,tt0133202,...,1956-01-01,Silja - nuorena nukkunut,Jack Witikka,,"Heidi Krohn, Jussi Jurkka, Aku Korhonen, Pentt...",0,0,"In the 1910s, beautiful young Silja loses both...",,unclassified
45406,468707,,"fantasy, youth, weird",,Finland,0.347806,8.0,1.0,fi,tt5742932,...,2017-07-28,Thick Lashes of Lauri Mäntyvaara,Hannaleena Hauru,Elokuvayhtiö Oy Aamu,"Inka Haapamäki, Rosa Honkonen, Tiitus Rantala,...",42832,1254040,The love rebels are ready for battle. Satu and...,,box office bomb


In [23]:
final_dataset.to_csv(f"{DATA_FOLDER}/final_dataset.csv", index=False, header=True)