<h1>Introduction</h1>

Welcome to our kernel notebook. We are two students in computer science and we are thrilled to present to all the kaggle community our first data science project. Through this kernel, we want to share with you our analysis and methodology.
Here, we are working with data from TMDB Box Office Prediction Challenge. Thanks to loads of available data, we will try to predict the revenue for movies' world wide box office before their release. 
What is the best model which will be able to accurately predict film revenues? Could this model be used to make some changes in movies to increase their revenues even further? 

Competition's informations :
We'll be using 3000 films from The Movie Database to train our model, then when we will predict 4000 movies' worldwide box office revenue. 


**Table of content**
1. [Loading data](#0)
2. [Data analysis and features (EDA)](#1)
3. [Model](#2)
4. [Conclusion](#3)





<a id="0"></a>
<h1>Loading data</h1>

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from xgboost import XGBRegressor,plot_importance
import ast
from collections import Counter
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
import bokeh
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.models import LabelSet, ColumnDataSource, HoverTool
from bokeh.palettes import Category20c, Spectral6
from bokeh.transform import cumsum, factor_cmap, jitter

output_notebook()


In [None]:
# Path of the files to read.
train_path = '../input/tmdb-box-office-prediction/train.csv'
test_path = '../input/tmdb-box-office-prediction/test.csv'
nominations_path='../input/nominations/nominations.csv'

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
nominations=pd.read_csv(nominations_path)


Because certain columns contain a dictionary but are loaded as a string from the CSV file, we need to transform them back into a dictionary.

In [None]:
colons_in_Json = ['genres', 'production_companies', 'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew','belongs_to_collection']

def get_dictionary(s):
    try:
        d = eval(s)
    except:
        d = {}
    return d

for col in colons_in_Json :
    train[col] = train[col].apply(lambda x : get_dictionary(x))
    
for col in colons_in_Json :
    test[col] = test[col].apply(lambda x : get_dictionary(x))


<h2>External data</h2>

<h3>Actors awards nominations</h3>

We thought that something important for the success of a movie is the success of its actors. We decided to create a script to get data from IMDB on the actors nominations to the Oscars and Golden globes.

In [None]:
%%script false

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ast
import time
import json
import unidecode
import urllib

path = './train.csv'
data = pd.read_csv(path)

cast_dict = data['cast'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else {})
cast_ids=[[y['id'] for y in x] for x in cast_dict]
unique_cast = pd.DataFrame(cast_ids).stack().unique()
nominations=pd.DataFrame(columns=['id', 'nominations'])

# output = pd.DataFrame({'id': nominations.id,
#                        'nominations': nominations.nominations})
# output.to_csv('nominations.csv', index=False)

nominations['id'] = [x for x in unique_cast]


i=0 

with open('nominations.csv', 'a') as csv:
    for x in unique_cast[i:]:
        start_time = time.time()

        print(x)
        
        try:
            with urlopen("https://www.themoviedb.org/person/"+str(x)) as page:
                soup = BeautifulSoup(page, 'html.parser')
                name=soup.find('h2').text
                name = unidecode.unidecode(name)
                name=name.replace(" ", "%20")

                print(name)

                url="https://sg.media-imdb.com/suggests/"+name[0].lower()+"/"+name.lower()+".json"
                print(url)
                page = urlopen(url)
                soup = BeautifulSoup(page, 'html.parser')
                id=soup.text.find('"id":"nm')
                id=soup.text[id+6:id+15]

                print(id)
                print(i)
                page = urlopen("https://www.imdb.com/name/"+id+"/awards?ref_=nm_ql_2")

                soup = BeautifulSoup(page, 'html.parser')
                
                a = 0
                academy=0
                golden=0
                if soup.find_all('h3', string="Academy Awards, USA"):
                    soupCarotte = soup.find_all('table', attrs={'class': 'awards'})[a]
                    academy=len(soupCarotte.find_all('td', attrs={'class': 'award_outcome'}))
                    a+=1

                if soup.find_all('h3', string="Golden Globes, USA"):
                    soupSoup = soup.find_all('table', attrs={'class': 'awards'})[a]
                    golden=len(soupSoup.find_all('td', attrs={'class': 'award_outcome'}))

                nominations.loc[i,'nominations']=golden+academy


        except urllib.error.HTTPError:
            nominations.loc[i,'nominations']=0


        print(nominations)

        output = pd.DataFrame({'id': nominations.loc[i, 'id'],
                       'nominations': nominations.loc[i, 'nominations']}, index=[0])
        output.to_csv(csv, header=False, index=False)

        i+=1


print(time.time() - start_time)

# output = pd.DataFrame({'id': nominations.id,
#                        'nominations': nominations.nominations})
# output.to_csv('nominations.csv', index=False)

<h3>Google search results</h3>

Conceptualizing and measuring popularity : We think an accurate way to measure popularity might be the number of results that we get from a google search for a specific movie's title. So we decided to create a script to collect the number of results for each movie title. We aggregate the word 'movie' to the tilte for better result.

In [None]:
%%script false

import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ast
import time
import json
import urllib
import unidecode
import requests
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
import re
from itertools import islice



path = './train.csv'
data = pd.read_csv(path)


print(data['belongs_to_collection'].iloc[0])


colons_in_Json = ['belongs_to_collection']

def get_dictionary(s):
    try:
        d = eval(s)
    except:
        d = {}
    return d

for col in tqdm(colons_in_Json) :
    data[col] = data[col].apply(lambda x : get_dictionary(x))




google_result=data[['id','title','belongs_to_collection']]
google_result['google_result']=""
movie_title=google_result['title']
list_movie_title=list(movie_title)


google_result['film_belongs_to_collection'] = google_result['belongs_to_collection'].apply(lambda x: 0 if x == {} else 1)

word_movie=' movie'

with open('google_result.csv', 'a') as csv:
    for index, r in google_result.iloc[0:].iterrows():    #choose here to start iterating from the row you want in case of an error during data collection 
        if google_result.iloc[index]['film_belongs_to_collection']==0:
            search=google_result['title'].iloc[index]
            search=search+word_movie
        else:
            words = ['(Theatrical)','(1958 series)','( Series)','- Коллекция','Collection','(Animation)','(Universal Series)','(Heisei)',': The Original Series','- Collezione','(Original)','(Hammer Series)','(Remake)','(Reboot)','( Series)','Trilogy','(1976 series)','(Original Series)','(Universal Series)','Anthology','(Universal)','()','The Klapisch ','(Коллекция)']
            search = [] 
            belongs_to_collection_line=google_result.iloc[index]['belongs_to_collection']
            collection_name = belongs_to_collection_line[0]['name']
            for w in words:
                collection_name = collection_name.replace(w, '')
            search.append(collection_name)
            search=(search[0])  
            search=search+word_movie
      
        try:
            print(search)
            r = requests.get("https://www.google.com/search", params={'q':search})
            soup = BeautifulSoup(r.text, "lxml")
            res = soup.find("div", {"id": "resultStats"})
            nb_result = ''.join(x for x in res.text if x.isdigit())
            print(nb_result)
            google_result.loc[index, 'google_result']=nb_result
        except urllib.error.HTTPError:
            google_result.loc[index,'google_result']=0
  
        output = pd.DataFrame({'id': google_result.loc[index, 'title'],
                    'google_result': google_result.loc[index, 'google_result']}, index=[0])
        output.to_csv(csv, header=False, index=False)


<a id="1"></a>
<h1>Data analysis and features</h1>

What matter the most in a film to make it successfull ? 
As spectators, we think the main reasons that make us go see a movie are the actors who play in it, the popularity of the movie and if it is the sequel of a previous movie we saw. But of course these are only from a spectator point of view and they are not the only elements for the success of a movie.

That's why we need to do some plot to analyse the data.

<h2>Revenue</h2>

Let's have a look at the films' revenues.

In [None]:
t = train[['id','revenue', 'title']]
          
hover = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Revenue','@revenue'),
            ('id','@id')
           ])


fig = figure(x_axis_label='Films',
             y_axis_label='Revenue',
             title='Revenue for each Films',
            tools=[hover])


fig.square(x='id',
           y='revenue',
          source=t)

show(fig)

Most of the films have a revenue below 250 000 000 $, which means that the difference between the biggest and the average revenue is far too big. Let's convert the revenue data into a log value.

In [None]:
t = train[['id','revenue', 'title']]

          
hover = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Revenue in log1p','@revenue'),
            ('id','@id')
           ])


fig = figure(x_axis_label='Films',
             y_axis_label='Revenue Revenue in log1p',
             title='Revenue in log1p for each Films',
            tools=[hover])


fig.square(x='id',
           y='revenue',
          source=t)


show(fig)

<h2>Runtime</h2>

Let's have a look at the runtime of each film to see if we find a specific pattern.

In [None]:
t = train[['id','runtime', 'title','revenue']].copy()
t['revenue'] = np.log1p(t.revenue)
          
hover = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Runtime','@runtime'),
            ('id','@id'),
            ('Revenue','@revenue')
           ])


fig = figure(x_axis_label='Films',
             y_axis_label='Runtime',
             title='Runtime for each Films',
            tools=[hover])


fig.square(x='id',
           y='runtime',
          source=t)


show(fig)

We see some values at "0", we might need to do something to it. We think that a film that is more than 2:30min might have less success at the box office.

What are the average revenue of films with a runtime of 150 and more ?

In [None]:
t= train[['id','title','runtime','revenue','release_date']].copy()

          
t_150=t.loc[(t['runtime'] >= 150), ['id','title','runtime','revenue','release_date']] 
t_150['revenue'] = np.log1p(t_150.revenue)
          

          
hover = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Runtime','@runtime'),
            ('id','@id'),
            ('Revenue','@revenue'),
            ('Release date','@release_date')
           ])


fig = figure(x_axis_label='Revenue in log1p',
             y_axis_label='Runtime',
             title='Runtime for each Films',
            tools=[hover])


fig.square(x='revenue',
           y='runtime',
          source=t_150)


show(fig)

We can't find a specify pattern to use it as a feature. Let's try something with categories.

In [None]:
t= train[['id','title','runtime','revenue']].copy()

t.iloc[1335]=t.iloc[1335].replace(np.nan, int(120))
t.iloc[2302]=t.iloc[2302].replace(np.nan, int(90))


    
t['runtime_cat_min_60'] = t['runtime'].apply(lambda x: 1 if (x <=60) else 0)
t['runtime_cat_61_80'] = t['runtime'].apply(lambda x: 1 if (x >60)&(x<=80) else 0)
t['runtime_cat_81_100'] = t['runtime'].apply(lambda x: 1 if (x >80)&(x<=100) else 0)
t['runtime_cat_101_120'] = t['runtime'].apply(lambda x: 1 if (x >100)&(x<=120) else 0)
t['runtime_cat_121_140'] = t['runtime'].apply(lambda x: 1 if (x >120)&(x<=140) else 0)
t['runtime_cat_141_170'] = t['runtime'].apply(lambda x: 1 if (x >140)&(x<=170) else 0)
t['runtime_cat_171_max'] = t['runtime'].apply(lambda x: 1 if (x >=170) else 0)


t.loc[t.runtime_cat_min_60 == 1,'runtime_category'] = 'cat_min-60'
t.loc[t.runtime_cat_61_80 == 1,'runtime_category'] = 'cat_61-80'
t.loc[t.runtime_cat_81_100 == 1,'runtime_category'] = 'cat_81-100'
t.loc[t.runtime_cat_101_120 == 1,'runtime_category'] = 'cat_101-120'
t.loc[t.runtime_cat_121_140 == 1,'runtime_category'] = 'cat_121-140'
t.loc[t.runtime_cat_141_170 == 1,'runtime_category'] = 'cat_141-170'
t.loc[t.runtime_cat_171_max == 1,'runtime_category'] = 'cat_171-max'


#to count how many samples do we have for a category. We want at at least 15 exemples to categorise a data. 
# print(Counter(t['runtime_cat_171_max']==1))


cat = t['runtime_category']
ctr = Counter(cat)
cat = [x for x in ctr]
unique_names = pd.Series(cat).unique()

dic={}
for a in unique_names:
    mask = t.runtime_category.apply(lambda x: a in x)
    dic[a] = t[mask]['revenue'].mean()
    
t = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'runtime_cat'})

t = t.nlargest(6, 'mean_revenue')

t['color'] = Category20c[6]

hover1 = HoverTool(tooltips = [
            ('Runtime_category','@runtime_cat'),
            ('Revenue','@mean_revenue')
           ])

p = figure(x_range=t.runtime_cat, plot_width=800,plot_height=400, toolbar_location=None, title="Revenue per runtime category", tools=[hover1])
p.vbar(x='runtime_cat', top='mean_revenue', width=0.9, source=t, legend='runtime_cat',
       line_color='white',fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)

Let's convert the budget value in log1p value and compare budget vs revenue :

We can clearly see that if we categorise films per runtime, there are some differences of mean revenues. Longer films tend to generate a better revenue. We do a one hot encoding with those categories :

In [None]:
# feature engeneering : film by runtime category
train['runtime_cat_min_60'] = train['runtime'].apply(lambda x: 1 if (x <=60) else 0)
train['runtime_cat_61_80'] = train['runtime'].apply(lambda x: 1 if (x >60)&(x<=80) else 0)
train['runtime_cat_81_100'] = train['runtime'].apply(lambda x: 1 if (x >80)&(x<=100) else 0)
train['runtime_cat_101_120'] = train['runtime'].apply(lambda x: 1 if (x >100)&(x<=120) else 0)
train['runtime_cat_121_140'] = train['runtime'].apply(lambda x: 1 if (x >120)&(x<=140) else 0)
train['runtime_cat_141_170'] = train['runtime'].apply(lambda x: 1 if (x >140)&(x<=170) else 0)
train['runtime_cat_171_max'] = train['runtime'].apply(lambda x: 1 if (x >=170) else 0)

test['runtime_cat_min_60'] = test['runtime'].apply(lambda x: 1 if (x <=60) else 0)
test['runtime_cat_61_80'] = test['runtime'].apply(lambda x: 1 if (x >60)&(x<=80) else 0)
test['runtime_cat_81_100'] = test['runtime'].apply(lambda x: 1 if (x >80)&(x<=100) else 0)
test['runtime_cat_101_120'] = test['runtime'].apply(lambda x: 1 if (x >100)&(x<=120) else 0)
test['runtime_cat_121_140'] = test['runtime'].apply(lambda x: 1 if (x >120)&(x<=140) else 0)
test['runtime_cat_141_170'] = test['runtime'].apply(lambda x: 1 if (x >140)&(x<=170) else 0)
test['runtime_cat_171_max'] = test['runtime'].apply(lambda x: 1 if (x >=170) else 0)

<h2>Budget</h2>

We will compare the budget to the revenue to see if they are correlated.

In [None]:
t = train[['id','title','runtime','revenue','release_date','budget']].copy()
t['revenue'] = np.log1p(t.revenue)



hover = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Revenue','@revenue'),
            ('Budget','@budget')
           ])


fig = figure(x_axis_label='Budget',
             y_axis_label='Revenue',
             title='log Revenue vs log Budget ',
            tools=[hover])



fig.square('budget', 'revenue',source=t)

show(fig)

The pattern shows that the bigger the budget, the better the revenue. We can clearly use this as a feature. No need to do anything as it is already a numeric value with magnitude. All we did was to convert the value in log1p to get a smaller range.

In [None]:
# feature engeneering : Films budget  
train['budget'] = np.log1p(train.budget)
test['budget'] = np.log1p(test.budget)

<h2>Popularity</h2>

Like we said in the introduction, we think the popularity of a movie is one of the main reason people go to see it. Let's check if we were right.

In [None]:
t = train[['id','title','runtime','revenue','release_date','budget','popularity']].copy()
t['revenue'] = np.log1p(t.revenue)

hover = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Revenue','@revenue'),
            ('Popularity','@popularity')
            
           ])


fig = figure(x_axis_label='Popularity',
             y_axis_label='Revenue',
             title='log Revenue vs log Popularity ',
            tools=[hover])



fig.square('popularity', 'revenue',source=t)

show(fig)

This is also very interesting, and should be a good feature. We also converted the popularity value into a log value.

In [None]:
# feature engeneering : popularity
train['popularity'] = np.log1p(train.popularity)
test['popularity'] = np.log1p(test.popularity)

<h2>Homepage</h2>

Is having a website a good thing ? Let's have a look :

In [None]:
#Plot : Revenue for each film that has homepage or not 

t = train[['revenue','homepage','title']].copy()

t['film_that_has_homepage'] = t['homepage'].isnull().apply(lambda x: str(False) if x==True  else str(True))


t = t.groupby('film_that_has_homepage')['revenue'].mean().reset_index()

hover1 = HoverTool(tooltips = [
            ('Mean revenue','@revenue'),
           ])


t['color'] = [Spectral6[1],Spectral6[2]]


p = figure(x_range=['False','True'], plot_width=600,plot_height=400, toolbar_location=None, title="Revenue for a film that has homepage", tools=[hover1])
p.vbar(x='film_that_has_homepage', top='revenue', width=0.9, source=t, legend='film_that_has_homepage',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = 'top_left'

show(p)

Films with a homepage have a slightly better revenue, it might be interesting to use it in our model with a one hot implementation.

In [None]:
# feature engeneering : Film that has homepage
train['film_that_has_homepage'] = train['homepage'].isnull().apply(lambda x: 0 if x==True else 1).copy()
test['film_that_has_homepage'] = test['homepage'].isnull().apply(lambda x: 0 if x==True else 1).copy()

<h2>Original language</h2>

Which original languages tend to help getting a better revenue ?

In [None]:
t = train[['revenue','original_language','title']].copy()


lang = t['original_language']
ctr = Counter(lang).most_common(17)
lang = [x[0] for x in ctr ]
unique_names = pd.Series(lang).unique()



dic={}
for a in unique_names:
    mask = t.original_language.apply(lambda x: a in x)
    dic[a] = t[mask]['revenue'].mean()

t = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'langue'})

t = t.nlargest(12, 'mean_revenue')

t['color'] = Category20c[12]

hover1 = HoverTool(tooltips = [
            ('Langue','@langue'),
            ('Revenue','@mean_revenue')
           ])

p = figure(x_range=t.langue, plot_width=1400,plot_height=400, toolbar_location=None, title="Revenue per original language", tools=[hover1])
p.vbar(x='langue', top='mean_revenue', width=0.9, source=t, legend='langue',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)

We can see that english and chinese movies usually have a better mean revenue.

Feature engineering : one hot encoding for original languages that have at least 5 samples

In [None]:
# feature engeneering : one hot encoding for original language that have at least 5 samples
lang = train['original_language']
lang_more_17_samples = [x[0] for x in Counter(pd.DataFrame(lang).stack()).most_common(17)]

for col in lang_more_17_samples :
    train[col] = train['original_language'].apply(lambda x: 1 if x == col else 0)
for col in lang_more_17_samples :
    test[col] = test['original_language'].apply(lambda x: 1 if x == col else 0)
# print(train['Drama'])


<h2>Google search results</h2>

First, we need to load the data that we got thanks to the script shown earlier.

In [None]:
google_train_path = '../input/google-result/google_result_train.csv'
google_test_path = '../input/google-result/google_result_test.csv'

google_train = pd.read_csv(google_train_path)
google_test = pd.read_csv(google_test_path)
train['google_result'] = google_train['result']
test['google_result'] = google_test['result']

Then we will plot the revenue versus the number of search results.

In [None]:
t = train[['revenue','title','google_result']].copy()
t['revenue']=np.log1p(t.revenue)
t['google_result']=np.log1p(t.google_result)


hover = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Revenue','@revenue'),
            ('Google result number','@google_result')
            
           ])


fig = figure(x_axis_label='Google result number',
             y_axis_label='Revenue',
             title='log Revenue vs log google_result ',
            tools=[hover])



fig.square('google_result', 'revenue',source=t)

show(fig)

We thought that this data would be a better indication of the revenue, but we can still use it. 

In [None]:
# feature engeneering : popularity with google search 
train['google_result']=np.log1p(train.google_result)
test['google_result']=np.log1p(test.google_result)

<h2>Collection</h2>

Belongs to collection : we think that a film about to hit the cinema screen that belongs to a collection might affect his revenue. A plot for the revenue per film belonging to collection might help to answer our doubt.

In [None]:
t = train[['revenue','belongs_to_collection','title']].copy()


t['film_belongs_to_collection'] = t['belongs_to_collection'].apply(lambda x: str(False) if x == {} else str(True))


t = t.groupby('film_belongs_to_collection')['revenue'].mean().reset_index()


hover1 = HoverTool(tooltips = [
            ('Mean revenue','@revenue'),
           ])


t['color'] = [Spectral6[0],Spectral6[1]]


p = figure(x_range=['False','True'], plot_width=600,plot_height=400, toolbar_location=None, title="Mean revenue for a film belonging to a collection", tools=[hover1])
p.vbar(x='film_belongs_to_collection', top='revenue', width=0.9, source=t, legend='film_belongs_to_collection',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = 'top_left'

show(p)

As we thought, this feature could really help our model as the revenue is much bigger when a film belongs to a collection. We implement it as a one hot encoded feature.

In [None]:
# feature engeneering : Film that belongs_to_collection 
train['film_belongs_to_collection'] = train['belongs_to_collection'].apply(lambda x: 0 if x == {} else 1)
test['film_belongs_to_collection'] = test['belongs_to_collection'].apply(lambda x: 0 if x == {} else 1)

<h2>Genres</h2>

Genres preferences for a movie is really subjective, but we can still try to see if some are more popular than others.
Let's plot the mean revenue for each genre.

In [None]:
t = train[['id','revenue', 'title', 'genres']].copy()
t['genres'] = [[y['name'] for y in x] for x in t['genres']]

genres = t['genres'].sum()
ctr = Counter(genres)
df_genres = pd.DataFrame.from_dict(ctr, orient='index').reset_index().rename(columns={'index':'genre', 0:'count'})       
df_genres=df_genres.sort_values('count', ascending=False)
df_genres = df_genres[df_genres['count'] > 1]
df_genres = df_genres.nlargest(20, 'count')


genres = list(df_genres['genre'])

dic={}
for a in genres:
    mask = t.genres.apply(lambda x: a in x)
    dic[a] = t[mask]['revenue'].mean()

t = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'genre'})

t['color'] = Category20c[len(t)]

hover1 = HoverTool(tooltips = [
            ('Genre','@genre'),
            ('Genre mean revenue','@mean_revenue')
           ])

p = figure(x_range=t.genre, plot_width=1400,plot_height=400, toolbar_location=None, title="Mean revenue per genre", tools=[hover1])
p.vbar(x='genre', top='mean_revenue', width=0.9, source=t, legend='genre',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)


6 genres seem above the others.
To make sure they have enough data, we can plot the number of movies per genre.

In [None]:
t = train[['id','revenue', 'genres']]
x = [[y['name'] for y in x] for x in t['genres']]
x = Counter(pd.DataFrame(x).stack())
x = pd.Series(x)


data = x.reset_index(name='value').rename(columns={'index':'genre'})
data['angle'] = data['value']/data['value'].sum() * 2*np.pi
data['color'] = Category20c[len(x)]

p = figure(plot_height=350, title="Number of movies per genres", toolbar_location=None,
           tools="hover", tooltips="@genre: @value", x_range=(-0.5, 1.0))

p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend='genre', source=data)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)

Drama is the most common genre, and we see from the previous plot that it doesn't mean it generates a big revenue.

We can use a one hot encoding on the genres. We will only consider the 6 genres with a higher mean revenue.

In [None]:
train['genres_names'] = [[y['name'] for y in x] for x in train['genres']]

# genres = train['genres_names'].sum()
# ctr = Counter(genres)
# genres=[n for n in ctr if ctr[n] > 249]
# genres_list = pd.Series(genres).unique()

genres_list=['Action', 'Adventure', 'Science Fiction', 'Family', 'Fantasy','Animation']
        
for a in genres_list :
    train['genre_'+a]=train['genres_names'].apply(lambda x: 1 if a in x else 0)
train = train.drop(['genres_names'], axis=1)

test['genres_names'] = [[y['name'] for y in x] for x in test['genres']]
for a in genres_list :
    test['genre_'+a]=test['genres_names'].apply(lambda x: 1 if a in x else 0)
test = test.drop(['genres_names'], axis=1)

<h2>Release date</h2>

Does the release date have an influence on the revenue of a movie? 
First, we need to transform the data in the *"release_date"* column into DateTime format. We'll need to fix the year because we have some 4 digit format ie "2013" and some 2 digit format ie "98".

In [None]:
# feature engeneering : release date 
def date_features(df):
    df[['release_month','release_day','release_year']]=df['release_date'].str.split('/',expand=True).replace(np.nan, 0).astype(int)
    df['release_year'] = df['release_year']
    df.loc[ (df['release_year'] <= 18) & (df['release_year'] < 100), "release_year"] += 2000
    df.loc[ (df['release_year'] > 18)  & (df['release_year'] < 100), "release_year"] += 1900
    df['release_date'] = pd.to_datetime(df['release_date'])
    df['release_month'] = df['release_date'].dt.month
    # df['release_day'] = df['release_date'].dt.day
    df['release_quarter'] = df['release_date'].dt.quarter
    df.drop(columns=['release_date'], inplace=True)
    
    return df

train=date_features(train)
test=date_features(test)

The mean revenue tend to increase each year that goes by. That's why it is important to implement these values in the model. First we'll try to implement for each film the year as it is a numerical value that increases.

In [None]:
# mean revenue per year 

t = train[['id','revenue','release_year']].copy()

t = t.groupby('release_year')['revenue'].aggregate('mean')
t=np.log1p(t)

hover = HoverTool(tooltips = [
            ('Year','@x'),
            ('Revenue','@top')
           ])


fig = figure(plot_height=400,
             plot_width=600,
             x_axis_label='Year',
             y_axis_label='Mean revenue',
             title='Log mean revenue for each year',
             tools = [hover])


fig.vbar(x=t.index,
           top=t.values, 
           width=0.9,
           color='royalblue')

show(fig)

We will plot the mean revenue for each months to determine which are the most profitable.

In [None]:
# mean revenue per month 

t = train[['id','revenue', 'release_month']]
months_mean_revenues = t.groupby('release_month')['revenue'].aggregate('mean')


hover1 = HoverTool(tooltips = [
            ('Month','@x'),
            ('Revenue','@top')
           ])


fig = figure(plot_height=400,
             plot_width=600,
             x_axis_label='Month',
             y_axis_label='Mean revenue',
             title='Mean revenue for each months',
             tools = [hover1])


fig.vbar(x=months_mean_revenues.index,
           top=months_mean_revenues.values, 
           width=0.9,
           color='royalblue')



show(fig)

We can see that June is the month where movies have a higher revenue on average, while movies releasing in january and from august to october usually have a lower revenue.

We can also plot the mean revenue by quarter.

In [None]:
# mean revenue per month 

t = train[['id','revenue', 'release_quarter']]
quarters_mean_revenues = t.groupby('release_quarter')['revenue'].aggregate('mean')

hover1 = HoverTool(tooltips = [
            ('Quarter','@x'),
            ('Revenue','@top')
           ])


fig = figure(plot_height=400,
             plot_width=600,
             x_axis_label='Quarter',
             y_axis_label='Mean revenue',
             title='Mean revenue for each quarter',
             tools = [hover1])


fig.vbar(x=quarters_mean_revenues.index,
           top=quarters_mean_revenues.values, 
           width=0.9,
           color='royalblue')



show(fig)

The second quarter is the most profitable.

We see that the revenue does depend on the release month and quarter. We can had both of these informations as features with a one hot encoding:

In [None]:
# feature engeneering : Release date per month one hot encoding
for col in range (1,12) :
    train['month'+str(col)] = train['release_month'].apply(lambda x: 1 if x == col else 0)

for col in range (1,12) :
    test['month'+str(col)] = test['release_month'].apply(lambda x: 1 if x == col else 0)
    
# feature engeneering : Release date per quarter one hot encoding
for col in range (1,4) :
    train['quarter'+str(col)] = train['release_quarter'].apply(lambda x: 1 if x == col else 0)

for col in range (1,4) :
    test['quarter'+str(col)] = test['release_quarter'].apply(lambda x: 1 if x == col else 0)



We can also add a column with the mean revenue for the month and quarter of release

In [None]:
# # feature engeneering : Release date per months mean revenues
# train['months_mean_revenue'] = train['release_month'].apply(lambda x: months_mean_revenues[x])
# train['quarter_mean_revenue'] = train['release_quarter'].apply(lambda x: quarters_mean_revenues[x])

# test['release_quarter'].fillna(0, inplace=True)
# test['release_month'].fillna(0,inplace=True)

# # feature engeneering : Release date per quarter mean revenues
# test['months_mean_revenue'] = test['release_month'].apply(lambda x: months_mean_revenues[x] if x > 0 else 0)
# test['quarter_mean_revenue'] = test['release_quarter'].apply(lambda x: quarters_mean_revenues[x] if x > 0 else 0)

Are some genres more popular in certain months?
Animations movies may be more popular in december because 

We have only 141 animation movies which is not enough to check the revenue for each month and have reliable data, so we do it by quarter

In [None]:
# mean revenue per quarter for animation movies 

t = train[['id','revenue', 'title', 'genres', 'release_quarter']].copy()
t['genres'] = [[y['name'] for y in x] for x in t['genres']]
mask = t.genres.apply(lambda x: 'Animation' in x)
t = t[mask]
t = t.groupby('release_quarter')['revenue'].aggregate('mean')


hover1 = HoverTool(tooltips = [
            ('Quarter','@x'),
            ('Revenue','@top')
           ])


fig = figure(plot_height=400,
             plot_width=600,
             x_axis_label='Quarter',
             y_axis_label='Mean revenue',
             title='Mean revenue for each quarter for animation movies',
             tools = [hover1])


fig.vbar(x=t.index,
           top=t.values, 
           width=0.9,
           color='royalblue')



show(fig)


The mean revenue is similar to the on of all the movies, so we won't be using this.

Let's check for a genre with more data, like Drama.

In [None]:
# mean revenue per quarter for drama movies 

t = train[['id','revenue', 'title', 'genres', 'release_month']].copy()
t['genres'] = t['genres'].apply(lambda x: [y['name'] for y in x])
mask = t.genres.apply(lambda x: 'Drama' in x)
t = t[mask]
t = t.groupby('release_month')['revenue'].aggregate('mean')


hover1 = HoverTool(tooltips = [
            ('Quarter','@x'),
            ('Revenue','@top')
           ])


fig = figure(plot_height=400,
             plot_width=600,
             x_axis_label='Month',
             y_axis_label='Mean revenue',
             title='Mean revenue for each month for drama movies',
             tools = [hover1])


fig.vbar(x=t.index,
           top=t.values, 
           width=0.9,
           color='royalblue')



show(fig)


Drama movies are a bit more popular in december but the rest is also similar to the other movies.
If there had been a bigger difference, we could have made a one hot encoding for the release month based on the genre of the movie, but it's not the case.

Now we need to drop the *"release_month"* and *"release_quarter"* because we don't need them anymore.

In [None]:
train = train.drop(['release_month', 'release_quarter'], axis=1)
test = test.drop(['release_month', 'release_quarter'], axis=1)

<h2>Actors</h2>

In this section, we will only consider the first 3 actors for each movies because we assume they are the main characters, thus they are the one that will influence the movie popularity the most.

Actors are one of the first thing people will look at to decide if they will go see a movie or not. We can see which actors are the most known by plotting the number of movies they made.

In [None]:
t = train[['id','revenue', 'title', 'cast']].copy()
t['cast'] = [[y['name'] for y in x] for x in t['cast']]
t['cast'] = t['cast'].apply(lambda x: x[:3])

names = t['cast'].sum()
ctr = Counter(names)
df_names = pd.DataFrame.from_dict(ctr, orient='index').reset_index().rename(columns={'index':'actor', 0:'count'})       
df_names=df_names.sort_values('count', ascending=False)
df_names = df_names[df_names['count'] > 8]
 
p = figure(plot_width=1300, plot_height=500, title="Most common actors",
           x_range=df_names['actor'], toolbar_location=None, tooltips=[("Actor", "@actor"), ("Count", "@count")])

p.vbar(x='actor', top='count', width=1, source=df_names,
       line_color="white" )

p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Actors name"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None

show(p)

Let's look at the mean revenue for the movies the 20 most common actors played in.

In [None]:
t = train[['id','revenue', 'title', 'cast']].copy()
t['cast'] = [[y['name'] for y in x] for x in t['cast']]
t['cast'] = t['cast'].apply(lambda x: x[:3])

df_names_revenue = df_names.nlargest(20, 'count')
names = list(df_names_revenue['actor'])

dic={}
for a in names:
    mask = t.cast.apply(lambda x: a in x)
    dic[a] = t[mask]['revenue'].mean()

t = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'actor'})

t = t.nlargest(20, 'mean_revenue')

t['color'] = Category20c[20]

hover1 = HoverTool(tooltips = [
            ('Actor','@actor'),
            ('Movies mean revenue','@mean_revenue')
           ])

p = figure(x_range=t.actor, plot_width=1400,plot_height=400, toolbar_location=None, title="20 most common actors movies mean revenue", tools=[hover1])
p.vbar(x='actor', top='mean_revenue', width=0.9, source=t, legend='actor',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)


This plot shows that it's not because an actor makes a lot of movie that his movies revenue will be high. For example, Robert De Niro is the most common actor with 25 movies, but he is only at the 18th position in this plot.

To make sure we have enough examples, we will take only the actors that played in at least 12 movies for the features. We will make a feature with how many actors are in the top 40 with highest revenue for each movie.

In [None]:
df_names = df_names[df_names['count'] > 11]
names_list = list(df_names['actor'])

train['cast_names']=[[y['name'] for y in x] for x in train['cast']]
train['cast_names'] = train['cast_names'].apply(lambda x: x[:3])

dic={}
for a in names_list:
    mask = train['cast_names'].apply(lambda x: a in x)
    dic[a] = train[mask]['revenue'].mean()

actors_mean_revenue = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'actor'})
names_list = list(actors_mean_revenue.nlargest(40, 'mean_revenue')['actor'])

train['actors_mean_revenue'] = train['cast_names'].apply(lambda x: actors_mean_revenue[actors_mean_revenue['actor'].isin(x)].mean()['mean_revenue'])
train['actors_mean_revenue'].fillna(0,inplace=True)


train['total_top_actors_revenue']=train['cast_names'].apply(lambda x: sum([1 for i in x if i in names_list]))
# for a in names_list :
#     train['actor_'+a]=train['cast_names'].apply(lambda x: 1 if a in x else 0)
train = train.drop(['cast_names'], axis=1)

test['cast_names']=[[y['name'] for y in x] for x in test['cast']]
test['cast_names'] = test['cast_names'].apply(lambda x: x[:3])

test['actors_mean_revenue'] = test['cast_names'].apply(lambda x: actors_mean_revenue[actors_mean_revenue['actor'].isin(x)].mean()['mean_revenue'])
test['actors_mean_revenue'].fillna(0,inplace=True)

test['total_top_actors_revenue']=test['cast_names'].apply(lambda x: sum([1 for i in x if i in names_list]))

# for a in names_list :
#     test['actor_'+a]=test['cast_names'].apply(lambda x: 1 if a in x else 0)
test = test.drop(['cast_names'], axis=1)


We can also plot the actors with the highest mean revenue per movie.

In [None]:
t = train[['id','revenue', 'title', 'cast']].copy()
t['cast'] = [[y['name'] for y in x] for x in t['cast']]
t['cast'] = t['cast'].apply(lambda x: x[:3])

names = t['cast'].sum()
ctr = Counter(names)
names=[n for n in ctr if ctr[n] > 0]
unique_names = pd.Series(names).unique()

dic={}
for a in unique_names:
    mask = t.cast.apply(lambda x: a in x)
    dic[a] = t[mask]['revenue'].mean()

actors_mean_revenue = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'actor'})

t = actors_mean_revenue.nlargest(20, 'mean_revenue')

t['color'] = Category20c[20]

hover1 = HoverTool(tooltips = [
            ('Actor','@actor'),
            ('Revenue','@mean_revenue')
           ])

p = figure(x_range=t.actor, plot_width=1400,plot_height=400, toolbar_location=None, title="20 actors with highest mean revenue", tools=[hover1])
p.vbar(x='actor', top='mean_revenue', width=0.9, source=t, legend='actor',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)


We can see that most of these actors are not really popular, because they only played in one movie that worked well, so this information is not usable because those are exceptions and it doesn't mean that the next movie they will play in will work as well.

The same thing can be said if we plot the 20 actors with the highest mean popularity of their movies.

In [None]:
t = train[['id','popularity', 'title', 'cast']].copy()
t['popularity'] = np.expm1(t['popularity'])
t['cast'] = [[y['name'] for y in x] for x in t['cast']]
t['cast'] = t['cast'].apply(lambda x: x[:3])

names = t['cast'].sum()
ctr = Counter(names)
names=[n for n in ctr if ctr[n] > 0]
unique_names = pd.Series(names).unique()

dic={}
for a in unique_names:
    mask = t.cast.apply(lambda x: a in x)
    dic[a] = t[mask]['popularity'].mean()

t = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'actor'})

t = t.nlargest(20, 'mean_revenue')

t['color'] = Category20c[20]

hover1 = HoverTool(tooltips = [
            ('Actor','@actor'),
            ('Revenue','@mean_revenue')
           ])

p = figure(x_range=t.actor, plot_width=1400,plot_height=400, toolbar_location=None, title="20 actors with highest mean popularity", tools=[hover1])
p.vbar(x='actor', top='mean_revenue', width=0.9, source=t, legend='actor',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)

<h4>Actors awards nominations</h4>

We collected the actors awards data from IMDB and created a CSV file (nominations.csv) containing the number of Oscars and Golden Globes nominations for each actors.
We can plot the number of nominations for each movie and see if it affects a movie revenue.

In [None]:
t = train[['id','revenue', 'title', 'cast']].copy()

t['cast_ids']=[[y['id'] for y in x] for x in t['cast']]
t['cast_ids'] = t['cast_ids'].apply(lambda x: x[:3])
t['nominations'] = t['cast_ids'].apply(lambda x: nominations[nominations['id'].isin(x)]['nominations'].sum())
t=t.drop(['cast'], axis=1)

hover1 = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Revenue','@revenue'),
            ('Nominations','@nominations')
           ])


fig = figure(x_axis_label='Nominations for all main actors',
             y_axis_label='Revenue',
             title='Nominations vs. Revenue',
            tools=[hover1])


fig.square(x='nominations',
           y='revenue',
          source=t)

show(fig)

As the plot shows, the revenue does not depend on the number of nominations of its actors, probably because some of these movies are old and the nominations data is recent. The actors didn't as much nominations as they have now.

We can try to see if on average, a movie featuring at least one nominated actor has a higher budget.

In [None]:
t = train[['id','revenue', 'title', 'cast']].copy()

t['cast_ids']=[[y['id'] for y in x] for x in t['cast']]
t['cast_ids'] = t['cast_ids'].apply(lambda x: x[:3])
t['nominations'] = t['cast_ids'].apply(lambda x: 'True' if (nominations[nominations['id'].isin(x)]['nominations'] != 0).any() else 'False')
df_has_nominated_actor=t.drop(['cast', 'id'], axis=1)
t = df_has_nominated_actor.groupby('nominations')['revenue'].mean().reset_index()
hover1 = HoverTool(tooltips = [
            ('Mean revenue','@revenue'),
           ])

t['color'] = [Spectral6[1],Spectral6[2]]

p = figure(x_range=['False', 'True'], plot_width=400,plot_height=400, toolbar_location=None, title="Has a nominated actor", tools=[hover1])
p.vbar(x='nominations', top='revenue', width=0.9, source=t, legend='nominations',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)

On average, a movie containing at least one nominated actor has almost twice the revenue of a movie without any nominated actor.

We also plot the number of nominated actor per movie and see if it affects the mean revenue.

In [None]:
t = train[['id','revenue', 'title', 'cast']].copy()

t['cast_ids']=[[y['id'] for y in x] for x in t['cast']]
t['cast_ids'] = t['cast_ids'].apply(lambda x: x[:6])
t['nominations'] = t['cast_ids'].apply(lambda x: str((nominations[nominations['id'].isin(x)]['nominations'] != 0).sum()))
t=t.drop(['cast'], axis=1)
t = t.groupby('nominations')['revenue'].mean().reset_index()

hover1 = HoverTool(tooltips = [
            ('Mean revenue','@revenue'),
           ])


t['color'] = Spectral6+[Spectral6[1]]

p = figure(x_range=['0','1','2','3','4','5', '6'], plot_width=500,plot_height=400, toolbar_location=None, title="Revenue vs. number of nominated actors", tools=[hover1])
p.vbar(x='nominations', top='revenue', width=0.9, source=t, legend='nominations',
       line_color='white', fill_color='color')

p.xgrid.grid_line_color = None
p.legend.location='top_left'

show(p)


Movies with more nominated actor generate a higher revenue on average, but it drops when we start to take the more than the first 4 actors, probably because after 4 the actors don't have an important role in the movie.

In [None]:
df_has_nominated_actor['nominations'] = df_has_nominated_actor['nominations'].apply(lambda x: 1 if x == 'True' else 0)
train['has_nominated_actor'] = df_has_nominated_actor['nominations']


test['cast_ids']=[[y['id'] for y in x] for x in test['cast']]
test['cast_ids'] = test['cast_ids'].apply(lambda x: x[:3])
test['has_nominated_actor'] = test['cast_ids'].apply(lambda x: 0 if (nominations[nominations['id'].isin(x)]['nominations'] != 0).any() else 1)
test = test.drop(['cast_ids'], axis=1)


train['cast_ids']=[[y['id'] for y in x] for x in train['cast']]
train['cast_ids'] = train['cast_ids'].apply(lambda x: x[:4])
train['nominated_actors'] = train['cast_ids'].apply(lambda x: (nominations[nominations['id'].isin(x)]['nominations'] != 0).sum())

test['cast_ids']=[[y['id'] for y in x] for x in test['cast']]
test['cast_ids'] = test['cast_ids'].apply(lambda x: x[:4])
test['nominated_actors'] = test['cast_ids'].apply(lambda x: (nominations[nominations['id'].isin(x)]['nominations'] != 0).sum())


<h2>Directors</h2>

In [None]:
t = train[['id','revenue', 'title', 'crew']].copy()
t['crew'] = [[y['name'] for y in x if y['department']=='Directing'] for x in t['crew'] ]
t['crew'] = t['crew'].apply(lambda x: x[:3])

names = t['crew'].sum()
ctr = Counter(names)
df_names = pd.DataFrame.from_dict(ctr, orient='index').reset_index().rename(columns={'index':'actor', 0:'count'})       
df_names=df_names.sort_values('count', ascending=False)
df_names = df_names[df_names['count'] > 4]
 
p = figure(plot_width=1300, plot_height=500, title="Most common directors",
           x_range=df_names['actor'], toolbar_location=None, tooltips=[("Director", "@actor"), ("Count", "@count")])

p.vbar(x='actor', top='count', width=1, source=df_names,
       line_color="white" )

p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Director names"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None

show(p)

To make sure we have enough examples, we will do a one encoding on directors with over 10 movies and highest mean revenue

In [None]:
df_names = df_names[df_names['count'] > 10]
names_list = list(df_names['actor'])

train['crew_names'] = [[y['name'] for y in x if y['department']=='Directing'] for x in train['crew'] ]
train['crew_names'] = train['crew_names'].apply(lambda x: x[:3])

dic={}
for a in names_list:
    mask = train['crew_names'].apply(lambda x: a in x)
    dic[a] = train[mask]['revenue'].mean()

directors_mean_revenue = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'director'})

names_list = list(directors_mean_revenue.nlargest(40, 'mean_revenue')['director'])

# train['total_top_actors_revenue']=train['cast_names'].apply(lambda x: sum([1 for i in x if i in names_list]))

for a in names_list :
    train['director_'+a]=train['crew_names'].apply(lambda x: 1 if a in x else 0)
train = train.drop(['crew_names'], axis=1)

test['crew_names'] = [[y['name'] for y in x if y['department']=='Directing'] for x in test['crew'] ]
test['crew_names'] = test['crew_names'].apply(lambda x: x[:3])
for a in names_list :
    test['director_'+a]=test['crew_names'].apply(lambda x: 1 if a in x else 0)
test = test.drop(['crew_names'], axis=1)


<h2>Production companies</h2>

Let's see which companies make the most movies.

In [None]:
t = train[['id','revenue', 'title', 'production_companies']].copy()
t['production_companies'] = [[y['name'] for y in x] for x in t['production_companies'] ]
t['production_companies'] = t['production_companies'].apply(lambda x: x[:3])

names = t['production_companies'].sum()
ctr = Counter(names)
df_names = pd.DataFrame.from_dict(ctr, orient='index').reset_index().rename(columns={'index':'actor', 0:'count'})       
df_names=df_names.sort_values('count', ascending=False)
df_names = df_names[df_names['count'] > 9]
 
p = figure(plot_width=1300, plot_height=500, title="Number of movies per production company",
           x_range=df_names['actor'], toolbar_location=None, tooltips=[("Company", "@actor"), ("Count", "@count")])

p.vbar(x='actor', top='count', width=1, source=df_names,
       line_color="white" )

p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Production company"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None

show(p)

Like for the directors, we will create a one hot encoded feature with the 20 highest mean revenue production companies.

In [None]:
df_names = df_names[df_names['count'] > 9]
names_list = list(df_names['actor'])

train['production_companies'] = [[y['name'] for y in x] for x in train['production_companies'] ]
train['production_companies'] = train['production_companies'].apply(lambda x: x[:3])

dic={}
for a in names_list:
    mask = train['production_companies'].apply(lambda x: a in x)
    dic[a] = train[mask]['revenue'].mean()

companies_mean_revenue = pd.DataFrame.from_dict(dic, orient='index', columns=['mean_revenue']).reset_index().rename(columns={'index':'company'})

names_list = list(companies_mean_revenue.nlargest(20, 'mean_revenue')['company'])

# train['total_top_companies']=train['production_companies'].apply(lambda x: sum([1 for i in x if i in names_list]))
for a in names_list :
    train['production_'+a]=train['production_companies'].apply(lambda x: 1 if a in x else 0)
train = train.drop(['production_companies'], axis=1)

test['production_companies'] = [[y['name'] for y in x] for x in test['production_companies'] ]
test['production_companies'] = test['production_companies'].apply(lambda x: x[:3])
# test['total_top_companies']=test['production_companies'].apply(lambda x: sum([1 for i in x if i in names_list]))

for a in names_list :
    test['production_'+a]=test['production_companies'].apply(lambda x: 1 if a in x else 0)
test = test.drop(['production_companies'], axis=1)

<a id="2"></a>
<h1 id="2">Model</h1>

In [None]:
# Create target object and call it y
y = np.log1p(train.revenue)

<h2>Training</h2>

In [None]:
# Create X
X = train.drop(['id','runtime', 'release_day'], axis=1)

test_X = test.drop(['id','runtime', 'release_day'], axis=1).select_dtypes(exclude=['object'])

    
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1,test_size=0.33)

train_X=train_X.drop(['revenue'], axis=1).select_dtypes(exclude=['object'])
X=X.drop(['revenue'], axis=1).select_dtypes(exclude=['object'])
val_X_revenue=val_X.pop('revenue')
val_X_title=val_X.pop('title')
val_X=val_X.select_dtypes(exclude=['object'])

xgb_model = XGBRegressor(learning_rate=0.05, 
                            n_estimators=10000,max_depth=4)
xgb_model.fit(train_X, train_y, early_stopping_rounds=100, 
             eval_set=[(val_X, val_y)], eval_metric = 'rmse')
xbg_val_predictions=xgb_model.predict(val_X)


<h3>Training analysis</h3>

In [None]:
df=val_X.reset_index().join(pd.DataFrame(np.expm1(xbg_val_predictions)).rename(columns={0:'prediction'}))
df=df.join(val_X_revenue.reset_index()['revenue'])
df=df.join(val_X_title.reset_index()['title'])
df_x=df[['revenue','prediction', 'title']]

hover1 = HoverTool(tooltips = [
            ('Titre','@title'),
            ('Revenue','@revenue'),
            ('Prediction','@prediction')
           ])


fig = figure(x_axis_label='Revenue',
             y_axis_label='prediction',
             title='Revenue vs. Prediction',
            tools=[hover1])


fig.square(x='revenue',
           y='prediction',
          source=df_x)

show(fig)

fig, ax = plt.subplots(figsize=(15, 13))
plot_importance(xgb_model, ax=ax)
plt.show()

In [None]:
df_a=df_x[df_x['title']=='Top Gun']
df_a=df_a.append(df_x[df_x['title']=='Tomorrowland'])
df_x=df_a.append(df_x[df_x['title']=='Rambo III'])

fig = figure(x_axis_label='Revenue',
             y_axis_label='prediction',
             title='Revenue vs. Prediction',
            tools=[hover1])

fig.square(x='revenue',
           y='prediction',
          source=df_x)

show(fig)

<h2>Predictions</h2>

In [None]:
xgb_model_full = XGBRegressor(n_estimators=145, learning_rate=0.05,max_depth=4)
xgb_model_full.fit(X, y)


test_preds=xgb_model_full.predict(test_X)

output = pd.DataFrame({'id': test.id,
                       'revenue': np.expm1(test_preds)})
output.to_csv('submission.csv', index=False)

<a id="3"></a>
<h1>Conclusion</h1>

We learned a lot from this project and we have a lot of room for improvements.
A lot of features we thought would work well didn't reach our expectations, for example the use of the external data from IMDB for the awards or the google search results.

There's a lot of data we didn't use, by lack of time, which could improve our results. We mainly ignored the columns containing text, like the title or the overview, because they require a completely different approach and we decided to focus first on the other types of data. We did try to create a word vectorization with a neural network with keras and tensorflow, but we decided to stop because it was taking us too much time and we weren't sure the result would be worth it.





