# TMDB Box Office Prediction

**If you use parts of this notebook in your scripts/notebooks, giving some kind of credit would be very much appreciated :) You can for instance link back to this notebook. Thanks!**

![boxoffice.jpg](http://sanjeevwritings.files.wordpress.com/2018/05/boxoffice.jpg)

### Introduction

In a world... where movies made an estimated $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget? For some movies, it's "You had me at 'Hello.'" For others, the trailer falls short of expectations and you think "What we have here is a failure to communicate."

In this, we're presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. You can collect other publicly available data to use in your model predictions, but in the spirit of this competition, use only data that would have been available before a movie's release.


### Load Libraries

In [None]:
#Libraries
import numpy as np
import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd
pd.set_option('max_columns', None)
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import ast
from wordcloud import WordCloud
from collections import Counter
from PIL import Image
from urllib.request import urlopen
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, RandomizedSearchCV
from sklearn.linear_model import LinearRegression,Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings("ignore")

### Load Data

In [None]:
train = pd.read_csv('../input/tmdb-box-office-prediction/train.csv')
test = pd.read_csv('../input/tmdb-box-office-prediction/test.csv')

dict_columns = ['belongs_to_collection','genres','spoken_languages','production_companies',
                'production_countries','Keywords','cast','crew']

def text_to_dict(df):
    for columns in dict_columns:
        df[columns] = df[columns].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x))
    return df

train = text_to_dict(train)
test = text_to_dict(test)

test['revenue'] = np.nan

# features from https://www.kaggle.com/kamalchhirang/eda-simple-feature-engineering-external-data
# Aditional Features
train = pd.merge(train, pd.read_csv('../input/tmdb-competition-additional-features/TrainAdditionalFeatures.csv'), how='left', on=['imdb_id'])
test = pd.merge(test, pd.read_csv('../input/tmdb-competition-additional-features/TestAdditionalFeatures.csv'), how='left', on=['imdb_id'])

train.head(2)

In [None]:
train.shape , test.shape

In [None]:
# Data Fixes from https://www.kaggle.com/somang1418/happy-valentines-day-and-keep-kaggling-3

train.loc[train['id'] == 16,'revenue'] = 192864          # Skinning
train.loc[train['id'] == 90,'budget'] = 30000000         # Sommersby          
train.loc[train['id'] == 118,'budget'] = 60000000        # Wild Hogs
train.loc[train['id'] == 149,'budget'] = 18000000        # Beethoven
train.loc[train['id'] == 313,'revenue'] = 12000000       # The Cookout 
train.loc[train['id'] == 451,'revenue'] = 12000000       # Chasing Liberty
train.loc[train['id'] == 464,'budget'] = 20000000        # Parenthood
train.loc[train['id'] == 470,'budget'] = 13000000        # The Karate Kid, Part II
train.loc[train['id'] == 513,'budget'] = 930000          # From Prada to Nada
train.loc[train['id'] == 797,'budget'] = 8000000         # Welcome to Dongmakgol
train.loc[train['id'] == 819,'budget'] = 90000000        # Alvin and the Chipmunks: The Road Chip
train.loc[train['id'] == 850,'budget'] = 90000000        # Modern Times
train.loc[train['id'] == 1007,'budget'] = 2              # Zyzzyx Road 
train.loc[train['id'] == 1112,'budget'] = 7500000        # An Officer and a Gentleman
train.loc[train['id'] == 1131,'budget'] = 4300000        # Smokey and the Bandit   
train.loc[train['id'] == 1359,'budget'] = 10000000       # Stir Crazy 
train.loc[train['id'] == 1542,'budget'] = 1              # All at Once
train.loc[train['id'] == 1570,'budget'] = 15800000       # Crocodile Dundee II
train.loc[train['id'] == 1571,'budget'] = 4000000        # Lady and the Tramp
train.loc[train['id'] == 1714,'budget'] = 46000000       # The Recruit
train.loc[train['id'] == 1721,'budget'] = 17500000       # Cocoon
train.loc[train['id'] == 1865,'revenue'] = 25000000      # Scooby-Doo 2: Monsters Unleashed
train.loc[train['id'] == 1885,'budget'] = 12             # In the Cut
train.loc[train['id'] == 2091,'budget'] = 10             # Deadfall
train.loc[train['id'] == 2268,'budget'] = 17500000       # Madea Goes to Jail budget
train.loc[train['id'] == 2491,'budget'] = 6              # Never Talk to Strangers
train.loc[train['id'] == 2602,'budget'] = 31000000       # Mr. Holland's Opus
train.loc[train['id'] == 2612,'budget'] = 15000000       # Field of Dreams
train.loc[train['id'] == 2696,'budget'] = 10000000       # Nurse 3-D
train.loc[train['id'] == 2801,'budget'] = 10000000       # Fracture
train.loc[train['id'] == 335,'budget'] = 2 
train.loc[train['id'] == 348,'budget'] = 12
train.loc[train['id'] == 470,'budget'] = 13000000 
train.loc[train['id'] == 513,'budget'] = 1100000
train.loc[train['id'] == 640,'budget'] = 6 
train.loc[train['id'] == 696,'budget'] = 1
train.loc[train['id'] == 797,'budget'] = 8000000 
train.loc[train['id'] == 850,'budget'] = 1500000
train.loc[train['id'] == 1199,'budget'] = 5 
train.loc[train['id'] == 1282,'budget'] = 9               # Death at a Funeral
train.loc[train['id'] == 1347,'budget'] = 1
train.loc[train['id'] == 1755,'budget'] = 2
train.loc[train['id'] == 1801,'budget'] = 5
train.loc[train['id'] == 1918,'budget'] = 592 
train.loc[train['id'] == 2033,'budget'] = 4
train.loc[train['id'] == 2118,'budget'] = 344 
train.loc[train['id'] == 2252,'budget'] = 130
train.loc[train['id'] == 2256,'budget'] = 1 
train.loc[train['id'] == 2696,'budget'] = 10000000

#Clean Data
test.loc[test['id'] == 6733,'budget'] = 5000000
test.loc[test['id'] == 3889,'budget'] = 15000000
test.loc[test['id'] == 6683,'budget'] = 50000000
test.loc[test['id'] == 5704,'budget'] = 4300000
test.loc[test['id'] == 6109,'budget'] = 281756
test.loc[test['id'] == 7242,'budget'] = 10000000
test.loc[test['id'] == 7021,'budget'] = 17540562       #  Two Is a Family
test.loc[test['id'] == 5591,'budget'] = 4000000        # The Orphanage
test.loc[test['id'] == 4282,'budget'] = 20000000       # Big Top Pee-wee
test.loc[test['id'] == 3033,'budget'] = 250 
test.loc[test['id'] == 3051,'budget'] = 50
test.loc[test['id'] == 3084,'budget'] = 337
test.loc[test['id'] == 3224,'budget'] = 4  
test.loc[test['id'] == 3594,'budget'] = 25  
test.loc[test['id'] == 3619,'budget'] = 500  
test.loc[test['id'] == 3831,'budget'] = 3  
test.loc[test['id'] == 3935,'budget'] = 500  
test.loc[test['id'] == 4049,'budget'] = 995946 
test.loc[test['id'] == 4424,'budget'] = 3  
test.loc[test['id'] == 4460,'budget'] = 8  
test.loc[test['id'] == 4555,'budget'] = 1200000 
test.loc[test['id'] == 4624,'budget'] = 30 
test.loc[test['id'] == 4645,'budget'] = 500 
test.loc[test['id'] == 4709,'budget'] = 450 
test.loc[test['id'] == 4839,'budget'] = 7
test.loc[test['id'] == 3125,'budget'] = 25 
test.loc[test['id'] == 3142,'budget'] = 1
test.loc[test['id'] == 3201,'budget'] = 450
test.loc[test['id'] == 3222,'budget'] = 6
test.loc[test['id'] == 3545,'budget'] = 38
test.loc[test['id'] == 3670,'budget'] = 18
test.loc[test['id'] == 3792,'budget'] = 19
test.loc[test['id'] == 3881,'budget'] = 7
test.loc[test['id'] == 3969,'budget'] = 400
test.loc[test['id'] == 4196,'budget'] = 6
test.loc[test['id'] == 4221,'budget'] = 11
test.loc[test['id'] == 4222,'budget'] = 500
test.loc[test['id'] == 4285,'budget'] = 11
test.loc[test['id'] == 4319,'budget'] = 1
test.loc[test['id'] == 4639,'budget'] = 10
test.loc[test['id'] == 4719,'budget'] = 45
test.loc[test['id'] == 4822,'budget'] = 22
test.loc[test['id'] == 4829,'budget'] = 20
test.loc[test['id'] == 4969,'budget'] = 20
test.loc[test['id'] == 5021,'budget'] = 40 
test.loc[test['id'] == 5035,'budget'] = 1 
test.loc[test['id'] == 5063,'budget'] = 14 
test.loc[test['id'] == 5119,'budget'] = 2 
test.loc[test['id'] == 5214,'budget'] = 30 
test.loc[test['id'] == 5221,'budget'] = 50 
test.loc[test['id'] == 4903,'budget'] = 15
test.loc[test['id'] == 4983,'budget'] = 3
test.loc[test['id'] == 5102,'budget'] = 28
test.loc[test['id'] == 5217,'budget'] = 75
test.loc[test['id'] == 5224,'budget'] = 3 
test.loc[test['id'] == 5469,'budget'] = 20 
test.loc[test['id'] == 5840,'budget'] = 1 
test.loc[test['id'] == 5960,'budget'] = 30
test.loc[test['id'] == 6506,'budget'] = 11 
test.loc[test['id'] == 6553,'budget'] = 280
test.loc[test['id'] == 6561,'budget'] = 7
test.loc[test['id'] == 6582,'budget'] = 218
test.loc[test['id'] == 6638,'budget'] = 5
test.loc[test['id'] == 6749,'budget'] = 8 
test.loc[test['id'] == 6759,'budget'] = 50 
test.loc[test['id'] == 6856,'budget'] = 10
test.loc[test['id'] == 6858,'budget'] =  100
test.loc[test['id'] == 6876,'budget'] =  250
test.loc[test['id'] == 6972,'budget'] = 1
test.loc[test['id'] == 7079,'budget'] = 8000000
test.loc[test['id'] == 7150,'budget'] = 118
test.loc[test['id'] == 6506,'budget'] = 118
test.loc[test['id'] == 7225,'budget'] = 6
test.loc[test['id'] == 7231,'budget'] = 85
test.loc[test['id'] == 5222,'budget'] = 5
test.loc[test['id'] == 5322,'budget'] = 90
test.loc[test['id'] == 5350,'budget'] = 70
test.loc[test['id'] == 5378,'budget'] = 10
test.loc[test['id'] == 5545,'budget'] = 80
test.loc[test['id'] == 5810,'budget'] = 8
test.loc[test['id'] == 5926,'budget'] = 300
test.loc[test['id'] == 5927,'budget'] = 4
test.loc[test['id'] == 5986,'budget'] = 1
test.loc[test['id'] == 6053,'budget'] = 20
test.loc[test['id'] == 6104,'budget'] = 1
test.loc[test['id'] == 6130,'budget'] = 30
test.loc[test['id'] == 6301,'budget'] = 150
test.loc[test['id'] == 6276,'budget'] = 100
test.loc[test['id'] == 6473,'budget'] = 100
test.loc[test['id'] == 6842,'budget'] = 30


There are only 3000 rows to train the data.
We can see that some of columns contain lists with dictionaries. Some lists contain a single dictionary, some have several. Let's extract data from these columns!

__Data Description id__ - Integer unique id of each movie

__belongs_to_collection__ - Contains the TMDB Id, Name, Movie Poster and Backdrop URL of a movie in JSON format. You can see the Poster and Backdrop Image like this: https://image.tmdb.org/t/p/original/. Example: https://image.tmdb.org/t/p/original//iEhb00TGPucF0b4joM1ieyY026U.jpg

__budget__:Budget of a movie in dollars. 0 values mean unknown.

__genres__ : Contains all the Genres Name & TMDB Id in JSON Format

__homepage__ - Contains the official homepage URL of a movie. Example: http://sonyclassics.com/whiplash/ , this is the homepage of Whiplash movie.

__imdb_id__ - IMDB id of a movie (string). You can visit the IMDB Page like this: https://www.imdb.com/title/

__original_language__ - Two digit code of the original language, in which the movie was made. Like: en = English, fr = french.
__original_title__ - The original title of a movie. Title & Original title may differ, if the original title is not in English.

__overview__ - Brief description of the movie.

__popularity__ - Popularity of the movie in float.

__poster_path__ - Poster path of a movie. You can see the full image like this: https://image.tmdb.org/t/p/original/

__production_companies__ - All production company name and TMDB id in JSON format of a movie.

__production_countries__ - Two digit code and full name of the production company in JSON format.

__release_date__ - Release date of a movie in mm/dd/yy format.

__runtime__ - Total runtime of a movie in minutes (Integer).

__spoken_languages__ - Two digit code and full name of the spoken language.

__status__ - Is the movie released or rumored?

__tagline__ - Tagline of a movie

__title__ - English title of a movie

__Keywords__ - TMDB Id and name of all the keywords in JSON format.

__cast__ - All cast TMDB id, name, character name, gender (1 = Female, 2 = Male) in JSON format

__crew__ - Name, TMDB id, profile path of various kind of crew members job like Director, Writer, Art, Sound etc.

__revenue__ - Total revenue earned by a movie in dollars.

Let's check the skewness and kurtosis of the columns and make these columns normal for better working of features.

In [None]:
pd.DataFrame(train.skew().sort_values(ascending=False)).head(10)

In [None]:
pd.DataFrame(train.kurtosis().sort_values(ascending=False)).head(10)

We can transform popularity, revenue,totalVotes, budget, runtime and popularity2

In [None]:
train['popularity'] = np.log1p(train['popularity'])   #log(1+x)  #expm1 - inverse
train['revenue'] = np.log1p(train['revenue'])
train['totalVotes'] = np.log1p(train['totalVotes'])
train['budget'] = np.log1p(train['budget'])
train['runtime'] = np.log1p(train['runtime'])
train['popularity2'] = np.log1p(train['popularity2'])

test['popularity'] = np.log1p(test['popularity'])  
test['totalVotes'] = np.log1p(test['totalVotes'])
test['budget'] = np.log1p(test['budget'])
test['runtime'] = np.log1p(test['runtime'])
test['popularity2'] = np.log1p(test['popularity2'])

### Feature Engineering

#### Belongs to collection

In [None]:
for i,e in enumerate(train['belongs_to_collection'][:2]):
    print(i,e)

In [None]:
train['belongs_to_collection'].apply(lambda x: 1 if x!= {} else 0).value_counts()

2396 dont have any value. 604 have collection values. We will store collection name separtely as another features, as rest of the values won't be much needed, so we'll drop them.

In [None]:
train['has_collection'] = train['belongs_to_collection'].apply(lambda x: len(x) if x!={} else 0)
test['has_collection'] = test['belongs_to_collection'].apply(lambda x: len(x) if x!={} else 0)

In [None]:
train.sample(2)

Similarly we will check all the dictionaries and clean them.
Now we will check for Genres.
### Genres

In [None]:
for i,e in enumerate(train['genres'][:2]):
    print(i,e)

In [None]:
print('Number of genres in films:')
train['genres'].apply(lambda x: len(x) if x!={} else 0).value_counts()

This shows that majority of the films have 2-3 genres. 5-6 are also possible but 0-7 might be outliers. 

In [None]:
list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x!={} else []).values)

In [None]:
plt.figure(figsize=(12,8))
text = ' '.join(i for j in list_of_genres for i in j)
wordcloud = WordCloud(max_font_size = None, width = 1200, height = 1000,
                      collocations =False).generate(text)
plt.imshow(wordcloud)
plt.title('Top Genres')
plt.axis('off')
plt.show()

In [None]:
Counter([i for j in list_of_genres for i in j]).most_common(10)

As we can see, Drama, Comedy, Thriller , Action are the most common genres.

In [None]:
top_genres =[m[0] for m in Counter([i for j in list_of_genres for i in j]).most_common(10)]
print(top_genres)

As we can see, Drama and Comedy are the most common genres.
We can create separate features. 
One for num of genres.
Another for value of all genres.
and then for most common genres.

In [None]:
train['num_of_genres'] = train['genres'].apply(lambda x: len(x) if x!={} else 0)
train['all_genres'] = train['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x ])) 
                                           if x!= {} else '')
test['num_of_genres'] = test['genres'].apply(lambda x: len(x) if x!={} else 0)
test['all_genres'] = test['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x ])) 
                                           if x!= {} else '')

In [None]:
for g in top_genres:
    train['genre_' + g] = train['all_genres'].apply(lambda x: 1 if g in x else 0)
    test['genre_' + g] = test['all_genres'].apply(lambda x: 1 if g in x else 0)

Now let's look briefly at production companies.
#### Production Companies

In [None]:
for i,e in enumerate(train['production_companies'][:2]):
    print(i,e)

In [None]:
print('Number of Production Companies for a movie:')
train['production_companies'].apply(lambda x: len(x) if x!= {} else 0).value_counts()

As you can see, majority of the movie's have 1-3 production companies.
There are movie's with more than 10 production companies. We will have a look at these companies to check if the data is valid.

In [None]:
train[train['production_companies'].apply(lambda x: len(x) if x!= {} else 0) > 10]

All of the movie's look real, so we will keep the data.

Now lets see the most common production companies.

In [None]:
list_of_companies = list(train['production_companies'].apply(lambda x : [i['name'] for i in x] 
                                                            if x!= {} else []).values)
Counter(i for j in list_of_companies for i in j).most_common(20)

We will create binary columns for the top 10 production house and later see what we do with this data. We will also create additional features.

In [None]:
train['num_prod_companies'] = train['production_companies'].apply(lambda x: len(x) if
                                                                 x!={} else 0)
test['num_prod_companies'] = test['production_companies'].apply(lambda x: len(x) if 
                                                               x!={} else 0)
train['all_prod_companies'] = train['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x!={} else '' )
test['all_prod_companies'] = test['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x!={} else '')

In [None]:
top_prod_companies = [m[0] for m in Counter(i for j in list_of_companies for i in j).most_common(10)]
for pc in top_prod_companies:
    train['production_' + pc] = train['all_prod_companies'].apply(lambda x: 1 if pc in x else 0)
    test['production_'+ pc] = test['all_prod_companies'].apply(lambda x: 1 if pc in x else 0)

Similarly we will check for production countries.
#### Production Countries

In [None]:
for i, e in enumerate(train['production_countries'][:2]):
    print(i,e)

In [None]:
print('Number of Production Countries in Movies:')
train['production_countries'].apply(lambda x: len(x) if x!={} else 0).value_counts()

Majority of the movies have 1 or 2 production countries. Some movies have more. Let's check the movie's with more than 5 production countries.

In [None]:
train[train['production_countries'].apply(lambda x: len(x) if x!= {} else 0) > 5]

There are only 4 movies with more than 5 production countries, all of which look valid. Now let's see which are the most common production countries.

In [None]:
List_of_countries = list(train['production_countries'].apply(lambda x: [i['name'] for i in x] 
                                                             if x!= {} else []))
#Count of production countries in movies
Counter(i for j in List_of_countries for i in j).most_common(10)

In [None]:
train['num_prod_countries'] = train['production_countries'].apply(lambda x: len(x) if x!= {} 
                                                                  else 0)
test['num_prod_countries'] = test['production_countries'].apply(lambda x: len(x) if x!={}
                                                               else 0)
train['all_prod_countries'] = train['production_countries'].apply(lambda x: ' '.join(sorted(i['name'] for i in x))
                                                                 if x!= {} else '')
test['all_prod_countries'] = test['production_countries'].apply(lambda x: ' '.join(sorted(i['name'] for i in x))
                                                               if x!= {} else '')


In [None]:
top_prod_countries = [m[0] for m in Counter(i for j in List_of_countries for i in j).most_common(6)]
for t in top_prod_countries:
    train['prod_country_' + t] = train['all_prod_countries'].apply(lambda x: 1 if t in x else 0)
    test['prod_country_'+ t] = test['all_prod_countries'].apply(lambda x: 1 if t in x else 0)

#### Spoken Languages

In [None]:
for i, e in enumerate(train['spoken_languages'][:2]):
    print(i,e)

In [None]:
print('Number of languages for a movie:')
train['spoken_languages'].apply(lambda x: len(x) if x!={} else 0).value_counts()

This shows that most of the movies have 1-2 languages.

In [None]:
list_of_langs = list(train['spoken_languages'].apply(lambda x: [i['name'] for i in x]
                                                    if x!= {} else []))
top_langs = [m[0] for m in Counter(i for j in list_of_langs for i in j).most_common(5)]
Counter(i for j in list_of_langs for i in j).most_common(5)

In [None]:
train['num_of_langs'] = train['spoken_languages'].apply(lambda x: len(x) if x!= {} else 0)
test['num_of_langs'] = test['spoken_languages'].apply(lambda x: len(x) if x!= {} else 0)

train['all_langs'] = train['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name']for i in x]))
                                                    if x!= {} else '')
test['all_langs'] = test['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x]))
                                                  if x!= {} else '')

for l in top_langs:
    train['lang_' + l] = train['all_langs'].apply(lambda x: 1 if l in x else 0)
    test['lang_'+ l] = test['all_langs'].apply(lambda x: 1 if l in x else 0)

In [None]:
plt.figure(figsize=(12,8))
text2 = ' '.join(i for j in list_of_langs for i in j)
wordcloud2 = WordCloud(collocations=False).generate(text2)
plt.imshow(wordcloud2)
plt.axis('off')
plt.title('Top Spoken Languages in Movies')
plt.show()

#### Keywords

In [None]:
for i, e in enumerate(train['Keywords'][:2]):
    print(i,e)

In [None]:
list_of_keys = list(train['Keywords'].apply(lambda x: [i['name'] for i in x] if x!= {} else []))
Counter(i for j in list_of_keys for i in j).most_common(10)

In [None]:
top_keywords = [m[0] for m in Counter(i for j in list_of_keys for i in j).most_common(10)]
train['num_of_keywords'] = train['Keywords'].apply(lambda x: len(x) if x!={} else 0)
test['num_of_keywords'] = test['Keywords'].apply(lambda x: len(x) if x!={} else 0)

train['all_keywords'] = train['Keywords'].apply(lambda x: ' '.join(sorted([i['name']for i in x]))
                                               if x!= {} else '')
test['all_keywords'] = test['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x]))
                                             if x!={} else '')
for k in top_keywords:
    train['keyword_'+ k] = train['all_keywords'].apply(lambda x: 1 if k in x else 0)
    test['keyword_'+ k] = test['all_keywords'].apply(lambda x: 1 if k in x else 0)


In [None]:
plt.figure(figsize=(12,10))
text3 = ' '.join(['_'.join(i.split(' ')) for j in list_of_keys for i in j])
wordcloud3 = WordCloud(collocations = False).generate(text3)
plt.imshow(wordcloud3)
plt.title('Top Keywords')
plt.axis('off')
plt.show()

#### Cast

In [None]:
for i, e in enumerate(train['cast'][:1]):
    print(i,e)

In [None]:
print('Number of casts used per movie:')
train['cast'].apply(lambda x: len(x) if x!={} else 0).value_counts().head(10)

In [None]:
list_cast_name = list(train['cast'].apply(lambda x: [i['name'] for i in x]if x!= {} else []))
top_cast_name = [m[0] for m in Counter(i for j in list_cast_name for i in j).most_common(20)]
Counter(i for j in list_cast_name for i in j).most_common(20)

In [None]:
train['num_of_cast']= train['cast'].apply(lambda x: len(x) if x!={} else 0)
test['num_of_cast'] = test['cast'].apply(lambda x: len(x) if x!={} else 0)

train['all_cast_name'] = train['cast'].apply(lambda x: ' '.join(sorted([i['name']for i in x]))
                                             if x!={} else '')
test['all_cast_name'] = test['cast'].apply(lambda x: ' '.join(sorted([i['name']for i in x]))
                                          if x!= {} else '')
for c in top_cast_name:
    train['cast_name_'+ c]= train['all_cast_name'].apply(lambda x: 1 if c in x else 0)
    test['cast_name_'+ c]= test['all_cast_name'].apply(lambda x: 1 if c in x else 0)

#### Crew

In [None]:
for i,e in enumerate(train['crew'][:1]):
    print(i,e)

In [None]:
print('Number of crew members per movie:')
train['crew'].apply(lambda x: len(x) if x!= {} else 0).value_counts().head(10)

In [None]:
list_crew_names = list(train['crew'].apply(lambda x: [i['name'] for i in x] if x!= {} else []).values)
Counter(i for j in list_crew_names for i in j).most_common(15)

In [None]:
top_crew_names = [m[0] for m in Counter(i for j in list_crew_names for i in j).most_common(20)]
train['num_of_crew'] = train['crew'].apply(lambda x: len(x) if x!= {} else 0)
test['num_of_crew']= test['crew'].apply(lambda x: len(x) if x!= {} else 0)
for cn in top_crew_names:
    train['crew_name_'+ cn]= train['crew'].apply(lambda x: 1 if cn in str(x) else 0)
    test['crew_name_'+ cn] = test['crew'].apply(lambda x: 1 if cn in str(x) else 0)

#### Homepage

In [None]:
train['homepage'].isna().sum()

In [None]:
train['has_homepage'] = 1
train.loc[pd.isnull(train['homepage']) ,"has_homepage"] = 0
test['has_homepage'] = 1
test.loc[pd.isnull(test['homepage']) ,"has_homepage"] = 0

In [None]:
train['runtime'].isna().sum()

In [None]:
train['runtime'].fillna(train['runtime'].mean(),inplace= True)

In [None]:
test['runtime'].fillna(test['runtime'].mean(),inplace= True)

## Data Visualization
We'll do Data Visualization for our features and then add additional features .

### __Target Variable: Revenue__

In [None]:
fig, ax = plt.subplots(figsize = (12,5))
sns.set()
plt.subplot(1,2,1)
plt.hist(np.expm1(train['revenue']), bins =10)
plt.title('Distribution of revenue',fontsize=15)
plt.subplot(1,2,2)
plt.hist(train['revenue'], bins =10) 
plt.title('Distribution of log revenue', fontsize=15)

We converted Revenue to log Revenue earlier and we can see a better distribution of data now.

### Budget

In [None]:
fig, ax = plt.subplots(figsize = (14,5))
sns.set()
plt.subplot(1,2,1)
plt.hist(np.expm1(train['budget']), bins =10)
plt.title('Distribution of budget',fontsize=15)
plt.subplot(1,2,2)
plt.hist(train['budget'], bins =10) 
plt.title('Distribution of log budget', fontsize=15)

In [None]:
px.scatter(data_frame = train, x='budget',y='revenue', title = 'Log Budget vs Log Revenue')

In [None]:
px.scatter(data_frame = train, x='budget',y='popularity', title = 'Log Budget vs Log Popularity')

In [None]:
px.scatter(data_frame = train, x='budget',y='runtime', title = 'Log Budget vs Log Runtime')

### Original Language

In [None]:
fig = px.line(train, x="budget", y="revenue", color="original_language", title = 'Log Budget vs Log Revenue in different languages')
fig.show()

In [None]:
px.box(train.loc[train['original_language'].isin(train['original_language'].value_counts().head(6).index)], x='original_language', y='revenue', title='Log Revenue Distribution for top languages')


### Original Title

In [None]:
plt.figure(figsize=(12,10))
text4 = ' '.join(train['original_title'].sort_values(ascending=False))
wordcloud = WordCloud(collocations=False).generate(text4)
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most Common words in title', fontsize=15)
plt.show()

### Overview

In [None]:
plt.figure(figsize=(12,10))
text5 = ' '.join(train['overview'].fillna('').values)
wordcloud = WordCloud(collocations=False).generate(text5)
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Top words in Overview', fontsize=15)
plt.show()

### Popularity

In [None]:
px.scatter(train.loc[train['original_language'].isin(train['original_language'].value_counts().head(6).index)],
           x='popularity', y='revenue',color = 'original_language',size='budget', title = 'Log Revenue vs Log Popularity (Buble size=Budget)')

### Release Date

In [None]:
train.loc[train['release_date'].isnull() == True, 'release_date'] = '01/01/98'
test.loc[test['release_date'].isnull() == True, 'release_date'] = '01/01/98'


In [None]:
def fix_date(x):
    """
    Fixes dates which are in 20xx
    """
    year = x.split('/')[2]
    if int(year) <= 19:
        return x[:-2] + '20' + year
    else:
        return x[:-2] + '19' + year

In [None]:
train['release_date'] = train['release_date'].apply(lambda x: fix_date(x))
test['release_date'] = test['release_date'].apply(lambda x: fix_date(x))
train['release_date'] = pd.to_datetime(train['release_date'])
test['release_date'] = pd.to_datetime(test['release_date'])

In [None]:
def process_date(df):
    date_parts = ["year", "weekday", "month", 'weekofyear', 'day', 'quarter']
    for part in date_parts:
        part_col = 'release' + "_" + part
        df[part_col] = getattr(df['release_date'].dt, part).astype(int)
    
    return df

train = process_date(train)
test = process_date(test)

In [None]:
d = train['release_date'].dt.year.value_counts().sort_index()
g = train.groupby('release_date')['revenue'].sum()

In [None]:
d1 = train['release_year'].value_counts().sort_index()
d2 = train.groupby(['release_year'])['revenue'].sum()
d3 = train.groupby(['release_year'])['budget'].sum()
data = [go.Scatter(x=d1.index, y=d1.values, name='film count'), 
        go.Scatter(x=d2.index, y=d2.values, name='total revenue', yaxis='y2'),
        go.Scatter(x=d3.index, y=d3.values, name='total budget', yaxis='y2')]
layout = go.Layout(dict(title = "Number of films and total revenue per year",
                  xaxis = dict(title = 'Year'),
                  yaxis = dict(title = 'Count'),
                  yaxis2=dict(title='Capital', overlaying='y', side='right')
                  ),legend=dict(
                orientation="v"))
fig = go.Figure(data, layout)
fig.update_xaxes(
    rangeslider_visible=True)
fig.show()

The film industry has grown significantly over the last few decades as we can see the significant increase in Number of films and Revenue generated by them each year.

In [None]:
plt.figure(figsize=(10,7))
sns.stripplot(x='release_weekday', y= 'revenue', data=train)
plt.xlabel('Weekday')
plt.ylabel('Revenue')
plt.title('Log Revenue by release day of week', fontsize=17)

It looks like Wednesday, Thursday and Friday releases generate more revenue.

In [None]:
plt.figure(figsize=(10,7))
sns.stripplot(x='release_quarter', y= 'revenue', data=train)
plt.xlabel('Quarter')
plt.ylabel('Revenue')
plt.title('Log Revenue by release quater of year', fontsize=17)

Not much different. It hardly matters in which quarter the movie is releasing.

In [None]:
plt.figure(figsize=(10,7))
sns.stripplot(x='release_month', y= 'revenue', data=train)
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.title('Log Revenue by release month', fontsize=17)

### Runtime

In [None]:
fig, ax = plt.subplots(figsize = (14,5))
plt.subplot(1,2,1)
sns.regplot(data=train, x='runtime', y='revenue')
plt.xlabel('Runtime')
plt.ylabel('Revenue')
plt.title('Log Revenue by Log Runtime', fontsize=17)
plt.subplot(1,2,2)
plt.hist(train['runtime'], bins=10)
plt.xlabel('Runtime')
plt.ylabel('Count')
plt.title('Distribution by Log Runtime', fontsize=17)

Runtime doesn't look like a strong explanatory variable.

### Status

In [None]:
train['status'].value_counts()

Since majority of the movies are released, this variable is useless.

### Tagline

In [None]:
plt.figure(figsize=(12,12))
text6 = ' '.join(train['tagline'].fillna('').values)
wordcloud = WordCloud(collocations = False).generate(text6)
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Top words in tagline')
plt.show()

### Has Collection

In [None]:
fig, ax = plt.subplots(figsize = (14,5))
plt.subplot(1,2,1)
sns.stripplot(data=train, x='has_collection', y= 'revenue')
plt.title('Stripplot of Log Revenue vs Collection', fontsize=17)
plt.subplot(1,2,2)
sns.boxplot(data=train, x='has_collection', y= 'revenue')
plt.title('Boxplot of Log Revenue vs Collection',fontsize=17)


This gives us an indication that the movies that are the part of a collection are expected to earn more on average than the others.

In [None]:
fig, ax = plt.subplots(figsize = (14,5))
plt.subplot(1,2,1)
sns.stripplot(data=train, x='has_homepage', y= 'revenue')
plt.title('Stripplot of Log Revenue vs Homepage', fontsize=17)
plt.subplot(1,2,2)
sns.boxplot(data=train, x='has_homepage', y= 'revenue')
plt.title('Boxplot of Log Revenue vs Homepage',fontsize=17)


### Genres

In [None]:
fig, ax = plt.subplots(figsize = (14,5))
plt.subplot(1,2,1)
sns.stripplot(data=train, x='num_of_genres', y= 'revenue')
plt.title('Stripplot of Log Revenue vs Number of Genres', fontsize=17)
plt.subplot(1,2,2)
sns.boxplot(data=train, x='num_of_genres', y= 'revenue')
plt.title('Boxplot of Log Revenue vs Number of Genres',fontsize=17)


Surprisingly movies with 3-4 genres are expected to earn more than the rest.

In [None]:
f, axes = plt.subplots(4, 3, figsize=(15, 12))
for i,e in enumerate([col for col in train if col.startswith('genre_')]):
    sns.stripplot(data=train, x=e, y='revenue',  ax=axes[i // 3][i % 3])
plt.tight_layout()

We can see that Adventure and Science Fiction are expected to earn more on average than other genres.

### Production Companies

In [None]:
fig = px.box(train, x='num_prod_companies', y= 'revenue',
             color='has_collection',title='Log Revenue vs Number of Production companys')
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

Number of Production companies doesn't matter much in movies with no collection but in movies with collection, average revenue increases with increasing number of production companies, till a certain limit.

### Production Countries

In [None]:
plt.figure(figsize=(10,6))
sns.set()
sns.stripplot(x='num_prod_countries', y='revenue', data=train)
plt.xlabel('Production Countries',fontsize=15)
plt.ylabel('Revenue',fontsize=15)
plt.title('Log Revenue vs Number of countries producing the film',fontsize=15);

As the number of Production Countries increases, the revenues is decreasing. The films produced in 1-2 countries have the highest revenue.

In [None]:
f, axes = plt.subplots(2, 3, figsize=(12, 10))
plt.suptitle('Log revenue vs Top Production Countries', fontsize=15)
for i,e in enumerate([col for col in train if col.startswith('prod_country_')]):
    sns.boxplot(data=train, x=e, y='revenue',  ax=axes[i // 3][i % 3])
plt.show()

Movies produced in USA generates more revenue on Average as compared to movies produced in other countries.

### Languages
The number of languages a movie is released in.

In [None]:
plt.figure(figsize=(10,5))
sns.set()
sns.stripplot(x='num_of_langs', y='revenue', data=train)
plt.xlabel('Number of languages',fontsize=15)
plt.ylabel('Revenue',fontsize=15)
plt.title('Log Revenue vs Number of languages movie released in',fontsize=15);

### Keywords

In [None]:
plt.figure(figsize=(10,6))
sns.set()
sns.stripplot(x='num_of_keywords', y='revenue', data=train)
plt.xlabel('Number of Keywords',fontsize=15)
plt.ylabel('Revenue',fontsize=15)
plt.title('Revenue vs Number of keywords',fontsize=15);

In [None]:
f, axes = plt.subplots(4, 3, figsize=(15, 15))
plt.suptitle('Boxplot of Log Revenue vs Top Keywords', fontsize=16)
for i,e in enumerate([col for col in train if col.startswith('keyword_')]):
    sns.boxplot(data=train, x=e, y='revenue',  ax=axes[i // 3][i % 3])
plt.show()

### Cast

In [None]:
plt.figure(figsize=(10,6))
sns.set()
sns.regplot(x='num_of_cast', y='revenue', data=train)
plt.xlabel('Number of Cast',fontsize=15)
plt.ylabel('Revenue',fontsize=15)
plt.title('Log Revenue vs Number of Cast',fontsize=16);

In [None]:
f, axes = plt.subplots(5, 4, figsize=(15, 13))
plt.suptitle('Boxplot of Log Revenue vs Top Cast', fontsize=16)
for i,e in enumerate([col for col in train if col.startswith('cast_name_')]):
    sns.boxplot(data=train, x=e, y='revenue',  ax=axes[i // 4][i % 4])
plt.show()

We can clearly see, movies of some actors generate more revenue than others.

### Crew

In [None]:
px.scatter(data_frame = train, x='num_of_crew',y='revenue', title = 'Crew vs Log Revenue(Bubble size= Number of cast, color= Budget)',
           size='num_of_cast',color='budget')

In [None]:
f, axes = plt.subplots(5, 4, figsize=(15, 20))
plt.suptitle('Boxplot of Log Revenue vs Top Crew', fontsize=16)
for i,e in enumerate([col for col in train if col.startswith('crew_name_')]):
    sns.boxplot(data=train, x=e, y='revenue',  ax=axes[i // 4][i % 4])
plt.show()

Some crew definetly produce more revenue.

#### Additional Features

In [None]:
rating_na = train.groupby(["release_year","original_language"])['rating'].mean().reset_index()
train[train.rating.isna()]['rating'] = train.merge(rating_na, how = 'left' ,on = ["release_year","original_language"])
vote_count_na = train.groupby(["release_year","original_language"])['totalVotes'].mean().reset_index()
train[train.totalVotes.isna()]['totalVotes'] = train.merge(vote_count_na, how = 'left' ,on = ["release_year","original_language"])
train['weightedRating'] = ( train['rating']*train['totalVotes'] + 6.367 * 1000 ) / ( train['totalVotes'] + 1000 )

train['inflationBudget'] = np.log1p(np.expm1(train['budget']) + np.expm1(train['budget'])*1.8/100*(2018-train['release_year'])) 
#Inflation simple formula
train['_popularity_mean_year'] = train['popularity'] / train.groupby("release_year")["popularity"].transform('mean')
train['_budget_runtime_ratio'] = train['budget']/train['runtime'] 
train['_budget_popularity_ratio'] = train['budget']/train['popularity']
train['_budget_year_ratio'] = train['budget']/(train['release_year']*train['release_year'])
train['_releaseYear_popularity_ratio'] = train['release_year']/train['popularity']

train['_popularity_totalVotes_ratio'] = train['totalVotes']/train['popularity']
train['_rating_popularity_ratio'] = train['rating']/train['popularity']
train['_rating_totalVotes_ratio'] = train['totalVotes']/train['rating']
train['_totalVotes_releaseYear_ratio'] = train['totalVotes']/train['release_year']
train['_budget_rating_ratio'] = train['budget']/train['rating']
train['_runtime_rating_ratio'] = train['runtime']/train['rating']
train['_budget_totalVotes_ratio'] = train['budget']/train['totalVotes']
    
train['meanruntimeByYear'] = train.groupby("release_year")["runtime"].aggregate('mean')
train['meanPopularityByYear'] = train.groupby("release_year")["popularity"].aggregate('mean')
train['meanBudgetByYear'] = train.groupby("release_year")["budget"].aggregate('mean')
train['meantotalVotesByYear'] = train.groupby("release_year")["totalVotes"].aggregate('mean')
train['meanTotalVotesByRating'] = train.groupby("rating")["totalVotes"].aggregate('mean')

train['isTaglineNA'] = 0
train.loc[train['tagline'] == 0 ,"isTaglineNA"] = 1 
    
train['isTitleDifferent'] = 1
train.loc[ train['original_title'] == train['title'] ,"isTitleDifferent"] = 0 


In [None]:
rating_na = test.groupby(["release_year","original_language"])['rating'].mean().reset_index()
test[test.rating.isna()]['rating'] = test.merge(rating_na, how = 'left' ,on = ["release_year","original_language"])
vote_count_na = test.groupby(["release_year","original_language"])['totalVotes'].mean().reset_index()
test[test.totalVotes.isna()]['totalVotes'] = test.merge(vote_count_na, how = 'left' ,on = ["release_year","original_language"])
test['weightedRating'] = ( test['rating']*test['totalVotes'] + 6.367 * 1000 ) / ( test['totalVotes'] + 1000 )


test['inflationBudget'] = np.log1p(np.expm1(test['budget']) + np.expm1(test['budget'])*1.8/100*(2018-test['release_year'])) #Inflation simple formula
 
test['_popularity_mean_year'] = test['popularity'] / test.groupby("release_year")["popularity"].transform('mean')
test['_budget_runtime_ratio'] = test['budget']/test['runtime'] 
test['_budget_popularity_ratio'] = test['budget']/test['popularity']
test['_budget_year_ratio'] = test['budget']/(test['release_year']*test['release_year'])
test['_releaseYear_popularity_ratio'] = test['release_year']/train['popularity']

test['_popularity_totalVotes_ratio'] = test['totalVotes']/test['popularity']
test['_rating_popularity_ratio'] = test['rating']/test['popularity']
test['_rating_totalVotes_ratio'] = test['totalVotes']/test['rating']
test['_totalVotes_releaseYear_ratio'] = test['totalVotes']/test['release_year']
test['_budget_rating_ratio'] = test['budget']/test['rating']
test['_runtime_rating_ratio'] = test['runtime']/test['rating']
test['_budget_totalVotes_ratio'] = test['budget']/test['totalVotes']
    
test['meanruntimeByYear'] = test.groupby("release_year")["runtime"].aggregate('mean')
test['meanPopularityByYear'] = test.groupby("release_year")["popularity"].aggregate('mean')
test['meanBudgetByYear'] = test.groupby("release_year")["budget"].aggregate('mean')
test['meantotalVotesByYear'] = test.groupby("release_year")["totalVotes"].aggregate('mean')
test['meanTotalVotesByRating'] = test.groupby("rating")["totalVotes"].aggregate('mean')

test['isTaglineNA'] = 0
test.loc[test['tagline'] == 0 ,"isTaglineNA"] = 1 
    
test['isTitleDifferent'] = 1
test.loc[ test['original_title'] == test['title'] ,"isTitleDifferent"] = 0 


In [None]:
train = train.drop(['id','belongs_to_collection','genres','homepage','imdb_id','overview','runtime'
    ,'poster_path','production_companies','production_countries','release_date','spoken_languages'
    ,'status','title','Keywords','cast','crew','original_language','original_title','tagline','all_genres',
                    'all_prod_companies','all_prod_countries','all_langs','all_keywords','all_cast_name'],axis=1)
test = test.drop(['id','belongs_to_collection','genres','homepage','imdb_id','overview','runtime'
    ,'poster_path','production_companies','production_countries','release_date','spoken_languages'
    ,'status','title','Keywords','cast','crew','original_language','original_title','tagline','all_genres',
                    'all_prod_companies','all_prod_countries','all_langs','all_keywords','all_cast_name'],axis=1)

In [None]:
train.fillna(value=0.0, inplace = True) 
test.fillna(value=0.0, inplace = True) 

In [None]:
train.sample(3)

In [None]:
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)
train = clean_dataset(train)

### Modeling

First we will try simple regressions model like Linear Regression, Lasso Regression, Decision Tree, Random Forest Regressor.

In [None]:
X = train.drop(['revenue'],axis=1)
y = train.revenue

X_train, X_valid, y_train, y_valid = train_test_split(X,y,test_size=0.2,random_state=25)

#### Linear Regression

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_valid)
accuracy = r2_score(y_valid,pred)
print('Linear Regression R2 Score: ', accuracy)

mse = mean_squared_error(y_valid,pred)
print('Mean Squared Error: ', mse)
print('Root Mean Square Error',np.sqrt(mse))

cv_pred = cross_val_predict(lr,X,y,n_jobs=-1, cv=10)
cv_accuracy = r2_score(y,cv_pred)
print('Cross-Predicted(KFold) R2 Score: ', cv_accuracy)
#REsidual Plots

#### Lasso Regression

In [None]:
ls = Lasso()
ls.fit(X_train, y_train)
pred = ls.predict(X_valid)
accuracy = r2_score(y_valid,pred)
print('Lasso Regression R2 Score: ', accuracy)

mse = mean_squared_error(y_valid,pred)
print('Mean Squared Error: ', mse)
print('Root Mean Squared Error', np.sqrt(mse))

cv_pred = cross_val_predict(ls,X,y,n_jobs=-1, cv=10)
cv_accuracy = r2_score(y,cv_pred)
print('Cross-Predicted(KFold) Lasso Regression Accuracy: ', cv_accuracy)

#### Decision Tree Regressor

In [None]:
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
pred = dt.predict(X_valid)
accuracy = r2_score(y_valid,pred)
print('Decision Tree R2 Score: ', accuracy)

mse = mean_squared_error(y_valid,pred)
print('Mean Squared Error: ', mse)
print('Root Mean Square Error',np.sqrt(mse))

cv_pred = cross_val_predict(dt,X,y,n_jobs=-1, cv=10)
cv_accuracy = r2_score(y,cv_pred)
print('Cross-Predicted(KFold) Decision Tree Accuracy: ', cv_accuracy)

#### Random Forest Regressor

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
pred = rf.predict(X_valid)
accuracy = r2_score(y_valid,pred)
print('Random Forest Regressor R2: ', accuracy)

mse = mean_squared_error(y_valid,pred)
print('Mean Squared Error: ', mse)
print('Root Mean Square Error',np.sqrt(mse))

cv_pred = cross_val_predict(rf,X,y,n_jobs=-1, cv=10)
cv_accuracy = r2_score(y,cv_pred)
print('Cross-Predicted(KFold) Random Forest R2: ', cv_accuracy)

Random Forest looks like a better predictor than other models, let's tune it and see how much accuracy we can get.

#### Randomized Search CV on Random Forest Regressor

In [None]:
rfr = RandomForestRegressor()
n_estimators = [int(x) for x in np.linspace(start = 50 , stop = 300, num = 5)] # returns 10 numbers 
max_features = [10,20,40,60,80,100,120]
max_depth = [int(x) for x in np.linspace(5, 10, num = 2)] 
max_depth.append(None)
bootstrap = [True, False]
r_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'bootstrap': bootstrap}
cv_random = RandomizedSearchCV(estimator=rfr, param_distributions=r_grid, n_iter = 20,
                                scoring='neg_mean_squared_error', cv = 3, verbose=2, random_state=42,
                                n_jobs=-1, return_train_score=True)

cv_random.fit(X_train, y_train);

print(cv_random.best_params_)

pred = cv_random.predict(X_valid)
mse = mean_squared_error(y_valid,pred)
print('Mean Squared Error: ', mse)
print('Root Mean Square Error',np.sqrt(mse))

cv_accuracy = r2_score(y_valid,pred)
print('Random Forest Predict R2: ', cv_accuracy)

In [None]:
feature_imp = [col for col in zip(X_train.columns, cv_random.best_estimator_.feature_importances_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)


In [None]:
imp = pd.DataFrame(feature_imp[0:40], columns=['feature', 'importance'])
plt.figure(figsize=(14, 12))
sns.barplot(y='feature', x='importance', data=imp)
plt.title('30 Most Important Features', fontsize=16)
plt.ylabel("Feature", fontsize=15)
plt.xlabel("Importance Param",fontsize=15)
plt.show()


These are the 30 most important features.

In [None]:
imp

#### H2o AutoML
Now let's try H2o AutoML on our Data to check if it gives better accuracy and less error.

In [None]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.automl import H2OAutoML

In [None]:
h2o.init()

In [None]:
h2o_df=h2o.H2OFrame(train)
h2o_df.head()

In [None]:
splits = h2o_df.split_frame(ratios=[0.8],seed=1)
h2o_train = splits[0]
h2o_valid = splits[1]

In [None]:
y = "revenue" 
x = h2o_df.columns 
x.remove(y) 

In [None]:
aml = H2OAutoML(max_runtime_secs=180, seed=1,stopping_metric='RMSE')

In [None]:
aml.train(x=x,y=y, training_frame=h2o_train)

In [None]:
lb = aml.leaderboard
lb.head()

The best rmse is for the Stacked Ensemble Model which is 1.90,which shows that our Random Forest Regressor was a good model as we were able to achieve rmse of 1.92

In [None]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(se.metalearner()['name'])

In [None]:
#This shows us how much each base learner is contributing to the ensemble.
%matplotlib inline
metalearner.std_coef_plot()

In [None]:
pred = aml.predict(h2o_valid)
pred.head()

In [None]:
h2o.save_model(aml.leader, path="./model_bin")

Now let's try if we can get better accuracy and less error from XGBoost.

#### XGBoost

In [None]:
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
params = {'objective': 'reg:linear', 
          'eta': 0.01, 
          'max_depth': 6, 
          'min_child_weight': 3,
          'subsample': 0.8,
          'colsample_bytree': 0.8,
          'colsample_bylevel': 0.50, 
          'gamma': 1.45, 
          'eval_metric': 'rmse', 
          'seed': 12, 
          'silent': True}
# create dataset for xgboost
xgb_data = [(xgb.DMatrix(X_train, y_train), 'train'), (xgb.DMatrix(X_valid, y_valid), 'valid')]
print('Starting training...')
# train
xgb_model = xgb.train(params, 
                  xgb.DMatrix(X_train, y_train),
                  10000,  
                  xgb_data, 
                  verbose_eval=300,
                  early_stopping_rounds=300)

RMSE is even better as compared to the best model by H2o AutoML. So we will stick with our XGBoost as our final model.

In [None]:
xgb_pred = xgb_model.predict(xgb.DMatrix(X_valid))

In [None]:
fig, ax = plt.subplots(figsize=(20,12))
xgb.plot_importance(xgb_model, max_num_features=30, height = 0.8, ax = ax)
plt.title('XGBOOST Features (avg over folds)')
plt.show()

In [None]:
train.shape, test.shape

In [None]:
X_test = test.drop('revenue',axis=1)

In [None]:
X_test[X_test==np.inf]=np.nan
X_test.fillna(X_test.mean(), inplace=True)

In [None]:
test_pred_xgb = xgb_model.predict(xgb.DMatrix((X_test)), ntree_limit=xgb_model.best_ntree_limit)

In [None]:
test_pred_xgb[0]

### Cat Boost

In [None]:
from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=100000,
                                 learning_rate=0.005,
                                 depth=5,
                                 eval_metric='RMSE',
                                 colsample_bylevel=0.8,
                                 random_seed = 21,
                                 bagging_temperature = 0.2,
                                 metric_period = None,
                                 early_stopping_rounds=200
                                )
model.fit(X_train, y_train,eval_set=(X_valid, y_valid),use_best_model=True,verbose=500)
    
val_pred = model.predict(X_valid)
print('RMSE',np.sqrt(mean_squared_error(val_pred,y_valid)))
test_pred_cat = model.predict(X_test)

CatBoost gave even less error than the XGB, now let's try LightGBM.

### LightGBM Model

In [None]:
import lightgbm as lgb
params = {'objective':'regression',
         'num_leaves' : 30,
         'min_data_in_leaf' : 20,
         'max_depth' : 9,
         'learning_rate': 0.004,
         #'min_child_samples':100,
         'feature_fraction':0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9,
         'lambda_l1': 0.2,
         "bagging_seed": 11,
         "metric": 'rmse',
         #'subsample':.8, 
          #'colsample_bytree':.9,
         "random_state" : 11,
         "verbosity": -1}
record = dict()
model = lgb.train(params
                      , lgb.Dataset(X_train, y_train)
                      , num_boost_round = 100000
                      , valid_sets = [lgb.Dataset(X_valid, y_valid)]
                      , verbose_eval = 500
                      , early_stopping_rounds = 500
                      , callbacks = [lgb.record_evaluation(record)]
                     )
best_idx = np.argmin(np.array(record['valid_0']['rmse']))

val_pred = model.predict(X_valid, num_iteration = model.best_iteration)
test_pred_gbm = model.predict(X_test, num_iteration = model.best_iteration)

For the final submission you can try different combinations of model to predict the target revenue. For me the below model made sense and gave great prediction.

In [None]:
sub = pd.read_csv('../input/tmdb-box-office-prediction/sample_submission.csv')
df_sub = pd.DataFrame()
df_sub['id'] = sub['id']
final_pred = 0.3*test_pred_xgb + 0.7*test_pred_cat
df_sub['revenue'] = np.expm1(final_pred)
print(df_sub['revenue'])
df_sub.to_csv("submission.csv", index=False)

**Conclusion**

That's it we reached the end of our exercise.

We started with data exploration and cleaning, checked skewness,then we jumped straight out to feature creation, we converted all the text features to usable features for our model.
Then we did Data Visualization and checked correlation between various features and our target variable 'Revenue' and then created additional features.
Finally we created some models and checked performance based on rmse. Random Forest showed us good result. H2o AutoML gave us an even better performance, but since it is a blackbox model, we rather tried XGBoost model which gave us an equally good performance. 
It took me many many hours of effort to get this all done.
Do drop comments where you think I can improve the model or features.
Upvote if you liked what you saw.
Thanks and much more to come ;)