# Introduction
This is an analysis report for TMDB dataset. TMDB itself is a research on movies to detect what influence movies' revenue. 

# Objective
In this report, we try to answer 3 questions:
1. What areas have the most influence on revenue.
2. How is a revenue and average score affected by its genre.
3. What influence does release date have on revenue.

### Package used

In [1]:
import pandas as pd
import plotly.graph_objs as go
import plotly
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

### Load data

In [2]:
data = pd.read_csv('tmdb_5000_movies.csv')
data.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [3]:
print('Features in table: ', list(data.columns))
print('\nNumber shape: ', data.shape)
# Check data type of each column
for column in list(data.columns):
    print('%s type: ' %column, data[column].dtype)

Features in table:  ['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']

Number shape:  (4803, 20)
budget type:  int64
genres type:  object
homepage type:  object
id type:  int64
keywords type:  object
original_language type:  object
original_title type:  object
overview type:  object
popularity type:  float64
production_companies type:  object
production_countries type:  object
release_date type:  object
revenue type:  int64
runtime type:  float64
spoken_languages type:  object
status type:  object
tagline type:  object
title type:  object
vote_average type:  float64
vote_count type:  int64


### Data cleaning

Draw distribute genres on renvenue.
Which is highest, which is lowest?
Use pandas.describe in detail.

Number of times each of the columns in the dataset contains null values. 

In [4]:
data.isnull().sum()

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

Following above, homepage and tagline contain empty entry. Now we need to fill up the missing values with 'Unknown'
##### NOTE: 
To be fast, we can use: data.fillna('Unknown')

In [5]:
data.homepage = data.homepage.fillna('Unknown')
data.tagline = data.tagline.fillna('Unknown')

Now we can check again the filled columns.

In [6]:
print('============Homepage===============')
print(data.homepage.value_counts().head(10))
print('\n============Tagline===============')
print(data.tagline.value_counts().head(10))

Unknown                                 3091
http://www.missionimpossible.com/          4
http://www.thehungergames.movie/           4
http://www.transformersmovie.com/          3
http://www.kungfupanda.com/                3
http://www.thehobbit.com/                  3
http://www.ironmanmovie.com/               2
http://www.howtotrainyourdragon.com/       2
http://www.indianajones.com                2
http://www.lordoftherings.net/             2
Name: homepage, dtype: int64

Unknown                                       844
Based on a true story.                          3
From zero to hero.                              2
One way in. No way out.                         2
What could go wrong?                            2
One ordinary couple. One little white lie.      2
Be careful what you wish for.                   2
The only way out is down.                       2
Worlds Collide                                  2
Who's next?                                     2
Name: tagline, dtype

### Data explore
In this part, we will start to analysis data to answer 3 questions above.

#### 1.  What areas have the most influence on revenue?
The area is presented in column production_countries. Now we will try to extract the information from this column.

First, let's took a look a sample in this column.

In [7]:
print(data.production_countries.iloc[0])
print(type(data.production_countries[0]))

[{"iso_3166_1": "US", "name": "United States of America"}, {"iso_3166_1": "GB", "name": "United Kingdom"}]
<class 'str'>


As you can see above, the format of the sample is actually a list in which there are dictionaries. Thanks God, we do have a package which can help us to convert a string representation of list to list. It's ast!!!

In [8]:
import ast
list_dict = ast.literal_eval(data.production_countries.iloc[0])
print(list_dict)
print(list_dict[0])
print(list_dict[0].get('name'))

[{'iso_3166_1': 'US', 'name': 'United States of America'}, {'iso_3166_1': 'GB', 'name': 'United Kingdom'}]
{'iso_3166_1': 'US', 'name': 'United States of America'}
United States of America


Amazing, right? So now we will create a new data which contain the areas and their revenues.

In [9]:
areas = []
revenues = []
for rows in range(len(data)):
    list_dict = ast.literal_eval(data.production_countries.iloc[rows])
    for dicts in range(len(list_dict)):
        revenues.append(data.revenue.values[rows]/len(list_dict)) # If more than 1 countries, take the average for each.
        areas.append(list_dict[dicts].get('name'))
eval_data = {'Area': areas, 'Revenue': revenues}
eval_table = pd.DataFrame(eval_data)

Plot bar chart for Area and Average Revenue

In [10]:
area_table = eval_table.groupby('Area').mean() # Contain the mean values of each area.
# We need to discard areas which have under 30 samples because they are not statistical significant.
area_list = eval_table.groupby('Area').count()[eval_table.groupby('Area').count().Revenue.values >= 30].index.values
area_table = area_table[area_table.index.isin(area_list)]
# Now draw table
table = [
    go.Bar(
        x=area_table.index.values, # The area now is the index in area_table
        y=area_table.Revenue
    )
]

layout = go.Layout(
    title='Average revenue following areas',
    yaxis=dict(title='Revenue'),
    xaxis=dict(title='Area')
)

fig = go.Figure(data=table, layout=layout)
iplot(fig)

#### 2.  How is a movie’s revenue and average score affected by its genre?
Process these like above.

In [11]:
genres = []
scores = []
revenues = []
for rows in range(len(data)):
    list_dict = ast.literal_eval(data.genres.iloc[rows])
    for dicts in range(len(list_dict)):
        genres.append(list_dict[dicts].get('name'))
        revenues.append(data.revenue.values[rows]/len(list_dict))
        scores.append(data.vote_average.values[rows])
eval_data = {'Genre': genres, 'Revenue': revenues, 'Score': scores}
eval_table = pd.DataFrame(eval_data)

Genres vs Revenues.

In [12]:
genre_table = eval_table.groupby('Genre').mean() # Contain the mean values of each area.
# We need to discard genres which have under 30 samples because they are not statistical significant.
genre_list = eval_table.groupby('Genre').count()[eval_table.groupby('Genre').count().Revenue.values >= 30].index.values
genre_table = genre_table[genre_table.index.isin(genre_list)]

# Now draw table
table = [
    go.Bar(
        x=genre_table.index.values, # The genre now is the index in area_table
        y=genre_table.Revenue
    )
]

layout = go.Layout(
    title='Average revenue following areas',
    yaxis=dict(title='Revenue'),
    xaxis=dict(title='Genre')
)

fig = go.Figure(data=table, layout=layout)
iplot(fig)

Genres vs Score.

In [13]:
genre_table = eval_table.groupby('Genre').mean() # Contain the mean values of each area.
# We need to discard genres which have under 30 samples because they are not statistical significant.
genre_list = eval_table.groupby('Genre').count()[eval_table.groupby('Genre').count().Score.values >= 30].index.values
genre_table = genre_table[genre_table.index.isin(genre_list)]

# Now draw table
table = [
    go.Bar(
        x=genre_table.index.values, # The genre now is the index in area_table
        y=genre_table.Score
    )
]

layout = go.Layout(
    title='Average revenue following areas',
    yaxis=dict(title='Revenue'),
    xaxis=dict(title='Score')
)

fig = go.Figure(data=table, layout=layout)
iplot(fig)

#### 3.  What influence does release date have on revenue?
In this, we will analysis following months and years.
##### Note:
We just analysis the realeased movies which make more sense.

In [14]:
# Extract months and years from column realease_date
months = []
years = []
for value in data[data.status=='Released'].release_date.values:
    value = str(value) # <==== I don't know why but the dtype of value here is float.
    months.append(value[5:7])
    years.append(value[0:4])
eval_data = {'Month': months, 'Year': years, 'Revenue': data[data.status=='Released'].revenue.values}
eval_table = pd.DataFrame(eval_data)

Plot bar chart following month.

In [15]:
month_table = eval_table.groupby('Month').mean() # Contain the mean values of each month.
table = [
    go.Bar(
        x=month_table.index.values, # The month now is the index in eval_Table
        y=month_table.Revenue
    )
]

layout = go.Layout(
    title='Average revenue following months',
    yaxis=dict(title='Revenue'),
    xaxis=dict(title='Month')
)

fig = go.Figure(data=table, layout=layout)
iplot(fig)

Plot serial chart following year.

In [16]:
year_table = eval_table.groupby('Year').mean() 
table = [
    go.Scatter(
        x=year_table.index.values, # The year now is the index in eval_Table
        y=year_table.Revenue
    )
]
layout = go.Layout(
    title='Average revenue following months',
    yaxis=dict(title='Revenue'),
    xaxis=dict(title='Year')
)

fig = go.Figure(data=table, layout=layout)
iplot(fig)