## <font color='red'><center> Don't forget to upvote if you like it! :) </center></font>

# Box Office Revenue Analysis and Visualization

![](https://news.tfw2005.com/wp-content/uploads/sites/10/2018/12/boxofficeearnings-Transformers-Bumblebee.jpg)

# Introduction

In a world… where movies made an estimated $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget? For some movies, it's "You had me at 'Hello.'" For others, the trailer falls short of expectations and you think "What we have here is a failure to communicate."

In this kernel I am going to answer some of the questions.

## Import required libraries

In [None]:
import pandas as pd
import numpy as np

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('dark_background')

# display multiple output in single cell
from IPython.display import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# data
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Data Loading and Exploration

#### Let's add only train data for now because my main goal for this notebook is to perform data analysis task and not a submisison.

In [None]:
%time train = pd.read_csv('../input/tmdb-box-office-prediction/train.csv')

#### Let's have a look at sample data.

In [None]:
train.head(n=10)

## Lets have some statastics of data.

In [None]:
print("Shape of data is ")
train.shape
print('The total number of movies are',train.shape[0])

## Lets check information of dataset.

In [None]:
train.info()

### About Dataset:
- **id**: Integer unique id of each movie
- **belongs_to_collection**: Contains the TMDB Id, Name, Movie Poster and Backdrop URL of a movie in JSON format.
- **budget**: Budget of a movie in dollars. Some row contains 0 values which mean unknown.
- **genres**: Contains all the Genres Name & TMDB Id in JSON Format.
- **homepage**: Contains the official URL of a movie.
- **imdb_id**: IMDB id of a movie (string).
- **original_language**: Two digit code of the original language, in which the movie was made.
- **original_title**: The original title of a movie in original_language.
- **overview**: Brief description of the movie.
- **popularity**: Popularity of the movie.
- **poster_path**: Poster path of a movie. You can see full poster image by adding url after this link --> https://image.tmdb.org/t/p/original/
- **production_companies**: All production company name and TMDB id in JSON format of a movie.
- **production_countries**: Two digit code and full name of the production company in JSON format.
- **release_date**: Release date of a movie in mm/dd/yy format.
- **runtime**: Total runtime of a movie in minutes (Integer).
- **spoken_languages**: Two digit code and full name of the spoken language.
- **status**: Is the movie released or rumored?
- **tagline**: Tagline of a movie
- **title**: English title of a movie
- **Keywords**: TMDB Id and name of all the keywords in JSON format.
- **cast**: All cast TMDB id, name, character name, gender (1 = Female, 2 = Male) in JSON format
- **crew**: Name, TMDB id, profile path of various kind of crew members job like Director, Writer, Art, Sound etc.
- **revenue**: Total revenue earned by a movie in dollars.

In [None]:
train.describe(include='all')

Let's check missing value in train data.

In [None]:
# checking NULL value

train.isnull().sum()

### As we can see that some features have dictiories. I am droping all such columns for now.

In [None]:
train = train.drop(['belongs_to_collection', 'genres', 'crew', 'cast', 'Keywords', 
                  'spoken_languages', 'production_companies', 'production_countries', 'tagline','overview','homepage'], axis=1)

## Create new Column for release day, date, month and year.

In [None]:
train['release_date'] = pd.to_datetime(train['release_date'], infer_datetime_format=True)
train['release_day'] = train['release_date'].apply(lambda t: t.day)
train['release_weekday'] = train['release_date'].apply(lambda t: t.weekday())
train['release_month'] = train['release_date'].apply(lambda t: t.month)

# Year was being interpreted as future dates in some cases so I had to adjust some values
train['release_year'] = train['release_date'].apply(lambda t: t.year if t.year < 2018 else t.year -100)

# Data Analysis and Visualization

## 1. Which movie made the highest revenue?

In [None]:
train[train['revenue'] == train['revenue'].max()]

In [None]:
train[['id','title','budget','revenue']].sort_values(['revenue'], ascending=False).head(10).style.background_gradient(subset='revenue', cmap='BuGn')

## The Avengers made the highest revenue.

## 2. Which movie is the most expensive?

In [None]:
train[train['budget'] == train['budget'].max()]

In [None]:
train[['id','title','budget', 'revenue']].sort_values(['budget'], ascending=False).head(10).style.background_gradient(subset=['budget', 'revenue'], cmap='PuBu')

## Pirates of the Caribbean: On Stranger Tides is most expensive movie.

## 3. Which movie is Longest?

In [None]:
train[train['runtime'] == train['runtime'].max()]

In [None]:
plt.hist(train['runtime'].fillna(0) / 60, bins=40);
plt.title('Distribution of length of film in hours', fontsize=16, color='white');
plt.xlabel('Duration of Movie in Hours')
plt.ylabel('Number of Movies')

In [None]:
train[['id','title','runtime', 'budget', 'revenue']].sort_values(['runtime'],ascending=False).head(10).style.background_gradient(subset=['runtime','budget','revenue'], cmap='YlGn')

## Carlos is the longest movie, with 338 minutes (5 hours and 38 minutes) of runtime. 

## 4. In which year most movies were released?

In [None]:
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_year'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Year",fontsize=20)
plt.xlabel('Release Year')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12,rotation=90)
plt.show()

In [None]:
train['release_year'].value_counts().head()

## In 2013 total 141 movies were released.

## 5. Movies with Highest and Lowest ratings.

In [None]:
train[train['popularity']==train['popularity'].max()][['original_title','popularity','release_date','revenue']]

In [None]:
train[train['popularity']==train['popularity'].min()][['original_title','popularity','release_date','revenue']]

Lets create popularity distribution plot.

In [None]:
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.distplot(train['popularity'], kde=False)
plt.title("Movie Popularity Count",fontsize=20)
plt.xlabel('Popularity')
plt.ylabel('Count')
plt.xticks(fontsize=12,rotation=90)
plt.show()

## Wonder Woman movie have highest popularity of 294.33 whereas Big Time movie have lowest popularity which is 0.

## 6. In which month most movies are released from 1921 to 2017?

In [None]:
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_month'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Month",fontsize=20)
plt.xlabel('Release Month')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()

In [None]:
train['release_month'].value_counts()

## In september month most movies are relesed which is around 362.

## 7. On which date of month most movies are released?

In [None]:
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_day'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Day of Month",fontsize=20)
plt.xlabel('Release Day')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()

In [None]:
train['release_day'].value_counts()

## On first date highest number of movies are released, 152.

## 8. On which day of week most movies are released?

In [None]:
plt.figure(figsize=(20,12))
sns.countplot(train['release_weekday'].sort_values(), palette='Dark2')
loc = np.array(range(len(train['release_weekday'].unique())))
day_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plt.xlabel('Release Day of Week')
plt.ylabel('Number of Movies Release')
plt.xticks(loc, day_labels, fontsize=12)
plt.show()

In [None]:
train['release_weekday'].value_counts()

## Highest number of movies released on friday.

### I am still Updating this notebook.

### There is still a lot of question data to ask..


I hope you liked my analysis and visualization.

## <font color='blue'> Don't forget to upvote if you like it!. </font>

If you have any doubt reagrding any part of the notebook, feel free to comment your doubt in the comment box.

Thank you!!



# Work in Progress... ⏳