# TMDb Movie Data Analysis

## Introduction

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

- The ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters.
- There are some odd characters in the ‘cast’ column. Don’t worry about cleaning them. You can leave them as is.
- The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.


__Questions to research in the analysis__
- Which genres are most popular from year to year?
- What kinds of properties are associated with movies that have high revenues?
- Directors with most movies
- Most Popular genres
- Most earning / popular movie in a year


> Max budget, max revenue, max profit, oldest movie

In [6]:
# Import libraries

import pandas as pd  
import numpy as np 
import matplotlib.pyplot as plt  
%matplotlib inline

In [7]:
df = pd.read_csv('tmdb-movies.csv')

In [8]:
# Visual Assessment 
df.head(2)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0


In [9]:
# Column information & Null Values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

- Few Null values exist in `cast`, `director`, `overview` & `genres`, so these can be dropped
- Many Null values exist in `homepage`, `tagline`, `keywords` & `production_companies` columns. Further we can drop `keywords` & `production_companies` since this data is not necesary in the analysis.

Descriptive Statistics

In [10]:
df.describe()

Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0
mean,66064.177434,0.646441,14625700.0,39823320.0,102.070863,217.389748,5.974922,2001.322658,17551040.0,51364360.0
std,92130.136561,1.000185,30913210.0,117003500.0,31.381405,575.619058,0.935142,12.812941,34306160.0,144632500.0
min,5.0,6.5e-05,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,0.0
25%,10596.25,0.207583,0.0,0.0,90.0,17.0,5.4,1995.0,0.0,0.0
50%,20669.0,0.383856,0.0,0.0,99.0,38.0,6.0,2006.0,0.0,0.0
75%,75610.0,0.713817,15000000.0,24000000.0,111.0,145.75,6.6,2011.0,20853250.0,33697100.0
max,417859.0,32.985763,425000000.0,2781506000.0,900.0,9767.0,9.2,2015.0,425000000.0,2827124000.0


## Data Wrangling

### Gathering Data: 
The data is imported from CSV file into the notebook within a pandas dataframe

### Assessing Data:

This [link](https://www.themoviedb.org/talk/5141d424760ee34da71431b0) states that `popularity` score is measured by counting unique views on the website with number of ratings, favourites & watched list additions. Also, it has no upper bond. 

In [11]:
# Filter budget data 

df_zero_budget = df.query('budget == 0')
df_zero_budget.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
30,280996,tt3168230,3.927333,0,29355203,Mr. Holmes,Ian McKellen|Milo Parker|Laura Linney|Hattie M...,http://www.mrholmesfilm.com/,Bill Condon,The man behind the myth,...,"The story is set in 1947, following a long-ret...",103,Mystery|Drama,BBC Films|See-Saw Films|FilmNation Entertainme...,6/19/15,425,6.4,2015,0.0,27006770.0
36,339527,tt1291570,3.358321,0,22354572,Solace,Abbie Cornish|Jeffrey Dean Morgan|Colin Farrel...,,Afonso Poyart,"A serial killer who can see your future, a psy...",...,"A psychic doctor, John Clancy, works with an F...",101,Crime|Drama|Mystery,Eden Rock Media|FilmNation Entertainment|Flynn...,9/3/15,474,6.2,2015,0.0,20566200.0
72,284289,tt2911668,2.272044,0,45895,Beyond the Reach,Michael Douglas|Jeremy Irvine|Hanna Mangan Law...,,Jean-Baptiste LÃ©onetti,,...,A high-rolling corporate shark and his impover...,95,Thriller,Furthur Films,4/17/15,81,5.5,2015,0.0,42223.38
74,347096,tt3478232,2.165433,0,0,Mythica: The Darkspore,Melanie Stone|Kevin Sorbo|Adam Johnson|Jake St...,http://www.mythicamovie.com/#!blank/wufvh,Anne K. Black,,...,When Teelaâ€™s sister is murdered and a powerf...,108,Action|Adventure|Fantasy,Arrowstorm Entertainment,6/24/15,27,5.1,2015,0.0,0.0
75,308369,tt2582496,2.141506,0,0,Me and Earl and the Dying Girl,Thomas Mann|RJ Cyler|Olivia Cooke|Connie Britt...,http://www.foxsearchlight.com/meandearlandthed...,Alfonso Gomez-Rejon,A Little Friendship Never Killed Anyone.,...,Greg is coasting through senior year of high s...,105,Comedy|Drama,Indian Paintbrush,6/12/15,569,7.7,2015,0.0,0.0


- Upon searching the internet, found that **Beyond the Reach** movie had a budget of $ 396k

In [12]:
# Filter data
df_zero_revenue = df.query('revenue == 0')
df_zero_revenue.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
48,265208,tt2231253,2.93234,30000000,0,Wild Card,Jason Statham|Michael Angarano|Milo Ventimigli...,,Simon West,Never bet against a man with a killer hand.,...,When a Las Vegas bodyguard with lethal skills ...,92,Thriller|Crime|Drama,Current Entertainment|Lionsgate|Sierra / Affin...,1/14/15,481,5.3,2015,27599990.0,0.0
67,334074,tt3247714,2.331636,20000000,0,Survivor,Pierce Brosnan|Milla Jovovich|Dylan McDermott|...,http://survivormovie.com/,James McTeigue,His Next Target is Now Hunting Him,...,A Foreign Service Officer in London tries to p...,96,Crime|Thriller|Action,Nu Image Films|Winkler Films|Millennium Films|...,5/21/15,280,5.4,2015,18399990.0,0.0
74,347096,tt3478232,2.165433,0,0,Mythica: The Darkspore,Melanie Stone|Kevin Sorbo|Adam Johnson|Jake St...,http://www.mythicamovie.com/#!blank/wufvh,Anne K. Black,,...,When Teelaâ€™s sister is murdered and a powerf...,108,Action|Adventure|Fantasy,Arrowstorm Entertainment,6/24/15,27,5.1,2015,0.0,0.0
75,308369,tt2582496,2.141506,0,0,Me and Earl and the Dying Girl,Thomas Mann|RJ Cyler|Olivia Cooke|Connie Britt...,http://www.foxsearchlight.com/meandearlandthed...,Alfonso Gomez-Rejon,A Little Friendship Never Killed Anyone.,...,Greg is coasting through senior year of high s...,105,Comedy|Drama,Indian Paintbrush,6/12/15,569,7.7,2015,0.0,0.0
92,370687,tt3608646,1.876037,0,0,Mythica: The Necromancer,Melanie Stone|Adam Johnson|Kevin Sorbo|Nicola ...,http://www.mythicamovie.com/#!blank/y9ake,A. Todd Smith,,...,Mallister takes Thane prisoner and forces Mare...,0,Fantasy|Action|Adventure,Arrowstorm Entertainment|Camera 40 Productions...,12/19/15,11,5.4,2015,0.0,0.0


- [Wikipedia](https://en.wikipedia.org/wiki/Wild_Card_(2015_film)) states that **Wild Card** had budget & box office collecion.
- The movie **Mythica: The Darkspore** received a [tax credit](https://en.wikipedia.org/wiki/Mythica_(film_series)) for filming in Governor's office

Hence, the zero  values in `budget` & `revenue` columns are missing values in the data set.

Let's count the total zero values. Based on the total number of missing values we can drop or retain them.

In [13]:
print("Total missing values in budget column:",df_zero_budget.shape[0])
print("Total missing values in revenue column:",df_zero_revenue.shape[0])


Total missing values in budget column: 5696
Total missing values in revenue column: 6016


Almost 50% of the rows in the dataset have missing values in the `budget` & `revenue` columns respectively.
Instead of Dropping, we replace the missing with Null values.

In [14]:
df.query('runtime == 0').count()

id                      31
imdb_id                 31
popularity              31
budget                  31
revenue                 31
original_title          31
cast                    31
homepage                 6
director                29
tagline                  5
keywords                15
overview                29
runtime                 31
genres                  30
production_companies    13
release_date            31
vote_count              31
vote_average            31
release_year            31
budget_adj              31
revenue_adj             31
dtype: int64

#### Issues
- Drop unnecessary columns 
- Drop duplicates
- Drop Null values from `cast`, `director`, `genres`
- Fix: Zero values in `budget` & `revenue` columns.
- Drop zero value records in `runtime`
- Fix `release_date` dtype

- Split values in `genre` & `casts` into separate columns.

### Cleaning Data:

In [15]:
# Define: Drop unnecesary columns

# Code
col = ['imdb_id', 'homepage', 'tagline', 'overview', 'budget_adj', 'revenue_adj']
df.drop(col, axis = 1, inplace = True)

# Test
df.columns

Index(['id', 'popularity', 'budget', 'revenue', 'original_title', 'cast',
       'director', 'keywords', 'runtime', 'genres', 'production_companies',
       'release_date', 'vote_count', 'vote_average', 'release_year'],
      dtype='object')

In [16]:
# Define: Drop duplicated values

# Code
df.drop_duplicates(inplace=True)

# Test 
df.duplicated().any()

False

In [17]:
# Define: Drop Null values from cast, director, genres columns

# Code
cols = ['cast', 'director', 'genres']
df.dropna(subset = cols, how = 'any', inplace = True)

# Test
df.isnull().sum()

id                         0
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                       0
director                   0
keywords                1425
runtime                    0
genres                     0
production_companies     959
release_date               0
vote_count                 0
vote_average               0
release_year               0
dtype: int64

In [18]:
#Define: Zero values in `budget` & `revenue` column

# Code

df['budget'] = df['budget'].replace(0, np.NaN)
df['revenue'] = df['revenue'].replace(0, np.NaN)

# Test
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10731 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10731 non-null  int64  
 1   popularity            10731 non-null  float64
 2   budget                5153 non-null   float64
 3   revenue               4843 non-null   float64
 4   original_title        10731 non-null  object 
 5   cast                  10731 non-null  object 
 6   director              10731 non-null  object 
 7   keywords              9306 non-null   object 
 8   runtime               10731 non-null  int64  
 9   genres                10731 non-null  object 
 10  production_companies  9772 non-null   object 
 11  release_date          10731 non-null  object 
 12  vote_count            10731 non-null  int64  
 13  vote_average          10731 non-null  float64
 14  release_year          10731 non-null  int64  
dtypes: float64(4), int6

The above result shows, 5153 zero values in `budget` are replaced with Null values. Similar action was executed for `revenue` column.

In [19]:
# Define: Drop xero values in 'runtime' column

# Code
df.query('runtime != 0', inplace = True)

# Test
df.query('runtime == 0')

Unnamed: 0,id,popularity,budget,revenue,original_title,cast,director,keywords,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year


In [20]:
# Fix 'release_date' dtype
df['release_date'] = pd.to_datetime(df['release_date'])

# Check
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10703 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    10703 non-null  int64         
 1   popularity            10703 non-null  float64       
 2   budget                5150 non-null   float64       
 3   revenue               4843 non-null   float64       
 4   original_title        10703 non-null  object        
 5   cast                  10703 non-null  object        
 6   director              10703 non-null  object        
 7   keywords              9293 non-null   object        
 8   runtime               10703 non-null  int64         
 9   genres                10703 non-null  object        
 10  production_companies  9759 non-null   object        
 11  release_date          10703 non-null  datetime64[ns]
 12  vote_count            10703 non-null  int64         
 13  vote_average    