
# Project: TMDB Data Analysis Excercise

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

## Introduction

The focus of the analysis is to gather relevant columns of the Kaggle TMDB 5000 Movie Datsaet and Credits Dataset, clean the data and explore it with an attempt to find some interesting facts and answer questions for what makes a movie profitable.

Data description and background information is at the below link:
https://www.kaggle.com/tmdb/tmdb-movie-metadata

In [None]:
#Importing essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json

# Applying matplotlib magic word and seaborn style
%matplotlib inline         
plt.style.use('seaborn')

<a id='wrangling'></a>
# Data Wrangling

## Movies data


In [None]:
# Loading the movies data
data1 = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv', encoding='utf-8')     # To deal with any non-English characters

#Inspecting the first six rows
data1.head()

In [None]:
# For the purposes of exploring the dataset, filtering the below columns:
# budget, id, original_title, popularity, release_date, revenue, production companies, production countries, genres,vote 

# Filtering and re-ordering the above columns
movies_data = data1[['id', 'original_title','release_date','budget', 'revenue','production_companies','production_countries','genres','popularity','vote_average']].copy()

# Inspecting the first 6 observations
movies_data.head()

In [None]:
# Printing data info
print(movies_data.info())

In [None]:
# Converting release date to date type
movies_data['release_date'] = pd.to_datetime(movies_data['release_date'])

In [None]:
# Checking any duplicate values. No duplicate rows identified
movies_data[movies_data.duplicated()]

In [None]:
# Printing null/missing values per column
movies_data.isnull().sum()

In [None]:
# Checking the null value
movies_data[movies_data.isnull().any(axis=1)]

In [None]:
# It appears there may be other blank variables and NaT values. Checking the column wise totals if they are replaced by nan.
movies_data.replace(['?','', 'NaT','NA','N/A','None',0,0.0,'[]'], np.nan).isnull().sum()

A large number of movies have no value for production companies, countries. Budget and revenue have over a 1000 variables as 0. Relevant conversions will be carried out after data is extracted from the dictionary list columns as it is in str form and would not recognize nan.

In [None]:
# Passing a function to extract all values from columns with dictionary lists.

import ast  # To use literal eval

def extract_json(col):
    catg = ''
    counter = 0
    lst = ast.literal_eval(col)
    for dic in lst:
        catg += dic['name']
        counter += 1 
        if counter < len(lst):
            catg += '|'
    return catg

# Noted considerable movies have multiple production companies 
# Extracting production company name from production companies column. 
movies_data['production_companies'] = movies_data['production_companies'].apply(extract_json)

In [None]:
# Passing a function to extract the first value from columns with the dictionary list.
def extract_first(col):
    catg = ''
    lst = ast.literal_eval(col)
    for dic in lst:
        catg += dic['name']
        return catg
    
# Adding a column for primary country from the production countries column as it seems the first country name is the primary production country    
movies_data['country'] = movies_data['production_countries'].apply(extract_first)

# Adding a column for primary genre as it seems the first genre is the primary genre for the movie
movies_data['genre'] = movies_data['genres'].apply(extract_first)

In [None]:
# Re-arranging and renaming columns and printing the first six rows
col_names = ['id', 'original_title', 'release_date', 'budget', 'revenue','production_companies','country', 'genre', 'popularity','vote_average']

movies_data = movies_data[col_names].copy()

# Renaming id to movie_id
movies_data.rename(columns={'id':'movie_id'},inplace=True)

# Printing first six rows
movies_data.head()

In [None]:
# Replacing missing values including 0 with nan. As this is an exploratory excercise, nan values have not been dropped.
movies_data.replace(['?','', 'NaT','NA','N/A','None','[]',0], np.nan, inplace=True)

# Printing null values per column and shape
print('Null values per column:\n', movies_data.isnull().sum())
print('Shape of the data frame:', movies_data.shape)

## Credits data

In [None]:
# Loading the credit data
credits_data = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv', encoding='utf-8')

#Inspecting the first six observations
credits_data.head()

In [None]:
# Printing data info
credits_data.info()

In [None]:
# Checking for any duplicate values
# No duplicate rows identified

duplicate_rows = credits_data[credits_data.duplicated()]
duplicate_rows

A visual inspection of credits data csv indicates some entries are occupying multiple rows in the csv 
(eg: rows 30, 223, 420, 607). No duplicates and same no. of rows for both data sets indicates file has been read correctly
and the above observations are extra-ordinarily lengthy.

In [None]:
# Printing null values per column
# No null values identified
credits_data.isnull().sum()

In [None]:
# Checking total nan
credits_data.replace(['?','', 'NaT','NA','N/A','None',0,'[]'], np.nan).isnull().sum()

Cast and crew columns have nans. Relevant conversions will be carried out after data is extracted from the 
dictionary list columns.

In [None]:
# Extracting leading actor name which is the first name in the first dictionary of cast column 
credits_data['leading_actor'] = credits_data['cast'].apply(extract_first)

In [None]:
# Adding gender to categorize lead actor by gender [1: female, 2: male]
# Passing a function to extract gender from the first dictionary

def extract_gender(col):
    g_val = 0
    lst = ast.literal_eval(col)
    for dic in lst:
        g_val += dic['gender']
        return (g_val)

# Converting to category dtype and assigning category codes [1: female, 2: male]  
credits_data['actor_gender'] = credits_data['cast'].apply(extract_gender).astype('category').cat.codes

In [None]:
# Some movies have multiple directors per movie. Selecting the first director for simpler analysis

# To extract first directors
def extract_director(col):
    catg = ''
    counter = 0
    lst = ast.literal_eval(col)
    for dic in lst:
        if dic['job'] == "Director":
            catg += dic['name']
            return catg

credits_data['director'] = credits_data['crew'].apply(extract_director)

In [None]:
# Replacing missing values including 0 with nan and printing total number of nan per column
credits_data.replace(['?','', 'NaT','NA','N/A','[]',0], np.nan, inplace=True)
print('Null values per column:\n', credits_data.isnull().sum())
print('Shape of the data frame:', credits_data.shape)

In [None]:
# Filtering  columns and printing the first six rows
col_names = ['movie_id', 'leading_actor', 'actor_gender', 'director']
credits_data = credits_data[col_names].copy()
credits_data.head()

In [None]:
# Joining lead actor, actor gender and director columns to movies_data and verifying its shape

print('Shape of movies_data:',movies_data.shape)
print('Shape of credits_data:',credits_data.shape)

movies_data = pd.merge(movies_data, credits_data, on='movie_id', how='left' )
print('Shape of movies_data after merging the tables:',movies_data.shape)

In [None]:
# Printing the first six rows of the dataframe
movies_data.head()

<a id='eda'></a>
## Exploratory Data Analysis




### Data exploratory questions 



In [None]:
#Providing summary statistics for the data set
movies_data.describe()

1. Which movie has the highest and the lowest revenue ?

In [None]:
print('Highest revenue: \n',movies_data[['original_title','revenue']][movies_data['revenue'] == movies_data['revenue'].max()])
print()
print('Lowest revenue: \n',movies_data[['original_title','revenue']][movies_data['revenue'] == movies_data['revenue'].min()])

2. Which movie had the most and the least budget?

In [None]:
print('Highest budget: \n', movies_data[['original_title','budget']][movies_data['budget'] == movies_data['budget'].max()])
print()
print('Lowest budget: \n',movies_data[['original_title','budget']][movies_data['budget'] == movies_data['budget'].min()])

3. Which is the most popular movie in the dataset?

In [None]:
print('Most popular movie: \n',movies_data[['original_title','popularity']][movies_data['popularity'] == movies_data['popularity'].max()])

4. Which movies was most profitable and loss making in 2016?


In [None]:
# Adding column for year of release
movies_data['release_year'] = movies_data['release_date'].dt.year

# Movie with highest revenue
print('Highest revenue: \n',movies_data[['original_title','revenue']][movies_data['release_year']==2016].dropna().sort_values(by='revenue', ascending=False).head(1))

In [None]:
# Movie with lowest revenue
print('Lowest revenue: \n',movies_data[['original_title','revenue']][movies_data['release_year']==2016].dropna().sort_values(by='revenue').head(1))

5. Who are the top ten male and female leading actors in terms of number of movies?

In [None]:
# Leading female actors
movies_data['leading_actor'][movies_data['actor_gender']==1].value_counts().head(10)

In [None]:
# Leading male actors
movies_data['leading_actor'][movies_data['actor_gender']==2].value_counts().head(10)

6. List the top ten actors in terms of their aggregate movie revenues?

In [None]:
movies_data[['leading_actor','revenue']].groupby(movies_data['leading_actor']).agg('sum').sort_values(by='revenue',ascending=False).head(10)

7. List the top ten directors in terms of their aggregate revenue?

In [None]:
movies_data[['director','revenue']].groupby(movies_data['director']).agg('sum').sort_values(by='revenue',ascending=False).head(10)

8. List the most popular movies by genre?

In [None]:
movies_data[['movie_id','original_title','genre','popularity','production_companies']].loc[movies_data.groupby(['genre'])['popularity'].idxmax()]

9. List the most profitable movies by genre?

In [None]:
movies_data[['movie_id','original_title','genre','revenue','production_companies']].loc[movies_data.groupby(['genre'])['revenue'].idxmax().dropna()]

# Exploring if there are there any variables that drive higher revenues.

1. Are movie revenues and budget related?

In [None]:
# Plotting Correlation matrix
import scikitplot as skplt

#Dropping rows with nan
plot_df = movies_data.dropna()

# Numerical column names
col_names = ['budget', 'revenue', 'popularity', 'vote_average']

# Filtering data frame with numerical columns
plot_df_std = plot_df[col_names]

#Standardizing the dataset
from sklearn.preprocessing import StandardScaler
plot_df_std = StandardScaler().fit_transform(plot_df_std)

# Creating a dataframe with standardized columns
plot_df_std = pd.DataFrame(plot_df_std, columns=col_names, index=plot_df.index)

# Plotting correlation matrix
plot_df_std.corr(method='pearson')

In [None]:
# Budget vs revenue
plt.scatter(plot_df['budget'],plot_df['revenue'], alpha=0.6);
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.title('Budget vs Revenue');

From the plot and the correlation matrix, there appears to be a moderate linear relation between revenue and popularity.

In [None]:
# Budget vs popularity
plt.scatter(plot_df['revenue'],plot_df['popularity'], alpha=0.6);
plt.xlabel('Budget')
plt.ylabel('Popularity')
plt.title('Budget vs Popularity');

From the plot and the correlation matrix, there appears to be a weak to moderate linear relation between budget and popularity.

2. Are certain popular directors/actors associated with above average revenues ?

In [None]:
# Calculating mean for revenue
revenue_mean = movies_data['revenue'].dropna().mean()

# Filtering leading actors and directors for a movies with above average revenues
revenue_df = movies_data[['leading_actor','director','revenue']][movies_data['revenue'] > revenue_mean]

# Printing the shape of the dataframe
revenue_df.shape

In [None]:
# Grouping leading actors and filtering actors who have above average revenues more than five times  
la_group = revenue_df[['leading_actor','revenue']].groupby(revenue_df['leading_actor'])
size1 = la_group.size() 
size1[size1>5]

In [None]:
# Proportion of leading actors who have greater than five above average revenue movies
size1[size1>5].sum()/len(revenue_df)

The proportion of leading actors who have greater than five above average revenue movies is approximately 42%, which is significant.

In [None]:
# Grouping directors and filtering directors who have above average revenues more than five times  
d_group = revenue_df[['director','revenue']].groupby(revenue_df['director'])
size2 = d_group.size() 
size2[size2>5]

In [None]:
# Proportion of directors who have greater than five above average revenue movies
size2[size2>5].sum()/len(revenue_df)

The proportion of directors who have greater than five above average revenue movies is approximately 19%.

3. Does a specific month of movie release makes more profit?

In [None]:
# Adding column for month of release to the dataframe with dropped nan values

pd.options.mode.chained_assignment = None    # To silence SettingWithCopyWarning

import calendar
# Splitting the month of relase date column and adding a new column to data frame
plot_df.loc[:,'release_month'] = plot_df['release_date'].dt.month

# Encoding the months with calender month abbreviations
month_dict = dict(enumerate(calendar.month_abbr))
plot_df.loc[:,'release_month'] = plot_df['release_month'].map(month_dict)
plot_df.head()


In [None]:
# Box-plots for revenue by month
order_list = calendar.month_abbr # List of months in order

sns.boxplot(y='revenue', x = 'release_month', data=plot_df,order=order_list[1:13])
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.title('Revenue vs Release month');

Months of May-July and Nov-December appear to have a larger spread of revenues compared to other months. This could be larger release of movies coinciding with school holidays/festive periods.

4. Are certain genres more revenue generating?

In [None]:
# Box-plots for revenue by genre
# Adding a horizontal line for mean revenue value

# Calculating mean for revenue 
rev_mean = plot_df['revenue'].mean()

sns.boxplot(y='revenue', x = 'genre', data=plot_df).axhline(rev_mean, color='red', alpha=0.6) 
plt.xlabel('Genre')
plt.ylabel('Revenue')
plt.title('Revenue vs Genre')
plt.xticks(rotation=45,ha='right');     # To rotate x-axis ticks  

Adventure and Animation genres have above average revenues. Fantasy, Science Fiction and Family genres have near average revenues with Family having a broader range of revenue values. Months of May-July and Nov-December appear to be associated with higher average revenues. This could be larger release of movies coinciding with school holidays/festive periods. The outliers are likely the high revenue movies within each genre.

<a id='conclusions'></a>
## Conclusions

From the above analysis it appears that it may not be a single factor but a mixture of the chosen columns in leading actor, director, budget, time of release and genre that may lead to more successful movies in terms of their revenue. There is limited exploration undertaken for the above excercise and the analysis does not imply any statistical conclusions.