# Investigate TMDb Movies Dataset (Data Analysis For Beginners)


## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

>The Movies Database (TMDb) dataset, this dataset is originally from Kaggle was provided by Udacity. The dataset contains information about 10,000 movies form 1960 to 2015 collected from The Movie Database (TMDb), including user ratings and revenue. The questions that I will explore over the course of the report: 

>    • Q1 : What kind of properties are associated with movies that have high profit?
>
>        •   High profit movies average budget, average revenue and average popularity ?     
>        •   In which month and year the movies makes the most profit ?     
>        •   High profit movies genre, casts, directors and production companies ?
>
>    • Q2 : What kind of properties are associated with movies that have high ratings?
>
>        •   High Ratings movies average budget, average revenue, and average popularity ?
>        •   High Ratings movies genre, casts, directors and production companies ?

In [None]:
# Import python packages i plan to use.

import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> In this section of the report, I will load in the data, check for cleanliness, and then trim and clean your dataset for analysis.

### General Properties

>Below are the questions to answer using pandas to explore tmdb-movies.csv and have a holistic understanding of the data set:                                               
      • Number of samples & columns in the dataset  (shape of data).                         
      • Datatypes of the columns .                           
      • Descriptive statistics for the dataset .                                                  
      • Features with missing values .                              
      • Duplicate rows in the dataset .                                          
      • Number of unique values for the dataset .                                           
      • Number of rows with missing values in the dataset .                                       
      • Number of zero values in runtime, budget_adj and revenue_adj.

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

df = pd.read_csv('../input/tmbd-movies-dataset-udacity/tmdb-movies.csv')
df.head(3)

#### Number of rows & columns in the dataset
Based on the cell below, there are a totel of 10,866 number of moives and 21 columns in the dataset.

In [None]:
# To check the size of the dataset
df.shape

#### Datatype of columns
According to the cell below, we can notice that ``release_date`` is not using the date datatype, ``budget_adj`` and ``revenue_adj`` are float datatype so we will changeing datatypes in the data cleaning step. 

In [None]:
df.info()

#### Descriptive statistics for the dataset
Based on the table shown below, 50% of the value in `budget` and `revenue` are of zero value. The minimum value for runtime is 0 which means that there could be a number of values of zero value in `runtime`.

In [None]:
df.describe()

#### Number of missing values in the dataset
There are number of columns which have null values according to the resualt below : `imdb_id` , `cast`,`homepage` , `director` , `tagline` , `keywords`,`overview` ,`genres`, `production_companies`. Some of columns have a huge number of missing values such as : `homepage` ,`tagline` , `keywords`, `production_companies` but most of them i will drop since they not necessary for our questions.

In [None]:
df.isnull().sum()

#### Duplicate rows in the dataset
According to the cell below there is only 1 duplicate row which i will drop next section.

In [None]:
#Number of duplicate rows
df.duplicated().sum()

In [None]:
# Make sure of duplicate row is exactly the same 
df[df.duplicated(keep=False)]

#### Number of unique values for the dataset
To see the total number of unique values for each column.

In [None]:
df.nunique()

#### Number of rows with missing values
To check the number of rows with at least one column with missing value. 

In [None]:
df.isnull().any(axis=1).sum()

#### Number of zero Values in `runtime`, `budget_adj`, `revenue_adj`
I need to get the total numer of zero values to decide whether to drop the zero values in the dataset.

In [None]:
col_with_zero = ['runtime','budget_adj','revenue_adj']
for i in col_with_zero:
    zero_count = (df[i] == 0).sum()
    print('`{}` have {} zero values'.format(i,zero_count))

>**Observations:** 
>- There are 5696 zero values rows in `budget_adj`, 6016 zero values row in `revenue_adj` It is a huge amount of missing data for these two columnes. In order not to drop more than 50% of the data that will affect my statistics and visualization result, I decided to retain these rows and replace them with mean values.

### Data Cleaning 
After the discussion on the structure of the dataset and the problems that need to be cleaned, the following are the cleaning steps :

1. Drop unimportant columns.
2. Drop duplicates rows.
3. Replace zero values with NaN.
4. Replace NaN values with mean.
5. Add new column `profit`.
6. Convert ` release_date` column to Date datatype.

#### Drop Uniportant Columns
- Drop columns that aren't related to our questions. Coulmns i will drop : `id`, `imdb_id`, `homepage`, `tagline`, `keywords`,`overview`, `vote_count`.
- `budget` and ```revenue``` will also be drop as I will be using the final two columns ending with “_adj” which show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.

In [None]:
# # drop columns from dataset.
df.drop(['id','imdb_id','homepage','tagline','keywords','overview'
         ,'budget','revenue','vote_count'],axis = 1,inplace = True)

# showe the resualt.
df.head(3)

#### Drop Duplicates Rows 

In [None]:
# drop duplicate rows
df.drop_duplicates(inplace=True)
# check number of duplicates -it should be 0
df.duplicated().sum()

#### Replce zero values with NaN

Based on the previous Descriptive statistics for the dataset, 50% of the value in `budget` and `revenue` are of zero value. The minimum value for runtime is 0 which means that there could be a number of values of zero value in `runtime`.
Considering the impact of dropping all the rows with zero might impact my analysis, I will first mark the zero values as NaN so that I can replace them with mean values in the next section.

In [None]:
# creeate a list columns with zero values.
col_with_zero = ['runtime','budget_adj','revenue_adj']

# replace zero values with NaN for columns in the list.
df[col_with_zero] = df[col_with_zero].replace(0,np.NAN)

# confirme the changes
df.describe()

#### Replace NaN values with mean

I handle missing values by inputting them with the mean.

In [None]:
# fill NaN values with mean
df['runtime'].fillna(df['runtime'].mean(),inplace = True)
df['budget_adj'].fillna(df['budget_adj'].mean(),inplace = True)
df['revenue_adj'].fillna(df['revenue_adj'].mean(),inplace = True)

#### Add New Column profit

In [None]:
# adding new column profit calculated using revenue minus budget 
df['profit'] = df['revenue_adj'] - df['budget_adj']

In [None]:
# confirm changes
df.head(1)

#### Convert release_date column to Date datatype.

In [None]:
# convert release_data to datetime formate
df['release_date']=pd.to_datetime(df['release_date'])
# confirm changes
df.dtypes

In [None]:
# save cleaned data for next steps 
df.to_csv('tmdb_cleaned_data.csv', index = False)

<a id='eda'></a>
## Exploratory Data Analysis

Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section.


#### Explore Relations Between Values
From the plots blewo can see the relation between all values, important correlation we need the relation between profit and other values to know which values have the most impact on profit.  

In [None]:
pd.plotting.scatter_matrix(df,figsize=(15,15));

In [None]:
df['profit'].corr(df['popularity'])

In [None]:
df['profit'].corr(df['runtime'])

In [None]:
df['profit'].corr(df['vote_average'])

In [None]:
df['profit'].corr(df['budget_adj'])

> Based on the cells above, there is a strong correlation between profit and `revenue_adj`, mediam correlation with `popularity` and week correlation with each of `budget_adj` ,`vote_average` ,`runtime`. All of these correlation is positive regardless of their strength.

### Q1 : What kind of properties are associated with movies that have high profit?

In order to analyze on the properties that are associated with movies that have high profit, I will filter the dataset to movies that made profit of more than or equal 100 Million dollars.

In [None]:
# create new dataframe by filter to movies that made profit of more than 100Million dollars 
high_profit_movie = df.query('profit >= 100000000')

high_profit_movie.head(3)

In [None]:
high_profit_movie.describe()

> From 10,865 movies, we are now down with only 1,966 movies which have profit of at least 100 Million dollars. The highest earning movie is 2.75 Billion dollars.

In [None]:
# detailes of highest profit movie
highest = high_profit_movie['profit'].idxmax()
highest_details = pd.DataFrame(high_profit_movie.loc[highest])
highest_details

> From resualt above Star Wars movie have the highest profit by 2.75 Billion dollars.

#### High profit movie average popularity?

In [None]:
# the average popularity of the movies
high_profit_movie['popularity'].mean()

> Based on the above, the average popularity for high profit movies is 1.311 million. Let's use visualization to see the distribution of the popularity.

In [None]:
# create histogram to see the distribution of the popularity 
plt.figure(figsize=(10,5), dpi = 100)
sns.set_style('darkgrid')
# x-axis 
plt.xlabel('Movie popularity(Million)', fontsize = 15)
# y-axis 
plt.ylabel('No. of Movies', fontsize=15)
# distribution title
plt.title('Movie popularity Distribution', fontsize=15)

# Plot the histogram
plt.hist(high_profit_movie['popularity'], rwidth = 0.9, bins =35)
# Displays the plot
plt.show()

#### High profit movies average budget?

In [None]:
# the average budget of the movies
high_profit_movie['budget_adj'].mean()

>Based on the above, the average budget for high revenue movies is around 42 Million dollars ($42,584,424).

#### High profit movies average revenue?

In [None]:
# the average revenue of the movies
high_profit_movie['revenue_adj'].mean()

>Based on the above, the average revenue for high revenue movies is around 257 Million dollars. ($257,177,144).

#### Which year that makes the highest profit ?

In [None]:
# release year have highest profit
highest_profit_year = high_profit_movie.groupby('release_year')['profit'].sum()
highest_profit_year.idxmax()

In [None]:
# Figure size
plt.figure(figsize=(12,6), dpi = 130)
sns.set_style('darkgrid')
# x-axis
plt.xlabel('Year', fontsize = 12)
# y-axis
plt.ylabel('Profit', fontsize = 12)
# Title
plt.title('Higtest Profit Year')

# Plot line Chart
plt.plot(highest_profit_year)

# Display the line Chart
plt.show()

> Based on the above, we can see that the movies makes the highest profits in year 2013.

In [None]:
# create a new column month by extracting the month from the release date
high_profit_movie['month'] = high_profit_movie['release_date'].apply(lambda x: x.month)

In [None]:
# total of profits group by month 
highest_profit_month = high_profit_movie.groupby('month')['profit'].sum()
# count of high profit movies group by month
high_profit_movie_month = high_profit_movie.groupby('month')['profit'].count()

highest_profit_month

In [None]:
high_profit_movie_month

In [None]:
# get the month with the highest movies profit
highest_profit_month.idxmax()

In [None]:
# get the month with largest count of high profit movies
high_profit_movie_month.idxmax()

> Based on the above two results, we found that the month with most count of high profit movies is December followed by June but June is the month with the highest profit followed by December which means the number of high profit movies in June has achieved higher profits than the number of high profit movies in December. 

In [None]:
# Figure size
plt.figure(figsize=(15,8))
sns.set_style('darkgrid')

month_name = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
plt.bar([1,2,3,4,5,6,7,8,9,10,11,12], highest_profit_month, tick_label = month_name)
# Title
plt.title('Highest Profit Month')
# y-axis
plt.ylabel('Profit')
# x-axis
plt.xlabel('Month');

In [None]:
# Figure size
plt.figure(figsize=(15,8))
sns.set_style('darkgrid')

month_name = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
plt.bar([1,2,3,4,5,6,7,8,9,10,11,12], high_profit_movie_month, tick_label = month_name)
# Title
plt.title('Number of High Profit Movies Ber Months')
# y-axis
plt.ylabel('Number of Movies')
# x-axis
plt.xlabel('Months');

#### High Profit Movies Casts ??

In [None]:
def extract_high_proft_data(column):

    data = high_profit_movie[column].str.cat(sep = '|')
    
    # create pandas series and store the values separately
    data = pd.Series(data.split('|'))
    
    # display value count in descending order
    count = data.value_counts(ascending = False)
    
    return count

In [None]:
# get top 10 casts
cast = extract_high_proft_data('cast')
cast.head(15)

> The top two male actors with more than 20 movies are Tom Cruise (27 movies) and Tom Hanks (22 movies). As for female actress are Cameron Diaz (17 movies). Seems like there are more high revenue movies actors than actress as out of the 15 cast from the above list there are only 1 female.

#### High profit movie directors?

In [None]:
# get top 10 directors
director = extract_high_proft_data('director')
director.head(10)

> Steven Spielberg (23 movies) is the top director that makes the most movies for the 55 years between 1960 to 2015 with high profit followed by Robert Zemeckis with 13 movies and Ron Howard with 12 movies.

#### High profit production companies?

In [None]:
# Get top 10 production companies
production_companies = extract_high_proft_data('production_companies')
production_companies.head(10)

>The top production is Universal Pictures (139 movies), Warner Bros (136 movies) and Paramount Picture (131 movies). Amazing that the top three production companies produces more than 400 movies for the 55 years between 1960 to 2015 and they are also the ones that top the production companies with high profit movies list.

#### High profit movie genre?

In [None]:
# Get top 10 genres
director = extract_high_proft_data('genres')
director.head(10)

>  Drama movies have the highest profit with 749 movies followed by Comedy with 613 movies and Thriller with 612 movies, Surprisingly Action movies and Adventure wich have the highest popularity they come in fourth and fifth place in terms of the number of most profitable movies.

### Q2 : What kind of properties are associated with movies that have high ratings??

In order to analyze on the properties that are associated with movies that have high vote average, I will filter the dataset to movies that made vote average equal or more than 7.0 .

In [None]:
# create new dataframe by filter to movies that made vote of more than or equal to 7.0   
high_vote_movie = df.query('vote_average >= 7.0')

high_vote_movie.head(3)

In [None]:
high_vote_movie.describe()

> From 10,865 movies, we are now down with only 1,561 movies which have vote average of at least 7. The highest voted movie is 9.2 .

In [None]:
highest = high_vote_movie['vote_average'].idxmax()
highest_details = pd.DataFrame(high_vote_movie.loc[highest])
highest_details

> From resualt above The Story of Film: An Odyss movie have the highest vote average by 9.2 .

In [None]:
# the average popularity of the movies
high_vote_movie['popularity'].mean()

> Based on the above, the average popularity for high vote movies is 1.088 million. Let's use visualization to see the distribution of the popularity.

In [None]:
# create histogram to see the distribution of the popularity 
plt.figure(figsize=(10,5), dpi = 100)
sns.set_style('darkgrid')
# x-axis 
plt.xlabel('Movie popularity', fontsize = 15)
# y-axis 
plt.ylabel('No. of Movies', fontsize=15)
# distribution title
plt.title('Movie popularity Distribution', fontsize=15)

# Plot the histogram
plt.hist(high_vote_movie['popularity'], rwidth = 0.9, bins =35)
# Displays the plot
plt.show()

In [None]:
# the average budget of the movies
high_vote_movie['budget_adj'].mean()

>Based on the above, the average budget for high vote movies is around 38.5 Million dollars ($38,552,080).

In [None]:
# the average revenue of the movies
high_vote_movie['revenue_adj'].mean()

>Based on the above, the average revenue for high vote movies is around 161 Million dollars. ($161,986,833).

In [None]:
def extract_high_vote_data(column):

    data = high_vote_movie[column].str.cat(sep = '|')
    
    # create pandas series and store the values separately
    data = pd.Series(data.split('|'))
    
    # display value count in descending order
    count = data.value_counts(ascending = False)
    
    return count

#### High vote average movies casts ??

In [None]:
# get top 10 casts
vote_cast = extract_high_vote_data('cast')
vote_cast.head(15)

> Surprisingly, there is a swap from this high profit movies listing with the high vote movie listing. Based on the result above Robert DE Niro have the most high vote movies with 20 movies followed by Tom Hankes with 18 movies and have the same rank in high profit movies listing. As for female actress are Scarlett Johansson (11 movies). Seems like there are more high vote movies actors than actress as out of the 15 cast from the above list there are only 1 female. 

#### High vote average movies Director ??

In [None]:
# get top 10 directors
vote_director = extract_high_vote_data('director')
vote_director.head(10)

> As we see here Martin Scorsese have the largest number of high vote movies with 15 movies followed by Steven Spielberg With 13 movies who have the the largest number of high profit movies too. There is a difference here between high vote movies directors listing and high profit movies directors listing.

#### High vote average movies production companies ??

In [None]:
# Get top 10 production companies
vote_production_companies = extract_high_vote_data('production_companies')
vote_production_companies.head(10)

> Based on the above result, there is small difference between high vote movies production company listing and high profit movies production company listing. Warner Bros has the largest number of high vote movies with 94 movies followed by Universal Pictures with 72 movies which has the largest number of high profit movies and Paramount Pictures has the same rank in two lists.

#### High vote average movies genres ??

In [None]:
# Get top 10 genres
vote_genres = extract_high_vote_data('genres')
vote_genres.head(10)

> Based on the above result, Drame (779 movies) And Comedy (385 movies) take the same rank as high vote movies genres and high profit movies genres. Documentary came in third place in the list with 268 movies followed by Action genre with 231 movies.

<a id='conclusions'></a>
## Conclusions

>My goal of this data analysis is to answer the 2 main questions - (1) kind of properties that are associated with movies with a high profit of at least 100 Million dollars? (2) kind of properties that are associated with movies with a high vote of at least 7.0 from 10? After the above analysis, I can conclude the following:
<br/><br/>
>** Properties and attributes of the movies have a profit of at least 100 Million Dollars: **
- Average popularity of the movie 1.3112 million people.
- Average Budget must be around 42 Million Dollars
- Average revenue must be around 257 Million Dollars
- Year that makes the most profit: 2013.
- Month to release the movie: December or June
- Actors to cast: Tom Cruise, Tom Hanks, Samuel L. Jackson.
- Any one of these should be the director: Steven Spielberg, Robert Zemeckis, Ron Howard.
- Popular production companies for high profit movies: Universal Pictures, Warner Bros., Paramount Pictures.
- Genre must any of these: Drama, Comedy, Thriller.


>** For a movie to have a vote of at least 7.0 : **
- Average popularity of the movie 1.088 million people.
- Average Budget must be around 38 Million Dollars.
- Average revenue must be around 161 Million Dollars.
- Actors to cast: Robert De Niro, Tom Hanks, Samuel L. Jackson, Brad Pitt.
- Any one of these should be the director:  Martin Scorsese, Steven Spielberg, Joel Coen.
- Produce by any of these production companies: Warner Bros., Universal Pictures, Paramount Pictures .
- Genre must any of these: Comedy, Drama, Documentary.

<br/><br/>
>By meeting the above criteria, the movie will have a higher probability to be a hit and earn an average revenue of around 371 million dollar and profit of at least 100 million dollars.
<br/><br/>
But do note that the above analysis was done on the movies from 1960 to 2015 and with a profit of at least 100 million dollar. Also considering that there is a huge number of missing data and noticed that some of the error values for the movies (e.g. $1 for some of the movies).

### Resources
- <a href="https://pandas.pydata.org/pandas-docs/stable/">Pandas Documentation</a>
- <a href="https://matplotlib.org/">Matplotlib Documentation</a>
- Python For Data Analysis (Book)