# Project: Investigating TMDb dataset

## Table of Contents
<ul>
    <li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

### Introduction
The goal of this project is to explore a movies dataset containing information about 10,000 movies collected from The Movie Database (TMDb).

Here are the list of questions I will be addressing:
    
*  Which year has the highest number of releases?
*  Which month has the highest number of releases?
*  Which movie has the highest and lowest popularity rating, budget, revenue, profit, vote average, and runtime?
*  How the movies budget have changed over the years?
*  Which genre have the highest number of movie releases?

In [None]:
#Importing required packages
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

In [None]:
#import os 
#os.getcwd() #Checking working directory to import data

<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [None]:
# Loading movies data
movies_data=pd.read_csv('../input/tmdb-movies-dataset/tmdb_movies_data.csv')
movies_data.head(3) #Check head and tail

In [None]:
movies_data.tail(3)

Here I checked the first and last 3 rows of the dataset to see how it looks.

In [None]:
movies_data.shape

Looks like this data has 10866 rows and 21 columns.

In [None]:
movies_data.info()

From the info command, I noticed that there are missing values in some of the columns like cast, homepage, tagline, keywords,imbd_id,
director, overview, genres, and production_companies. In addition, I checked the data types for all the variables and looks like I need to
change the data type of release_date.

In [None]:
movies_data.describe()

The describe commands provided the summary of all the quantitative variables in the data. The representation of budget, revenue doesn't
look nice. I'll change it to more readable form later in the course.

Now, lets check the missing, duplicate and uniques values

In [None]:
#Missing values
movies_data.isnull().sum()

In [None]:
#Duplicate values
movies_data.duplicated().sum()

In [None]:
#Unique values
movies_data.nunique()

Looks like there are plenty of columns with missing data and there is just one row with duplicate data.

### Data Cleaning 
In this part, I'll be treating missing and duplicate rows as well as changing the data type for 'release_date' column.
Also, I will be dropping a few columns which might not be useful for our analysis.

In [None]:
#Removing columns that are not important
movies_data.drop(['id','imdb_id','homepage','tagline','overview','keywords'], axis=1,inplace=True)
movies_data.head(3)

Now, I'll be removing the miising values but before that let's replace all the empty rows with 'nan'.

In [None]:
#Replace all empty values with nan
movies_data=(movies_data.replace(r'^\s*$', np.nan, regex=True))

In [None]:
#Drop the columns with na values
movies_data.dropna(inplace=True)

#Let's check the changes
movies_data.isnull().sum()

In [None]:
#Removing duplicate data
movies_data.drop_duplicates(inplace=True)

#Let's check it
movies_data.duplicated().sum()

In [None]:
#Let's check the shape of new data
movies_data.shape

In [None]:
#Changing data type of 'release_date' from string to datetime format
movies_data['release_date']=pd.to_datetime(movies_data['release_date'])

#Changing data type of 'release_year' from integer to string
movies_data['release_year']=movies_data['release_year'].astype(int)
                                                                               
#Let's confirm it
movies_data.info()

In [None]:
#Now, I'll extract month from the 'release_date' 
movies_data['month'] = movies_data['release_date'].dt.month

import calendar
movies_data['month'] = movies_data['month'].apply(lambda x: calendar.month_abbr[x])

Further, I feel there are too many decimal places for the 'popularity' column so I'll round it to 3 decimals.
In addition, it's hard to read the price in 'budget', 'revenue', 'budget_adj', and 'revenue_adj' columns. I'll be converting these 
prices to 'million dollars' to make to easy to understand.


In [None]:
# Let's round the popularity column to 3 decimal places
movies_data['popularity']=movies_data['popularity'].round(decimals=3)
movies_data.head(3)

In [None]:
#Let's work on budget and revenue columns now
#Here 'mill' means 'millions'

movies_data['budget_mill']=(movies_data['budget'].astype(int)/1000000).round(2).astype(float) 
movies_data['revenue_mill']=(movies_data['revenue'].astype(int)/1000000).round(2).astype(float) 
movies_data['budget_adj_mill']=(movies_data['budget_adj'].astype(float)/1000000).round(2).astype(float) 
movies_data['revenue_adj_mill']=(movies_data['revenue_adj'].astype(float)/1000000).round(2).astype(float) 

In [None]:
#Now drop the budget, revenue, budget_adj, and revenue_adj columns
movies_data.drop(['budget','revenue', 'budget_adj','revenue_adj'], axis=1, inplace=True)

In [None]:
#Let's check the changes
print(movies_data.shape)

In [None]:
movies_data.head(3)

The data looks pretty clean now and we are ready for the EDA part.

<a id='eda'></a>
## Exploratory Data Analysis

In this part, I will do the univariate and bivariate analysis to find the answers to the questions I have mentioned in the 
beginning.

### Research Question 1: Which year has the highest number of releases?

In [None]:
movies_data.groupby('release_year').count()['original_title'].plot(xticks = np.arange(1960,2016,5));
plt.title("Release Year vs Number Of Movies",fontsize = 15)
plt.xlabel('release_year',fontsize = 10)
plt.ylabel('movies_count',fontsize = 10);

From this graph, it looks like the movie releases has increased exponantially from 1960 to 2011 and we see a downward trend in 2015.

### Research Question 2: Which month has the highest number of releases?

In [None]:
movies_data['month'].value_counts().plot(kind='barh', color='brown', figsize=(7,5));
plt.title("Month vs Number Of Movies",fontsize = 15)
plt.xlabel('movies_count',fontsize = 10)
plt.ylabel('month',fontsize = 10);

Looks like the highest number of movies released in September.

### Research Question 3: How the movie runtime varies in this data set?

In [None]:
movies_data['runtime'].plot(kind='box')
plt.title("Runtime vs Number Of Movies",fontsize = 15)
plt.ylabel('movies_count',fontsize = 10);

Looks like the median runtime is ~ 110 minutes and maximum ~ 900 minutes. Interestingly, there are some movies with 0 runtime as well. These are the potential missing values in the runtime column which I'll try to remove in the next graph.

### Research Question 4: Which movie has the highest and lowest runtime?

In [None]:
#Highest runtime
movies_data.nlargest(5,'runtime').plot.line(y='runtime',x='original_title',color='blue', figsize=(8,5));
plt.title("Runtime vs Movie Title",fontsize = 15)
plt.xlabel('original_title',fontsize = 10)
plt.ylabel('runtime',fontsize = 10);

In [None]:
#Lowest runtime

print(movies_data.query('runtime==0')['original_title'].count())
#Looks like some values are missing in runtime columns as there are 13 movies with a runtime of 0 mins.

#Let's exclude these 13 rows and calculate the lowest runtime using the rest of the data
Lowest_runtime_data=movies_data.query('runtime!=0')

Lowest_runtime_data.nsmallest(10,'runtime').plot.barh(y='runtime',x='original_title',color='orange', figsize=(8,5));
plt.title("Runtime vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('runtime',fontsize = 10);

From this graph, I concluded that the movie 'Taken' ahs the highest runtime and five movies have the lowest runtime of 3 minutes. Also, there were
13 movies with zero runtime which I have removed from the analysis.

### Research Question 5: What is the average vote for most of the movies?

In [None]:
movies_data['vote_average'].hist()
plt.title("Average Vote vs Number of Movies",fontsize = 15)
plt.ylabel('movies_count',fontsize = 10)
plt.xlabel('vote_average',fontsize = 10);

The graph looks normally distributed. Around 3000 movies have an average vote of ~ 6. The highest vote_average is ~ 8 and the lowest is ~ 2.

### Research Question 6: How the vote count distribution looks for this dataset?

In [None]:
movies_data['vote_count'].plot(kind='hist')
plt.title("Vote Count vs Number Of Movies",fontsize = 15)
plt.ylabel('movies_count',fontsize = 10)
plt.xlabel('vote_count',fontsize = 10);

For most of the movies the vote count is less than 1000. However, there are a few movies with a vote count of 2000 to 4000 as well.

### Research Question 7: Which movie has the highest and lowest average vote?

In [None]:
#Highest average vote
movies_data.nlargest(10,'vote_average').plot.scatter(x='vote_average',y='original_title',color='blue', figsize=(8,5));
plt.title("Average Vote vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('vote_average',fontsize = 10);

In [None]:
#Lowest average vote
movies_data.nsmallest(10,'vote_average').plot.scatter(x='vote_average',y='original_title',color='blue', figsize=(8,5));
plt.title("Average Vote vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('vote_average',fontsize = 10);

Movie 'Pink Floyd: Pulse' has the highest vote_average of 8.7 whereas movies 'Transmorphers' and 'Manos: The Hands of Fate' have the lowest ratings of ~ 1.5.

### Research Question 8:  Which movie has the highest and lowest popularity rating?

In [None]:
#Highest popularity rating
movies_data.nlargest(10,'popularity').plot.barh(y='popularity',x='original_title',color='red', figsize=(7,5));
plt.title("Popularity vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('popularity',fontsize = 10);

In [None]:
#Lowest popularity rating

movies_data.nsmallest(10,'popularity').plot.barh(y='popularity',x='original_title',color='green');
plt.title("Popularity vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('popularity',fontsize = 10);

Looks like 'Jurrasic World' has the highest popularity rating whereas 'Paheli', 'The Central Park Five', and 'Freddie Mercury: The Great Pretender' have the lowest popularity ratings.

### Research Question 9: How the movies budget have changed over the years?

In [None]:
movies_data.groupby('release_year')['budget_mill'].mean().plot(figsize=(10,5),color='red')
plt.title("Release Year vs Budget",fontsize = 15)
plt.xlabel('release_year',fontsize = 10)
plt.ylabel('budget_mill',fontsize = 10);

This graph shows that the movies budget has increased from year 1960 to 2000 and starts decreasing thereafter.
The budget in 1960 was ~ 1 million US dollars whereas in 2015 was ~ 13 million US dollars.

### Research Question 10: How the movies budget and revenue varies in this dataset?

In [None]:
movies_data[['budget_mill','revenue_mill']].plot(kind='box')
plt.title("Budget and Revenue vs Number Of Movies",fontsize = 15)
plt.ylabel('movies_count',fontsize = 10);

Looks like there are some zeros or negative numbers in the budget dataset. I'll remove these in the next graph.
The highest budget is ~ 500 million US dollars whereas the highest revenue is ~ 2000 million US dollars (that's a large number). Let's see which movie earned that much money.                                                   

### Research Question 11:  Which movie has the highest and lowest budget?

In [None]:
#Highest budget_mill (in millions)
movies_data.nlargest(10,'budget_mill').plot.barh(y='budget_mill',x='original_title',color='blue', figsize=(8,5));
plt.title("Budget vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('budget_mill',fontsize = 10);

In [None]:
#Lowest budget_mill (in millions)

print(movies_data.query('budget_mill==0')['original_title'].count())
#Looks like some values are missing in budget_mill columns as there are 4804 movies with a budget of 0 million.

#Let's exclude these 4804 rows and calculate the lowest budget using the rest of the data
new_budget=movies_data.query('budget_mill!=0')

new_budget.nsmallest(10,'budget_mill').plot.barh(y='budget_mill',x='original_title',color='blue', figsize=(8,5));
plt.title("Budget vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('budget_mill',fontsize = 10);

There were 4804 movies with missing budget values. I have removed these rows and performed the analysis on the rest of the data.
Movie 'The Warrior's Way' has the highest budget of ~ 470 million US dollars whereas there are 7 movies with the budget of just 0.01 million US dollars.

### Research Question 12: Which movie has the highest and lowest revenue?

In [None]:
#Highest revenue_mill (in millions)
movies_data.nlargest(10,'revenue_mill').plot.scatter(x='revenue_mill',y='original_title',color='green', figsize=(8,5));
plt.title("Revenue vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('revenue_mill',fontsize = 10);

In [None]:
#Lowest revenue_mill (in millions)

print(movies_data.query('revenue_mill==0')['original_title'].count())

#Looks like some values are missing in revenue_mill columns as there are 5085 movies with a revenue of 0 million. Also, there is a 
#negative value of -1513.46 (row 1386). 

print(movies_data.query('revenue_mill<=0')['original_title'].count())

#Let's exclude these 5086 rows to calculate the lowest revenue using the rest of the data
new_revenue=movies_data.query('revenue_mill>0')

new_revenue.nsmallest(10,'revenue_mill').plot.scatter(x='revenue_mill',y='original_title',color='blue', figsize=(10,5));
plt.title("Revenue vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('revenue_mill',fontsize = 10);

Again, there were 5085 missing numbers in the revenue column and 1 negative value. I have removed these rows and performed analysis on the rest of the data set.

The movie 'Star Wars' earned the highest amount of money  ~ 2500 million US dollars whereas lowest revenue earned is 0.01 million US dollars.

### Research Question 13: Movie with the highest and lowest profit?

In [None]:
#Profit
#For Profit, I will make new data set where budget_mill and revenue data_mill are greater than 0 (to get rid of zero's and negative values)
profit_data=movies_data[(movies_data.budget_mill>0) & (movies_data.revenue_mill>0)]

#Let's check it
print(profit_data[['budget_mill','revenue_mill']].isnull().sum())

#Calculate profit
profit_data['profit_mill']= profit_data['revenue_mill'] - profit_data['budget_mill']

In [None]:
#Highest profit
profit_data.nlargest(10,'profit_mill').plot.barh(y='profit_mill',x='original_title',color='blue', figsize=(10,5));
plt.title("Profit vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('profit_mill',fontsize = 10);


In [None]:
#Lowest profit
profit_data.nsmallest(10,'profit_mill').plot.barh(y='profit_mill',x='original_title',color='red', figsize=(10,5));
plt.title("Profit vs Movie Title",fontsize = 15)
plt.ylabel('original_title',fontsize = 10)
plt.xlabel('profit_mill',fontsize = 10);
print(profit_data.query('profit_mill <0')['original_title'].count())

In terms of profit, movie 'Star Wars' earned the highest with ~ 1850 million US dollars. There are a few movies which didn't earn any money.

Note: To calculate profit, I didn't use the rows with either budget or revenue equals to zeros because it will mislead my analysis.

### Research Question 14: Which Genre Has The Highest Release Of Movies?

In [None]:
#Let's make a function to count the number of genres.
def count_genre(i):
    d_plot = movies_data[i].str.cat(sep = '|')
    d = pd.Series(d_plot.split('|'))
    gen = d.value_counts(ascending=False)
    return gen

#Count the movies of each genre.
total_movies_genre = count_genre('genres')

#Plot
total_movies_genre.plot.barh(color='orange',figsize = (15,7));
plt.title("Genres vs Movie Count",fontsize = 15)
plt.ylabel('genres',fontsize = 10)
plt.xlabel('count',fontsize = 10);

Looks like most of the movies are made in drama category followed by comedy and thriller and the least famous genre is foreign.

<a id='conclusions'></a>
## Conclusions

From the above analysis, we can draw the following conclusions:

* Highest movie releases were in year 2013-2014.
* Most of the movies in this dataset were released in September.
* In terms of popularity, 'Jurassic World' has the highest and 'The Hospital' has the lowest rating.
* 'Taken' was the longest movie with a runtime of ~ 900 mins whereas 5 movies has the shortest runtime of 3 mins.
* For the average vote, 'Pink Floyd : Pulse' received the highest vote of 8.7 whereas 'Transmorphers' and 'Manos' received the lowest vote of ~ 1.5.
* The budget of movie 'The Warrior's Way' was the highest (~ 450 million dollars) whereas it was just 0.01 million dollars for the 7 movies.
* The revenue was highest for the movie 'Star Wars' being ~ 2100 million dollars.
* Movie 'Star Wars' received the highest profit of ~ 1850 million dollars. The profit was negative for 1025 movies indicating their budget was way too higher than the revenue. 
* Overall, the movie budget has increased from ~1 million dollars in 1960 to ~30 million dollars in 2000. According to this data, the budget has declined afterwards being ~14 million dollars in 2015.
* For genres, the drama category was in the lead until 2015.

## Limitations

No Statistical analysis was done to interpret these results. These are purely on the basis of the graphical representations.