# TMDB Movies Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

>Here we have a dataset for the selected 10k movies in TMDB. We are going to analyse these data according to their budget,revenue,ratings and their profits.

#### Questions that can be answered by looking at the datasets are:
> - Which movie has the highest profit & which has the lowest ? 
> - Which movie has the longest runtime & which has the lowest ? 
> - Which movie had the highest and lowest budget?
> - Which movie had the highest and lowest revenue?
> - Which are the most frequent actor involved?

#### Questions that will be answered based on the 100 rated movies: 
> - What is the highest rated movie?
> - What is the average budget of the movies?
> - What is the average revenue of the movies?
> - What is the average runtime of the movies?
> - Which are the successfull genres?
> - Which are the most frequent actor involved?

In [None]:
#importing the libraries we need 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling
>Here in this section we will see how the data is represented, what will we use in the data analysis process and what we won't.
### General Properties

Loading the dataset and representing the first five rows

In [None]:
#load our dataset
df=pd.read_csv('../input/tmdb-movies-dataset/tmdb_movies_data.csv')
#getting a closer look on the data
df.head(5)

Let's see some statistical values for these dataset

In [None]:
df.describe()

In [None]:
df.shape

>We have here 10866 rows and 21 columns.

In [None]:
df.info()

Seems like there are too many null values that we might need to deal with

### Let's make the dataset more useable for the analysis

>First of all let's get rid of the columns that we won't use as they won't affect the analysis. 
I don't think we need : (id, imdb_id , homepage , director , tagline, keywords, overview, production_companies, budget_adj, revenue_adj , vote_count , popularity)        

In [None]:
#droping the columns
columns_drop=['id','imdb_id','popularity','homepage','director' , 'tagline','vote_count', 'keywords', 'overview', 'production_companies', 'budget_adj', 'revenue_adj']
df.drop(columns_drop, axis =1 , inplace = True)

Let's take a look on how our data looks like now

In [None]:
#our new dataset
df.head(5)

In [None]:
df.shape

>The columns are reduced to 9 columns now

In [None]:
df.describe()

In [None]:
df.info()

Now we will check for duplicated values and null values

In [None]:
df[df.duplicated()]

We have only one duplicated value. I'm going to remove one of them and keep the other.

In [None]:
df.drop_duplicates(keep='first',inplace=True)

Now we have no duplicated values

In [None]:
df[df.duplicated()]

Let's check for the null values

In [None]:
df.isnull().sum()

>Looks like we have 23 missing values in the genres column and 76 in the cast column

In [None]:
df[df.genres.isnull()]

The best thing to do with these null values is to remove them. Removing them won't affect the data analysis process

In [None]:
#removing the rows which have the null values
df.dropna(subset = ["genres"], inplace=True)
df[df.genres.isnull()]

We will do the same thing for the cast column

In [None]:
df[df.cast.isnull()]

In [None]:
#droping the NaN's
df.dropna(subset=['cast'],inplace=True)
df[df.cast.isnull()]

I don't think it's a right idea to have a 0 runtime in the dataset

In [None]:
df.query('runtime<=0')

I'm going to drop all the rows having zero or less as their runtime

In [None]:
df.drop(df[df['runtime']<= 0].index, inplace = True)
df.query('runtime<=0')

Let's have a look at the runtime column

In [None]:
df.runtime

I think that making the runtime in hours will be more useful in the analysis

In [None]:
#using lambda to convert the time into more familiar formula
df.runtime = df.runtime.apply(lambda x: '{:02d}:{:02d}'.format(*divmod(x, 60)))
df.runtime

Now let's convert the release date into more proper formula

In [None]:
df.release_date = pd.to_datetime(df['release_date'])
df.release_date.head(5)

I will just add a profits column for the data set

In [None]:
#profit = revenue - budget
df['profit']=df.revenue-df.budget
df.profit.head(10)

Now we have our final dataset ready to analyse

In [None]:
df.head(5)

In [None]:
#last checking
df.info()

In [None]:
df.shape

>Our final dataset after cleaning have 10737 rows and 9 columns

<a id='eda'></a>
## Exploratory Data Analysis

> After cleaning our data let's answer the questions we had earlier using statistics and visualizations.

### Research Question 1.1 (Which movie has the highest profit & which has the lowest ?)

We will get the highest profit by sorting the profit descendingly and the opposite to get the lowest

In [None]:
#getting the highest profit
df_HighestSorted_profit=df.sort_values(by='profit' , ascending = False)
df_HighestSorted_profit[['original_title','profit']].head(1)


>Avatar is on top of the profits with 2,544,505,847 Dollars.

In [None]:
#getting the lowest profit
df_LowSorted_profit=df.sort_values(by='profit' , ascending = True)
df_LowSorted_profit[['original_title','profit']].head(1)


>The Warrior's Way is the lowest profit with total loss of -413,912,431 Dollas.

### Research Question 1.2 (Which movie has the longest runtime & which has the shortest ?)

The same process will happen here

In [None]:
#getting the longest runtime
df_Longest=df.sort_values(by='runtime',ascending = False)
df_Longest[['original_title','runtime']].head(1)


>The movie that has the longest runtime is The Story of Film: An Odyssey with 15 hours runtime !

In [None]:
#getting the lowest runtime
df_lowest=df.sort_values(by='runtime',ascending = True)
df_lowest[['original_title','runtime']].head(1)


>The movie that has the shortest runtime is Batman: Strange Days with only 3 minutes runtime !

### Research Question 1.3 (Which movie had the highest and lowest budget?)

Again, the same process

In [None]:
#getting the highest budget
df_HighestBudget=df.sort_values(by='budget',ascending = False)
df_HighestBudget[['original_title','budget']].head(1)

>The Warrior's Way lead the way with 425,000,000 Dollas.

In [None]:
#getting the lowest budget
df_lowestBudget=df.sort_values(by='budget',ascending = True)
df_lowestBudget[['original_title','budget']].head(1)

>Salvando al Soldado Perez has the lowest budget with value of zero.

Now let's visualize the relation between budget and profit 

In [None]:
sns.scatterplot(data=df, x="budget", y="revenue");


As you can see the more the movie's budget is, the more profit it achieves

### Research Question 1.4 (Which movie had the highest and lowest revenue?)

In [None]:
#getting the highest revenue
df_Highestrevenue=df.sort_values(by='revenue',ascending = False)
df_Highestrevenue[['original_title','revenue']].head(1)

>Of course it's Avatar, Avatar has 2,781,505,847 Dollars as total revenue.

In [None]:
#getting the lowest budget
df_lowestrevenue=df.sort_values(by='revenue',ascending = True)
df_lowestrevenue[['original_title','revenue']].head(1)

>Manos: The Hands of Fate has 0 as total revenue.

I will plot the same graph as the last one but now with revenue and profit

In [None]:
sns.scatterplot(data=df, x="revenue", y="profit");


>I found that the relation between revenue and profit is linear

The graph between budget and revenue is like the following : 

In [None]:
sns.scatterplot(data=df, x="budget", y="revenue");


>The relation between revenue and profit is directly proportional so hen the budget increases the revenue increases as well 

### Research Question 1.5 (Which are the most frequent actor involved?)


In [None]:
#Getting the most frequent actor
cast_count = pd.Series(df['cast'].str.cat(sep = '|').split('|')).value_counts(ascending = False)
cast_count.head(20)

>Robert De Niro lead the way here.

Let's see how this will look as a plot

In [None]:
#for the color variety
cmap = plt.cm.tab10
colors = cmap(np.arange(len(df)) % cmap.N)
# Initialize the plot
diagram = cast_count.head(20).plot.barh(fontsize = 8,color=colors)
# Set a title
diagram.set(title = 'Cast')
# x-label and y-label
diagram.set_xlabel('Number of Movies')
diagram.set_ylabel('List of Cast')
# Show the plot
plt.show()

### Research Question 2  (Questions that will be answered based on the 100 rated movies)

Before analyzing the top 100 movies we should get a dataframe for the movies

In [None]:
df100R=df.sort_values(by='vote_average',ascending=False)
df100R=df100R[['original_title','vote_average']].head(100)
#dataframe that includes the top 100 movies
df100R.head(10)

### Research Question 2.1  (What is the highest rated movie?)

In [None]:
df100R.head(1)

The Story of Film: An Odyssey is the highest rated movie.

In [None]:
sns.histplot(data=df100R);
plt.xlabel("Number of Votes", size=10)
plt.ylabel("Counts", size=10)
plt.title("Average Votes", size=15);

Over 50% of the top 100 movies are rated with 8.0

Let's see some visualization of the top 100 movies

In [None]:
#creating a dataframe first
df100=df.sort_values(by='vote_average',ascending=False)
df100=df100[['original_title','vote_average','budget','revenue','profit','genres','cast']].head(100)
#visualizing the data
df100.hist(figsize=(15,8));


Seems here that most of the 8.0 rated films did not spend too much compared to the higher rated movies

### Research Question 2.2  (What is the average budget of the movies?)

Getting the average value for the budget

In [None]:
df100['budget'].mean()

>So the average budget of the movies is 14,185,571.58 Dollars

### Research Question 2.3  (What is the average revenue of the movies?)

Getting the average value for the revenue

In [None]:
df100['revenue'].mean()

>The average budget of the movies is 81,859,012.63 Dollars

### Research Question 2.4  (What is the average runtime of the movies?)

Getting the average value for the runtime

In [None]:
df['runtime'] = pd.to_datetime(df['runtime'], infer_datetime_format=True)
df['runtime'].mean()

> The average runtime for the 100 rated movies is 1 hour and 42 minutes and 41 seconds.

### Research Question 2.5  (Which are the successfull genres?)

First I will get the total count for every genre

In [None]:
genres_count = pd.Series(df100['genres'].str.cat(sep = '|').split('|')).value_counts(ascending = False)
genres_count

> Fans seem to appreciate the Drama genre more than the others. Surprisingly for me Documentary is the second in the list with only one movie less than Drama.

Let's visualize the genres

In [None]:
#for the color variety
cmap = plt.cm.tab10
colors = cmap(np.arange(len(df)) % cmap.N)
# Initialize the plot
diagram = genres_count.plot.bar(fontsize = 12,color=colors)
# Set a title
plt.title('Top Genres')
# x-label and y-label
plt.xlabel('Type')
plt.ylabel('Number of Movies')
# Show the plot
plt.show();

>As we can see Drama is leading the way followed by Documentary and Music. Comedy and Crime in fourth and fifth place. History and TV Movie are the least two liked genres

### Research Question 2.6  (Which are the most frequent actor involved?)

First I will get the total count for every actor

In [None]:
actor_count = pd.Series(df100['cast'].str.cat(sep = '|').split('|')).value_counts(ascending = False)
actor_count.head(20)

Let's visualize the actors

In [None]:
# Initialize the plot
cmap = plt.cm.tab10
colors = cmap(np.arange(len(df)) % cmap.N)
diagram = actor_count.head(20).plot.barh(fontsize = 8,color=colors)
# Set a title
diagram.set(title = 'Actors')
# x-label and y-label
diagram.set_xlabel('Number of Movies')
diagram.set_ylabel('List of Actors')
# Show the plot
plt.show()

The top six actors here are :
>Louis Tomlinson
>Niall Horan          
>Liam Payne           
>Bill Burr            
>David Tennant        
>Harry Styles         

with all of them having 3 movies in the list

<a id='conclusions'></a>
## Conclusions


> ##### So the conclusion is, that if we want to create movies which can be in the top 100 highest rated movies then
> The average budget of the movies can be arround 14,185,571.58 Dollars
>
> The average runtime of the movies can be arround 1 hour and 42 minutes and 41 seconds.
>
> The Top 10 Genres we should focus on should be Drama, Documentary, Music, Comedy, Crime, Thriller, Animation, Science Fiction, Family and Adventure.
>
> The Top 6 cast we should focus on should be : Louis Tomlinson Niall Horan, Liam Payne, Bill Burr, David Tennant and Harry Styles
>
> The average revenue of the movies will be arround 81,859,012.63 Dollars 
>
> ##### The limitations associated with the conclusions are:
>We have used TMBD Movies dataset for our analysis and worked with popularity, revenue and runtime. Our analysis is limited to only the provided dataset. For example, the dataset does not confirm that every release of every director is listed.
>
>There is no normalization or exchange rate or currency conversion is considered during this analysis and our analysis is limited to the numerical values of revenue.
>
>Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards the relationship being analyzed. etc.