# TMDb Movies Data Set Analysis

<a id='intro'></a>
# 1 - Introduction
TMDb data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

This report will analyis these data to get the best results about the budget and revenue and the better genres that deserve the cost.

#### From the data set we will try to solve the following questions: 

* How many movie in the dataset, and from which years are collected ?
* Which movie is the highest revenue?
* Which movie is the highest budget?
* Which movie is the highest vote?
* What is the relation and compairing between Revenue Vs Budget for the highest 15 revenue?
* What kinds of properties are associated with movies that have high revenues?
* Which movie is the highest Revenue per Years?
* Which movie is the highest popularity movies per genres?
* According to diffrent years, what are the genres are most produced from year to year?


>the data set is from https://docs.google.com/document/d/e/2PACX-1vTlVmknRRnfy_4eTrjw5hYGaiQim5ctr9naaRd4V9du2B5bxpd8FEH3KtDgp8qVekw7Cj1GLk1IXdZi/pub?embedded=True



## Table of Contents


<ul>
<li><a href="#intro">1 - Introduction</a></li>
<li><a href="#wrangling">2 - Data Wrangling</a></li>
<li><a href="#GP">2.1 - General Properties</a></li>
<li><a href="#DC1">2.2 - Data Cleaning (focusing on the budget & revenue & genres)</a></li>
<li><a href="#eda">3 - Exploratory Data Analysis</a></li>
<li><a href="#HM">3.1 - Highest Movies</a></li>    
<li><a href="#HMRBV">3.1.1 - Highest Movies (Revenue, Budget, Vote)</a></li>
<li><a href="#HM15">3.1.2 - Revenue Vs Budget for the highest 15 revenue movies</a></li> 
<li><a href="#Q2">3.1.3 - What kinds of properties are associated with movies that have high revenues?</a></li>
<li><a href="#HMY">3.1.4 - Highest Revenue Movies per Years</a></li>
<li><a href="#HMG">3.1.5 - Highest Popularity Movies per genres</a></li>
 
<li><a href="#Q1">3.3 - Genres are most produced from year to year</a></li>
<li><a href="#DC2">3.4.1 - Data Cleaning (Removing movies with Null genres)

<li><a href="#Conclusions">4 - Conclusions</a></li> 
<li><a href="#limitations">5 - Limitations</a></li>    
</ul>


In [None]:
# this cell to set up import statements for all of the packages that are used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

<a id='wrangling'></a>
# 2 - Data Wrangling
> Presenting the data and cleaning it

<a id='GP'></a>
## 2.1 - General Properties
> Presenting general information about the data set 

In [None]:
# read the data set and desplay general info of the dataframe
#df_original = pd.read_csv('tmdb-movies.csv')
df_original = pd.read_csv('../input/tmdb-movies-dataset/tmdb_movies_data.csv')
df_original.info()

In [None]:
# 1st row as example of the dataframe 
df_original.head(1)

In [None]:
# Dispaing all the essential information about the dataframe
# calculating some statistical data such as percentile, mean and std of different numerical values of the DataFrame.
df_original.describe()

In [None]:
# Correlation matrix
corr = df_original.corr()

plt.figure (figsize=(15,15))
ax = sns.heatmap(corr, annot = True, cmap = 'coolwarm')

ax.set_title("General Corrolation",fontsize=13)
plt.savefig('df_original corr.jpg')

#### Research Question 1 : How many movie in the dataset, and from which years are collected ?

In [None]:
# Number of movies in the data set
# And the release year
print("Number of movies {} in the data set And the release year between {} & {}".
      format(df_original.shape[0],
             df_original.release_year.min(),
             df_original.release_year.max()
            )
     )

<a id='DC1'></a>
###  2.2 - Data Cleaning (focusing on the budget & revenue & genres)

##### Removing unneeded columns from the dataset 


id               : it will not affect the analysis
<br>
imdb_id          : it will not affect the analysis
<br>
homepage         : it will not affect the analysis 
<br>
tagline          : it will not affect the analysis
<br>
keywords         : it will not affect the analysis
<br>
overview         : it will not affect the analysis
<br>
budget_adj       : there is high corrolation between budget_adj and budget , so we will use budget
<br>
revenue_adj      : there is high corrolation between revenue_adj and revenue , so we will use revenue

In [None]:
df_original.info()

In [None]:
df = df_original.drop(["id",
                       "imdb_id",
                       "homepage",
                       "tagline",
                       "keywords",
                       "overview",
                       "budget_adj",
                       "revenue_adj"],
                      axis=1
                     )

df.info()

**Remove Duplicate Rows**

In [None]:
#counting the duplicate elements
sum(df.duplicated())

In [None]:
#drop duplicated row using 'drop_duplicates()' function
df=df.drop_duplicates()
df.shape

In [None]:
# 1st row as example of the dataframe 
df.head(1)

In [None]:
df.isnull().sum()

<a id='eda'></a>
# 3 - Exploratory Data Analysis
> Analysing the data frame


In [None]:
# Correlation matrix
df_corr = df.corr()

plt.figure (figsize=(15,15))
ax = sns.heatmap(df_corr, annot = True, cmap = 'coolwarm')

ax.set_title("Corrolation for the used data",fontsize=13)
#ax.set_xlabel("Release Year",fontsize=12)
#ax.set_ylabel("Vote Average",fontsize=12)

plt.savefig('df_original corr.jpg')

In [None]:
df.hist(figsize=(10,10))


plt.savefig('General hist.jpg')

In [None]:
ax = df.plot.scatter(x='release_year',y='popularity')
ax.set_title("Release year vs Popularity",fontsize=13)
ax.set_xlabel("Release Year",fontsize=12)
ax.set_ylabel("Popularity",fontsize=12)
plt.savefig('release_year vs popularity.jpg')

> From the Histograms and the scatter plot  it is clear that :
<br>
1 - There is positive relationship between latest years and the number of movies releasd per these years and popularity
<br>
2- There is positive relationship between latest years and popularity
<br>
3- the vote average is around to symmetric (normal distripution) 
<br>
4- most of the movies runtime are around 200MINs

<a id='HM'></a>
## 3.1 - Highest Movies 

<a id='HMRBV'></a>   
### 3.1.1 - Highest Movies (Revenue, Budget, Vote)

#### Research Question 2 : Which movie is the highest revenue?

In [None]:
print("Highest revenue movie '{0}' with revenue {1}$ and budget {2}$".
      format(df.original_title[df.revenue == df.revenue.max()].values[0] ,
             df.revenue.max() ,
             df.budget[df.revenue == df.revenue.max()].values[0]
            )
     )

df[df.revenue == df.revenue.max()]

#### Research Question 2 : Which movie is the highest budget?

In [None]:
print("Highest budget movie '{0}' with budget {1}$ and revenue {2}$".
      format(df.original_title[df.budget == df.budget.max()].values[0] ,
             df.budget.max() ,
             df.revenue[df.budget == df.budget.max()].values[0]
            )
     )

df[df.budget == df.budget.max()]

#### Research Question 3 : Which movie is the highest budget?

In [None]:
print("Highest voted count movie '{}' with vote count {} and vote average {} and popularity {} ".
      format( 
             df.original_title[df.vote_count == df.vote_count.max()].values[0] ,
             df.vote_count.max() ,
             df.vote_average[df.vote_count == df.vote_count.max()].values[0] ,
             df.popularity[df.vote_count == df.vote_count.max()].values[0]     
            )
    )

df[df.vote_count == df.vote_count.max()]

#### Research Question 4 : Which movie is the highest vote average & vote count?
#### and the relation between vote and population

In [None]:
print("Highest voted average movie '{}' with vote average {} and vote count {}  and popularity {} ".
      format( 
          df.original_title[df.vote_average == df.vote_average.max()].values[0] , 
          df.vote_average.max() , 
          df.vote_count[df.vote_average == df.vote_average.max()].values[0],
          df.popularity[df.vote_average == df.vote_average.max()].values[0]
      )
     )

df[df.vote_average == df.vote_average.max()]

In [None]:
print("Highest popularity movie '{}' with vote popularity {} and vote count {}  and vote average {} ".
      format( 
          df.original_title[df.popularity == df.popularity.max()].values[0] , 
          df.popularity.max() , 
          df.vote_count[df.popularity == df.popularity.max()].values[0],
          df.vote_average[df.popularity == df.popularity.max()].values[0]
      )
     )
df[df.popularity == df.popularity.max()]

In [None]:
fig, (ax1,ax2) = plt.subplots(2)
fig.suptitle('Popularity vs (Vote average & Vote count)')
plt.style.use('seaborn')

df.plot(kind='scatter',x='popularity',y='vote_average',color='r',ax=ax1,figsize=(10, 10))
ax1.set_xlabel("Popularity",fontsize=12)
ax1.set_ylabel("Vote Average",fontsize=12)


df.plot(kind='scatter',x='popularity',y='vote_count',color='g',ax=ax2,figsize=(10, 10))
ax2.set_xlabel("Popularity",fontsize=12)
ax2.set_ylabel("Vote Count",fontsize=12)

> From the previous analysis it is clear that :
<br>
- there is corrolation between popularity and vote count 


<a id='HM15'></a>
### 3.1.2 - The highest 15 revenue movies comparison
>Comparing (revenue,budget, vote_count,vote_average, release_year) for the highest 15 revenue movies with graph 
<br>
##### the graph is saved as 'first 15th movies revenue compare.jpg'

In [None]:
# function to add value labels
def addlabels(x,y):
    for i in range(len(x)):
        plt.text(x=i, y=y[i], s=y[i], ha = 'center',
                 bbox = dict(facecolor = 'red', alpha =.8))
def addlabels_1e0(x,y):
    for i in range(len(x)):
        M = float(y[i])
        M = M/1e0
        M = round(M,1)
        M = str(M)
        plt.text(x=i, y=y[i], s=M, ha = 'center',
                 bbox = dict(facecolor = 'red', alpha =.8))
        
def addlabels_1e9(x,y):
    for i in range(len(x)):
        M = float(y[i])
        M = M/1e9
        M = round(M,2)
        M = str(M)
        plt.text(x=i, y=y[i], s=M, ha = 'center',
                 bbox = dict(facecolor = 'red', alpha =.8))
        
def addlabels_1e8(x,y):
    for i in range(len(x)):
        M = float(y[i])
        M = M/1e8
        M = round(M,2)
        M = str(M)
        plt.text(x=i, y=y[i], s=M, ha = 'center',
                 bbox = dict(facecolor = 'red', alpha =.8))

In [None]:
df_sorted = df.sort_values(by="revenue",ascending=False)
df_sorted[0:15].describe()

#### Research Question 5 : What is the relation and compairing between Revenue Vs Budget for the highest 15 revenue?

In [None]:
x = df_sorted[0:15].original_title.values.tolist()

fig, ax = plt.subplots(6, 1)
fig.suptitle('** The highest 15 revenue movies comparison **', fontsize=20)
plt.style.use('seaborn')

ax=plt.subplot(6,1, 1)
ax.axes.get_xaxis().set_visible(False)
ax=df_sorted[0:15].plot(x="original_title",y="release_year",figsize=(10, 10),kind='bar',ax=ax, width=0.3)
ax.set_ylabel("Release Year",fontsize=12)
ax.set_ylim(1990 , 2020)
y=df_sorted[0:15].release_year.values.tolist()
addlabels(x, y)

ax=plt.subplot(6,1,2)
ax.axes.get_xaxis().set_visible(False)
ax=df_sorted[0:15].plot(x="original_title",y="vote_average",figsize=(10, 10),kind='bar',ax=ax, width=0.3)
ax.set_ylabel("Vote Average",fontsize=12)
ax.set_ylim(5 , 8)
y=df_sorted[0:15].vote_average.values.tolist()
addlabels(x, y)

ax=plt.subplot(6,1, 3)
ax.axes.get_xaxis().set_visible(False)
ax=df_sorted[0:15].plot(x="original_title",y="vote_count",figsize=(10, 10),kind='bar',ax=ax, width=0.3)
ax.set_ylabel("Vote Count",fontsize=12)
ax.set_ylim(0 , 10000)
y=df_sorted[0:15].vote_count.values.tolist()
addlabels(x, y)

ax=plt.subplot(6,1, 4)
ax.axes.get_xaxis().set_visible(False)
ax=df_sorted[0:15].plot(x="original_title",y="budget",figsize=(10, 10),kind='bar',ax=ax, width=0.3)
ax.set_ylabel("Budget",fontsize=12)
ax.set_ylim(1e7 , 3e8)
y=df_sorted[0:15].budget.values.tolist()
addlabels_1e8(x, y)

ax=plt.subplot(6,1, 5)
ax.axes.get_xaxis().set_visible(False)
ax=df_sorted[0:15].plot(x="original_title",y="popularity",figsize=(10, 10),kind='bar',ax=ax, width=0.3)
ax.set_ylabel("Popularity",fontsize=12)
ax.set_ylim(0 , 35)
y=df_sorted[0:15].popularity.values.tolist()
addlabels_1e0(x, y)

ax=plt.subplot(6,1, 6)
ax=df_sorted[0:15].plot(x="original_title",y="revenue",figsize=(10, 10),kind='bar',ax=ax, width=0.3)
ax.set_ylabel("Revenue",fontsize=12)
ax.set_xlabel("Movie Title",fontsize=12)
ax.set_ylim(1e9 , 3e9)
y=df_sorted[0:15].revenue.values.tolist()
addlabels_1e9(x, y)


plt.savefig('first 15th movies revenue compare.jpg')

<a id='Q2'></a>
### 3.1.3 - What kinds of properties are associated with movies that have high revenues?

In [None]:
df.info()

#### Research Question 6 : What kinds of properties are associated with movies that have high revenues?

In [None]:
# Correlation matrix
boston_corr = df.corr()

plt.figure (figsize=(12,12))
ax = sns.heatmap(boston_corr, annot = True, cmap = 'coolwarm')
ax.set_title("Corrolation for the used data",fontsize=13)
plt.savefig('df corr.jpg')

#### From the previous analysis 


1- There is  corrolation between the populatity and the revenue , so popularity is **effective**

2- There is weak corrolation between the vote average and the revenue , so vote average is **not effective**

3- There is positive and high corrolation between the budget and the revenue, but not mandatory the movies with high budget has high revenue 

4- all the high revenue movies runtime is less than 200 min and more than 130 min,  **so runtime effective**

5-  There is positive and high corrolation between the vote count and the revinue, **the more vote count , the more revenue movies**



<a id='HMY'></a>
### 3.1.4 - Highest Popularity Movies per Years

In [None]:
year_list = np.array(df['release_year'])
year_list = np.unique(year_list)
#print(year_list)

#### Research Question 7 : Which movie is the highest Revenue per Years?

In [None]:
movie_name=[]
grouped  = df.groupby('release_year')
for year in year_list:
    value = grouped.get_group(year).popularity.max()
    #print(year,"  :  ",df.original_title[(df['popularity']==value.item()) & (df['release_year']==year)].values[0])
    #movie_name.append(df.original_title[(df['popularity']==value.item()) & (df['release_year']==year)].values[0])
    movie_name.append(df.original_title[(df['popularity']==value) & (df['release_year']==year)].values[0])    

df1 =  pd.DataFrame({'Year':year_list,
                     'Movie_Name':movie_name})
df1   

<a id='HMG'></a>
### 3.1.5 - Highest Popularity Movies per genres
>the genres has NA values so we need to clean it 

In [None]:
genres_list = np.array(df['genres'].str.split('|'))
genres_list = np.hstack(genres_list)
genres_list = np.unique(genres_list)
print(genres_list)

In [None]:
genres_list = genres_list[genres_list!='nan']

In [None]:
df.shape

In [None]:
df = df.dropna(subset=['genres'])
df.shape

#### Research Question 8 : Which movie is the highest popularity movies per genres?

In [None]:
df_title_list=[]

for genres in genres_list:
    df_geners = df[df['genres'].str.contains(genres)]
    #print(genres,"\t:\t",df_geners.original_title[df_geners.popularity == df_geners.popularity.max()].values[0])
    df_geners=df_geners[df_geners.popularity == df_geners.popularity.max()]
    #print(df_geners.original_title.values[0])
    df_title_list.append(df_geners.original_title[df_geners.popularity == df_geners.popularity.max()].values[0])

df_geners =  pd.DataFrame({'Geners':genres_list,
                          'Movie_Name':df_title_list})
df_geners

<a id='Q1'></a>
# 3.3 - Genres are most produced from year to year

#### Research Question 9 : According to diffrent years, what are the genres are most produced from year to year?

In [None]:
max_genres = []
max_count = []
for year in year_list:
    #print(year)
    genres_max = 0
    for genres in genres_list:
        genres_count=0
        #print('\t',genres) 
        genres_count = df[ (df['genres'].str.contains(genres)) &
                           (df['release_year']==year)  
                         ]
        genres_count=genres_count.count()[1]
        #print(genres_count)
        #print(year,':',genres,genres_count)
        if genres_max <  genres_count:
            genres_max = genres_count
            genres_type= genres
        #print([year,genres,genres_count])
    #print('max',[year,genres_type,genres_max])
    max_genres.append(genres_type)
    max_count.append(genres_max)
    
df_geners =  pd.DataFrame({'Year':year_list,
                           'Geners':max_genres,
                           'movies_number':max_count})
df_geners        

In [None]:
df_geners.set_index(["Geners", "movies_number"]).count(level="Geners")

In [None]:
sns.set(rc={"figure.figsize":(15, 15)})
sns.lmplot('Year', 'movies_number', data=df_geners, hue='Geners', fit_reg=False)
ax = plt.gca()

ax.set_title("Year vs Movies Number",fontsize=13)
ax.set_xlabel("Year",fontsize=12)
ax.set_ylabel("Movies Number",fontsize=12)

plt.savefig('Years vs Geners movies_numbers.jpg')

<a id='Conclusions'></a>
# 4 - Conclusions

* **Drama** is the most produced genre over the years followed by comedy 
* The more the years are advanced , the more of the movie production 
* **'Avatar', 'Star Wars' and 'Titanic'** are the highest revinue movies
* The movies with high budget, most probably have high revinue and high popularity 
* There is positive relationship between latest years and popularity
* the vote average is around to symmetric (normal distripution)
* vote count represent popularity even the vote average is low 
* most of the movies runtime are lower than 200min and it is **popular not to increase than 200min**
* There is positive and high corrolation between the vote count and the revinue, the more vote count , the more revenue movies




<a id='limitations'></a>
# 5 - limitations

* Some movies has no budget and revenue values **data missing**
* Some movies has no geners **data missing**
* We didnt analys the compined geners , and most on the movies has no one specific geners , but multible and diffrent compination , and this might affect the revenue

