# Project: Investigate a Dataset (TMDb Movie Data)

## Table of Contents

* Introduction
* Data Wrangling
* Exploratory Data Analysis
* Conclusions

Introduction

We choose the TMDb movie data set for data analysis. This data set contains information about 10,000 movies collected from The Movie Database (TMDb),we would like to find intresting patterns in the dataset.


 #### We're going to study:
 
 1. What is the average runtime/budget/revenue of all movies?
 2. Relation between Budget Vs Runtime and Revenue Vs Budget?
 3. Which are the successfull genres?
 4. Which genres are most popular from year to year?
 5. Movie with Longest And Shortest Runtime?
 6. Which Month Released Highest Number Of Movies In All Of The Years? And Which Month Made The Highest Average   Revenue?
 7. Movie with highest And lowest revenue?


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from functools import reduce
import datetime as dt
from scipy import stats as st


    
## Step 1. Download the data and prepare it for analysis

In [None]:
movies = pd.read_csv('tmdb-movies.csv')
display(movies.head(2))

In [None]:
# drop columns from 2008 dataset
movies.drop(['homepage','imdb_id','keywords','tagline','overview','production_companies','budget_adj','revenue_adj','vote_count','cast','director'], axis=1, inplace=True)
display(movies.head(2))

In [None]:
#Rearrange column name

movies = pd.DataFrame(movies, columns =['id', 'original_title','genres', 'release_year','release_date','budget','revenue','vote_average','popularity','runtime'])
movies= movies.query('budget > 0 & runtime > 0 & revenue > 0 ')
display(movies.head(2))
display(movies.info())
movies = movies.dropna()
display(movies.isnull().sum())



* Conclusion:
    
1. The movies data set has 10866 rows and 9
    columns
2. There have no missing and duplicate values observed.And all The data types are correct.


    
## Step 2. Exploratory data analysis
    

### Q1 : What is the average runtime/budget/revenue of all movies?

In [None]:
# Average runtime of movies
print('Average movie run time:',movies['runtime'].mean())
#histogram movie runtime
plt.hist(movies["runtime"],bins= 30,range=(0,300))
plt.title("Movie runtime distribution")
plt.xlabel("runtime of movies")
plt.ylabel("number of movies")
plt.show()

In [None]:
# Average runtime of movies
print('Average movie budget:',movies['budget'].mean())
#histogram movie budget
plt.hist(movies["budget"],bins= 20,range=(0,350000000))
plt.title("Movie budget distribution")
plt.xlabel("budget of movies")
plt.ylabel("number of movies")
plt.show()

In [None]:
# Average revenue of movies
print('Average movie revenue:',movies['revenue'].mean())
#histogram movie revenue
plt.hist(movies["revenue"],bins= 30,range=(0,1000000000))
plt.title("Movie revenue distribution")
plt.xlabel("revenue of movies")
plt.ylabel("number of movies")
plt.show()

* Conclusion:

Average runtime of movies is approx. 102 minutes,Average movie budget is 40 millions and average revenue is approx. one billion.So average movie invest to profit ratio is approx. 2.5 times.But it is not same for all.

### Q2 : Relation between Budget Vs Runtime and Revenue Vs Budget?

In [None]:
#Budget Vs Runtime 
movies.plot(x='runtime', y='budget', kind='scatter', figsize=(8, 6), sharex=False, grid=True);
plt.title("Movie budget Vs runtime")
plt.show()

#correlation between budget Vs Runtime
print('corelation between budget Vs Runtime',movies['budget'].corr(movies['runtime']))

In [None]:
#Revenue Vs Budget 
movies.plot(x='budget', y='revenue', kind='scatter', figsize=(8, 6), sharex=False, grid=True);
plt.title("Movie revenue Vs budget")
plt.show()
#correlation between budget Vs Runtime
print('corelation between revenue Vs Budget',movies['revenue'].corr(movies['budget']))

* Conclusion:



1. There is a postive relation observed between budget and runtime.Most of the movies runtime is greater than 50 minutes and less than 200 minutes.Though there is a positive correlation, but higher run time of movies leads to low budget.It could be the editing factors and low optimized.Movies between 150-200 minutes leads to higher budget.

1. There is a postive relation observed between revenue and budget.Most of the movies budget is not greater than 100 millions and average revenue is approx. one billion.Higher run budget movies leads to high revenue as well. For example, Two hundred million budget movies brings approx. 1.5 billions.

### Q3 : Which are the successfull genres?

In [None]:
#showing types of our movies 
print(movies.iloc[:,2].values)

In [None]:
count = pd.Series(movies['genres'].str.cat(sep = '|').split('|')).value_counts(ascending = False)
count

In [None]:
# Initialize the plot for top 6 successful genre
diagram = count[:6].plot.bar(fontsize = 10)
# Set a title
diagram.set(title = 'Top Genres')
# x-label and y-label
diagram.set_xlabel('Type of genres')
diagram.set_ylabel('Number of Movies')
# Show the plot
plt.show()

*Conclusion:

The top six genres are Drama,comedy,thriller,action,adventure and romance.

### Q4 : Which genres are most popular from year to year?

In [None]:
#map all the rows of genres in a list.
genre_details = list(map(str,(movies['genres'])))
genre = ['Adventure', 'Science Fiction', 'Fantasy', 'Crime', 'Western', 'Family','Animation','War','Mystery','Romance','TV Movie','Action', 'Thriller','Comedy','Drama' , 'History', 'Music', 'Horror', 'Documentary', 'Foreign']

#make the numpy array of year and popularity which contain all the rows of release_year and popularity column. 
year = np.array(movies['release_year'])
popularity = np.array(movies['popularity'])

#make a null dataframe which indexs are genres and columns are years.
popularity_df = pd.DataFrame(index = genre, columns = range(1960, 2016))
#change all the values of the dataframe from NAN to zero.
popularity_df = popularity_df.fillna(value = 0.0)

z = 0
for i in genre_details:
    split_genre = list(map(str,i.split('|')))
    popularity_df.loc[split_genre, year[z]] = popularity_df.loc[split_genre, year[z]] + popularity[z]
    z+=1

In [None]:
#calculate the standard deviation for the accurate results.
def calculate_std(x):
    return (x-x.mean())/x.std(ddof=0)

popular_genre = calculate_std(popularity_df)
popular_genre.head()

In [None]:
#How the top6 genre popularity differ year by year.
sns.set_style("whitegrid")
#make a subplot of size 2,3.
fig, ax = plt.subplots(2,3,figsize = (16,10))

#set the title of the subplot.
fig.suptitle('Genre Popularity Over Year To Year',fontsize = 16)

#plot the 'Drama' genre plot see the popularity difference over year to year.
popular_genre.loc['Drama'].plot(label = "Drama",color = '#f67280',ax = ax[0][0],legend=True)

#plot the 'Comedy' genre plot see the popularity difference over year to year.
popular_genre.loc['Comedy'].plot(label = "Comedy",color='#33FFB5',ax = ax[0][1],legend=True)

#plot the 'Thriller' genre plot see the popularity difference over year to year.
popular_genre.loc['Thriller'].plot(label = "Thriller",color='#33FFB5',ax = ax[0][2],legend=True)

#plot the 'Action' genre plot see the popularity difference over year to year.
popular_genre.loc['Action'].plot(label = "Action",color='#00818a',ax = ax[1][0],legend=True)

#plot the 'Adventure' genre plot see the popularity difference over year to year.
popular_genre.loc['Adventure'].plot(label = "Adventure",color='#08c299',ax = ax[1][1],legend=True)

#plot the 'Adventure' genre plot see the popularity difference over year to year.
popular_genre.loc['Romance'].plot(label = "Romance",color='#08c299',ax = ax[1][2],legend=True)
plt.show()




* Conclusion:

Drama was most popular in the early years.But with the time it gets down trend.In the recent years, Drama is no longer popular genre for people.People are more interested in action genre in recent years.

### Q5 : Which Month Released Highest Number Of Movies In All Of The Years? And Which Month Made The Highest Average Revenue?

In [None]:

#extract the month number from the release date.
movies['release_date'] = pd.to_datetime(movies['release_date'])
movies['month'] = movies['release_date'].dt.month
month_release = movies['release_date'].dt.month

display(movies.head())

#count the movies in each month using value_counts().
number_of_release = month_release.value_counts().sort_index()

months=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
number_of_release = pd.DataFrame(number_of_release)
number_of_release['month'] = months

#change the column name of the new dataframe 'number_of_release'
number_of_release.rename(columns = {'release_date':'number_of_release'},inplace=True)

#plot the bar graph using plot.
number_of_release.plot(x='month',kind='bar',fontsize = 11,figsize=(8,6))

#set the labels and titles of the plot.
plt.title('Months vs Number Of Movie Releases',fontsize = 15)
plt.xlabel('Month',fontsize = 13)
plt.ylabel('Number of movie releases',fontsize = 13)
sns.set_style("darkgrid")

In [None]:
#which month made the highest average revenue?.
#make a dataframe with in which store the release month of each movie.
month_release = pd.DataFrame(month_release)

#change the column name of the new dataframe 'month_release'.
month_release.rename(columns = {'release_date':'release_month'},inplace=True)

#add a new column 'revenue' in the dataframe 'month_release'.
month_release['revenue'] = movies['revenue']

#make the group of the data according to their month and calculate the mean revenue of each month.
mean_revenue  = month_release.groupby('release_month').mean()
mean_revenue['month'] = months

#make the bar plot using pandas plot function.
mean_revenue.plot(x='month',kind='bar',figsize = (8,6),fontsize=11)

#setup the title and lables of the plot.
plt.title('Average revenue by month (1960 - 2015)',fontsize = 15)
plt.xlabel('Month',fontsize = 13)
plt.ylabel('Average Revenue',fontsize = 13)
sns.set_style("darkgrid")

* Conclusion:

Highest number of movies were released in the month of september and december.And highest revenue is earned in June,May and November.

### Q6 : Movie with Longest And Shortest Runtime?

In [None]:
#top 5 Movies With Longest runtime
#sort the 'runtime' column in decending order and store it in the new dataframe.
info = pd.DataFrame(movies['runtime'].sort_values(ascending = False))
info['original_title'] = movies['original_title']
data = list(map(str,(info['original_title'])))

#extract the top 10 longest duraton movies data from the list and dataframe.
x = list(data[:5])
y = list(info['runtime'][:5])

#make the point plot and setup the title and labels.
ax = sns.pointplot(x=y,y=x)
sns.set(rc={'figure.figsize':(10,5)})
ax.set_title("Top 5 Longest Movies",fontsize = 15)
ax.set_xlabel("Runtime",fontsize = 13)
sns.set_style("darkgrid")

Top 5 longest movies are:
1. carlos
2. cleopatra
3. heaven's gate
4. lawrence of arabia
5. gods and generals

carlos is the longest movie which is close to 340 minutes.

In [None]:
#top 5 Movies With shortest runtime
#sort the 'runtime' column in decending order and store it in the new dataframe.
info = pd.DataFrame(movies['runtime'].sort_values(ascending = True))
info['original_title'] = movies['original_title']
data = list(map(str,(info['original_title'])))

#extract the top 10 longest duraton movies data from the list and dataframe.
x = list(data[:5])
y = list(info['runtime'][:5])

#make the point plot and setup the title and labels.
ax = sns.pointplot(x=y,y=x)
sns.set(rc={'figure.figsize':(10,5)})
ax.set_title("Top 5 shortest Movies",fontsize = 15)
ax.set_xlabel("Runtime",fontsize = 13)
sns.set_style("darkgrid")

Top 5 shortest movies are:

1. kid's story
2. mickey's christmas carol
3. dr. horrible's sing along blog
4. louis C.K at the beacon theater
5. winnie the pooh

kid's story is less than 20 minutes whereas louis C.K at the beacon theater and winnie the pooh is around 60 minutes.

### Q7 : Movie with highest And lowest Revenue?

In [None]:
#top 5 movies which made highest revenue.
#sort the 'revenue' column in decending order and store it in the new dataframe.
info = pd.DataFrame(movies['revenue'].sort_values(ascending = False))
info['original_title'] = movies['original_title']
data = list(map(str,(info['original_title'])))

#extract the top 10 movies with high revenue data from the list and dataframe.
x = list(data[:5])
y = list(info['revenue'][:5])

#make the point plot and setup the title and labels.
ax = sns.pointplot(x=y,y=x)
sns.set(rc={'figure.figsize':(10,5)})
ax.set_title("Top 5 High Revenue Movies",fontsize = 15)
ax.set_xlabel("Revenue",fontsize = 13)
sns.set_style("darkgrid")

* Conclusion:

Top high revenue movies are:
1. Avatar
2. star wars
3. titanic
4. the avengers
5. jurrasic world

Avatar earned almost two billion eight hundred million which is higher than any other movies in the given time period.

In [None]:
#top 5 movies which made lowest revenue.
#sort the 'revenue' column in decending order and store it in the new dataframe.
info = pd.DataFrame(movies['revenue'].sort_values(ascending = True))
info['original_title'] = movies['original_title']
data = list(map(str,(info['original_title'])))

#extract the top 10 movies with high revenue data from the list and dataframe.
x = list(data[:5])
y = list(info['revenue'][:5])

#make the point plot and setup the title and labels.
ax = sns.pointplot(x=y,y=x)
sns.set(rc={'figure.figsize':(10,5)})
ax.set_title("Top 5 low Revenue Movies",fontsize = 15)
ax.set_xlabel("Revenue",fontsize = 13)
sns.set_style("darkgrid")

Top low revenue movies are:

1. shattered glass
2. mallrats
3. dr. horrible's sing along blog
4. bordello of blood
5. kid's story

The lowest revnue movies are suspicious.For example shattered glass and mallrats earned 2$.This data is not acceptable in reality.There might be some information missing.


## Step 3. Test the following hypothesis

#### Test : Average movie revenue is not higher in may than june all over the time period

**Testing:**
Let's compare the sample means for user score:
1. H0  - the sample means have no difference.
1. H1  - the sample means are different.
1. alpha - 0.05

In [None]:
revenue_may=movies.query('month == 6')
revenue_june=movies.query('month == 5')

revenue_may_2=revenue_may['revenue']
revenue_june_2=revenue_june['revenue']
display(revenue_may_2.head())
print('revenue_may_2 std:', revenue_may_2.std(), '\nrevenue_june_2 std:', revenue_june_2.std())

In [None]:
alpha = .05   #critical statistical significance level

results = st.ttest_ind(revenue_may_2, revenue_june_2,equal_var = False) 

print('p-value:',results.pvalue) 

if (results.pvalue < alpha):
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

* Conclusion:
    
From our hypothesis, it is clear that the average movie revenue is higher in june than may all over the time period.

# Overall conclusion:

1. The movies data set has 10866 rows and 9 columns.
1. There have no missing and duplicate values observed.And all The data types are correct.
1. Average runtime of movies is approx. 102 minutes,Average movie budget is 40 millions and average revenue is approx. one billion.So average movie invest to profit ratio is approx. 2.5 times.But it is not same for all.
1. There is a postive relation observed between budget and runtime.Most of the movies runtime is greater than 50 minutes and less than 200 minutes.Though there is a positive correlation, but higher run time of movies leads to low budget.It could be the editing factors and low optimized.Movies between 150-200 minutes leads to higher budget.
1. There is a postive relation observed between revenue and budget.Most of the movies budget is not greater than 100 millions and average revenue is approx. one billion.Higher run budget movies leads to high revenue as well. For example, Two hundred million budget movies brings approx. 1.5 billions.
1. The top six genres are Drama,comedy,thriller,action,adventure and romance.
1. Drama was most popular in the early years.But with the time it gets down trend.In the recent years, Drama is no longer popular genre for people.People are more interested in action genre in recent years.
1. Highest number of movies were released in the month of september and december.And highest revenue is earned in June,May and November.
1. From our hypothesis, it is clear that the average movie revenue is higher in june than may all over the time period.
1. Top 5 longest movies are:

* carlos
* cleopatra
* heaven's gate
* lawrence of arabia
* gods and genera 

Top 5 shortest movies are:

* kid's story
* mickey's christmas carol
* dr. horrible's sing along blog
* louis C.K at the beacon theater
* winnie the pooh

# Limitations:


1. The lowest revnue movies are suspicious.For example shattered glass and mallrats earned 2 unit currency.This data is not acceptable in reality.There might be some information missing.
1. The budget and revenue columns did not have a currency specified so there may be some differences due to fluctuating exchange rates. 
1. Rows with NaN values were dropped, hence a lot of key data might have been lost in the process.
1. People having higher expectations gives less probability of meeting their expectations. Even if the movie was worth, people's high expectations would lead in biased results ultimately effecting the profits.