# Project: The Movie DataBase -TMDB

**Table of Contents**

* Introduction
* Data Wrangling
* Exploratory Data Analysis
* Conclusions

# Introduction

Initially this Dataset includes - 10866 rows and 21 columns -['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title','cast', 'homepage', 'director', 'tagline', 'keywords', 'overview','runtime', 'genres', 'production_companies','release_date','vote_count', 'vote_average', 'release_year', 'budget_adj','revenue_adj']

I will be working on this data and will remove the unwanted columns and then drop the redudant and zero values data.

# After Analysing the Data ,following questions could be answered:

1. List of movies with maximum and minimum value for each attribute - Budget,Revenue,Runtime,Popularity...
2. Year of release vs Profitability.
3. Visualization of Popularity vs Revenue ('high', 'mod_high', 'medium', 'low')
4. Top 10 popular Cast,Genres and Directors
5. Average runtime of the movies

**Importing required packages- pandas,numpy,matplotlib and seaborn**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Wrangling

**General Properties of Dataframe**

In [None]:
# Data is loaded
data = pd.read_csv('/kaggle/input/tmdb-movies-dataset/tmdb_movies_data.csv')
data.head(3)

In [None]:
data.tail(3)

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data.describe()

In [None]:
data_duplicate = data[data.duplicated()]
print(data_duplicate)

In [None]:
#drop duplicates values
data.drop_duplicates(keep='first',inplace = True)
data.shape

# Data Cleaning

**Dropping the columns which I am not considering for Data Analysis. Also, removed duplicates from the Dataframe**

In [None]:
data.drop(['id','imdb_id','homepage','tagline','keywords','overview','release_date','vote_count','budget','revenue'],axis=1, inplace = True)

**I have dropped the above listed columns and have mentioned reasons for each below:**

1. 'id','imdb_id','homepage','tagline','keywords','overview - I will not be using these columns for analysing the questions i have mentioned above.
2. 'release_date'- I have planned to analyse it yearly. So Date is not required , I have kept 'release_year' column for this purpose.
3. 'vote_count'- I will be considerig vote_average , so dropping the 'vote_count'.
4. 'budget','revenue'- Since it will be better to consider 'budget_adj' and 'revenue_adj' instead of budget and revenue.

In [None]:
data.head(3)

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.describe()

**budget_adj,revenue_adj and Runtime should not have zero values...Hence dropping the NA values first and then will check for zero values**

In [None]:
data.dropna(inplace = True)
print(data.shape)

In [None]:
data.isnull().sum()

In [None]:
data.describe()

**'budget_adj' ,'revenue_adj' , 'runtime' of a movie cannot not be zero so in order to fix that issue, we will replace 0 with NaN and then drop those values.**

**[https://stackoverflow.com/questions/22649693/drop-rows-with-all-zeros-in-pandas-data-frame] Used this link to understand how to replace 0 with NaN**

**Dropping zero values -as they might effect the analysis**

In [None]:
data['budget_adj'] = data['budget_adj'].replace(0,np.NaN)
data['revenue_adj'] = data['revenue_adj'].replace(0,np.NaN)
data['runtime'] = data['runtime'].replace(0,np.NaN)

In [None]:
data.isnull().sum()

In [None]:
data.dropna(inplace = True)# dropped all the columns with zero values, this might be a limitation of this project as it has dropped many rows

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.head(3)

In [None]:
data.runtime.unique()

In [None]:
data['revenue_adj'].unique

**Change the column DataTypes :**

In [None]:
change_coltype = ['budget_adj', 'revenue_adj', 'runtime']

In [None]:
data[change_coltype] = data[change_coltype].applymap(np.int64)

In [None]:
data.info()

In [None]:
data.head()

**Insert a new column -Profit, which calculated profit of movie(revenue-budget)**

In [None]:
data['profit'] = data['revenue_adj'] - data['budget_adj']

In [None]:
data.head(3)

# Exploratory Data Analysis

# Question 1 : List of movies with maximum and minimum value for each attribute - Budget,Revenue,Runtime,Popularity

**Build a function to find max and minimum values for all the columns : This function will return the entire row with the max and min vallues of the column passed in the function.**

In [None]:
def min_max(column_name):
    
    max_value = data[column_name].idxmax()
    max_details = pd.DataFrame(data.loc[max_value])
    
    min_value = data[column_name].idxmin()
    min_details = pd.DataFrame(data.loc[min_value])
    
    both_values = pd.concat([max_details, min_details],axis=1)
    
    return(both_values)

**i) Details of movie with maximum and minimum Popularity**

In [None]:
min_max('popularity')

**Most popular movie is : 'Jurassic World' and Least popular is : 'Ð¡Ñ‚Ð°Ð»Ð¸Ð½Ð³Ñ€Ð°Ð´'**

**ii) Details of movie with maximum and minimum Budget**

In [None]:
min_max('budget_adj')

**Highest Budget Movie : The Warrior's Way**

**Lowest Budget Movie : Love, Wedding, Marriage**

**iii) Details of movie with maximum and minimum Revenue**

In [None]:
min_max('revenue_adj')

**Highest Revenue Movie : Avatar**

**Lowest Budget Movie : Shattered Glass**

**iv) Details of movie with maximum and minimum Runtime**

In [None]:
min_max('runtime')

Maximum Runtime movie : Carlos

Minimum Runtime Movie : Kid's Story

v) Most latest and oldest movie in this Dataframe

In [None]:
min_max('release_year')

Latest movie : Jurassic World

Oldest Movie : Psycho

vi) Minimum and Maximum vote_average

In [None]:
min_max('vote_average')

The Shawshank Redemption got the maximum vote_average and Foodfight!! got the minimum

In [None]:
min_max('profit')

Maximum profit earned by 'Star Wars'

Minimum profit earned by 'The Warrior's Way'

# Question 2 : Year of Release vs Profitability.

In [None]:
pro_year = data.groupby('release_year')['profit'].sum()

In [None]:
plt.figure(figsize=(12,3),dpi=120)
plt.xlabel('Release Year of Movies in the data set', fontsize = 12)
plt.ylabel('Profit earned by Movies', fontsize = 12)
plt.title('Profit of all movies Vs Year of their release.')
plt.plot(pro_year)

The above graph shows that profit of movies is increased with years.

In [None]:
pro_year = data.groupby('release_year')['profit'].mean()

plt.figure(figsize=(12,3),dpi=120)
plt.xlabel('Release Year of Movies in the data set', fontsize = 12)
plt.ylabel('Profit earned by Movies', fontsize = 12)
plt.title('Profit of all movies Vs Year of their release.')
plt.plot(pro_year)

According to the plot year 1960-1970 the most profitable years And the profit was very low between the years 2000-2010

In [None]:
plt.figure(figsize = (8,3))
plt.hist(data['release_year'])
plt.xlabel('Years')
plt.ylabel('Number of movies')
plt.title('Movies with years')
plt.show()

**This graph shows that number of movies in Dataframe also increased with years..**

# Question 3 : Visualization of Popularity vs Revenue ('high', 'mod_high', 'medium', 'low')

In [None]:
rev_data = data.describe().revenue_adj
rev_data

In [None]:
bin_edges = [rev_data[3], rev_data[4], rev_data[5], rev_data[6], rev_data[7]]
bin_edges

In [None]:
bin_names = ['High','Mod_high','Medium','Low']

In [None]:
data['new_popular'] = pd.cut(data['revenue_adj'], bin_edges, labels=bin_names)

In [None]:
pop_plot = data.groupby('new_popular')['popularity'].mean()

pop_plot.plot.bar()
plt.xlabel('Revenue')
plt.ylabel('Popularity')
plt.title('Popularity vs Revenue')

In [None]:
plt.scatter(x= data['revenue_adj'], y= data['popularity'])
plt.xlabel('Revenue')
plt.ylabel('Popularity')
plt.title('Revenue vs popularity over the years')

This shows that movies with low Revenue has high popularity

# Question 4 : Top 10 popular Cast,Genres and Directors

In [None]:
actor = data['cast'].str.cat(sep="|").split("|")
actors_list = pd.Series(actor).value_counts()[:10]
print(actors_list)
graph = actors_list.plot.bar()
graph.set(title = "List of top 10 actors",ylabel = "Number of time casted")

In [None]:
genres = data['genres'].str.cat(sep="|").split("|")
genres_list = pd.Series(genres).value_counts()[:10]
print(genres_list)
graph = genres_list.plot.barh()
graph.set(title = "List of top 10 Genres",xlabel = "Number of time movies made")

In [None]:
director = data['director'].str.cat(sep="|").split("|")
director_list = pd.Series(director).value_counts()[:10]
print(director_list)
graph = director_list.plot.bar()
graph.set(title = "List of top 10 directors",ylabel = "Number of movies directed")

# Question 5 : Average runtime of the movies

In [None]:
Avg_runtime = data['runtime'].mean()
print('Average runtime of movies is %.2f'%Avg_runtime)

In [None]:
plt.hist(data['runtime'], rwidth =.9, bins=30)
plt.xlabel('Runtime')
plt.ylabel('Number of movies')
plt.title('Runtime of all movies')

The above graph also confirms the average runtime. Lets make another box plot for the same using seaborn

In [None]:
sns.boxplot(data['runtime'])
plt.xlabel('Runtime')

This also proves the average runtime is approx 109



In [None]:
avg_budget = data['budget_adj'].mean()
avg_revenue = data['revenue_adj'].mean()
print('The average Budget of movies %.2f'%avg_budget)
print('The average revenue of movies %.2f'%avg_revenue)

**Plot between Runtime and Profit**

In [None]:
data.groupby('runtime')['profit'].mean().plot()

plt.xlabel('Runtime')
plt.ylabel('Average Profit')
plt.title('Runtime vs Profit')

According to the plot runtime between 150-200 had the most profit.

**Plot between Release_Year and Runtime**

In [None]:
data.groupby('release_year').mean()['runtime'].plot(xticks = np.arange(1960,2016,5))

sns.set(rc={'figure.figsize':(20,3)})
plt.xlabel('Release Year')
plt.ylabel('Runtime')
plt.title('Release_year vs Runtime')

The above graph shows the earlier runtime of movies was more and has been decreased recently.

# Conclusions

**Interesting facts about this Data**

1. Average runtime of movie is 109.35
2. Top four Genres for movies are - Drama,Comdedy,Thriller,Action.
3. Top four movie Directors are : Steven Spielberg,Clint Eastwood,Ridley Scott,Woody Allen
4. Top 5 Cast: Robert De Niro,Bruce Willis,Samuel L. Jackson, Matt Damon,Nicolas Cage
5. The number of movie released are increasing year by year.
6. Runtime between 150-200 had the most profit.
7. Low revenue movies are more popular.

**Having an Average Budget of 44719764.83 , runtime 109 mins, and above listed genres,caste and director, Revenue of 138715933.89 could be generated**

Limitations
i) The Data had lot of zero values and Null values and we have deleted that columns which might have resulted in loosing valuable data.

ii) Also,budget and revenue column do not have currency unit, it might be possible different movies have budget in different currency.