
# <center>EDA of 1000 Movies Data</center>

## Table of Contents
 
1. [Problem Statement](#section1)</br>
    - 1.1 [Introduction](#section101)<br/>
<br>
2. [Data Loading and Description](#section2)</br>
    - 2.1 [Importing Packages](#section201)<br/>
    - 2.2 [Importing Dataset](#section202)<br/>
<br>
3. [Data profiling](#section3)</br>
    - 3.1 [Understanding Data](#section301)<br/>
    - 3.2 [Data Preprofiling](#section302)<br/>
    - 3.3 [Data Preprocessing](#section303)<br/>
    - 3.4 [Data Postprocessing](#section304)<br/>  
<br>
4. [1000 Movies Data Analysis](#section4)</br>
    - 4.1 [How many movies are released per year over the period 2006-2016?](#section401)<br/>
    - 4.2 [What are the top 10 Movies with Highest Revenue?](#section402)<br/>
    - 4.3 [What is the Revenue trend of the Movies over the years?](#section403)<br/>
    - 4.4 [Relation between Rating and Revenue](#section404)<br/>
    - 4.5 [Analysing Genres of all the movies individually](#section405)<br/>   
    - 4.6 [Who are the Highest and Lowest Revenue(Cumm.) generated Directors?](#section406)<br/>
    - 4.7 [Do all the parameters(Votes, Metascore, Rating) speak the same about a movie?](#section407)<br/>
    - 4.8 [Which Actors had appeared in Highest number of movies?](#section408)<br/>
    - 4.9 [What are the movies with highest - Rating, Metascore and Votes?](#section409)<br/>    
    - 4.10 [Which are the highest Revenue generated Genres ?](#section410)<br/>
    - 4.11 [What are the top and bottom 10 Genres (Combination) ?](#section411)<br/>
    - 4.12 [Is there a change in Movie Runtime(avg) over the years](#section412)<br/>  
</br>
5. [Conclusion](#section5)</br>  




<a id=section1></a>
## 1. Problem statement

The dataset consists of 1000 movies information over the period (2006-2016) with various variables. Such as,          Title, Year of release, Revenue, Ratings, Votes, Actors and Genres. Insights from this dataset guides Producers, Distributors, Theatres and Others in understanding the movies trend in all the aspects.


<a id=section101></a>
### 1.1 Introduction:

This Exploratory Data Analysis is to practice Python skills learned till now on a structured data set including      loading, inspecting, wrangling, exploring, and drawing conclusions from data. The notebook has observations with      each step in order to explain thoroughly how to approach the data set. Based on the observation some questions are also answered in the notebook for the reference though not all of them are explored in the analysis.

<a id=section2></a>
## 2. Data Loading and Description:

The dataset comprises of __1000 observations of 12 columns__. Below is a table showing names of all the columns and their description.

| Column Name        | Description                                                 |
| -------------------|:-----------------------------------------------------------:| 
| Rank               | Rank of the movie                                           | 
| Title              | Title of the movie                                          |  
| Genre              | Style\Category of the movie                                 | 
| Description        | Movie description                                           |   
| Director           | Director of the movie                                       |
| Actors             | Actors\Lead roles in the movie                              |
| Year               | Year in which movie is released                             |
| Runtine_(Minutes)  | Movie length in minutes                                     |
| Rating             | Rating of movie on 10 scale                                 |
| Votes              | Count of votes for the movie                                |
| Revenue_(Millions) | Revenue generated by the movie                              |
| Metascore          |     Weighted average of the reviews from most respected critics |

# <a id=section201></a>
### 2.1 Importing Packages:

In [None]:
import numpy as np                      # Implements multi-dimensional array and matrices
import pandas as pd                     # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt         # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                   # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()

from subprocess import check_output     # To check output

<a id=section202></a>
### 2.2 Importing Dataset:

In [None]:
#Reading 1000 Movies Dataset
md = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/1000%20movies%20data.csv")

<a id=section3></a>
## 3. Data Profiling:

- Firstly, I will __understand our dataset__ using various pandas functionalities.
- Then, with the help of __pandas profiling__ I will find which columns of the dataset need preprocessing.
- In __preprocessing__, I will deal with erronous and missing values. 
- Then after, I will do __pandas profiling__ again, to see how preprocessing has transformed the dataset.

<a id=section301></a>
### 3.1 Understanding Data:
- Displaying the first five rows of the data to understand variables.
- Finding the shape of dataset.

In [None]:
md.head(15)                                # Display the first five rows of the dataset.

In [None]:
md.tail(10)

In [None]:
md.shape                                          # Shape of the Dataset

- Next, I will do descriptive statistics for numerical variables.
- This will help in finding distribution, Standard Deviations and min-max of numerical columns

In [None]:
md.describe()          # Descriptive statistics for the numerical variables in the Dataset.

- And, Display a random 10 rows to see variations in the Data.

In [None]:
md.sample(10)                            # Displays a random 10 rows from the dataset.

- Inorder to handle the data better, finding the Data Types and Null Values. 

In [None]:
md.info()                         # Displays Data_Types and Null_Values in the Dataset.

### Observations from Data Profiling:

- Each Column details and Shape of the dataset are understood.
- With Descriptive statistics, Statistical information of dataset is observed.
- A random gist of the dataset to find abnormalities was also done.
- Info() revealed the Data Types and Missing Values.

Now, I have inputs for starting off with pre-profiling.

<a id=section302></a>
### 3.2 Pre profiling:

- With pandas profiling, an __interactive HTML report__ gets generated, which gives breif description about the columns of the dataset like the __counts, Histograms and Correlations__. Detailed information about __each column__, __coorelation between different columns__ and a __sample__ of dataset.<br/>
- It also gives __visual interpretation__ of each column in the data.
- Spread of the data is better understood by the distribution plot. 
- Grannular level analysis of each column.

In [None]:
import pandas_profiling                          # Get a quick overview for all the variables using pandas_profiling.                                         
profile = pandas_profiling.ProfileReport(md)
profile.to_file("1000movies_before_preprocessing.html")     # HTML file will be downloaded to local workspace.

Here, I have done Pandas Profiling before preprocessing our dataset, so I have named the html file as __1000movies_before_preprocessing.htm__. Looking at the file will help developing useful insights from the dataset. <br/>

Now, I will process the data for better understanding.

<a id=section303></a>
### 3.3 Pre processing:
- Dealing with missing Values.
- Dealing with zeros.
- Splitting column data for more insights.
- Dropping the coulumn unsoughtful for data insights.

#### 3.3.1 Dealing with missing Values:

In [None]:
md['Revenue (Millions)'].mean()         # Mean of the column Revenue_(Millions). 

In [None]:
md['Revenue (Millions)'].median()       # Median of the column Revenue_(Millions). 

Here, Mean of __Revenue__ is choosen as it matches more with the values of the Column.

In [None]:
md['Revenue (Millions)']= md['Revenue (Millions)'].replace(np.NaN , md['Revenue (Millions)'].mean())

md['Revenue (Millions)'].isnull().sum()       #Replaced missing values of column Revenue_(Millions) with Mean and checking if any null values persists.

#### 3.3.2 Dealing with zeros:



It is observed that __Revenue Column__ comprises of zero values, which cannot happen in real life. Hence, replacing zero values with means.

In [None]:
md['Revenue (Millions)']= md.mask(md['Revenue (Millions)']==0.0, md['Revenue (Millions)'].mean())
md['Revenue (Millions)'].sample(15)
#Replaced 0.0 values of column Revenue_(Millions) with Mean and sample(15) display.

In [None]:
md['Metascore'].mean()                  # Mean of the column Metascore. 

In [None]:
md['Metascore'].median()                # Median of the column Metascore. 

In [None]:
md['Metascore']= md['Metascore'].replace(np.NaN , md['Metascore'].mean()) 
md['Metascore'].tail(15)      # Replaced missing values of column Metascore with Mean and head(15) display.

#### 3.3.3 Splitting Genre column for more insights:

- The __Genre__ Column consists of various genres combined for each movie. If we split up the genres, more insights can be drawn with regards to each movie. Hence, Genres will be analyzed combinedly as well as by spliting.

In [None]:
from pandas import DataFrame                       #Splitting the column 'Genre' into 3 columns
df=pd.DataFrame(md)
GenSplt=md['Genre'].str.split(",", n=-1, expand = True)
GenSplt.columns = ['G1','G2','G3']                 #Giving names to the columns
#GenSplt


In [None]:
GenSplt.columns = ['G1','G2','G3']
GenOne= GenSplt['G1'].append(GenSplt['G2'])        #Appending all the 3 columns to single column
GenOne= GenOne.append(GenSplt['G3'])
GenOne= GenOne.dropna(axis = 0, how ='any')        #Dropping NaN Values
df2=pd.DataFrame(GenOne)                           #Making the splitted&appended column to DataFrame
df2.columns=['G']
#df2
md.head(10)

#### 3.3.4 Splitting Actors column for more insights:

In [None]:
from pandas import DataFrame                       #Splitting the column 'Actors' into 4 columns
df=pd.DataFrame(md)
ActSplt=md['Actors'].str.split(",", n=-1, expand = True)
ActSplt.columns = ['A1','A2','A3','A4']                 #Giving names to the columns
#GenSplt



In [None]:

ActOne= ActSplt['A1'].append(ActSplt['A2'])        #Appending all the 4 columns to single column
ActOne= ActOne.append(ActSplt['A3'])
ActOne= ActOne.append(ActSplt['A3'])
ActOne= ActOne.dropna(axis = 0, how ='any')        #Dropping NaN Values
df10=pd.DataFrame(ActOne)                           #Making the splitted&appended column to DataFrame
df10.columns=['A']
#df10

#### 3.3.5 Dropping the column unsoughtful for data insights.

- Description of the Movies is very less importance for the current analysis. Hence, dropping of the __Description__ column.

In [None]:
md.drop(['Description'], axis=1)   # Dropping the Description column for no insights to analyze data.


<a id=section304></a>
### 3.4 Post processing:

Now we have preprocessed the data and It doesn't contain missing or zero values. In addition, we have also introduced new feature named __G__ to draw insights from __Genre__ column. Next, the __post-profiling__ report which we will generate after preprocessing will give us more beneficial insights. We can compare the two reports, i.e., __1000movies_after_preprocessing.html__ and __1000movies_before_preprocessing.html__<br/>


In [None]:
import pandas_profiling                          # Get a quick overview for all the variables using pandas_profiling.                                         
profile = pandas_profiling.ProfileReport(md)
profile.to_file("1000movies_after_preprocessing.html")     # HTML file will be downloaded to local workspace.

In 1000movies_after_preprocessing.html report, observations:
- In the Dataset info, Total __Missing(%)__ = __0.0%__ 
- Number of __variables__ = __12__ 
- Observe the newly created variable __G__.

<a id=section4></a>
## 4. 1000 Movies Data Analysis:

<a id=section401></a>
### 4.1 How many movies are released per year over the period: 2006-2016?

In [None]:
plt.figure(figsize=(20,20))
MovTrnd = sns.factorplot("Year", data=md, aspect=2, kind="count", color='Skyblue')
#Displays count of movies releases over the period:2006-2016.
#Also, Count of Movies per year.

#### Observations:
- The above plot shows that the number of movies per year was gradually increasing.
- But in 2016, count has grown all of a sudden signifying movie industry has become very popular.

<a id=section402></a>
### 4.2 What are the top 10 Movies with Highest Revenue?

In [None]:
plt.figure(figsize=(10,5))
df4=md.sort_values("Revenue (Millions)", ascending=False).head(10)
df_4 = sns.barplot(x="Revenue (Millions)", y="Title",hue='Year', linewidth=0, data=df4,ci= None)
#plt.title('Top 10 Movies with Highest Revenue',fontsize=18,fontweight="bold")
df_4.set_xlim(990,1000)
plt.legend(bbox_to_anchor=(1.05, 0.9), loc=2, borderaxespad=0.)  #Moving the hue to right for graph visibility.
plt.show()


- From the barplot, the movie with highest revenue is: 'Nine Lives' in the year 2016.
- And its revenue is nearly 999 millions.
- 'Search Party' and 'Step Up 2: The Streets' are the next highest reveneue generators of about 999 millions and 998 millions respectively.
- It is observed that there is not much revenue difference(<10m) among the top 10 movies.
- Another interesting observation is that the highest revenue generated movies are from all the years except 2006. 


<a id=section403></a>
### 4.3 What is the Revenue trend of the Movies over the Years?

In [None]:
x =[]
y =[]

for i in md['Year'].unique():
    y.append(md[md['Year']==i]['Revenue (Millions)'].sum())
    x.append(i) 

In [None]:
z = pd.DataFrame(list(zip(x, y)), columns=['Year','Revenue_Sum'])
z

In [None]:
plt.figure(figsize=(10,7))
TrenMov = sns.lineplot(x="Year", y="Revenue_Sum",data=z,ci= None)
#Displays cummulative revenue generated on the movies per year.

- The Linegraph shows the sum of Revenues per Year.
- Total Revenue was gradually increasing with Years.
- Key point is that, There is sudden increase in the Revenue in 2013.
- And then, stable or slightly decreased in 2014.
- The year 2016 generated highest total revenue.

<a id=section404></a>
### 4.4 Relation between Rating and Revenue.

In [None]:
plt.figure(figsize=(20,20))
sns.jointplot("Rating","Revenue (Millions)", data=md, kind='hex', color='DarkBlue')
#Relation between Rating and Revenue.

- The immediate inference by seeing this joinplot is, Although ratings play a vital role, It doesn't guarantee revenue generation all the times.
- In the graph, majority of ratings for more than 200m revenue genereated movies are between 6 and 8.5.
- Another important finding is, most of the movies ratings are between 5 and 8.5. 
- This signifies that users are generous to give average or above average rating.
- Also, the rating is normally distributed. Where as, Movies count with <200m is more.



<a id=section405></a>
### 4.5 Analysing Genres of all the movies individually.

In [None]:
import matplotlib.pyplot as plt                     #Finding individual uniques out of profiled genre column
df2.G.unique()

In [None]:
df2.pivot_table(index=['G'], aggfunc='size') 

In [None]:
Genre_names=['Action', 'Adventure', 'Horror', 'Animation', 'Comedy',
       'Biography', 'Drama', 'Crime', 'Romance', 'Mystery', 'Thriller',
       'Sci-Fi', 'Fantasy', 'Family', 'History', 'Music', 'Western',
       'War', 'Musical', 'Sport']
Genre_Size=[303,259,49,81,279,150,513,51,101,29,119,16,5,106,141,120,18,195,13,7]

plt.figure(figsize=(15,15))
fig, gen = plt.subplots()
gen.axis('equal')
Genpie, _ = gen.pie(Genre_Size, radius=2.5, labels=Genre_names, colors = ['skyblue', 'gold','red','orange','blue','pink','violet','grey','gold','yellowgreen','brown'])
plt.setp(Genpie, width=1, edgecolor='white')
plt.margins(0,0)

- From pie chart, movies with drama genre are highest.
- Followed by, Action and then, Comedy Genre.
- Genres with lowest movies are Musical, Fantasy, Sci-Fi and Sport.
- All the remaining genres shared nearly same movies count ratio.

<a id=section406></a>
### 4.6 Who are the Highest and Lowest Revenue(cummulative) generated Directors??

In [None]:
x =[]
y =[]

for i in md['Director'].unique():
    y.append(md[md['Director']==i]['Revenue (Millions)'].sum())
    x.append(i)

In [None]:
df6 = (pd.DataFrame(list(zip(x, y)), columns=['Director','Revenue_Sum'])).sort_values('Revenue_Sum', ascending=False)

In [None]:
df7 = df6.tail(5)  #Lowest Revenue generated directors
df6 = df6.head(5)  #Highest Revenue generated directors

In [None]:
sns.barplot(x="Revenue_Sum", y="Director", data=df6, ci= None)
plt.show()

In [None]:
md.Director.value_counts().head(15).plot.bar()
#sns.barplot(x="Revenue_Sum", y="Director", data=df7, ci= None)

In [None]:
md[md['Director']=='Alexandre Aja']

- Even though ‘Ridley Scott’ directed highest number of movies, when it comes to total revenue generation he is in 3rd place.
- Where as, ’Paul W.S. Anderson' generated highest total revenue of 3715m with 6 movies.
- ‘Alexandre Aja’ with only 4 movies is in the second position of total revenue ranking.This makes him the most successful director than any other till 2016.

<a id=section407></a>
### 4.7 Do all the parameters(Votes, Metascore, Rating) speak the same about a movie?

#### Rating vs Votes:

In [None]:
#plt.figure(figsize=(8,8))
a=sns.scatterplot(x="Rating", y="Votes",data=md, ci= None) 

#### Observations:
- Vote count is higher when the rating is higher. 
- Signifying, the movies which have better rating are also having high vote count.

#### Rating vs Metascore:

In [None]:
b=sns.scatterplot(x="Rating", y="Metascore",data=md,ci= None)

#### Observations:
- Movies with higher rating are having Higher Metascore.
- Also, there are some handful of cases. Where Rating is low but, Metascore is average.

#### Metascore vs Votes:

In [None]:
c=sns.scatterplot(x="Metascore", y="Votes", data=md);

#### Observations:
- Density in the plot shows that, Metascore and Votes are not so related.
- Because, Vote count <2,50,000 is spread across the metascore values.
- Although not directly related, Higher vote count movies are having higher rating.

<a id=section408></a>
### 4.8 Which Actors had appeared in highest number of Movies?

In [None]:
j = df10.A.unique()
k = df10.pivot_table(index=['A'], aggfunc='size')
ab = ((pd.DataFrame(zip(j,k),columns=['Actors','MovCount'])).sort_values('MovCount',ascending = False)).head(50)

In [None]:
plt.figure(figsize=(10,10))
sns.barplot('MovCount','Actors', data = ab)

- 'Malin Akarmen' is the only one who acted in 14 Movies. 
- And the next 5 actors: Mia, Susan, Michael, Laura, Roman and Joan are part of 12 movies each.

<a id=section409></a>
### 4.9 What are the movies with highest Rating, Metascore and Votes?

In [None]:
# Combinations
"""    [['Votes','Metascore','Rating'],['Votes','Rating','Metascore'],
        ['Rating','Metascore','Votes'],['Rating','Votes','Metascore'],
        ['Metascore','Rating','Votes'],['Metascore','Votes','Rating']] 
"""

In [None]:
tmd = pd.DataFrame()
tmd = tmd.append((md.sort_values(by=['Votes','Metascore','Rating'], ascending = False)).head(10))
tmd = tmd.append((md.sort_values(by=['Votes','Rating','Metascore'], ascending = False)).head(10))
tmd = tmd.append((md.sort_values(by=['Rating','Metascore','Votes'], ascending = False)).head(10))
tmd = tmd.append((md.sort_values(by=['Rating','Votes','Metascore'], ascending = False)).head(10))
tmd = tmd.append((md.sort_values(by=['Metascore','Rating','Votes'], ascending = False)).head(10))
tmd = tmd.append((md.sort_values(by=['Metascore','Votes','Rating'], ascending = False)).head(10))

In [None]:
tmd.Rank.nunique()

In [None]:
tmd = tmd.drop_duplicates(subset='Rank', keep='first', inplace=False)

In [None]:
tmd.shape

In [None]:
plt.figure(figsize=(7,7))
sns.barplot(y="Title", x="Revenue (Millions)", data=tmd, ci= None)
plt.show()

- The Movies: 'Manchester by the Sea', 'Interstellar' and 'Moonlight' generated low Revenue even with Higher Ratings, Votes and Metascore.
- 'Boyhood', 'Gravity', 'Carol', 'The Lives of Others' and 'Ratatouille' Movies have performed well in poroportional to Ratings and other metrics.

<a id=section410></a>
### 4.10 Which are the highest Revenue generated Genres ?

In [None]:
md['Genre'].value_counts().head(15)   #To view the most frequent genres and their counts in descending order 

In [None]:
# Created list to use for loop
lst = ['Action,Adventure,Sci-Fi',
        'Drama',
        'Comedy,Drama,Romance',
        'Comedy',
        'Drama,Romance',
        'Animation,Adventure,Comedy',
        'Comedy,Drama',
        'Action,Adventure,Fantasy',
        'Comedy,Romance',
        'Crime,Drama,Thriller',
        'Crime,Drama,Mystery',
        'Action,Adventure,Drama',
        'Action,Crime,Drama',
        'Horror,Thriller',
        'Drama,Thriller']


# Summation of revenue for each genre
TopGnr = []
for i in lst:
    x = int(md[md['Genre']==i]['Revenue (Millions)'].sum())
    TopGnr.append(x)

# TopGen DataFrame for the values.
TopGen= pd.DataFrame(TopGnr)
TopGen.columns =['Revenue_Sum(Millions)'] 
TopGen['Genres'] = lst

#TopGn.head()

In [None]:
TopGen=TopGen.sort_values('Revenue_Sum(Millions)', ascending=False) #Sorting Values in descending order for graph

In [None]:
plt.figure(figsize=(8,8))
sns.barplot(x="Revenue_Sum(Millions)", y="Genres",data=TopGen,linewidth=1,ci= None)


- 'Drama' followed by 'Comedy,Drama,Romance' are the highest revenue generated Genres.
-  At the same time, they are the most repetitive genres.
-  The difference between revenue of top genre and the next genres in the row is >8000m.
-  It is observed that Drama, Comedy and Romance are the most interested genres.

<a id=section411></a>
### 4.11 What are the top 10 Genres (Combination) ? 

In [None]:
plt.figure(figsize=(7,7))
md['Genre'].value_counts().head(10).plot.bar() #What are most interested Genres ?

- There are atleast 50 movies of 'Action,Adventure,Sci-Fi' genre making it the most frequent.
- The frequency of other genres is almost same around 25 for the top 10.

<a id=section412></a>
### 4.12 Is there a change in movie run time(avg) over the decade?

In [None]:
g = []
h = []
for i in range(2006,2017):
    den = md[md['Year']==i]['Runtime (Minutes)'].count()
    g.append(int((md[md['Year']==i]['Runtime (Minutes)'].sum())/den))
    h.append(i)

In [None]:
df9 = (pd.DataFrame(list(zip(g,h)), columns=['AvgRuntime','Year'])) #.sort_values('Revenue_Sum', ascending=False)
df9

In [None]:
plt.figure(figsize=(10,7))
sns.barplot(x="Year", y="AvgRuntime",data=df9,ci= None)

- Average Runtime of the movies didn't change much.
- It seems that average Runtime of movies from 2016 are slight lower.

<a id=section5></a>
## 5. Conclusion

- 'Nine Lives' is the highest Revenue generated movie in the year 2016.
-  Highest Total Revenue is in the year 2016, which is obvious being the latest.
-  An unusual trend is observed where Total Revenue decreased from 2013 to 2014, even when movie count was increasing.
-  Movies directed by 'Paul W.S. Anderson' generated highest total revenue of 3715m.
-  Drama, Comedy and Action are the most common genres(individual).
-  The movies with higher rating will have higher vote count.
-  Every year there is aleast one movie to cross 900m revenue.
-  Movies of 'Drama' Genre has generated the highest revenue.
- The Movies: 'Manchester by the Sea', 'Interstellar' and 'Moonlight' generated low Revenue even with Higher - Ratings, Votes and Metascore.
-  Movies with 'Action,Adventure,Sci-Fi' genre are the most. 

### Insights or Takeaways:

- Drama is the most interested genre by majority of audience. And it is followed by Comedy and Romance. So, Producers can be rest assured while investing in these genres in future.
- Fantasy, Sports genres are less explored by the existing movie makers. So this is a good space for the new directors, at the same time they would help avoid competition.
- Most of the existing directors are concentrating on the genres: Action, Adventure and Sci-Fi even though they are not doing well commercially. Hence, Directors should start combinations of those with well performed genres such as, Drama, Comedy and Romance.
- There are movies with higher metrics of rating, metascore and votes but, they have not done well in terms of revenue. So, These movies can be pushed to OTT platforms for further revenue generation.
- Distributors can safely opt for the movies involving the Actors: Malin Akarmen, Mia Goth, Susan Loughnane, Michael Varten, Laura Dern, Roman Kolinka and Joan Allen, And the Directors: Alexandre Aja, Paul Anderson, Ridley Scott and Woody Allen.