In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
pd.set_option('mode.chained_assignment', None)      # To suppress pandas warnings.
pd.set_option('display.max_colwidth', -1)           # To display all the data in each column
pd.options.display.max_columns = 50                 # To display every column of the dataset in head()

import warnings
warnings.filterwarnings('ignore')                   # To suppress all the warnings in the notebook.

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style='whitegrid', font_scale=1.3, color_codes=True)      # To apply seaborn styles to the plots.
# Making plotly specific imports
# These imports are necessary to use plotly offline without signing in to their website.

from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import chart_studio.plotly as py
from plotly import tools
init_notebook_mode(connected=True)

**Reading the csv to get dataframe**

In [None]:
moviesdf=pd.read_csv('/kaggle/input/imdb-data/IMDB-Movie-Data.csv')
moviesdf.head()

In [None]:
moviesdf.shape

Here we know that there are 1000 rows and 12 columns

In [None]:
moviesdf.columns

Above is the list of columns as mentioned in Description section

In [None]:
moviesdf.info()

- ```info``` function gives us the following insights into the moviesdf dataframe:

  - There are a total of **1000 samples (rows)** and **12 columns** in the dataframe.
  
  - There are **4 columns** with a **numeric** datatype 
  - There are **5 columns** with an **object** datatype.
  - There are **3 columns** with a **float** datatype.
  
  - There are **228 missing** values in the Revenue(Millions) column.
  - There are **64 missing** values in the Metascore column.

In [None]:
moviesdf.describe()

- ```describe``` function gives us the following insights into the moviesdf dataframe:

  - The data is for **1000 movie titles for 2006-2016**.
  - **The minimum revuenue is 0 million which is highly unlikely but possible if movie not released in theatres or suffered huge losses**.
  - As observed before, as per the data from count function, **missing values in Revenue and Metascore fields**.
  - For Revenue column, **the third quartile(75 %) has value of approx 113 million USD, but the max value is approx 936 million USD, so it seems few movies have earned much higher than remaining movies**. 
  - For Revenue, **the mean is greater than median, so data seems to be right skewed**.


In [None]:
#Dropping the column - Description
moviesdf.drop(['Description'], 1, inplace=True)

In [None]:
#Changing columns names for Revenue(Millions) to Revenue and Runtime(minutes) to Runtime
moviesdf.rename(columns={'Revenue (Millions)':'Revenue','Runtime (Minutes)':'Runtime'}, inplace=True)

In [None]:
moviesdf.head()

In [None]:
#Count of missing values in each column
moviesdf.isna().sum()

In [None]:
#Checking the count where Revenue has missing value
moviesdf[moviesdf['Revenue'].isnull()].isnull().sum()

In [None]:
#Checking the count where Metascore has missing value
moviesdf[moviesdf['Metascore'].isnull()].isnull().sum()

In [None]:
moviesdf[moviesdf['Revenue'].isnull() & moviesdf['Metascore'].isnull()].isnull().sum()

In [None]:
moviesdf[moviesdf['Title']=='The Host']

 - As per above, only the columns - **Revenue** and **Metascore** have missing values as suggested by pre-profiling report.

 - Since count of rows having missing value in either **Revenue(128) or Metascore(64)** is not too low, so dropping them can impact the analysis of other fields.

 - Since **NaN values have no impact in the statistical study**, so it would be safe to retain those records

  - We can ignore the missing values in **Revenue and Metascore** while analyzing their relationship with other fields, instead of droping those values permamnently from the dataframe.

 - Since the missing value could mean, either no revenue or metascore information so while studying the relationship between Revenue and metascore, can remove the missing values from the data, so that the trend can be analyzed on the proper values.
 
 - "The Host" multiple entries are of 2 different movies released in different year, so no redundant entry found for it.

**Highest earning movies**

In [None]:
revmoviesdf=moviesdf.dropna(subset=['Revenue'])
revmoviesdf.isna().sum()

In [None]:
#Top 10 earning movies
revmoviesdf.sort_values(by=['Revenue'], ascending=False)[:10]

In [None]:
revmoviesdf.groupby(['Title'])['Revenue'].mean().sort_values(ascending=False)[:10].plot(kind='bar', figsize=(15,8), fontsize=13, color='green')
plt.ylabel('Revenue')
plt.title("Highest earning movies")

**Lowest earning movies**

In [None]:
#10 lowest earning movies
revmoviesdf.sort_values(by=['Revenue'])[:10]

In [None]:
revmoviesdf.groupby(['Title'])['Revenue'].mean().sort_values(ascending=True)[:10].plot(kind='bar', figsize=(15,8), fontsize=13, color='red')
plt.ylabel('Revenue')
plt.title("Lowest earning movies")

**Highest earning movies year-wise**

In [None]:
#Highest earning movies year-wise
revmoviesdf.sort_values(by=['Revenue'], ascending=False).groupby('Year').first()

In [None]:
tmp=revmoviesdf.sort_values(by=['Revenue'], ascending=False).groupby('Year').first()
tmp.groupby(['Title','Year'])['Revenue'].sum().sort_values().plot.bar(x='Title', y='Revenue', figsize=(10,8))
plt.ylabel('Revenue in Millions(USD)')
plt.title("Highest earning movies by years")

- Highest earning movie year-wise
    - Star Wars Episode VII - The Force awakens
    - Avatar
    - The Avengers
    - The Dark Knight
    
- Year 2015 has the highest earning movie - Star Wars, followed by Year 2009 with Avatar.

- There is no trend of increasing movie revenue with the year-wise increasing price of movie tickets, as the highest earning movie was released in 2015(Start Wars VII) followed by 2009(Avatar) and then 2012(The Avengers).

- The movie released in 2006(Pirates of Carribean) has earned much more than the movies released in 2010,2011 and 2014.

- Highest earning movie in 2016(Rougue One) has earned less than highest earning movies in 2012,2009.

**Total revenue over the years**

In [None]:
revmoviesdf.groupby(['Year'])['Revenue'].sum().plot(kind='bar', figsize=(15,8), fontsize=13, color='blue')
plt.ylabel('Revenue (Milllion USD)')
plt.title("Total Revenue By Years")

In [None]:
revmoviesdf.groupby(['Year'])['Revenue'].sum().plot(kind='line', figsize=(15,8), fontsize=13, color='blue')
plt.ylabel('Revenue')
plt.title("Total Revenue By Years")

The revenue trend has been generally increasing over the year and revenue generated in 2014,2015 and 2016 is much more compared to the years before than 2012.

**Revenue distribution for the movies**

In [None]:
#Revenue distribution
sns.distplot(revmoviesdf['Revenue']).set_title("Revenue distribution for movies")

- Most of the movies have earned in the range till 150 million and some earned 200 milllion and very few earned more than 400 million with couple of movies earning more than 800 million

**Number of movies released over the years**

In [None]:
moviesdf.groupby(['Year'])['Title'].count().plot(kind='bar', figsize=(15,8), fontsize=13, color='yellow')
plt.ylabel('Number of Titles')
plt.title("Number of Movies released by Years")

- From the above graph we can observe the number of movies releasing per year has increased significantly.
- So, this resulted in the total revenue increased over the year as we observed in previous section
- The number of movies released in 2016 are much more when compared to other years and that could be the reason the revenue generation which was observed earlier is huge in 2016 when compared to other years

**Average revenue over the years**

In [None]:
revmoviesdf.groupby(['Year'])['Revenue'].sum()

In [None]:
revmoviesdf.groupby(['Year'])['Revenue'].mean()

In [None]:
revmoviesdf.groupby(['Year'])['Revenue'].mean().plot(kind='bar', figsize=(15,8), fontsize=13, color='orange')
plt.ylabel('Revenue')
plt.title("Average Revenue By Years")

In [None]:
revmoviesdf.groupby(['Year'])['Revenue'].mean().plot(kind='line', figsize=(15,8), fontsize=13, color='orange')
plt.ylabel('Revenue')
plt.title("Average Revenue By Years")

- From the above bar and line chart, we can observe that the average revenue is least in 2016 and maximum in 2009 followed by 2012.
- So, we can observe that though the total revenue has increased over the years, but the average revenue has been decreased as the number of movies releasing has been increased over the years.
- So our assumption that the increasing ticket price has increased the revenue may not be correct as more movies releasing has increased the total revenue but decreased the average revenue.
- Since the ticket price is not mentioned in the data, so we can't predict its impact on movie revenue

**Relationship study between different fields in dataset**

In [None]:
corr = moviesdf.corr()

figure = plt.figure(figsize=(15,10))

sns.heatmap(data=corr, annot=True,cmap='viridis',xticklabels=True, yticklabels=True).set_title("Relation betweem Movie dataset fields")

- From the above graph, we can observe that below fields have mild high correlation

 -  Rating and Metascore
 - Revenue and Votes
 - Votes and rating

- While below fields have very low correlation
 - Rating and Revenue
 - Runtime and revenue

- And Rank and year are negatively correlated with other fields.


In [None]:
sns.pairplot(moviesdf)

 - The year plot is right-skewed denoting the number of movies increased over the year and maximum in 2016.
 - Revenue and run-time plot, we can observe that the movies for rumtime 100-120 have earned maximum revenue.
 - We can pbserve the mild-linear dependency on ratings and revenue, most of the high-earning movies have rating more than 7.
 - Rating and metascore have linear relationship.
 - Votes plots are left-skewed as average number of votes have decreased over the years
 - mild linear trend between votes and revenue generation, as most of the movies earned more than 500 have more than 500000 votes.
 - Revenue is left-skewed as average revenue decreased over the year.
 - Metascore is constant over the year and also no imapct on revenue of any movie

**Distribution of votes for movies for different years and its relationship**

In [None]:
moviesdf.groupby(['Year'])['Votes'].sum()

In [None]:
#Number of votes over the years
moviesdf.groupby(['Year'])['Votes'].sum().plot(kind='line', figsize=(15,8), fontsize=13, color='orange')
plt.ylabel('Number of votes')
plt.title("Number of Votes by Years")

- The number of votes have increaed from 2006 to 2016, but no in linear trend as we understood, the number of votes in 2016 are less when compared from 2012-2015.
- 2013-2014 have highest number of votes. 

In [None]:
moviesdf.groupby(['Year'])['Votes'].mean()

In [None]:
#Average number of votes over the years
moviesdf.groupby(['Year'])['Votes'].mean().plot(kind='line', figsize=(15,8), fontsize=13, color='orange')
plt.ylabel('Number of Votes')
plt.title("Average number of Votes by Years")

- The average number of votes have reduced gradually over the years , could be due to number of movies releasing over the years have increaed resulting in reducing the average votes.

In [None]:
plt.figure(figsize=(10,6))
plt.title("Votes and Year relation")
sns.regplot(data=moviesdf, x="Year", y="Votes", color='orange')
plt.ylabel("Number of Votes")

The above graph we can see the trend is going down with year and votes.

- From the above 3 graphs we observe following
 - The number of votes were maximum in 2013-2014
 - The avearge number of votes decreaed due to increase in number of movies releasing over the years.

**Distribution of ratings for movies for different years and its relationship**

In [None]:
moviesdf.groupby(['Year'])['Rating'].count()

In [None]:
moviesdf.groupby(['Year'])['Rating'].mean()

In [None]:
#Average ratings over the years
moviesdf.groupby(['Year'])['Rating'].mean().plot(kind='line', figsize=(15,8), fontsize=13, color='violet')
plt.ylabel('Ratings')
plt.title("Average Ratings by Years")

In [None]:
plt.figure(figsize=(10,6))
plt.title("Ratings and Year relation")
sns.regplot(data=moviesdf, x="Year", y="Rating", color='violet')
plt.ylabel("Ratings")

- The average ratings have been decreased over the years




**Distribution of metascores for movies for different years and its relationship**

In [None]:
#Drop the rows with missing metascore values
metamoviedf=moviesdf.dropna(subset=['Metascore'])

In [None]:
metamoviedf.groupby(['Year'])['Metascore'].mean()

In [None]:
#Average Metascore over the years
metamoviedf.groupby(['Year'])['Metascore'].mean().plot(kind='line', figsize=(15,8), fontsize=13, color='blue')
plt.ylabel('Metascore')
plt.title("Average Metascore by Years")

In [None]:
plt.figure(figsize=(10,6))
plt.title("Metascore and Year relation")
sns.regplot(data=metamoviedf, x="Year", y="Metascore", color='blue')
plt.ylabel("Metascore")

- The average metascore for the movies have decreased over the years but not in linear trend.

**Movies with highest rating over the year**

In [None]:
tmp=moviesdf.sort_values(by=['Rating'], ascending=False).groupby('Year').first()
tmp.groupby(['Title','Year'])['Rating'].mean()

In [None]:
tmp=moviesdf.sort_values(by=['Rating'], ascending=False).groupby('Year').first()
tmp.groupby(['Title','Year'])['Rating'].mean().sort_values().plot.bar(x='Title', y='Rating', figsize=(10,8), color='purple')
plt.ylabel('Ratings')
plt.title("Highest rated movies by years")

- The Dark Knight is the highest rated movie, but the ratings of highest rated movies across the years dont have much difference.

- Also, the ratings are not depedent on the year.

**Movies with highest metacore over the year**

In [None]:
tmp=metamoviedf.sort_values(by=['Metascore'], ascending=False).groupby('Year').first()
tmp.groupby(['Title','Year'])['Metascore'].mean()

In [None]:
tmp=metamoviedf.sort_values(by=['Metascore'], ascending=False).groupby('Year').first()
tmp.groupby(['Title','Year'])['Metascore'].mean().sort_values().plot.bar(x='Title', y='Metascore', figsize=(10,8), color='violet')
plt.ylabel('Metascore')
plt.title("Highest Metascore movies by years")

- Boyhood has highest metascore, like the ratings mesatscore also not have much difference for the top movies across the year apart from "Up in 2009" which has score much less than other movies

- Also, the metascores are not depedent on the year.

**How Revenue and Votes are related**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Revenue vs votes")
sns.regplot(data=revmoviesdf, x="Revenue", y="Votes")
plt.ylabel("Votes")

- Most of the movies around 100-150 million have votes under 250000
- But we can observe the movies earning high have votes more than 250000.
- The movies withn votes with more than 500000 have earned generally more than 500 million apart from few exceptions.
- We have exceptions which earned 800 million or more.

**How Revenue and Ratings are related**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Revenue vs ratings")
sns.regplot(data=revmoviesdf, x="Revenue", y="Rating", color='orange')
plt.ylabel("Rating")

- Most of the movies have earned around 150 million and have different ratings.
- Not quite clear trend, but most of the movies earning more than 400 million have ratings more than 7.

**How Revenue and Metascores are related**

In [None]:
tmp=moviesdf.dropna(subset=['Revenue','Metascore'])

In [None]:
plt.figure(figsize=(10,6))
plt.title("Revenue vs Metascore")
sns.regplot(data=tmp, x="Revenue", y="Metascore", color='green')
plt.ylabel("Metascore")

 - No clear relationship trend observed between metascore and revenue, but the movies earning more than 400 million have generally score more than 60. 

**How Duration of Movie affects Revenue**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Revenue vs Runtime")
sns.regplot(data=revmoviesdf, x="Revenue", y="Runtime")
plt.ylabel("Runtime")

- No relationship, but the high earning movies mostly have run-time between 100-140 minutes.

**Relationship between Ratings and Votes**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Ratings vs Votes")
sns.regplot(data=moviesdf, x="Rating", y="Votes", color='red')
plt.ylabel("Votes")

 - No relationship between ratings and votes, as some movies with high ratings of more than 8 have low vote count

**Relationship between Metascores and Votes**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Metascores vs Votes")
sns.regplot(data=metamoviedf, x="Metascore", y="Votes", color='violet')
plt.ylabel("Votes")

 - We can observe no relationship between Metascores and Votes as movies with good score of more than 80 have low vote count

**Relationship between Ratings and Metascores**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Metascores vs Ratings")
sns.regplot(data=metamoviedf, x="Metascore", y="Rating", color='purple')
plt.ylabel("Rating")

 - We can observe the linear trend between ratings and metascores, the scores increasing along with the ratings with few exceptions.

**Relationship between Revenue and Genre**

In [None]:
#get genre list sorted by revenue and stored in the list
genre_list=moviesdf.sort_values(by='Revenue', ascending=False).Genre
genre_list

In [None]:
#using counter to count the occurence of each unique genre element in the list genrated above.
from collections import Counter 

genlist=[]
for genre in genre_list:
  tmp=[]
  tmp=genre.split(',')
  genlist.extend(tmp)
  
#print(genlist)

mycounter=Counter(genlist)
print(mycounter)

#print(mycounter.keys())
#print(mycounter.values())

In [None]:
#Empty dictionary
genre_dict=dict.fromkeys(genlist,0)
print(genre_dict)

#print(type(genre_dict.keys()))
#print(type(genre_dict.values()))

In [None]:
#Traversing dataframe and storing data in dictionary, key-genre, value-total revenue for genre calcualted over the years
genredict=dict()
for idx in moviesdf.index:
  if (moviesdf['Revenue'][idx]>=0):
    if moviesdf['Genre'][idx] in genredict:
      genredict[moviesdf['Genre'][idx]]+=moviesdf['Revenue'][idx]
    else:
       genredict[moviesdf['Genre'][idx]]=moviesdf['Revenue'][idx]

for k,v in genre_dict.items():
  for key, val in genredict.items():
    tmplist=[]
    tmplist.extend(key.split(','))
    if (k in tmplist):
      genre_dict[k]+=val

for k,v in genre_dict.items():
  print ("Genre : {}, Revenue : {}".format(k,v))

In [None]:
list(genre_dict.keys())

In [None]:
tuple(genre_dict.values())

In [None]:
fig = go.Figure([go.Bar(x=list(genre_dict.keys()), y=tuple(genre_dict.values()))])

fig.update_layout(
    title="Genre with highest revneue",
    xaxis_title="Genre",
    yaxis_title="Revenue(in Millions USD)")

fig.show()

- From the above graph, understood that Adventure genre has earned most revenue followed by Action.
- Sports,War,Music,Western genre has not made much revneue

**Revenue distribution for each genre over the year**

In [None]:
#Getting unique year from the list from above section
yeararr=revmoviesdf['Year'].unique()
yeararr=np.sort(yeararr)
yeararr

first=True

#for each year, we are traversing the genre element in each movie and storing the revenue generated by each element along 
#with genre element in a dictionary and further creating dataframe from the dictionary
for year in yeararr:
    genrevdict=dict()
    genrevdict=dict.fromkeys(genlist,0)
    genrevdict['Year']=year
    
    tmpdf=revmoviesdf[revmoviesdf['Year']==year]

    total=0
    for idx in tmpdf.index:
        revlist=[]
        revlist=tmpdf['Genre'][idx].split(',')
        for genre in revlist:
            if genre in genrevdict.keys():
                genrevdict[genre]+=tmpdf['Revenue'][idx]
            else:
                   genrevdict[genre]=tmpdf['Revenue'][idx]
        total+=tmpdf['Revenue'][idx]
        genrevdict["Total"]=total

    if (first==True):
        revenuedf=pd.DataFrame(genrevdict, index=[0])
        first=False
    else:
         revenuedf=revenuedf.append(genrevdict, ignore_index=True)
            
revenuedf  

In [None]:
year=[2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016]


g1=revenuedf.groupby(['Year'])['Total'].sum().array
g2=revenuedf.groupby(['Year'])['Action'].sum().array
g3=revenuedf.groupby(['Year'])['Adventure'].sum().array
g4=revenuedf.groupby(['Year'])['Fantasy'].sum().array
g5=revenuedf.groupby(['Year'])['Sci-Fi'].sum().array
g6=revenuedf.groupby(['Year'])['Crime'].sum().array
g7=revenuedf.groupby(['Year'])['Drama'].sum().array
g8=revenuedf.groupby(['Year'])['Animation'].sum().array
g9=revenuedf.groupby(['Year'])['Comedy'].sum().array
g10=revenuedf.groupby(['Year'])['Thriller'].sum().array
g11=revenuedf.groupby(['Year'])['Mystery'].sum().array
g12=revenuedf.groupby(['Year'])['Family'].sum().array
g13=revenuedf.groupby(['Year'])['Biography'].sum().array
g14=revenuedf.groupby(['Year'])['Horror'].sum().array
g15=revenuedf.groupby(['Year'])['Sport'].sum().array
g16=revenuedf.groupby(['Year'])['War'].sum().array
g17=revenuedf.groupby(['Year'])['Romance'].sum().array
g18=revenuedf.groupby(['Year'])['Music'].sum().array
g19=revenuedf.groupby(['Year'])['History'].sum().array
g20=revenuedf.groupby(['Year'])['Western'].sum().array
g21=revenuedf.groupby(['Year'])['Musical'].sum().array

plt.bar(year, g1, color = '#eec900')
plt.bar(year, g2, color = '#44c9c6', bottom=g1)
plt.bar(year, g3, color = '#58dae4', bottom=g1+g2)
plt.bar(year, g4, color = '#39af8e', bottom=g1+g2+g3)
plt.bar(year, g5, color = '#3e4f6a', bottom=g1+g2+g3+g4)
plt.bar(year, g6, color = '#2eaf57', bottom=g1+g2+g3+g4+g5)
plt.bar(year, g7, color = '#eee7ea', bottom=g1+g2+g3+g4+g5+g6)
plt.bar(year, g8, color = '#6ca0c5', bottom=g1+g2+g3+g4+g5+g7)
plt.bar(year, g9, color = '#1ba1e2', bottom=g1+g2+g3+g4+g5+g7+g8)
plt.bar(year, g10, color = '#008080', bottom=g1+g2+g3+g4+g5+g7+g8+g9)
plt.bar(year, g11, color = '#420420', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10)
plt.bar(year, g12, color = '#110044', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11)
plt.bar(year, g13, color = '#110011', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12)
plt.bar(year, g14, color = '#333300', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12+g13)
plt.bar(year, g15, color = '#688248', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12+g13+g14)
plt.bar(year, g16, color = '#cda1ac', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12+g13+g14+g15)
plt.bar(year, g17, color = '#cc0066', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12+g13+g14+g15+g16)
plt.bar(year, g18, color = '#ff003c', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12+g13+g14+g15+g16+g17)
plt.bar(year, g19, color = '#b05f1b', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12+g13+g14+g15+g16+g17+g18)
plt.bar(year, g20, color = '#f9d7c0', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12+g13+g14+g15+g16+g17+g18+g19)
plt.bar(year, g21, color = '#b87624', bottom=g1+g2+g3+g4+g5+g7+g8+g9+g10+g11+g12+g13+g14+g15+g16+g17+g18+g19+g20)


plt.legend(labels=('Total', 'Action', 'Adventure','Fantasy', 'Sci-Fi', 'Crime', 'Drama', 'Animation', 'Comedy', 'Thriller',
                  'Mystery', 'Family', 'Biography', 'Horror', 'Sport', 'War', 'Romance', 'Music', 'History', 'Western', 'Musical'))
plt.xlabel("Year")
plt.ylabel("Total Revenue")

fig_size = plt.rcParams["figure.figsize"]
print ("Current size:", fig_size)
fig_size[0] = 20
fig_size[1] = 20
plt.rcParams["figure.figsize"] = fig_size

plt.show()

- From the above graph of revenue distribution for each year, we can observe that Action and Adventure genre has earned most of the revenue followed by Comedy genre.

**Director-wise revenue**

In [None]:
revmoviesdf.groupby(['Director'])['Revenue'].sum().sort_values(ascending=False)

In [None]:
revmoviesdf.groupby(['Director'])['Revenue'].sum().sort_values(ascending=False)[:20].plot(kind='bar', figsize=(15,8), fontsize=13, color='orange')
plt.ylabel('Revenue(in Millions USD)')
plt.title("Director-wise revenue")

- We can observe that JJ Abrams directed movie has generared maximum revenue followed by movies direced by David yates and Christopher Nolan 

In [None]:
revmoviesdf[revmoviesdf['Director']=='J.J. Abrams']

In [None]:
revmoviesdf[revmoviesdf['Director']=='David Yates']

In [None]:
revmoviesdf[revmoviesdf['Director']=='Christopher Nolan']

**Directors with higest votes**

In [None]:
moviesdf.groupby(['Director'])['Votes'].sum().sort_values(ascending=False)[:20]

In [None]:
moviesdf.groupby(['Director'])['Votes'].sum().sort_values(ascending=False)[:20].plot(kind='bar', figsize=(15,8), fontsize=13, color='red')
plt.ylabel('Votes')
plt.title("Directors with maximum votes received")

**Directors with higest ratings**

In [None]:
moviesdf.groupby(['Director'])['Rating'].mean().sort_values(ascending=False)[:20]

In [None]:
moviesdf.groupby(['Director'])['Rating'].mean().sort_values(ascending=False)[:20].plot(kind='bar', figsize=(15,8), fontsize=13, color='green')
plt.ylabel('Ratings')
plt.title("Directors with maximum Ratings received")

**Directors with higest metascores**

In [None]:
metamoviedf.groupby(['Director'])['Metascore'].mean().sort_values(ascending=False)[:20]

In [None]:
metamoviedf.groupby(['Director'])['Metascore'].mean().sort_values(ascending=False)[:20].plot(kind='bar', figsize=(15,8), fontsize=13, color='yellow')
plt.ylabel('Metascore')
plt.title("Directors with maximum Metascores received")

**Conclusion**

After analyzing the movies data, following is the conclusion of the observation-
 - Total revenue increased over the years but average revenue decreased
 - Number of movies increased increased over the years, which could be the reason in decreased average revenue
 - Movies with runtime, 100-120 have earned maximum revenue, so the movies which are too short or too long have not earned much revenue over the year.
 - Average number of votes decreased over the years, while maximum votes were between 2013-2014.
 - Ratings decreased over the years, but the ratings of individual movies are not dependent on the year.
 - The movies with highest metascores are not the same as the movies with highest ratings.
 - The movies with highest revenues are also not among the movies with highest ratings or metascores
 - No relationship of revenue with ratings,metascores or run-time, but few common values of high earning movies
   - Runtime - 100-140
   - Metascores-60
   - Ratings - 8
 - Votes by viewers are indepndent of ratings and metascores.
 - Adventure genre has earned most revenue over the years followed by Action, while War,Musical and Western genre are lowest earners.
 - The direcctors with highest earning movies have received high count of votes as well, but not have the high ratings or metascores.