In [None]:
import matplotlib.pyplot as plt #for visualisaton
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Introduction

Movies play an important part in our life, and **imdb** is the defacto stadard for rating movies.So, let's analyze the imdb datset and get some insights.

- Some Important questions that we are gonna answer:

  1. is there a relation between higher ratings and revenue?
  2. Which movies are most famous?, Who are some top rated directors?
  3. Does being a famous movie imply higher revenue?
  4. Does runtime affect the ratings?
  5. Is there a relation between number of votes and imdb ratings?
  6. What is the relation between year and average revenue on that year?
  
- Note: This data set contains information of all movies from 2006 to 2016


In [None]:
#importing data
imdb_data = pd.read_csv("/kaggle/input/imdb-data/IMDB-Movie-Data.csv")

#getting overview of various columns
imdb_data.info()

In [None]:
#finding dimensions
print(imdb_data.shape)

## Dealing with NA values

In [None]:
#Let's see how many columns contains NA values
imdb_data.isna().any()

1. So, we can see Revenue and Metascore contains NA values
1. Let's visualize to make it more clear

In [None]:
#Getting count of NA values in each column
print(imdb_data.isna().sum())

#visualizing
imdb_data.isna().sum().plot(kind="bar")


We can say that `Revenue` column has literally double the NA values than `Metascore`. <br>
Let's drop rows with NA values.

In [None]:
imdb_data_cleaned = imdb_data.dropna()
imdb_data_cleaned.info()

In [None]:
#finding summary statistics 
imdb_data_cleaned.describe()


## 1.Relation b/w higher rating and revenue

In [None]:
#visulizing the histogram of ratings
imdb_data_cleaned["Rating"].hist(bins=30)

We can see most of the movies' ratings are between `6 - 8`

In [None]:
revenue_hist = imdb_data_cleaned["Revenue (Millions)"].hist(bins=30)
revenue_hist.set_xlabel("Revenue (in Million $)")
revenue_hist.set_ylabel("Movie Counts")

Most of the movies have revenue in `range(0-200)` million dollars. Let's zoom in that range.


In [None]:
revenue_hist_zoomed = imdb_data_cleaned["Revenue (Millions)"].hist(bins=30)
revenue_hist_zoomed.set_xlim(0, 200)
revenue_hist_zoomed.set_xlabel("Revenue (in Million $)")
revenue_hist_zoomed.set_ylabel("Movie Counts")

Now, it is more clear that most of the movies do business around **`0 to 60 million dollars`**

> Now, let's answer our main question what is the relation between revenues and ratings?

In [None]:
imdb_data_cleaned.plot(kind="scatter", x="Rating", y="Revenue (Millions)", color="orchid")

The scatter plot is very clustered around ratings ranging from 5 to 8. One thing is clear that highest grossing movies(>400 m$) generally have higher imdb ratings(>6). 

- Let's zoom in range where movies make 0 - 500 million dollars since most of the data points lie there.

In [None]:
imdb_data_cleaned.plot(kind="scatter", x="Rating", y="Revenue (Millions)", color="orange", ylim=(0, 500), alpha=0.4)

There is no guarantee if rating is high, the movie will generate higher revenue But if movie is generating high revenue, mostly it is highly rated. There some instances where movies with low ratings (<5) have made more than high rated movies. A plausilble reason might be production and marketing budget.

## 2. Top rated Movies & Directors
- Let's say movies having ratings higher than 6 are considered to be more famous.
- We cas use "Metascore" attribute here differentiate movies with same imdb ratings!

- Let's find out top 15 movies and directors from  `years 2006-2016`

In [None]:
top_rated = imdb_data_cleaned.sort_values(["Rating","Metascore"], ascending=False)[
    ["Title", "Director", "Rating","Metascore"]]
top_rated.index = range(1,839)
top_rated.head(n=15)

In [None]:
#Caution: MultiIndex Dataframe
top_rated.groupby("Director")[["Rating", "Metascore"]].agg([np.mean, np.median]).sort_values(
    [("Rating","mean"),("Metascore", "mean")], ascending = False).head(n=15)

***Christopher Nolan*** is the best director in span of 2006 to 2016

## 3. Revenue Vs Famous Movies

In [None]:
top_rated_revenue = imdb_data_cleaned.sort_values(["Rating","Metascore"], ascending=False)[
    ["Title", "Director", "Rating","Metascore", "Revenue (Millions)"]]
top_rated_revenue.index = range(1,839)
top_rated_revenue.head(n=15)

- We say that it is not necessary that if the movie is famous, it will generate more revenue!

## 4. Runtime Vs Fame

In [None]:
top_rated_runtime = imdb_data_cleaned.sort_values(["Rating", "Metascore"], ascending=False)[
    ["Title", "Director", "Runtime (Minutes)", "Rating","Metascore"]]
top_rated_runtime.index = range(1,839)
top_rated_runtime.head(n=15)

In [None]:
#to see if there is any correlatiob between runtime and metascore
imdb_data_cleaned[["Runtime (Minutes)", "Metascore"]].corr()

In [None]:
#Let's plot with respect to Metascore because, it is more unique
top_rated_runtime.plot(kind="scatter",
                      x="Runtime (Minutes)",
                      y="Metascore",
                      alpha=0.4)

Clearly there is no relation between runtime and Metascore i.e movie being famous

## 5. Votes Vs Ratings

- Let's see if number of votes play any role to decide the Metascore of a movie.

In [None]:
imdb_data_cleaned[["Votes", "Metascore"]].corr()

In [None]:
imdb_data_cleaned.plot(kind="scatter",
                      x="Votes",
                      y="Metascore",
                      color="red",
                      alpha=0.4,
                      )

In [None]:
#zooming to clustered area
imdb_data_cleaned.plot(kind="scatter",
                      x="Votes",
                      y="Metascore",
                      color="red",
                      alpha=0.4,
                      xlim=(0, 650000)
                      )

There doesn't seem to be any major relation between number of votes and metascore.

## 6. Average Revenue Generated By movies every year.

In [None]:
year_vs_revenue = imdb_data_cleaned.groupby("Year")[["Revenue (Millions)"]].mean()
year_vs_revenue.plot(kind="bar", color="green")

# Conclusion

- We performed attribute analysis on various columns.
- It turns out that *Christopher Nolan* is the best director and **Dark Knight** was the best movie.

> Limitations: We dropped rows with NA values which might contain important information
