# IMDB Data Exploration

Let's have a look at the IMDB Data Set uploaded by PromptCloud. This is my first kaggle notebook by the way, so if you find anything i should be aware of you are very welcome to leave me a comment :).

The Set contains Data about movies between 2006 and 2016. Let's Import the necessary libraries and first and then we are set for some exploration. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Importing the Data and having a first look

In [None]:
f = pd.read_csv('../input/IMDB-Movie-Data.csv')

Let's start with the most basic look at the data first:

In [None]:
print(f.shape)

The shape tells us, there are 12 features and 1000 instances. Next we will have a look at an extract of the data to develop an idea of the information contained.

In [None]:
f.head()

At first glance there are no surprises in this Data Set, we get basic info about a range of movies from different genres. The shape tells us, there are 12 features and 1000 instances. Some of the Data is easy to process and we can have a look at it straight away (Rating, Votes, etc.) some features are more complicated. The genres are complex constructs for example and before we can classify by genre we might have to clean this Data. The same goes for Actors and Descriptions. Before we start spliting strings and modifiyng the data set i will have a look at the features that are easy to work with. Maybe we can get some insights here already. We will have look at the distributions of features and the relations between the features.   

## Looking at the numbers

First of all let's look at some histograms to see how the movies are spread thoughout the data set. 

First of let's see if we have an even distribution of Movies throughout the time 2006 untill 2016:

In [None]:
plt.hist(f.Year)
plt.show()

The bulk of the movies in this data set seems to be quite recent! Let's have a look at this in more detail:

In [None]:
f.Year.value_counts(normalize=True)

Of all the movies 29% where published in 2016. That is quite a lot. Let's keep that in mind for further exploration. For now i'd like to move on to get more of an overview of the data.

Our next step is a look at the distribution of the Ratings:

In [None]:
plt.hist(f.Rating)
plt.show()

Apparently the user of IMDB are quite generous, the movies in this data set certainly show quite a high volume of positively rated movies. 

Like before let's have a look at the list of the value distribution again:

In [None]:
f.Rating.value_counts(normalize=True).head(10)

Since we are looking at a continuum of values between zero and ten, including decimals, this list view is not very helpful for this feature. Let's have a look at the describe method instead and see what we can learn there:

In [None]:
f.Rating.describe()

Ok, our average gets a Rating of 6.7 thats good to know, i think if i'll need a benchmark for a decent movie in the future i am willing to go as low as 7.0. We can a have a look at the subset of decent movies later. 

But first of all i need to know what the min and max movies are! I mean 1.9 is really low, depending on the description i might have to go watch it ;). A ranking of 9.0 on the other hand is quite impressive! If haven't seen that one yet i most definitely will!

### I need to know

Let's have a look:

In [None]:
f[f.Rating == 1.9]

The "Disaster Movie" is the worst movie to come out in the past ten years according to this daa set, they seem to have kept the title's promise..87 Minutes of disaster...

Lets have a look at the full description:

In [None]:
pd.set_option('display.max_colwidth', -1)
f[f.Title == 'Disaster Movie'].Description

Ok not much to learn here. Even thoiugh the rating caught my eye the desription really doesn't sell this movie for me. 

But all jokes aside - I am now interested in the Revenue and the metascore of the movies. In particular i want to know if the Metascore and Ranking are closely realated. Also i really want to know if revenue and percieved quality go hand in hand. 

But first we have to find out what the best movie in this data set is!

In [None]:
f[f.Rating == 9.0]

## The Dark Knight

Yeah, that was a great movie. I would have sworn it was more recent than 2008 though. And this time the description really sells the movie too! I might have to rewatch it soon :).

## Moving on

Let's move on and have a look at the res of the features. I'll do a few in bulk now, so that we can get an overview without wasting too much time. 

In [None]:
# i copied this function from here: https://stackoverflow.com/questions/29530355/plotting-multiple-histograms-in-grid

def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure()
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=10,ax=ax)
        ax.set_title(var_name+" Distribution")
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()
    
f_col = ['Runtime (Minutes)', 'Votes', 'Revenue (Millions)', 'Metascore']
    
draw_histograms(f, f_col, 2, 2)

### Runtime

The runtime doesn't seem very interesting. It seems just like i would expect after the movies i have seen so far.

### Votes

The Votes are a little more interesseting. Overall the engagement seems to be quite low and only some movies get a lot of attention we will see later on if this is related to some of the other values.

### Revenue

The shape of the Revenue distribution is quite similar to the one of the votes. We will have a look at this later on too.

### Metascore

The metascore does look a lot like a normal distribution and it seems to be less skewed than the Rating. 

## Looking at relations

So far we have had a look at isolated features. Now i am intersted in discovering what relations some of these features might have. Lets start by visualizing the correlations between these features and see if we can discover anything from there. Later on i will create plots of pairs of features. 

In [None]:
f_corr = f.corr()

sns.heatmap(f_corr)
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.show()

f_corr

The correlation matrix contains a few interesting informations. As expected Rating and Metascore are correlated. The Popular vote and the Expert opiniom still frequently differ quite a bit though. One aspect that is interesting to me is the realtion between the Revenue and the number of Votes both seem to be expressing the popularity. 

From looking at the heatmap a few things caught my eye, so i will plot them individually. 

#### We will be looking at the following relations:

- Rating vs Metascore
- Votes vs Revenue
- Year vs Votes
- Rating vs Votes
- Rating vs Revenue

and finally 

- Metascore vs Revenue

In [None]:
fig = plt.figure()
fig.add_subplot(321)
plt.scatter(f.Rating, f.Metascore)
fig.add_subplot(322)
plt.scatter(f.Votes, f['Revenue (Millions)'])
fig.add_subplot(323)
plt.scatter(f.Year, f.Votes)
fig.add_subplot(324)
plt.scatter(f.Rating, f.Votes)
fig.add_subplot(325)
plt.scatter(f.Rating, f['Revenue (Millions)'])
fig.add_subplot(326)
plt.scatter(f.Metascore, f['Revenue (Millions)'])
fig.tight_layout()
plt.show()

## Money and the popular vote

Once again we see quite clearly the relation between the Rating and the Metascore. The number of Votes that contribute to the rating seem to grow with the Revenue the year on the other hand does not influence the number of Votes. The negative correlation seems to be due to some very popular, older movies (maybe we should find out which of the movies are the super popular ones to plan the next movie night!).

Interestingly enough, a high number of votes in many cases is also accompanied by a high Rating. Apparently people like to share their opinon on the movies they like and when they care enough to vote they also care about the movie. We saw this already in the Rating histogram in the beginning. 

**Lastly: The Relation between Revenue, Rating and Metascore.**

Either high grossing movies are the favorites of the IMDB community or the favorites of the IMDB commuinty get a lot of attention by the looks of this little plot. The critics giving out the Metascore do not seem to be as closely related to the revenue. Maybe no one is following their recommendations.

# Next Steps

Next Would like to dig deeper into the information hidden in the long strings of f.Genre and f.Actors. Unfortunately in the current state they are hard to use due to most of the values being unique. In the next step i would like to reshape the data to gain some information about the relations between Genres, Actors and the numeric values. I probably will reduce the number of Genres in some way, or find a new mapping to reduce the number of unique genres and form bigger clusters. I might also be interested in looking into the metrics for individual Actors, that had a lot of appearances. 

### Thanks for reading. 

If anyone did, i would be more than happy about your feedback and some pointers on where i could improve on this :).