# IMDB Top 1000 Movies EDA

In [None]:
%matplotlib inline
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
movie = pd.read_csv("/kaggle/input/imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv")

## Overview

Overview of variable types, form and checking for missing values

In [None]:
movie.info()

In [None]:
movie.head(3)

Things to bear in mind:

* Released Year, Runtime, Gross are being treated as objects when it should be numeric of some sort
* Genre is a string which can contain more than one genre
* Meta_Score, Certificate and Gross have missing values
* Star1, Star2, Star3, Star4 may have to be converted to long format


### What to look for

**Questions**

* Does the general public rating (IMDB_Rating) walk hand-in-hand with the specialized critic's ratings (Meta_score)? On what films do these scores differ? Is there a pattern?

* Do highly rated films have a higher gross revenue? What movies don't follow such a pattern?

* What is the relation between the targeted age group and ratings? Do family-friendly films have lower ratings than age-restricted ones?

* Other questions concern Score x Runtime, Released Year x Runtime, Stars in the cast x Revenue, ...

**Statistics**

Formulating the IMDB_score statistic in the following way:

* There is a populational IMDB_Score that we want to estimate and our sample consists of the users who posted their vote on IMDB. In this setting, the precision of the estimate depends on the size of the sample (No_of_Votes), therefore, we must take this into account when comparing two ratings.

## Preprocessing with informal Univariate and Bivariate analysis along the way

Drop unnecessary columns: Poster Link, Overview

In [None]:
movie.drop(["Poster_Link", "Overview"], axis=1, inplace=True)

### Certificate

**Problem**: The variable *certificate* is not standardized, possibly because of the different rating systems across countries and epochs, therefore it depends on the nationality and release year of the film.

In [None]:
movie["Certificate"].value_counts()

**Solution**: Aggregate certificates based on the age group they are meant for, inspired by the Brazilian rating system

Looking up the different rating systems and making concessions we have the approximate mapping:

* L: PG, U, G
* 12+: U/A, TV-PG, UA, Passed, Approved
* 14+: PG-13, TV-14
* 16+: 16, A, R, TV-MA
* NaN: Unrated

Sources: 

* https://en.wikipedia.org/wiki/Motion_picture_content_rating_system
* https://en.wikipedia.org/wiki/Central_Board_of_Film_Certification
* https://en.wikipedia.org/wiki/TV_Parental_Guidelines



In [None]:
certificates = {
    "PG": "L", "U": "L", "G": "L",
    "U/A":"12+", "TV-PG":"12+", "UA": "12+", "Passed": "12+", "Approved":"12+",
    "PG-13": "14+", "TV-14":"14+",
    "16":"16+", "A":"16+", "R":"16+", "TV-MA":"16+",
    "Unrated": np.nan
}

In [None]:
movie["Certificate"] = pd.Categorical(movie["Certificate"].map(certificates), categories=["L", "12+", "14+", "16+"],
                                     ordered=True)

In [None]:
movie["Certificate"].value_counts().sort_index().plot(kind="bar")

Except for 14+ , there seems to be a balanced mix of ratings among the 1000 top IMDB movies. The scarcity of the 14+ badge is probably due to the concessions that had to be made to construct the mapping.

#### Scores x Certificate

In [None]:
fig = plt.figure(figsize=(15, 5))
ax1, ax2 = fig.subplots(1,2)
sns.boxplot(x="Certificate", y="IMDB_Rating", data=movie, ax=ax1)
sns.boxplot(x="Certificate", y="Meta_score", data=movie, ax=ax2)

Not much here, the difference in ratings among certificates is either small or doesn't follow a clear pattern.

#### Runtime

In [None]:
movie['Runtime'][1]

First, convert the *runtime* variable from string to float 

In [None]:
movie['Runtime'] = movie['Runtime'].map(lambda x: float(x.replace(' min', '')))
movie['Runtime'][1]

In [None]:
fig = plt.figure(figsize=(15, 5))
ax1, ax2 = fig.subplots(1,2)
sns.histplot(x="Runtime", data=movie, ax=ax1, binwidth=2)
sns.boxplot(x="Runtime", data=movie, ax=ax2);
movie['Runtime'].describe()

On average, movies among the top 1000 rated have a runtime of approximately 2 hours

#### Runtime x Certificate

**Hypothesis**: Family-friendly movies have a shorter runtime than age-restricted ones.

In [None]:
sns.boxplot(x="Certificate", y="Runtime", data=movie);
movie.groupby(by="Certificate")["Runtime"].describe()

Doesn't seem to be the case

#### Released Year

In [None]:
movie['Released_Year'].value_counts()

Released Year is in the string format, before converting to float, there is a value (PG) wrongly inserted that must be dealt with.

In [None]:
movie['Released_Year'] = movie['Released_Year'].map(lambda x: np.nan if x == "PG" else float(x))

#### Runtime x Released Year

**Question**: Are modern films longer than older ones?

In [None]:
sns.scatterplot(x="Released_Year", y="Runtime", data=movie, alpha=0.3);

There seems to be a mild ascending behavior. Possible guesses for this phenomenon might be:

* Films were harder to produce in the past, therefore they can be shorter on average
* In the beginning, filmmaking was experimental, therefore producers would start small instead of taking a long shot on a lengthy movie.

(Obs: I don't know if that is true, I'd have to talk to a specialist)

However, by following such a reasoning, it is expected that as the film industry matures the average runtime stabilizes, which in practice means that the ascending behavior fades as time goes by. (Like a logarithmic curve).

In [None]:
sns.regplot(x="Released_Year", y="Runtime", data=movie, lowess=True, 
            line_kws={'color': 'red'})

The nonparametric regression (lowess) curve makes explicit such dependence.

In [None]:
movie[['Released_Year', 'Runtime']].corr(method='spearman').iloc[0,1]

Calculating Spearman's correlation (a nonparametric version correlation) between Released_Year and Runtime, we have another indication of their positive correlation.