# Movies EDA with Dataprep.eda

Exploratory data analysis of movies from [TheMovieDB.org](https://www.themoviedb.org) 

Includes:
- Data prep and transformation
- EDA with matplotlib and Dataprep.eda

## Install Dataprep.eda
I'll be using this library for exploratory data analysis and visualization.

For creating plots for exploratory data analysis, see the documentation here:
- https://docs.dataprep.ai/user_guide/eda/introduction.html

To learn more about the DataPrep.eda library:
- [Dataprep.eda: Accelerate your EDA](https://towardsdatascience.com/dataprep-eda-accelerate-your-eda-eb845a4088bc)
- [Exploratory Data Analysis: DataPrep.eda vs Pandas-Profiling](https://towardsdatascience.com/exploratory-data-analysis-dataprep-eda-vs-pandas-profiling-7137683fe47f)
- [DataPrep.eda Homepage - datapre.ai](https://dataprep.ai)

In [None]:
# Install dataprep library
!pip install dataprep

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from dataprep.eda import plot, plot_correlation, plot_missing

## File Management
- You can write up to 20GB to the current directory (`/kaggle/working/`) that gets preserved as output when you create a version using "Save & Run All" 
- You can also write temporary files to `/kaggle/temp/`, but they won't be saved outside of the current session

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Read and Review Data

In [None]:
df = pd.read_csv('/kaggle/input/the-movie-database-19022019/movies.csv')
df.head(5)

View records at the tail end of the dataset.

In [None]:
df.tail()

## Initial Overview

In [None]:
df.info()

## Missing Data Report
Using dataprep.eda

In [None]:
plot_missing(df)

## Reduce and Organize Columns

For this project, I am interested only in these fields:
- title
- release_date
- budget
- revenue
- runtime
- genres

In [None]:
df = df[['title','release_date','budget','revenue','runtime','genres']]
df.head(5)

## Drop Records with Budget or Revenue of Zero

- Our analysis demands values for budget and revenue.
- Zero-values indicate missing values.
- Drop them.

In [None]:
# Test filter for budget and revenue greater than zero
df[(df['budget'] > 0) & (df['revenue'] > 0)].head(10)

In [None]:
# Overwrite the dataframe with the filtered data
df = df[(df['budget'] > 0) & (df['revenue'] > 0)]
df.head(10)

## Create New genre field
- Use an algorithm to assign a single genre per title.
- This has been tested via trial and error and yields satisfactory results, though they may be improved.

In [None]:
# Define a single-genre-assignment function
# I have written this for human readability.
# Could be rewritten as a switch statement for improved performance. (Current dataset size does not warrant.)

def generate_genre(genres):
    if 'Animation' in genres:
        return 'Animation'
    elif 'Horror' in genres:
        return 'Horror'
    elif 'Documentary' in genres:
        return 'Documentary'
    elif 'Action' in genres:
        return 'Action'
    elif 'Family' in genres:
        return 'Family'
    elif 'Adventure' in genres:
        return 'Action'
    elif 'Science Fiction' in genres:
        return 'Science Fiction'
    elif 'Fantasy' in genres:
        return 'Fantasy'
    elif 'Western' in genres:
        return 'Action'
    elif 'Crime' in genres or 'Mystery' in genres or 'Thriller' in genres:
        return 'Crime/Mystery/Thriller'
    elif 'Comedy' in genres:
        return 'Comedy'
    elif 'Romance' in genres:
        return 'Romance'
    elif 'Drama' in genres:
        return 'Drama'
    else:
        return 'Other'

In [None]:
# Create a new genre column, populating it with the value
# from the above function, applied to all non-null values for genres
df['genre'] = df.loc[df['genres'].notnull(), 'genres'].apply(generate_genre)
df.head(10)

## Drop old genres column

In [None]:
df.drop(columns='genres',inplace=True)
df.head(10)

## Check again for missing values

In [None]:
df.isnull().sum()

In [None]:
# Check the records with null values for runtime
df[df['runtime'].isnull()]

**These do not seem consequential for our analysis. Rather than filling the values we'll drop the records.**

In [None]:
# Drop records with nulls
df.dropna(inplace=True)
df.isnull().sum()

In [None]:
df.info()

## Assign Data Types
- release_date to datetime
- budget, revenue, and runtime to int
- genre to categorical

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'])
df['budget'] = df['budget'].astype(int)
df['revenue'] = df['revenue'].astype(int)
df['runtime'] = df['runtime'].astype(int)
df['genre'] = df['genre'].astype('category')
df.info()

## Make title the index
- Using title as the index is helpful for reducing code needed for charts.
- Be aware: There are a few repeated titles, and so the index will not be unique.

In [None]:
# Get a count of non-unique titles to be aware
df['title'].count() - df['title'].nunique()

In [None]:
# Proceed with knowledge
df.set_index('title', inplace=True)
df.head()

## Add Calculated Fields
- profit: Enables us to see negative profits.
- ratio: Ratio of `revenue` to `budget` gives us important insight into ROI per movie.

In [None]:
df['profit'] = df['revenue'] - df['budget'] # Create profit column
df['ratio'] = (df['revenue'] / df['budget']).round(2) # Create ratio column
df = df[['release_date','budget','revenue','profit','ratio','runtime','genre']] # Organize columns in desired order
df.head() # View the result

In [None]:
# Check new dataframe fundamentals
df.info()

## Let the EDA Begin!

Let's start with an overview of our fields using Dataprep.eda's `plot()` function.

In [None]:
plot(df)

**As we might expect:**
- The number of movies has greatly increased over time.
- A small percentage of movies have extraordinarily large budgets, revenue, profit, and revenue/budget ratio.
- Average runtime is around 110 minutes.
- Action is the most common genre.

**But more insights may come to light if we look into these fields one at a time.**

## Time Series Analysis
Trends over time:
- Avg budget
- Avg revenue
- Avg profit
- Avg ratio

I'll be using the Dataprep.eda plot function ...

### Count of movies by date

In [None]:
plot(df, 'release_date')

### AVG Budget by Date

In [None]:
plot(df, 'release_date','budget')

**The average budget rose from 2.9M in 1975 to 52.9M in 2019.**

### AVG Revenue by Date

In [None]:
plot(df, 'release_date','revenue')

**Average revenue had a huge spike in 1937 (_total_ revenue over time for _Gone with the Wind_), and has been on a relatively steady increase since.**

### Avg Profit by Date

In [None]:
plot(df, 'release_date','profit')

**Unsuprisingly, the pattern for average profit is similar to avg revenue. But the relatively lower production cost of Gone with the Wind makes its spike even more pronounced.**

### Avg Ratio by date

In [None]:
plot(df, 'release_date', 'ratio')

**These huge spikes, dwarfing all other data points, seem incredibly odd and merit further investigation. I will return to examine ratio below.**

### AVG Runtime by Date

In [None]:
# Avg Runtime by date
plot(df, 'release_date', 'runtime')

**The mean runtime has normalized over time.**

## Univariate Analysis of Numeric Fields

### Budget univariate analysis

In [None]:
# Budget stats and charts using dataprep.eda
plot(df, 'budget')

# NOTE: See Histogram and Box Plot tabs:

**Evaluation:** Skewed to the right by a small number of very high budgets.

In [None]:
# matplotlib horizontal boxplot
df['budget'].plot.box(vert=False, figsize=(12,5));

> **We have *456* outliers with extraordinarily high budgets ...**

Let's do the math to see what number defines the upper whisker, beyond which movie budgets are considered outliers.

In [None]:
# Get the statistical summary
df['budget'].describe().map('{:,.0f}'.format)

**Outliers are those whose values are greater than 1.5 IQR above the 75th percentile.**
- IQR (interquartile range) = the range between the upper and lower quartile
- Lower Quartile = 6M
- Upper Quartile = 40M
- IQR = 40-6 = 34M
- 1.5 IQR = 51
- Upper whisker, bottom border of outliers = 40M + 51M = 91M

**Outliers are those with budget of $91M or greater.**

### Revenue univariate analysis

In [None]:
# Using dataprep.eda
plot(df, 'revenue')

In [None]:
# matplotlib horizontal boxplot
df['revenue'].plot.box(vert=False, figsize=(12,5));

**Interpretation:** Similar to budget, we have 590 outliers with extraordinarily high revenues.

### Profit univariate analysis

In [None]:
plot(df, 'profit')

We have some negative profits, and still a skew to the right, with 627 outliers.

### Ratio univariate analysis

In [None]:
plot(df, 'ratio')

In [None]:
# matplotlib horizontal boxplot resized
df['ratio'].plot.box(vert=False, figsize=(12,5));

The boxplot reveals a very small number of very extreme outliers. Far more extreme than for the other fields.

This merits further investigation.

## Investigate Movie Ratios
The ratio of Revenue / Budget
- Above analysis shows a few very extreme outliers
- Do these reflect bad data?

### View key fields sorted by highest ratio

In [None]:
cols = ['budget','revenue','profit','ratio']
df[cols].sort_values('ratio', ascending=False).head(20)

**Evaluation**
- Some of the budgets for these biggest outliers are ridiculously low.
- Let's find a lower-end cut-off for budgets and filter our data accordingly.

### Filter for movies with budgets GTE $50K

This cut-off point may be somewhat arbitrary. A few points worth considering:
- It eliminates the unrealistically low budgets.
- It may unnecessarily exclude movies with budgets such as 15K or 30K, etc.
- It does leave _Blair Witch Project_, which is a well known case of astounding return on investment.

In [None]:
# Filter for movies with budgets GTE $50K
df_budg50k = df[df['budget'] >= 50000]
df_budg50k.sort_values('budget').head(10)

### Movies  with $50K+ Budgets Sorted by Revenue/Budget Ratio

In [None]:
cols = ['budget','revenue','profit','ratio']
(
    df_budg50k[cols].sort_values('ratio', ascending=False)
                    .head(20)
                    .apply(lambda s: s.apply('{:,.0f}'.format))
)

These results look good. Let's update the dataframe to include only these movies with $50K+ budgets

In [None]:
df = df_budg50k
df.info()

## Genre Analysis
- Count of movies per genre
- Avg budget per genre
- Avg revenue per genre
- Avg profit per genre
- Avg ratio per genre

### Count of Movies per Genre

In [None]:
# Matplotlib to get a horizontal bar chart, including null values
(
df['genre'].value_counts(dropna=False) # Get a count per category, including null values
            .plot.barh(title = 'Movies per Genre, 1915-2019', x='title', figsize=(7,5))
            .invert_yaxis() # Fix sort order for horz bar chart
)

### Create Pivot Table for Avgs per Genre

In [None]:
cols = ['budget','revenue','profit','ratio']
Genre_AVGs = df.pivot_table(values=cols, index='genre', aggfunc='mean')
Genre_AVGs = Genre_AVGs[['budget','revenue','profit','ratio']]
Genre_AVGs

#### Sort by ratio and format numbers

In [None]:
Genre_AVGs.sort_values('ratio', ascending=False).apply(lambda s: s.apply('{:,.2f}'.format))

### Plot AVG Budget by Genre

In [None]:
Genre_AVGs['budget'].sort_values().plot.barh(title = 'AVG Budget by Genre', figsize=(7,5));

### Plot AVG Revenue by Genre

In [None]:
Genre_AVGs['revenue'].sort_values().plot.barh(title = 'AVG Revenue by Genre', figsize=(7,5));

### Plot AVG Profit by Genre

In [None]:
Genre_AVGs['profit'].sort_values().plot.barh(title = 'AVG Profit by Genre', figsize=(7,5));

### Plot AVG Ratio by Genre

In [None]:
Genre_AVGs['ratio'].sort_values().plot.barh(title = 'AVG Ratio by Genre', figsize=(7,5));

### Noteworthy: Looking at ratios brings about an important shift in our perspective on the most profitable movie genres.