# Movies EDA Notebook

This notebook is intended to provide a quick way to get started with data review, prep, and analysis.

I have included the code needed to load and use Dataprep.eda if you desire. And I have provided some documentation below.

If you desire, [check out this example of a more fully developed Movies EDA](https://www.kaggle.com/davidcochran/movies-data-eda-with-dataprep-eda).

## Install Dataprep.eda
The Dataprep.eda library has been designed to speed and enhance exploratory data analysis.

For creating plots for exploratory data analysis, see the documentation here:
- https://docs.dataprep.ai/user_guide/eda/introduction.html

To learn more about the DataPrep.eda library:
- [Dataprep.eda: Accelerate your EDA](https://towardsdatascience.com/dataprep-eda-accelerate-your-eda-eb845a4088bc)
- [Exploratory Data Analysis: DataPrep.eda vs Pandas-Profiling](https://towardsdatascience.com/exploratory-data-analysis-dataprep-eda-vs-pandas-profiling-7137683fe47f)
- [DataPrep.eda Homepage - datapre.ai](https://dataprep.ai)

In [None]:
# Install the dataprep library
!pip install dataprep

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from dataprep.eda import plot, plot_correlation, plot_missing

## File Management
- You can write up to 20GB to the current directory (`/kaggle/working/`) that gets preserved as output when you create a version using "Save & Run All" 
- You can also write temporary files to `/kaggle/temp/`, but they won't be saved outside of the current session

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Read in the Data

In [None]:
df = pd.read_csv('/kaggle/input/the-movie-database-19022019/movies.csv')
df.head(5)

## Overview of the Data 

In [None]:
df.info()

## Organizing Columns 

In [None]:
df = df[['title','release_date','budget','revenue','runtime','genres',"original_language"]]
df.head(5)

### Dropping any unnecessary record from Budget and Revenue 
Making sure that any zero values for the Buget and Revenue columns are dropped 

In [None]:
df[(df['budget'] > 0) & (df['revenue'] > 0)].head(10)

In [None]:
df = df[(df['budget'] > 0) & (df['revenue'] > 0)]
df.head(10)

## Editing the the genre field

In [None]:
def generate_genre(genres):
    if 'Animation' in genres:
        return 'Animation'
    elif 'Horror' in genres:
        return 'Horror'
    elif 'Documentary' in genres:
        return 'Documentary'
    elif 'Action' in genres:
        return 'Action'
    elif 'Family' in genres:
        return 'Family'
    elif 'Adventure' in genres:
        return 'Action'
    elif 'Science Fiction' in genres:
        return 'Science Fiction'
    elif 'Fantasy' in genres:
        return 'Fantasy'
    elif 'Western' in genres:
        return 'Action'
    elif 'Crime' in genres or 'Mystery' in genres or 'Thriller' in genres:
        return 'Crime/Mystery/Thriller'
    elif 'Comedy' in genres:
        return 'Comedy'
    elif 'Romance' in genres:
        return 'Romance'
    elif 'Drama' in genres:
        return 'Drama'
    else:
        return 'Other'

In [None]:
df['genre'] = df.loc[df['genres'].notnull(), 'genres'].apply(generate_genre)
df.head(10)

### Drop the old genre column

In [None]:
df.drop(columns='genres',inplace=True)
df.head(10)

## Assign the Data Types 
- release_date to datetime 
- budget, revenue,runtime, and original_language to int 
- genre to categorical 

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'])
df['budget'] = df['budget'].astype(int)
df['revenue'] = df['revenue'].astype(int)
df['genre'] = df['genre'].astype('category')
df.info()

## Making title the index
This allows for the information table to be more clear 

In [None]:
df['title'].count() - df['title'].nunique()

In [None]:
df.set_index('title', inplace=True)
df.head()

## Adding the Calculated Fields

In [None]:
df['profit'] = df['revenue'] - df['budget']
df['ratio'] = (df['revenue'] / df['budget']).round(2)
df = df[['release_date','budget','revenue','profit','ratio','runtime','genre']]
df.head()

In [None]:
df.info()

## Time to start EDA 

In [None]:
plot(df)

### Count of movies by runtime

In [None]:
plot(df, 'runtime')

### AVG Budget by Runtime

In [None]:
plot(df, 'runtime','budget')

### AVG Revenue by runtime

In [None]:
plot(df, 'runtime','revenue')

### AVG Profit by Runtime

In [None]:
plot(df, 'runtime','profit')

### AVG Ratio by Runtime

In [None]:
plot(df, 'runtime', 'ratio')

### Univariate Analysis of Numeric Fields 

#### Revenue univariate analysis 

In [None]:
plot(df, 'revenue')

In [None]:
df['revenue'].plot.box(vert=False, figsize=(12,5));

**Explanation:** 560 outliers with very high revenues 

In [None]:
# Statisical summary
df['revenue'].describe().map('{:,.0f}'.format)

### Budget univariate analysis 

In [None]:
plot(df, 'budget')

In [None]:
df['budget'].plot.box(vert=False, figsize=(12,5));

In [None]:
df['budget'].describe().map('{:,.0f}'.format)

#### Ratio univariate analysis 

In [None]:
plot(df, 'ratio')

In [None]:
df['ratio'].plot.box(vert=False, figsize=(12,5));

In [None]:
df['ratio'].describe().map('{:,.0f}'.format)

### Genre Analysis 

#### Count of Movies per Genre 

In [None]:
df['genre'].value_counts(dropna=False).plot.barh(title = 'Movies per Genre, 1915-2019', x='title', figsize=(7,5)).invert_yaxis()

#### Creating table for AVG per Genre

In [None]:
cols = ['budget','revenue','profit','ratio']
Genre_AVGs = df.pivot_table(values=cols, index='genre', aggfunc='mean')
Genre_AVGs = Genre_AVGs[['budget','revenue','profit','ratio']]
Genre_AVGs

#### Ploting AVG Budget by Genre 

In [None]:
Genre_AVGs['budget'].sort_values().plot.barh(title = 'AVG Budget by Genre', figsize=(7,5));

#### Plot AVG Revenue by Genre 

In [None]:
Genre_AVGs['revenue'].sort_values().plot.barh(title = 'AVG Revenue by Genre', figsize=(7,5));

#### Plot AVG Profit by Genre 

In [None]:
Genre_AVGs['profit'].sort_values().plot.barh(title = 'AVG Profit by Genre', figsize=(7,5));

#### Plot AVG Ratio by Genre 

In [None]:
Genre_AVGs['ratio'].sort_values().plot.barh(title = 'AVG Ratio by Genre', figsize=(7,5));