# Movies Prep and EDA
### This dataset includes data from over *26,000* films from **1902 to 2019**

## Steps taken in this notebook:
### Import Libraries
### Introductory Data Exploration and Preparation
### Univariate Analysis
1. Display key statistics for individual fields
3. Budget, Revenue, and Runtime statistics
4. Profit statistics
5. Ratio statistics
2. Genre proportions (pie charts)
### Bivariate Analysis
1. Statistics by Genre
2. Budget/Profit Scatterplot by Genre
3. Average Budget by Genre Barchart
4. Average Revenue by Genre Barchart
5. Median Ratio by Genre Barchart
6. Average Runtime by Genre Barchart
### Export Data
    to CSV and JSON


## Import Libraries

In [None]:
!pip install dataprep

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from dataprep.eda import plot, plot_correlation
import datetime as dt
import seaborn as sns


## Read in the Data

In [None]:
df = pd.read_csv('/kaggle/input/the-movie-database-19022019/movies.csv')
df.head(5)

## Introductory Data Exploration and Prep
- Find values of zero that ought to be replaced with null values
- Find records which lack essential information, delete from dataframe

In [None]:
df['budget'].isnull().sum() #adds up the number of null values

In [None]:
# display all records with 0 in budget field (note that there are 17662 rows)
df[df['budget'] == 0]

In [None]:
budget_stats_before = df['budget'].describe().map('{:,.0f}'.format)
budget_stats_before

In [None]:
# convert zero values to NULL values by overwriting them with numpy "NotANumber"
df.loc[df['budget'] == 0, 'budget'] = np.NaN

In [None]:
df.loc[df['budget'] == 0, 'budget'] # now no zeros

In [None]:
df.loc[df['budget'].isnull(), 'budget'] # this returns all values that are now NULL

In [None]:
budget_stats_after = df['budget'].describe().map('{:,.0f}'.format)
budget_stats_after

#### Budget stats significantly changed after replacing zeros with null
- Mean 3x higher
- Median previously 0, now 11 mil

In [None]:
df.info()

In [None]:
df.dropna(subset = ['release_date'], inplace = True)

In [None]:
# test for remaining null titles
df[df["title"].isnull()]

In [None]:
# test for remaining null release dates
df[df["release_date"].isnull()]

### Describe numeric fields

In [None]:
df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

### Describe non-numeric fields

In [None]:
df.describe(include = ['O'])

In [None]:
df.loc[df["revenue"] == 0, "revenue"] = np.NaN

In [None]:
df.loc[df["runtime"] == 0, "runtime"] = np.NaN

In [None]:
budget_revenue_runtime = ["budget", "revenue", "runtime"]
df[budget_revenue_runtime]

In [None]:
df[budget_revenue_runtime].isnull().sum()

In [None]:
df[budget_revenue_runtime].notnull().sum()

#### Create new dataframe with only those fields desired for analysis

In [None]:
wanted_columns = ["title", "release_date", "budget", "revenue", "runtime", "genres"]
df2 = df[wanted_columns]

In [None]:
# look for all rows (:), the budget column, add up the count of those with null values
df2.loc[:,"budget"].isnull().sum()

In [None]:
# look for all rows (:), the budget column, add up the count of those with null values
df2.loc[:, "revenue"].isnull().sum()

In [None]:
df2.loc[df2["budget"].isnull()]

In [None]:
#drop records with null values in either the salary or the revenue columns
df2.dropna(subset = ["budget", "revenue", "runtime"], inplace = True)

In [None]:
#check to make sure no null budgets remain
df2["budget"].isnull().sum()

In [None]:
#check to make sure no null revenues remain
df2["revenue"].isnull().sum()

### Investigation of genres field
- This field contains multiple genres per film, which is good for descriptive purposes, but makes it difficult to accurately categorize the films.  
- Each combination of genres creates a new unique value for the field, meaning that there are far too many unique values to be useful for analysis in the current format

In [None]:
df2["genres"]

In [None]:
# count of unique genre combinations
len(df2["genres"].unique())

In [None]:
# Top 5 genres (considering combinations) by count
df2["genres"].value_counts(dropna = False).head()

In [None]:
# Bottom 5 genres (considering combinations) by count
df2["genres"].value_counts(dropna = False).tail()

In [None]:
df2["genres"].value_counts(normalize = True) #percentage of all films made up of particular genre combination

### Create new fields to categorize films by single genres
- Using the genre categories currently combined in the "genre" field, we can test for individual genre categories by film using a custom function for each genre.  
- These functions return boolean series', which will be stored in distinct fields.  
- So rather than just being able to look at each film to see what categories it belongs to, now we will be able to look at each genre and see which films belong to it, even if they also belong to other genre categories

#### Define custom function to test for inclusion of each genre in existing genre field

In [None]:
def has_animation(genres):
    if "Animation" in genres:
        return True
    else:
        return False
    
def has_drama(genres):
    if "Drama" in genres:
        return True
    else:
        return False
    
def has_comedy(genres):
    if "Comedy" in genres:
        return True
    else:
        return False
    
def has_romance(genres):
    if "Romance" in genres:
        return True
    else:
        return False
    
def has_action(genres):
    if "Action" in genres:
        return True
    else:
        return False
    
def has_documentary(genres):
    if "Documentary" in genres:
        return True
    else:
        return False
    
def has_family(genres):
    if "Family" in genres:
        return True
    else:
        return False
    
def has_adventure(genres):
    if "Adventure" in genres:
        return True
    else:
        return False
    
def has_history(genres):
    if "History" in genres:
        return True
    else:
        return False
    
def has_war(genres):
    if "War" in genres:
        return True
    else:
        return False
    
def has_crime(genres):
    if "Crime" in genres:
        return True
    else:
        return False
    
def has_mystery(genres):
    if "Mystery" in genres:
        return True
    else:
        return False
    
def has_thriller(genres):
    if "Thriller" in genres:
        return True
    else:
        return False
    
def has_horror(genres):
    if "Horror" in genres:
        return True
    else:
        return False
    
def has_western(genres):
    if "Western" in genres:
        return True
    else:
        return False
    
def has_fantasy(genres):
    if "Fantasy" in genres:
        return True
    else:
        return False
    
def has_scifi(genres):
    if "Science Fiction" in genres:
        return True
    else:
        return False
    
def has_music(genres):
    if "Music" in genres:
        return True
    else:
        return False

#### Define variables to represent boolean series for each genre

In [None]:
animation = df2.loc[df2["genres"].notnull(), "genres"].apply(has_animation)
drama = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_drama)
comedy = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_comedy)
romance = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_romance)
action = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_action)
documentary = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_documentary)
family = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_family)
adventure = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_adventure)
history = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_history)
war = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_war)
crime = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_crime)
mystery = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_mystery)
thriller = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_thriller)
horror = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_horror)
western = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_western)
fantasy = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_fantasy)
scifi = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_scifi)
music = df2.loc[df2.loc[:, "genres"].notnull(), "genres"].apply(has_music)

#### Write boolean series to new dataframe field

In [None]:
df2["Animation"] = animation
df2["Drama"] = drama
df2["Comedy"] = comedy
df2["Romance"] = romance
df2["Action"] = action
df2["Documentary"] = documentary
df2["Family"] = family
df2["Adventure"] = adventure
df2["History"] = history
df2["War"] = war
df2["Crime"] = crime
df2["Mystery"] = mystery
df2["Thriller"] = thriller
df2["Horror"] = horror
df2["Western"] = western
df2["Fantasy"] = fantasy
df2["Science Fiction"] = scifi
df2["Music"] = music

#### Create new calculated fields for profit and ratio

In [None]:
df2.insert(4, "profit", value = 0)

In [None]:
df2["profit"] = df2["revenue"] - df2["budget"]

In [None]:
df2.insert(5, "ratio", value = 0)

In [None]:
revenue = df2.loc[df2["revenue"].notnull(), "revenue"]
budget = df2.loc[df2["budget"].notnull(), "budget"]
ratio = revenue / budget

In [None]:
df2.loc[:, "ratio"] = ratio

### Convert data types
- New True/False columns converted to boolean (note memory savings from before to after)
- Change release date to datetime for use in later analysis
- Change budget, revenue, profit, and runtime fields to integer datatype (another big memory savings)

In [None]:
info_before = df2.info()

In [None]:
boolean_columns = ["Animation", "Drama", "Comedy", "Romance", "Action", "Documentary", 
     "Family", "Adventure", "History", "War", "Crime", "Mystery", "Thriller", "Horror",
    "Western", "Fantasy", "Science Fiction", "Music"]

df2[boolean_columns] = df2[boolean_columns].astype("bool")

In [None]:
df2["release_date"] = df2["release_date"].astype("datetime64")
df2["budget"] = df2["budget"].astype("int")
df2["revenue"] = df2["revenue"].astype("int")
df2["profit"] = df2["profit"].astype("int")
df2["runtime"] = df2["runtime"].astype("int")
df2["ratio"] = df2["ratio"].astype("int")

In [None]:
info_after = df2.info()

In [None]:
boolean_columns = ["Animation", "Drama", "Comedy", "Romance", "Action", "Documentary", 
     "Family", "Adventure", "History", "War", "Crime", "Mystery", "Thriller", "Horror",
    "Western", "Fantasy", "Science Fiction", "Music"]

df2[boolean_columns] = df2[boolean_columns].astype("bool")

In [None]:
df2["release_date"] = df2["release_date"].astype("datetime64")
df2["budget"] = df2["budget"].astype("int")
df2["revenue"] = df2["revenue"].astype("int")
df2["profit"] = df2["profit"].astype("int")
df2["runtime"] = df2["runtime"].astype("int")
df2["ratio"] = df2["ratio"].astype("int")

In [None]:
info_after = df2.info()

## Univariate Analysis
### Describe individual fields by basic statistics
#### Budget, Revenue, and Runtime statistics

In [None]:
df2[budget_revenue_runtime].describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Profit statistics

In [None]:
df2["profit"].describe().astype("int")

#### Ratio statistics

In [None]:
df2["ratio"].describe().astype("int")

### Analysis of Genre Proportions

In [None]:
plt.style.use("default")
df2["Animation"].value_counts().plot(kind = "pie", labels = ["Not Animated", "Animated"], 
title = "Proportion of Animated Movies to Total")
plt.legend(labels= ["Not Animated", "Animated"],loc="best")
plt.ylabel("")

In [None]:
df2["Animation"].value_counts()

In [None]:
plt.style.use("default")
df2["Drama"].value_counts().plot(kind = "pie", labels = ["Non-Drama", "Drama"], 
title = "Proportion of Drama Movies to Total")
plt.legend(labels= ["Non-Drama", "Animated"],loc="best")
plt.ylabel("")

In [None]:
df2["Drama"].value_counts()

In [None]:
plt.style.use("default")
df2["Comedy"].value_counts().plot(kind = "pie", labels = ["Non-Comedy", "Comedy"], 
title = "Proportion of Comedy Movies to Total")
plt.legend(labels= ["Non-Comedy", "Comedy"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Romance"].value_counts().plot(kind = "pie", labels = ["Non-Romance", "Drama"], 
title = "Proportion of Romance Movies to Total")
plt.legend(labels= ["Non-Romance", "Romance"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Action"].value_counts().plot(kind = "pie", labels = ["Non-Action", "Action"], 
title = "Proportion of Action Movies to Total")
plt.legend(labels= ["Non-Action", "Animated"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Documentary"].value_counts().plot(kind = "pie", labels = ["Non-Documentary", "Documentary"], 
title = "Proportion of Documentary Movies to Total")
plt.legend(labels= ["Non-Documentary", "Documentary"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Family"].value_counts().plot(kind = "pie", labels = ["Non-Family", "Family"], 
title = "Proportion of Family Movies to Total")
plt.legend(labels= ["Non-Family", "Family"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Adventure"].value_counts().plot(kind = "pie", labels = ["Non-Adventure", "Adventure"], 
title = "Proportion of Adventure Movies to Total")
plt.legend(labels= ["Non-Adventure", "Adventure"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["History"].value_counts().plot(kind = "pie", labels = ["Non-History", "History"], 
title = "Proportion of History Movies to Total")
plt.legend(labels= ["Non-History", "History"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["War"].value_counts().plot(kind = "pie", labels = ["Non-War", "War"], 
title = "Proportion of War Movies to Total")
plt.legend(labels= ["Non-War", "War"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Crime"].value_counts().plot(kind = "pie", labels = ["Non-Crime", "Crime"], 
title = "Proportion of Crime Movies to Total")
plt.legend(labels= ["Non-Crime", "Crime"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Mystery"].value_counts().plot(kind = "pie", labels = ["Non-Mystery", "Mystery"], 
title = "Proportion of Mystery Movies to Total")
plt.legend(labels= ["Non-Mystery", "Mystery"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Thriller"].value_counts().plot(kind = "pie", labels = ["Non-Thriller", "Thriller"], 
title = "Proportion of Thriller Movies to Total")
plt.legend(labels= ["Non-Thriller", "Thriller"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Horror"].value_counts().plot(kind = "pie", labels = ["Non-Horror", "Horror"], 
title = "Proportion of Horror Movies to Total")
plt.legend(labels= ["Non-Horror", "Horror"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Western"].value_counts().plot(kind = "pie", labels = ["Non-Western", "Western"], 
title = "Proportion of Western Movies to Total")
plt.legend(labels= ["Non-Western", "Western"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Fantasy"].value_counts().plot(kind = "pie", labels = ["Non-Fantasy", "Fantasy"], 
title = "Proportion of Fantasy Movies to Total")
plt.legend(labels= ["Non-Fantasy", "Fantasy"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Science Fiction"].value_counts().plot(kind = "pie", labels = ["Non-Science Fiction", "Science Fiction"], 
title = "Proportion of Science Fiction Movies to Total")
plt.legend(labels= ["Non-Science Fiction", "Science Fiction"],loc="best")
plt.ylabel("")

In [None]:
plt.style.use("default")
df2["Music"].value_counts().plot(kind = "pie", labels = ["Non-Music", "Music"], 
title = "Proportion of Music Movies to Total")
plt.legend(labels= ["Non-Music", "Music"],loc="best")
plt.ylabel("")

## Bivariate Analysis
### Create distince dataframes for each genre
#### Each dataframe will contain all films which bear the corresponding genre label, even if they also bear other genre labels

In [None]:
Animation_df = df2.loc[df2["Animation"] == True]
Drama_df = df2.loc[df2["Drama"] == True]
Comedy_df = df2.loc[df2["Comedy"] == True]
Romance_df = df2.loc[df2["Romance"] == True]
Action_df = df2.loc[df2["Action"] == True]
Documentary_df = df2.loc[df2["Documentary"] == True]
Family_df = df2.loc[df2["Family"] == True]
Adventure_df = df2.loc[df2["Adventure"] == True]
History_df = df2.loc[df2["History"] == True]
War_df = df2.loc[df2["War"] == True]
Crime_df = df2.loc[df2["Crime"] == True]
Mystery_df = df2.loc[df2["Mystery"] == True]
Thriller_df = df2.loc[df2["Thriller"] == True]
Horror_df = df2.loc[df2["Horror"] == True]
Western_df = df2.loc[df2["Western"] == True]
Fantasy_df = df2.loc[df2["Fantasy"] == True]
Scifi_df = df2.loc[df2["Science Fiction"] == True]
Music_df = df2.loc[df2["Music"] == True]

#### Animation genre statistics

In [None]:
Animation_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Drama genre statistics

In [None]:
Drama_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Comedy genre statistics

In [None]:
Comedy_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Romance genre statistics

In [None]:
Romance_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Action genre statistics

In [None]:
Action_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Documentary genre statistics

In [None]:
Documentary_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Family genre statistics

In [None]:
Family_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Adventure genre statistics

In [None]:
Adventure_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### History genre statistics

In [None]:
History_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### War genre statistics

In [None]:
War_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Crime genre statistics

In [None]:
Crime_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Mystery genre statistics

In [None]:
Mystery_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Thriller genre statistics

In [None]:
Thriller_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Horror genre statistics

In [None]:
Horror_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Western genre statistics

In [None]:
Western_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Fantasy genre statistics

In [None]:
Fantasy_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Science Fiction genre statistics

In [None]:
Scifi_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

#### Music genre statistics

In [None]:
Music_df.describe().apply(lambda s: s.apply('{:,.0f}'.format))

### Budget/Profit Scatterplots by Genre

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Animation_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Animated Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Drama_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Drama Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Comedy_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Comedy Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Romance_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Romance Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Action_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Action Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Documentary_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Documentary Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Family_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Family Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Adventure_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Adventure Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = History_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for History Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = War_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for War Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Crime_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Crime Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Mystery_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Mystery Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Thriller_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Thriller Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Horror_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Horror Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Western_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Western Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Fantasy_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Fantasy Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Scifi_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Science Fiction Films", fontsize = 10)

In [None]:
sns.lmplot(x = "budget", y = "profit", data = Music_df, fit_reg = True).fig.suptitle(
    "Budget/Profit Plot for Music Films", fontsize = 10)

### Analysis of Budget by Genre

In [None]:
Animation_avg_budget = df2.loc[df2["Animation"], "budget"].agg("mean")
Drama_avg_budget = df2.loc[df2["Drama"], "budget"].agg("mean")
Comedy_avg_budget = df2.loc[df2["Comedy"], "budget"].agg("mean")
Romance_avg_budget = df2.loc[df2["Romance"], "budget"].agg("mean")
Action_avg_budget = df2.loc[df2["Action"], "budget"].agg("mean")
Documentary_avg_budget = df2.loc[df2["Documentary"], "budget"].agg("mean")
Family_avg_budget = df2.loc[df2["Family"], "budget"].agg("mean")
Adventure_avg_budget = df2.loc[df2["Adventure"], "budget"].agg("mean")
History_avg_budget = df2.loc[df2["History"], "budget"].agg("mean")
War_avg_budget = df2.loc[df2["War"], "budget"].agg("mean")
Crime_avg_budget = df2.loc[df2["Crime"], "budget"].agg("mean")
Mystery_avg_budget = df2.loc[df2["Mystery"], "budget"].agg("mean")
Thriller_avg_budget = df2.loc[df2["Thriller"], "budget"].agg("mean")
Horror_avg_budget = df2.loc[df2["Horror"], "budget"].agg("mean")
Western_avg_budget = df2.loc[df2["Western"], "budget"].agg("mean")
Fantasy_avg_budget = df2.loc[df2["Fantasy"], "budget"].agg("mean")
Scifi_avg_budget = df2.loc[df2["Science Fiction"], "budget"].agg("mean")
Music_avg_budget = df2.loc[df2["Music"], "budget"].agg("mean")

In [None]:
genre_budget_dict = {
    "Budget": [Animation_avg_budget, Drama_avg_budget, Comedy_avg_budget, Romance_avg_budget, Action_avg_budget,
              Documentary_avg_budget, Family_avg_budget, Adventure_avg_budget, History_avg_budget, War_avg_budget, 
               Crime_avg_budget, Mystery_avg_budget, Thriller_avg_budget, Horror_avg_budget, Western_avg_budget,
              Fantasy_avg_budget, Scifi_avg_budget, Music_avg_budget],
     "Genre": ["Animation", "Drama", "Comedy", "Romance", "Action", "Documentary", "Family", "Adventure", "History",
             "War", "Crime", "Mystery", "Thriller", "Horror", "Western", "Fantasy", "Science Fiction", "Music"]
}

In [None]:
genre_budget_df = pd.DataFrame(genre_budget_dict)

In [None]:
genre_budget_df.plot(x = "Genre", y = "Budget", kind = "barh", title = "Average budget by genre")

### Analysis of Revenue by Genre

In [None]:
Animation_avg_revenue = df2.loc[df2["Animation"], "revenue"].agg("mean")
Drama_avg_revenue = df2.loc[df2["Drama"], "revenue"].agg("mean")
Comedy_avg_revenue = df2.loc[df2["Comedy"], "revenue"].agg("mean")
Romance_avg_revenue = df2.loc[df2["Romance"], "revenue"].agg("mean")
Action_avg_revenue = df2.loc[df2["Action"], "revenue"].agg("mean")
Documentary_avg_revenue = df2.loc[df2["Documentary"], "revenue"].agg("mean")
Family_avg_revenue = df2.loc[df2["Family"], "revenue"].agg("mean")
Adventure_avg_revenue = df2.loc[df2["Adventure"], "revenue"].agg("mean")
History_avg_revenue = df2.loc[df2["History"], "revenue"].agg("mean")
War_avg_revenue = df2.loc[df2["War"], "revenue"].agg("mean")
Crime_avg_revenue = df2.loc[df2["Crime"], "revenue"].agg("mean")
Mystery_avg_revenue = df2.loc[df2["Mystery"], "revenue"].agg("mean")
Thriller_avg_revenue = df2.loc[df2["Thriller"], "revenue"].agg("mean")
Horror_avg_revenue = df2.loc[df2["Horror"], "revenue"].agg("mean")
Western_avg_revenue = df2.loc[df2["Western"], "revenue"].agg("mean")
Fantasy_avg_revenue = df2.loc[df2["Fantasy"], "revenue"].agg("mean")
Scifi_avg_revenue = df2.loc[df2["Science Fiction"], "revenue"].agg("mean")
Music_avg_revenue = df2.loc[df2["Music"], "revenue"].agg("mean")

In [None]:
genre_revenue_dict = {
    "Revenue": [Animation_avg_revenue, Drama_avg_revenue, Comedy_avg_revenue, Romance_avg_revenue, Action_avg_revenue,
               Documentary_avg_revenue, Family_avg_revenue, Adventure_avg_revenue, History_avg_revenue, War_avg_revenue,
               Crime_avg_revenue, Mystery_avg_revenue, Thriller_avg_revenue, Horror_avg_revenue, Western_avg_revenue,
               Fantasy_avg_revenue, Scifi_avg_revenue, Music_avg_revenue],
     "Genre": ["Animation", "Drama", "Comedy", "Romance", "Action", "Documentary", "Family", "Adventure", "History",
             "War", "Crime", "Mystery", "Thriller", "Horror", "Western", "Fantasy", "Science Fiction", "Music"]
}

In [None]:
genre_revenue_df = pd.DataFrame(genre_revenue_dict)

In [None]:
genre_revenue_df.plot(x = "Genre", y = "Revenue", kind = "barh", title = "Average revenue by genre")

### Analysis of Ratio by Genre

In [None]:
Animation_median_ratio = df2.loc[df2["Animation"], "ratio"].agg("median")
Drama_median_ratio = df2.loc[df2["Drama"], "ratio"].agg("median")
Comedy_median_ratio = df2.loc[df2["Comedy"], "ratio"].agg("median")
Romance_median_ratio = df2.loc[df2["Romance"], "ratio"].agg("median")
Action_median_ratio = df2.loc[df2["Action"], "ratio"].agg("median")
Documentary_median_ratio = df2.loc[df2["Documentary"], "ratio"].agg("median")
Family_median_ratio = df2.loc[df2["Family"], "ratio"].agg("median")
Adventure_median_ratio = df2.loc[df2["Adventure"], "ratio"].agg("median")
History_median_ratio = df2.loc[df2["History"], "ratio"].agg("median")
War_median_ratio = df2.loc[df2["War"], "ratio"].agg("median")
Crime_median_ratio = df2.loc[df2["Crime"], "ratio"].agg("median")
Mystery_median_ratio = df2.loc[df2["Mystery"], "ratio"].agg("median")
Thriller_median_ratio = df2.loc[df2["Thriller"], "ratio"].agg("median")
Horror_median_ratio = df2.loc[df2["Horror"], "ratio"].agg("median")
Western_median_ratio = df2.loc[df2["Western"], "ratio"].agg("median")
Fantasy_median_ratio = df2.loc[df2["Fantasy"], "ratio"].agg("median")
Scifi_median_ratio = df2.loc[df2["Science Fiction"], "ratio"].agg("median")
Music_median_ratio = df2.loc[df2["Music"], "ratio"].agg("median")

In [None]:
genre_ratio_dict = {
    "Ratio": [Animation_median_ratio, Drama_median_ratio, Comedy_median_ratio, Romance_median_ratio, Action_median_ratio,
             Documentary_median_ratio, Family_median_ratio, Adventure_median_ratio, History_median_ratio, War_median_ratio,
             Crime_median_ratio, Mystery_median_ratio, Thriller_median_ratio, Horror_median_ratio,
             Western_median_ratio, Fantasy_median_ratio, Scifi_median_ratio, Music_median_ratio],
    "Genre": ["Animation", "Drama", "Comedy", "Romance", "Action", "Documentary", "Family", "Adventure", "History",
             "War", "Crime", "Mystery", "Thriller", "Horror", "Western", "Fantasy", "Science Fiction", "Music"]
}

In [None]:
genre_ratio_df = pd.DataFrame(genre_ratio_dict)

In [None]:
genre_ratio_df.plot(x = "Genre", y = "Ratio", kind = "barh", title = "Median ratio by genre")

### Analysis of Runtime by Genre

In [None]:
Animation_avg_run = df2.loc[df2["Animation"], "runtime"].agg("mean")
Drama_avg_run = df2.loc[df2["Drama"], "runtime"].agg("mean")
Comedy_avg_run = df2.loc[df2["Comedy"], "runtime"].agg("mean")
Romance_avg_run = df2.loc[df2["Romance"], "runtime"].agg("mean")
Action_avg_run = df2.loc[df2["Action"], "runtime"].agg("mean")
Documentary_avg_run = df2.loc[df2["Documentary"], "runtime"].agg("mean")
Family_avg_run = df2.loc[df2["Family"], "runtime"].agg("mean")
Adventure_avg_run = df2.loc[df2["Adventure"], "runtime"].agg("mean")
History_avg_run = df2.loc[df2["History"], "runtime"].agg("mean")
War_avg_run = df2.loc[df2["War"], "runtime"].agg("mean")
Crime_avg_run = df2.loc[df2["Crime"], "runtime"].agg("mean")
Mystery_avg_run = df2.loc[df2["Mystery"], "runtime"].agg("mean")
Thriller_avg_run = df2.loc[df2["Thriller"], "runtime"].agg("mean")
Horror_avg_run = df2.loc[df2["Horror"], "runtime"].agg("mean")
Western_avg_run = df2.loc[df2["Western"], "runtime"].agg("mean")
Fantasy_avg_run = df2.loc[df2["Fantasy"], "runtime"].agg("mean")
Scifi_avg_run = df2.loc[df2["Science Fiction"], "runtime"].agg("mean")
Music_avg_run = df2.loc[df2["Music"], "runtime"].agg("mean")

In [None]:
genre_run_dict = {
    "Runtime": [Animation_avg_run, Drama_avg_run, Comedy_avg_run, Romance_avg_run, Action_avg_run, Documentary_avg_run,
                Family_avg_run, Adventure_avg_run, History_avg_run, War_avg_run, Crime_avg_run, Mystery_avg_run,
                Thriller_avg_run, Horror_avg_run, Western_avg_run, Fantasy_avg_run, Scifi_avg_run, Music_avg_run],
    "Genre": ["Animation", "Drama", "Comedy", "Romance", "Action", "Documentary", "Family", "Adventure", "History",
             "War", "Crime", "Mystery", "Thriller", "Horror", "Western", "Fantasy", "Science Fiction", "Music"]
}

In [None]:
genre_run_df = pd.DataFrame(genre_run_dict)

In [None]:
genre_run_df.plot(x = "Genre", y = "Runtime", kind = "barh", legend = False, title = "Average runtime by genre")

### Export Dataframe

In [None]:
df2.to_json("Movies_EDA.json")
df2.to_csv("Movies_EDA.csv")