# Movie Budget Data Analysis Project

## In this project we will be answering the following question: 
    How do the directors and producers, cast, production budget, and marketing of a movie impact the success of the movie pre-screening or post-screening?

## We have hypothesized the following: 
    We hypothesize that with better production, societally aware keywords during marketing, and a higher budget, movies will generally be more successful. However, we acknowledge that there are also movies that have a lower budget but more investment into the actors and playing to their genre that can stand as outliers and will need further consideration. 

## Background 
    From past experiences like Avatar big budget movies does not equate to the movie being successful at the box office. Movies can spend money on big names such as The Rock and special effects, but in the long run movies with a better plot have a better chance of making more money in the box office. Indie movies like Get Out that have low budgets and not a list actors have done better than bigger movies released on the same weekend. To confirm such we’ve taken a look at a couple data science projects that have looked into the profitability of movies as in: https://www.kaggle.com/param1/the-money-makers


![alt text](img.jpg "Title")

## **TODO**
1. Load data into pandas 
2. Merge the two datasets on common identifiers
3. Read nested JSON and clean data  
4. Remove noisy/incomplete data
5. Get rudimentary statistics for the dataset
6. Explore basic categories and their direct impact on budgets -> success
    - Visualizations
7. Venture into more latent features
    - Visualizations

In [1]:
%matplotlib inline

import ast
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [122]:
# import csv files
credits = pd.read_csv('data/credits.csv')
keywords = pd.read_csv('data/keywords.csv')
movies_metadata = pd.read_csv('data/movies_metadata.csv')
ratings = pd.read_csv('data/ratings.csv')

tmdb_movies = pd.read_csv('data/tmdb_5000_movies.csv')
credits_1 = pd.read_csv('data/tmdb_5000_credits.csv')

  interactivity=interactivity, compiler=compiler, result=result)


The two datasets we are using data from contain nested JSON data and need to be flattened. 

In [123]:
COLS = ["id", "cast", "crew", "keywords", "genres", "production_companies", "production_countries", "belongs_to_collection"]

# Drop rows with nan values
credits = credits.dropna(subset=['id', 'cast', 'crew'])
keywords = keywords.dropna(subset=['id', 'keywords'])
movies_metadata = movies_metadata.dropna(subset=['id', 'genres', 'production_companies', 'production_countries'])

In [124]:
# Flatten dataframes 
def json2dict(jstr):
    if type(jstr) != str: 
        return
    if jstr == np.nan:
        return np.nan
    return ast.literal_eval(jstr)

def checkBudget(budget):
    if str(budget).isdigit():
        return float(budget)
    else:
        return np.nan
    
def checkId(id):
    if str(id).isdigit():
        return int(id)
    else:
        return np.nan

def checkKeywords(keywords):
    if keywords:
        return keywords
    else:
        return np.nan
    
credits.cast = credits.cast.apply(json2dict)
credits.crew = credits.crew.apply(json2dict)
credits.id = credits.id.apply(checkId)

keywords.keywords = keywords.keywords.apply(json2dict)
keywords.keywords = keywords.keywords.apply(checkKeywords)
keywords.id = keywords.id.apply(checkId)
keywords.dropna(inplace=True)

movies_metadata.id = movies_metadata.id.apply(checkId)
movies_metadata.budget = movies_metadata.budget.apply(checkBudget)
movies_metadata.dropna(subset=['id'], inplace=True)
movies_metadata.id = movies_metadata.id.astype(int)
movies_metadata.genres = movies_metadata.genres.apply(json2dict)

movies_metadata.production_companies = movies_metadata.production_companies.apply(json2dict)
movies_metadata.production_countries = movies_metadata.production_countries.apply(json2dict)
movies_metadata.belongs_to_collection = movies_metadata.belongs_to_collection.apply(json2dict)

movies_metadata.dropna(subset=['id'], inplace=True)
movies_metadata.drop(axis=1, columns=['homepage', 'status', 'video', 'poster_path', 'original_title'], inplace=True)
credits.dropna(subset=['id'], inplace=True)
keywords.dropna(subset=['id'], inplace=True)

In [131]:
from functools import reduce
# Merge dataframes
dfs = [credits, keywords, movies_metadata]
df = pd.merge(pd.merge(credits, keywords, on='id'), movies_metadata, on='id')
df.head()

Unnamed: 0,cast,crew,id,keywords,adult,belongs_to_collection,budget,genres,imdb_id,original_language,...,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,tagline,title,vote_average,vote_count
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...",False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",tt0114709,en,...,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",,Toy Story,7.7,5415.0
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1...",False,,65000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",tt0113497,en,...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",tt0113228,en,...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,"[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...",False,,16000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",tt0114885,en,...,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...",False,"{'id': 96871, 'name': 'Father of the Bride Col...",0.0,"[{'id': 35, 'name': 'Comedy'}]",tt0113041,en,...,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0


In [3]:
# Basic Statistics from final dataset

In [None]:
#Visualizations