![example](images/director_shot.jpeg)

# Microsoft Movie Profitability Analysis by Genre ___ etc etc

**Author:** Spencer Hadel
***

## Overview

This analysis will review recent movie data, and deem which decisions yield the most profitable results. Microsoft can use this data to embark on it's own journey of video content production, while informed of what practices to adhere to in an attempt to create the best possible content that will gain the best possible return on investment as well as gain positive customer reviews.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

budgets: for release date, production budget, and domestic/worldwide gross

titles: movie names and genres

names: for actors, directors, etc (connect to title_basics via known_for_titles(?))

reviews: ratings, also connected through tconst

In [26]:
budgets_df = pd.read_csv('data/zippedData/tn.movie_budgets.csv.gz')
budgets_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [27]:
#budgets_df.sort_values(by='release_date', ascending=False)
#budgets_df.head()

In [28]:
# use release_date to make new columns MONTH and YEAR
#probably also remove years before a certain point.

In [29]:
budgets_df['month'] = budgets_df.release_date.str[:3]
budgets_df['year'] = budgets_df.release_date.str[-4:]
budgets_df.drop(['id', 'release_date'], axis=1, inplace=True)

In [30]:
budgets_df.head()

Unnamed: 0,movie,production_budget,domestic_gross,worldwide_gross,month,year
0,Avatar,"$425,000,000","$760,507,625","$2,776,345,279",Dec,2009
1,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",May,2011
2,Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350",Jun,2019
3,Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",May,2015
4,Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",Dec,2017


In [31]:
titles_df = pd.read_csv('data/zippedData/imdb.title.basics.csv.gz')
#titles_df.head()

#remove start_year and runtime


In [33]:
titles_df.drop(['start_year', 'runtime_minutes'], axis=1, inplace=True)

In [34]:
titles_df.head()

Unnamed: 0,tconst,primary_title,original_title,genres
0,tt0063540,Sunghursh,Sunghursh,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,"Comedy,Drama,Fantasy"


In [35]:
#new_df = titles_df[titles_df['primary_title'] != titles_df['original_title']]
#new_df

In [36]:
#alt_df = pd.merge(new_df, budgets_df, left_on='original_title', right_on='movie', how = 'inner')
#alt_df.sort_values(by='domestic_gross', ascending=False)

new_df and alt_df check if there would be mroe to gain from using the original_title column. the end result is a much smaller dataframe, so we will not use this column in this instance.

therefore we will drop the original title column, and merge the two tables for our main dataset!

In [13]:
titles_df.drop(['original_title'], axis=1, inplace=True)

In [14]:
main_df = pd.merge(titles_df, budgets_df, left_on='primary_title', right_on='movie', how = 'inner')
main_df

Unnamed: 0,tconst,primary_title,genres,movie,production_budget,domestic_gross,worldwide_gross,month,year
0,tt0249516,Foodfight!,"Action,Animation,Comedy",Foodfight!,"$45,000,000",$0,"$73,706",Dec,2012
1,tt0293429,Mortal Kombat,"Action,Adventure,Fantasy",Mortal Kombat,"$20,000,000","$70,433,227","$122,133,227",Aug,1995
2,tt0326592,The Overnight,,The Overnight,"$200,000","$1,109,808","$1,165,996",Jun,2015
3,tt3844362,The Overnight,"Comedy,Mystery",The Overnight,"$200,000","$1,109,808","$1,165,996",Jun,2015
4,tt0337692,On the Road,"Adventure,Drama,Romance",On the Road,"$25,000,000","$720,828","$9,313,302",Mar,2013
...,...,...,...,...,...,...,...,...,...
3810,tt9678962,Fuel,"Documentary,Sport",Fuel,"$2,500,000","$174,255","$174,255",Nov,2008
3811,tt9729206,Diner,Crime,Diner,"$5,000,000","$12,592,907","$12,592,907",Apr,1982
3812,tt9805168,Traitor,"Action,Drama,Romance",Traitor,"$22,000,000","$23,530,831","$27,882,226",Aug,2008
3813,tt9844102,Ray,Crime,Ray,"$40,000,000","$75,305,995","$124,823,094",Oct,2004


In [41]:
main_df.drop(['primary_title', 'tconst'], axis=1, inplace = True)

main_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3815 entries, 0 to 3814
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   genres             3743 non-null   object
 1   movie              3815 non-null   object
 2   production_budget  3815 non-null   object
 3   domestic_gross     3815 non-null   object
 4   worldwide_gross    3815 non-null   object
 5   month              3815 non-null   object
 6   year               3815 non-null   object
dtypes: object(7)
memory usage: 238.4+ KB


The above is our main dataframe with all the info that will initially be significant to investigation into the profitability of different movies.

Change budgets to INTS!

Determine whether we want worldwide or domestic numbers (maybe both?)

        Create profits column for whichever you choose

create graph showing each years total profits, followed by a group of graphs for the last 4-5 years by month.

remove null values etc

identify most common genres of film with value_counts

replace genre names with more understandable names

create graph showing genre correlation to profitability

        by year, then by month in a year
        
        then show by month in other years to identify trend

identify which genres are most profitable at which times of year

demonstrate what the most profitable time of year is in general, but make sure to indicate genre, time of year

also identify production budget, especially for highest or lowest times of year
        this will make clear what kinds of projects to commit to in slower seasons, etc.

after all this, using our data about profitability, pick the films with highest ratings to help inform decisions further.

combine the following table with existing data in a new df

In [6]:
reviews_df = pd.read_csv('data/zippedData/imdb.title.ratings.csv.gz')
reviews_df.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [40]:
# Here you run your code to explore the data

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [41]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [42]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***