# Introduction

Do higher film budgets lead to more box office revenue? Let's find out if there's a relationship using the movie budgets and financial performance data that I've scraped from [the-numbers.com](https://www.the-numbers.com/movie/budgets) on **May 1st, 2018**. 

<img src=https://i.imgur.com/kq7hrEh.png>

# Import Statements

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# Notebook Presentation

In [2]:
# This formats FLOAT (does not work on ints) to have 2 decimal places
# and have thousand separators
pd.options.display.float_format = "{:,.2f}".format

from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()

# Read the Data

In [3]:
data = pd.read_csv("cost_revenue_dirty.csv")

# Explore and Clean the Data

**Challenge**: Answer these questions about the dataset:
1. How many rows and columns does the dataset contain?
2. Are there any NaN values present?
3. Are there any duplicate rows?
4. What are the data types of the columns?

In [4]:
# Explore
data.head()

Unnamed: 0,Rank,Release_Date,Movie_Title,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross
0,5293,8/2/1915,The Birth of a Nation,"$110,000","$11,000,000","$10,000,000"
1,5140,5/9/1916,Intolerance,"$385,907",$0,$0
2,5230,12/24/1916,"20,000 Leagues Under the Sea","$200,000","$8,000,000","$8,000,000"
3,5299,9/17/1920,Over the Hill to the Poorhouse,"$100,000","$3,000,000","$3,000,000"
4,5222,1/1/1925,The Big Parade,"$245,000","$22,000,000","$11,000,000"


In [5]:
data.shape

(5391, 6)

In [6]:
# Check for NaN
data.isna().sum()

Rank                     0
Release_Date             0
Movie_Title              0
USD_Production_Budget    0
USD_Worldwide_Gross      0
USD_Domestic_Gross       0
dtype: int64

In [7]:
# Check for duplicates
data.duplicated().sum()

0

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5391 entries, 0 to 5390
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Rank                   5391 non-null   int64 
 1   Release_Date           5391 non-null   object
 2   Movie_Title            5391 non-null   object
 3   USD_Production_Budget  5391 non-null   object
 4   USD_Worldwide_Gross    5391 non-null   object
 5   USD_Domestic_Gross     5391 non-null   object
dtypes: int64(1), object(5)
memory usage: 252.8+ KB


### Data Type Conversions

**Challenge**: Convert the `USD_Production_Budget`, `USD_Worldwide_Gross`, and `USD_Domestic_Gross` columns to a numeric format by removing `$` signs and `,`. 
<br>
<br>
Note that *domestic* in this context refers to the United States.

In [9]:
# Remove commas and dollar signs
chars_to_rem = [",", "$"]

columns_dirty = [
    "USD_Production_Budget",
    "USD_Worldwide_Gross",
    "USD_Domestic_Gross",
]

for col in columns_dirty:
    for char in chars_to_rem:
        # Remove char
        data[col] = data[col].str.replace(char, "")

    # Convert "object" type columns to numeric
    data[col] = data[col].astype(float)

In [10]:
data.sample(3)

Unnamed: 0,Rank,Release_Date,Movie_Title,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross
4236,4818,9/14/2012,Airborne,1200000.0,0.0,0.0
2739,2881,8/25/2006,Idlewild,15000000.0,12669914.0,12669914.0
38,4545,1/1/1946,Notorious,2000000.0,24464742.0,24464742.0


In [11]:
# Confirm data type changes
data.dtypes

Rank                       int64
Release_Date              object
Movie_Title               object
USD_Production_Budget    float64
USD_Worldwide_Gross      float64
USD_Domestic_Gross       float64
dtype: object

**Challenge**: Convert the `Release_Date` column to a Pandas Datetime type. 

In [12]:
data["Release_Date"] = pd.to_datetime(data["Release_Date"])

In [13]:
# Confirm
display(data.sample(3))
data.dtypes

Unnamed: 0,Rank,Release_Date,Movie_Title,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross
1758,3250,2002-01-25,A Walk to Remember,11000000.0,46060915.0,41227069.0
995,2817,1997-08-15,Cop Land,15000000.0,63706632.0,44906632.0
4359,187,2013-03-27,G.I. Joe: Retaliation,140000000.0,375740705.0,122523060.0


Rank                              int64
Release_Date             datetime64[ns]
Movie_Title                      object
USD_Production_Budget           float64
USD_Worldwide_Gross             float64
USD_Domestic_Gross              float64
dtype: object

### Descriptive Statistics

**Challenge**: 

1. What is the average production budget of the films in the data set?
2. What is the average worldwide gross revenue of films?
3. What were the minimums for worldwide and domestic revenue?
4. Are the bottom 25% of films actually profitable or do they lose money?
5. What are the highest production budget and highest worldwide gross revenue of any film?
6. How much revenue did the lowest and highest budget films make?

In [14]:
# Most of these can be answered by .describe()
data.describe()

Unnamed: 0,Rank,Release_Date,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross
count,5391.0,5391,5391.0,5391.0,5391.0
mean,2696.0,2003-09-19 15:02:02.203672704,31113737.58,88855421.96,41235519.44
min,1.0,1915-08-02 00:00:00,1100.0,0.0,0.0
25%,1348.5,1999-12-02 12:00:00,5000000.0,3865206.0,1330901.5
50%,2696.0,2006-06-23 00:00:00,17000000.0,27450453.0,17192205.0
75%,4043.5,2011-11-23 00:00:00,40000000.0,96454455.0,52343687.0
max,5391.0,2020-12-31 00:00:00,425000000.0,2783918982.0,936662225.0
std,1556.39,,40523796.88,168457757.0,66029346.27


In [15]:
# C1. What is the average production budget of the films in the data set?
data["USD_Production_Budget"].mean()
# $31,113,737

31113737.57837136

In [16]:
# C2. What is the average worldwide gross revenue of films?
data["USD_Worldwide_Gross"].mean()
# $88,855,421

88855421.96271564

In [17]:
# C3. What were the minimums for worldwide and domestic revenue?
data[["USD_Worldwide_Gross", "USD_Domestic_Gross"]].min()
# 0 for both

USD_Worldwide_Gross   0.00
USD_Domestic_Gross    0.00
dtype: float64

In [18]:
# C4. Are the bottom 25% of films actually profitable or do they lose money?

# Create bottom 25% df
# Rank column is based on budget
quantile_25 = data["Rank"].quantile(0.75)
filt_25 = data["Rank"] >= quantile_25
# Note the .copy(). It disables the
# "A value is trying to be set on a copy of a slice from a DataFrame" warning
data_bottom_25 = data.loc[filt_25].copy()

# Create new column for profit
data_bottom_25.loc[:, "USD_Net_Profit"] = (
    data_bottom_25["USD_Worldwide_Gross"] - data_bottom_25["USD_Production_Budget"]
)
data_bottom_25.head()

Unnamed: 0,Rank,Release_Date,Movie_Title,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross,USD_Net_Profit
0,5293,1915-08-02,The Birth of a Nation,110000.0,11000000.0,10000000.0,10890000.0
1,5140,1916-05-09,Intolerance,385907.0,0.0,0.0,-385907.0
2,5230,1916-12-24,"20,000 Leagues Under the Sea",200000.0,8000000.0,8000000.0,7800000.0
3,5299,1920-09-17,Over the Hill to the Poorhouse,100000.0,3000000.0,3000000.0,2900000.0
4,5222,1925-01-01,The Big Parade,245000.0,22000000.0,11000000.0,21755000.0


In [19]:
# C4. Are the bottom 25% of films actually profitable or do they lose money?
n_profit = len(data_bottom_25.loc[data_bottom_25["USD_Net_Profit"] > 0])
n_loss = len(data_bottom_25.loc[data_bottom_25["USD_Net_Profit"] <= 0])

print(
    f"Of the bottom (budget-wise) {len(data_bottom_25)} movies:\n{n_profit} movies "
    f"had profit\n{n_loss} movies were at break-even or loss"
)

Of the bottom (budget-wise) 1348 movies:
611 movies had profit
737 movies were at break-even or loss


In [20]:
# C5. What are the highest production budget and highest worldwide gross
# revenue of any film?
data.loc[data["USD_Production_Budget"].idxmax()]
# Highest budget: Avatar

Rank                                       1
Release_Date             2009-12-18 00:00:00
Movie_Title                           Avatar
USD_Production_Budget         425,000,000.00
USD_Worldwide_Gross         2,783,918,982.00
USD_Domestic_Gross            760,507,625.00
Name: 3529, dtype: object

In [21]:
# C5. What are the highest production budget and highest worldwide gross
# revenue of any film?
data.loc[data["USD_Worldwide_Gross"].idxmax()]
# Highest worldwide gross: Avatar

Rank                                       1
Release_Date             2009-12-18 00:00:00
Movie_Title                           Avatar
USD_Production_Budget         425,000,000.00
USD_Worldwide_Gross         2,783,918,982.00
USD_Domestic_Gross            760,507,625.00
Name: 3529, dtype: object

In [22]:
# C6. How much revenue did the lowest and highest budget films make?
data.loc[data["USD_Production_Budget"].idxmin()]
# "My Date With Drew" grossed $181,041

Rank                                    5391
Release_Date             2005-05-08 00:00:00
Movie_Title                My Date With Drew
USD_Production_Budget               1,100.00
USD_Worldwide_Gross               181,041.00
USD_Domestic_Gross                181,041.00
Name: 2427, dtype: object

In [23]:
# C6. How much revenue did the lowest and highest budget films make?
data.loc[data["USD_Worldwide_Gross"].idxmax()]  # Taken from C5
# "Avatar" grossed $2,783,918,982

Rank                                       1
Release_Date             2009-12-18 00:00:00
Movie_Title                           Avatar
USD_Production_Budget         425,000,000.00
USD_Worldwide_Gross         2,783,918,982.00
USD_Domestic_Gross            760,507,625.00
Name: 3529, dtype: object

# Investigating the Zero Revenue Films

**Challenge** How many films grossed $0 domestically (i.e., in the United States)? What were the highest budget films that grossed nothing?

In [24]:
filt_zero_domestic_gross = data["USD_Domestic_Gross"] == 0
len(data.loc[filt_zero_domestic_gross])
# 512 movies grossed $0. Some of these movies are still to be released since this
# data was taken on May 1, 2018

512

In [25]:
# Highest budget with $0 gross. (Note some of these are unreleased)
data.loc[filt_zero_domestic_gross].sort_values(
    "USD_Production_Budget", ascending=False
).head(10)

Unnamed: 0,Rank,Release_Date,Movie_Title,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross
5388,96,2020-12-31,Singularity,175000000.0,0.0,0.0
5387,126,2018-12-18,Aquaman,160000000.0,0.0,0.0
5384,321,2018-09-03,A Wrinkle in Time,103000000.0,0.0,0.0
5385,366,2018-10-08,Amusement Park,100000000.0,0.0,0.0
5090,556,2015-12-31,"Don Gato, el inicio de la pandilla",80000000.0,4547660.0,0.0
4294,566,2012-12-31,Astérix et Obélix: Au service de Sa Majesté,77600000.0,60680125.0,0.0
5058,880,2015-11-12,The Ridiculous 6,60000000.0,0.0,0.0
5338,879,2017-04-08,The Dark Tower,60000000.0,0.0,0.0
5389,1119,2020-12-31,Hannibal the Conqueror,50000000.0,0.0,0.0
4295,1230,2012-12-31,Foodfight!,45000000.0,73706.0,0.0


**Challenge**: How many films grossed $0 worldwide? What are the highest budget films that had no revenue internationally?

In [27]:
filt_zero_global_gross = data["USD_Worldwide_Gross"] == 0
len(data.loc[filt_zero_global_gross])
# 357 movies grossed $0 internationally.

357

### Filtering on Multiple Conditions

In [52]:
# Highest budget filts with no international revenue (some are still to be released)
data.loc[filt_zero_global_gross].sort_values(
    "USD_Production_Budget", ascending=False
).head()

Unnamed: 0,Rank,Release_Date,Movie_Title,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross
5388,96,2020-12-31,Singularity,175000000.0,0.0,0.0
5387,126,2018-12-18,Aquaman,160000000.0,0.0,0.0
5384,321,2018-09-03,A Wrinkle in Time,103000000.0,0.0,0.0
5385,366,2018-10-08,Amusement Park,100000000.0,0.0,0.0
5058,880,2015-11-12,The Ridiculous 6,60000000.0,0.0,0.0


**Challenge**: Use the [`.query()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) to accomplish the same thing. Create a subset for international releases that had some worldwide gross revenue, but made zero revenue in the United States. 

Hint: This time you'll have to use the `and` keyword.

### Unreleased Films

**Challenge**:
* Identify which films were not released yet as of the time of data collection (May 1st, 2018).
* How many films are included in the dataset that have not yet had a chance to be screened in the box office? 
* Create another DataFrame called data_clean that does not include these films. 

In [None]:
# Date of Data Collection
scrape_date = pd.Timestamp("2018-5-1")

### Films that Lost Money

**Challenge**: 
What is the percentage of films where the production costs exceeded the worldwide gross revenue? 

# Seaborn for Data Viz: Bubble Charts

### Plotting Movie Releases over Time

**Challenge**: Try to create the following Bubble Chart:

<img src=https://i.imgur.com/8fUn9T6.png>



# Converting Years to Decades Trick

**Challenge**: Create a column in `data_clean` that has the decade of the release. 

<img src=https://i.imgur.com/0VEfagw.png width=650> 

Here's how: 
1. Create a [`DatetimeIndex` object](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) from the Release_Date column. 
2. Grab all the years from the `DatetimeIndex` object using the `.year` property.
<img src=https://i.imgur.com/5m06Ach.png width=650>
3. Use floor division `//` to convert the year data to the decades of the films.
4. Add the decades as a `Decade` column to the `data_clean` DataFrame.

### Separate the "old" (before 1969) and "New" (1970s onwards) Films

**Challenge**: Create two new DataFrames: `old_films` and `new_films`
* `old_films` should include all the films before 1969 (up to and including 1969)
* `new_films` should include all the films from 1970 onwards
* How many films were released prior to 1970?
* What was the most expensive film made prior to 1970?

# Seaborn Regression Plots

**Challenge**: Use Seaborn's `.regplot()` to show the scatter plot and linear regression line against the `new_films`. 
<br>
<br>
Style the chart

* Put the chart on a `'darkgrid'`.
* Set limits on the axes so that they don't show negative values.
* Label the axes on the plot "Revenue in \$ billions" and "Budget in \$ millions".
* Provide HEX colour codes for the plot and the regression line. Make the dots dark blue (#2f4b7c) and the line orange (#ff7c43).

Interpret the chart

* Do our data points for the new films align better or worse with the linear regression than for our older films?
* Roughly how much would a film with a budget of $150 million make according to the regression line?

# Run Your Own Regression with scikit-learn

$$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$$

**Challenge**: Run a linear regression for the `old_films`. Calculate the intercept, slope and r-squared. How much of the variance in movie revenue does the linear model explain in this case?

# Use Your Model to Make a Prediction

We just estimated the slope and intercept! Remember that our Linear Model has the following form:

$$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$$

**Challenge**:  How much global revenue does our model estimate for a film with a budget of $350 million? 