# Final Project

# Indroduction

## Question of Interest

Disney has enchanted audiences for generations, creating timeless tales and unforgettable characters. I am interested in understanding which decade has been most ‘impactful’. Sadly, ‘impact’ cannot be quantified, so I will assume inflation adjusted revenue as a proxy. Therefore, in this analysis, I will be investigating the relationship between inflation adjusted revenue of Disney movies and time. It will be interesting to investigate which decade has had the highest earnings for Disney. I expect more recent decades to be the most (economically) productive, but we are using inflation adjusted figures, so you never know!

## Dataset Description

The Disney dataset has $5$ tables, `disney-characters.csv`, `disney_movies_total_gross.csv`, `disney-voice-actors.csv`, `disney-director.csv`, and `disney_revenue_1991-2016.csv`. Each table is stored in a csv file. There is information regarding film title, release date, characters, genre, MPAA rating, gross (at time and inflation adjusted), voice actor, director, and total Disney revenue. I will be using the `disney_movies_total_gross.csv`, described below:
    
* **disney_movies_total_gross.csv**
    * This file contains information regarding movie title, release date, genre, MPAA rating, gross at the time of release, and inflation adjusted gross.


# Methods and Results 

I am interested in seeing which decade was most impactful and seen by the widest audience, therefore, I will use the table that contains information on gross revenue.

However, before moving further, we will import said table and do some preliminary visualizations.

In [214]:
# Import required libraries
import pandas as pd
import altair as alt

In [215]:
movies_gross = pd.read_csv('data/disney_movies_total_gross.csv', parse_dates = ['release_date'])
movies_gross.head()

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,"$184,925,485","$5,228,953,251"
1,Pinocchio,1940-02-09,Adventure,G,"$84,300,000","$2,188,229,052"
2,Fantasia,1940-11-13,Musical,G,"$83,320,000","$2,187,090,808"
3,Song of the South,1946-11-12,Adventure,G,"$65,000,000","$1,078,510,579"
4,Cinderella,1950-02-15,Drama,G,"$85,000,000","$920,608,730"


Let's learn more about our dataframe

In [216]:
movies_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   movie_title               579 non-null    object        
 1   release_date              579 non-null    datetime64[ns]
 2   genre                     562 non-null    object        
 3   MPAA_rating               523 non-null    object        
 4   total_gross               579 non-null    object        
 5   inflation_adjusted_gross  579 non-null    object        
dtypes: datetime64[ns](1), object(5)
memory usage: 27.3+ KB


The gross revenue df has 579 rows and 6 columns. There are 579 non-null values of interest under inflation adjusted gross out of 579. Notice our column of interest, 'total_gross', is of dtype object. 

Before continuing to preliminary visualization, we must convert the object dtype into integer dtype, so that we may sort and continue wranginling this data.

In [217]:
# Import my function
from my_func import money_to_int

In [218]:
money_to_int(movies_gross, 'total_gross')

  df[money_column] = df[money_column].str.replace('\$', '').str.replace(',', '').astype(int)


Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,"$5,228,953,251"
1,Pinocchio,1940-02-09,Adventure,G,84300000,"$2,188,229,052"
2,Fantasia,1940-11-13,Musical,G,83320000,"$2,187,090,808"
3,Song of the South,1946-11-12,Adventure,G,65000000,"$1,078,510,579"
4,Cinderella,1950-02-15,Drama,G,85000000,"$920,608,730"
...,...,...,...,...,...,...
574,The Light Between Oceans,2016-09-02,Drama,PG-13,12545979,"$12,545,979"
575,Queen of Katwe,2016-09-23,Drama,PG,8874389,"$8,874,389"
576,Doctor Strange,2016-11-04,Adventure,PG-13,232532923,"$232,532,923"
577,Moana,2016-11-23,Adventure,PG,246082029,"$246,082,029"


In [219]:
money_to_int(movies_gross, 'inflation_adjusted_gross')

  df[money_column] = df[money_column].str.replace('\$', '').str.replace(',', '').astype(int)


Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251
1,Pinocchio,1940-02-09,Adventure,G,84300000,2188229052
2,Fantasia,1940-11-13,Musical,G,83320000,2187090808
3,Song of the South,1946-11-12,Adventure,G,65000000,1078510579
4,Cinderella,1950-02-15,Drama,G,85000000,920608730
...,...,...,...,...,...,...
574,The Light Between Oceans,2016-09-02,Drama,PG-13,12545979,12545979
575,Queen of Katwe,2016-09-23,Drama,PG,8874389,8874389
576,Doctor Strange,2016-11-04,Adventure,PG-13,232532923,232532923
577,Moana,2016-11-23,Adventure,PG,246082029,246082029


In [220]:
movies_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   movie_title               579 non-null    object        
 1   release_date              579 non-null    datetime64[ns]
 2   genre                     562 non-null    object        
 3   MPAA_rating               523 non-null    object        
 4   total_gross               579 non-null    int64         
 5   inflation_adjusted_gross  579 non-null    int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 27.3+ KB


Now we have gross and inflation adjusted gross values as integers and can continue wrangling the data.

Let's look at inflation adjusted gross in descending and ascending order to see the highest and lowest grossing Disney films.

In [222]:
movies_gross_sorted = movies_gross.sort_values(by='inflation_adjusted_gross', ascending=False)

# Top 10 earning movies
top_10_earning = movies_gross_sorted.head(10)
top_10_earning

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251
1,Pinocchio,1940-02-09,Adventure,G,84300000,2188229052
2,Fantasia,1940-11-13,Musical,G,83320000,2187090808
8,101 Dalmatians,1961-01-25,Comedy,G,153000000,1362870985
6,Lady and the Tramp,1955-06-22,Drama,G,93600000,1236035515
3,Song of the South,1946-11-12,Adventure,G,65000000,1078510579
564,Star Wars Ep. VII: The Force Awakens,2015-12-18,Adventure,PG-13,936662225,936662225
4,Cinderella,1950-02-15,Drama,G,85000000,920608730
13,The Jungle Book,1967-10-18,Musical,Not Rated,141843000,789612346
179,The Lion King,1994-06-15,Adventure,G,422780140,761640898


In [223]:
movies_gross_sorted = movies_gross.sort_values(by='inflation_adjusted_gross', ascending=True)

# Top 10 lowest earning movies
lowest_10_earning = movies_gross_sorted.head(10)
lowest_10_earning

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
27,Amy,1981-03-20,Drama,,0,0
20,The Many Adventures of Winnie the Pooh,1977-03-11,,,0,0
355,Frank McKlusky C.I.,2002-01-01,,,0,0
29,Condorman,1981-08-07,Action,,0,0
511,Zokkomon,2011-04-22,Adventure,PG,2815,2984
487,Walt and El Grupo,2009-09-10,Documentary,PG,20521,23064
502,Gedo Senki (Tales from Earthsea),2010-08-13,Adventure,PG-13,48658,51988
251,The War at Home,1996-11-20,,R,34368,65543
280,An Alan Smithee Film: Burn Hollywood …,1998-02-27,Comedy,R,45779,82277
495,Waking Sleeping Beauty,2010-03-26,Documentary,PG,80741,86264


Now let's look at release dates.

In [224]:
movies_gross_time = movies_gross.sort_values(by='release_date')
movies_gross_time

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251
1,Pinocchio,1940-02-09,Adventure,G,84300000,2188229052
2,Fantasia,1940-11-13,Musical,G,83320000,2187090808
3,Song of the South,1946-11-12,Adventure,G,65000000,1078510579
4,Cinderella,1950-02-15,Drama,G,85000000,920608730
...,...,...,...,...,...,...
574,The Light Between Oceans,2016-09-02,Drama,PG-13,12545979,12545979
575,Queen of Katwe,2016-09-23,Drama,PG,8874389,8874389
576,Doctor Strange,2016-11-04,Adventure,PG-13,232532923,232532923
577,Moana,2016-11-23,Adventure,PG,246082029,246082029


We are looking at release dates from the end of 1937 to the end of 2016! Pretty cool. Now let's look at summary statistics for inflation adjusted gross.

In [225]:
inflation_gross_summary_float = movies_gross['inflation_adjusted_gross'].describe()
inflation_gross_summary_float

count    5.790000e+02
mean     1.187625e+08
std      2.860853e+08
min      0.000000e+00
25%      2.274123e+07
50%      5.515978e+07
75%      1.192020e+08
max      5.228953e+09
Name: inflation_adjusted_gross, dtype: float64

In [226]:
# Convert scientific notation to easily readable numbers
inflation_gross_summary_object = movies_gross['inflation_adjusted_gross'].describe()

for key, value in inflation_gross_summary_object.items():
    inflation_gross_summary_object[key] = '{:,.0f}'.format(value)

inflation_gross_summary_object

count            579.0
mean       118,762,523
std        286,085,280
min                  0
25%         22,741,232
50%         55,159,783
75%        119,202,000
max      5,228,953,251
Name: inflation_adjusted_gross, dtype: object

In [227]:
top_ten_plot = (
    alt.Chart(top_10_earning, width=500, height=300)
    .mark_bar()
    .encode(
        x=alt.X("movie_title:N", title="Movie Title", sort="-y"),
        y=alt.Y("inflation_adjusted_gross:Q", title="Inflation adjusted gross"),
    )
    .properties(title="Top 10 highest earning movies")
)

top_ten_plot

In [228]:
# Adding a decade column 
movies_gross_decade = pd.read_csv('data/disney_movies_total_gross.csv', parse_dates = ['release_date'])
movies_gross_decade['decade'] = (movies_gross_decade['release_date'].dt.year // 10) * 10
money_to_int(movies_gross_decade, 'inflation_adjusted_gross')

# DataFrame of total revenue by decade
decade_sum_df = movies_gross_decade.groupby('decade')['inflation_adjusted_gross'].sum().reset_index()
decade_sum_df

  df[money_column] = df[money_column].str.replace('\$', '').str.replace(',', '').astype(int)


Unnamed: 0,decade,inflation_adjusted_gross
0,1930,5228953251
1,1940,5453830439
2,1950,2706430071
3,1960,2989484231
4,1970,1062951109
5,1980,4636550126
6,1990,17743304509
7,2000,15791503349
8,2010,13150493912


In [229]:
# Plot of inflation adjusted revenue by decade
gross_decade_plot = (
    alt.Chart(decade_sum_df, width=500, height=300)
    .mark_bar()
    .encode(
        x=alt.X("decade:O", title="Decade of Releases"),
        y=alt.Y("inflation_adjusted_gross:Q", title="Inflation adjusted gross"),
    )
    .properties(title="Gross inflation adjusted revenue by decade")
)
gross_decade_plot

The 1990's appears to be the highest inflation adjusted earning decade for Disney.

In this project, I analyzed ‘disney_movies_total_gross’ data and tried to see which decade had produced the highest revenue, my measure for impact. 

The 90’s was the highest earning decade for Disney. Although, there is a significant caveat to this finding: the data only goes up to 2016, and the earnings for this portion of the decade comes close to that of the entire 90’s.

Further investigations could look into the rate of release of movies and how that affects revenue. When does the market saturate on Disney movies, if at all? You could also look into MPAA rating and if that has any effect on revenue. Many more questions can be asked and answered with this data set.  
