<h1>Investigate the effect of how quickly a movie earns its money on what the movie's IMDb rating is</h1>

In [None]:
import pandas as pd
df = pd.read_csv('../input/remove-movies-with-inadequate-daily-data/Good_Daily_DataFrame.csv')

In [None]:
x = df.groupby('Movie_Title')['Daily'].sum()
under_10mil = list(x[x < 1e7].index) + ['The Polar Express (2005)'] # create a list of all movies that grossed under 10 million, and include extra Polar Express movie with weird data
df = df.set_index('Movie_Title').drop(under_10mil).reset_index()

<h2>Find the Opening Day for each movie</h2>
Movies are sometimes not released everywhere all at once. Sometimes they begin with limited showings in a few theaters in Los Angeles or other major hubs, and then they are released everywhere a week or two later. The data I pulled from Box Office Mojo contains daily data starting from the first release, but if I want to analyze how many days it takes for a movie to earn the majority of its money, then I should probably not include the opening days when only a select few theaters are showing the movie. My goal is to compare the time it takes movies with similar reviews to earn the most of their money during their theater runs. If I include opening days in my day count, then the low theater count during these days will result in low earnings, and the total number of days it takes for the movie to earn a majority of its money will be artificially high.

<h3>Solution: Create a feature for each movie that records the day it first reaches a showing in 5% of the maximum number of theaters it will reach over its run</h3>
I am assuming once a movie is released in 5% of its maximum theaters, then it is no longer on a limited release. This percentage threshold can be varied. If it is too low, then it becomes more likely that my calculated Opening Day could still be during a limited release. If it is too high, then there is a risk that my calculated Opening Day is later than the actual Opening Day. For movies that were not popular upon release but became more popular as they went through their theater run, their real Opening Day could have started in a low number of theaters, but their maximum number of theaters throughout their run would be high relative to the number of theaters on their Opening Day. 

In [None]:
def count_opening_day(group):
    '''
    Find the first day that the movie opens, accepts a groupby of a movie by title, returns a series of tuples
    '''
    cutoff = group['Theaters'].max() * .05 # take 5 percent of max theaters as opening night indication
    for idx, theaters in enumerate(group['Theaters']): # find the first day where theaters surpassed the 5% threshold
        if theaters > cutoff:
            opening_day = idx
            break
            
    earnings_before_open = group['Daily'][:opening_day].sum()
    return opening_day + 1, earnings_before_open, earnings_before_open/group['Daily'].sum()

In [None]:
opening_day_info = df.groupby('Movie_Title').apply(count_opening_day)
opening_day_df = opening_day_info.apply(pd.Series) # change series of tuples to dataframe
opening_day_df.columns=['Opening_Day', 'Earnings_Before_Opening', 'Percent_Total_Earnings']

In [None]:
opening_day_df[opening_day_df['Opening_Day']>1].sort_values('Percent_Total_Earnings', ascending = False)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(opening_day_df[opening_day_df['Opening_Day']>1].Percent_Total_Earnings * 100, binwidth = 2.5);
plt.xlabel('% of Total Earnings before Opening Day')
plt.ylabel('# Movies')
plt.title('Distribution of Total Earnings before Opening Day');

There are 242 movies that, according to our Opening Day feature, were widely released sometime after the first day of daily data. The greatest percent of total earnings earned before this Opening Day is a little over 15%, but most of these movies earned less than 7.5% of their total before the Opening Day. This means that there is not much significant daily earnings data lost by excluding limited release days in our calculations.

<h2>Calculate how many days it took for movies to earn their money</h2>


In [None]:
opening_day_dict = opening_day_info.apply(lambda x: x[0]).to_dict() # take the first element of our opening day info series of tuples, which is the opening day, and turn it into a dictionary

In [None]:
def count_days_to_earn(group, percentage = .5):
    opening_day = opening_day_dict[group.name]
    goal = group['Daily'][opening_day-1:].sum() * percentage # Find the % of total earnings goal, but only find % of total starting from the Opening Day
    cumsum = group['Daily'][opening_day-1:].cumsum() # Only start counting on Opening Day
    days = len(cumsum[cumsum<=goal]) + 1 # count the number of days it takes to reach the goal and add one
    
    return days

In [None]:
days_to_fifty_percent = df.groupby('Movie_Title').apply(count_days_to_earn)
days_to_sixty_percent = df.groupby('Movie_Title').apply(count_days_to_earn, percentage=.6)
days_to_seventy_percent = df.groupby('Movie_Title').apply(count_days_to_earn, percentage=.7)
days_to_eighty_percent = df.groupby('Movie_Title').apply(count_days_to_earn, percentage=.8)
days_to_ninety_percent = df.groupby('Movie_Title').apply(count_days_to_earn, percentage=.9)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(3, 2, figsize = (15, 6))

fig.subplots_adjust(left  = 0.125, right = 0.9, bottom = 0.1, top = 0.9, wspace = .3, hspace = 1)

sns.histplot(days_to_fifty_percent, ax = axs[0,0], binwidth = 7);
sns.histplot(days_to_sixty_percent, ax = axs[1,0], binwidth = 7);
sns.histplot(days_to_seventy_percent, ax = axs[2,0], binwidth = 7);
sns.histplot(days_to_eighty_percent, ax = axs[0,1], binwidth = 7);
sns.histplot(days_to_ninety_percent, ax = axs[1,1], binwidth = 7);
fig.delaxes(axs[2][1]);

axs[0,0].set_title('Days to Reach 50% of Box Office Revenue')
axs[1,0].set_title('Days to Reach 60% of Box Office Revenue')
axs[2,0].set_title('Days to Reach 70% of Box Office Revenue')
axs[0,1].set_title('Days to Reach 80% of Box Office Revenue')
axs[1,1].set_title('Days to Reach 90% of Box Office Revenue')

for ax in axs.flat:
    ax.set(xlabel='Days', ylabel='# Movies')


There are two questions I wish to answer.
<li>Do movies that earned most of their money later on in their run tend to have better reviews?</li>
<li>Among movies with similar total earnings, will the movies that earned their money later on in their run tend to have better reviews? </li>

In [None]:
days_to_fifty_percent[days_to_fifty_percent > 60]

<h2>Academy Awards</h2>

After reviewing these movies that took over 2 months after their release to reach 50% of their total earnings, I found that many of these movies have a revival in theaters around February. This revival is due to Oscar nominations and awards which leads to renewed interest several months after their release date. This renewed interest leads to more box office revenue, which pushes the number of days to earn 50% of total earnings way up. It seems a bit unfair to include these movies in our analysis, because the key factor we are observing, the # of days to earn 50%, can be heavily influenced by Oscars instead of natural interest in the movie. Also, movies that earn Oscar nominations are probably going to have excellent reviews, so the effect of increased revenue due to Oscar nominations may heavily bias our results. There could be ways to remove this Oscar effect from our analysis if we had Oscars data.

<h2>Load Reviews Dataset - IMDb</h2>

In [None]:
imdb_reviews = pd.read_csv('../input/imdb-extensive-dataset/IMDb ratings.csv')
imdb_movies = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv')
imdb_movies = imdb_movies.drop(83917) # drop item with year == 'TV Movie 2019'
display(imdb_reviews.head(1))
display(imdb_movies.head(1))

Merge IMDb reviews data into one dataset with the title and imdb_vote

In [None]:
reviews = imdb_reviews[['imdb_title_id', 'weighted_average_vote']].merge(imdb_movies[['imdb_title_id','original_title','year']], on = 'imdb_title_id').drop(columns='imdb_title_id')
reviews['original_title_year'] =  reviews['original_title'] + reviews['year'].astype(str).apply(lambda x: ' ('+x+')')

In [None]:
reviews.head(3)

Merge imdb reviews dataset with the days it takes to reach 50% of total revenue starting from opening day

In [None]:
df50 = days_to_fifty_percent.to_frame('days_to_fifty_percent').reset_index() # create DataFrame from the days_to_fifty_percent series to merge with the reviews dataframe

movie_votes_df = reviews[['original_title_year','weighted_average_vote']].merge(df50, left_on = 'original_title_year',right_on='Movie_Title',how='inner')
movie_votes_df = movie_votes_df.drop(columns = 'original_title_year').set_index('Movie_Title').rename(columns = {'weighted_average_vote':'imdb_vote'})
movie_votes_df.head(3)

In [None]:
plt.scatter(movie_votes_df.days_to_fifty_percent, movie_votes_df.imdb_vote)
plt.title('IMDb rating vs Days to reach 50% of total earnings after opening day')
plt.xlabel('Days to 50%')
plt.ylabel('IMDb rating')

<h2>Need Oscars Data to see the full picture</h2>

In [None]:
oscars = pd.read_csv('../input/the-oscar-award/the_oscar_award.csv')

In [None]:
oscars['film_year'] = oscars.apply(lambda x: str(x[5]) + ' (' + str(x[0]) + ')', axis = 1)
oscars_nominated = set(oscars['film_year']) # get a set of all movies that were nominated

In [None]:
movie_votes_df['oscar_nominated'] = movie_votes_df.apply(lambda x: x.name in oscars_nominated, axis = 1) # oscars dataset does not contain any 2020 movies, however my data does
print('{} movies were nominated for an oscar'.format(movie_votes_df.oscar_nominated.sum()))

Scatterplot again with Oscar data

In [None]:
nominated = movie_votes_df[movie_votes_df['oscar_nominated']==True]
normal = movie_votes_df[movie_votes_df['oscar_nominated']==False]

plt.scatter(normal.days_to_fifty_percent, normal.imdb_vote, label = 'normal')
plt.scatter(nominated.days_to_fifty_percent, nominated.imdb_vote, color = 'orange', label = 'nominated')
plt.title('IMDb rating vs Days to reach 50% of total earnings after opening day')
plt.xlabel('Days to 50%')
plt.ylabel('IMDb rating')
plt.legend(loc = 4);

In [None]:
nominated = movie_votes_df[(movie_votes_df['oscar_nominated']==True) & (movie_votes_df['days_to_fifty_percent']<50)]
normal = movie_votes_df[(movie_votes_df['oscar_nominated']==False) & (movie_votes_df['days_to_fifty_percent']<50)]
plt.scatter(normal.days_to_fifty_percent, normal.imdb_vote, label = 'normal')
plt.scatter(nominated.days_to_fifty_percent, nominated.imdb_vote, color = 'orange', label = 'nominated')
plt.title('IMDb rating vs Days to reach 50% (Zoomed In)')
plt.xlabel('Days to 50%')
plt.ylabel('IMDb rating')
plt.legend(loc = 4);

Next: Regression analysis