<h3> This notebook will provide an initial attempt at cleaning up the Daily Data scraped from BoxOfficeMojo. On BoxOfficeMojo, for every year, the top 200 grossing movies for the year are listed on a webpage. The Daily Dataset contains scraped data for all of these movies, but some of these movies do not have adequate data. Either there is just too little days to make any sense, or the total daily domestic box office data for a movie does not add up to their reported total domestic box office earnings. We will be removing these movies from the dataset. </h3>

In [None]:
import numpy as np
import pandas as pd

In [None]:
daily_data = pd.read_csv('../input/movie-attributes-for-3400-movies-from-20002020/Daily_DataFrame.csv')
attributes = pd.read_csv('../input/movie-attributes-for-3400-movies-from-20002020/Attributes_DataFrame.csv')

In [None]:
display(daily_data.head(3))
display(attributes.head(3))

In [None]:
num_days = daily_data.groupby('Movie_Title')['Date'].count() # total days movie was in theaters

In [None]:
num_days.sort_values(ascending = False).head(20)

Looking through the longest running movies, there are several documentaries that run for a year or more. It is interesting how documentaries continue stay in theatres long after their release and still attract customers.

In [None]:
num_days.value_counts().sort_index().head(30) # Show the number of movies that only have 1, 2, ... 30 daily data points

There are many movies that do not have very much data. I checked some of these movies on Box Office Mojo to make sure I scraped the information correctly. Some of these movies just don't have good data, and if we are to do any meaningful analysis on the full dataset then we need to remove these movies. One way to filter out the movies with bad data is to sum up our total earnings over every day it ran in theaters, then compare that with our domestic box office total from Attributes_DataFrame. If there is a large discrepancy between the two numbers, then that means there is not enough daily box office data for that movie.

In [None]:
domestic_box_office = daily_data.groupby('Movie_Title')['Daily'].sum() # get total box office for each movie
domestic_box_office = pd.DataFrame(domestic_box_office.reset_index().rename(columns = {'Movie_Title':'Title'})) # set up our data to be merged with Attributes

true_vs_counted = pd.merge(attributes[['Title', 'Domestic']], domestic_box_office).rename(columns = {'Domestic':'True', 'Daily':'Counted'})
true_vs_counted

In [None]:
ratio = true_vs_counted['Counted'] / true_vs_counted['True'] # Find what percent the counted domestic value is of the True domestic value

print('Number of movies with a ratio greater than 1: {}'.format(ratio.where(ratio>1).dropna().count()))
ratio.where(ratio > 1).dropna().sort_values(ascending = False)[:5] # show 5 highest ratio movies

There are 45 movies that actually have a ratio greater than 1, which means we counted a greater box office total than there truly was, but this can be attributed to rounding errors and is not something we need to worry about

In [None]:
print ('Number of movies that contain daily data for less than 80% of their total domestic earnings: {}'.format(len(ratio.where(ratio < 0.80).dropna())))
print ('Number of movies that contain daily data for less than 90% of their total domestic earnings: {}'.format(len(ratio.where(ratio < 0.90).dropna())))
print ('Number of movies that contain daily data for less than 95% of their total domestic earnings: {}'.format(len(ratio.where(ratio < 0.95).dropna())))
print ('Number of movies that contain daily data for less than 99% of their total domestic earnings: {}'.format(len(ratio.where(ratio < 0.99).dropna())))
print ('Number of movies that contain daily data for less than 99.9% of their total domestic earnings: {}'.format(len(ratio.where(ratio < 0.999).dropna())))
print ('Number of movies that contain daily data for less than 99.99% of their total domestic earnings: {}'.format(len(ratio.where(ratio < 0.9999).dropna())))
print ('Number of movies that contain daily data for less than 100% of their total domestic earnings: {}'.format(len(ratio.where(ratio < 1).dropna())))
print ('Number of movies that contain daily data for exactly 100% of their total domestic earnings: {}'.format(len(ratio.where(ratio == 1).dropna())))

Surprisingly, almost half of our movies contain daily box office data that adds up to exactly their true domestic earnings. There are many more movies that are off by only a little. It is not obvious what the cutoff point is for useful data. Surely it is good enough to have data on 99% of a movie's earnings, but is 95% good enough? What about 90%? For now I will say that as long as a movie contains data for 95% of its domestic earnings, then that data is good enough and we will use that movie.

In [None]:
invalid_movies = ratio.where(ratio < .95).dropna().index # get indicies of movies that contain daily data for less than 95% of their domestic box office total

In [None]:
dropped_movies = true_vs_counted.iloc[invalid_movies] # create dataframe of the movies that did not contain 95% of their domestic box office total
dropped_movies['Year'] = dropped_movies['Title'].apply(lambda x: x[-5:-1]) # Get the Year attribute

import matplotlib.pyplot as plt
from pylab import rcParams

dropped_years_count = dropped_movies['Year'].value_counts().sort_index()
plt.bar(dropped_years_count.index, dropped_years_count.values)
plt.xticks(rotation=90);
plt.title('Movies Per Year With Inadequate Data')
plt.xlabel('Year')
plt.ylabel('Movies Per Year')
rcParams['figure.figsize'] = 10, 5

In [None]:
daily_data[daily_data['Movie_Title'].where(daily_data['Movie_Title'].isin(dropped_movies['Title'].values)).isna()].groupby('Movie_Title')['Date'].count().value_counts().sort_index().head(30) # Our new list of movies with the least amount of data

Most of the movies that have bad data on BoxOfficeMojo and do not contain daily data that corresponds to their domestic box office total were early 2000s. The reporting from this period simply was not completely accurate. After removing these movies with inadequate daily data, we find there are still several movies with overall low days of daily data, although less than before. Our plan of removing movies with daily data that did not match their box office data did not completely clean out all low-data movies. Some of our dataset is affected by a wacky 2020 COVID year with theaters closing early and good movies not being released, and other movies in our dataset just still do not have good data.

In [None]:
daily_data[daily_data['Movie_Title'].where(daily_data['Movie_Title'].isin(dropped_movies['Title'].values)).isna()].groupby('Movie_Title')['Date'].count().sort_values()[:20]

In [None]:
x = daily_data[daily_data['Movie_Title'].where(daily_data['Movie_Title'].isin(dropped_movies['Title'].values)).isna()].groupby('Movie_Title')['Date'].count() # x is our object from the last cell, except it is not sorted
good_movies = x.where(x > 30).dropna().index # find movies with greater than 30 days of data
good_daily_data = daily_data.where(daily_data['Movie_Title'].isin(good_movies)).dropna().copy() # Our final dataset of good movies

In [None]:
good_daily_data.to_csv('Good_Daily_DataFrame.csv')

To finish cleaning the daily dataset, I removed all movies with less than 30 days worth of data because this is just not typical and probably not something that you would want to account for in any type of model. Perhaps removing this data could add some sort of bias to a model, but I am assuming there was something wrong with the data for these models.

<h3> To conclude, in this notebook we cleaned our daily dataset by removing all movies that did not contain daily data for at least 95% of their domestic box office earnings, and did not contain at least 30 days of data. </h3>