## Module 1 Project

Please fill out:
* Student name: Jennifer Wadkins
* Student pace: self paced
* Scheduled project review date/time: 
* Instructor name: Jeff Herman
* Blog post URL:


### Importing our modules

We will be using the following libraries in this project:
pandas

In [86]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import datetime
%matplotlib inline

## Data Set 1 -  The Numbers

First we will look at our movie budgets dataset from "The Numbers". When performing our EDA on ALL datasets in this project, we initially want to know things like:
    * What is the shape of our imported data?
    * How many data entries?
    * What format is the data in?
    * How can we remove the most obvious redundancies (columns we just don't need, etc)
    * Are there missing/null values in the dataset that will need to be removed or imputed?

In [112]:
# movie budgets dataset
df1 = pd.read_csv('zippedData/tn.movie_budgets.csv')

# taking a look at what we've imported
df1.head()

# what is the shape of our data?
df1.shape

# what kind of data is stored?
df1.dtypes

# what are our columns?
df1.columns

# do we have any missing/null values?
df1.isnull().sum()
# since we know that all of our data is objects, we MAY actually have missing values. We won't be sure until later.
# for now let's look at the tail of the set and see if anything pops out.

df1.tail()
# we do, in fact, see entries with a $0 for gross. These aren't showing up as null because
# they are actual entries rather than null values. We will need to remove or impute these entries after we convert these cells.

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
5781,82,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


On the movie budgets dataset, we find the following things to clean up and resolve:
    * We have 5782 entries. We'll want to explore how/why movies were included in this dataset, as it's not a very large dataset compared to the number of movies released over time
    * all of the data in this set is objects. A lot of the data is numbers, so we need it to be in a numerical format
    * We have an id column, which can be used as our dataset index
    * Many entries with a $0 for gross. These aren't showing up as null in our initial EDA because they are actual entries of $0 not null values. We will need to remove these entries after we convert these cells.

We're going to clean up this dataset in the following way before moving on:

    a) set the id as the index
    b) convert the release date into a standard datetime
    c) convert all cost/gross fields into integers
    d) add 2 new columns for domestic net and worldwide net
    e) remove rows without information for cost OR gross, as we won't be able to use this data

In [113]:
# sets the id as the index, removing a redundant column (former index)
df1.set_index('id', inplace=True)

# using pandas built-in datetime converter to change our release date column to standard format
df1['release_date'] = pd.to_datetime(df1['release_date'])


# write a function to convert the cost/gross object entries into proper numbers that we can use in calculation
def convert_numbers(x):
    '''Takes in a string formatted number that starts with $ and may include commas, and returns that 
    number as a whole integer that can be used in calculations'''
    x = x[1:]
    x = x.replace(',', '')
    x = int(x)
    return x

# run the function on each of our three cost/gross entries
df1['production_budget'] = df1['production_budget'].map(lambda x: convert_numbers(x))
df1['domestic_gross'] = df1['domestic_gross'].map(lambda x: convert_numbers(x))
df1['worldwide_gross'] = df1['worldwide_gross'].map(lambda x: convert_numbers(x))

# add two new columns for domestic net and worldwide net
df1['domestic_net'] = df1['domestic_gross'] - df1['production_budget']
df1['worldwide_net'] = df1['worldwide_gross'] - df1['production_budget']

# check that the data now looks the way we want it
df1.tail()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_net,worldwide_net
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
78,2018-12-31,Red 11,7000,0,0,-7000,-7000
79,1999-04-02,Following,6000,48482,240495,42482,234495
80,2005-07-13,Return to the Land of Wonders,5000,1338,1338,-3662,-3662
81,2015-09-29,A Plague So Pleasant,1400,0,0,-1400,-1400
82,2005-08-05,My Date With Drew,1100,181041,181041,179941,179941


Now that we have corrected our numbers, we need to address the missing data that we identified before. We also want to figure out how the movies were selected for inclusion on this list, if possible, as it's clearly a small sample of all available released movies.

In [114]:
#checking out a few more things before we move on. Namely, what appears to be the minimum stat that warranted
# inclusion on this list?
df1.sort_values('worldwide_net', ascending=False)
# our net ranges from positive to negative, so it's not just top grossing movies

df1.sort_values('release_date', ascending=False)
# our release dates cover the gamut of 1915-2020, so it's not just movies within the last x years

df1.sort_values('production_budget')
# production budget was clearly not a minimum requirement, as our budgets range from only a few thousand dollars
# to over 400 million dollars

sum(df1['production_budget'] == 0)
# all of the movies have a production budget. Regardless, we can't get enough info about success without any gross, so
# we'll be dropping the rows that have a gross of 0 for domestic

sum(df1['domestic_gross'] == 0)
# 548 of our entries have no data for domestic_gross. We can't use these in calculations, and we're not going
# to impute them, so we are going to drop these rows from the dataset.
df1 = df1[df1['domestic_gross'] !=0]


In [115]:
df1.sort_values('worldwide_net', ascending=False)

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_net,worldwide_net
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2009-12-18,Avatar,425000000,760507625,2776345279,335507625,2351345279
43,1997-12-19,Titanic,200000000,659363944,2208208395,459363944,2008208395
7,2018-04-27,Avengers: Infinity War,300000000,678815482,2048134200,378815482,1748134200
6,2015-12-18,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220,630662225,1747311220
34,2015-06-12,Jurassic World,215000000,652270625,1648854864,437270625,1433854864
...,...,...,...,...,...,...,...
5,2002-08-16,The Adventures of Pluto Nash,100000000,4411102,7094995,-95588898,-92905005
53,2001-04-27,Town & Country,105000000,6712451,10364769,-98287549,-94635231
42,2019-06-14,Men in Black: International,110000000,3100000,3100000,-106900000,-106900000
94,2011-03-11,Mars Needs Moms,150000000,21392758,39549758,-128607242,-110450242


We're still not sure how movies were chosen for this particular dataset, but at least we've cleaned up the data. We no longer have any movies in the set without a budget, gross and net information. All of our dates are in a standard format, and all of our money entries are in an integer format so that we can do further calculations with them.

In [116]:
#pd.plotting.scatter_matrix(df1[['production_budget', 'domestic_net', 'worldwide_net']], figsize=(15,15));

In [117]:
#df1.plot('release_date', 'domestic_net', kind='scatter', figsize=(10, 10));

Now that we're looking at some visualizations, we realize that this data goes back further than we really need. We're not aiming for the full history of cinema - we're aiming to capitalize on current trends and provide current recommendations. With this in mind, we will lose all entries that are more than 20 years old.

In [124]:
current_date = pd.datetime.now().date()
current_date = pd.to_datetime(current_date)
current_date

df1['movie_age'] = df1['release_date'] - current_date
df1['movie_age'] = df1['movie_age'] / -(np.timedelta64(1, 'Y'))

#df1.drop(df1[(df1['movie_age'] >= 20)].index, inplace=True)

df1.sort_values('movie_age').tail()
sum(df1['movie_age'] >= 20)

  """Entry point for launching an IPython kernel.


1522

## Data Set 2 - The Movie Database

Time to work with data from a different source. We're now pulling in movie information from TMDB - The Movie Database

We're going to perform our EDA on this dataset, including:
    * What is the shape of our imported data?
    * How many data entries?
    * What format is the data in?
    * How can we remove the most obvious redundancies (columns we just don't need, etc)
    * Are there missing/null values in the dataset that will need to be removed or imputed?

In [6]:
#the movie database movies dataset
df2 = pd.read_csv('zippedData/tmdb.movies.csv')

# taking a look at what we've imported
df2.head()

# what is the shape of our data?
df2.shape
# this dataset has 26,517 movie entries

# what kind of data is stored?
df2.dtypes
# Most of the data in this set seems to be stored in the correct format already (numbers as numbers, etc)
# we'll change the date to a proper date/time

# what are our columns?
df2.columns
# we can definitely reassign our index

# do we have any missing/null values?
df2.isnull().sum()
# This dataset has no missing values. That doesn't mean there aren't categorical placeholders, and we will look into that further

df2['vote_count'].value_counts()
# There are 6541 entries in this dataset with only 1 vote. We're going to look at these entries later and figure out what is
# unusual about them.


1       6541
2       3044
3       1757
4       1347
5        969
        ... 
2328       1
6538       1
489        1
2600       1
2049       1
Name: vote_count, Length: 1693, dtype: int64

In [7]:
df2.describe()
# One thing we can see in this dataset is that there are a LOT of movies with 5 or fewer votes. A full 50% of the dataset
# has 5 or fewer votes. We will look more into this and figure out the situation.

df2.sort_values('popularity').head(30)
# while sorting on popularity, I also notice for the first time that a lot of the genre_ids on this low popularity list are absent

sum(df2['genre_ids'] == '[]')
# we have 2479 entries where there is no genre id

2479

In [8]:
#studying the data to look for bad or less-than-useful data

# how many entries with no genre id, popularity 1 or less and vote count 5 or less?
temp = df2.loc[(df2['genre_ids'] == '[]') & (df2['popularity'] <= 1) & (df2['vote_count'] <= 5)].sort_values('popularity', ascending=False)
temp
# 2137 entries

# what about how many entries with popularity 1 or less and vote count 5 or less?
temp = df2.loc[(df2['popularity'] <= 1) & (df2['vote_count'] <= 5)].sort_values('popularity', ascending=False)
temp
# 10636 entries

# how many entries with vote count vote count 5 or less and MORE than 1 popularity?
temp = df2.loc[(df2['popularity'] > 1) & (df2['vote_count'] <= 5)].sort_values('popularity', ascending=False)
temp
#3022 entries

# how many entries with vote count 30 or less?
temp = df2.loc[(df2['vote_count'] <= 30)].sort_values('popularity', ascending=False)
temp.head()
#20170 entries with vote count 30 or less

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
23901,23901,[16],495925,ja,映画ドラえもん のび太の宝島,20.176,2018-12-31,Doraemon the Movie: Nobita's Treasure Island,5.9,21
23933,23933,"[27, 14, 35, 10770]",518158,en,Leprechaun Returns,16.973,2018-12-11,Leprechaun Returns,4.8,30
23937,23937,"[18, 10749]",571346,en,American Kamasutra,16.908,2018-12-13,American Kamasutra,4.1,9
2522,2522,[18],67308,cn,3D肉蒲團之極樂寶鑑,14.413,2011-04-14,3-D Sex and Zen: Extreme Ecstasy,4.9,29
23996,23996,"[28, 878, 53]",522964,en,Incoming,14.411,2018-05-04,Incoming,3.7,29


We're going to do the following initial work on this dataset to clean it up:
    * Drop entries with fewer than 30 votes. Our client is looking for a blockbuster, not a bespoke production.
    * drop entries with no genre specified. We'll want to use the genre to make recommendations.
    * drop entries with 1.0 or less popularity, for the same reasons as votes
    * Set the index as the Unnamed column

In [9]:
# cleaning up this dataset

# set our index equal to the first column
# we might go back later and drop this index column and make the title the index
df2.set_index('Unnamed: 0', inplace=True)

# Drop all entries with a vote count of 30 or less
df2.drop(df2[(df2['vote_count'] <= 30)].index, inplace=True)

# Drop all entries with popularity 1 or less
df2.drop(df2[(df2['popularity'] <= 1.00)].index, inplace=True)

# Drop all entries with no genre id
df2.drop(df2[(df2['genre_ids'] == '[]')].index, inplace=True)

# using pandas built-in datetime converter to change our release date column to standard format
df2['release_date'] = pd.to_datetime(df2['release_date'])

In [10]:
len(df2)
# We now have a dataset of 6322 rows


# We still need to figure out the reason for the inclusion on this dataset

df2.sort_values('release_date')
# There are older movies included as well as newer movies

# Based on our earlier research, the dataset included plenty of unpopular movies
# So at this point, we're not entirely sure about the criter


Unnamed: 0_level_0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
14335,"[18, 10752]",143,en,All Quiet on the Western Front,9.583,1930-04-29,All Quiet on the Western Front,7.8,299
11192,"[18, 36, 10749]",887,en,The Best Years of Our Lives,9.647,1946-12-25,The Best Years of Our Lives,7.8,243
14740,"[18, 53]",43397,en,Caught,5.439,1949-02-17,Caught,6.5,31
120,[878],830,en,Forbidden Planet,10.274,1956-03-15,Forbidden Planet,7.3,388
24211,[18],614,sv,Smultronstället,9.381,1957-12-26,Wild Strawberries,8.1,595
...,...,...,...,...,...,...,...,...,...
23947,"[80, 28, 53]",438674,en,Dragged Across Concrete,16.389,2019-03-22,Dragged Across Concrete,6.6,127
24204,"[18, 10749, 10402]",440298,pl,Zimna wojna,9.480,2019-03-22,Cold War,7.6,533
24084,"[53, 18]",500904,en,A Vigilante,11.743,2019-03-29,A Vigilante,5.1,68
24308,[18],518496,fr,Sauvage,8.182,2019-04-10,Sauvage,6.8,42


## Data Set 3 - Box Office Mojo

We're going to perform the same EDA that we have done on the previous datasets.

In [11]:
#Box Office Mojo movie gross
df3 = pd.read_csv('zippedData/bom.movie_gross.csv')
df3.head()

# what is the shape of our data?
df3.shape
# this dataset has 3387 movie entries

# what kind of data is stored?
df3.dtypes
# Most of this data is stored correctly, except foreign_gross. We will have to fix this column

# what are our columns?
df3.columns
# Not a lot of unclear data here

# do we have any missing/null values?
df3.isnull().sum()
# This dataset has a few missing values in domestic_gross and many in foreign_gross. We will definitely need to deal with
# domestic gross at least, as we need this information for our recommendations

df3['studio'].value_counts()
# There are some odd one-off studios listed here. We might not use these entries, as our client is looking to emulate
# the successful studios

df3

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


We're going to clean up this dataset in the following way before moving on:

    a) Get rid of bespoke productions by eliminating all entries that are a studio's only movie
    b) Getting rid of all entries with no information on domestic gross
    c) Turn our foreign gross numbers into floats instead of objects


In [12]:
sum(df3['studio'].value_counts() == 1)

# getting rid of all entries with no information for domestic gross
df3.drop(df3[(df3['domestic_gross'].isnull())].index, inplace=True)

# turning our foreign_gross entries into floats
#df3['foreign_gross'] = df3['foreign_gross'].notnull().apply(lambda x: float(x))

# dropping studio counts of only 1
counts = df3['studio'].value_counts()
df3.drop(df3[df3['studio'].isin(counts[counts == 1].index)].index, inplace=True)

In [13]:
df3

#df3['studio'].value_counts()

#temp = df3.loc[(df3['domestic_gross']<50000)]
#temp

df3['year'].min()
# The oldest movie on this list is from 2010.
# This might be acceptable, as we should strive to use more recent data for our recommendations
# to account for the current moviegoing climate

2010

## Data Sets 4-9 IMDB

We're going to do our EDA on each of these datasets, exploring how they will interact with each other when we merge them. We'll determine what 
needs to be cleaned before vs after merging the datasets.

### Set 4 -  imdb user ratings per movie

In [14]:
#imdb user ratings per movie
df4 = pd.read_csv('zippedData/title.ratings.csv')

# taking a look at what we've imported
df4.head()
# this dataset is using the movie id and showing the average rating, and the number of votes

# what is the shape of our data?
df4.shape
# this dataset has 73,856 movie entries

# what kind of data is stored?
df4.dtypes
# The data in this set appears to be stored in the proper formats

# what are our columns?
df4.columns
# The 'tconst' will be found throughout our IMDB datasets. We will consider turning it into our index for all of the IMDB datasets.

# do we have any missing/null values?
df4.isnull().sum()
# This dataset has no missing values. That doesn't mean there aren't categorical placeholders, and we will look into that further

# how many entries with vote count 30 or less?
temp = df4.loc[(df4['numvotes'] <= 30)]
temp
#30553 entries with vote count 30 or less. We are going to drop all of these entries, but we will do this AFTER merging.

Unnamed: 0,tconst,averagerating,numvotes
2,tt1042974,6.4,20
4,tt1060240,6.5,21
13,tt1193623,8.0,5
15,tt1204784,5.8,6
24,tt1258812,4.0,21
...,...,...,...
73850,tt9783738,7.4,7
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14


In [15]:
df4.set_index('tconst', inplace=True)

##### Conclusions for Dataset 4:

We made the unique "tconst" into our index.

### Set 5 - cast and crew per movie

In [16]:
#imdb primary cast and crew per movie
df5 = pd.read_csv('zippedData/title.principals.csv')

# taking a look at what we've imported
df5.head()
# this dataset is using the movie id and showing the average rating, and the number of votes

# what is the shape of our data?
df5.shape
# this dataset has 1,028,186 cast and crew entries

# what kind of data is stored?
df5.dtypes
# The data in this set appears to be stored in the proper formats

# what are our columns?
df5.columns
# The 'tconst' will be found throughout our IMDB datasets. We will turn it into our index for all of the IMDB datasets.

# do we have any missing/null values?
df5.isnull().sum()
# This dataset has large numbers of missing values. We will inspect the data itself to determine if this is important.

df5.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


In [17]:
# After inspecting the data, we can see that the "job" column is generally an extension of the "category" column 
# We will drop this column.
df5.drop(columns=['job'], inplace=True)

# We can also see that the "ordering" column is just for sorting the different jobs for each movie id
# we don't really need this column and will remove it as well
df5.drop(columns=['ordering'], inplace=True)

# lastly, we want all of our data to contribute to a recommendation, and while the actors themselves may be important,
# the characters they play do not seem particularly important. We will also drop the "characters" column
df5.drop(columns=['characters'], inplace=True)

df5.head()



Unnamed: 0,tconst,nconst,category
0,tt0111414,nm0246005,actor
1,tt0111414,nm0398271,director
2,tt0111414,nm3739909,producer
3,tt0323808,nm0059247,editor
4,tt0323808,nm3579312,actress


##### Conclusions for Dataset 5:

This dataset had three unnecessary columns which were removed. We now have a cleaned list of the cast and crew for each movie id.

After studying this dataset, we see that the movie id (tconst) is not unique. Because of this, we will not turn the tconst value into the index in any of the datasets.

### Set 6 - director and writer assignments per movie

In [18]:
#IMDB directors and writers per movie
df6 = pd.read_csv('zippedData/title.crew.csv')
df6


Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943
...,...,...,...
146139,tt8999974,nm10122357,nm10122357
146140,tt9001390,nm6711477,nm6711477
146141,tt9001494,"nm10123242,nm10123248",
146142,tt9004986,nm4993825,nm4993825


This appears to give the same information as the previous dataset, but in a different format. Let's do a few comparisons and see if that is the case.

In [19]:
temp = df5.loc[df5['tconst'] == 'tt0417610']
temp
# our director is nm1145057 and our writer is nm0083201, let's check if it's the same in dataset 6

temp = df6.loc[df6['tconst'] == 'tt0417610']
temp
# at first glance it's not the same! But then we see that the director is also a writer.

# using this information, we'll have to decide if we want to value when a person is credited in multiple roles.

# let's check one more multi-role
temp = df5.loc[df5['tconst'] == 'tt0999913']
temp
# we have 1 director and 3 writers listed

temp = df6.loc[df6['tconst'] == 'tt0999913']
temp
# 1 director and 4 writers, where one of the writers is the director.

# Let's take a look at a listing from this dataset with no writer attached, in dataset 5
temp = df5.loc[df5['tconst'] == 'tt0879859']
temp
# there is indeed no writer attached to this movie according to dataset 5

Unnamed: 0,tconst,nconst,category
144129,tt0879859,nm1269186,editor
144130,tt0879859,nm0028844,actor
144131,tt0879859,nm2421419,actress
144132,tt0879859,nm0090301,actress
144133,tt0879859,nm3127072,actress
144134,tt0879859,nm2416460,director
144135,tt0879859,nm0505953,producer
144136,tt0879859,nm0614195,producer
144137,tt0879859,nm1244349,composer
144138,tt0879859,nm0806706,cinematographer


#### Dataset 6 conclusions:

Based on what we are seeing here, we are NOT going to use this dataset. We'll use the other cast and crew dataset to get this same information already broken apart, rather than having to break apart this dataset.

### Set 7 - movie stats

In [20]:
#imdb stats per movie
df7 = pd.read_csv('zippedData/title.basics.csv')

# taking a look at what we've imported
df7.head()
# this dataset is using the movie id and finally we have the title of the movie, as well as the year, the runtime, and the genres

# what is the shape of our data?
df7.shape
# this dataset has 146,144 movie entries

# what kind of data is stored?
df7.dtypes
# The data in this set appears to be stored in the proper formats

# what are our columns?
df7.columns
# The 'tconst' is found throughout our IMDB datasets and is the movie identifier
# we will want to understand the distinction between primary_title and original_title

# do we have any missing/null values?
df7.isnull().sum()
# This dataset has some missing values. We will inspect the data itself to determine if this is important.
# there are no primary titles or years missing, which seems like the most important data to have

# let's look at where the primary title and original title don't match in order to understand more about that
temp = df7.loc[(df7['primary_title']) != (df7['original_title'])]
temp
# We can see from this that the original title is the movie's foreign language title. We will use the translated titles
# and drop this column

# Does this list include only movies, or does it also have shows? Let's take a look at runtime minutes
df7.sort_values('runtime_minutes', ascending=False).head()
# It's not clear if these are movies or shows

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
132389,tt8273150,Logistics,Logistics,2012,51420.0,Documentary
44840,tt2659636,Modern Times Forever,Modern Times Forever,2011,14400.0,Documentary
123467,tt7492094,Nari,Nari,2017,6017.0,Documentary
87264,tt5068890,Hunger!,Hunger!,2015,6000.0,"Documentary,Drama"
88717,tt5136218,London EC1,London EC1,2015,5460.0,"Comedy,Drama,Mystery"


In [21]:
df7.drop(columns=['original_title'], inplace=True)

df7.set_index('tconst', inplace=True)

df7

Unnamed: 0_level_0,primary_title,start_year,runtime_minutes,genres
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
tt0063540,Sunghursh,2013,175.0,"Action,Crime,Drama"
tt0066787,One Day Before the Rainy Season,2019,114.0,"Biography,Drama"
tt0069049,The Other Side of the Wind,2018,122.0,Drama
tt0069204,Sabse Bada Sukh,2018,,"Comedy,Drama"
tt0100275,The Wandering Soap Opera,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...
tt9916538,Kuambil Lagi Hatiku,2019,123.0,Drama
tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
tt9916706,Dankyavar Danka,2013,,Comedy
tt9916730,6 Gunn,2017,116.0,


##### Dataset 7 conclusions:

This dataset seems nearly ready to use. We dropped the original language column and decided to use the english titles.

We set the index as the unique value tconst.

### Set 8 - alternate titles

In [22]:
#imdb alternate titles
df8 = pd.read_csv('zippedData/title.akas.csv')

# taking a look at what we've imported
df8


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0
...,...,...,...,...,...,...,...,...
331698,tt9827784,2,Sayonara kuchibiru,,,original,,1.0
331699,tt9827784,3,Farewell Song,XWW,en,imdbDisplay,,0.0
331700,tt9880178,1,La atención,,,original,,1.0
331701,tt9880178,2,La atención,ES,,,,0.0


##### Dataset 8 conclusions:

It is immediately apparent that this dataset lists all of the alternate titles for each movie id.

We won't be using this dataset.

### Set 9 - detailed crew information

In [23]:
#imdb detailed crew information
df9 = pd.read_csv('zippedData/name.basics.csv')

# taking a look at what we've imported
df9.head()
# this dataset has the information about the cast and crew ids

# what is the shape of our data?
df9.shape
# this dataset has 606,648 people entries

# what kind of data is stored?
df9.dtypes
# The data in this set appears to be stored in the proper formats

# what are our columns?
df9.columns

# do we have any missing/null values?
df9.isnull().sum()
# This dataset has a lot of missing values for birth year, death year, profession, and known for.
# We don't need some of this information, including birth year, profession and known for
# We will keep death year to make sure we don't make any recommendations for cast/crew that is deceased

nconst                     0
primary_name               0
birth_year            523912
death_year            599865
primary_profession     51340
known_for_titles       30204
dtype: int64

In [24]:
# the only info we need on people is if they are alive, so we will drop their year of birth
df9.drop(columns=['birth_year'], inplace=True)

# We don't need the specific professions of our players. We can see their role from dataset 5
df9.drop(columns=['primary_profession'], inplace=True)

# We're going to use other, more quantifiable metrics of popularity than the known for information
df9.drop(columns=['known_for_titles'], inplace=True)

# we will make the unique nconst the index
df9.set_index('nconst', inplace=True)

In [25]:
df9.head()
df9.sort_values('death_year').head()
# now we realize that we can have writers and composers that are long deceased. We are going to keep the death_year column.

Unnamed: 0_level_0,primary_name,death_year
nconst,Unnamed: 1_level_1,Unnamed: 2_level_1
nm0653992,Ovid,17.0
nm0613556,Shikibu Murasaki,1031.0
nm0019604,Dante Alighieri,1321.0
nm0090504,Giovanni Boccaccio,1375.0
nm1063158,Cheng'en Wu,1581.0


##### Dataset 9 conclusions:

We got rid of some unnecessary columns: birth year, profession and "known for" titles

### IMDB data set observations/summaries

df4 - User ratings and votes for each movie id. Join on movie id (tconst).

df5 - Cast and crew for each movie id. Join on movie id tconst and person id nconst. Consider this join as a separate dataframe.

df6 - DO NOT USE. Redundant information with df5.

df7 - Movie title, year, runtime and genre for each movie id. Join on movie id (tconst).

df8 - DO NOT USE. Alternate titles.

df9 - Cast and crew info. Join on nconst.


In [26]:
# We are joining our df4 and df7 on the tconst which is the movie id
imdb_movies = df7.join(df4, how='left')
imdb_movies

#how many null values are there in the averagerating and numvotes categories?
imdb_movies.isnull().sum()

# we're not interested in any movies that aren't even popular enough to have ratings. We are dropping all movies
# with no rating entries, and all movies with fewer than 30 votes, just like our df2 cleanup
imdb_movies.drop(imdb_movies[imdb_movies['averagerating'].isnull()].index, inplace=True)
imdb_movies.drop(imdb_movies[imdb_movies['numvotes'] <= 30].index, inplace=True)

imdb_movies.sort_values('numvotes', ascending=False).head()
# We now have 43,303 entries

Unnamed: 0_level_0,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
tt1375666,Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,1841066.0
tt1345836,The Dark Knight Rises,2012,164.0,"Action,Thriller",8.4,1387769.0
tt0816692,Interstellar,2014,169.0,"Adventure,Drama,Sci-Fi",8.6,1299334.0
tt1853728,Django Unchained,2012,165.0,"Drama,Western",8.4,1211405.0
tt0848228,The Avengers,2012,143.0,"Action,Adventure,Sci-Fi",8.1,1183655.0


In [27]:
# we are joining our df5 and df9 to move the cast and crew information over to where they have performed

imdb_crew = df5.join(df9, on='nconst', how='inner')
# we lost a few hundred entries (out of over a million) for people listed in IMDB who have never worked on a movie

imdb_crew


Unnamed: 0,tconst,nconst,category,primary_name,death_year
0,tt0111414,nm0246005,actor,Tommy Dysart,
1,tt0111414,nm0398271,director,Frank Howson,
763031,tt5573596,nm0398271,director,Frank Howson,
2,tt0111414,nm3739909,producer,Barry Porter-Robinson,
3,tt0323808,nm0059247,editor,Sean Barton,
...,...,...,...,...,...
1028178,tt9689618,nm10439724,actor,Phillippe Warner,
1028180,tt9689618,nm10439725,director,Xavi Herrero,
1028183,tt9692684,nm10441594,director,Guy Jones,
1028184,tt9692684,nm6009913,writer,Sabrina Mahfouz,


We've succesfully turned our 6 IMDB datasets into 2 dataframes.

## Sets 10 and 11 - Rotten Tomatoes

In [28]:
df10 = pd.read_csv('zippedData/rt.reviews.tsv', sep='\t', encoding='Latin-1')
df10.tail()

#It's immediately apparent that these are the posted reviews for movies on rotten tomatoes, using the id of the movie

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"
54431,2000,,3/5,fresh,Nicolas Lacroix,0,Showbizz.net,"November 12, 2002"


In [29]:
df11 = pd.read_csv('zippedData/rt.movie_info.tsv', sep='\t', encoding='Latin-1')
df11.tail()

# this is the information on the movies, by id. But it doesn't include the movie name!!

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


After checking out the Rotten Tomatoes/Fandango API usage, we see that they do not grant API access to individuals. We will have to scrape for more data if we want to use this data.