In [3]:
import pandas as pd

# Rotten Tomatoes Movie Info

In [56]:
# Load file into dataframe.
df = pd.read_csv('ZippedData/rt.movie_info.tsv.gz', sep='\t', compression='gzip', index_col=0)

In [57]:
# Look at first record.
df.head(1)

Unnamed: 0_level_0,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,


#### Summarize fields
* There is a unique **id** for each movie, which is assigned to be the index. This is NOT a count of the records, so it must have some other meaning.
* The **synopsis** is a string. 62 missing.
* The **rating** is a string. This rating refers to the Motion Picture Association film rating: "R", "PG-13", etc. 103 missing
* The **genre** is a string. At least some of the genres contain multiple genres, separated by pipes "|" and no spaces. 108 missing
* The **director** is a string. 199 missing
* The **writer** is a string. If there are multiple writers listed, then they are separated by pipes "|" and no spaces. 449 missing
* The **theater-date** is a string, formatted as "Jan 1, 2013" or "Apr 18, 2000". 359 missing
* The **dvd_date** is a string, which seems to be formatted the same way as the theater-date. 359 missing
* The **currency** is null, except for 340 values which are all '\\$'. I will throw away this field/column.
* The **box_office** seems to be either an integer or a string. (The strings also represent numbers.) Could this column be giving the amount of money that the film made at the box office?? Regardless, we only have non-missing box_office values for 340 out of 1560 records, so I will probably also throw away this field/column.
* The **runtime** is a string, formatted "## minutes". 30 missing
* The **studio** is a string. We only have 494 studios out of 1560 records, so this will likely be dropped as well.

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1560 entries, 1 to 2000
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   synopsis      1498 non-null   object
 1   rating        1557 non-null   object
 2   genre         1552 non-null   object
 3   director      1361 non-null   object
 4   writer        1111 non-null   object
 5   theater_date  1201 non-null   object
 6   dvd_date      1201 non-null   object
 7   currency      340 non-null    object
 8   box_office    340 non-null    object
 9   runtime       1530 non-null   object
 10  studio        494 non-null    object
dtypes: object(11)
memory usage: 146.2+ KB


# Rotten Tomatoes Reviews

In [59]:
# Load data into dataframe.
df = pd.read_csv('ZippedData/rt.reviews.tsv.gz', sep='\t', compression='gzip', encoding='unicode_escape')

In [64]:
# View first 5 reviews.
df.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [65]:
len(df) # There are 54,432 records in this table!

54432

In [66]:
df.groupby('id').count() # The id columns seems to refer to a movie. One movie can have many reviews.
# Reviews and ratings do not have to go together; some people leave reviews without ratings, and vice versa.

Unnamed: 0_level_0,review,rating,fresh,critic,top_critic,publisher,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,162,113,163,160,163,163,163
5,6,20,23,21,23,23,23
6,49,41,57,52,57,57,57
8,57,40,75,69,75,75,75
10,107,61,108,104,108,107,108
...,...,...,...,...,...,...,...
1996,135,115,143,139,143,142,143
1997,19,23,28,24,28,28,28
1998,2,2,2,2,2,2,2
1999,34,31,46,44,46,46,46


In [67]:
df.loc[df.id==5] # The top_critic column seems to list 1 if the critic is a top critic, and 0 otherwise.

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
163,5,This is not the smoothest trip: the transition...,,fresh,David Ansen,1,Newsweek,"February 26, 2018"
164,5,Charming tale of songwriter finding her voice ...,4/5,fresh,Brian Costello,0,Common Sense Media,"July 12, 2016"
165,5,"The lead, Ileana Douglas, is good, but the mus...",C+,fresh,Emanuel Levy,0,EmanuelLevy.Com,"June 7, 2011"
166,5,"An intelligent, engaging comedy-drama that wil...",3.5/4,fresh,Michael Dequina,0,TheMovieReport.com,"September 10, 2005"
167,5,Illeana Douglas grabs onto her first starring ...,,fresh,Frank Wilkins,0,ReelTalk Movie Reviews,"December 3, 2003"
168,5,Pean to creativity and a celebration of one wo...,,fresh,,0,Spirituality and Practice,"August 27, 2002"
169,5,,3/5,fresh,Cole Smithey,0,ColeSmithey.com,"October 10, 2005"
170,5,,2/5,rotten,Chuck O'Leary,0,Fantastica Daily,"October 9, 2005"
171,5,,3/5,fresh,Philip Martin,0,Arkansas Democrat-Gazette,"April 28, 2005"
172,5,,3/5,fresh,Eric Melin,0,Lawrence.com,"August 24, 2004"


# TMDB Movies

This data has some interesting fields: popularity and vote_count which I think could be useful in a complete analysis of which type of movie is best.

In [70]:
# Load data.
df = pd.read_csv('ZippedData/tmdb.movies.csv.gz', compression='gzip', index_col=0)

In [71]:
# Observe first five movies.
df.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 2.0+ MB


# Movie Budgets

In [78]:
df = pd.read_csv('ZippedData/tn.movie_budgets.csv.gz', compression='gzip', index_col=0)

In [79]:
df.head()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [77]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [76]:
df.id.describe()

count    5782.000000
mean       50.372363
std        28.821076
min         1.000000
25%        25.000000
50%        50.000000
75%        75.000000
max       100.000000
Name: id, dtype: float64

# Movie Gross

In [80]:
df = pd.read_csv('ZippedData/bom.movie_gross.csv.gz')

In [81]:
df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB
