The Rotten Tomatoes data set only has two data sets, movie info and movie reviews. Lets start by looking at the data sets and what type of data each of them hold. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#import .tsv files 
#need to specify encoding as 'latin1'

movie_info = pd.read_csv(r"C:\Users\rafav\Documents\Flatiron\Final Project_Section 1\project_data\rt_movie_info.tsv", delimiter='\t',encoding='latin1')
movie_reviews = pd.read_csv(r"C:\Users\rafav\Documents\Flatiron\Final Project_Section 1\project_data\rt_reviews.tsv", delimiter='\t',encoding='latin1')

In [24]:
movie_info.shape

(1560, 12)

In [25]:
movie_reviews.shape

(54432, 8)

In [4]:
movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [5]:
movie_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [6]:
movie_info.dtypes

id               int64
synopsis        object
rating          object
genre           object
director        object
writer          object
theater_date    object
dvd_date        object
currency        object
box_office      object
runtime         object
studio          object
dtype: object

We have now got an idea of what our data sets look like, lets start by making sure the data types in each data set are in the correct format for analysis. 

In [7]:
#change 'box office' to a numeric value so we can analyze the amount of money the movies made. 
movie_info["box_office"] = pd.to_numeric(movie_info.box_office, errors='coerce')

In [8]:
# we need to change run time to a numeric value as well
movie_info["runtime"] = pd.to_numeric(movie_info.box_office, errors='coerce')

In [9]:
movie_info.dtypes

id                int64
synopsis         object
rating           object
genre            object
director         object
writer           object
theater_date     object
dvd_date         object
currency         object
box_office      float64
runtime         float64
studio           object
dtype: object

In [13]:
movie_reviews.dtypes

id             int64
review        object
rating        object
fresh         object
critic        object
top_critic     int64
publisher     object
date          object
dtype: object

In [14]:
#change the rating of the movie to a numeric value so we can compare values to other variables. 
movie_reviews["rating"] = pd.to_numeric(movie_reviews.rating, errors='coerce')

In [18]:
movie_reviews.dtypes

id              int64
review         object
rating        float64
fresh          object
critic         object
top_critic      int64
publisher      object
date           object
dtype: object

Now that we have checked that everything is in the correct data type format lets check for missing or NaN values and clean both data sets up. 

In [22]:
movie_reviews.isnull().sum()

id                0
review         5563
rating        53682
fresh             0
critic         2722
top_critic        0
publisher       309
date              0
dtype: int64

In [20]:
movie_info.isnull().sum()

id                 0
synopsis          62
rating          1560
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1559
runtime         1559
studio          1066
dtype: int64

In [29]:
# lets consider dropping rows that are null from a subset of the data we are interested in inspecting later
movie_reviews.dropna(subset = ['review','rating'], how = 'all', inplace = True)

In [28]:
movie_info.dropna(subset = ['currency','box_office','rating'], how = 'all', inplace = True)

By looking at the data set dimensions after removing the missing values it looks like there were a lot more in the movie_info data set. At this point lets join these data sets using pandas merge to continue our cleaning. 

In [31]:
# set the index as id since both data sets have this column in common
#movie_info.set_index('id', inplace=True)
#movie_reviews.set_index('id', inplace=True)

In [33]:
# join the two data sets on the id column that is now the index
movie_info_and_reviews = pd.merge(movie_info, movie_reviews, on='id', how='inner')

In [34]:
movie_info_and_reviews.head()

Unnamed: 0_level_0,synopsis,rating_x,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,rating_y,fresh,critic,top_critic,publisher,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
3,"New York City, not-too-distant-future: Eric Pa...",,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,,,Entertainment One,A distinctly gallows take on contemporary fina...,,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
3,"New York City, not-too-distant-future: Eric Pa...",,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,,,Entertainment One,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
3,"New York City, not-too-distant-future: Eric Pa...",,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,,,Entertainment One,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,"New York City, not-too-distant-future: Eric Pa...",,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,,,Entertainment One,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
3,"New York City, not-too-distant-future: Eric Pa...",,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,,,Entertainment One,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [35]:
movie_info_and_reviews.shape

(32174, 18)

In [39]:
#check data types 
movie_info_and_reviews.dtypes

synopsis         object
rating_x        float64
genre            object
director         object
writer           object
theater_date     object
dvd_date         object
currency         object
box_office      float64
runtime         float64
studio           object
review           object
rating_y        float64
fresh            object
critic           object
top_critic        int64
publisher        object
date             object
dtype: object

In [38]:
#look at summary statistics 
movie_info_and_reviews.describe()

Unnamed: 0,rating_x,box_office,runtime,rating_y,top_critic
count,0.0,21.0,21.0,378.0,32174.0
mean,,363.0,363.0,5.097619,0.249518
std,,0.0,0.0,2.693635,0.432741
min,,363.0,363.0,0.0,0.0
25%,,363.0,363.0,3.0,0.0
50%,,363.0,363.0,6.0,0.0
75%,,363.0,363.0,7.7,0.0
max,,363.0,363.0,9.8,1.0


In [40]:
#check for missing values 
movie_info_and_reviews.isna().sum()

synopsis            0
rating_x        32174
genre               0
director         3788
writer           5448
theater_date      111
dvd_date          111
currency            0
box_office      32153
runtime         32153
studio           2230
review             17
rating_y        31796
fresh               0
critic            904
top_critic          0
publisher         206
date                0
dtype: int64

lets look at the numerical data to see if their are any outliers. In order to stay consistant between data sources we will use the same cut off of 250 minutes as was used in the IMDb data set. 