# Rotten Tomatoes Reviews
by Stuart Miller  
[Github](https://github.com/sjmiller8182)

**Objective:** Predict 'freshness' of critic review  

[Rotten Tomatoes](https://www.rottentomatoes.com/) rates movies as ['fresh' or 'rotten'](https://www.rottentomatoes.com/about#whatisthetomatometer). A movie is rated a 'fresh' when at least 60% of the critic reviews are positive and rotten when less than 60% of the critic reviews are positive. In some cases, the critics give the movie a numerical or graded rating such as '3/4' or 'A+'. However, in many cases, the critic does not give a rating or grade and the 'freshness' of the review must be inferred from the review. This will be the objective of this project: predicting the freshness of critic reviews.  

Data was previously scraped with [TomatoPy](https://github.com/sjmiller8182/tomatopy). See Data Collection notebook for details.

In [36]:
import pandas as pd
import numpy as np
import codecs

Load data from file. Open and ignore encoding errors.

In [31]:
data_trans = [[],[],[],[],[],[],[],[]]
with codecs.open('reviews.tsv', "r",encoding='utf-8', errors='ignore')  as fdata:
    for line in fdata:
        data = line.strip().split('\t')
        for i in range(len(data)):
            data_trans[i].append(data[i])

Load data into DataFrame for analysis.

In [32]:
data = pd.DataFrame()
keys = ['id', 'reviews', 'rating', 'fresh', 'critic',
        'top_critic', 'publisher', 'date']

for i in range(len(data_trans)):
    data[keys[i]] = data_trans[i]

## Data Frame Structure
The table contained 7 columns of data and a column of forigen keys. Keys match review data to movie info, which was also scraped.
- id: table forigen key
- reviews: text of critic reviews
- rating: rating
- fresh: freshness of review - fresh or rotten
- critic: critic name
- top_critic: if critic is considered a 'top critic' 1 or 0
- publisher: publisher or review
- date: date of review

Of the 54,432 rows, there are 48,869 reviews and 40,915 ratings. There are no null fresh rows.

In [33]:
data.head()

Unnamed: 0,id,reviews,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,"""Continuing along a line introduced in last ye...",,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [35]:
data.replace(to_replace = '', value = pd.np.nan, inplace = True)

In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
id            54432 non-null object
reviews       48869 non-null object
rating        40915 non-null object
fresh         54432 non-null object
critic        51710 non-null object
top_critic    54432 non-null object
publisher     54123 non-null object
date          54432 non-null object
dtypes: object(8)
memory usage: 3.3+ MB
