# Investigating Fandango Movie Ratings

## Introduction

In this project, we'll be building on the a piece of data journalism that Walt Hickey produced in 2015, which found strong evidence to suggest that Fandango's, a movie review site, rating system was biased and dishonest. We will analyze recent movie ratings data from Fandango to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

## The datasets

The two data sets we have access to and will be working with are:
- The 2015 data Walt Hickey analyzed, which he made available publicly on his GitHub
- Movie ratings data for movies released in 2016 and 2017

In [1]:
import pandas as pd

# Reading the csv files

fandango_score_comparison = pd.read_csv("fandango_score_comparison.csv")
print(fandango_score_comparison.shape)
print(fandango_score_comparison.columns)
fandango_score_comparison.head()

(146, 22)
Index(['FILM', 'RottenTomatoes', 'RottenTomatoes_User', 'Metacritic',
       'Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
       'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
       'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
       'Metacritic_norm_round', 'Metacritic_user_norm_round',
       'IMDB_norm_round', 'Metacritic_user_vote_count', 'IMDB_user_vote_count',
       'Fandango_votes', 'Fandango_Difference'],
      dtype='object')


Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5
2,Ant-Man (2015),80,90,64,8.1,7.8,5.0,4.5,4.0,4.5,...,3.9,4.0,4.5,3.0,4.0,4.0,627,103660,12055,0.5
3,Do You Believe? (2015),18,84,22,4.7,5.4,5.0,4.5,0.9,4.2,...,2.7,1.0,4.0,1.0,2.5,2.5,31,3136,1793,0.5
4,Hot Tub Time Machine 2 (2015),14,28,29,3.4,5.1,3.5,3.0,0.7,1.4,...,2.55,0.5,1.5,1.5,1.5,2.5,88,19560,1021,0.5


In [2]:
movie_ratings_16_17 = pd.read_csv("movie_ratings_16_17.csv")
print(movie_ratings_16_17.shape)
print(movie_ratings_16_17.columns)
movie_ratings_16_17.head()

(214, 15)
Index(['movie', 'year', 'metascore', 'imdb', 'tmeter', 'audience', 'fandango',
       'n_metascore', 'n_imdb', 'n_tmeter', 'n_audience', 'nr_metascore',
       'nr_imdb', 'nr_tmeter', 'nr_audience'],
      dtype='object')


Unnamed: 0,movie,year,metascore,imdb,tmeter,audience,fandango,n_metascore,n_imdb,n_tmeter,n_audience,nr_metascore,nr_imdb,nr_tmeter,nr_audience
0,10 Cloverfield Lane,2016,76,7.2,90,79,3.5,3.8,3.6,4.5,3.95,4.0,3.5,4.5,4.0
1,13 Hours,2016,48,7.3,50,83,4.5,2.4,3.65,2.5,4.15,2.5,3.5,2.5,4.0
2,A Cure for Wellness,2016,47,6.6,40,47,3.0,2.35,3.3,2.0,2.35,2.5,3.5,2.0,2.5
3,A Dog's Purpose,2017,43,5.2,33,76,4.5,2.15,2.6,1.65,3.8,2.0,2.5,1.5,4.0
4,A Hologram for the King,2016,58,6.1,70,57,3.0,2.9,3.05,3.5,2.85,3.0,3.0,3.5,3.0


The 2015 file has 145 entries with 22 attributes collected. The 2016-2017 file has over 200 records with only 15 attributes tracked. 

We will select the columns that we are interested in working with:

In [4]:
cols = ['FILM', 
        'Fandango_Stars', 
        'Fandango_Ratingvalue', 
        'Fandango_votes', 
        'Fandango_Difference']

fandango_old = fandango_score_comparison[cols]
fandango_old.head()

Unnamed: 0,FILM,Fandango_Stars,Fandango_Ratingvalue,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),5.0,4.5,14846,0.5
1,Cinderella (2015),5.0,4.5,12640,0.5
2,Ant-Man (2015),5.0,4.5,12055,0.5
3,Do You Believe? (2015),5.0,4.5,1793,0.5
4,Hot Tub Time Machine 2 (2015),3.5,3.0,1021,0.5


In [5]:
cols = ['movie',
        'year',
       'fandango']

fandango_new = movie_ratings_16_17[cols]
fandango_new.head()

Unnamed: 0,movie,year,fandango
0,10 Cloverfield Lane,2016,3.5
1,13 Hours,2016,4.5
2,A Cure for Wellness,2016,3.0
3,A Dog's Purpose,2017,4.5
4,A Hologram for the King,2016,3.0


## Evaluating the data sets

The goal we started this project with was to determine whether there has been any change in Fandango's rating system after Hickey's analysis. 

In order to be able to conduct this analysis, we'd need a representative population of all the movie ratings stored on the Fandango website during the two periods of time that we'd like to compare. 

So are these datasets representative? Reading the README.md files of both repositories, we find that the samplings were not random:
- In the first dataset, only movies with at least 30 fan ratings on the Fandango website were included
- In the second dataset, only movies with a "considerable number of votes and reviews" were included 

Both datasets are skewed and are not representative of the entire population (which should include all movies, no matter how popular or good). 

## Changing the goal of the analysis

Rather than collecting new data, we can tweak our goal.

**Original goal:** To determine whether there has been any change in Fandango's rating system after Hickey's article.

**New goal:** To determine whether there is any difference between Fandango's ratings for popular movies in 2015 vs in 2016. 

To 