# Investigating the Biasness in Online Movie Ratings

In this project, we delve into the question of whether online movie rating aggregators exhibit bias and dishonesty in their ratings. Our focus is on a detailed examination of Fandango's data, prompted by suspicions raised by data journalist [Walt Hickey's investigation](https://fivethirtyeight.com/features/fandango-movies-ratings/) in 2015.

Hickey's analysis uncovered concerning patterns within Fandango's rating system. Specifically, his findings revealed that:

- Ratings were consistently rounded up to the nearest half-star, with no instances of rounding down.
- Remarkably, almost no movie received a rating lower than three stars (98% of the time). Furthermore, 75% of the ratings were at least four stars.

Our project aims to extend this investigation further by analyzing whether Fandango has addressed these biases as promised or if the observed patterns persist in their rating system.

**Why should we care?**

Understanding the integrity of online movie ratings is crucial for several reasons:

- The film industry is a ***significant economic force***, generating billions annually at the U.S. box office. Consequently, the influence of online ratings aggregators on ***consumer decisions*** is substantial.
- Fandango, as a prominent player in the movie ticketing market, has a ***vested interest*** in shaping consumer perceptions through its ratings. Their influence extends to direct ticket sales.
- Regulatory bodies such as the ***Federal Trade Commission*** are vigilant in safeguarding consumers against deceptive and anti-competitive practices. Monitoring the use of ratings and endorsements is essential in ensuring ***fair and transparent marketplace practices.***

Through this investigation, we aim to contribute to a more transparent and accountable online rating ecosystem, empowering consumers and regulatory bodies to make informed decisions.

### Finding the Suitable Dataset

We use two different dataset for the investigation: 
* The 2015 dataset collected by Hickey for his original analysis: fandango_score_comparison.csv. 
* The dataset collected and published on GitHub for movie ratings data between 2016 and 2017: movie_ratings_16_17.csv.

**Population of Interest:** Fandango's rating system containing the entire movie dataset.


In [1]:
import pandas as pd

# Import the dataset prior to Hickey's analysis
movies_prior = pd.read_csv('./dataset/fandango_score_comparison.csv')

movies_prior.head(3)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5
2,Ant-Man (2015),80,90,64,8.1,7.8,5.0,4.5,4.0,4.5,...,3.9,4.0,4.5,3.0,4.0,4.0,627,103660,12055,0.5


In [2]:
# Import the dataset after Hickey's analysis
movies_after = pd.read_csv('./dataset/movie_ratings_16_17.csv')

movies_after.head(3)

Unnamed: 0,movie,year,metascore,imdb,tmeter,audience,fandango,n_metascore,n_imdb,n_tmeter,n_audience,nr_metascore,nr_imdb,nr_tmeter,nr_audience
0,10 Cloverfield Lane,2016,76,7.2,90,79,3.5,3.8,3.6,4.5,3.95,4.0,3.5,4.5,4.0
1,13 Hours,2016,48,7.3,50,83,4.5,2.4,3.65,2.5,4.15,2.5,3.5,2.5,4.0
2,A Cure for Wellness,2016,47,6.6,40,47,3.0,2.35,3.3,2.0,2.35,2.5,3.5,2.0,2.5


There are a lot of extra information we won't be using in both the datasets. To simplify the datasets, we'll be isolating only the columns we are interested in.

In [3]:
# Isolate the necessary columns from the 1st dataset
movies_prior = movies_prior.copy()[['FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes', 'Fandango_Difference']]

movies_prior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   FILM                  146 non-null    object 
 1   Fandango_Stars        146 non-null    float64
 2   Fandango_Ratingvalue  146 non-null    float64
 3   Fandango_votes        146 non-null    int64  
 4   Fandango_Difference   146 non-null    float64
dtypes: float64(3), int64(1), object(1)
memory usage: 5.8+ KB


In [4]:
# Isolate the necessary columns from the 2nd dataset
movies_after = movies_after.copy()[['movie', 'year', 'fandango']]

movies_after.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie     214 non-null    object 
 1   year      214 non-null    int64  
 2   fandango  214 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 5.1+ KB


### Adjusted Goal: Comparing Fandango's Ratings for Popular Movies in 2015 and 2016

Given the limitations of the available datasets and the non-random sampling processes involved, we have refined our project goal to focus on comparing Fandango's ratings for popular movies in 2015 and 2016. This adjustment allows us to leverage the available data effectively while minimizing the risk of drawing inaccurate conclusions due to sampling bias.

**Methodology:**
1. **Data Collection:**
* fandango_score_comparison.csv: Contains data on every film with at least 30 fan reviews on Fandango, collected in 2015 by Walt Hickey for his original analysis.
* movie_ratings_16_17.csv: Consists of movie ratings data for 214 popular movies released in 2016 and 2017, obtained from a dataset published on GitHub.

2. **Population of Interest:** Fandango's entire movie dataset.

3. **Sampling Approach:** Recognize that both datasets are not obtained through random sampling and may not be representative of the population of interest.

4. **Adjusted Analysis:** Instead of investigating changes in Fandango's rating system over time, we will compare Fandango's ratings for popular movies in 2015 and 2016. This approach allows us to explore potential differences in ratings between the two time periods.

5. **Adjusted Population of Interest:**
* All Fandango's ratings for popular (30 fan ratings or more) movies released in 2015.
* All Fandango's ratings for popular (30 fan ratings or more) movies released in 2016.

## Clean and Prep the Data for Analysis

1. Clean up the column names to standardize two datasets. 
2. We'll check if both samples contain enough popular movies to be representative of the population.
2. Isolate sample points that belong to our population of interest only i.e. 2015 and 2016 movies.
3. 1st dataset Avengers: Age of Ultron (2015)	

In [5]:
# Standarize the column names to lowercase
movies_prior.columns = movies_prior.columns.str.lower()
movies_prior.rename(columns={'fandango_ratingvalue': 'fandango_rating_value', 'film': 'movie'}, inplace=True)

movies_prior.columns

Index(['movie', 'fandango_stars', 'fandango_rating_value', 'fandango_votes',
       'fandango_difference'],
      dtype='object')

Next, we'll check if both datasets have enough popular movies to be representative of the population of interest i.e. 30 ratings or up.

In [6]:
# Check number of movies less than 30 fan reviews
movies_prior.loc[movies_prior['fandango_votes'] < 30, 'movie'].count()

0

In [7]:
# Get the min and max in the range
'Range: [{min_vote} - {max_vote}]'.format(min_vote = movies_prior['fandango_votes'].min(), max_vote = movies_prior['fandango_votes'].max())

'Range: [35 - 34846]'

None of the movies in the dataset prior to Hickey's analysis have less than 30 reviews; they are between 35 and 34,846. This matches our definition of *popular* movies.

The second dataset doesn't clearly define their criteria for choosing *popular* movies. This raises represntativity issue. We'll try a different approach.

In [8]:
# Merge the two dataset - keep all movies from 2nd dataset and common movies from 1st dataset
merged_data = pd.merge(left=movies_after, right=movies_prior, how='inner', on='movie')

merged_data['fandango_votes'].value_counts(dropna=False)

Series([], Name: count, dtype: int64)

There are no movies in common between the two dataset. We'll have to come up with a different approach.

### Extract Year from movie column

We want to clean the first dataset's movie column and extract the year out of it, which might be useful later on to double-check the year these movies were released to check representativity.

In [9]:
# Define regex pattern for year
year_pattern = r"(?P<year>[1-2][0-9]{3})"

# Extract the year and store it in a column
movies_prior['year'] = movies_prior['movie'].str.extract(year_pattern)

movies_prior['movie'] = movies_prior['movie'].str.split('(').str[0]

movies_prior

Unnamed: 0,movie,fandango_stars,fandango_rating_value,fandango_votes,fandango_difference,year
0,Avengers: Age of Ultron,5.0,4.5,14846,0.5,2015
1,Cinderella,5.0,4.5,12640,0.5,2015
2,Ant-Man,5.0,4.5,12055,0.5,2015
3,Do You Believe?,5.0,4.5,1793,0.5,2015
4,Hot Tub Time Machine 2,3.5,3.0,1021,0.5,2015
...,...,...,...,...,...,...
141,Mr. Holmes,4.0,4.0,1348,0.0,2015
142,'71,3.5,3.5,192,0.0,2015
143,"Two Days, One Night",3.5,3.5,118,0.0,2014
144,Gett: The Trial of Viviane Amsalem,3.5,3.5,59,0.0,2015


### Isolate Movies Released in 2015 and 2016 from the Dataset
Now that the movie name and the year are in separate column, we can find how many movies are not 2015/2016. We'll begin with the frequency distribution of the first dataset and isolate the data for movies release in 2015.

In [10]:
movies_prior['year'].value_counts()

year
2015    129
2014     17
Name: count, dtype: int64

In [11]:
movies_2015 = movies_prior[movies_prior['year'] == '2015']

movies_2015['year'].value_counts()

year
2015    129
Name: count, dtype: int64

We'll do the same for the 2nd dataset and isolate the data for year 2016.

In [12]:
movies_after['year'].value_counts()

year
2016    191
2017     23
Name: count, dtype: int64

In [13]:
movies_2016 = movies_after[movies_after['year'] == 2016]

movies_2016['year'].value_counts()

year
2016    191
Name: count, dtype: int64