# Cody Deleurme

## Research question/interests

What interests me most about this data is whether these high grossing movies are any better than less popular films. Going into this topic, I will admit that I am somewhat biased as I've seen dozens of films not contained in this dataset which I've rated an 8 out of 10 or higher. That being said, I will be fair and let the data speak for itself. I will be using a dataset of my own creation to examine the ratings of each of these films on Imdb, letterboxd, rottentomatoes, and rateyourmusic. If I have time, I would also like to compare and contrast my own ratings of some of these films to their aggregate ratings.

My research question will be thus: **"Is there any relationship between the quality of a movie and its popularity? If so, then to what degree does a movie's popularity translate to overall quality?"** 

## Methods

To answer this research question, I will be using 5 different aggregate rating scores from 4 different locations. These locations are as follows:
 - Imdb weighted average rating out of 10
 - Rotten Tomatoes 'Tomatometer' for both audiences and critics
 - Letterboxd weighted average rating out of 5 stars
 - RateYourMusic weighted average rating out of 5 stars

### How are these ratings aggregated?
In order to properly understand my analysis, it is important that we understand how each of these sites aggregates their ratings because they have significant differences.

#### Imdb
Imdb uses a simple weighted average system to give movies a score out of 10 rounded to one decimal place. Not every rating is equal on Imdb. According to the website, when users with unusual voting activity are detected, their rating is given a lesser weight to the overall score. This is to prevent 'review-bombing' or abuse of the system.

#### Rotten Tomatoes Critics
Rotten Tomatoes uses a percentage scale called the 'Tomatometer'. This is a scale from 0 to 100% which aggregates the percentage of critic reviews that are positive. A movie with less than 60% on the Tomatometer is considered 'Rotten', whereas a movie with 60% or more on the meter is considered 'Fresh'. An important consideration is that this is a two point scale. A positive review will increase the meter, and a negative review will lower the meter (because there will be a higher total number of reviews).

#### Rotten Tomatoes Audience
Rotten Tomatoes uses a slightly different percentage scale for audience members. It still uses the 60% as a cutoff between a 'good' or 'bad' movie and uses a two point scale. The difference is that a 'positive review' for an audience member is considered to be a rating of 3.5 or higher out of 5.

#### Letterboxd
Letterboxd uses a similar system to Imdb, except that it has a 10 point scale where you can rate things from 0.5 stars to 5 stars. Also, one other key difference is that Letterboxd attempts to account for low sample size by weighting down films with low numbers of ratings.

#### RateYourMusic (films)
RateYourMusic (aka RYM) uses a weighted average system which rewards users who actively participate on the website. The rating scale is a 5 star system, but a 10 point scale, like Letterboxd. Again, not all ratings equally contribute to the final score. Users who sign up and leave a few ratings on the website are weighted less (~0.5) than users who actively participate (~1). Users who abuse the system are given a near-zero weighting (~0). One interesting caveat of RYM is that they actually give a small bonus to someone's weighting if they not only rate a movie, but also leave a review (~1.25).

### What assumptions will I make?

In this analysis, I will be converting every aggregate score to a rating out of 10, rounded to 2 decimal places. I will follow RottenTomatoes' audience score assumption and assume that an 'average' movie is one that is a 6/10. This seems to be reasonable as the other 3 websites have a minimum rating of 1/10 or 0.5/5 with a maximum rating of 10 or 5, implying an arithemetic mean of 6/10 or 3/5. To account for uncertainty and inaccuracies in ratings, as well as differences in the subjective rating scales of individual moviegoers, I will expand the definition of an 'average' movie to be any movie rated between 5.50 and 6.49 inclusive. Any movie rated 6.50 or above will be considered 'good', whereas any movie rated 5.49 or below will be considered 'bad'. Amongst the 'good' movies, I will make the further distinction that a movie rated 8.00 or higher is 'great'.

#### In summary

 - 0 - 5.49 = Bad
 - 5.50 - 6.49 = Average
 - 6.50 - 10.00 = Good
    - 8.00 - 10.00 = Great

### How will I conduct this analysis?

I will first analyze the rating data by using a weighted average of all 5 ratings sources, then I will do an individual analysis of the scores for each website. I expect the initial analysis will give a solid holistic view of the overall movie quality, whereas the individual analysis will help explain variation in ratings from different sites.

## Exploratory Data Analysis



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("../data/processed/Movie Aggregate Rating Data.csv")

# This is a dataset of my own creation. All data in this set was manually obtained and entered by myself.

In [None]:
df = df.drop(columns = ['Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12'])
df = df.dropna(subset = ['Imdb', 'Rateyourmusic'])

df[['Imdb', 'Imdb Sample Size']] = df['Imdb'].str.split(' ', 1, expand=True)
df[['RYM', 'RYM Sample Size']] = df['Rateyourmusic'].str.split(' ', 1, expand=True)
df[['RT (Audience)', 'RT (Audience) Sample Size']] = df['RT(Audience)'].str.split(' ', 1, expand=True)
df[['RT (Critic)', 'RT (Critic) Sample Size']] = df['RT(Critic)'].str.split(' ', 1, expand=True)
df[['Letterboxd', 'Letterboxd Sample Size']] = df['Letterboxd'].str.split(' ', 1, expand=True)
df = df.drop(columns = ['RT(Critic)', 'RT(Audience)', 'Rateyourmusic'])

# cleaning up data and seperating things into proper columns
# need to standardize ratings to properly compare
df = df.apply(pd.to_numeric, errors = 'ignore')
df['RYM'] = df['RYM'].apply(lambda x: x*2)
df['Letterboxd'] = df['Letterboxd'].apply(lambda x: x*2)
print(df)

In [None]:
# The columns I will be focusing on in this exploratory analysis are Imdb and Letterboxd.
# More specifically, for this milestone, I will be looking at the first 250 entries (I still need to finish creating the dataset)

# This first visualization is a scatterplot comparing ratings between Imdb and RYM

# I am setting the limits of this plot to be between 0 and 10 on both axes to reflect the minimum and maximum rating values.
# The reason I am doing this is to clearly show the differences in average ratings on the two sites.
plt.ylim(0,10)
plt.xlim(0,10)
# Observation 1: Ratings on Imdb and RYM have a high correlation, which could possibly be linear.
# Observation 2: Ratings on Imdb tend to be significantly higher on average than on RYM
plot1 = sns.scatterplot(data = df, x = 'Imdb', y = 'RYM', size = 5).set(title = 'Relationship between Imdb and RYM Movie Ratings', xlabel = 'Imdb Ratings', ylabel = 'RYM Ratings')
print(plot1)

![scatterplot](../images/cody_histogram2.JPG)

In [None]:
#Aggregate scores of RYM and Imdb by implementing a Weighted Rating column.
#In this case, we will just take the average score
df['Weighted Rating'] = df['RYM'] + df['Imdb']
df['Weighted Rating'] = df['Weighted Rating'].apply(lambda x: x/2)

In [None]:
plot2 = sns.boxplot(data = df, x = 'Weighted Rating').set(title = 'Box Plot of Weighted Rating out of 10 for Imdb and RYM', xlabel = 'Weighted Rating (0-10 Scale)')
print(plot2)

![boxplot](../images/cody_boxplot.JPG)

In [None]:
df['Weighted Rating'].describe()

![summary](../images/cody_summary.JPG)

We can make multiple observations from this Box Plot of Weighted Rating as well as these summary statistics.
1. The median of Weighted Rating is 6.42, implying that, by our definition of Good, Bad, and Average, 50% or higher of observations are Average or Bad.
2. Quartile 1 is at 5.76/10. This implies that 25% of observations are below this rating.
3. Quartile 3 is 7.15/10. This implies that 25% of observations are above this rating.
4. 50% of observations fall between 5.76 and 7.15, meaning that the majority of movies from this dataset are by definition, Average or Good.
5. The mean (6.44) of Weighted Rating is slightly higher than the median (6.42). This implies that the distribution of Weighted Rating has a slight skew to the right.
6. The standard deviation is ~0.96.

### Histograms

In [None]:
# Let us put Weighted Rating into a histogram to get a clearer sense of its distribution
rating_bins = pd.Series(np.arange(0, 11, 1)) 
# Above code creates a series of cutoff points which we can easily adjust to see the frequency of a movie having a Weighted Rating within specific ranges
plot3 = sns.displot(data = df, x = 'Weighted Rating', bins = rating_bins).set(title = 'Distribution of Movies by Weighted Rating')

#### Histogram 1 (split every 1 rating point)
![histogram 1](../images/cody_histogram3.JPG)

#### Histogram 2 (split every 0.5 rating points)
![histogram 2](../images/cody_histogram1.JPG)

#### Histogram 3 (split every 0.25 rating points)
![histogram 3](../images/cody_histogram2.JPG)