library(ggplot2)
library(dplyr)
library(statsr)
library(kableExtra)
library(GGally)
library(gridExtra)
load("movies.Rdata")
This study evaluates a data set provided for Coursera which contains 651 random movies produced and released before 2016 and selected from Rotten Tomatoes and IMDB websites. Both websites rate movies and provide some information about them.
This is an observational study that uses a random sampling approach to obtain a representative sample which makes the data generalizable to movies. Because the study does not use a random assignment that can have an indication of association among the relationship of the variables of interest, but not causation.
The audience score is created by volunteers who can raise a bias in the study because probably those are people with more interest in movies and with strong opinions and different than the overall audience.
Is there a relationship between the IMDB rating and the runtime variable? Can this relationship be affected by the movie’s genre?
Maybe the IMDB rating is affected depending on how long is the movie. It is possible that too short or too long movies have a worse rating. Maybe the genre affects the movie’s runtime and the rating.
The goal is to analyze a possible relationship between the IMDB rating and genre and runtime variables.
Let’s start the data in genre and runtime variables.
genre <- data.frame(round(prop.table(table(movies$genre)) * 100, 2))
genre[order(genre$Freq, decreasing = T), ] %>%
kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Var1 |
Freq |
|
---|---|---|
6 |
Drama |
46.85 |
4 |
Comedy |
13.36 |
1 |
Action & Adventure |
9.98 |
9 |
Mystery & Suspense |
9.06 |
5 |
Documentary |
7.99 |
7 |
Horror |
3.53 |
10 |
Other |
2.46 |
3 |
Art House & International |
2.15 |
8 |
Musical & Performing Arts |
1.84 |
2 |
Animation |
1.38 |
11 |
Science Fiction & Fantasy |
1.38 |
par(mar = c(11,4,4,2) + 0.1)
barplot(height = genre$Freq, names = genre$Var1,
col = palette.colors(nrow(genre)),
main = "Lower Class", las = 2)
The genre variable has Drama
with more than 46% of the movies.
summary(movies$runtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 39.0 92.0 103.0 105.8 115.8 267.0 1
hist(movies$runtime, main = 'Histogram of Runtime', xlab = 'Runtime')
The Runtime
variable is moderately right-skewed and has a null value.
The values are concentrating around 100 minutes for the majority of the
movies.
summary(movies$imdb_rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.900 5.900 6.600 6.493 7.300 9.000
hist(movies$imdb_rating, main = 'Histogram of IMDB Rating', xlab = 'IMDB Rating')
The IMDB rating
variable is left-skewed. The values are concentrating
around 6.5 rates for the majority of the movies and only 25% of the
movies have less than 5.9 as rate. It means that the big majority of
movies have a good rate or at least better than average.
df <- movies %>% filter(!is.na(runtime)) %>% select('imdb_rating', 'runtime')
ggpairs(df)
ggplot(data = df, aes (x = runtime, y = imdb_rating)) +
geom_jitter(color = 'darkblue') +
geom_smooth(method = 'lm', formula = y~x, color = 'red') +
ylab("IMDB Rating") +
xlab("Runtime") +
labs(title = "Scatter Plot of IMDB Rating vs Runtime" )
The correlation between IMDB rating
and Runtime
is weak. It means a
movie’s runtime
does not strongly affect its IMDB rating
.
graphs <- list()
pos <- 1
genres <- unique(movies$genre)
for (g in genres) {
df <- movies %>% filter(!is.na(runtime) & genre == g) %>% select('imdb_rating', 'runtime')
correlation <- round(cor(df)[1, 2], 2)
graphs[[pos]] <- ggplot(data = df, aes (x = runtime, y = imdb_rating)) +
geom_jitter(color = 'darkblue') +
geom_smooth(method = 'lm', formula = y~x, color = 'red') +
ylab('IMDB Rating') +
xlab('Runtime') +
labs(title = paste(g, '-', 'Correlation:', correlation)) +
theme(plot.title = element_text(hjust = 0.5))
pos <- pos + 1
}
grid.arrange(grobs = graphs,
ncol = 2, nrow = round(length(graphs) / 2))
Dividing the movies by genre is possible to see that only a few have a
strong relationship between the IMDB rating
and the Runtime
. Those
are Mystery & Suspense
and Other
and probably their stronger
correlation is explained by outliers.
In this step, we will predict the IMDB rating
using multiple linear
regression and comparing with the actor, actress, director or the move
won any Oscar. The idea is to verify if have any relation with win an
Oscar increases the movie’s rating. Beyond that, we can simulate if the
IMDB
website always has a delay rating comparing with the Rotten Tomatoes
website. In this case, can the Rotten Tomatoes
website
explains part of the IMDB rating
?
The question changes from the research question mainly because during
the exploratory data analysis was easy to see that the runtime
variable is not explanatory to IMDB rating
and because the variable
cited above can give to us a better explanation.
-
irrelevant variables as
title
,title_type
,genre
,mpaa_rating
,studio
,imdb_num_votes
,best_pic_nom
,thtr_rel_year
,thtr_rel_month
,thtr_rel_day
,dvd_rel_year
,dvd_rel_month
,dvd_rel_month
,top200_box
,director
,actor1
,actor2
,actor3
,actor4
,actor5
,imdb_url
, andrt_url
-
variable already tested as
runtime
which was tested in the EDA step
movies_t0 <- movies %>%
mutate(
best_pic_win = ifelse(best_pic_win == 'yes', 1, 0),
best_actor_win = ifelse(best_actor_win == 'yes', 1, 0),
best_actress_win = ifelse(best_actress_win == 'yes', 1, 0),
best_dir_win = ifelse(best_dir_win == 'yes', 1, 0)
) %>%
select('imdb_rating', 'critics_score', 'audience_score', 'best_pic_win', 'best_actor_win', 'best_actress_win', 'best_dir_win')
head(movies_t0) %>%
kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
imdb_rating |
critics_score |
audience_score |
best_pic_win |
best_actor_win |
best_actress_win |
best_dir_win |
---|---|---|---|---|---|---|
5.5 |
45 |
73 |
0 |
0 |
0 |
0 |
7.3 |
96 |
81 |
0 |
0 |
0 |
0 |
7.6 |
91 |
91 |
0 |
0 |
0 |
0 |
7.2 |
80 |
76 |
0 |
1 |
0 |
1 |
5.1 |
33 |
27 |
0 |
0 |
0 |
0 |
7.8 |
91 |
86 |
0 |
0 |
0 |
0 |
dim(movies_t0)
## [1] 651 7
After removing the variables then rest 7 variables in the data set.
After removing the variables that are irrelevant to the study that is the full model:
full_model <- lm( imdb_rating ~ critics_score + audience_score + best_pic_win + best_actor_win + best_actress_win + best_dir_win,
data = movies_t0)
summary(full_model)
##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win, data = movies_t0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50444 -0.21333 0.01883 0.30667 1.25385
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6286267 0.0629684 57.626 <2e-16 ***
## critics_score 0.0114887 0.0009558 12.020 <2e-16 ***
## audience_score 0.0347740 0.0013352 26.043 <2e-16 ***
## best_pic_win 0.1022745 0.1992119 0.513 0.6079
## best_actor_win 0.1001511 0.0554421 1.806 0.0713 .
## best_actress_win 0.0969903 0.0623271 1.556 0.1202
## best_dir_win 0.1055285 0.0824182 1.280 0.2009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.488 on 644 degrees of freedom
## Multiple R-squared: 0.7995, Adjusted R-squared: 0.7976
## F-statistic: 427.9 on 6 and 644 DF, p-value: < 2.2e-16
To Model Selection technique it will use backward elimination to get the
simpler model. Here it will use the step
function evaluating the
Akaike Information Criterion(AIC) to get the simplest model as possible.
final_model <- step(full_model, direction = "backward")
## Start: AIC=-927.13
## imdb_rating ~ critics_score + audience_score + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 0.063 153.43 -928.87
## - best_dir_win 1 0.390 153.76 -927.48
## <none> 153.37 -927.13
## - best_actress_win 1 0.577 153.94 -926.69
## - best_actor_win 1 0.777 154.14 -925.84
## - critics_score 1 34.409 187.78 -797.36
## - audience_score 1 161.523 314.89 -460.81
##
## Step: AIC=-928.87
## imdb_rating ~ critics_score + audience_score + best_actor_win +
## best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## <none> 153.43 -928.87
## - best_dir_win 1 0.549 153.98 -928.54
## - best_actress_win 1 0.639 154.07 -928.16
## - best_actor_win 1 0.773 154.20 -927.60
## - critics_score 1 34.499 187.93 -798.83
## - audience_score 1 162.164 315.59 -461.36
summary(final_model)
##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + best_actor_win +
## best_actress_win + best_dir_win, data = movies_t0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50376 -0.21254 0.01754 0.30679 1.25473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.625777 0.062687 57.839 <2e-16 ***
## critics_score 0.011500 0.000955 12.043 <2e-16 ***
## audience_score 0.034806 0.001333 26.110 <2e-16 ***
## best_actor_win 0.099871 0.055408 1.802 0.0719 .
## best_actress_win 0.101187 0.061754 1.639 0.1018
## best_dir_win 0.118806 0.078210 1.519 0.1292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4877 on 645 degrees of freedom
## Multiple R-squared: 0.7994, Adjusted R-squared: 0.7978
## F-statistic: 514.1 on 5 and 645 DF, p-value: < 2.2e-16
After simplifying the model it has a small improvement of the adjusted
R-squared from 0.7976
to 0.7978
. The model has one less variable.
IMDb_Rating = 3.625777 + 0.011500 ∗ critics_score + 0.034806 ∗ audience_score + 0.099871 * best_actor_win + 0.101187 * best_actress_win + 0.118806 * best_dir_win
-
Intercept: it is 3.625777, which means a movie without any Oscar and Rotten Tomatoes scores equal 0 related should rate this value.
-
critics_score: all else held constant, for every one unit increase in critics_score, the model predicts a 0.011500 increase in
imdb_rating
on average. -
audience_score: all else held constant, for every one unit increase in audience_score, the model predicts a 0.034806 increase in
imdb_rating
on average. -
best_actor_win: all else held constant, for every movie that the actor ever won an Oscar, the model predicts a 0.099871 increase in
imdb_rating
on average. -
best_actress_win: all else held constant, for every movie that the actress ever won an Oscar, the model predicts a 0.101187 increase in
imdb_rating
on average. -
best_dir_win: all else held constant, for every movie that the director ever won an Oscar, the model predicts a 0.118806 increase in
imdb_rating
on average. -
R2: 79.94% of the variability in
IMDb rating
can be explained by the model.
Before making predictions, we would like to check if our model satisfies
all the conditions for a Linear Regression
model.
Using the residuals plot and comparing the two numerical variables
critics_score
and audience_score
is possible to evaluate if it has a
random scatter around 0.
critics_score_residual <- data.frame(residuals = final_model$residuals, critics_score = movies_t0$critics_score)
ggplot(data = critics_score_residual, aes(x = critics_score, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed') +
labs(x = 'Critics Score', y = 'Residuals')
audience_score_residual <- data.frame(residuals = final_model$residuals, audience_score = movies_t0$audience_score)
ggplot(data = audience_score_residual, aes(x = audience_score, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed') +
labs(x = 'Audience Score', y = 'Residuals')
It is possible to verify that the residuals for both variables are
distributed randomly scattered around 0, which means that the conditions
are met.
Using a histogram and a normal probability plot is possible to check if the residuals are nearly normally distributed.
par(mfrow=c(1,2))
hist(final_model$residuals, main = 'Histogram of Residuals')
qqnorm(final_model$residuals, main = 'Normal Probability Plot of Residuals')
qqline(final_model$residuals)
The plot is left-skewed but not strong and the normal probability plot
has deviation only on the left tail. It is right to assume that the
residuals have a nearly normal distribution and this condition is
satisfied
The residuals should be equally variable for low and high values of the predicted response variable. To check this condition it uses a residuals plot of residuals vs. predicted.
graphs <- c()
fitted_residual <- data.frame(residuals = final_model$residuals, fitted = final_model$fitted)
graphs[[1]] <- ggplot(data = fitted_residual, aes(x = fitted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed') +
labs(title = 'Residuals vs. fitted', x = 'Fitted', y = 'Residuals') +
theme(plot.title = element_text(hjust = 0.5))
graphs[[2]] <- ggplot(data = fitted_residual, aes(x = fitted, y = abs(residuals))) +
geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed') +
labs(title = 'Absolute values of residuals vs. fitted', x = 'Fitted', y = 'Residuals') +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(grobs = graphs,
ncol = 2, nrow = 1)
The first plot is the Residuals vs. fitted
and it does not have a fan
shape and the variability of the residuals are nearly constant since the
distribution is left-skewed. The second plot Absolute values of residuals vs. fitted
confirms the nearly constant variability and this
condition has been satisfied too.
The residuals should not show any pattern when increasing or decreasing. To that, we plot the residuals in the normal order.
order_residual <- data.frame(residuals = final_model$residuals, index = seq(1:length(final_model$residuals)))
ggplot(data = order_residual, aes(x = index, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed') +
labs(title = 'Order Residuals', x = 'Index', y = 'Residuals') +
theme(plot.title = element_text(hjust = 0.5))
Considering it is an observational study where the data was randomly
collected, and examining the plot above without any pattern is possible
to consider the condition was met.
Using the model created above in this step we will predict an IMDB rating
for a movie that does not exist in the current data set. The
movie is called Harriet
from 2019 and has a rate of 6.6
on the IMDB
website. The data about this movie come from IMDB and Rotten Tomatoes
website.
harriet <- data.frame(critics_score = 73, audience_score = 97, best_pic_win = 0, best_actor_win = 0, best_actress_win = 0, best_dir_win = 0)
predict(final_model, harriet, interval = 'prediction', level = 0.95) %>%
kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
fit |
lwr |
upr |
---|---|---|
7.841455 |
6.879956 |
8.802954 |
The IMDB rating
predict is 7.841455
, which is not accurate, given
that the real rate is 6.6. The prediction with 95% of confidence for
IMDB rating
was between 6.879956
to 8.802954
what is not precise
to the real rate too.
Maybe Harriet
for being a more modern movie (from 2019) is not a good
movie to be predicted because the current data set is the movies
released before 2016. The question is if the movie is from 2016 the
prediction could be more accurate. For the new prediction, it will use
the movie The Magnificent Seven
from 2016 which has a 6.9
rate in
the IMDB website. The data about this movie come from IMDB and Rotten
Tomatoes website.
the_magnificent_seven <- data.frame(critics_score = 64, audience_score = 72, best_pic_win = 0, best_actor_win = 1, best_actress_win = 0, best_dir_win = 0)
predict(final_model, the_magnificent_seven, interval = 'prediction', level = 0.95) %>%
kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
fit |
lwr |
upr |
---|---|---|
6.96768 |
6.004102 |
7.931257 |
The IMDB rating
predict is 6.96768
, which is very accurate to the
real rate is 6.9. The prediction with 95% of confidence for IMDB rating
was between 6.004102
to 7.931257
what is precise to the real
rate too.
Maybe to predict a more modern movie with more accuracy the model needs
a more updated data set. It is possible that the rating approach change
in the last years.
The conclusion for this study was that runtime
is not a strong
explanatory variable to IMDB rating
. On the other hand, the Rotten Tomatoes'
score and have actors/actresses, a director that won an
Oscar, or won an Oscar for the best picture is a good explanation to
IMDB rating
.
Maybe the current data set should be updated with more modern movies to
have more accuracy in predict modern movies.