# Lab 4: Visualizing Distributions of IMDB Title Ratings


# Preliminaries

1. Comments about Homework 2 submissions
  -  Most of you did well, but we have some general comments
  -  Remember to read the questions carefully before submitting your answers. Many of you lost points because you didn't follow the exact instructions. Examples:
      - In 1(a) it asks you to write the command to install the package tidyverse, and then load the tidyverse package into your current environment. You should show the code #install.packages("tidyverse"), but can use # to make it a comment.
      -  In 2(b) it asks you to write the command to get more information on the airquality data set, and then output the first 6 rows of the data set. Many students only use head() to check first 6 rows, but you also need to use ?airquality or summary(airquality) to get info about the dataset first.
  -  Make sure your codes are shown completely in the html file and are not cropped.
  -  Make sure the outputs are shown properly in your html or pdf files.

In [None]:
library(tidyverse)

# HW2 Challenge Question

Load the diamonds data set. Reproduce the following plot.

(Hint: for this plot, you will need figure out how to manually manipulate the ticks on the 𝑥 and 𝑦 axes.)

![title](https://github.com/keanmingtan/stats306_fall2021/blob/main/HW/HW1/carat.png?raw=true)

In [None]:
ggplot(data=diamonds, aes(x = carat, y = price)) + geom_point()

In [None]:
ggplot(data=diamonds, aes(x = carat, y = price)) + 
    geom_point( )+ 
    scale_x_continuous(name = "Carats", trans = "log10", breaks=c(0.2,0.5, 1.0, 2.0,5.0), 
                       limits=c(0.2, 5.0)) +
    scale_y_continuous(name = "Price", trans = "log10", breaks=c(500,1000, 2000, 5000,10000, 20000), 
                       limits=c(500, 20000))

In [None]:
ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price)) +
    labs(x = "Carats", y = "Price") +
    scale_x_log10(breaks = c(0.2, 0.5, 1.0, 2.0, 5.0), 
                  limits = c(0.2, 5.0)) +
    scale_y_log10(breaks = c(500, 1000, 2000, 5000, 10000, 20000), 
                  limits = c(500,20000))

# Visualizing IMDB Movie Ratings

Original source: https://www.imdb.com/interfaces/

In many cases, you can read compressed files directly with the ```read_*``` family of functions in tidyverse.

In [None]:
ratings <- read_tsv('https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab4/title.ratings.tsv.gz')

In [None]:
print(ratings)

In [None]:
ggplot(ratings) + geom_histogram(aes(x=averageRating))

What can we say about the distribution ratings?

-  The peak is around 7.5
-  The distribution is skewed left

In [None]:
ggplot(ratings) + geom_density(aes(x=averageRating))

geom_density estimates the probability density function of your data. It relies on a smoothing parameter called the "bandwidth". If you select a higher bandwidth, the result will be smoother, but may discard local features. You can see what the default choice of bandwidth is by looking in the documentation.

In [None]:
?geom_density
?stats::bw.nrd0

In [None]:
default_bw = stats::bw.nrd0(ratings$averageRating)
default_bw

In [None]:
multipliers = c(1,1.25, 1.5, 10, 1000)
for(m in multipliers) {
    g <- ggplot(ratings) + 
        geom_density(aes(x=averageRating), bw=m*default_bw) + 
        ggtitle(paste0("Distribution of IMDB Movie ratings, bw=", m*default_bw))
    plot(g)
}

In [None]:
principals <- read_tsv('https://datasets.imdbws.com/title.principals.tsv.gz')

In [None]:
print(principals)

In [None]:
library(stringi)

Suppose we want to compare the two greatest actors of our generation:

![nic](https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab4/nic.jpg) ![leo](https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab4/leo.jpg)

The dataset containing all actors and characters is really large, so I ran this code to generate a table which will tell us if Nicolas Cage or Leonardo DiCaprio were in a movie, and saved it to "greatest_actors.csv.gz". Note that write_csv will automatically compress the output if we have ".gz" as the extension.

```
greatest_actors <- principals %>% 
    mutate(actor=case_when(
        nconst=="nm0000115"~"Nicolas Cage",
        nconst=="nm0000138"~"Leonardo DiCaprio",
        TRUE~"Somebody else"
    )) %>%
    group_by(tconst) %>%
    summarize(has_nic=any(nconst=="nm0000115"), has_leo=any(nconst=="nm0000138"))
write_csv(greatest_actors, "greatest_actors.csv.gz")
```

In [None]:
greatest_actors <- read_csv("https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab4/greatest_actors.csv.gz")

In [None]:
print(greatest_actors)

We now join this table to our ratings table via the title identifier "tconst"

In [None]:
ratings_actors <- inner_join(ratings, greatest_actors, by="tconst")

In [None]:
print(ratings_actors)

A nifty use of the "case_when" function lets us construct a single variable, "cast", which will categorize movies into four groups:
-  Has Nicolas Cage ("Nic"), but not Leonardo DiCaprio ("Leo")
-  Has Leo, but not Nic
-  Has both Leo and Nic
-  Has neither Leo nor Nic

In [None]:
ratings_nic_vs_leo <- ratings_actors %>%
    mutate(cast = case_when(
        has_nic & !has_leo  ~ "nic_only",
        !has_nic & has_leo  ~ "leo_only",
        has_nic & has_leo   ~ "both",
        !has_nic & !has_leo ~ "neither")) %>%
    mutate(cast = as.factor(cast)) %>%
    filter(numVotes > 10000)

Let's look at the ratings of movies with Leonardo DiCaprio vs movies without him.

In [None]:
ggplot(ratings_nic_vs_leo, aes(x=averageRating, fill=has_leo, group=has_leo)) + geom_histogram()

Wait, what happened? Since the y-axis is count, the number of movies with Leo in them is dwarfed by the number of total titles in the dataset. Even the most prolific actors can't star in that many movies.

There's a way to fix the histogram to be proportional and compare groups; this is left as an exercise.

Instead, we can use geom_density to compare the distribution of a variable between different groups.

In [None]:
ggplot(ratings_nic_vs_leo, aes(x=averageRating, color=has_leo, group=has_leo)) + geom_density()

In [None]:
ggplot(ratings_nic_vs_leo, aes(x=averageRating, color=has_nic, group=has_nic)) + geom_density()

We can look at the distributions of all four of our groups. Note that this gets a bit messy and hard to read.

In [None]:
ggplot(ratings_nic_vs_leo, aes(x=averageRating, color=cast, group=cast)) + geom_density()

The "geom_boxplot" function is another handy way to visualize distributions. It is arguably more foolproof than geom_density because it directly plots the quartiles of distributions. It also makes it easier to visually compare many distributions

In [None]:
ggplot(ratings_nic_vs_leo, aes(x=averageRating, color=cast, group=cast)) + geom_boxplot()

Here we see that Nic does generally worse in his movie ratings than Leo. Moreover, Nic has been in some particularly low-rated titles, shown by the outlier dots.

The summary statistics for each group can be seen here:

In [None]:
ratings_nic_vs_leo %>%
    group_by(cast) %>%
    summarize(
        worst=min(averageRating),
        q25=quantile(averageRating, 0.25),
        q50=quantile(averageRating, 0.50),
        q75=quantile(averageRating, 0.75),
        best=max(averageRating))