# Lab 4 (2/3): dplyr

### Web pages
Course page: https://ambujtewari.github.io/teaching/STATS306-Winter2020/

Lab page: https://rogerfan.github.io/stats306_w20/

### Office Hours
    Mondays: 2-4pm, USB 2165
    
### Contact
    Questions on problems: Use the slack discussions
    If you need to email me, include in the subject line: [STATS 306]
    Email: rogerfan@umich.edu
    

In [None]:
library(tidyverse)

# Sample 1200 rows
set.seed(306)
rand_idx = sample(1:nrow(diamonds), 1200)
dm = diamonds[rand_idx, ]
dim(dm)
head(dm)

## Statistical transformations
Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot.

You can learn which stat a geom uses by inspecting the default value for the `stat` argument in the documentation. For example, `geom_bar`'s default value for stat is `'count'`, which means that `geom_bar()` uses `stat_count()`.

`stat_count` is documented on the same page as `geom_bar`, and if you scroll down you can find a section called "Computed Variables." Here we can see that `stat_count` computes two extra variables: `count` and `prop`.

In [None]:
popn <- tribble(
~country, ~population,
"ETHIOPIA", 102000000,
"NIGERIA", 186000000,
"EGYPT", 96000000,
"DR CONGO", 78000000,
"SOUTH AFRICA", 56000000
)

print(popn)

In [None]:
ggplot(popn, aes(x=country, y=population)) +
  geom_bar(stat='identity') +   # Could use geom_col() instead
  ggtitle('Most populous countries in Africa')

Question: Can you guess what the output of the following commands will be?

In [None]:
ggplot(popn, aes(x=country, y=population)) +
  geom_bar()

In [None]:
ggplot(popn, aes(x=country)) +
  geom_bar()

### The group aesthetic

In the previous lecture, we used the following code to plot the proportions of each cut in the diamonds dataset.

In [None]:
ggplot(data=dm, aes(x=cut)) + 
    geom_bar(aes(y=..count../sum(..count..)))

Question: You might've noticed that `prop` is directly calculated by the `stat_count` function. So why can't we use the following code to directly plot the proportions?

What would we have to add to fix it?

In [None]:
ggplot(data=dm, aes(x=cut)) + 
    geom_bar(aes(y=..prop..))

### Other summary stats

You may want to apply your own stat functions to the data. For this, you can use `stat_summary()` to apply any summary function, including custom-made ones, to the data for each `x` value.

In [None]:
ggplot(dm, aes(x=cut, y=price)) +
    stat_summary(
        fun.ymin=min,
        fun.ymax=max,
        fun.y=median
    )

# dplyr for data manipulation

In [None]:
dim(dm)
head(dm)

There are five main functions we will focus on in `dplyr`: `filter`, `arrange`, `select`, `mutate` and `summarize`. all of them have the following properties:
1. The first argument is a dataframe.
2. The subsequent arguments describe what to do with the data, using the variable names in the dataframe.
3. The result is a new data frame.

Note that the documentation for these functions can be found at https://dplyr.tidyverse.org/reference/.

## Filter

Used if you want to view or store a new dataset containing a subset of the rows of a dataset according to some condition.

In [None]:
filter(dm, cut == 'Fair', color == 'J')

Remember to assign the result to a variable name if you want to store the subset for later use.

Also make sure to use `==` instead of `=` inside the `filter` function. The former is to test equality while the latter is for assignments.

In [None]:
worst_diamonds = filter(dm, cut == 'Fair', color == 'J')
worst_diamonds

In [None]:
## filtering for rows that satisfy one or both of the conditions
a = filter(dm, color == 'D' | color == 'J') 

## filtering for rows that satisfy both conditions
b = filter(dm, color == 'D' & cut == 'Ideal') 
# b = filter(dm, color == 'D', cut == 'Ideal') 

## filtering for rows that satisfy exactly one condition
c = filter(dm, xor(color == 'D', cut == 'Ideal')) 

## filtering using membership condition
best_cuts = filter(dm, !(cut %in% c('Premium', 'Ideal'))) 

## can do this because cut is an ordinal variable
is.ordered(dm$cut)
levels(dm$cut)
good_or_better_cuts = filter(dm, cut > 'Good') 

Note: `NA` is generally used to denote missing values in R. Never check for missing values using `variable == NA` or `variable != NA`. Instead, use the `is.na` function. So use `is.na(variable)` or `!is.na(variable)`.

In [None]:
df = tibble(x = c(1, NA, 3, NA, 5), y=c('a', 'b', 'c', 'd', 'e'))
df

In [None]:
filter(df, is.na(x))

In [None]:
filter(df, !is.na(x))

## Arrange

Used to reorder rows.

In [None]:
dm_order = arrange(dm, cut, carat)
head(dm_order, 10)

In [None]:
dm_order2 = arrange(dm, desc(cut), desc(carat))
head(dm_order2, 10)

Note that missing values are always sorted to the end, regardless of desired order

In [None]:
arrange(df, x)

In [None]:
arrange(df, desc(x))

## Select
This is used to select certain columns out of a dataset.

In [None]:
names(dm)

In [None]:
dm2 = select(dm, carat, price)
head(dm2)

In [None]:
dm3 = select(dm, color:price)
head(dm3)

In [None]:
select(dm, -(color:price))[1:3,]

In [None]:
select(dm, starts_with('c'))[1:3,]

In [None]:
select(dm, contains('co'))[1:3,]

Use `rename()`, which is a variant of `select()`, to rename a column and keep all the variables that aren't explicitly mentioned:

In [None]:
rename(dm, width=x)[1:3,]

In [None]:
select(dm, width=x)[1:3,]

The `everything()` helper function is often useful if you want to keep all the variables while making changes to some. For instance, if you want to move variables around.

In [None]:
select(dm,  everything(), width=x)[1:3,]

In [None]:
select(dm, price, carat, everything())[1:3,]


### Mutate

Create a new column or change an existing column.

In [None]:
dm_dim = select(dm, -(carat:price))
head(dm_dim)

In [None]:
mutate(dm_dim, volume = x*y*z)[1:3,]

If you only want to keep the new variables, use `transmute()`.

In [None]:
transmute(dm_dim, volume = x*y*z)[1:3,]

In [None]:
mutate(dm_dim, z = x+y)[1:3,]

## Summarize

Calculates summary statistics. Generally used with the `group_by()` function to output summaries by group.

In [None]:
dm_by_color = group_by(dm, color)

In [None]:
head(dm_by_color)
group_vars(dm_by_color)

In [None]:
summarize(dm_by_color, avg_price=mean(price, na.rm=TRUE))

In [None]:
head(mpg)

In [None]:
mpg2 = mutate(mpg, year=factor(year))

mpg2 = mutate(mpg2, manual=(str_detect(trans, 'manual')))

head(mpg2)

In [None]:
mpg2_by_maker_yr = group_by(mpg2, manufacturer, year)
hwy_summary = summarize(mpg2_by_maker_yr,
                        count = n(),
                        hwy = mean(hwy, na.rm=TRUE),
                        cty = mean(cty, na.rm=TRUE))
head(hwy_summary)

In [None]:
print(group_vars(mpg2_by_maker_yr))
print(group_vars(hwy_summary))

In [None]:
hwy_summary_af = filter(hwy_summary, str_detect(manufacturer, '^[a-f]'))

ggplot(hwy_summary_af, aes(x=cty, y=hwy, color=manufacturer, shape=year)) + 
    geom_point(size=3)

## Pipes
`tidyverse` provides shortcuts for performing multiple operators on a dataset in the form of pipes. This can be used with any of the dataset functions we have learned today, where the syntax is:

In [None]:
mpg2_by_maker_yr = group_by(mpg2, manufacturer, year)
hwy_summary = summarize(mpg2_by_maker_yr,
                        count = n(),
                        hwy = mean(hwy, na.rm=TRUE),
                        cty = mean(cty, na.rm=TRUE))
hwy_summary_af = filter(hwy_summary, str_detect(manufacturer, '^[a-f]'))


hwy_summary_af2 = mpg2 %>% 
    group_by(manufacturer, year) %>%
    summarize(
        count = n(),
        hwy = mean(hwy, na.rm=TRUE),
        cty = mean(cty, na.rm=TRUE)) %>%
    filter(str_detect(manufacturer, '^[a-f]'))

In [None]:
hwy_summary_af
hwy_summary_af2

### Exercise 1
What is the default geom associated with `stat_summary()`? Can you modify the below code to make a line plot of the median `hwy` by `cyl` (just plot the line through the medians)? Note that you can remove the `fun.ymax` and `fun.ymin` arguments.


In [None]:
ggplot(mpg, aes(x=cyl, y=hwy)) +
    stat_summary(
        fun.ymin=min,
        fun.ymax=max,
        fun.y=median
    )

### Exercise 2

Using the dataset `dm`:
1. Use `filter` to output diamonds with combined `x` and `y` values greater than 17.
2. Use `filter` and `nrow` to count the number of diamonds that sold for an even price. 

### Exercise 3
Add a new column to `dm` that converts the US dollar prices in `price` to Korean Won and rounds to the nearest thousand. Today's exchange rate is 1 USD = 1,195.33 WON. If you don't know how to round numbers in R, try searching the internet for what function to use and its documentation.

### Exercise 4

Using the dataset `dm` and pipes, create a dataset of the mean price by color for diamonds with `Ideal` cuts and carats greater than or equal to 1. Round the price to the nearest dollar.

### Exercise 5

When used on a grouped dataset, expressions within the mutate and filter commands are computed by group. Use this knowledge and the dataset `dm` to solve the following problem.

Only consider `subcompact` cars. Find which manufacturers had a car model/variant with a city mileage (`cty`) greater than one standard deviation over the average city mileage for that year. 