# Lab 2: More ggplot and dplyr

### We encourage you to be active on Piazza, asking and answering homework questions.

### Demo on homework submission.


## Review : Lab 1 Exercise
    1. What is the default value of the mean and standard deviation used by the ``rnorm'' function in R to generate a value from a normal distribution?
    2. Create a boxplot of `price' grouped by the levels in the `cut' variable.

In [None]:
library(tidyverse)

In [None]:
summary(diamonds)

In [None]:
# boxplot helps to visualize the variability of a price for each cut
ggplot(data = diamonds) + 
    geom_boxplot(mapping = aes(x = cut, y = price)) +
    labs(x = 'cut', y = 'Price($)') + 
    ggtitle('Diamond price by carat count')

### Facets
    If we want more segmented plots

In [None]:
ggplot(data = diamonds) + 
    geom_point(mapping = aes(x = carat, y = price)) +
    labs(x = 'Diamonds Carats', y = 'Price($)') +
    ggtitle('Diamond size by carat') + 
    facet_grid(.~clarity) 

## Subset generation

In [None]:
rand_idx = floor(runif(1000, min=1, max=dim(diamonds)[1]))
dm = diamonds[rand_idx, ]
print(names(dm))
print(dim(dm))

In [None]:
summary(dm)

## More about Facets

![Caption for the picture.](./graph1.png)

In [None]:
p1 = ggplot(data = dm) + 
    geom_point(mapping = aes(x = x, y = price, color = cut)) + 
    facet_wrap(~clarity)
print(p1)

## geom_smooth
    So far we have plotted the points in a fundamental manner. We might be interested in the linear trends of the data that we have.

![Caption for the picture.](./graph2.png)

In [None]:
p2 = ggplot(data = dm) + 
    geom_point(mapping = aes(x = x, y = price)) +
    geom_smooth(mapping = aes(x = x, y = price))
print(p2)


    How to reduce the code duplication above? "(x=x, y=price)" is in both lines. 

In [None]:
ggplot(data = dm, mapping = aes(x = x, y= price)) + 
  geom_smooth() +   geom_point()

## Geometric Plots
    Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot.

    You can learn which stat a geom uses by inspecting the default value for the 'stat' argument. For example, ?geom_bar shows that the default value for stat is 'count', which means that geom_bar() uses stat_count().

    stat_count() is documented on the same page as geom_bar(), and if you scroll down you can find a section called "computed variables." That describes how it computes two new variables: count and prop.

    Let's use the following case where we override the default stat and instead want to use a value within the dataset of the y-axis (instead of a count).


In [None]:
?geom_bar

In [None]:
popn <- tribble(
~country, ~population,
"ETHIOPIA", 102000000,
"NIGERIA", 186000000,
"EGYPT", 96000000,
"DR CONGO", 78000000,
"SOUTH AFRICA", 56000000
)

In [None]:
popn

![Caption for the picture.](./graph5.png)

In [None]:
ggplot(data = popn) +
  geom_bar(mapping = aes(x = country, y = population), stat = "identity") +
  ggtitle('Most populous countries in Africa')

In [None]:
?geom_bar

In [None]:
ggplot(data = dm) + 
    geom_bar(mapping = aes(x=cut, y=..prop.., group=1))

In [None]:
ggplot(data = dm) +
    stat_count(mapping = aes(x = cut))

In [None]:
ggplot(data = dm) +
    geom_col(mapping = aes(x = cut, y = price / 1e6))

### Exercise
    1. What does geom_col() do? How is it different to geom_bar()?
    2. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
    3. What variables does stat_smooth() compute? What parameters control its behaviour?
    4. In our proportion bar chart, we need to set group = 1. Why? In other words, what is the problem with these two graphs?

In [None]:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y=price), stat="identity")

## Position adjustments

In [None]:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

Remmeber that we can "color by" a different variable - in this case, clarity. By default, it stacks the bars for each clarity level. This is done using the positional adjustment specified by the position argument of geom_bar. If you don't want a stacked bar chart, you can use one of three other options: "identity", "dodge", or "fill".

![Caption for the picture.](./graph7.png)

In [None]:
ggplot(data = dm) +
    geom_bar(mapping = aes(x = cut, fill = clarity)) # this "fill" is different from the following one.

*position 'fill'*
This works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

In [None]:
ggplot(data = dm) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

*position = 'dodge'*

This places overlapping objects directly beside one another, which makes it easier to compare individual values.

![Caption for the picture.](graph10.png)

In [None]:
ggplot(data = dm) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

A positional adjustment that is very useful for scatterplots with overlaps is the 'jitter' argument.

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")



# dplyr for data manipulation

In [None]:
dim(dm)
head(dm)

Simply run 'dm' after declaring the dm variable above. can you guess what 'dbl', 'ord', and 'int' are?

Notice how the levels below follow an order. Indeed, we expect Fair < Good < Very Good < Premium < Ideal

In [None]:
print(levels(dm$cut))
print(levels(dm$color))
print(levels(dm$clarity))

In [None]:
sizes = c("M", "S", "S", "M", "XL", "XXL", "XL", "S", "M", "L")
sizes

In [None]:
sizes = ordered(sizes, levels = c("S", "M", "L", "XL", "XXL"))
levels(sizes) # in R Studio, this will output the ordering explicitly with less-than signs

There are five functions in dplyr: filter, arrange, select, mutate and summarise. all of them have the following properties:
1. The first argument is a data frame.
2. The subsequent arguments describe what to do with the data frame, using the variable names( without quotes).
3. The result is a new data frame.

## Filter
Used if you want to view or store a new dataset containing a subset of the full dataset.

In [None]:
filter(dm, cut == 'Fair', color == 'J')

Usually you want to store the newly subsetted data in memory. 

In [None]:
worst_diamonds = filter(dm, cut == 'Fair', color == 'J')

Make sure to use '==' instead of '='. The former is to test equality while the latter is for assignments. 

In [None]:
4 == 6
test = 6
test
test == 5


## Use cases

In [None]:
a = filter(dm, color == 'D' | color == 'J') 
# filtering for rows that satisfy one or both of the conditions

b = filter(dm, color == 'D' & color == 'J') 
# filtering for rows that satisfy both conditions

best_cuts = filter(dm, cut == 'Ideal') 
# filtering using membership condition

not_worst_cuts = filter(dm, cut > 'Fair') 
# can do this because cut is an ordinal variable

In [None]:
not_worst_cuts[1:20,]

In R, if you want to find if a variable's value is missing, use the is.na() function. In particular, do not check for equality with NA:

In [None]:
x = 4

In [None]:
x == NA
is.na(x)

Similarly, never put an equality condition with NA in your dplyr filter() statements

In [None]:
# create a dataframe
df = tibble(x = c(1, NA, 3))
print(df)

In [None]:
filter(df, x>1)

In [None]:
filter(df, is.na(x) | x > 1)

### Exercise
1. Write code using filter that will allow you to output diamonds with colors D or E and cuts Good or Very Good
2. Write code using filter that will allow you to output diamonds with even-numbered prices

## Arrange
Useful for ordering rows instead of filtering for a subset of them

In [None]:
arrange(dm, cut, color)[1:20,] 
# can be done since cut and color are ordinal variables

In [None]:
# arranging in the descending order of carat and then cut
arrange(dm,desc(carat), cut)[1:20,]

Missing values are always sorted at the end:

In [None]:
df = tibble(x = c(5, NA, 2))
arrange(df, x)

In [None]:
arrange(df, desc(x))

### Exercise
1. Use arrange to sort the dm dataset by descending order of the product of the x, y, and z variables. Output the first 20 rows of the new dataset.

## Select
This is used to reduce the number of columns that we're dealing with. Useful for things like genetic data

In [None]:
names(dm)

In [None]:
select(dm, carat, price)[1:20,]

In [None]:
select(dm, carat:price)[1:20,]

In [None]:
select(dm, -(carat:price))[1:20,]

Use rename(), which is a variant of select(), to rename a column and keep all the variables that aren't explicitly mentioned:

In [None]:
rename(dm, width=x)[1:20,]

In [None]:
select(dm, width = x)[1:20,]

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you would like to move to the start of the dataframe

In [None]:
select(dm, price, carat, everything())[1:20,]

There are some helper functions for select():
    starts_with()
    ends_with()
    contains()
    
### Exercise:
1. Write code that wil have price as the first column and the columns starting with the letter 'c' as the following columns. Output the first 20 rows of such a datset.

### Mutate
to create a computed column

In [None]:
dm_dimensions = select(dm, 
  -(carat:price)
)
mutate(dm_dimensions,
  volume = x*y*z
)[1:20,]

If you only want to keep the new variables, use transmute()

In [None]:
transmute(dm_dimensions, 
          volume = x*y*z)[1:20,]


## Summarise
Generally used in concert with group_by() function to output summaries by group. Group summaries are seen in many applications

In [None]:
by_color = group_by(dm, color)
summarise(by_color, avg_price = mean(price, na.rm = TRUE))

In [None]:
head(mpg)

In [None]:
mpg2 = mpg
mpg2$year = as.factor(mpg$year) # telling it we're dealing with a category column
mpg2 = mutate(mpg2, manual = (grepl('manual', trans)))
head(mpg2)

In [None]:
by_maker_yr = group_by(mpg2, manufacturer, year)
hwy_summary = summarise(by_maker_yr,
                       count = n(),
                       hwy = mean(hwy, na.rm = TRUE),
                       cty = mean(cty, na.rm = TRUE))
hwy_summary

In [None]:
hwy_summary_ag = filter(hwy_summary, substring(manufacturer,1,1) %in% c('a','b','c','d','e','f','g'))
hwy_summary_ag

In [None]:
ggplot(data = hwy_summary_ag, mapping = aes(x = cty, y = hwy)) + 
geom_point(mapping = aes(color = manufacturer, shape = year))

## Pipes
The transformations above can be written as:


In [None]:
hwy_summary_ag2 = mpg2 %>% 
group_by(manufacturer, year) %>%
summarise(
    count = n(),
    hwy = mean(hwy, na.rm = TRUE),
    cty = mean(cty, na.rm = TRUE)) %>%
filter(substring(manufacturer,1,1) %in% c('a','b','c','d','e','f','g'))

In [None]:
hwy_summary_ag2