# Lecture Worksheet A-2: Data Wrangling with dplyr

By the end of this worksheet, you will be able to: 

1. Use the five core dplyr verbs for data wrangling: `select()`, `filter()`, `arrange()`, `mutate()`, `summarise()`.
2. Use piping when implementing function chains.
3. Use `group_by()` to operate within groups (of rows) with `mutate()` and `summarise()`. 
4. Use `across()` to operate on multiple columns with `summarise()` and `mutate()`.

## Instructions + Grading

+ To get full marks for each participation worksheet, you must successfully answer at least 50% of all autograded questions: that's 10 for this worksheet. 

+ Autograded questions are easily identifiable through their labelling as **QUESTION**. Any other instructions that prompt the student to write code are activities, which are not graded and thus do not contribute to marks - but do contribute to the workflow of the worksheet!

## Attribution

Thanks to Icíar Fernández Boyano and Victor Yuan for their help in putting this worksheet together. 

The following resources were used as inspiration in the creation of this worksheet:

+ [Swirl R Programming Tutorial](https://swirlstats.com/scn/rprog.html)
+ [Palmer Penguins R Package](https://github.com/hadley/palmerpenguins)
+ [RD4S Data Transformation](https://r4ds.had.co.nz/transform.html)


## Five core dplyr verbs: an overview of this worksheet

So far, we've **looked** at our dataset. It's time to **work with** it! Prior to creating any models, or using visualization to gain more insights about our data, it is common to tweak the data in some ways to make it a little easier to work with. For example, you may need to rename some variables, reorder observations, or even create some new variables from your existing ones!

As explained in depth in the [R4DS Data Transformation chapter](https://r4ds.had.co.nz/transform.html), there are five key dplyr functions that allow you to solve the vast majority of data manipulation tasks:

+ Pick variables by their names (`select()`)
+ Pick observations by their values (`filter()`)
+ Reorder the rows (`arrange()`)
+ Create new variables with functions of existing variables (`mutate()`)
+ Collapse many rows down to a single summary (`summarise()`)

We can use these in conjunction with two other functions:

- The `group_by()` function groups a tibble by rows. Downstream calls to `mutate()` and `summarise()` operate independently on each group.
- The `across()` function, when used within the `mutate()` and `summarise()` functions, operate on multiple columns.

Because data wrangling involves calling multiple of these functions, we will also see the pipe operator `%>%` for putting these together in a single statement.  

## Getting Started

Load the required packages for this worksheet:

In [2]:
suppressPackageStartupMessages(library(palmerpenguins))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(tsibble))
suppressPackageStartupMessages(library(testthat))
suppressPackageStartupMessages(library(digest))
expect_sorted <- function(object) {
  act <- quasi_label(rlang::enquo(object), arg = "object")
  expect(
    !is.unsorted(act$val),
    sprintf("%s not sorted", act$lab)
  )
  invisible(act$val)
}

The following code chunk has been unlocked, to give you the flexibility to start this document with some of your own code. Remember, it's bad manners to keep a call to `install.packages()` in your source code, so don't forget to delete these lines if you ever need to run them.

In [None]:
# An unlocked code chunk.

# Part 1: The Five Verbs

## Exploring your data

What's the first thing that you should do when you're starting a project with a new dataset? Having a coffee is a reasonable answer, but before that, you should **look at the data**. This may sound obvious, but a common mistake is to dive into the analysis too early before being familiar with the data - only to have to go back to the start when something goes wrong and you can't quite figure out why. Some of the questions you may want to ask are:

+ What is the format of the data?
+ What are the dimensions?
+ Are there missing data?

You will learn how to answer these questions and more using dplyr.

## Penguins Data

[Palmer penguins](https://github.com/hadley/palmerpenguins) is an R data package created by Allison Horst. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The dataset that we will be using is stored in a variable called "penguins". It is a subset of the "penguins_raw" dataset, also included in this R package. Let's have a look at it.

In [3]:
head(penguins)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


`head()` returns the first 6 rows of a dataframe, instead of printing all the data to screen.

## What is the format of the data?

Let's begin by checking the class of the **penguins** variable. This will give us a clue about the overall structure of the data.

In [4]:
class(penguins)

As you can see, the function returns 3 classes: "tbl_df", "tbl", and "data.frame". A dataframe is the default class for data read into R. Tibbles ("tbl" and "tbl_df") are a modern take on data frames, but slightly tweaked to work better in the tidyverse. For now, you don’t need to worry about the differences; we’ll come back to tibbles later. The dataset that we are working with was originally a data.frame that has been coerced into a tibble, which is why multiple class names are returned by the `class()` function.

## What are the dimensions?

There are two functions that we can use to see exactly how many rows (observations) and columns (variables) we're dealing with. `dim()` is the base R option, and `glimpse()` is the dplyr flavour, which gives us some more information besides the row and column number. Give both a try!

In [5]:
dim(penguins)
glimpse(penguins)

Rows: 344
Columns: 8
$ species           [3m[90m<fct>[39m[23m Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island            [3m[90m<fct>[39m[23m Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm    [3m[90m<dbl>[39m[23m 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm     [3m[90m<dbl>[39m[23m 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm [3m[90m<int>[39m[23m 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g       [3m[90m<int>[39m[23m 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex               [3m[90m<fct>[39m[23m male, female, female, NA, female, male, female, male~
$ year              [3m[90m<int>[39m[23m 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~


There are more functions that you can use to further explore the dimensions, such as `nrow()`, `ncol()`, `colnames()` or `rownames()`, but we won't be looking into those.

## QUESTION 1.0

In the `dim()` function, what is the first number that you see?

Multiple choice!

A) number of rows   

B) number of columns

Put your selection (e.g. the letter corresponding to the correct option) into a variable named `answer1.0`.

In [6]:
answer1.0 <- "A"

In [7]:
test_that("Question 1.0", {
    expect_equal(digest(as.character(toupper(answer1.0))), "75f1160e72554f4270c809f041c7a776")
})
cat("success!")

[32mTest passed[39m 
success!

## `select()` 

*A brief interlude on naming things:* Names are important. Jenny Bryan has some excellent [slides](https://speakerdeck.com/jennybc/how-to-name-files) for naming things in a way that is human readable *and* machine readable. Don't worry too much about it for this worksheet, but do keep it in mind as it helps with *reproducibility*. 

A quick tip that you can put into practice: you can use *Pascal case* - creating names by concatenating capitalized words, such as PenguinsSubset, or PenguinsTidy. If names get too long, remove vowels! For example, PngnSubset, or PngnTidy instead. Or, you can use snake_case!

## QUESTION 1.1

In the next few questions, you will practice using the dplyr verb `select()` to pick and modify variables by their names. Modify the penguins data so that it contains the columns `species`, `island`, `sex`, in that order.

Assign your answer to a variable named `answer1.1`.

In [8]:
answer1.1 <- select(penguins, species, island, sex)



In [9]:
test_that("Question 1.1", {
    expect_equal(digest(as_tibble(answer1.1)), "0df5cac5070ec518519a6f2781f4e01f")
})
cat("success!")

-- [1m[33mFailure[39m (<text>:2:5): Question 1.1[22m ------------------------------------------
digest(as_tibble(answer1.1)) not equal to "0df5cac5070ec518519a6f2781f4e01f".
1/1 mismatches
x[1]: "63491aa90dcb507c85810ba253a6a465"
y[1]: "0df5cac5070ec518519a6f2781f4e01f"

success!

## QUESTION 1.2

Out of the following options, what would be the best name for the object that you just created above (currently stored in `answer1.1`)? Put your answer in a variable named `answer1.2`.

A) _penguin_subset   

B) penguins  

C) 2penguin   

D) PngnSub   

In [10]:
answer1.2 <- "D"

In [11]:
test_that("Question 1.2", {
    expect_equal(digest(as.character(toupper(answer1.2))), "c1f86f7430df7ddb256980ea6a3b57a4")
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 1.3

Select all variables, from `bill_length_mm` to `body_mass_g` (in that order). Of course, you could do it this way...

In [12]:
# This will work:
select(penguins, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>% 
   print(n = 5)

[90m# A tibble: 344 x 4[39m
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
           [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m       [3m[90m<int>[39m[23m
[90m1[39m           39.1          18.7               181        [4m3[24m750
[90m2[39m           39.5          17.4               186        [4m3[24m800
[90m3[39m           40.3          18                 195        [4m3[24m250
[90m4[39m           [31mNA[39m            [31mNA[39m                  [31mNA[39m          [31mNA[39m
[90m5[39m           36.7          19.3               193        [4m3[24m450
[90m# ... with 339 more rows[39m


But there is a better way to do it! Which do you think would work?

A) `select(penguins, body_mass_g:bill_length_mm)`   

B) `select(penguins, c(body_mass_g::bill_length_mm))`   

C) `select(penguins, bill_length_mm:body_mass_g)`   

D) `select(penguins, bill_length_mm::body_mass_g)`

Assign your answer to a variable called `answer1.3`

In [13]:
answer1.3 <- "C"

In [14]:
test_that("Question 1.3", {
    expect_equal(digest(as.character(toupper(answer1.3))), "475bf9280aab63a82af60791302736f6")
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 1.4

You're doing a great job. Keep it up! Now, select all variables, except `island`. How would you write this code?

A) `select(penguins, -c("island"))`   

B) `select(penguins, -island)`   

C) `select(penguins, -("island"))`   

Put your answer in a variable named `answer1.4`. We encourage you to try executing these!

In [15]:
answer1.4 <- "B"

In [16]:
test_that("Question 1.4", {
    expect_equal(digest(as.character(toupper(answer1.4))), "3a5505c06543876fe45598b5e5e5195d")
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 1.5

Output the `penguins` tibble so that `year` comes first. Hint: use the tidyselect `everything()` function. Store the result in a variable named `answer1.5`. 

In [17]:
answer1.5 <- select(penguins, year, everything())
head(answer1.5)

year,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
<int>,<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>
2007,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
2007,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2007,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
2007,Adelie,Torgersen,,,,,
2007,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
2007,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male


In [18]:
test_that("Question 1.5", {
    expect_equal(digest(dim(answer1.5)), "d095e682a86f7f16404b7f8dd5f3d676")
    expect_equal(digest(answer1.5), "a07a1cdcb64726866df3d525811a9bf6")
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 1.6

Rename `flipper_length_mm` to `length_flipper_mm`. Store the result in a variable named `answer1.6`

In [19]:
answer1.6 <- rename(penguins, length_flipper_mm=flipper_length_mm)

head(answer1.6)



species,island,bill_length_mm,bill_depth_mm,length_flipper_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


In [20]:
test_that("Question 1.6", {
  expect_equal(digest(dim(answer1.6)), 'd095e682a86f7f16404b7f8dd5f3d676')
  expect_equal(digest(names(answer1.6)), 'ef6a2aaa40de41c0b11ad2f6888d5ce6')
})
cat("success!")

[32mTest passed[39m 
success!

## `filter()` 

So far, we've practiced picking variables by their name with `select()`. But how about picking observations (rows)? This is where `filter()` comes in.

## QUESTION 1.7

Pick penguins with body mass greater than 3600 g. Store the resulting tibble in a variable named `answer1.7`

In [21]:
answer1.7 <- filter(penguins, body_mass_g > 3600)

head(answer1.7)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,39.3,20.6,190,3650,male,2007
Adelie,Torgersen,38.9,17.8,181,3625,female,2007
Adelie,Torgersen,39.2,19.6,195,4675,male,2007
Adelie,Torgersen,42.0,20.2,190,4250,,2007


In [22]:
test_that("Question 1.7", {
  expect_equal(digest(dim(answer1.7)), '0f80c9cad929bf5de5ae34e0d50cb60d')
  expect_equal(sum(pull(answer1.7, body_mass_g) <= 3600), 0)
})
cat("success!")

[32mTest passed[39m 
success!

## Storing the subsetted penguins data

In question 1.7 above, you've created a subset of the `penguins` dataset by filtering for those penguins that have a body mass greater than 3600 g. Let's do a quick check to see how many penguins meet that threshold by comparing the dimensions of the `penguins` dataset and your subset, `answer1.7`. There are two different ways to do this. 

In [23]:
dim(penguins)
dim(answer1.7)

As you can see, in filtering down to penguins with a body mass greater than 3600g, we have lost about 100 rows (observations). However, `answer1.7` doesn't seem like an informative name for this new dataset that you've created from `penguins`. Let's rename it to something else.

In [24]:
penguins3600 <- answer1.7

## QUESTION 1.8

From your "new" dataset `penguins3600`, take only data from penguins located in the Biscoe island. Store the result in a variable named `answer1.8`. 

In [25]:
answer1.8 <- filter(penguins3600, island == "Biscoe")

head(answer1.8)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Biscoe,35.9,19.2,189,3800,female,2007
Adelie,Biscoe,38.2,18.1,185,3950,male,2007
Adelie,Biscoe,38.8,17.2,180,3800,male,2007
Adelie,Biscoe,35.3,18.9,187,3800,female,2007
Adelie,Biscoe,40.5,18.9,180,3950,male,2007
Adelie,Biscoe,40.1,18.9,188,4300,male,2008


In [26]:
test_that("Question 1.8", {
  expect_equal(digest(dim(answer1.8)), "92ac01cd2e8809faceb1f7a283cd935f")
  a <- as.character(unique(pull(answer1.8, island)))
  expect_length(a, 1L)
  expect_equal(a, "Biscoe")
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 1.9

Repeat the task from Question 1.8, but take data from islands Torgersen and Dream. Now that you've practiced with dplyr verbs quite a bit, you don't need as many prompts to answer! Hint: When you want to select more than one island, you use `%in%` instead of `==`.

Store your answer in a variable named `answer1.9`.

In [27]:
# answer1.9 <- FILL_THIS_IN(FILL_THIS_IN, island FILL_THIS_IN c("FILL_THIS_IN", "FILL_THIS_IN"))
# your code here
answer1.9 <- filter(penguins3600, island %in% c("Torgersen", "Dream"))

head(answer1.9)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,39.3,20.6,190,3650,male,2007
Adelie,Torgersen,38.9,17.8,181,3625,female,2007
Adelie,Torgersen,39.2,19.6,195,4675,male,2007
Adelie,Torgersen,42.0,20.2,190,4250,,2007


In [28]:
test_that("Question 1.9", {
  expect_equal(digest(dim(answer1.9)), "b207bbce54bb47be51e7ba7b56d24bc2")
  expect_equal(sum(pull(answer1.9, island) == "Torgersen"), 28)
  expect_equal(sum(pull(answer1.9, island) == "Dream"), 69)
  expect_equal(sum(pull(answer1.9, island) == "Biscoe"), 0)
})
cat("success!")

[32mTest passed[39m 
success!

## `arrange()` 

`arrange()` allows you to rearrange rows. Let's give it a try!

## QUESTION 1.10

Order `penguins` by year, in ascending order. Store the resulting tibble in a variable named `answer1.10`.

In [29]:
# answer1.10 <- arrange(FILL_THIS_IN, FILL_THIS_IN)
# your code here
answer1.10 <- arrange(penguins, year)
head(answer1.10)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


In [30]:
test_that("Question 1.10", {
    expect_sorted(pull(answer1.10, year))
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 1.11

Great work! Order `penguins` by year, in descending order. Hint: there is a function that allows you to order a variable in descending order called `desc()`.

Store your tibble in a variable named `answer1.11`.

In [36]:
# answer1.11 <- arrange(FILL_THIS_IN, FILL_THIS_IN)
# your code here
answer1.11 <- arrange(penguins, desc(year))
#answer1.11 <- arrange(penguins, -year)
head(answer1.11)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Biscoe,35.0,17.9,192,3725,female,2009
Adelie,Biscoe,41.0,20.0,203,4725,male,2009
Adelie,Biscoe,37.7,16.0,183,3075,female,2009
Adelie,Biscoe,37.8,20.0,190,4250,male,2009
Adelie,Biscoe,37.9,18.6,193,2925,female,2009
Adelie,Biscoe,39.7,18.9,184,3550,male,2009


In [37]:
test_that("Question 1.11", {
    expect_sorted(pull(answer1.11, year) %>% 
                    rev())
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 1.12

Order `penguins` by year, then by `body_mass_g`. Use ascending order in both cases.

Store your answer in a variable named `answer1.12`

In [38]:
# answer1.12 <- arrange(FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN)
# your code here
answer1.12 <- arrange(penguins, year, body_mass_g)
head(answer1.12)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Chinstrap,Dream,43.2,16.6,187,2900,female,2007
Adelie,Dream,37.5,18.9,179,2975,,2007
Adelie,Dream,37.0,16.9,185,3000,female,2007
Adelie,Dream,36.0,18.5,186,3100,female,2007
Adelie,Biscoe,37.9,18.6,172,3150,female,2007
Adelie,Dream,36.5,18.0,182,3150,female,2007


In [39]:
test_that("Question 1.12", {
  expect_sorted(pull(answer1.12, year))
  answer1.12_list <- answer1.12 %>% 
    group_by(year) %>% 
    group_split()
  
  expect_length(answer1.12_list, 3)
  expect_sorted(answer1.12_list[[1]] %>% pull(body_mass_g) %>% na.omit())
  expect_sorted(answer1.12_list[[2]] %>% pull(body_mass_g) %>% na.omit())
  expect_sorted(answer1.12_list[[3]] %>% pull(body_mass_g) %>% na.omit())
})
cat("success!")

[32mTest passed[39m 
success!

## Piping, `%>%` 

So far, we've been using dplyr verbs by inputting the dataset that we want to work on as the first argument of the function (e.g. `select(**penguins**, year))`. This is fine when you're using a single verb, i.e. you only want to filter observations, or select variables. However, more often than not you will want to do several tasks at once; such as filtering penguins with a certain body mass, and simultaneously ordering those penguins by year. Here is where piping (`%>%`) comes in.

Think of `%>%` as the word "then"!

Let's see an example. Here I want to combine `select()` with `arrange()`.

This is how I could do it by *nesting* the two function calls. I am selecting variables year, species, island, and body_mass_g, while simultaneously arranging by year.

In [40]:
print(arrange(select(penguins, year, species, island, body_mass_g), year), n = 5)

[90m# A tibble: 344 x 4[39m
   year species island    body_mass_g
  [3m[90m<int>[39m[23m [3m[90m<fct>[39m[23m   [3m[90m<fct>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m  [4m2[24m007 Adelie  Torgersen        [4m3[24m750
[90m2[39m  [4m2[24m007 Adelie  Torgersen        [4m3[24m800
[90m3[39m  [4m2[24m007 Adelie  Torgersen        [4m3[24m250
[90m4[39m  [4m2[24m007 Adelie  Torgersen          [31mNA[39m
[90m5[39m  [4m2[24m007 Adelie  Torgersen        [4m3[24m450
[90m# ... with 339 more rows[39m


However, that seems a little hard to read. Now using pipes:

In [41]:
penguins %>%
  select(year, species, island, body_mass_g) %>%
  arrange(year) %>% 
  print(n = 5)

[90m# A tibble: 344 x 4[39m
   year species island    body_mass_g
  [3m[90m<int>[39m[23m [3m[90m<fct>[39m[23m   [3m[90m<fct>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m  [4m2[24m007 Adelie  Torgersen        [4m3[24m750
[90m2[39m  [4m2[24m007 Adelie  Torgersen        [4m3[24m800
[90m3[39m  [4m2[24m007 Adelie  Torgersen        [4m3[24m250
[90m4[39m  [4m2[24m007 Adelie  Torgersen          [31mNA[39m
[90m5[39m  [4m2[24m007 Adelie  Torgersen        [4m3[24m450
[90m# ... with 339 more rows[39m


## Creating tibbles

Throughout Part A, we have been working with a tibble, `penguins`. Remember that when we ran `class()` on `penguins`, we could see that it was a dataframe that had been coerced to a tibble, which is a unifying feature of the tidyverse.

Suppose that you have a dataframe that you want to coerce to a tibble. To do this, you can use `as_tibble()`. R comes with a few built-in datasets, one of which is `mtcars`. Let's check the class of `mtcars`:

In [42]:
class(mtcars)

As you can see, mtcars is a dataframe. Now, coerce it to a tibble with `as_tibble()`:

In [43]:
as_tibble(mtcars) %>% 
    print(n = 5)

[90m# A tibble: 32 x 11[39m
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m  21       6   160   110  3.9   2.62  16.5     0     1     4     4
[90m2[39m  21       6   160   110  3.9   2.88  17.0     0     1     4     4
[90m3[39m  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
[90m4[39m  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
[90m5[39m  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
[90m# ... with 27 more rows[39m


You can read more about tibbles in the [R4DS Tibble Chapter](https://r4ds.had.co.nz/tibbles.html#creating-tibbles).


## QUESTION 1.13

At the start of this worksheet, we loaded a package called `gapminder`. This package comes with a dataset stored in the variable also named `gapminder`. Check the class of the `gapminder` dataset:

In [44]:
class(gapminder)

As you can see, it is already a tibble.

Take all countries in Europe that have a GDP per capita greater than 10000, and select all variables except `gdpPercap`, using pipes. (Hint: use `-`).

Store your answer in a variable named `answer1.13`. Here is a code snippet that you can copy and paste into the solution cell below. 

```
answer1.13 <- FILL_THIS_IN %>%
  filter(FILL_THIS_IN > 10000, FILL_THIS_IN == "Europe") %>%
  FILL_THIS_IN(-FILL_THIS_IN)
```

In [47]:
# your code here
answer1.13 <- gapminder %>%
    filter(gdpPercap > 10000, continent == "Europe") %>%
    select(-gdpPercap)
head(answer1.13)

country,continent,year,lifeExp,pop
<fct>,<fct>,<int>,<dbl>,<int>
Austria,Europe,1962,69.54,7129864
Austria,Europe,1967,70.14,7376998
Austria,Europe,1972,70.63,7544201
Austria,Europe,1977,72.17,7568430
Austria,Europe,1982,73.18,7574613
Austria,Europe,1987,74.94,7578903


In [48]:
test_that("Question 1.13", {
  expect_equal(digest(dim(answer1.13)), "87d72f02bf15a0a29647db0c48c9a226")
  expect_equal(digest(answer1.13), "d0136991f3cfee4fcf896f677181c9c6")
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 1.14

Coerce the `mtcars` data frame to a tibble, and take all columns that start with the letter "d". 
*Hint: take a look at the "Select helpers" documentation by running the following code: `?tidyselect::select_helpers`.*

Store your tibble in a variable named `answer1.14`

```
answer1.14 <- FILL_THIS_IN(FILL_THIS_IN) %>%
    FILL_THIS_IN(FILL_THIS_IN("d"))
```

In [53]:
answer1.14 <- as_tibble(mtcars) %>% 
    select(starts_with("d"))
    
head(answer1.14)



disp,drat
<dbl>,<dbl>
160,3.9
160,3.9
108,3.85
258,3.08
360,3.15
225,2.76


In [54]:
test_that("Question 1.14", {
  expect_equal(digest(dim(answer1.14)), "ea1df69d6a59227894d1d4330f9bfab8")
  expect_equal(digest(colnames(answer1.14)), "0956954d01fe74c59c1f16850b7e874f")
})
cat("success!")

[32mTest passed[39m 
success!

This exercise is from [r-exercises](https://www.r-exercises.com/2017/10/19/dplyr-basic-functions-exercises/).

## `mutate()`

The `mutate()` function allows you to create new columns, possibly using existing columns. Like `select()`, `filter()`, and `arrange()`, the `mutate()` function also takes a tibble as its first argument, and returns a tibble. 

The general syntax is: `mutate(tibble, NEW_COLUMN_NAME = CALCULATION)`.

## QUESTION 1.15

Make a new column with body mass in kg, named `body_mass_kg`, *and* rearrange the tibble so that `body_mass_kg` goes after `body_mass_g` and before `sex`. Store the resulting tibble in a variable named `answer1.15`.


*Hint*: within `select()`, use R's `:` operator to select all variables from `species` to `body_mass_g`.

```
answer1.15 <- penguins %>%
    mutate(FILL_THIS_IN = FILL_THIS_IN) %>%
    select(FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN)
```

In [61]:
answer1.15 <- penguins %>% 
    mutate(body_mass_kg = body_mass_g/1000)%>%
    select(species:body_mass_g, body_mass_kg, sex, year)
    


head(answer1.15)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,body_mass_kg,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<dbl>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,3.75,male,2007
Adelie,Torgersen,39.5,17.4,186.0,3800.0,3.8,female,2007
Adelie,Torgersen,40.3,18.0,195.0,3250.0,3.25,female,2007
Adelie,Torgersen,,,,,,,2007
Adelie,Torgersen,36.7,19.3,193.0,3450.0,3.45,female,2007
Adelie,Torgersen,39.3,20.6,190.0,3650.0,3.65,male,2007


In [62]:
test_that("Question 1.15", {
  expect_equal(digest(dim(answer1.15)), "9e9457527d068c2333ea8fd598e07f13")
  expect_equal(digest(colnames(answer1.15)), "d7121e41fe934232c1c45dc425365040")
  expect_equal(na.omit(answer1.15$body_mass_kg / answer1.15$body_mass_g) %>% digest,
               "cdfbfd4da65e3575a474558218939055")
})
cat("success!")

[32mTest passed[39m 
success!

Notice the backwards compatibility! No need for loops! By the way, if you'd like to simultaneously create columns _and_ delete other columns, use the `transmute` function.

## `group_by()`

The `group_by()` function groups the _rows_ in your tibble according to one or more categorical variables. Just specify the columns containing the grouping variables. `mutate()` (and others) will now operate on each chunk independently. 

## QUESTION 1.16

Calculate the growth in population since the first year on record _for each country_, and name the column `rel_growth`. Do this by **rearranging the following lines**, and **filling in the `FILL_THIS_IN`**. Assign your answer to a variable named `answer1.16`

*Hint*: Here's another convenience function for you: `dplyr::first()`.

```
answer1.16 <-
    mutate(rel_growth = FILL_THIS_IN) %>% 
    arrange(FILL_THIS_IN) %>% 
    gapminder %>% 
    group_by(country) %>% 
```

In [83]:
answer1.16 <- gapminder %>%
    group_by(country) %>%
    arrange(year)%>%
    mutate(rel_growth = pop - first(pop))
head(answer1.16)

country,continent,year,lifeExp,pop,gdpPercap,rel_growth
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>,<int>
Afghanistan,Asia,1952,28.801,8425333,779.4453,0
Albania,Europe,1952,55.23,1282697,1601.0561,0
Algeria,Africa,1952,43.077,9279525,2449.0082,0
Angola,Africa,1952,30.015,4232095,3520.6103,0
Argentina,Americas,1952,62.485,17876956,5911.3151,0
Australia,Oceania,1952,69.12,8691212,10039.5956,0


In [84]:
test_that("Answer 1.16", {
    expect_equal(nrow(answer1.16), 1704)
    c('country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap', 'rel_growth') %>% 
       map_lgl(~ .x %in% names(answer1.16)) %>% 
       all() %>% 
       expect_true()
    expect_equal(digest(as.integer(answer1.16$rel_growth)), '26735e4b17481f965f9eb1d3b5de89ad')
})
cat("success!")

[32mTest passed[39m 
success!

## `summarise()`

The last core dplyr verb is `summarise()`. It collapses a data frame to a single row:

In [85]:
summarise(penguins, body_mass_mean = mean(body_mass_g, na.rm = TRUE))

body_mass_mean
<dbl>
4201.754


*From R4DS Data Transformation:* 

> `summarise()` is not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group".

For example, if we applied exactly the same code to a tibble grouped by island, we get the average body mass per island:

In [86]:
penguins %>%
  group_by(island) %>%
  summarise(body_mass_mean = mean(body_mass_g, na.rm = TRUE))

island,body_mass_mean
<fct>,<dbl>
Biscoe,4716.018
Dream,3712.903
Torgersen,3706.373


## QUESTION 1.17

From the `penguins` tibble, calculate the mean penguin body mass per island by year, in a column named `body_mass_mean`. Your tibble should have the columns `year`, `island`, and `body_mass_mean` only (and in that order). Store the resulting tibble in a variable named `answer1.17`.

```
answer1.17 <- penguins %>%
  group_by(FILL_THIS_IN) %>%
  FILL_THIS_IN(body_mass_mean = mean(FILL_THIS_IN, na.rm = TRUE))
```

In [92]:
answer1.17 <- penguins %>% 
    group_by(year, island) %>% 
    summarize(body_mass_mean = mean(body_mass_g, na.rm = TRUE))

head(answer1.17)

`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.



year,island,body_mass_mean
<int>,<fct>,<dbl>
2007,Biscoe,4740.909
2007,Dream,3684.239
2007,Torgersen,3763.158
2008,Biscoe,4628.125
2008,Dream,3779.412
2008,Torgersen,3856.25


In [93]:
test_that("Question 1.17", {
  expect_equal(digest(dim(answer1.17)), "f4885de1726d18557bd43d769cc0ae26")
  expect_equal(digest(colnames(answer1.17)), "ba0c85220a5fa5222cac937acb2f94c2")
})
cat("success!")

[32mTest passed[39m 
success!

# Part 2: Scoped variants with `across()`

Sometimes we want to perform the same operation on many columns. We can achieve this by embedding the `across()` function within the `mutate()` or `summarise()` functions.

## QUESTION 2.0

In a single expression, make a tibble with the following columns *for each island* in the penguins data set:

+ What is the *mean* of each numeric variable in the `penguins` dataset in each island? Keep the column names the same.
+ How many penguins are there in each island? Add this to a column named `n`.

Assign your answer to a variable named `answer2.0`

```
answer2.0 <- penguins %>% 
 group_by(FILL_THIS_IN) %>% 
 summarise(across(where(FILL_THIS_IN), FILL_THIS_IN, na.rm = TRUE), 
           n = n())
```

In [95]:
answer2.0 <- penguins %>% 
    group_by(island) %>% 
    summarise(across(where(is.numeric), mean, na.rm = TRUE), 
             n = n())

head(answer2.0)        

island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year,n
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
Biscoe,45.25749,15.87485,209.7066,4716.018,2008.095,168
Dream,44.16774,18.34435,193.0726,3712.903,2007.984,124
Torgersen,38.95098,18.42941,191.1961,3706.373,2007.923,52


In [96]:
test_that("Answer 2.0", {
    expect_equal(
        answer2.0 %>% 
          mutate(across(where(is.numeric), round, digits = 0)) %>% 
          unclass() %>% 
          digest(),
        "b06d7816762e489a57ca922d175f08ef"
    )
})
cat("success!")

[32mTest passed[39m 
success!

## QUESTION 2.1

Using the `penguins` dataset, what is the mean bill length and depth of penguins on each island, by year? The resulting tibble should have columns named `island`, `year`, `bill_length_mm`, and `bill_depth_mm`, in that order. Store the result in a variable named `answer2.1`. Be sure to remove NA's when you are calculating the mean. 

*Hint*: Use `starts_with()` instead of `where()` in the `across()` function.

```
answer2.1 <- penguins %>%
    group_by(FILL_THIS_IN) %>%
    summarise(across(FILL_THIS_IN))
```

In [100]:
answer2.1 <- penguins %>%  
    group_by(island, year) %>% 
    summarise(across(starts_with("bill"), mean, na.rm = TRUE))
    
head(answer2.1)

`summarise()` has grouped output by 'island'. You can override using the `.groups` argument.



island,year,bill_length_mm,bill_depth_mm
<fct>,<int>,<dbl>,<dbl>
Biscoe,2007,45.03864,15.54091
Biscoe,2008,44.62031,15.825
Biscoe,2009,46.11186,16.17797
Dream,2007,44.53913,18.57391
Dream,2008,43.75588,18.39706
Dream,2009,44.09773,18.06364


In [101]:
test_that("Answer 2.1", {
    expect_equal(names(answer2.1), c("island", "year", "bill_length_mm", "bill_depth_mm"))
    sorted <- answer2.1 %>%
       arrange(island, year)
    expect_identical(digest(round(sorted$bill_length_mm, 0)), "f9f46fe0b2604eac7903505876e4b240")
    expect_identical(digest(round(sorted$bill_depth_mm, 0)), "d54992e0dbb34479e18f4f73ff1f16f4")
})
cat("success!")

[32mTest passed[39m 
success!