In [2]:
IRdisplay::display_html("
<style>
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 100%;
}
body.rise-enabled div.inner_cell>div.input_area {
    font-size: 150%;
}

body.rise-enabled div.output_subarea.output_text.output_result {
    font-size: 150%;
}
body.rise-enabled div.output_subarea.output_text.output_stream.output_stdout {
  font-size: 150%;
}
</style>
")

# Lecture 04: Data manipulation II

<div style="border: 1px double black; padding: 10px; margin: 10px">

**In today's lecture we'll answer the following questions:**
* What days of the year / week are the busiest for flying?
* Who is the best batter in the history of baseball?

Along the way, we'll learn how to:
* Use [pipes](#Pipes).
* [Generate new variables](#Adding-New-Variables) using various transformations.
* [Group data and summarize it](#Grouped-Summaries).
</div>

In [3]:
library(tidyverse)
library(nycflights13)
options(jupyter.plot_mimetypes = "image/png");

Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 0.8.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


## What days of the year / at what airport are the busiest for flying?

Let's think about the table we would want to have in order to answer this question. Ideally,
it would look something like this:

    # A tibble: 1,095 x 4
       month   day airport n_sched_departures
       <int> <int> <chr>                <int>
     1     1     1 EWR                    305
     2     1     1 JFK                    297
     3     1     1 LGA                    240
     4     1     2 EWR                    350
     5     1     2 JFK                    321
     6     1     2 LGA                    272
     7     1     3 EWR                    336
     8     1     3 JFK                    318
     9     1     3 LGA                    260
    10     1     4 EWR                    339
    # … with 1,085 more rows

The table we are given has ~337k rows, one for each flight. How do we go from the `flights` table to the one shown above?

## Summaries
`summarize()` can be used to summarize entire data frames by collapsing them into single number summaries. The syntax is:
```{r}
summarize(<grouped tibble>, <new variable> = <formula for new variable>,
                            <other new variable> = <other formula>)
```

The most basic use of summarize is to compute statistics over the whole data set:

In [24]:
print(summarize(flights, delay = mean(dep_delay)))

[38;5;246m# A tibble: 1 x 1[39m
  delay
  [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m    [31mNA[39m


### Grouping observations
`summarize()` is most useful when combined with `group_by()` to group observations before calculating the summary statistic. The `group_by` function tells R how your data are grouped:

In [27]:
print(group_by(flights, year, month, day))

[38;5;246m# A tibble: 336,776 x 19[39m
[38;5;246m# Groups:   year, month, day [365][39m
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m  [4m2[24m013     1     1      517            515         2      830            819
[38;5;250m 2[39m  [4m2[24m013     1     1      533            529         4      850            830
[38;5;250m 3[39m  [4m2[24m013     1     1      542            540         2      923            850
[38;5;250m 4[39m  [4m2[24m013     1     1      544            545        -[31m1[39m     [4m1[24m004           [4m1[24m022
[38;5;250m 5[39m  [4m2[24m013     1     1      554            600        -[31m6[39m      812        

`summarize()` applies the summary function to each group of data. Remember that it always returns **one row per group**.

In [28]:
print(summarize(group_by(flights, month), mean_dep_delay = mean(dep_delay, na.rm=T)))

[38;5;246m# A tibble: 12 x 2[39m
   month mean_dep_delay
   [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m     1          10.0 
[38;5;250m 2[39m     2          10.8 
[38;5;250m 3[39m     3          13.2 
[38;5;250m 4[39m     4          13.9 
[38;5;250m 5[39m     5          13.0 
[38;5;250m 6[39m     6          20.8 
[38;5;250m 7[39m     7          21.7 
[38;5;250m 8[39m     8          12.6 
[38;5;250m 9[39m     9           6.72
[38;5;250m10[39m    10           6.24
[38;5;250m11[39m    11           5.44
[38;5;250m12[39m    12          16.6 


It's as if `summarize()` filtered your data for each group, calculated the summary statistic, and
then combined all the results back into one table.

In [29]:
df <- filter(flights, month == 1)
mean(df$dep_delay, na.rm = T)  # <--- first row of the summary table

Many summary functions are available:

* Center: `mean(), median()`
* Spread: `sd(), IQR(), mad()`
* Range: `min(), max(), quantile()`
* Position: `first(), last(), nth()`
* Count: `n(), n_distinct()`
* Logical: `any(), all()`

### Examples
The `n()` function calculates the number of rows in each group:

In [33]:
print(summarize(group_by(flights, month), n()))
nrow(filter(flights, month == 12))

[38;5;246m# A tibble: 12 x 2[39m
   month `n()`
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m     1 [4m2[24m[4m7[24m004
[38;5;250m 2[39m     2 [4m2[24m[4m4[24m951
[38;5;250m 3[39m     3 [4m2[24m[4m8[24m834
[38;5;250m 4[39m     4 [4m2[24m[4m8[24m330
[38;5;250m 5[39m     5 [4m2[24m[4m8[24m796
[38;5;250m 6[39m     6 [4m2[24m[4m8[24m243
[38;5;250m 7[39m     7 [4m2[24m[4m9[24m425
[38;5;250m 8[39m     8 [4m2[24m[4m9[24m327
[38;5;250m 9[39m     9 [4m2[24m[4m7[24m574
[38;5;250m10[39m    10 [4m2[24m[4m8[24m889
[38;5;250m11[39m    11 [4m2[24m[4m7[24m268
[38;5;250m12[39m    12 [4m2[24m[4m8[24m135


## Exercise
Modify this command to produce the table shown on the first slide. That is, generate a table that has one row for each day of the year and each of the three airports, and a column which tells how many rows there were.

Use this table to answer the question: which day of the year is busiest, and at what airport?

In [21]:
## Solution (your code here)

Sometimes we want to ask: what was the most extreme value within each group? For example, what was the busiest day at each of the three airports? The `top_n()` function tells us this:

In [36]:
top_n(summarize(group_by(flights, carrier), n = n()), 5)

Selecting by n


carrier,n
<chr>,<int>
AA,32729
B6,54635
DL,48110
EV,54173
UA,58665


#### A shortcut
`summarize(n = n())` occurs so often that there is a shortcut for it:

In [110]:
top_n(count(flights, carrier), 5)  # returns the five highest carriers based on `n`

Selecting by n


carrier,n
<chr>,<int>
AA,32729
B6,54635
DL,48110
EV,54173
UA,58665


## Exercise
What was the busiest day at each of the three airports (`EWR`, `JFK`, `LGA`)?

*Hint*: `count()` by month, day and origin, and then use grouping and `top_n()`.

In [42]:
## Solution

### Exercise

Use `summarize()`, `count()`, `filter()`, `arrange()` and/or `top_n()` to answer:

<table class="table-condensed">
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr><td>
        Which plane (tail number) flew the most flights in July?
        </td>
    <td>
        How many planes flew at least one flight in January, but none in February?
        </td>
    </tr>
    <tr>
<td>
</tr>
</table>






In [None]:
## Solution

## Pipes
Starting now, we will make extensive use of the pipe operator `%>%`. Consider the previous exercise:

In [43]:
top_n(group_by(summarize(group_by(flights, month, day, origin), n = n()), origin), 1)

Selecting by n


month,day,origin,n
<int>,<int>,<chr>,<int>
4,15,EWR,377
7,11,JFK,332
9,13,LGA,346


This is not very nice. To figure out what the command is doing you have to work from the inside out, which is not the order in which we are accustomed to reading. A slight improvement might be:

In [48]:
table1 <- group_by(flights, month, day, origin)
table2 <- summarize(table1, n = n())
table3 <- group_by(table2, origin)
top_n(table3, 1)

Selecting by n


month,day,origin,n
<int>,<int>,<chr>,<int>
4,15,EWR,377
7,11,JFK,332
9,13,LGA,346


This is better, but now you've created a bunch of useless temporary variables, and it requires a lot of typing. 
Instead, we are going to use a new operator `%>%` (prounouced "pipe"):

In [49]:
flights %>% 
    group_by(month, day, origin) %>%
    summarize(n = n()) %>%
    group_by(origin) %>%
    top_n(1)

Selecting by n


month,day,origin,n
<int>,<int>,<chr>,<int>
4,15,EWR,377
7,11,JFK,332
9,13,LGA,346


This is much better. We can read the command from left to right and know exactly what is going on.

### How `%>%` works
Under the hood, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. We can use `%>%` on any function, not just those defined in tidyverse:

In [50]:
"hello world" %>% print()  # prints "hello world"

[1] "hello world"


## Adding New Variables
The `dplyr`/`tidyverse` package offers the `mutate()` and `transmute()` commands to add new variables to data tibbles. The syntax is:
```{r}
<tibble> %>% mutate(<new variable> = <formula for new variable>,
                    <other new variable> = <other formula>)
```
This returns a copy of `<tibble>` with the new variables added on. `transmute()` does the same thing as `mutate()` but only keeps the new variables.

New variables can be added using the `mutate()` function. We already have an `air_time` variable. Let us compute the total time for the flight by subtracting the time of departure `dep_time` from time of arrival `arr_time`.

We notice something odd though. When we subtract 5h 17m from 8h 30m we should get 3h 13m, i.e. 193 minutes. But instead we get 313 minutes below.

In [87]:
flights %>% mutate(total_time = arr_time - dep_time) %>% 
            select(arr_time, dep_time, total_time) %>% 
            slice(1)

arr_time,dep_time,total_time
<int>,<int>,<int>
830,517,313


The issue is that `dep_time` and `arr_time` are in the hour-minute notation, so you cannot add and subtract them like regular numbers. We should first convert these times into the number of minutes elapsed since midnight.

We want add to new variables `new_dep` and `new_arr` but we need to write a function first that can do the conversion. The function is given below; we'll learn how it works later in the semester. For now just think of it as a black box that converts times from one format to another.

In [53]:
hourmin2min <- function(hourmin) {  # minutes after 000=midnight
    min <- hourmin %% 100  # modulus
    hour <- (hourmin - min) %/% 100  # integer division
    return(60*hour + min)
} 
hourmin2min(100)

Let us test the function on 530. That's 5h 30min, i.e., 330 minutes since midnight.

In [54]:
hourmin2min(530)

Let us now create two new variables obtained from `arr_time` and `dep_time` by converting them into minutes since midnight. In the same command, we can also create a new `total_time` column containing their difference.

In [59]:
my_flights = mutate(flights, 
                    new_arr = hourmin2min(arr_time), 
                    new_dep = hourmin2min(dep_time),
                    total_time = arr_time - dep_time
                   ) %>% print

[38;5;246m# A tibble: 336,776 x 22[39m
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m  [4m2[24m013     1     1      517            515         2      830            819
[38;5;250m 2[39m  [4m2[24m013     1     1      533            529         4      850            830
[38;5;250m 3[39m  [4m2[24m013     1     1      542            540         2      923            850
[38;5;250m 4[39m  [4m2[24m013     1     1      544            545        -[31m1[39m     [4m1[24m004           [4m1[24m022
[38;5;250m 5[39m  [4m2[24m013     1     1      554            600        -[31m6[39m      812            837
[38;5;250m 6[39m  [4m2[24m013     1    

### Exercise

There is something weird about <code>total_time</code> when compared to <code>air_time</code>. What is it?

In [64]:
## Solution

What would cause a discrepancy of 1500m?

## Exercise
Add a new variable to `new_total_time` to `my_flights` which contains a "corrected" version of `total_time`. Plot the resulting distribution of `new_total_time`.

In [65]:
## Solution

## Who is the greatest batter of all time?
The Lahman dataset contains information on baseball players.

In [69]:
library(Lahman)
bat <- as_tibble(Batting) %>% print

[38;5;246m# A tibble: 105,861 x 22[39m
   playerID yearID stint teamID lgID      G    AB     R     H   X2B   X3B    HR
   [3m[38;5;246m<chr>[39m[23m     [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<fct>[39m[23m  [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m abercda…   [4m1[24m871     1 TRO    NA        1     4     0     0     0     0     0
[38;5;250m 2[39m addybo01   [4m1[24m871     1 RC1    NA       25   118    30    32     6     0     0
[38;5;250m 3[39m allisar…   [4m1[24m871     1 CL1    NA       29   137    28    40     4     5     0
[38;5;250m 4[39m allisdo…   [4m1[24m871     1 WS3    NA       27   133    28    44    10     2     2
[38;5;250m 5[39m ansonca…   [4m1[24m871     1 RC1    NA       25   120    2

There is one row per player per year:

In [79]:
bat[2,] %>% print
Lahman::playerInfo('addybo01')

[38;5;246m# A tibble: 1 x 22[39m
  playerID yearID stint teamID lgID      G    AB     R     H   X2B   X3B    HR
  [3m[38;5;246m<chr>[39m[23m     [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<fct>[39m[23m  [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m addybo01   [4m1[24m871     1 RC1    NA       25   118    30    32     6     0     0
[38;5;246m# … with 10 more variables: RBI [3m[38;5;246m<int>[38;5;246m[23m, SB [3m[38;5;246m<int>[38;5;246m[23m, CS [3m[38;5;246m<int>[38;5;246m[23m, BB [3m[38;5;246m<int>[38;5;246m[23m, SO [3m[38;5;246m<int>[38;5;246m[23m,
#   IBB [3m[38;5;246m<int>[38;5;246m[23m, HBP [3m[38;5;246m<int>[38;5;246m[23m, SH [3m[38;5;246m<int>[38;5;246m[23m, SF [3m[38;5;246m<int>[38;5

Unnamed: 0_level_0,playerID,nameFirst,nameLast
Unnamed: 0_level_1,<chr>,<chr>,<chr>
106,addybo01,Bob,Addy


Bob Addy was active in the years 1871-1877. During that time he had $118+51+152+213+310+142+245=1231$ at-bats, and $32+16+54+51+80+40+68=341$ hits. Therefore his career batting average was $341/1241=0.277$.

In [82]:
filter(bat, playerID == "addybo01") %>% print

[38;5;246m# A tibble: 7 x 22[39m
  playerID yearID stint teamID lgID      G    AB     R     H   X2B   X3B    HR
  [3m[38;5;246m<chr>[39m[23m     [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<fct>[39m[23m  [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m addybo01   [4m1[24m871     1 RC1    NA       25   118    30    32     6     0     0
[38;5;250m2[39m addybo01   [4m1[24m873     1 PH2    NA       10    51    12    16     1     0     0
[38;5;250m3[39m addybo01   [4m1[24m873     2 BS1    NA       31   152    37    54     6     3     1
[38;5;250m4[39m addybo01   [4m1[24m874     1 HR1    NA       50   213    25    51     9     2     0
[38;5;250m5[39m addybo01   [4m1[24m875     1 PH2    NA       69   310    60    80     8

### Exercise
By appropriately grouping and summarizing the data, add up all the hits and at-bats for each player across all the years they played, and compute their career batting average. 

Which player(s) has the highest career batting average?

In [83]:
## Solution

### Always include counts
It is a good idea to include counts of each group when you do a summary. Some groups may have very low numbers of observations, resulting in high variance for the summary statistics. 

What happens if we restrict our batting average calculation to players that had at least 100 at-bats?

In [96]:
## Solution

Selecting by avg


playerID,H,AB,avg
<chr>,<int>,<int>,<dbl>
barnero01,860,2391,0.3596821
cobbty01,4189,11435,0.3663314
hornsro01,2930,8173,0.3584975
jacksjo01,1772,4981,0.3557519
meyerle01,513,1443,0.3555094


## Exercise
Which player had the highest batting average in a single season? After 1920?

In [109]:
## Solution