Permalink
Fetching contributors…
Cannot retrieve contributors at this time
165 lines (126 sloc) 8.41 KB
---
title: 'Tidy Sports Analytics, Part 2: tidyr'
author: Jake Thompson
date: '2017-06-25'
slug: tidy-sports-analytics-tidyr
categories:
- rstats
tags:
- nfl
- sports
- tidyverse
---
```{r setup, include = FALSE, message = FALSE}
library(here)
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
warning = FALSE,
comment = "#>",
echo = TRUE,
cache = FALSE,
fig.align = "center",
fig.width = 6,
fig.asp = 0.7,
out.width = "80%"
)
library (tidyverse)
```
This is the second in a series of posts that demonstrates how the [**tidyverse**](http://www.tidyverse.org/) can be used to easily explore and analyze NFL play-by-play data. In [part one](/post/2017-06-09-tidy-sports-analytics-part-1-dplyr/tidy-sports-analytics-dplyr/), I used the **dplyr** package to calculate the offensive success rate of each NFL offense in during the 2016 season. However, when we left off, I noted that really we should look at the success rate of both offenses and defenses in order to get a better idea of which teams were the best overall. For this, we'll use the [**tidyr**](http://tidyr.tidyverse.org/) package, which is main **tidyverse** package for tidying data. But first, let's talk about the pipe operator.
## Embrace the pipe
The pipe operator (`%>%`) comes from the [**magrittr**](http://magrittr.tidyverse.org/) package, and it is used to make code more readable. Compare the following two chunks of code.
```{r no-pipe, eval = FALSE}
library(tidyverse)
nfl_pbp <- read_rds(here("static", "data", "nfl_pbp_2016.rds"))
success <- select(nfl_pbp, game_id = gid, play_id = pid, offense = off,
defense = def, play_type = type, down = dwn,
to_go = ytg, gained = yds)
success <- filter(success, play_type %in% c("PASS", "RUSH"))
```
```{r pipe, message = FALSE, warning = FALSE}
library(tidyverse)
nfl_pbp <- read_rds(here("static", "data", "nfl_pbp_2016.rds"))
success <- nfl_pbp %>%
select(game_id = gid, play_id = pid, offense = off, defense = def,
play_type = type, down = dwn, to_go = ytg, gained = yds) %>%
filter(play_type %in% c("PASS", "RUSH"))
```
Both code chunks read in the data and do some initial manipulations to calculate success rate using the **dplyr** package (see [part one](/post/2017-06-09-tidy-sports-analytics-part-1-dplyr/tidy-sports-analytics-dplyr/) of this series for a more in depth explanation of how these functions work). In the first chunk, the data has to be referenced multiple times to select some variables from `nfl_pbp` and then filter to only passing and rushing plays. However, in the second chunk, all of the data manipulation is contained in a single code stanza, which improves readability.
The pipe works by passing the output of the function on the left as the first argument to the function on the right. So `a %>% f(x, y) %>% g(z)` is interpreted by the computer as `g(f(a, x, y), z)`. These two statements are equivalent, but in the first, it is much easier to see that `a` is passes to function `f()`, which also has arguments `x` and `y`, and then the output of that is passes to function `g()`, which also has argument `z`. In the **tidyverse**, the first argument is always the data frame that the operation is being performed on, and a data frame with the specified operation performed is returned. Thus, all of the **tidyverse** functions are compatible with the `%>%` operator. Many other base *R* functions, such as `unique`, are also compatible with the pipe.
Throughout the rest of this post, and the following posts in this series, we'll use the `%>%` to make the code more readable and straight forward.
## tidyr
In this post, we'll be using the **tidyr** package to calculate the offensive and defensive success rate for each team. Let's remind ourselves what the data looks like.
```{r view-data}
success
```
In the previous post, we grouped the data frame by offense in order to calculate success rate. Now, we want to group by team and offensive or defensive unit to calculate success rate. The problem we face is that there is no `unit` variable. Instead, the unit information is split across two columns: `offense` and `defense`. This is where the **tidyr** package comes into play. There are two main functions within **tidyr** that are used for tidying data:
- `gather`: collapse multiple columns into key-value pairs;
- `spread`: split key-value pairs into multiple columns.
I'll demonstrate how both of these functions are used in our analysis of NFL success rates.
## Using tidyr
Before we start using the **tidyr** package, let's calculate whether each play was a success or not using the processes from the [previous post](/post/2017-06-09-tidy-sports-analytics-part-1-dplyr/tidy-sports-analytics-dplyr/).
```{r catch-up}
success <- success %>%
mutate(
needed = case_when(
down == 1 ~ to_go * 0.45,
down == 2 ~ to_go * 0.60,
TRUE ~ to_go * 1.00
),
play_success = case_when(
gained >= needed ~ TRUE,
gained < needed ~ FALSE
)
)
success
```
In order to calculate the success rate for each team and unit combination, we first need to gather the unit information into key-value pairs using the `gather` function.
```{r gather}
success %>%
gather(key = "team_unit", value = "team", offense:defense)
```
When using the `gather` function, the column names of the collapsed columns become the "key", and the values are the "value". In the function, we chose to name the column of keys `team_unit` and the column of values `team`. We then specified which columns should be collapsed. Each column can be named individually (i.e., `c(offense, defense)`), or you can specify all columns in a range (i.e., `offense:defense`).
Next, as I noted last time, play success is calculated such that it measures whether the play was a success for the offense. This means that if a play was successful for an offense it was unsuccessful for the defense, and vice versa. Therefore, we need to reverse the decision of whether or not a play was successful when we calculate defensive success rates.
```{r def-success}
success %>%
gather(key = "team_unit", value = "team", offense:defense) %>%
mutate(
play_success = case_when(
team_unit == "defense" ~ !play_success,
TRUE ~ play_success
)
)
```
Here, we're saying if the `team_unit` for the observation is equal to defense, `play_success` is equal to the opposite of the current value of play success. If `play_success` is not equal to defense, then the value is unchanged. We can now group by `team` and `team_unit` to calculate the success rate of each offense and defense.
```{r unit-success}
success %>%
gather(key = "team_unit", value = "team", offense:defense) %>%
mutate(
play_success = case_when(
team_unit == "defense" ~ !play_success,
TRUE ~ play_success
)
) %>%
group_by(team, team_unit) %>%
summarize(success_rate = mean(play_success, na.rm = TRUE))
```
Finally, we want to spread the data frame back out so there is only one row per team. This is where the `spread` function comes into play.
```{r spread}
success %>%
gather(key = "team_unit", value = "team", offense:defense) %>%
mutate(
play_success = case_when(
team_unit == "defense" ~ !play_success,
TRUE ~ play_success
)
) %>%
group_by(team, team_unit) %>%
summarize(success_rate = mean(play_success, na.rm = TRUE)) %>%
spread(key = team_unit, value = success_rate)
```
The `spread` function takes the key-value pairs and spread them out into their own columns. The "key" column becomes the column names, so a new column will be created for each value in the "key". We have only two values (offense and defense), meaning that these are the two new columns that are created. Each column is populated with the values associated with that observation in the "value" column specified in the call to the `spread` function. Now we're back to one row per team, so it's easier to see the success rate of each unit for all of the teams.
## Conclusion
This post shows how the **tidyr** package, which is main the data cleaning package in the **tidyverse**, can be used to quickly reshape data in order to facilitate and perform analyses. In the next post, I'm going to demonstrate how the **ggplot2** package can be used to visualize these results. For more **tidyr** resources, check out:
- [tidyr.tidyverse.org](http://tidyr.tidyverse.org/)
- [*R for Data Science*, Tidy Data](http://r4ds.had.co.nz/tidy-data.html)
- [Tidy Data, *Journal of Statistical Software*](http://vita.had.co.nz/papers/tidy-data.html)