<center><h1>Introduction to dplyr Package</h1></center>

# 1. The _dplyr_ Package

  - "dplyr" is short for "data plyer"
  - R package for aggregating, summarizing, reshaping, and generally wrangling data
  - Extremely popular in the R community
  - Authored by Hadley Wickham
  - Part of the "tidyverse" set of packages

## 1.1 The _dplyr_ "Verbs"

  - The _dplyr_ package is organized around a set of "verbs", which are functions that operate on data
    + `filter()`
    + `summarise()`
    + `select()`
    + `mutate()`
    + `arrange()`

## 1.2 The "Pipe" Operator

  - Can be used to pipe some object into a function call
  - `%>%`
    + `x %>% f(y)` is the same as `f(x, y)`
    

# 2. `filter()` Examples with _dplyr_

In [None]:
library(dplyr)           # load the package

In [None]:
arrests_df <- read.csv("data/pvd_arrests_2021-10-03.csv")

In [None]:
arrests_df %>% 
    filter(gender == "Male") 

### 2.1.1 Comparing `filter()` with Logical Indexing

In [None]:
# dplyr approach
arrests_df %>% 
    filter(gender == "Male")


# "base" R approach
is_male <- arrests_df$gender == "Male"      # create vector of bools

arrests_df[is_male, ]                       # get male

## 2.2 `filter()` Examples (cont.)

In [None]:
# Here we create a new data.frame from result of filter()

arrests_males <- arrests_df %>%
    filter(gender == "Male")                

In [None]:
head(arrests_males)

## 2.2 Using `filter()` with Multiple Conditions

In [None]:
arrests_teen_male <- arrests_df %>%
    filter(
        gender == "Male",
        age < 20
    )

head(arrests_teen_male)

### 2.2.1 Using `filter()` with Logical OR

  - Recall the `||` operator is the logical OR
  - The `|` operator performs the same role, but elementwise for columns (or vectors)

In [None]:
young_old_male <- arrests_df %>%
    filter(
        gender == "Male",
        age < 25 | age > 65  
    )

head(young_old_male)

### 2.2.2 Using `filter()` with Logical OR (cont.)

In [None]:
ptk_young_old_male <- arrests_df %>%
    filter(
        gender == "Male",
        age < 25 | age > 65 | from_city == "Pawtucket"
    )

head(ptk_young_old_male)

<center><h1>Using <code>select()</code> Function in dplyr</h1></center>

# 3. Using `select()` to Extract Columns
  - Recall that `filter()` can be used to filter rows
  - Similarly, `select()` is used to select columns
  - These functions can be "chained"

## 3.1 Example of `select()`

In [None]:
arrests_subset <- arrests_df %>% 
    select(id, age, gender, statute_desc)

head(arrests_subset)

### 3.1.1 Comparing `select()` to `[, ]` notation

In [None]:
# dplyr example
arrests_df %>% 
    select(id, age, gender, statute_desc)


# equivalent in "base" R example
cols <- c("id", "age", "gender", "statute_desc")

arrests_df[, cols]

## 3.2 Example of `select()` (cont.)

In [None]:
arrests_vio <- arrests_df %>%
    select(
        id,
        age,
        gender,
        statute_desc
    )

In [None]:
head(arrests_vio)           # see first few lines of new dataframe

# 4. Chaining _dplyr_ Operators
  - One key reason for _dplyr_ popularity
  - _dplyr_ verbs/functions are "composable"
    + $(f \circ g)(x) == f(g(x))$

In [None]:
female_vio <- arrests_df %>%
    filter(gender == "Female") %>%
    select(id, age, gender, statute_desc)

head(female_vio)

## 4.1 More Chaining

In [None]:
female_midage <- arrests_df %>%
    filter(
        gender == "Female",
        age > 45,
        statute_desc != ""
    ) %>%
    select(
        id, 
        age, 
        gender,
        statute_desc
    ) %>%
    arrange(
        id
    )

head(female_midage)

<center><h1>Using <code>group_by()</code> and <code>summarise()</code> in dplyr</h1></center>

# 5. Why use `group_by()` and `summarise()` from _dplyr_?
  - Being able to aggregate and summarize by grouping is hugely common
  - _split-apply-combine_ pattern
  - These operations can be "chained" with other _dplyr_ functions
  - Often makes for concise, intuitive, and readable code

## 5.1 Example of `group_by()` and `summarise()`

In [None]:
gender_tbl <- arrests_df %>%
    group_by(gender) %>%
    summarise(
        n_rows = n(),
        mean_age = mean(age)
    ) 

head(gender_tbl)

# 6. Chaining `filter()` with `group_by()` and `summarise()`

In [None]:
gender_tbl <- arrests_df %>%
    filter(
        from_city == "Providence",
        year == 2019
    ) %>%
    group_by(gender) %>%
    summarise(
        n_rows = n(),
        mean_age = mean(age),
        mean_cnts = mean(counts, na.rm = TRUE)
    ) 

head(gender_tbl)

## 6.1 More Interesting Example of Chaining

In [None]:
is_summer <- function(month_num) {
    chk <- month_num %in% c(6, 7, 8)
    return(chk)
}

In [None]:
is_summer(6)   # TRUE
is_summer(2)   # FALSE
is_summer(8)   # TRUE


### 6.1.1 More Interesting Example (cont.)

In [None]:
vio_tbl <- arrests_df %>%
    filter(
        statute_desc != "",
        statute_desc != "NULL", 
        year == 2021
    ) %>%
    group_by(statute_desc) %>%
    summarise(
        n_vios = n(),
        prop_male = mean(gender == "Male"),
        mean_age = mean(age),
        prop_summer = mean(is_summer(month))
    ) %>%
    arrange(desc(n_vios))

head(vio_tbl, 10)