<font size="6"><b>`TIDYVERSE` SUITE</b></font>

In [None]:
library(tidyverse)
library(magrittr)
library(nycflights13)
library(DT)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=30) # for limiting the number of top and bottom rows of tables printed 

![xkcd](../imagesbb/data_pipeline.png)

(https://xkcd.com/2054)

In this session, we will cover some important components and functions of the tidyverse suite

tidyverse is essentially a suite of interrelated packages.
We will mostly focus on a collection of `dplyr` and `tidyr` functions and features in this session.

Note that, contrary to data.table, the modifications are not made in-place, they should be assigned back to the original object

We will use the same data from nycflights13 package

# Datasets

You can get info on and preview the structure and some rows of the datasets and navigate through them

## Airlines

In [None]:
#?(airlines)

In [None]:
head(airlines)

In [None]:
str(airlines)

In [None]:
datatable(airlines, filter = "top")

## airports

In [None]:
#?airports

In [None]:
head(airports)

In [None]:
str(airports)

In [None]:
datatable(airports, filter = "top")

## planes

In [None]:
#?planes

In [None]:
head(planes)

In [None]:
str(planes)

In [None]:
datatable(planes, filter = "top")

## weather

In [None]:
#?weather

In [None]:
head(weather)

In [None]:
str(weather)

In [None]:
#datatable(weather, filter = "top")

## flights

In [None]:
#?flights

In [None]:
head(flights)

In [None]:
str(flights)

In [None]:
#datatable(flights, filter = "top")

In [None]:
class(flights)

In [None]:
attributes(flights)

## classes of datasets

In [None]:
class(airlines)
class(airports)
class(planes)
class(weather)
class(flights)

Tibble is a tidyverse version of the data.frame class:

`A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.`

[Simple Data Frames: tibble](https://tibble.tidyverse.org/)

# Pipes

Pipes make easier to combine steps that form a linear flow of data computation.

The idea was first implemented in 1970s for the UNIX operating system by Ken Thompson

Listen to the story of its birth from its inventor interviewed by Brian Kernighan: How he implemented the pipes ONE HOUR and how he and Dennis Ritchie converted all related components in UNIX to be compatible with the pipes (so that extra messages are not written to stdout) in just ONE NIGHT:

[![Ken Thompson interviewed by Brian Kernighan at VCF East 2019](https://img.youtube.com/vi/EY6q5dv_B-o/0.jpg)](https://youtu.be/EY6q5dv_B-o?t=1959)

And watch how Brian Kernighan explains the simple idea behind pipes with a toy example:

[![Ken Thompson interviewed by Brian Kernighan at VCF East 2019](https://img.youtube.com/vi/L9GfCgLLZYE/0.jpg)](https://youtu.be/L9GfCgLLZYE)

Let's try to create and visualize a randomly generated normal distributed series from a pure entropy source: The modulo of nanotime

In [None]:
nano_sample <- replicate(1e5, microbenchmark::get_nanotime() %% 1e6)

In [None]:
nano_sample_n <- BBmisc::normalize(nano_sample + 1, method = "range", range = c(1e-5, 1-1e-5))

In [None]:
qvalues <- qnorm(nano_sample_n)

In [None]:
hist(qvalues, main = NULL, xlab = NULL)

Let's try to make it a one liner:

In [None]:
hist(qnorm(BBmisc::normalize(replicate(1e5, microbenchmark::get_nanotime() %% 1e6) +1,
                             method = "range",
                             range = c(1e-5, 1-1e-5))),
     main = NULL, xlab = NULL)

Or let's chain the steps with pipes:

In [None]:
replicate(1e5, microbenchmark::get_nanotime() %% 1e6) %>% "+"(1) %>%
BBmisc::normalize(method = "range", range = c(1e-5, 1-1e-5)) %>%
qnorm %>% hist(main = NULL, xlab = NULL)

Another version of the pipe (as a part of `magrittr` package) assigns the final output back to the original input to the first step

In [None]:
series <- replicate(1e5, microbenchmark::get_nanotime() %% 1e6) %>% "+"(1)

In [None]:
series %>% summary

In [None]:
series %<>% BBmisc::normalize()

In [None]:
series %>% summary

# Basic dplyr verbs

## select

Select or leave out columns by name

Similar to `j` of DT[i, j, by] where `j` is a list of columns given by `.(...)`

In [None]:
flights %>% select(time_hour, carrier, origin, dest)

In [None]:
flights %>% select(-flight, -tailnum, -year, -month, -day)

## filter

Subset rows on conditions

Similar to `i` of DT[i, j, by]

In [None]:
flights %>%
filter(air_time < 30 & distance > 100)

## mutate

Modify existing columns or create new ones

Similar to `:=` operator of data.table, but the new column is not modified in place, it has to be assigned back to the object

In [None]:
flights %<>%
mutate(air_hour = (air_time %/% 100) + (air_time %% 100)/60) %>%
mutate(speed = distance / air_hour)

In [None]:
flights

## summarize

Aggregate columns

Similar to DT[, .(...)] operations inside data.table where `.(...)` computes single or multiple columns with aggregation operations

In [None]:
flights %>%
summarise(max_speed = max(speed, na.rm = T))

## group by

Repeat operations across unique values of selected columns

Similar to `by` inside DT[i, j, by]

In [None]:
flights %>%
group_by(carrier) %>%
summarise(max_speed = max(speed, na.rm = T))

## arrange

Orders the output

Similar to `order` of data.table

In [None]:
flights %>%
group_by(carrier) %>%
summarise(max_speed = max(speed, na.rm = T)) %>%
arrange(-max_speed)

## Joins

Combine multiple tables based on related columns:

Similar to `merge` of data.table, but different types of merges are done with separate functions:

In [None]:
flights %>% select(origin, dest, distance) %>% left_join(select(airports, faa, lat, lon), by = c("origin" = "faa"))

You can view the types of joins:

![sql joins](../imagesbb/sqljoins_cheatsheet.png)

Let's create a toy example:

In [None]:
table1 <- tibble(aa = LETTERS[1:4], bb = 1:4)

In [None]:
table2 <- tibble(aa = LETTERS[3:6], cc = 3:6)

In [None]:
table1

In [None]:
table2

In left join the key values are taken only from the first table:

In [None]:
table1 %>% left_join(table2, by = "aa")

In right join the key values are taken only from the second table:

In [None]:
table1 %>% right_join(table2, by = "aa")

In inner join only common key values in two tables are taken:

In [None]:
table1 %>% inner_join(table2, by = "aa")

In full outer join key values from two tables are combined:

In [None]:
table1 %>% full_join(table2, by = "aa")

In anti join key values that appear in the first table but do not appear in the second table are taken:

In [None]:
table1 %>% anti_join(table2, by = "aa")

## *_all, *_at, *_if versions of functions

- functions ending with _all apply to all columns
- functions ending with _at apply to columns explicitly specified by names
- functions ending with _if apply to columns that satisfy a condition

### summarise variants

In [None]:
flights %>% summarise_all(class)

In [None]:
flights %>% summarise_at(c("year", "month", "day"), max)

In [None]:
flights %>% summarise_if(is.numeric, mean, na.rm = T)

### mutate variants

The columns are modified keeping their names:

In [None]:
flights %>% 
filter(distance < 100) %>%
mutate_all(as.character)

Note that, with mutate_at and mutate_if, the columns not selected for the operations are returned as is, they are kept

In [None]:
flights %>%
filter(distance < 100) %>%
mutate_at(c("dep_time", "arr_time", "air_time"), function(x) round(x %/% 100 + x%%100/60, 2))

In [None]:
flights %>%
filter(distance < 100) %>%
mutate_if(is.character, factor)

## predicates for column selection

A group of functions make it easier to specify the columns to select for functions like select, or \*_at functions

They are listed on page [Selection language](https://tidyselect.r-lib.org/reference/language.html)

In [None]:
flights %>% select(contains("time"))

In [None]:
flights %>% select(is.numeric)

In [None]:
flights %>% select(-is.numeric)

Mutate columns names of which end with "time":

In [None]:
flights %>%
filter(distance < 100) %>%
mutate_at(vars(ends_with("time")), function(x) round(x %/% 100 + x%%100/60, 2))

# basic operations with tidyr

## reshaping

### pivot_wider

Similar to `dcast` of data.table, reshapes long to wide

In [None]:
flights_at <- flights %>%
group_by(origin, carrier) %>%
summarise(max_at_oc = max(air_time, na.rm = T),
          min_at_oc = min(air_time, na.rm = T),
          av_at_oc = mean(air_time, na.rm = T))

In [None]:
flights_at

In [None]:
flights_at_wide <- flights_at %>% pivot_wider(id_cols = carrier, names_from = origin, values_from = av_at_oc)

In [None]:
flights_at_wide

`spread` is the older version of the function, however it cannot select the identity vars in itself, we have to select the necessary columns first:

In [None]:
flights_at %>% 
select(carrier, origin, av_at_oc) %>%
spread(key = origin, value = av_at_oc)

### pivot_longer

Similar to `melt` of data.table: reshapes from wide to long

In [None]:
flights_at_wide %>% pivot_longer(cols = -"carrier", names_to = "origin2", values_to = "av_time2", values_drop_na = T)

`gather` is the older version of the function:

In [None]:
flights_at_wide %>% gather(key = "origin2", value = "av_time2", -"carrier", na.rm = T)

## crossing

Getting all cartesian product of the rows of two tables:

In [None]:
flights %>% distinct(carrier) %>% arrange(carrier) %>%
crossing(flights %>% distinct(origin) %>% arrange(origin))

## unite / separate columns

`unite` takes the values in multiple columns, combines them using a separator into a single new column:

In [None]:
flights %>% unite(col = "origin_carrier", c("origin", "carrier"))

`separate` takes the values in a single column, separates them using a separator into a multiple new columns:

In [None]:
flights %>% unite(col = "origin_carrier", c("origin", "carrier")) %>%
separate(col = origin_carrier, into = c("origin", "carrier"), sep = "_")

# other selected functions and features

## slice

Filters rows by row indices.

Similar to `i` in DT[i, j, by] which covers slicing along with filtering:

In [None]:
flights %>% slice(1:10)

## pull

Extracts a column as a vector.

Similar to DT[, j, by] where j is the name of a single column

In [None]:
flights %>% slice(1:10) %>% pull(carrier)

## rename

Changes the name of a column:

Similar to `setnames` in data.table

In [None]:
flights %>% rename("departure_time" = "dep_time")

## if_all, if_any

Applies the same condition to multiple columns for a more complex filter.

- `if_all` filters the rows that all selected columns satisfy the condition
- `if_any` filters the rows that any of the selected columns satisfies the condition
- the columns can be selected by name or by selection language predicates

In [None]:
flights %>%
filter(if_all(c("distance", "air_hour"), ~ . > quantile(., 0.99, na.rm = T)))

In [None]:
flights %>%
filter(if_any(c("distance", "air_hour"), ~ . > quantile(., 0.999, na.rm = T)))

## shift functions

`lag` and `lead` functions shift a column by stated number of elements forward or backward, similar to `shift` function of data.table:

In [None]:
flights %>%
group_by(origin, dest) %>%
mutate(air_time_lag1 = lag(air_time, 1)) %>%
ungroup %>%
filter(origin == "EWR" & dest == "IAH") %>%
select(origin, dest, air_time, air_time_lag1)

In [None]:
flights %>%
group_by(origin, dest) %>%
mutate(air_time_lead1 = lead(air_time, 1)) %>%
ungroup %>%
filter(origin == "EWR" & dest == "IAH") %>%
select(origin, dest, air_time, air_time_lead1)

## percent_rank

Gives the percent rank of values in a column:

In [None]:
flights %>%
mutate(air_hourp = percent_rank(air_hour)) %>%
filter(!between(air_hourp, 0.001, 0.999)) %>%
select(time_hour, carrier, origin, dest, air_hour, air_hourp)

## replace_na

Similar to `nafill` of data.table with fixed value option: Replaces NA values with a fixed value

In [None]:
flights %>%
filter(is.na(air_time)) %>%
mutate_at("air_time", replace_na, 0)

## na_if

Changes a stated value to NA in a column:

In [None]:
c(sample(LETTERS, 10), "MISSING") %>% na_if("MISSING")

## bind_rows

Similar to `rbindlist` of data.table, combines multiple tables or parts of a list by rows:

In [None]:
flights %>% split(f = flights$origin) %>% bind_rows

## add_row

Appends a new row with stated values:

In [None]:
flights %>% slice(1:5) %>% select(carrier, origin, dest, year, month, day) %>%
add_row(carrier = "ZZ", origin = "ZZZ", dest = "YYY", year = 2024, month = 12, day = 31)

# Combining data.table and tidyverse

data.table and tidyverse both have their own strengths in different aspects.

We do not have to select one of them and we can combine their strengths:
tidyverse operations and pipes can be used inside a data.table and data.table operations can be added into tidyverse pipes

Note that when a data.table operation added into a series of tidyverse pipes, the output of the last pipe can be referred to with `.` symbol:

In [None]:
library(data.table)

In [None]:
setDT(flights)

In [None]:
flights %>% select(ends_with("time")) %>%
.[, dep_hour := dep_time %/% 100 + dep_time %% 100 * 60] %>%
.[]

Note that some tidyverse function can delete the data.table attribute so it must be reapplied:

In [None]:
flights %>%
unite(col = "origin_dest", c("origin", "dest")) %>%
class

In [None]:
flights %>%
unite(col = "origin_dest", c("origin", "dest")) %>%
as.data.table %>%
class

There are some cases where tidyverse approach is easier however the reverse is also true:

Let's think about that quite common case: Filter some rows, make calculations on some columns, assign back to an existing or new column and keep all unfiltered rows unchanged.

While this may not be so easy in tidyverse, it is one of the most trivial things to do in data.table.

Check this stackoverflow question and the answer below:

https://stackoverflow.com/questions/65892690/dplyr-filter-and-then-mutate-while-retaining-all-data