Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
373 lines (306 sloc) 17.5 KB
---
title: "Creating Summary Lines"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Creating Summary Lines}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r options, message=FALSE, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>")
library(gt)
library(dplyr)
```
Tables with summary data are quite common and naturally occur with alarming regularity. We can use **gt** to produce such summary components. These summary rows are automatically inserted within the relevant row groups, where each summary row is the result of a different aggregation function.
### Preparing the Input Data Table
To most easily insert the necessary summary rows, we need to ensure that some preparatory work is done. The key change to the input data table should be to have a `groupname` column present, with categorical group names for each grouping. The following example data table contains both `groupname` and `rowname` columns, along with two data columns (named `value_1` and `value_2`). Having the magic column names `groupname` and `rowname` will signal to **gt** that they should be used to form a table stub where `groupname` collects rows into a row group (furnishing a group heading row above), and `rowname` provides row labels within each row group. Here is a diagram showing the restructuring:
The incoming data table for the purpose of the first example will be created using `dplyr::tribble()`:
```{r}
# Create a table that creates a stub and
# row groups by virtue of having `groupname`
# and `rowname` columns
tbl <-
dplyr::tribble(
~groupname, ~rowname, ~value_1, ~value_2,
"A", "1", 235.6, 260.1,
"A", "2", 184.3, 84.4,
"A", "3", 342.3, 126.3,
"A", "4", 234.9, 83.0,
"B", "1", 190.9, 832.5,
"B", "2", 743.3, 281.2,
"B", "3", 252.3, 732.5,
"B", "4", 344.7, 546.1,
"C", "1", 197.2, 818.0,
"C", "2", 284.3, 394.4,
)
```
There are three groups in this data table: `A`, `B`, and `C`. The presence of the `groupname` column will tell **gt** to partition the stub into three row groups. This is how the **gt** display table appears after simply providing `tbl` to `gt()`:
```{r}
# Create a display table by using `tbl` as input
gt(tbl)
```
### Generating Summary Rows
Summaries by row group can be generated by using the `summary_rows()` function. There’s a lot of control in how the summaries are conducted and formatted. First of all, we can choose which groups will receive summary rows (the rows appear at the bottom of each row group). This optional choice of groups can be made using the `groups` argument (providing a vector of group names). Secondly, we can choose which columns should be included in the summary with the `columns` argument (in this case, wrapping column names in `vars()`). If we provide nothing to `groups`, then all groups will receive summary data for the selected `columns`.
To make any sort of summary, we need to use functions that will perform the aggregation. We can provide base functions such as `mean()`, `sum()`, `min()`, `max()`, and more, within a `list()` or `c()`. Each function provided will result in a summary row for each group.
Because each function will yield a row, we need to be able to identify that row. So, each summary row will receive a summary row label. We can provide our preferred names by naming the functions within the list or `c()` (e.g, `list(average = "mean", total = "sum", SD = "sd")` or `c("mean", total = "sum", stdev = "sd")`). If names aren't supplied, then **gt** will derive names from the functions supplied and make the names unique.
Now that the `summary_rows()` function has been somewhat explained, let’s look at how we can get summary rows for the `tbl` table:
```{r}
# Create a gt table with summary rows for
# `value_1` in the `A` and `B` row groups;
# three summary rows are made per group
# (one for each function in `fns`)
gt(tbl) %>%
summary_rows(
groups = c("A", "B"),
columns = vars(value_1),
fns = list("mean", "sum", "sd")
)
```
Here we see that summary rows were created for the `A` and `B` groups (group `C` has no summary lines in its row group). It was specified in `columns` that only the `value_1` column should be evaluated and, indeed, the `value_2` column shows no values whatsoever within the summary rows.
There is some flexibility in how we supply aggregation functions to `summary_rows()`. It's sometimes sufficient to supply function names enclosed in quotes (e.g., `list("mean", "sum", "sd")`) but the next section will demonstrate a more advanced method for supplying these functions (where we don't have to rely on function defaults). If we already have a vector of function names that use all of the default arguments within those functions, we can pass this directly to `fns`. This is nice if we want to build a common named vector of function names and summary labels. Here's an example that does just that:
```{r}
# Define a named vector of function names
# and summary row labels
fns_labels <- c(average = "mean", total = "sum", `std dev` = "sd")
# Use `fns_labels` in `fns`
gt(tbl) %>%
summary_rows(
columns = vars(value_1),
fns = fns_labels
)
```
### Providing Calls to Functions with a Formula Interface
Sometimes we want to use functions with some parameters set. The earlier example using `list("mean", "sum", "sd")` will break down if we have `NA` values in our data and we want to exclude those by using `na.rm = TRUE`. In this case we can provide a call to a function with a right-hand side (*RHS*) formula shortcut, having `.` serve as the data values. Here is an example where we modify our `tbl` to include some `NA` values (calling it `tbl_na`) and then changing the contents of the `list()`---using `c()` will work just as well---as calls to functions:
```{r}
# Create an alternate version of the
# `tbl` data frame that contains NAs
tbl_na <-
tbl %>%
mutate(value_1 = case_when(
value_1 >= 300 ~ NA_real_,
value_1 < 300 ~ value_1)
) %>%
mutate(value_2 = case_when(
value_2 >= 600 ~ NA_real_,
value_2 < 600 ~ value_2)
)
# Create summary rows from `tbl_na` within
# groups `A` and `B` but provide the
# aggregation functions as formulas so that
# we can modify the default options
gt(tbl_na) %>%
summary_rows(
groups = c("A", "B"),
columns = vars(value_1, value_2),
fns = list(
average = ~mean(., na.rm = TRUE),
sum = ~sum(., na.rm = TRUE),
`std dev` = ~sd(., na.rm = TRUE))
)
```
The input to `fns` is very permissive in regard to how the functions are defined. It is entirely valid to provide functions in the various forms shown earlier such that `list("sum", ~mean(., na.rm = TRUE), SD = "sd")` will be correctly interpreted.
### Formatting the Summary Output Values
Using `summary_rows()` to generate a summary is, in effect, creating new data and thus we need some means to format the new summary row data. We can use the `formatter` option to supply a formatting function to handle the formatting of the summary row data. Thankfully, we have a collection of formatting functions available in the package (i.e., all of the `fmt_*()` functions). The default for `formatter` is set to `fmt_number` which is a sensible default for many scenarios. The setting of argument values for a particular formatter can be done in the `...` area of the function call.
Let's start with an example where we format the summary row data using `fmt_number` as the formatter. The `fmt_number()` function several arguments to which we can pass values (`decimals`, `drop_trailing_zeros`, `negative_val`, `locale`, etc.). Here, we will provide `decimals = 3` to augment `fmt_number()`:
```{r}
# Use the formatter options in `...` to
# provide values to `fmt_number`
gt(tbl) %>%
summary_rows(
groups = "A",
columns = vars(value_1, value_2),
fns = c("mean", "sum", "min", "max"),
formatter = fmt_number,
decimals = 3
)
```
We can also store these argument values as local variables and pass them in both separate `fmt_number()` calls and then to a `summary_rows()` calls with `fmt_number()` used as the formatter. This is useful for standardizing formatting parameters across different table cell types.
```{r}
# Provide common formatting parameters to a list
# object named `formats`; the number of decimal
# places will be `2` and the locale is "fr_BE"
formats <- list(decimals = 2, locale = "fr_BE")
# Provide the formatting options from `formats` to
# all row data in all columns, and, to the summary
# row data
gt(tbl) %>%
fmt_number(
columns = vars(value_1, value_2),
decimals = formats$decimals,
locale = formats$locale
) %>%
summary_rows(
groups = "A",
columns = vars(value_1, value_2),
fns = list("mean", "sum", "min", "max"),
formatter = fmt_number,
decimals = formats$decimals,
locale = formats$locale
)
```
### Extracting the Summary Data from the gt Table Object
For a reproducible workflow, we do not want to have situations where any data created or modified cannot be accessed. While having summarized values be created in a **gt** pipeline presents advantages to readability and intent of analysis, it is recognized that the output table itself is essentially 'read only', as the input data undergoes processing and movement to an entirely different format.
However, the object created still contains data and we can obtain the summary data from a gt table object using the `extract_summary()` function. Taking the `gt_summary` object, we get a data frame containing the summary data while preserving the correct data types:
```{r}
# Create a gt table with summary rows and
# assign it to `gt_object`
gt_object <-
gt(tbl) %>%
summary_rows(
groups = c("A", "B"),
columns = vars(value_1, value_2),
fns = list("mean", "sum", "sd")
)
```
```{r}
# Extract the summary data from `gt_object`
# to a data frame object
summary_list <- gt_object %>% extract_summary()
# This is a list with two elements named
# `A` and `B` (matches the summary groups)
names(summary_list)
```
```{r}
# Print out the summary for the `A` group
summary_list[["A"]]
```
```{r}
# Print out the summary for the `B` group
summary_list[["B"]]
```
The output data frames within the list always contain the `groupname` and `rowname` columns. The `groupname` column is filled with the name of the stub group given in `summary_rows()`. The `rowname` column contains the descriptive stub labels for the summary rows (recall that values are either supplied explicitly in `summary_rows()`, or, are generated from the function names). The remaining columns are from the original dataset.
The output data frame from `extract_summary()` can be reintroduced to a reproducible workflow and serve as downstream inputs or undergo validation. Perhaps interestingly, the output data frame is structured in a way that facilitates direct input to `gt()` (i.e., has the magic `groupname` and `rowname` columns). This can produce a new, standalone summary table where the summary rows are now data rows:
```{r}
# Take the `gt_object`, which has a list of
# summary data frames, combine them with
# `do.call(bind_rows, ...)`, input that into
# `gt()`, and format all of the numeric values
do.call(
dplyr::bind_rows,
gt_object %>% extract_summary()
) %>%
gt() %>%
fmt_number(
columns = vars(value_1, value_2),
decimals = 3
)
```
### Providing Our Own Aggregation Functions to Generate Summary Rows
While many of the functions available in base R and within packages are useful as aggregate functions, we may occasionally have need to create our own custom functions. The only things to keep in mind are to ensure that a vector of values is the main input, and, a single value is returned. The return value can be most any class (e.g., `numeric`, `character`, `logical`) and it's the `formatter` function that will handle any custom formatting while also converting to `character`.
Here, we'll define a function that takes a vector of numeric values and outputs the two highest values (sorted low to high) above a `threshold` value. The output from this function is always a formatted `character` string.
```{r}
# Define a function that gives the
# highest two values above a threshold
agg_highest_two_above_value <- function(x, threshold) {
# Get sorted values above threshold value
values <- sort(round(x[x >= threshold], 2))
# Return character string with 2 highest values above threshold
if (length(values) == 0) {
return(paste0("No values above ", threshold))
} else {
return(
paste(
formatC(
tail(
sort(round(x[x > threshold], 2)), 2),
format = "f", digits = 2), collapse = ", "))
}
}
# Let's test this function with some values
agg_highest_two_above_value(
x = c(0.73, 0.93, 0.75, 0.86, 0.23, 0.81),
threshold = 0.8
)
```
Because this is character value that's returned, we don't need formatting functions like `fmt_number()`, `fmt_percent()`, etc. However, a useful formatter (and we do need *some* formatter) is the `fmt_passthrough()` function. Like the name suggests, it to great extent passes values through but formats as `character` (like all the `fmt_*()` function do) and it provides the option to decorate the output with a `pattern`. Let's have a look at how the `agg_highest_two_above_value()` function can be used with the `fmt_passthrough()` formatter function.
```{r}
# Create a gt table with summary rows for
# `value_1` & `value_2`; the custom function
# is being used with a threshold of `150`; the
# `fmt_passthrough` allows for minimal
# formatting of the aggregate values
summary_tbl <-
gt(tbl_na) %>%
summary_rows(
columns = vars(value_1, value_2),
fns = list(
high = ~agg_highest_two_above_value(., 150)),
formatter = fmt_passthrough,
pattern = "({x})"
) %>%
fmt_missing(columns = vars(value_1, value_2))
summary_tbl
```
We can extract the summary data from the `summary_tbl` object. Note that columns `value_1` and `value_2` are classed as character since it was character outputs that were generated by the `agg_highest_two_above_value()` function.
```{r}
summary_tbl %>%
extract_summary() %>%
str()
```
### Using Different Formatters for Different Columns
Suppose we have data across columns that should be summarized in the same way (i.e., common aggregate functions) but formatted differently (e.g., decimal notation, scientific notation, percentages, etc.). For each type of formatting, we need to call `summary_rows()`, however, we can employ certain strategies to make the code a bit more succinct while doing so.
Let's start with a new data table. This table contains `groupname` and `rowname` columns, which making using `summary_rows()` possible, along with `large`, `medium`, and `small` data columns, all containing numeric values.
```{r}
tbl_2 <-
dplyr::tribble(
~groupname, ~rowname, ~large, ~medium, ~small,
"A", "1", 235342.6, 342.2, 0.34,
"A", "2", 184123.0, 971.7, 0.23,
"A", "3", 342622.3, 392.6, 0.73,
"A", "4", 234353.9, 684.5, 0.93,
"B", "1", 190983.4, 328.0, 0.35,
"B", "2", 748250.3, 671.3, 0.98,
"B", "3", 252781.3, 934.2, 0.74,
"B", "4", 344030.7, 673.3, 0.27,
)
tbl_2
```
Here, we would like to apply a different format (both to the data rows and to the summary rows) across the three columns. We can store formatting options to three different list objects (`large_fmts`, `medium_fmts`, and `small_fmts`) and use the stored values in the **gt** pipeline.
```{r}
# Store formatting options in list objects
large_fmts <-
list(
columns = "large",
formatter = fmt_scientific,
options = list(decimals = 2)
)
medium_fmts <-
list(
columns = "medium",
formatter = fmt_number,
options = list(decimals = 2)
)
small_fmts <-
list(
columns = "small",
formatter = fmt_percent,
options = list(decimals = 5)
)
# Format the data rows, then, create summary rows
# and format summary data for consistent output in
# each column
tbl_2 %>%
gt() %>%
fmt_scientific(columns = large_fmts$columns) %>%
fmt_number(columns = medium_fmts$columns) %>%
fmt_percent(columns = small_fmts$columns) %>%
summary_rows(
columns = large_fmts$columns,
fns = c("sum", "mean", "sd"),
formatter = large_fmts$formatter
) %>%
summary_rows(
columns = medium_fmts$columns,
fns = c("sum", "mean", "sd"),
formatter = medium_fmts$formatter
) %>%
summary_rows(
columns = small_fmts$columns,
fns = c("sum", "mean", "sd"),
formatter = small_fmts$formatter
)
```
Passing in parameters like this is useful, especially if there are larger numbers of columns. Additionally, we can store formatting parameters outside of the `gt()` pipeline and separate our concerns between data structuring and data formatting. The separation of styles and options into objects becomes more important if we intend to centralize formatting options for reuse.
Another thing to note in the above example is that even though multiple independent calls of `summary_rows()` were made, summary data within common summary row names were 'squashed' together, thus avoiding the fragmentation of summary rows. Put another way, we don't create additional summary rows from a group across separate calls if we are referencing the same summary row labels. If the summary row labels provided in `fns` were to be different across columns, however, additional summary rows would be produced even if the types of data aggregations were to be functionally equivalent.