Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
102 lines (75 sloc) 3.63 KB
---
title: "Column Names"
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(readxl)
```
## Column names via `col_names`
readxl has always let you specify `col_names` explicitly at the time of import:
```{r}
read_excel(
readxl_example("datasets.xlsx"), sheet = "chickwts",
col_names = c("chick_weight", "chick_ate_this"), skip = 1
)
```
But users have long wanted a way to specify a name repair *strategy*, as opposed to enumerating the actual column names.
## Built-in levels of `.name_repair`
As of v1.2.0, readxl provides the `.name_repair` argument, which affords control over how column names are checked or repaired. This requires v2.0.0 or higher of the [tibble package](https://tibble.tidyverse.org), which powers this feature under the hood. *December 2018 note: tibble v2.0.0 hasn't been released yet but will be soon. Until then, `.name_repair` requires a development version of tibble.*
The `.name_repair` argument in `read_excel()`, `read_xls()`, and `read_xlsx()` works exactly the same way as it does in [`tibble::tibble()`](https://tibble.tidyverse.org/reference/tibble.html) and [`tibble::as_tibble()`](https://tibble.tidyverse.org/reference/as_tibble.html). Full documentation is in the [`?name-repair` topic](https://tibble.tidyverse.org/dev/reference/name-repair.html) of tibble.
readxl's default is `.name_repair = "unique"`, which ensures each column has a unique name. If that is already true of the column names, readxl won't touch them.
The value `.name_repair = "universal"` goes further and makes column names syntactic, i.e. makes sure they don't contain any forbidden characters or reserved words. This makes life easier if you use packages like ggplot2 and dplyr downstream, because the column names will "just work" everywhere and won't require protection via backtick quotes.
Compare the column names in these two calls. This shows the difference between `"unique"` (names can contain spaces) and `"universal"` (spaces replaced by `.`).
```{r}
read_excel(
readxl_example("deaths.xlsx"), range = "arts!A5:F8"
)
read_excel(
readxl_example("deaths.xlsx"), range = "arts!A5:F8",
.name_repair = "universal"
)
```
If you don't want readxl to touch your column names at all, use `.name_repair = "minimal"`.
## Pass a function to `.name_repair`
The `.name_repair` argument also accepts a function -- pre-existing or written by you -- or an anonymous formula. This function must operate on a "names in, names out" basis.
```{r}
## ALL CAPS! via built-in toupper()
read_excel(readxl_example("clippy.xlsx"), .name_repair = toupper)
## lower_snake_case via a custom function
my_custom_name_repair <- function(nms) tolower(gsub("[.]", "_", nms))
read_excel(
readxl_example("datasets.xlsx"), n_max = 3,
.name_repair = my_custom_name_repair
)
## take first 3 characters via anonymous function
read_excel(
readxl_example("datasets.xlsx"),
sheet = "chickwts", n_max = 3,
.name_repair = ~ substr(.x, start = 1, stop = 3)
)
```
This means you can also perform name repair in the style of base R or another package, such as `janitor::make_clean_names()` (*requires janitor > v1.1.1*).
```{r, eval = FALSE}
read_excel(
SOME_SPREADSHEET,
.name_repair = ~ make.names(.x, unique = TRUE)
)
read_excel(
SOME_SPREADSHEET,
.name_repair = ~ janitor::make_clean_names
)
```
What if you have a spreadsheet with lots of missing column names? Here's how you could fall back to letter-based column names, for easier troubleshooting.
```{r eval = FALSE}
read_excel(
SOME_SPREADSHEET,
.name_repair = ~ ifelse(nzchar(.x), .x, LETTERS[seq_along(.x)])
)
```