New data, functions, and features
Five new datasets provide some interesting built-in datasets to demonstrate
dplyr verbs (#2094):
starwarsdataset about starwars characters; has list columns
stormshas the trajectories of ~200 tropical storms
has some simple data to demonstrate joins.
arrange()for grouped data frames gains a
.by_groupargument so you
can choose to sort by groups if you want to (defaults to
This verb is powered with the new
which is exported as well. It is like
select_vars()but returns a
as_tibble()is re-exported from tibble. This is the recommend way to create
tibbles from existing data frames.
tbl_df()has been softly deprecated.
tribble()is now imported from tibble (#2336, @chrMongeau); this
is now prefered to
Deprecated and defunct
dplyr no longer messages that you need dtplyr to work with data.table (#2489).
summarise_each_q()functions have been removed.
failwith(). I'm not even sure why it was here.
summarise_each(), these functions
print a message which will be changed to a warning in the next release.
passing a value to this argument print a message which will be changed to a
warning in the next release.
This version of dplyr includes some major changes to how database connections work. By and large, you should be able to continue using your existing dplyr database code without modification, but there are two big changes that you should be aware of:
Almost all database related code has been moved out of dplyr and into a
new package, dbplyr. This makes dplyr
simpler, and will make it easier to release fixes for bugs that only affect
live dplyr so your existing code continues to work.
It is no longer necessary to create a remote "src". Instead you can work
directly with the database connection returned by DBI. This reflects the
maturity of the DBI ecosystem. Thanks largely to the work of Kirill Muller
(funded by the R Consortium) DBI backends are now much more consistent,
comprehensive, and easier to use. That means that there's no longer a
need for a layer in between you and DBI.
You can continue to use
src_sqlite(), but I recommend a new style that makes the connection to DBI more clear:
library(dplyr) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbWriteTable(con, "mtcars", mtcars) mtcars2 <- tbl(con, "mtcars") mtcars2
This is particularly useful if you want to perform non-SELECT queries as you can do whatever you want with
If you've implemented a database backend for dplyr, please read the backend news to see what's changed from your perspective (not much). If you want to ensure your package works with both the current and previous version of dplyr, see
wrap_dbplyr_obj() for helpers.
Error messages and explanations of data frame inequality are now encoded in
UTF-8, also on Windows (#2441).
Joins now always reencode character columns to UTF-8 if necessary. This gives
a nice speedup, because now pointer comparison can be used instead of string
comparison, but relies on a proper encoding tag for all strings (#2514).
group_vars()generic that returns the grouping as character vector, to
avoid the potentially lossy conversion to language symbols. The list returned
group_by_prepare()now has a new
group_namescomponent (#1950, #2384).
transmute()now have scoped variants (verbs suffixed with
these variants apply an operation to a selection of variables.
The scoped verbs taking predicates (
etc) now support S3 objects and lazy tables. S3 objects should
implement methods for
tbl_vars(). For lazy
tables, the first 100 rows are collected and the predicate is
applied on this subset of the data. This is robust for the common
case of checking the type of a column (#2129).
Summarise and mutate colwise functions pass
...on the the manipulation
The performance of colwise verbs like
mutate_all()is now back to
where it was in
funs()has better handling of namespaced functions (#2089).
dplyr has a new approach to non-standard evaluation (NSE) called tidyeval.
It is described in detail in
vignette("programming") but, in brief, gives you
the ability to interpolate values in contexts where dplyr usually works with expressions:
my_var <- quo(homeworld) starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE)
This means that the underscored version of each main verb is no longer needed,
and so these functions have been deprecated (but remain around for backward compatibility).
tidyeval to capture their arguments by expression. This makes it
possible to use unquoting idioms (see
fixes scoping issues (#2297).
Most verbs taking dots now ignore the last argument if empty. This
makes it easier to copy lines of code without having to worry about
deleting trailing commas (#1039).
[API] The new
.envenvironments can be used inside
all verbs that operate on data:
.data$column_nameaccesses the column
.env$varaccesses the external variable
Columns or external variables named
.envare shadowed, use
.env$...to access them. (
matching also for the
global()functions have been removed. They were never
documented officially. Use the new
Expressions in verbs are now interpreted correctly in many cases that
failed before (e.g., use of
case_when(), nonstandard evaluation, ...).
These expressions are now evaluated in a specially constructed temporary
environment that retrieves column data on demand with the help of the
bindrcpppackage (#2190). This temporary environment poses restrictions on
<-inside verbs. To prevent leaking of broken bindings,
the temporary environment is cleared after the evaluation (#2435).
xxx_join.tbl_df(na_matches = "never")treats all
different from each other (and from any other value), so that they never
match. This corresponds to the behavior of joins for database sources,
and of database joins in general. To match
na_matches = "na"to the join verbs; this is only supported for data frames.
The default is
na_matches = "na", kept for the sake of compatibility
to v0.5.0. It can be tweaked by calling
common_by()gets a better error message for unexpected inputs (#2091)
Anti- and semi-joins warn if factor levels are inconsistent (#2741).
Warnings about join column inconsistencies now contain the column names
For selecting variables, the first selector decides if it's an inclusive
selection (i.e., the initial column list is empty), or an exclusive selection
(i.e., the initial column list contains all columns). This means that
select(mtcars, contains("am"), contains("FOO"), contains("vs"))now returns
vscolumns like in dplyr 0.4.3 (#2275, #2289, @r2evans).
Select helpers now throw an error if called when no variables have been
Helper functions in
select()(and related verbs) are now evaluated
in a context where column names do not exist (#2184).
select()(and the internal function
select_vars()) now support
column names in addition to column positions. As a result,
select(mtcars, "cyl")are now allowed.
coalesce()now support splicing of
arguments with rlang's
count()now preserves the grouping of its input (#2021).
distinct()no longer duplicates variables (#2001).
distinct()with a grouped data frame works the same way as
distinct()on an ungrouped data frame, namely it uses all
copy_to()now returns it's output invisibly (since you're often just
calling for the side-effect).
lag()throw informative error if used with ts objects (#2219)
mutate()recycles list columns of length 1 (#2171).
summarise()no longer converts character
NAto empty strings (#1839).
Combining and comparing
bind_cols()give an error for database tables (#2373).
combine()are more strict when coercing.
Logical values are no longer coerced to integer and numeric. Date, POSIXct
and other integer or double-based classes are no longer coerced to integer or
double as there is chance of attributes or information being lost
tibble::repair_names()to ensure that all
names are unique (#2248).
bind_cols()handles empty argument list (#2048).
bind_cols()now accept vectors. They are treated
as rows by the former and columns by the latter. Rows require inner
c(col1 = 1, col2 = 2), while columns require outer
col1 = c(1, 2). Lists are still treated as data frames but
can be spliced explicitly with
.Deprecated(), they will be removed
in the next CRAN release. Please use
%in%gets new hybrid handler (#126).
between()returns NA if
nth()have better default values for factor,
Dates, POSIXct, and data frame inputs (#2029).
Fixed segmentation faults in hybrid evaluation of
lag(). These functions now always fall back to the R
implementation if called with arguments that the hybrid evaluator cannot
handle (#948, #1980).
n_distinct()gets larger hash tables given slightly better performance (#977).
ntile()are more careful about proper data types of their return values (#2306).
NAwhen computing group membership (#2564).
recode()can now recode a factor to other types (#2268)
Other minor changes and bug fixes
Many error messages are more helpful by referring to a column name or a
position in the argument list (#2448).
tbl_vars()now has a
group_varsargument set to
FALSE, group variables are not returned.
Fixed segmentation fault after calling
rename()on an invalid grouped
data frame (#2031).
strictargument to control if an
error is thrown when you try and rename a variable that doesn't
Fixed undefined behavior for
slice()on a zero-column data frame (#2490).
Fixed very rare case of false match during join (#2515).
Restricted workaround for
match()to R 3.3.0. (#1858).
dplyr now warns on load when the version of R or Rcpp during installation is
different to the currently installed version (#2514).
Fixed improper reuse of attributes when creating a list column in
summarise()always strip the
namesattribute from new
or updated columns, even for ungrouped operations (#1689).
Fixed rare error that could lead to a segmentation fault in
all_equal(ignore_col_order = FALSE)(#2502).
All operations that return tibbles now include the
This is important for correct printing with tibble 1.3.1 (#2789).
Makeflags uses PKG_CPPFLAGS for defining preprocessor macros.
Update RStudio project settings to install tests (#1952).
Rcpp::interfaces()to register C callable interfaces, and registering all native exported functions via
useDynLib(.registration = TRUE)(#2146).
Formatting of grouped data frames now works by overriding the
tbl_sum()generic instead of
print(). This means that the output is more consistent with tibble, and that
format()is now supported also for SQL sources (#2781).