arrange()once again ignores grouping (#1206).
distinct()now only keeps the distinct variables. If you want to return
all variables (using the first row for non-distinct values) use
.keep_all = TRUE(#1110). For SQL sources,
.keep_all = FALSEis
GROUP BY, and
.keep_all = TRUEraises an error
(#1937, #1942, @krlmlr). (The default behaviour of using all variables
when none are specified remains - this note only applies if you select
- The select helper functions
ends_with()etc are now
real exported functions. This means that you'll need to import those
functions if you're using from a package where dplyr is not attached.
dplyr::select(mtcars, starts_with("m"))used to work, but
now you'll need
Deprecated and defunct functions
- The long deprecated
%.%have been removed.
id()has been deprecated. Please use
rbind_list()are formally deprecated. Please use
- Outdated benchmarking demos have been removed (#1487).
- Code related to starting and signalling clusters has been moved out to
coalesce()finds the first non-missing value from a set of vectors.
(#1666, thanks to @krlmlr for initial implementation).
case_when()is a general vectorised if + else if (#631).
if_else()is a vectorised if statement: it's a stricter (type-safe),
faster, and more predictable version of
ifelse(). In SQL it is
translated to a
na_if()makes it easy to replace a certain value with an
In SQL it is translated to
near(x, y)is a helper for
abs(x - y) < tol(#1607).
recode()is vectorised equivalent to
union_all()method. Maps to
UNION ALLfor SQL sources,
for data frames/tbl_dfs, and
combine()for vectors (#1045).
- A new family of functions replace
mutate_each()(which will thus be deprecated in a future release).
mutate_all()apply a function to all columns
mutate_at()operate on a subset of
columns. These columuns are selected with either a character vector
of columns names, a numeric vector of column positions, or a column
select()semantics generated by the new
columns()helper. In addition,
take a predicate function or a logical vector (these verbs currently
require local sources). All these functions can now take ordinary
functions instead of a list of functions generated by
(though this is only useful for local sources). (#1845, @lionel-)
select_if()lets you select columns with a predicate function.
Only compatible with local sources. (#497, #1569, @lionel-)
All data table related code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package. If both data.table and dplyr are loaded, you'll get a message reminding you to load dtplyr.
Functions to related to the creation and coercion of
tbl_dfs, now live in their own package: tibble. See
vignette("tibble") for more details.
[[methods that never do partial matching (#1504), and throw
an error if the variable does not exist.
all_equal()allows to compare data frames ignoring row and column order,
and optionally ignoring minor differences in type (e.g. int vs. double)
(#821). The test handles the case where the df has 0 columns (#1506).
The test fails fails when convert is
FALSEand types don't match (#1484).
all_equal()shows better error message when comparing raw values
or when types are incompatible and
convert = TRUE(#1820, @krlmlr).
add_row()makes it easy to add a new row to data frame (#1021)
as_data_frame()is now an S3 generic with methods for lists (the old
as_data_frame()), data frames (trivial), and matrices (with efficient
C++ implementation) (#876). It no longer strips subclasses.
- The internals of
as_data_frame()have been aligned,
as_data_frame()will now automatically recycle length-1 vectors.
Both functions give more informative error messages if you attempting to
create an invalid data frame. You can no longer create a data frame with
duplicated names (#820). Both check for
POSIXltcolumns, and tell you to
frame_data()properly constructs rectangular tables (#1377, @kevinushey),
and supports list-cols.
glimpse()is now a generic. The default method dispatches to
(#1325). It now (invisibly) returns its first argument (#1570).
lst_()which create lists in the same way that
data_frame_()create data frames (#1290).
print.tbl_df()is considerably faster if you have very wide data frames.
It will now also only list the first 100 additional variables not already
on screen - control this with the new
(#1161). When printing a grouped data frame the number of groups is now
printed with thousands separators (#1398). The type of list columns
is correctly printed (#1379)
- Package includes
setOldClass(c("tbl_df", "tbl", "data.frame"))to help
with S4 dispatch (#969).
tbl_dfautomatically generates column names (#1606).
tbl_cubes are now constructed correctly from data frames, duplicate
dimension values are detected, missing dimension values are filled
NA. The construction from data frames now guesses the measure
variables by default, and allows specification of dimension and/or
measure variables (#1568, @krlmlr).
- Swap order of
matrix) for consistency with
as.tbl_cube.data.frame. Also, the
as.tbl_cube.tablenow defaults to
"Freq"for consistency with
as_data_frame()on SQL sources now returns all rows (#1752, #1821,
compute()gets new parameters
it easier to add indexes (#1499, @krlmlr).
db_explain()gains a default method for DBIConnections (#1177).
- The backend testing system has been improved. This lead to the removal of
temp_srcs(). In the unlikely event that you were using this function,
you can instead use
- You can now use
full_join()with remote tables (#1172).
src_memdb()is a session-local in-memory SQLite database.
data_frame(), but creates a new table in
src_sqlite()now uses a stricter quoting character, ```, instead of
". SQLite "helpfully" will convert `"x"` into a string if there is
no identifier called x in the current scope (#1426).
src_sqlite()throws errors if you try and use it with window functions
filter.tbl_sql()now puts parens around each argument (#934).
-is better translated (#1002).
escape.POSIXt()method makes it easier to use date times. The date is
rendered in ISO 8601 format in UTC, which should work in most databases
is.na()gets a missing space (#1695).
is.null()get extra parens to make precendence
more clear (#1695).
pmax()are translated to
- Window functions:
This version includes an almost total rewrite of how dplyr verbs are translated into SQL. Previously, I used a rather ad-hoc approach, which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so in this version I've implemented a much richer internal data model. Now there is a three step process:
- When applied to a
tbl_lazy, each dplyr verb captures its inputs
and stores in a
op(short for operation) object.
sql_build()iterates through the operations building to build up an
object that represents a SQL query. These objects are convenient for
testing as they are lists, and are backend agnostics.
sql_render()iterates through the queries and generates the SQL,
using generics (like
sql_select()) that can vary based on the
In the short-term, this increased abstraction is likely to lead to some minor performance decreases, but the chance of dplyr generating correct SQL is much much higher. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries.
If you have written a dplyr backend, you'll need to make some minor changes to your package:
sql_join()has been considerably simplified - it is now only responsible
for generating the join query, not for generating the intermediate selects
that rename the variable. Similarly for
sql_semi_join(). If you've
provided new methods in your backend, you'll need to rewrite.
select_query()gains a distinct argument which is used for generating
distinct(). It loses the
offsetargument which was
never used (and hence never tested).
src_translate_env()has been replaced by
should have methods for the connection object.
There were two other tweaks to the exported API, but these are less likely to affect anyone.
partial_eval()got a new API: now use connection +
variable names, rather than a
tbl. This makes testing considerably easier.
translate_sql_q()has been renamed to
- Also note that the sql generation generics now have a default method, instead
methods for DBIConnection and NULL.
Minor improvements and bug fixes
Single table verbs
distinct()doesn't crash when given a 0-column data frame (#1437).
filter()throws an error if you supply an named arguments. This is usually
filter(df, x = 1)instead of
filter(df, x == 1)(#1529).
select()now informs you that it adds missing grouping variables
(#1511). It works even if the grouping variable has a non-syntactic name
(#1138). Negating a failed match (e.g.
returns all columns, instead of no columns (#1176)
The naming behaviour of
tweaked so that you can force inclusion of both the function and the
summarise_each(mtcars, funs(mean = mean), everything())
mutate()handles factors that are all
NA(#1645), or have different
levels in different groups (#1414). It disambiguates
and silently promotes groups that only contain
NA(#1463). It deep copies
data in list columns (#1643), and correctly fails on incompatible columns
mutate()on a grouped data no longer droups grouping attributes
rowwise()mutate gives expected results (#1381).
slice()correctly handles grouped attributes (#1405).
Dual table verbs
bind_cols()matches the behaviour of
inputs (#1148). It also handles
POSIXcts with integer base type (#1402).
bind_rows()handles 0-length named lists (#1515), promotes factors to
characters (#1538), and warns when binding factor and character (#1485).
bind_rows()` is more flexible in the way it can accept data frames,
lists, list of data frames, and list of lists (#1389).
POSIXltcolumns (#1875, @krlmlr).
bind_rows()infer classes and grouping information
from the first data frame (#1692).
grouped_df()methods that make it harder to
create corrupt data frames (#1385). You should still prefer
- Joins now use correct class when joining on
(#1582, @joel23888), and consider time zones (#819). Joins handle a
that is empty (#1496), or has duplicates (#1192). Suffixes grow progressively
to avoid creating repeated column names (#1460). Joins on string columns
should be substantially faster (#1386). Extra attributes are ok if they are
identical (#1636). Joins work correct when factor levels not equal
(#1712, #1559), and anti and semi joins give correct result when by variable is a
suffixargument which allows you to control what suffix duplicated variable
names recieve (#1296).
- Set operations (
union()etc) respect coercion rules
setdiff()handles factors with
- There were a number of fixes to enable joining of data frames that don't
have the same encoding of column names (#1513), including working around
bug 16885 regarding
match()in R 3.3.0 (#1806, #1810,
cummean()is more stable against floating point errors (#1387).
lag()received a considerable overhaul. They are more
careful about more complicated expressions (#1588), and falls back more
readily to pure R evaluation (#1411). They behave correctly in
(#1434). and handle default values for string columns.
max()handle empty sets (#1481).
n_distinct()uses multiple arguments for data frames (#1084), falls back to R
evaluation when needed (#1657), reverting decision made in (#567).
Passing no arguments gives an error (#1957, #1959, @krlmlr).
nth()now supports negative indices to select from end, e.g.
selects the 2nd value from the end of
top_n()can now also select bottom
nvalues by passing a negative value
- Hybrid evaluation leaves formulas untouched (#1447).