-
Notifications
You must be signed in to change notification settings - Fork 2.1k
/
NEWS.md
3342 lines (2223 loc) · 133 KB
/
NEWS.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# dplyr (development version)
* R >=3.6.0 is now explicitly required (#7026).
# dplyr 1.1.4
* `join_by()` now allows its helper functions to be namespaced with `dplyr::`,
like `join_by(dplyr::between(x, lower, upper))` (#6838).
* `left_join()` and friends now return a specialized error message if they
detect that your join would return more rows than dplyr can handle (#6912).
* `slice_*()` now throw the correct error if you forget to name `n` while also
prefixing the call with `dplyr::` (#6946).
* `dplyr_reconstruct()`'s default method has been rewritten to avoid
materializing duckplyr queries too early (#6947).
* Updated the `storms` data to include 2022 data (#6937, @steveharoz).
* Updated the `starwars` data to use a new API, because the old one is defunct.
There are very minor changes to the data itself (#6938, @steveharoz).
# dplyr 1.1.3
* `mutate_each()` and `summarise_each()` now throw correct deprecation messages
(#6869).
* `setequal()` now requires the input data frames to be compatible, similar to
the other set methods like `setdiff()` or `intersect()` (#6786).
# dplyr 1.1.2
* `count()` better documents that it has a `.drop` argument (#6820).
* Fixed tests to maintain compatibility with the next version of waldo (#6823).
* Joins better handle key columns will all `NA`s (#6804).
# dplyr 1.1.1
* Mutating joins now warn about multiple matches much less often. At a high
level, a warning was previously being thrown when a one-to-many or
many-to-many relationship was detected between the keys of `x` and `y`, but is
now only thrown for a many-to-many relationship, which is much rarer and much
more dangerous than one-to-many because it can result in a Cartesian explosion
in the number of rows returned from the join (#6731, #6717).
We've accomplished this in two steps:
* `multiple` now defaults to `"all"`, and the options of `"error"` and
`"warning"` are now deprecated in favor of using `relationship` (see below).
We are using an accelerated deprecation process for these two options
because they've only been available for a few weeks, and `relationship` is
a clearly superior alternative.
* The mutating joins gain a new `relationship` argument, allowing you to
optionally enforce one of the following relationship constraints between the
keys of `x` and `y`: `"one-to-one"`, `"one-to-many"`, `"many-to-one"`, or
`"many-to-many"`.
For example, `"many-to-one"` enforces that each row in `x` can match at
most 1 row in `y`. If a row in `x` matches >1 rows in `y`, an error is
thrown. This option serves as the replacement for `multiple = "error"`.
The default behavior of `relationship` doesn't assume that there is any
relationship between `x` and `y`. However, for equality joins it will check
for the presence of a many-to-many relationship, and will warn if it detects
one.
This change unfortunately does mean that if you have set `multiple = "all"` to
avoid a warning and you happened to be doing a many-to-many style join, then
you will need to replace `multiple = "all"` with
`relationship = "many-to-many"` to silence the new warning, but we believe
this should be rare since many-to-many relationships are fairly uncommon.
* Fixed a major performance regression in `case_when()`. It is still a little
slower than in dplyr 1.0.10, but we plan to improve this further in the future
(#6674).
* Fixed a performance regression related to `nth()`, `first()`, and `last()`
(#6682).
* Fixed an issue where expressions involving infix operators had an abnormally
large amount of overhead (#6681).
* `group_data()` on ungrouped data frames is faster (#6736).
* `n()` is a little faster when there are many groups (#6727).
* `pick()` now returns a 1 row, 0 column tibble when `...` evaluates to an
empty selection. This makes it more compatible with [tidyverse recycling
rules](https://vctrs.r-lib.org/reference/theory-faq-recycling.html) in some
edge cases (#6685).
* `if_else()` and `case_when()` again accept logical conditions that have
attributes (#6678).
* `arrange()` can once again sort the `numeric_version` type from base R
(#6680).
* `slice_sample()` now works when the input has a column named `replace`.
`slice_min()` and `slice_max()` now work when the input has columns named
`na_rm` or `with_ties` (#6725).
* `nth()` now errors informatively if `n` is `NA` (#6682).
* Joins now throw a more informative error when `y` doesn't have the same
source as `x` (#6798).
* All major dplyr verbs now throw an informative error message if the input
data frame contains a column named `NA` or `""` (#6758).
* Deprecation warnings thrown by `filter()` now mention the correct package
where the problem originated from (#6679).
* Fixed an issue where using `<-` within a grouped `mutate()` or `summarise()`
could cross contaminate other groups (#6666).
* The compatibility vignette has been replaced with a more general vignette on
using dplyr in packages, `vignette("in-packages")` (#6702).
* The developer documentation in `?dplyr_extending` has been refreshed and
brought up to date with all changes made in 1.1.0 (#6695).
* `rename_with()` now includes an example of using `paste0(recycle0 = TRUE)` to
correctly handle empty selections (#6688).
* R >=3.5.0 is now explicitly required. This is in line with the tidyverse
policy of supporting the [5 most recent versions of
R](https://www.tidyverse.org/blog/2019/04/r-version-support/).
# dplyr 1.1.0
## New features
* [`.by`/`by`](https://dplyr.tidyverse.org/reference/dplyr_by.html) is an
experimental alternative to `group_by()` that supports per-operation grouping
for `mutate()`, `summarise()`, `filter()`, and the `slice()` family (#6528).
Rather than:
```
starwars %>%
group_by(species, homeworld) %>%
summarise(mean_height = mean(height))
```
You can now write:
```
starwars %>%
summarise(
mean_height = mean(height),
.by = c(species, homeworld)
)
```
The most useful reason to do this is because `.by` only affects a single
operation. In the example above, an ungrouped data frame went into the
`summarise()` call, so an ungrouped data frame will come out; with `.by`, you
never need to remember to `ungroup()` afterwards and you never need to use
the `.groups` argument.
Additionally, using `summarise()` with `.by` will never sort the results by
the group key, unlike with `group_by()`. Instead, the results are returned
using the existing ordering of the groups from the original data. We feel this
is more predictable, better maintains any ordering you might have already
applied with a previous call to `arrange()`, and provides a way to maintain
the current ordering without having to resort to factors.
This feature was inspired by
[data.table](https://CRAN.R-project.org/package=data.table), where the
equivalent syntax looks like:
```
starwars[, .(mean_height = mean(height)), by = .(species, homeworld)]
```
`with_groups()` is superseded in favor of `.by` (#6582).
* `reframe()` is a new experimental verb that creates a new data frame by
applying functions to columns of an existing data frame. It is very similar to
`summarise()`, with two big differences:
* `reframe()` can return an arbitrary number of rows per group, while
`summarise()` reduces each group down to a single row.
* `reframe()` always returns an ungrouped data frame, while `summarise()`
might return a grouped or rowwise data frame, depending on the scenario.
`reframe()` has been added in response to valid concern from the community
that allowing `summarise()` to return any number of rows per group increases
the chance for accidental bugs. We still feel that this is a powerful
technique, and is a principled replacement for `do()`, so we have moved these
features to `reframe()` (#6382).
* `group_by()` now uses a new algorithm for computing groups. It is often faster
than the previous approach (especially when there are many groups), and in
most cases there should be no changes. The one exception is with character
vectors, see the C locale news bullet below for more details (#4406, #6297).
* `arrange()` now uses a faster algorithm for sorting character vectors, which
is heavily inspired by data.table's `forder()`. See the C locale news bullet
below for more details (#4962).
* Joins have been completely overhauled to enable more flexible join operations
and provide more tools for quality control. Many of these changes are inspired
by data.table's join syntax (#5914, #5661, #5413, #2240).
* A _join specification_ can now be created through `join_by()`. This allows
you to specify both the left and right hand side of a join using unquoted
column names, such as `join_by(sale_date == commercial_date)`. Join
specifications can be supplied to any `*_join()` function as the `by`
argument.
* Join specifications allow for new types of joins:
* Equality joins: The most common join, specified by `==`. For example,
`join_by(sale_date == commercial_date)`.
* Inequality joins: For joining on inequalities, i.e.`>=`, `>`, `<`, and
`<=`. For example, use `join_by(sale_date >= commercial_date)` to find
every commercial that aired before a particular sale.
* Rolling joins: For "rolling" the closest match forward or backwards when
there isn't an exact match, specified by using the rolling helper,
`closest()`. For example,
`join_by(closest(sale_date >= commercial_date))` to find only the most
recent commercial that aired before a particular sale.
* Overlap joins: For detecting overlaps between sets of columns, specified
by using one of the overlap helpers: `between()`, `within()`, or
`overlaps()`. For example, use
`join_by(between(commercial_date, sale_date_lower, sale_date))` to
find commercials that aired before a particular sale, as long as they
occurred after some lower bound, such as 40 days before the sale was made.
Note that you cannot use arbitrary expressions in the join conditions, like
`join_by(sale_date - 40 >= commercial_date)`. Instead, use `mutate()` to
create a new column containing the result of `sale_date - 40` and refer
to that by name in `join_by()`.
* `multiple` is a new argument for controlling what happens when a row
in `x` matches multiple rows in `y`. For equality joins and rolling joins,
where this is usually surprising, this defaults to signalling a `"warning"`,
but still returns all of the matches. For inequality joins, where multiple
matches are usually expected, this defaults to returning `"all"` of the
matches. You can also return only the `"first"` or `"last"` match, `"any"`
of the matches, or you can `"error"`.
* `keep` now defaults to `NULL` rather than `FALSE`. `NULL` implies
`keep = FALSE` for equality conditions, but `keep = TRUE` for inequality
conditions, since you generally want to preserve both sides of an
inequality join.
* `unmatched` is a new argument for controlling what happens when a row
would be dropped because it doesn't have a match. For backwards
compatibility, the default is `"drop"`, but you can also choose to
`"error"` if dropped rows would be surprising.
* `across()` gains an experimental `.unpack` argument to optionally unpack
(as in, `tidyr::unpack()`) data frames returned by functions in `.fns`
(#6360).
* `consecutive_id()` for creating groups based on contiguous runs of the
same values, like `data.table::rleid()` (#1534).
* `case_match()` is a "vectorised switch" variant of `case_when()` that matches
on values rather than logical expressions. It is like a SQL "simple"
`CASE WHEN` statement, whereas `case_when()` is like a SQL "searched"
`CASE WHEN` statement (#6328).
* `cross_join()` is a more explicit and slightly more correct replacement for
using `by = character()` during a join (#6604).
* `pick()` makes it easy to access a subset of columns from the current group.
`pick()` is intended as a replacement for `across(.fns = NULL)`, `cur_data()`,
and `cur_data_all()`. We feel that `pick()` is a much more evocative name when
you are just trying to select a subset of columns from your data (#6204).
* `symdiff()` computes the symmetric difference (#4811).
## Lifecycle changes
### Breaking changes
* `arrange()` and `group_by()` now use the C locale, not the system locale,
when ordering or grouping character vectors. This brings _substantial_
performance improvements, increases reproducibility across R sessions, makes
dplyr more consistent with data.table, and we believe it should affect little
existing code. If it does affect your code, you can use
`options(dplyr.legacy_locale = TRUE)` to quickly revert to the previous
behavior. However, in general, we instead recommend that you use the new
`.locale` argument to precisely specify the desired locale. For a full
explanation please read the associated
[grouping](https://github.com/tidyverse/tidyups/blob/main/006-dplyr-group-by-ordering.md)
and [ordering](https://github.com/tidyverse/tidyups/blob/main/003-dplyr-radix-ordering.md)
tidyups.
* `bench_tbls()`, `compare_tbls()`, `compare_tbls2()`, `eval_tbls()`,
`eval_tbls2()`, `location()` and `changes()`, deprecated in 1.0.0, are now
defunct (#6387).
* `frame_data()`, `data_frame_()`, `lst_()` and `tbl_sum()` are no longer
re-exported from tibble (#6276, #6277, #6278, #6284).
* `select_vars()`, `rename_vars()`, `select_var()` and `current_vars()`,
deprecated in 0.8.4, are now defunct (#6387).
### Newly deprecated
* `across()`, `c_across()`, `if_any()`, and `if_all()` now require the
`.cols` and `.fns` arguments. In general, we now recommend that you use
`pick()` instead of an empty `across()` call or `across()` with no `.fns`
(e.g. `across(c(x, y))`. (#6523).
* Relying on the previous default of `.cols = everything()` is deprecated.
We have skipped the soft-deprecation stage in this case, because indirect
usage of `across()` and friends in this way is rare.
* Relying on the previous default of `.fns = NULL` is not yet formally
soft-deprecated, because there was no good alternative until now, but it is
discouraged and will be soft-deprecated in the next minor release.
* Passing `...` to `across()` is soft-deprecated because it's ambiguous when
those arguments are evaluated. Now, instead of (e.g.)
`across(a:b, mean, na.rm = TRUE)` you should write
`across(a:b, ~ mean(.x, na.rm = TRUE))` (#6073).
* `all_equal()` is deprecated. We've advised against it for some time, and
we explicitly recommend you use `all.equal()`, manually reordering the rows
and columns as needed (#6324).
* `cur_data()` and `cur_data_all()` are soft-deprecated in favour of
`pick()` (#6204).
* Using `by = character()` to perform a cross join is now soft-deprecated in
favor of `cross_join()` (#6604).
* `filter()`ing with a 1-column matrix is deprecated (#6091).
* `progress_estimate()` is deprecated for all uses (#6387).
* Using `summarise()` to produce a 0 or >1 row "summary" is deprecated in favor
of the new `reframe()`. See the NEWS bullet about `reframe()` for more details
(#6382).
* All functions deprecated in 1.0.0 (released April 2020) and earlier now warn
every time you use them (#6387). This includes `combine()`, `src_local()`,
`src_mysql()`, `src_postgres()`, `src_sqlite()`, `rename_vars_()`,
`select_vars_()`, `summarise_each_()`, `mutate_each_()`, `as.tbl()`,
`tbl_df()`, and a handful of older arguments. They are likely to be made
defunct in the next major version (but not before mid 2024).
* `slice()`ing with a 1-column matrix is deprecated.
### Newly superseded
* `recode()` is superseded in favour of `case_match()` (#6433).
* `recode_factor()` is superseded. We don't have a direct replacement for it
yet, but we plan to add one to forcats. In the meantime you can often use
`case_match(.ptype = factor(levels = ))` instead (#6433).
* `transmute()` is superseded in favour of `mutate(.keep = "none")` (#6414).
### Newly stable
* The `.keep`, `.before`, and `.after` arguments to `mutate()` have moved
from experimental to stable.
* The `rows_*()` family of functions have moved from experimental to stable.
## vctrs
Many of dplyr's vector functions have been rewritten to make use of the vctrs
package, bringing greater consistency and improved performance.
* `between()` can now work with all vector types, not just numeric and
date-time. Additionally, `left` and `right` can now also be vectors (with the
same length as `x`), and `x`, `left`, and `right` are cast to the common type
before the comparison is made (#6183, #6260, #6478).
* `case_when()` (#5106):
* Has a new `.default` argument that is intended to replace usage of
`TRUE ~ default_value` as a more explicit and readable way to specify
a default value. In the future, we will deprecate the unsafe recycling of
the LHS inputs that allows `TRUE ~` to work, so we encourage you to switch
to using `.default`.
* No longer requires exact matching of the types of RHS values. For example,
the following no longer requires you to use `NA_character_`.
```
x <- c("little", "unknown", "small", "missing", "large")
case_when(
x %in% c("little", "small") ~ "one",
x %in% c("big", "large") ~ "two",
x %in% c("missing", "unknown") ~ NA
)
```
* Supports a larger variety of RHS value types. For example, you can use a
data frame to create multiple columns at once.
* Has new `.ptype` and `.size` arguments which allow you to enforce
a particular output type and size.
* Has a better error when types or lengths were incompatible (#6261, #6206).
* `coalesce()` (#6265):
* Discards `NULL` inputs up front.
* No longer iterates over the columns of data frame input. Instead, a row is
now only coalesced if it is entirely missing, which is consistent with
`vctrs::vec_detect_missing()` and greatly simplifies the implementation.
* Has new `.ptype` and `.size` arguments which allow you to enforce
a particular output type and size.
* `first()`, `last()`, and `nth()` (#6331):
* When used on a data frame, these functions now return a single row rather
than a single column. This is more consistent with the vctrs principle that
a data frame is generally treated as a vector of rows.
* The `default` is no longer "guessed", and will always automatically be set
to a missing value appropriate for the type of `x`.
* Error if `n` is not an integer. `nth(x, n = 2)` is fine, but
`nth(x, n = 2.5)` is now an error.
* No longer support indexing into scalar objects, like `<lm>` or scalar S4
objects (#6670).
Additionally, they have all gained an `na_rm` argument since they
are summary functions (#6242, with contributions from @tnederlof).
* `if_else()` gains most of the same benefits as `case_when()`. In particular,
`if_else()` now takes the common type of `true`, `false`, and `missing` to
determine the output type, meaning that you can now reliably use `NA`,
rather than `NA_character_` and friends (#6243).
`if_else()` also no longer allows you to supply `NULL` for either `true` or
`false`, which was an undocumented usage that we consider to be off-label,
because `true` and `false` are intended to be (and documented to be) vector
inputs (#6730).
* `na_if()` (#6329) now casts `y` to the type of `x` before comparison, which
makes it clearer that this function is type and size stable on `x`. In
particular, this means that you can no longer do `na_if(<tibble>, 0)`, which
previously accidentally allowed you to replace any instance of `0` across
every column of the tibble with `NA`. `na_if()` was never intended to work
this way, and this is considered off-label usage.
You can also now replace `NaN` values in `x` with `na_if(x, NaN)`.
* `lag()` and `lead()` now cast `default` to the type of `x`, rather than taking
the common type. This ensures that these functions are type stable on `x`
(#6330).
* `row_number()`, `min_rank()`, `dense_rank()`, `ntile()`, `cume_dist()`, and
`percent_rank()` are faster and work for more types. You can now rank by
multiple columns by supplying a data frame (#6428).
* `with_order()` now checks that the size of `order_by` is the same size as `x`,
and now works correctly when `order_by` is a data frame (#6334).
## Minor improvements and bug fixes
* Fixed an issue with latest rlang that caused internal tools (such as
`mask$eval_all_summarise()`) to be mentioned in error messages (#6308).
* Warnings are enriched with contextualised information in `summarise()` and
`filter()` just like they have been in `mutate()` and `arrange()`.
* Joins now reference the correct column in `y` when a type error is thrown
while joining on two columns with different names (#6465).
* Joins on very wide tables are no longer bottlenecked by the application of
`suffix` (#6642).
* `*_join()` now error if you supply them with additional arguments that
aren't used (#6228).
* `across()` used without functions inside a rowwise-data frame no longer
generates an invalid data frame (#6264).
* Anonymous functions supplied with `function()` and `\()` are now inlined by
`across()` if possible, which slightly improves performance and makes possible
further optimisations in the future.
* Functions supplied to `across()` are no longer masked by columns (#6545). For
instance, `across(1:2, mean)` will now work as expected even if there is a
column called `mean`.
* `across()` will now error when supplied `...` without a `.fns` argument
(#6638).
* `arrange()` now correctly ignores `NULL` inputs (#6193).
* `arrange()` now works correctly when `across()` calls are used as the 2nd
(or more) ordering expression (#6495).
* `arrange(df, mydesc::desc(x))` works correctly when mydesc re-exports
`dplyr::desc()` (#6231).
* `c_across()` now evaluates `all_of()` correctly and no longer allows you to
accidentally select grouping variables (#6522).
* `c_across()` now throws a more informative error if you try to rename during
column selection (#6522).
* dplyr no longer provides `count()` and `tally()` methods for `tbl_sql`.
These methods have been accidentally overriding the `tbl_lazy` methods that
dbplyr provides, which has resulted in issues with the grouping structure of
the output (#6338, tidyverse/dbplyr#940).
* `cur_group()` now works correctly with zero row grouped data frames (#6304).
* `desc()` gives a useful error message if you give it a non-vector (#6028).
* `distinct()` now retains attributes of bare data frames (#6318).
* `distinct()` returns columns ordered the way you request, not the same
as the input data (#6156).
* Error messages in `group_by()`, `distinct()`, `tally()`, and `count()` are now
more relevant (#6139).
* `group_by_prepare()` loses the `caller_env` argument. It was rarely used
and it is no longer needed (#6444).
* `group_walk()` gains an explicit `.keep` argument (#6530).
* Warnings emitted inside `mutate()` and variants are now collected and stashed
away. Run the new `last_dplyr_warnings()` function to see the warnings emitted
within dplyr verbs during the last top-level command.
This fixes performance issues when thousands of warnings are emitted with
rowwise and grouped data frames (#6005, #6236).
* `mutate()` behaves a little better with 0-row rowwise inputs (#6303).
* A rowwise `mutate()` now automatically unlists list-columns containing
length 1 vectors (#6302).
* `nest_join()` has gained the `na_matches` argument that all other joins have.
* `nest_join()` now preserves the type of `y` (#6295).
* `n_distinct()` now errors if you don't give it any input (#6535).
* `nth()`, `first()`, `last()`, and `with_order()` now sort character `order_by`
vectors in the C locale. Using character vectors for `order_by` is rare, so we
expect this to have little practical impact (#6451).
* `ntile()` now requires `n` to be a single positive integer.
* `relocate()` now works correctly with empty data frames and when `.before` or
`.after` result in empty selections (#6167).
* `relocate()` no longer drops attributes of bare data frames (#6341).
* `relocate()` now retains the last name change when a single column is renamed
multiple times while it is being moved. This better matches the behavior of
`rename()` (#6209, with help from @eutwt).
* `rename()` now contains examples of using `all_of()` and `any_of()` to rename
using a named character vector (#6644).
* `rename_with()` now disallows renaming in the `.cols` tidy-selection (#6561).
* `rename_with()` now checks that the result of `.fn` is the right type and size
(#6561).
* `rows_insert()` now checks that `y` contains the `by` columns (#6652).
* `setequal()` ignores differences between freely coercible types (e.g. integer
and double) (#6114) and ignores duplicated rows (#6057).
* `slice()` helpers again produce output equivalent to `slice(.data, 0)` when
the `n` or `prop` argument is 0, fixing a bug introduced in the previous
version (@eutwt, #6184).
* `slice()` with no inputs now returns 0 rows. This is mostly for theoretical
consistency (#6573).
* `slice()` now errors if any expressions in `...` are named. This helps avoid
accidentally misspelling an optional argument, such as `.by` (#6554).
* `slice_*()` now requires `n` to be an integer.
* `slice_*()` generics now perform argument validation. This should make
methods more consistent and simpler to implement (#6361).
* `slice_min()` and `slice_max()` can `order_by` multiple variables if you
supply them as a data.frame or tibble (#6176).
* `slice_min()` and `slice_max()` now consistently include missing values in
the result if necessary (i.e. there aren't enough non-missing values to
reach the `n` or `prop` you have selected). If you don't want missing values
to be included at all, set `na_rm = TRUE` (#6177).
* `slice_sample()` now accepts negative `n` and `prop` values (#6402).
* `slice_sample()` returns a data frame or group with the same number of rows as
the input when `replace = FALSE` and `n` is larger than the number of rows or
`prop` is larger than 1. This reverts a change made in 1.0.8, returning to the
behavior of 1.0.7 (#6185)
* `slice_sample()` now gives a more informative error when `replace = FALSE` and
the number of rows requested in the sample exceeds the number of rows in the
data (#6271).
* `storms` has been updated to include 2021 data and some missing storms that
were omitted due to an error (@steveharoz, #6320).
* `summarise()` now correctly recycles named 0-column data frames (#6509).
* `union_all()`, like `union()`, now requires that data frames be compatible:
i.e. they have the same columns, and the columns have compatible types.
* `where()` is re-exported from tidyselect (#6597).
# dplyr 1.0.10
Hot patch release to resolve R CMD check failures.
# dplyr 1.0.9
* New `rows_append()` which works like `rows_insert()` but ignores keys and
allows you to insert arbitrary rows with a guarantee that the type of `x`
won't change (#6249, thanks to @krlmlr for the implementation and @mgirlich
for the idea).
* The `rows_*()` functions no longer require that the key values in `x` uniquely
identify each row. Additionally, `rows_insert()` and `rows_delete()` no
longer require that the key values in `y` uniquely identify each row. Relaxing
this restriction should make these functions more practically useful for
data frames, and alternative backends can enforce this in other ways as needed
(i.e. through primary keys) (#5553).
* `rows_insert()` gained a new `conflict` argument allowing you greater control
over rows in `y` with keys that conflict with keys in `x`. A conflict arises
if a key in `y` already exists in `x`. By default, a conflict results in an
error, but you can now also `"ignore"` these `y` rows. This is very similar to
the `ON CONFLICT DO NOTHING` command from SQL (#5588, with helpful additions
from @mgirlich and @krlmlr).
* `rows_update()`, `rows_patch()`, and `rows_delete()` gained a new `unmatched`
argument allowing you greater control over rows in `y` with keys that are
unmatched by the keys in `x`. By default, an unmatched key results in an
error, but you can now also `"ignore"` these `y` rows (#5984, #5699).
* `rows_delete()` no longer requires that the columns of `y` be a strict subset
of `x`. Only the columns specified through `by` will be utilized from `y`,
all others will be dropped with a message.
* The `rows_*()` functions now always retain the column types of `x`. This
behavior was documented, but previously wasn't being applied correctly
(#6240).
* The `rows_*()` functions now fail elegantly if `y` is a zero column data frame
and `by` isn't specified (#6179).
# dplyr 1.0.8
* Better display of error messages thanks to rlang 1.0.0.
* `mutate(.keep = "none")` is no longer identical to `transmute()`.
`transmute()` has not been changed, and completely ignores the column ordering
of the existing data, instead relying on the ordering of expressions
supplied through `...`. `mutate(.keep = "none")` has been changed to ensure
that pre-existing columns are never moved, which aligns more closely with the
other `.keep` options (#6086).
* `filter()` forbids matrix results (#5973) and warns about data frame
results, especially data frames created from `across()` with a hint
to use `if_any()` or `if_all()`.
* `slice()` helpers (`slice_head()`, `slice_tail()`, `slice_min()`, `slice_max()`)
now accept negative values for `n` and `prop` (#5961).
* `slice()` now indicates which group produces an error (#5931).
* `cur_data()` and `cur_data_all()` don't simplify list columns in rowwise data frames (#5901).
* dplyr now uses `rlang::check_installed()` to prompt you whether to install
required packages that are missing.
* `storms` data updated to 2020 (@steveharoz, #5899).
* `coalesce()` accepts 1-D arrays (#5557).
* The deprecated `trunc_mat()` is no longer reexported from dplyr (#6141).
# dplyr 1.0.7
* `across()` uses the formula environment when inlining them (#5886).
* `summarise.rowwise_df()` is quiet when the result is ungrouped (#5875).
* `c_across()` and `across()` key deparsing not confused by long calls (#5883).
* `across()` handles named selections (#5207).
# dplyr 1.0.6
* `add_count()` is now generic (#5837).
* `if_any()` and `if_all()` abort when a predicate is mistakingly used as `.cols=` (#5732).
* Multiple calls to `if_any()` and/or `if_all()` in the same expression are now
properly disambiguated (#5782).
* `filter()` now inlines `if_any()` and `if_all()` expressions. This greatly
improves performance with grouped data frames.
* Fixed behaviour of `...` in top-level `across()` calls (#5813, #5832).
* `across()` now inlines lambda-formulas. This is slightly more performant and
will allow more optimisations in the future.
* Fixed issue in `bind_rows()` causing lists to be incorrectly transformed as
data frames (#5417, #5749).
* `select()` no longer creates duplicate variables when renaming a variable
to the same name as a grouping variable (#5841).
* `dplyr_col_select()` keeps attributes for bare data frames (#5294, #5831).
* Fixed quosure handling in `dplyr::group_by()` that caused issues with extra
arguments (tidyverse/lubridate#959).
* Removed the `name` argument from the `compute()` generic (@ianmcook, #5783).
* row-wise data frames of 0 rows and list columns are supported again (#5804).
# dplyr 1.0.5
* Fixed edge case of `slice_sample()` when `weight_by=` is used and there
0 rows (#5729).
* `across()` can again use columns in functions defined inline (#5734).
* Using testthat 3rd edition.
* Fixed bugs introduced in `across()` in previous version (#5765).
* `group_by()` keeps attributes unrelated to the grouping (#5760).
* The `.cols=` argument of `if_any()` and `if_all()` defaults to `everything()`.
# dplyr 1.0.4
* Improved performance for `across()`. This makes `summarise(across())` and
`mutate(across())` perform as well as the superseded colwise equivalents (#5697).
* New functions `if_any()` and `if_all()` (#4770, #5713).
* `summarise()` silently ignores NULL results (#5708).
* Fixed a performance regression in `mutate()` when warnings occur once per
group (#5675). We no longer instrument warnings with debugging information
when `mutate()` is called within `suppressWarnings()`.
# dplyr 1.0.3
* `summarise()` no longer informs when the result is ungrouped (#5633).
* `group_by(.drop = FALSE)` preserves ordered factors (@brianrice2, #5545).
* `count()` and `tally()` are now generic.
* Removed default fallbacks to lazyeval methods; this will yield better error messages when
you call a dplyr function with the wrong input, and is part of our long term
plan to remove the deprecated lazyeval interface.
* `inner_join()` gains a `keep` parameter for consistency with the other
mutating joins (@patrickbarks, #5581).
* Improved performance with many columns, with a dynamic data mask using active
bindings and lazy chops (#5017).
* `mutate()` and friends preserves row names in data frames once more (#5418).
* `group_by()` uses the ungrouped data for the implicit mutate step (#5598).
You might have to define an `ungroup()` method for custom classes.
For example, see https://github.com/hadley/cubelyr/pull/3.
* `relocate()` can rename columns it relocates (#5569).
* `distinct()` and `group_by()` have better error messages when the mutate step fails (#5060).
* Clarify that `between()` is not vectorised (#5493).
* Fixed `across()` issue where data frame columns would could not be referred to
with `all_of()` in the nested case (`mutate()` within `mutate()`) (#5498).
* `across()` handles data frames with 0 columns (#5523).
* `mutate()` always keeps grouping variables, unconditional to `.keep=` (#5582).
* dplyr now depends on R 3.3.0
# dplyr 1.0.2
* Fixed `across()` issue where data frame columns would mask objects referred to
from `all_of()` (#5460).
* `bind_cols()` gains a `.name_repair` argument, passed to `vctrs::vec_cbind()` (#5451)
* `summarise(.groups = "rowwise")` makes a rowwise data frame even if the input data
is not grouped (#5422).
# dplyr 1.0.1
* New function `cur_data_all()` similar to `cur_data()` but includes the grouping variables (#5342).
* `count()` and `tally()` no longer automatically weights by column `n` if
present (#5298). dplyr 1.0.0 introduced this behaviour because of Hadley's
faulty memory. Historically `tally()` automatically weighted and `count()`
did not, but this behaviour was accidentally changed in 0.8.2 (#4408) so that
neither automatically weighted by `n`. Since 0.8.2 is almost a year old,
and the automatically weighting behaviour was a little confusing anyway,
we've removed it from both `count()` and `tally()`.
Use of `wt = n()` is now deprecated; now just omit the `wt` argument.
* `coalesce()` now supports data frames correctly (#5326).
* `cummean()` no longer has off-by-one indexing problem (@cropgen, #5287).
* The call stack is preserved on error. This makes it possible to `recover()`
into problematic code called from dplyr verbs (#5308).
# dplyr 1.0.0
## Breaking changes
* `bind_cols()` no longer converts to a tibble, returns a data frame if the input is a data frame.
* `bind_rows()`, `*_join()`, `summarise()` and `mutate()` use vctrs coercion
rules. There are two main user facing changes:
* Combining factor and character vectors silently creates a character
vector; previously it created a character vector with a warning.
* Combining multiple factors creates a factor with combined levels;
previously it created a character vector with a warning.
* `bind_rows()` and other functions use vctrs name repair, see `?vctrs::vec_as_names`.
* `all.equal.tbl_df()` removed.
* Data frames, tibbles and grouped data frames are no longer considered equal, even if the data is the same.
* Equality checks for data frames no longer ignore row order or groupings.
* `expect_equal()` uses `all.equal()` internally. When comparing data frames, tests that used to pass may now fail.
* `distinct()` keeps the original column order.
* `distinct()` on missing columns now raises an error, it has been a compatibility warning for a long time.
* `group_modify()` puts the grouping variable to the front.
* `n()` and `row_number()` can no longer be called directly when dplyr is not loaded,
and this now generates an error: `dplyr::mutate(mtcars, x = n())`.
Fix by prefixing with `dplyr::` as in `dplyr::mutate(mtcars, x = dplyr::n())`
* The old data format for `grouped_df` is no longer supported. This may affect you if you have serialized grouped data frames to disk, e.g. with `saveRDS()` or when using knitr caching.
* `lead()` and `lag()` are stricter about their inputs.
* Extending data frames requires that the extra class or classes are added first, not last.
Having the extra class at the end causes some vctrs operations to fail with a message like:
```
Input must be a vector, not a `<data.frame/...>` object
```
* `right_join()` no longer sorts the rows of the resulting tibble according to the order of the RHS `by` argument in tibble `y`.
## New features
* The `cur_` functions (`cur_data()`, `cur_group()`, `cur_group_id()`,
`cur_group_rows()`) provide a full set of options to you access information
about the "current" group in dplyr verbs. They are inspired by
data.table's `.SD`, `.GRP`, `.BY`, and `.I`.
* The `rows_` functions (`rows_insert()`, `rows_update()`, `rows_upsert()`, `rows_patch()`, `rows_delete()`) provide a new API to insert and delete rows from a second data frame or table. Support for updating mutable backends is planned (#4654).
* `mutate()` and `summarise()` create multiple columns from a single expression
if you return a data frame (#2326).
* `select()` and `rename()` use the latest version of the tidyselect interface.
Practically, this means that you can now combine selections using Boolean
logic (i.e. `!`, `&` and `|`), and use predicate functions with `where()`
(e.g. `where(is.character)`) to select variables by type (#4680). It also makes
it possible to use `select()` and `rename()` to repair data frames with
duplicated names (#4615) and prevents you from accidentally introducing
duplicate names (#4643). This also means that dplyr now re-exports `any_of()`
and `all_of()` (#5036).
* `slice()` gains a new set of helpers:
* `slice_head()` and `slice_tail()` select the first and last rows, like
`head()` and `tail()`, but return `n` rows _per group_.
* `slice_sample()` randomly selects rows, taking over from `sample_frac()`
and `sample_n()`.
* `slice_min()` and `slice_max()` select the rows with the minimum or
maximum values of a variable, taking over from the confusing `top_n()`.
* `summarise()` can create summaries of greater than length 1 if you use a
summary function that returns multiple values.
* `summarise()` gains a `.groups=` argument to control the grouping structure.
* New `relocate()` verb makes it easy to move columns around within a data
frame (#4598).
* New `rename_with()` is designed specifically for the purpose of renaming
selected columns with a function (#4771).
* `ungroup()` can now selectively remove grouping variables (#3760).
* `pull()` can now return named vectors by specifying an additional column name
(@ilarischeinin, #4102).
## Experimental features
* `mutate()` (for data frames only), gains experimental new arguments
`.before` and `.after` that allow you to control where the new columns are
placed (#2047).
* `mutate()` (for data frames only), gains an experimental new argument
called `.keep` that allows you to control which variables are kept from
the input `.data`. `.keep = "all"` is the default; it keeps all variables.
`.keep = "none"` retains no input variables (except for grouping keys),
so behaves like `transmute()`. `.keep = "unused"` keeps only variables
not used to make new columns. `.keep = "used"` keeps only the input variables
used to create new columns; it's useful for double checking your work (#3721).
* New, experimental, `with_groups()` makes it easy to temporarily group or
ungroup (#4711).
## across()
* New function `across()` that can be used inside `summarise()`, `mutate()`,
and other verbs to apply a function (or a set of functions) to a selection of
columns. See `vignette("colwise")` for more details.
* New function `c_across()` that can be used inside `summarise()` and `mutate()`
in row-wise data frames to easily (e.g.) compute a row-wise mean of all
numeric variables. See `vignette("rowwise")` for more details.
## rowwise()
* `rowwise()` is no longer questioning; we now understand that it's an
important tool when you don't have vectorised code. It now also allows you to
specify additional variables that should be preserved in the output when
summarising (#4723). The rowwise-ness is preserved by all operations;
you need to explicit drop it with `as_tibble()` or `group_by()`.
* New, experimental, `nest_by()`. It has the same interface as `group_by()`,
but returns a rowwise data frame of grouping keys, supplemental with a
list-column of data frames containing the rest of the data.
## vctrs
* The implementation of all dplyr verbs have been changed to use primitives
provided by the vctrs package. This makes it easier to add support for
new types of vector, radically simplifies the implementation, and makes
all dplyr verbs more consistent.
* The place where you are mostly likely to be impacted by the coercion
changes is when working with factors in joins or grouped mutates:
now when combining factors with different levels, dplyr creates a new
factor with the union of the levels. This matches base R more closely,
and while perhaps strictly less correct, is much more convenient.
* dplyr dropped its two heaviest dependencies: Rcpp and BH. This should make
it considerably easier and faster to build from source.
* The implementation of all verbs has been carefully thought through. This
mostly makes implementation simpler but should hopefully increase consistency,
and also makes it easier to adapt to dplyr to new data structures in the
new future. Pragmatically, the biggest difference for most people will be
that each verb documents its return value in terms of rows, columns, groups,