New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
top_n(df, n) returns bottom n rows #4494
Comments
Not sure I understand, library(dplyr, warn.conflicts = FALSE)
example <- data.frame(
startdate = seq(as.Date("2019/01/01"), as.Date("2019/12/31"), by="days"),
enddate = seq(as.Date("2021/01/01"), as.Date("2021/12/31"), by="days")
)
example %>%
top_n( 1, startdate)
#> startdate enddate
#> 1 2019-12-31 2021-12-31
example %>%
filter(startdate == max(startdate))
#> startdate enddate
#> 1 2019-12-31 2021-12-31
example %>%
top_n( -1, startdate)
#> startdate enddate
#> 1 2019-01-01 2021-01-01
example %>%
filter(startdate == min(startdate))
#> startdate enddate
#> 1 2019-01-01 2021-01-01 same as with numbers: library(dplyr, warn.conflicts = FALSE)
df <- data.frame(x = 1:5)
df %>%
top_n(1, x)
#> x
#> 1 5
df %>%
top_n(-1, x)
#> x
#> 1 1 |
If you don't understand why did you close the ticket? The behavior of top_n is opposite that of what it does in SQL. |
Nowhere in the manual entry is the word "GREATEST" used.... So it is natural to interpret top_n to comply with how it is in SQL, where it gives you the top entry among an ordered list... ascending order is 1, 2, 3, 4.... so top_n gives 1 the first entry in the list, not the greatest
|
Thank you. I appreciate the reconsideration and apologize for my abrupt comment about not understanding. |
Quick note... This documentation issue also relates to the concern raised in #1008 (issues/1008) where the implementation doesn't "play well" with explicitly using ascending or descending. The commenters also had issues as they were interpreting it to be similar to SQL, where the expected results were based on sorting of the data. SQL processes 'Top' is confusing as it relies on a mutual understanding of which end of the list is the top after sorting ascending or descending. If you used a deck of cards as an example, if you place the cards face down on a table when you sort them, the 'top' are the last ones in the sorting, i.e. the greatest -- as TL;DR |
I have a better example to explain this... You order a deck of cards first by suit (Club, Diamond, Heart, Spade) and then by rank/pip (A, 2 - 10, J, Q, K.... assume these are coded 1 to 13). Okay so we have Clubs 1, Clubs 2, Clubs 3 .... to Spades 11, Spades 12, Spades 13. In SQL, the sort is issued with ORDER BY which would output a datatable starting with Clubs 1 in row 1 or on the "top" of the pile facing upwards so you can see the face of the card when placed on a table. SELECT TOP 1 SUIT, RANK FROM tblDeck ORDER BY SUIT, RANK then pulls Clubs 1. In tidyverse, it appears that top_n( 1, SUIT, RANK ) will return Spades 13. In order to get the corresponding card top_n( -1, SUIT, RANK) would need to be used. This is why I claim they are "backward" to one another. |
Also, with SQL the ORDER BY is preserved when the data is stored. It is more natural for a SQL convert to desire to issue arrange( ) expecting the row 1 Club 1 to row 52 Spades 13. So in the next command top_n( 1 ) a SQL convert would expect this to return Club 1. |
This applies with dates the same way... 1/1/1900, 1/1/1901, 1/1/1902... 1/1/2000 SELECT TOP 1 date FROM tblData ORDER BY date -> returns 1/1/1900. top_n( tblData, 1, date) -> returns the top/highest value 1/1/2000, not the row 1 in a sorted table |
@cochetti thanks for the detailed analysis and explanation. I think the other thing that's confusing is that if you do specify library(dplyr, warn.conflicts = FALSE)
df <- data.frame(x = c(6, 4, 1, 10, 3, 1, 1))
df %>% arrange(x) %>% top_n(2)
#> Selecting by x
#> x
#> 1 6
#> 2 10 Created on 2019-12-31 by the reprex package (v0.3.0) This convinces me that the existing interface to Some brainstorming:
Of these options (This function would also sort by the |
Oooh, maybe |
df %>% top_n(x, 3) #-> df %>% slice_tail(x, n = 3)
df %>% top_n(x, -3) #-> df %>% slice_head(x, n = 3)
df %>% top_frac(x, 0.5) #-> df %>% slice_head(x, prop = 0.5)
df %>% sample_n(3, wt = w) #-> df %>% slice_random(w, n = 3)
df %>% sample_frac(0.5, wt = w) #-> df %>% slice_random(w, prop = 0.5) |
I really like the solution of calling it slice_xxxx... I think it resolves the "top" terminology confusion that exists even in SQL. And the addition of slice_random is a nice complementary implementation. FYI, I am working on a suggestion for how to annotate functions where an issue like this one is found and a solution has not yet been implemented or won't be. |
The manual pages for the
top_n
function do not include any examples with date values and trying to pick out the earliest/latest of a period can be confusing. For example, I worked in insurance so we had eligibility periods that ran from startdate to enddate.To get the earliest startdate, as a prior SQL programmer, I would expect to use an ascending list and the top item on the list is the first one. However, top_n provides the "largest" date i.e. the last one.
The ordering of an ascending list should return as the top the first item in the list. However, top_n returns the largest value, not the smallest. This can be seen in the example below. I am also porting the data over to SQL so you can see how this ascending order of lists, limit to the first item 1 returns differently there (in many SQL variants SELECT TOP # is supported but not SQLite).
Reproducible Example:
SQL Snippet to do the same thing
The same is true of the reverse, if you are obtaining end-date you would use a descending list from oldest to newest and pull the first item, but this pulls the "smallest" i.e. the "earliest" item.
However, coming from a SQL background this is counter-intuitive where I would normally query such as this:
Or alternatively, and easier if not looking for a matched set...
The text was updated successfully, but these errors were encountered: