Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

top_n(df, n) returns bottom n rows #4494

Closed
cochetti opened this issue Jul 16, 2019 · 12 comments
Closed

top_n(df, n) returns bottom n rows #4494

cochetti opened this issue Jul 16, 2019 · 12 comments
Labels
documentation helpers 🏋️‍♂️ helper functions like count() and tally() tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day

Comments

@cochetti
Copy link

cochetti commented Jul 16, 2019

The manual pages for the top_n function do not include any examples with date values and trying to pick out the earliest/latest of a period can be confusing. For example, I worked in insurance so we had eligibility periods that ran from startdate to enddate.

To get the earliest startdate, as a prior SQL programmer, I would expect to use an ascending list and the top item on the list is the first one. However, top_n provides the "largest" date i.e. the last one.

The ordering of an ascending list should return as the top the first item in the list. However, top_n returns the largest value, not the smallest. This can be seen in the example below. I am also porting the data over to SQL so you can see how this ascending order of lists, limit to the first item 1 returns differently there (in many SQL variants SELECT TOP # is supported but not SQLite).

Reproducible Example:

library(tidyverse)

example <- data.frame(  startdate = seq(as.Date("2019/01/01"), as.Date("2019/12/31"), by="days"),
                        enddate   = seq(as.Date("2021/01/01"), as.Date("2021/12/31"), by="days") )

example[1:5, ]

###erroneous result
example %>% 
  top_n( 1, startdate)
#2019-12-31

example %>%
  select( startdate ) %>% 
  arrange( startdate ) %>% 
  top_n( 1 )
#2019-12-31

###desired solution
example %>%
  summarize( output = min(startdate) )
#2019-01-01

example %>%
  top_n( -1, startdate )

library(DBI)
db=dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable( db, "example", example)

SQL Snippet to do the same thing


SELECT startdate
FROM example
ORDER BY startdate
LIMIT 1
*2019-01-01

The same is true of the reverse, if you are obtaining end-date you would use a descending list from oldest to newest and pull the first item, but this pulls the "smallest" i.e. the "earliest" item.

top_n( 1, enddate )

However, coming from a SQL background this is counter-intuitive where I would normally query such as this:

SELECT top 1 id, startdate
FROM x
GROUP BY id
ORDER BY startdate

SELECT top 1 id, enddate
FROM x
GROUP BY id
ORDER BY desc(enddate)

Or alternatively, and easier if not looking for a matched set...

SELECT id, min(startdate), max(enddate)
FROM x
GROUP BY id
@romainfrancois
Copy link
Member

Not sure I understand, top_n(1, startdate) gives you the row with the 1 greatest startdate value .

library(dplyr, warn.conflicts = FALSE)

example <- data.frame(
  startdate = seq(as.Date("2019/01/01"), as.Date("2019/12/31"), by="days"),
  enddate   = seq(as.Date("2021/01/01"), as.Date("2021/12/31"), by="days")
)
example %>% 
  top_n( 1, startdate)
#>    startdate    enddate
#> 1 2019-12-31 2021-12-31
example %>% 
  filter(startdate == max(startdate))
#>    startdate    enddate
#> 1 2019-12-31 2021-12-31

example %>% 
  top_n( -1, startdate)
#>    startdate    enddate
#> 1 2019-01-01 2021-01-01
example %>% 
  filter(startdate == min(startdate))
#>    startdate    enddate
#> 1 2019-01-01 2021-01-01

same as with numbers:

library(dplyr, warn.conflicts = FALSE)
df <- data.frame(x = 1:5)
df %>% 
  top_n(1, x)
#>   x
#> 1 5
df %>% 
  top_n(-1, x)
#>   x
#> 1 1

@cochetti
Copy link
Author

If you don't understand why did you close the ticket? The behavior of top_n is opposite that of what it does in SQL.

@cochetti
Copy link
Author

Nowhere in the manual entry is the word "GREATEST" used.... So it is natural to interpret top_n to comply with how it is in SQL, where it gives you the top entry among an ordered list... ascending order is 1, 2, 3, 4.... so top_n gives 1 the first entry in the list, not the greatest

x a tbl() to filter
n number of rows to return for top_n(), fraction of rows to return for top_frac().If x is grouped, this is the number (or fraction) of rows per group. Will include more rows if there are ties.If n is positive, selects the top rows. If negative, selects the bottom rows.
wt (Optional). The variable to use for ordering. If not specified, defaults to the last variable in the tbl.

@cochetti
Copy link
Author

Thank you. I appreciate the reconsideration and apologize for my abrupt comment about not understanding.

@cochetti
Copy link
Author

cochetti commented Aug 4, 2019

Quick note... This documentation issue also relates to the concern raised in #1008 (issues/1008) where the implementation doesn't "play well" with explicitly using ascending or descending. The commenters also had issues as they were interpreting it to be similar to SQL, where the expected results were based on sorting of the data.

SQL processes top n or limit statements based on the ordering specified in the ORDER BY clause. When SQL programmers switch to the tidyverse, they are expecting top_n to behave in the same way, i.e. arrange( x ) %>% top_n(10) will give the first 10 values among a list achieved with the prior ascending sort. Note klmr's comment on #1008 where the expectation is that arrange( pval ) %>% head(100) will be equivalent to top_n( 100, desc(pval) ) --- this is the same SQL-based assumption of which side is the 'top'.

'Top' is confusing as it relies on a mutual understanding of which end of the list is the top after sorting ascending or descending. If you used a deck of cards as an example, if you place the cards face down on a table when you sort them, the 'top' are the last ones in the sorting, i.e. the greatest -- as top_n currently returns. Alternatively, after sorting if you are holding the cards in sorted order, the 'top' are the first x cards you would see, i.e. the least -- as SQL TOP N code would return. Base R avoids this confusion as most folks use head() and tail() which explicitly state which end is being used.

TL;DR
dplyr::top_n is backward to SQL TOP N. It's confused a bunch of SQL converts. head() and tail() are more explicit in which "side" of the sort is being pulled from.

@hadley hadley added the tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day label Dec 10, 2019
@cochetti
Copy link
Author

I have a better example to explain this... You order a deck of cards first by suit (Club, Diamond, Heart, Spade) and then by rank/pip (A, 2 - 10, J, Q, K.... assume these are coded 1 to 13). Okay so we have Clubs 1, Clubs 2, Clubs 3 .... to Spades 11, Spades 12, Spades 13.

In SQL, the sort is issued with ORDER BY which would output a datatable starting with Clubs 1 in row 1 or on the "top" of the pile facing upwards so you can see the face of the card when placed on a table. SELECT TOP 1 SUIT, RANK FROM tblDeck ORDER BY SUIT, RANK then pulls Clubs 1.

In tidyverse, it appears that top_n( 1, SUIT, RANK ) will return Spades 13. In order to get the corresponding card top_n( -1, SUIT, RANK) would need to be used. This is why I claim they are "backward" to one another.

@cochetti
Copy link
Author

Also, with SQL the ORDER BY is preserved when the data is stored.

It is more natural for a SQL convert to desire to issue arrange( ) expecting the row 1 Club 1 to row 52 Spades 13. So in the next command top_n( 1 ) a SQL convert would expect this to return Club 1.

@cochetti
Copy link
Author

This applies with dates the same way... 1/1/1900, 1/1/1901, 1/1/1902... 1/1/2000

SELECT TOP 1 date FROM tblData ORDER BY date -> returns 1/1/1900.

top_n( tblData, 1, date) -> returns the top/highest value 1/1/2000, not the row 1 in a sorted table

@hadley hadley added the helpers 🏋️‍♂️ helper functions like count() and tally() label Dec 21, 2019
@hadley hadley changed the title top_n with date values not returning appropriately top_n(df, n) returns bottom n rows Dec 31, 2019
@hadley
Copy link
Member

hadley commented Dec 31, 2019

@cochetti thanks for the detailed analysis and explanation.

I think the other thing that's confusing is that if you do specify df %>% arrange(x) %>% top_n(2), the arrange is actually ignored because top_n() sorts by the last variable in the data frame. This result is confusing:

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(x = c(6, 4, 1, 10, 3, 1, 1))
df %>% arrange(x) %>% top_n(2)
#> Selecting by x
#>    x
#> 1  6
#> 2 10

Created on 2019-12-31 by the reprex package (v0.3.0)

This convinces me that the existing interface to top_n() is wrong, but it remains to figure out how to fix it without breaking existing code. I think the easiest way will be to retire (or maybe deprecate) the existing top_n() and replace it with a new function. But what should that function be called?

Some brainstorming:

  • first_n() and last_n() (and eliminate the use of negative n); but would also require first_frac() and last_frac(), and four functions for this purpose seems like a lot. (Another option would be allow frac(0.5) to work). Similarly start_n()/beginning_n() and end_n()

  • Something that doesn't indicate directionality: edge(), margin(), periphery(), extrema(), fringe(), outside()

  • Some minor variation on top_n(): top(), top_n2(), topn()

Of these options top() seems most appealing, but because it's so short the likelihood of a clash with another package is probably high.

(This function would also sort by the wt variable, if supplied, as requested in #4660. Probably should also include with_ties argument to match SQL. It also wouldn't evaluate n, but keep it fixed.)

@hadley
Copy link
Member

hadley commented Jan 1, 2020

Oooh, maybe slice_top() and slice_bottom() (and then slice_random() could replace sample_n() and sample_frac()). Maybe have mutually exclusive n and prop arguments to avoid the two suffixes.

@hadley
Copy link
Member

hadley commented Jan 1, 2020

slice_head() and slice_tail() would be even better:

df %>% top_n(x, 3)              #-> df %>% slice_tail(x, n = 3)
df %>% top_n(x, -3)             #-> df %>% slice_head(x, n = 3)
df %>% top_frac(x, 0.5)         #-> df %>% slice_head(x, prop = 0.5)
df %>% sample_n(3, wt = w)      #-> df %>% slice_random(w, n = 3)
df %>% sample_frac(0.5, wt = w) #-> df %>% slice_random(w, prop = 0.5)

hadley added a commit that referenced this issue Jan 2, 2020
Fixes #4494. Fixes #4660.
@hadley hadley closed this as completed in ed787ef Jan 7, 2020
@cochetti
Copy link
Author

I really like the solution of calling it slice_xxxx... I think it resolves the "top" terminology confusion that exists even in SQL. And the addition of slice_random is a nice complementary implementation.

FYI, I am working on a suggestion for how to annotate functions where an issue like this one is found and a solution has not yet been implemented or won't be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation helpers 🏋️‍♂️ helper functions like count() and tally() tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day
Projects
None yet
Development

No branches or pull requests

3 participants