Skip to content

top_n(df, n) returns bottom n rows #4494

Closed
@cochetti

Description

@cochetti

The manual pages for the top_n function do not include any examples with date values and trying to pick out the earliest/latest of a period can be confusing. For example, I worked in insurance so we had eligibility periods that ran from startdate to enddate.

To get the earliest startdate, as a prior SQL programmer, I would expect to use an ascending list and the top item on the list is the first one. However, top_n provides the "largest" date i.e. the last one.

The ordering of an ascending list should return as the top the first item in the list. However, top_n returns the largest value, not the smallest. This can be seen in the example below. I am also porting the data over to SQL so you can see how this ascending order of lists, limit to the first item 1 returns differently there (in many SQL variants SELECT TOP # is supported but not SQLite).

Reproducible Example:

library(tidyverse)

example <- data.frame(  startdate = seq(as.Date("2019/01/01"), as.Date("2019/12/31"), by="days"),
                        enddate   = seq(as.Date("2021/01/01"), as.Date("2021/12/31"), by="days") )

example[1:5, ]

###erroneous result
example %>% 
  top_n( 1, startdate)
#2019-12-31

example %>%
  select( startdate ) %>% 
  arrange( startdate ) %>% 
  top_n( 1 )
#2019-12-31

###desired solution
example %>%
  summarize( output = min(startdate) )
#2019-01-01

example %>%
  top_n( -1, startdate )

library(DBI)
db=dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable( db, "example", example)

SQL Snippet to do the same thing


SELECT startdate
FROM example
ORDER BY startdate
LIMIT 1
*2019-01-01

The same is true of the reverse, if you are obtaining end-date you would use a descending list from oldest to newest and pull the first item, but this pulls the "smallest" i.e. the "earliest" item.

top_n( 1, enddate )

However, coming from a SQL background this is counter-intuitive where I would normally query such as this:

SELECT top 1 id, startdate
FROM x
GROUP BY id
ORDER BY startdate

SELECT top 1 id, enddate
FROM x
GROUP BY id
ORDER BY desc(enddate)

Or alternatively, and easier if not looking for a matched set...

SELECT id, min(startdate), max(enddate)
FROM x
GROUP BY id

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions