Description
The manual pages for the top_n
function do not include any examples with date values and trying to pick out the earliest/latest of a period can be confusing. For example, I worked in insurance so we had eligibility periods that ran from startdate to enddate.
To get the earliest startdate, as a prior SQL programmer, I would expect to use an ascending list and the top item on the list is the first one. However, top_n provides the "largest" date i.e. the last one.
The ordering of an ascending list should return as the top the first item in the list. However, top_n returns the largest value, not the smallest. This can be seen in the example below. I am also porting the data over to SQL so you can see how this ascending order of lists, limit to the first item 1 returns differently there (in many SQL variants SELECT TOP # is supported but not SQLite).
Reproducible Example:
library(tidyverse)
example <- data.frame( startdate = seq(as.Date("2019/01/01"), as.Date("2019/12/31"), by="days"),
enddate = seq(as.Date("2021/01/01"), as.Date("2021/12/31"), by="days") )
example[1:5, ]
###erroneous result
example %>%
top_n( 1, startdate)
#2019-12-31
example %>%
select( startdate ) %>%
arrange( startdate ) %>%
top_n( 1 )
#2019-12-31
###desired solution
example %>%
summarize( output = min(startdate) )
#2019-01-01
example %>%
top_n( -1, startdate )
library(DBI)
db=dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable( db, "example", example)
SQL Snippet to do the same thing
SELECT startdate
FROM example
ORDER BY startdate
LIMIT 1
*2019-01-01
The same is true of the reverse, if you are obtaining end-date you would use a descending list from oldest to newest and pull the first item, but this pulls the "smallest" i.e. the "earliest" item.
top_n( 1, enddate )
However, coming from a SQL background this is counter-intuitive where I would normally query such as this:
SELECT top 1 id, startdate
FROM x
GROUP BY id
ORDER BY startdate
SELECT top 1 id, enddate
FROM x
GROUP BY id
ORDER BY desc(enddate)
Or alternatively, and easier if not looking for a matched set...
SELECT id, min(startdate), max(enddate)
FROM x
GROUP BY id