New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windowed rank functions need NA handling #774

Closed
gvfarns opened this Issue Nov 14, 2014 · 9 comments

Comments

Projects
None yet
3 participants
@gvfarns

gvfarns commented Nov 14, 2014

Each of the windowed rank functions could really use an NA handling parameter, similar to what is done in base R's rank() function with the parameter na.last.

At present it appears that NA is always last (highest rank), which is equivalent to na.last=TRUE and there is no obvious way to change this.

Of particular importance is the na.last=NA case, where any missing data is not included in the ranking (it gets a rank of NA). For me, at least, exclusion of missing values from rankings is the most common desired case.

@hadley

This comment has been minimized.

Member

hadley commented Nov 14, 2014

I can't see how na.last = NA could work for ranking functions without breaking the interface that length(output) == length(input).

Could you provide a use case for why you'd want to rank NA's higher than every other value?

@gvfarns

This comment has been minimized.

gvfarns commented Nov 14, 2014

For na.last=NA we don't remove observations, we just give them a ranking of NA. For example, my workaround when I learned the dplyr function ntile() would not handle NA as I needed it to was the following function

ntile_na <- function(x,n)
{
  notna <- !is.na(x)
  out <- rep(NA_real_,length(x))
  out[notna] <- ntile(x[notna],n)
  return(out)
}

The other rank functions are amenable to a similar workaround.

With respect to use cases, it's a question of how to interpret NA values, which the researcher knows but as developers, we don't. Here are some examples:

  1. If X is a set of times to completion, then NA could mean it never completed. Makes sense to group NA with the highest decile.
  2. if X is the score a student got on a test, then NA could mean they did not take it. Might make sense to group those students with those who got the lowest score.
  3. if X is a characteristic and some individuals did not report it, it makes sense not give those individuals a ranking of NA, meaning "unknown". My current work involves an analysis of hedge fund returns and flows and other characteristics. Hedge funds occasionally do not report their flows, so we can't rank them relative to their peers in those months. They may report one characteristic but not another so we don't necessarily want to drop the whole observation from the ranking process.

We can't reasonably impose our view that the data should be in the form of case 1 since as developers we don't know what the data means nor the correct way to interpret NA for that data.

@hadley

This comment has been minimized.

Member

hadley commented Nov 14, 2014

Oh, I see - I think that na.last = NA should be the standard behaviour for all ranking functions. You can get the other behaviours by trivially modifying the data (e.g. to use -Inf or Inf).

@gvfarns

This comment has been minimized.

gvfarns commented Nov 14, 2014

I agree completely. However, you might consider keeping the default as it is now to be compatible with the rank() function (and therefore many people's expectations), which has default behavior that matches that of the dplyr ranking functions. I'll leave it up to you to make that call.

hadley added a commit that referenced this issue Nov 18, 2014

@hadley

This comment has been minimized.

Member

hadley commented Nov 18, 2014

@romainfrancois does the hybrid evaluator also have implementations of these functions that need to be updated?

@hadley hadley added the feature label Nov 20, 2014

@hadley hadley added this to the 0.3.1 milestone Nov 20, 2014

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Dec 10, 2014

Yes.

        handlers[ Rf_install( "min_rank" )       ] = rank_impl_prototype<dplyr::internal::min_rank_increment> ;
        handlers[ Rf_install( "percent_rank" )   ] = rank_impl_prototype<dplyr::internal::percent_rank_increment> ;
        handlers[ Rf_install( "dense_rank" )     ] = rank_impl_prototype<dplyr::internal::dense_rank_increment> ;
        handlers[ Rf_install( "cume_dist" )      ] = rank_impl_prototype<dplyr::internal::cume_dist_increment> ;

        handlers[ Rf_install( "ntile" )          ] = ntile_prototype ;

The nice thing is that they all share the same implementation, rank_impl_prototype so we would only need to handle it once.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Dec 10, 2014

However, for all of the functions handled by the rank_impl_prototype template, hybrid evaluation only occurs if the number of arguments is one, so for "correct" handling of any extra argument, it can happen at the R layer.

Anyhow, what should the interface be for min_rank etc ... ?

min_rank and dense_rank could be sort of easy to modify. Bu perhaps cume_dist and percent_rank are more of a problem as we sort of need to know how many NA there are.

It looks as if the R versions of the functions did propagate the NA:

> x <- c(1,2,NA,1,0, NA)
> percent_rank(x)
[1] 0.2 0.6  NA 0.2 0.0  NA
> dense_rank(x)
[1]  2  3 NA  2  1 NA
> min_rank(x)
[1]  2  4 NA  2  1 NA
> cume_dist(x)
[1] 0.5000000 0.6666667        NA 0.5000000 0.1666667        NA

> ( data.frame( x = x ) %>% mutate( out = percent_rank(x) ) ) $out
[1] 0.2 0.6 0.8 0.2 0.0 0.8
> ( data.frame( x = x ) %>% mutate( out = dense_rank(x) ) ) $out
[1] 2 3 4 2 1 4
> ( data.frame( x = x ) %>% mutate( out = min_rank(x) ) ) $out
[1] 2 4 5 2 1 5
> ( data.frame( x = x ) %>% mutate( out = cume_dist(x) ) ) $out
[1] 0.5000000 0.6666667 1.0000000 0.5000000 0.1666667 1.0000000

So should I mimic those results, or do we need some ways to control NA, ...

For cume_dist and percent_rank the R functions use the total number of elements from the vector and not the number of non NA as was hinted by @gvfarns above. What the R functions are doing would be easier to implement internally, but is this the right results @hadley ?

@hadley

This comment has been minimized.

Member

hadley commented Dec 10, 2014

Oh oops, I'd say that's a bug in my R implementation. The denominator should be the number of non-NAs, not the length.

We don't need options to control behave, just ensure that NAs in input are NA in output

romainfrancois added a commit that referenced this issue Dec 12, 2014

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Dec 12, 2014

Did the first set of functions: min_rank, percent_rank, dense_rank and cume_dist. Also I think I fixed the R implementation of them.

I guess next is ntile

romainfrancois added a commit that referenced this issue Dec 13, 2014

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.