New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A time version of expand #85

Closed
matthieugomez opened this Issue Jun 6, 2015 · 6 comments

Comments

Projects
None yet
3 participants
@matthieugomez

matthieugomez commented Jun 6, 2015

For datasets where a variable corresponds to a group and another to a time, a useful function would be to add rows for missing dates, making missing observation explicit.

mydata <- data_frame(
  name = c("john","john","john","john","mary","chris","chris","chris"),
  year = c(1999, 2000, 2001, 2003, 2001, 1998, 1999, 2003),
  score = rnorm(8)
)

texpand(mydata, name, year) is different from expand(mydata, name, year) in that it adds all dates between the min and the max of the last argument (2002 in the previous dataset)

This defines texpand:

#' @export
texpand <- function(data, ..., period = 1L) {
  dots <- lazyeval::lazy_dots(...)
  texpand_(data, dots, period = period)
}

#' @export
texpand_ <- function(data, dots, ..., period = 1L) {
  UseMethod("texpand_")
}

#' @export
texpand_.data.frame <- function(data, dots, ..., period = 1L) {
  dots <- lazyeval::as.lazy_dots(dots)
  if (length(dots) == 0)
    return(data.frame())
  iddots <- dots[-length(dots)]
  pieces <- lapply(iddots, unique_vals, data = data)
  timedot <- dots[length(dots)]
  time  <- dplyr::select_(data, .dots = timedot)[[1]]
  timev <- seq_time(time, period = period)
  check <- (time - timev[1]) %%  period
  if (sum(check)){
    stop("Time vector is not a regular sequence")
  }
  pieces[[length(pieces)+1]] <- data.frame(timev)
  Reduce(cross_df, pieces)
}
#' @export
texpand_.tbl_df <- function(data, dots, ..., period = period) {
  dplyr::tbl_df(NextMethod())
}

seq_time <- function(x, period) {
  attrs <- attributes(x)
  rng <- range(x, na.rm = TRUE)
  attributes(rng) <- attrs
  out <- seq(rng[1], rng[2], by = period)
}

Another direction than creating texpand and tlag would be to define a new class of dataset, tbl_panel, which is a group + time. When setting the panel type, this checks that the time variable has no missing value and that there are no duplicate times by group. Then tidyr/dplyr has a special expand and a special lag for them.

@Mullefa

This comment has been minimized.

Mullefa commented Jun 18, 2015

Huge +1 for a function such as this. I'm currently implementing it myself using dplyr::left_join(), which probably isn't ideal.

The function signature by @matthieugomez looks good - I would probably add something whereby you can specify default values for missing rows that are added. Although grouped data frames aren't supported by tidyr (yet?), I would include a method for a grouped_df in this case as it'd be really useful and presumably not hard to implement once you have the method for data.frame.

I think this function would also be useful in ggvis. e.g. running the following code:

df <- data_frame(
  date = c(as.Date("2014-01-01"), as.Date("2014-01-03")),
  sales = c(10, 13)
)

df %>%
  ggvis(~date, ~sales) %>%
  layer_lines

we get a linear increasing graph, whereas I think in fact it makes sense to set the value of the missing date (2014-01-02) to 0. If this wasn't done in ggvis, you could at least call this proposed function on the data set before visualising it.

Alternate name: inflate().

@Mullefa

This comment has been minimized.

Mullefa commented Jun 18, 2015

In case you're interested, I've knocked up this package very quickly based on existing code: https://github.com/Mullefa/inflate

@hadley

This comment has been minimized.

Member

hadley commented Aug 24, 2015

I think this could just be an additional argument to expand:

df %>% expand(name, year, values = list(year = 1999:2003))

Or maybe

seq_range <- function(x, period) {
  rng <- range(x, na.rm = TRUE)
  seq(rng[1], rng[2], by = period)
}
df %>% expand(name, year, values = list(year = function(x) seq_range(x, 1)))

(Note that seq_range() will silently drop missing values so it might probably needs more sophistication)

@matthieugomez

This comment has been minimized.

matthieugomez commented Oct 6, 2015

Ideally this gives two functionalities

  • fill between the within-group-min and within-group-max
  • fill between the accross-groups-min and accross-group-max

I think your proposal only handles the second case. The first functionality is useful for datasets with non overlapping dates across groups. In certain cases, filling between the across group min and across group max can make the dataset 10x bigger.

@hadley

This comment has been minimized.

Member

hadley commented Dec 30, 2015

I think the best way forward would be to figure out how to solve the vector case - i.e. write a good seq_time(). I think it's better to check the input values for regularity, rather than the output values, and you need to allow for FP differences. Here's my attempt:

seq_time <- function(x, period) {
  if (any(x %% period > 1e-6)) {
    stop("Time vector is not a regular sequence", call. = FALSE)
  }

  rng <- range(x, na.rm = TRUE)
  seq(rng[1], rng[2], by = period)
}


seq_time(c(1, 2, 5, 10), 1)
seq_time(c(1, 2, 5, 9.9, 10), 1)

@hadley hadley closed this in cc0b0e0 Dec 31, 2015

@matthieugomez

This comment has been minimized.

matthieugomez commented Jan 1, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment