Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill downup updown #504

Closed
wants to merge 81 commits into from
Closed

Fill downup updown #504

wants to merge 81 commits into from

Conversation

coolbutuseless
Copy link
Contributor

@coolbutuseless coolbutuseless commented Oct 16, 2018

Add option to fill() to both fill-down-then-up and fill-up-then-down.

This is to replace a common idiom of mine, i.e.

df %>%
  group_by(group) %>%
  tidyr::fill(value, .direction = 'down') %>%
  tidyr::fill(value, .direction = 'up') %>%
  ungroup()

which could become

df %>%
  group_by(group) %>%
  tidyr::fill(value, .direction = 'downup') %>%
  ungroup()

Depending upon number of groups and number of variables to replace, the current duplicate call to fill() can be avoided, giving significant speed savings.

@hadley
Copy link
Member

hadley commented Oct 23, 2018

Can you explain why you'd want to do this?

hadley and others added 9 commits October 23, 2018 14:18
* Update lazyeval compat file

* Unquote scalar quosure with !!

* Use as_string(ensym()) rather than quo_name(enquo())

This is a much more robust way of capturing symbols
* Add uncount to ref index

* Build reference index w/ parens
* Fixes #480
@coolbutuseless
Copy link
Contributor Author

A situation where I do this

I have values only known at some particular time and I need to fill this value both forwards and backwards in time.

A particular example

I work with clinical trial data, which is often provided in multiple files.

In the process of making a data set for analysis, particular information may only be recorded at certain events/times, but need to be filled forward/back in time throughout a related time period.

It is only valid to fill up/down within certain groupings (e.g. subjects, day, part of study) - with lots of subjects and lots of groups, this filling can take a noticeable amount of time.

Also filling may be done within different groupings for different variables.

A simplified concrete example:

suppressPackageStartupMessages({
  library(dplyr)
})

# Weight only recorded at event_type = 1, but considered
# valid across the entire event_num.
# If 'wt' not defined for a given event num, it may be
# carried forwards from a prior run, or backwards from a following run
df <- tibble::tribble(
  ~subject, ~time, ~event_type, ~event_num,  ~wt,
  1       ,     1,           0,          1,   NA,
  1       ,     2,           0,          1,   NA,
  1       ,     3,           1,          1,   20,
  1       ,     4,           0,          1,   NA,
  1       ,     5,           0,          1,   NA,
  1       ,     1,           0,          2,   NA,
  1       ,     2,           0,          2,   NA,
  1       ,     3,           1,          2,   NA,
  1       ,     4,           0,          2,   NA,
  1       ,     5,           0,          2,   NA,
  1       ,     1,           0,          3,   NA,
  1       ,     2,           0,          3,   NA,
  1       ,     3,           1,          3,   30,
  1       ,     4,           0,          3,   NA,
  1       ,     5,           0,          3,   NA,
)

# fill wt down/up within the event_num for each subject,
# then down/up within subject only.
df %>%
  group_by(subject, event_num) %>%
  tidyr::fill(wt, .direction = 'down') %>%
  tidyr::fill(wt, .direction = 'up'  ) %>%
  group_by(subject) %>%
  tidyr::fill(wt, .direction = 'down') %>%
  tidyr::fill(wt, .direction = 'up'  ) %>%
  ungroup()
#> # A tibble: 15 x 5
#>    subject  time event_type event_num    wt
#>      <dbl> <dbl>      <dbl>     <dbl> <dbl>
#>  1       1     1          0         1    20
#>  2       1     2          0         1    20
#>  3       1     3          1         1    20
#>  4       1     4          0         1    20
#>  5       1     5          0         1    20
#>  6       1     1          0         2    20
#>  7       1     2          0         2    20
#>  8       1     3          1         2    20
#>  9       1     4          0         2    20
#> 10       1     5          0         2    20
#> 11       1     1          0         3    30
#> 12       1     2          0         3    30
#> 13       1     3          1         3    30
#> 14       1     4          0         3    30
#> 15       1     5          0         3    30

Created on 2018-10-24 by the reprex
package
(v0.2.0).

rbloehm and others added 6 commits October 24, 2018 07:35
* Add missing tests for spread with fill

* Remove duplicate test for gather. The test below the removed one, with the same name, covers exactly the same test cases (and more)

* Add missing test for id with high dimension

* Use tibble in test-spread.r

* Ensure /tests is lint-free

* Resolve conflict and small style points
To quiet glue deprecation message
@echasnovski
Copy link
Contributor

I also had several occasions where this type of functionality would be useful. However, I'd phrase them slightly differently: replace missing values based on the closest row (by some column, usually time). In case of an equal distance, use direction argument.

If data is ordered by reference column then these downup and updown solve this problem.

@coolbutuseless
Copy link
Contributor Author

OK. I think i totally hosed this PR by trying to sync it with current master. :/

Burn to the ground and start again? I can't see a solution...

@hadley
Copy link
Member

hadley commented Mar 4, 2019

In the future, you might try usethis::pr_pull_source() which should do the right thing to get your branch synced back with master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet