Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#4331 Updated distinct_if and distinct_at to include .keep_all argument #4343

Merged

Conversation

beansrowning
Copy link
Contributor

As @DaniMori suggested in #4331, this simply tacks the .keep_all argument onto the scoped variants of distinct().

I wasn't sure if this was needed (or desired) but I have also wanted a scoped variant of distinct() which would allow you the option to keep all columns. Adding it to distinct_all() seemed superfluous.

My use case was reading in many data files where I am appending a file column to track provenance. Since my data is very raw, I want to check that there isn't a duplicate observation across two or more files.

Something similar to this:

# Read in many excel files with many columns
# In this case, they're 1 row each
to_process <- dir("<my_dir>", pattern = ".*\\.xlsx$", full.names = TRUE)

# Originally, I assumed one could just do this:
# (but it's not possible due to distinct's mutate() semantics)
raw_data <- lapply(to_process, my_parsing_function) %>%
  bind_rows() %>%
  mutate(file = to_process) %>%
  distinct(-file, .keep_all = TRUE)

# Since I can't do that, I have to do some variation of this 
# (though there are likely better ways):
raw_data <- lapply(to_process, my_parsing_function) %>%
  bind_rows()

distinct_vars <- lapply(names(raw_data), as.name)

raw_data <- raw_data %>%
  mutate(file = to_process) %>%
  distinct(!!!distinct_vars, .keep_all = TRUE)

# Ideally I would just run the scoped version `distinct_at()`
# on all columns, save for file which I append later:
raw_data <- lapply(to_process, my_parsing_function) %>%
  bind_rows() %>%
  mutate(file = to_process) %>%
  distinct_at(vars(-file), .keep_all = TRUE)

# So not ~really~ that big of a deal, but it seems logical ¯\_(ツ)_/¯

@romainfrancois romainfrancois added this to the 0.8.2 milestone Apr 30, 2019
@romainfrancois
Copy link
Member

Still think it makes sense to add it to distinct_all() too

@DaniMori
Copy link

Still think it makes sense to add it to distinct_all() too

I agree that it's superfluous as @beansrowning said, but it wouldn't be the only "superfluous" case to be in dplyr. The most obvious one is select_all(). I can't see a use case where these superfluous functions or parameters would be useful, but there must be some for sure, so they may apply to distinct_all(..., .keep_all = TRUE).

Do they? I really don't know, probably someone else can state a good case in favor/against it.

@beansrowning
Copy link
Contributor Author

@romainfrancois @DaniMori Fair point, it's reasonable to have it consistent across.

Added.

@beansrowning
Copy link
Contributor Author

beansrowning commented May 1, 2019

Looks like it fails the check for distinct_all() since I made it a recursive default argument.

I'll have to think about it.

EDIT: Sike. After a little bit more sleep it is clear the issue lies between the chair and the keyboard 🤦‍♂ . Should be fixed now.

This reverts commit 9ddd5f2.
Incorrect argument declaration
- All checks passed local, Win 10 R3.5
@romainfrancois
Copy link
Member

closes #4331

@romainfrancois romainfrancois merged commit 00fc08a into tidyverse:master May 27, 2019
@lock
Copy link

lock bot commented Nov 23, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants