Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix groups when joining grouped data frames with duplicates (#2330) #2334

Merged
merged 1 commit into from Jan 26, 2017
Merged

Fix groups when joining grouped data frames with duplicates (#2330) #2334

merged 1 commit into from Jan 26, 2017

Conversation

davidkretch
Copy link
Contributor

@davidkretch davidkretch commented Dec 18, 2016

  • Fix subset_join to update group column names in attribute
    vars when they are duplicate column names and get renamed.

  • Add a test to verify that the group column names after
    join are all columns in the data frame.

  • Fix build_index_cpp to report correct missing group
    column name. Currently when a group column name does not
    exist in the data frame, it reports a name from the names
    vector (all columns) instead of the vars vector (group columns).

Fixes #2330.

@krlmlr
Copy link
Member

krlmlr commented Dec 20, 2016

Thanks. Are the group indices updated correctly? We'll need a test for that (unless it exists, I haven't found any), and also a test for the error message buglet you discovered.

@hadley: Is joining a grouped data frame an operation we support?

@hadley
Copy link
Member

hadley commented Dec 20, 2016

I don't know what joining a grouped data frame should do. Probably just preserve the grouping of x?

@krlmlr
Copy link
Member

krlmlr commented Dec 20, 2016

This pull request addresses the case when columns are renamed. I just checked, group indices look ok after an expanding join (replace df2 <- data.frame(...) by df2 <- expand.grid(...)).

@davidkretch
Copy link
Contributor Author

davidkretch commented Dec 20, 2016

Thanks. I'll add those tests asap.

@davidkretch
Copy link
Contributor Author

I've added tests for:

  • expanding join group indices
  • missing group column error message
  • attribute vars in the joined data frame is null when the original df was not grouped, i.e. the new code doesn't introduce any

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just a few minor nits about the tests. Concentrating the naming logic in the C++ code in one place will help later refactorings; currently this function does way too much.

@@ -107,7 +116,7 @@ DataFrame subset_join(DataFrame x, DataFrame y,
set_rownames(out, nrows);
out.names() = names;

SEXP vars = x.attr("vars");
SEXP vars = group_vars_x;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move the entire logic of updating group variables to here? I think you should be able to look up the new column names via names[i].

@@ -129,6 +129,17 @@ test_that("grouped_df errors on empty vars (#398)",{
expect_error( m %>% do(mpg = mean(.$mpg)) )
})

test_that("grouped data frame errors on non-existent var (#2330)", {
df <- data.frame(x = 1:5)
expect_error(grouped_df(df, c(as.symbol("y"))), "unknown column 'y'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think list(quote(y)) is a bit easier to parse.

df <- data.frame(x = 1:5)
expect_error(grouped_df(df, c(as.symbol("y"))), "unknown column 'y'")

gdf <- df %>% group_by(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks like a bit too much, and fragile. The test above might be just enough to test the glitch you discovered.

test_that("group indices are updated correctly for joined grouped data frames (#2330)", {
d1 <- data.frame(x = 1:2, y = 1:2) %>% group_by(x, y)
d2 <- expand.grid(x = 1:2, y = 1:2)
res <- inner_join(d1, d2, by = "x") %>% group_indices()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think testing the group indices as an invariant will make the test clearer (and easily extensible/adaptable).

expect_equal(group_indices(d1), d1$x)

...

res <- inner_join(d1, d2, by = "x")
expect_equal(group_indices(res), res$x)

test_that("group column names get updated when they are duplicates (#2330)", {
d1 <- data_frame(x = 1:5, y = 1:5) %>% group_by(x, y)
d2 <- data_frame(x = 1:5, y = 1:5)
res <- inner_join(d1, d2, by = "x")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, an explicit check for an expected result will give slightly nicer output in case of failure. We can be explicit about the suffixes the join uses, too.

* Fix subset_join to update group column names in attribute
vars when they are duplicate column names.

* Add tests for appropriate group columns after join.

* Add test for group indices on expanding join with grouped
data frame.

* Fix build_index_cpp to report correct missing group
column name. Currently when a group column name does not
exist in the data frame, it reports a name from the
names vector (all columns) instead of the vars vector
(group columns).

* Add test for error message on non-existent group columns.
@davidkretch
Copy link
Contributor Author

davidkretch commented Dec 24, 2016

Hi, thank you for your review. I've updated the pull request. Hopefully any future PRs have fewer issues.

I added logic to give an error if subset_join somehow gets an x grouped data frame with non-existent group columns. Without this check, it would try to access names[NA_INTEGER] and hang. I think the alternatives are to give an error, output with no groups, or not check assuming they're always valid. I'm not sure how to test the error without directly modifying the vars attribute though.

I also fixed a serious bug in my previous commit, which accidentally modified x.attr("vars") by reference. It is now const. Sorry about that.

@krlmlr
Copy link
Member

krlmlr commented Jan 26, 2017

Thanks for the edits. You can still rip out the column in a grouped data frame:

iris %>% group_by(Species) %>% { .[["Species"]] <- NULL; . }

We could forbid this, but creative users will find a way to break it anyway. I think anything that takes (grouped) data frames should check validity of the input, but this is too much for this PR. I'll merge this as is, the safety net you added will protect us from crashes.

@krlmlr krlmlr merged commit abfc9ec into tidyverse:master Jan 26, 2017
@lock
Copy link

lock bot commented Jan 18, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants