Fix groups when joining grouped data frames with duplicates (#2330) #2334

davidkretch · 2016-12-18T03:39:37Z

Fix subset_join to update group column names in attribute
vars when they are duplicate column names and get renamed.
Add a test to verify that the group column names after
join are all columns in the data frame.
Fix build_index_cpp to report correct missing group
column name. Currently when a group column name does not
exist in the data frame, it reports a name from the names
vector (all columns) instead of the vars vector (group columns).

krlmlr · 2016-12-20T14:20:31Z

Thanks. Are the group indices updated correctly? We'll need a test for that (unless it exists, I haven't found any), and also a test for the error message buglet you discovered.

@hadley: Is joining a grouped data frame an operation we support?

hadley · 2016-12-20T14:54:25Z

I don't know what joining a grouped data frame should do. Probably just preserve the grouping of x?

krlmlr · 2016-12-20T14:58:35Z

This pull request addresses the case when columns are renamed. I just checked, group indices look ok after an expanding join (replace df2 <- data.frame(...) by df2 <- expand.grid(...)).

davidkretch · 2016-12-20T15:48:52Z

Thanks. I'll add those tests asap.

davidkretch · 2016-12-21T20:21:20Z

I've added tests for:

expanding join group indices
missing group column error message
attribute vars in the joined data frame is null when the original df was not grouped, i.e. the new code doesn't introduce any

krlmlr

Thanks. Just a few minor nits about the tests. Concentrating the naming logic in the C++ code in one place will help later refactorings; currently this function does way too much.

krlmlr · 2016-12-21T21:51:12Z

src/join_exports.cpp

@@ -107,7 +116,7 @@ DataFrame subset_join(DataFrame x, DataFrame y,
  set_rownames(out, nrows);
  out.names() = names;

-  SEXP vars = x.attr("vars");
+  SEXP vars = group_vars_x;


Can you please move the entire logic of updating group variables to here? I think you should be able to look up the new column names via names[i].

krlmlr · 2016-12-21T21:55:40Z

tests/testthat/test-group-by.r

@@ -129,6 +129,17 @@ test_that("grouped_df errors on empty vars (#398)",{
  expect_error( m %>% do(mpg = mean(.$mpg)) )
 })

+test_that("grouped data frame errors on non-existent var (#2330)", {
+  df <- data.frame(x = 1:5)
+  expect_error(grouped_df(df, c(as.symbol("y"))), "unknown column 'y'")


I think list(quote(y)) is a bit easier to parse.

krlmlr · 2016-12-21T21:56:13Z

tests/testthat/test-group-by.r

+  df <- data.frame(x = 1:5)
+  expect_error(grouped_df(df, c(as.symbol("y"))), "unknown column 'y'")
+
+  gdf <- df %>% group_by(x)


This test looks like a bit too much, and fragile. The test above might be just enough to test the glitch you discovered.

krlmlr · 2016-12-21T22:00:25Z

tests/testthat/test-group-indices.R

+test_that("group indices are updated correctly for joined grouped data frames (#2330)", {
+  d1 <- data.frame(x = 1:2, y = 1:2) %>% group_by(x, y)
+  d2 <- expand.grid(x = 1:2, y = 1:2)
+  res <- inner_join(d1, d2, by = "x") %>% group_indices()


I think testing the group indices as an invariant will make the test clearer (and easily extensible/adaptable).

expect_equal(group_indices(d1), d1$x) ... res <- inner_join(d1, d2, by = "x") expect_equal(group_indices(res), res$x)

krlmlr · 2016-12-21T22:06:41Z

tests/testthat/test-joins.r

+test_that("group column names get updated when they are duplicates (#2330)", {
+  d1 <- data_frame(x = 1:5, y = 1:5) %>% group_by(x, y)
+  d2 <- data_frame(x = 1:5, y = 1:5)
+  res <- inner_join(d1, d2, by = "x")


Here, an explicit check for an expected result will give slightly nicer output in case of failure. We can be explicit about the suffixes the join uses, too.

* Fix subset_join to update group column names in attribute vars when they are duplicate column names. * Add tests for appropriate group columns after join. * Add test for group indices on expanding join with grouped data frame. * Fix build_index_cpp to report correct missing group column name. Currently when a group column name does not exist in the data frame, it reports a name from the names vector (all columns) instead of the vars vector (group columns). * Add test for error message on non-existent group columns.

davidkretch · 2016-12-24T19:50:06Z

Hi, thank you for your review. I've updated the pull request. Hopefully any future PRs have fewer issues.

I added logic to give an error if subset_join somehow gets an x grouped data frame with non-existent group columns. Without this check, it would try to access names[NA_INTEGER] and hang. I think the alternatives are to give an error, output with no groups, or not check assuming they're always valid. I'm not sure how to test the error without directly modifying the vars attribute though.

I also fixed a serious bug in my previous commit, which accidentally modified x.attr("vars") by reference. It is now const. Sorry about that.

krlmlr · 2017-01-26T09:19:33Z

Thanks for the edits. You can still rip out the column in a grouped data frame:

iris %>% group_by(Species) %>% { .[["Species"]] <- NULL; . }

We could forbid this, but creative users will find a way to break it anyway. I think anything that takes (grouped) data frames should check validity of the input, but this is too much for this PR. I'll merge this as is, the safety net you added will protect us from crashes.

lock · 2019-01-18T18:32:46Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

krlmlr reviewed Dec 21, 2016

View reviewed changes

krlmlr merged commit abfc9ec into tidyverse:master Jan 26, 2017

lock bot locked and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix groups when joining grouped data frames with duplicates (#2330) #2334

Fix groups when joining grouped data frames with duplicates (#2330) #2334

davidkretch commented Dec 18, 2016 •

edited by krlmlr

krlmlr commented Dec 20, 2016

hadley commented Dec 20, 2016

krlmlr commented Dec 20, 2016

davidkretch commented Dec 20, 2016 •

edited

davidkretch commented Dec 21, 2016

krlmlr left a comment

krlmlr Dec 21, 2016

krlmlr Dec 21, 2016

krlmlr Dec 21, 2016

krlmlr Dec 21, 2016

krlmlr Dec 21, 2016

davidkretch commented Dec 24, 2016 •

edited

krlmlr commented Jan 26, 2017

lock bot commented Jan 18, 2019

Fix groups when joining grouped data frames with duplicates (#2330) #2334

Fix groups when joining grouped data frames with duplicates (#2330) #2334

Conversation

davidkretch commented Dec 18, 2016 • edited by krlmlr

krlmlr commented Dec 20, 2016

hadley commented Dec 20, 2016

krlmlr commented Dec 20, 2016

davidkretch commented Dec 20, 2016 • edited

davidkretch commented Dec 21, 2016

krlmlr left a comment

Choose a reason for hiding this comment

krlmlr Dec 21, 2016

Choose a reason for hiding this comment

krlmlr Dec 21, 2016

Choose a reason for hiding this comment

krlmlr Dec 21, 2016

Choose a reason for hiding this comment

krlmlr Dec 21, 2016

Choose a reason for hiding this comment

krlmlr Dec 21, 2016

Choose a reason for hiding this comment

davidkretch commented Dec 24, 2016 • edited

krlmlr commented Jan 26, 2017

lock bot commented Jan 18, 2019

davidkretch commented Dec 18, 2016 •

edited by krlmlr

davidkretch commented Dec 20, 2016 •

edited

davidkretch commented Dec 24, 2016 •

edited