Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join_mutate(x = <sf>) #4917

Closed
romainfrancois opened this issue Mar 2, 2020 · 14 comments · Fixed by #5059
Closed

join_mutate(x = <sf>) #4917

romainfrancois opened this issue Mar 2, 2020 · 14 comments · Fixed by #5059
Labels
bug an unexpected problem or unintended behavior tables 🧮 joins and set operations
Milestone

Comments

@romainfrancois
Copy link
Member

library(sf)
#> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(qualmap)
library(dplyr, warn.conflicts = FALSE)

sf <- mutate(stLouis, TRACTCE = as.numeric(TRACTCE))
df <- tibble(TRACTCE = 112100, x = 1)

left_join(sf, df)
#> Joining, by = "TRACTCE"
#> Error: `nm` must be `NULL` or a character vector the same length as `x`

Created on 2020-03-02 by the reprex package (v0.3.0.9000)

This happens here: https://github.com/tidyverse/dplyr/blob/master/R/join.r#L296

x_key <- set_names(x[vars$x$key], names(vars$x$key))

because x is an sf object and the [ method brings back the geometry column, so x[vars$x$key] does not have the same number of columns as the length of names(vars$x$key)

@romainfrancois
Copy link
Member Author

A number of packages exhibit that issue as part of the latest revdep checks:

`nm` must be `NULL` or a character vector the same length as `x`
===============================================================

areal, chilemapas, foieGras, GADMTools, nhdplusTools, qualmap, raceland, 
sabre, spatialrisk, stplanr

I'm not sure all these are related to sf though.

@DavisVaughan

This comment has been minimized.

@lionel-
Copy link
Member

lionel- commented Mar 2, 2020

The sf data frames are hard to program with because x[idx] isn't guaranteed to be length(idx). Transforming inputs to a tibble in the data.frame method might be reasonable to avoid surprises. The problem is that this prevents input classes from checking invariants / updating metadata at each manipulation of the data frame, which seems wrong at some level. Maybe it's better to keep this error, and let sf maintainers override nest_join() to make their class compatible with dplyr.

@DavisVaughan

This comment has been minimized.

@hadley hadley added bug an unexpected problem or unintended behavior tables 🧮 joins and set operations labels Mar 5, 2020
@hadley hadley added this to the 1.0.0 milestone Mar 5, 2020
@hadley
Copy link
Member

hadley commented Mar 16, 2020

@edzer — I wanted to alert you to this issue. The basic problem is that sf breaks an implicit contract that we expect data.frame subclasses to fulfil. It's not totally clear to me what we need to do fix it — sf is an important class so we could change dplyr to make it work (or at least figure out how to make this bit of code generic). But I wanted to check with you that you were committed to this behaviour before we did a significant amount of work.

@hadley hadley closed this as completed Mar 16, 2020
@hadley hadley reopened this Mar 16, 2020
@lionel-
Copy link
Member

lionel- commented Mar 16, 2020

One potential compromise would be to allow 1-parameter [ to lose the geometry columns, and keep the sticky behaviour for the 2-parameters [. This seems to make sense if we consider the former to be programmer-oriented and the latter to be user-oriented.

The sf data frame could devolve either into a partially instantiated state (waiting for a geometry column to be added back later on), or to a bare data frame.

edzer added a commit to r-spatial/sf that referenced this issue Mar 16, 2020
@edzer
Copy link

edzer commented Mar 16, 2020

Thanks for notifying me! @lionel- on the sf side I don't think that is an option - a typical user pattern is to call plot(nc["BIR74"]) and expect a map with nc county polygons colored according to the BIR74 attribute.

What I now did is strip the sf class label and put it back after dplyr did its job; many of the dplyr methods are implemented that way in sf.

edzer added a commit to r-spatial/sf that referenced this issue Mar 16, 2020
@hadley
Copy link
Member

hadley commented Mar 16, 2020

@edzer are you happy to continue with that approach? It sounds like you'll now need to add methods for all the join functions.

@edzer
Copy link

edzer commented Mar 17, 2020

Thanks @hadley - let's put it this way: I would be the last person who'd be unhappy if dplyr handled sf objects the "right" way out of the box, but I also understand that you want x[idx] to have length length(idx), which sf breaks. As long as dplyr handles sfc geometry list columns correctly (thanks to @lionel- for the vctrs wrappers!), and if changes in behaviour like stripping attributes are announced early, the additional work to be done in sf is quite managable.

sf has provided dplyr compatible *_join methods for three years now.

@hadley
Copy link
Member

hadley commented Mar 17, 2020

@edzer you can consider this informal notification 😄 — we'll be sending out the formal revdep emails later in the week. We now have a somewhat more principled approach described in ?dplyr_extending; although unfortunately it's not going to be super helpful for you because of the aforementioned [ behaviour.

@edzer
Copy link

edzer commented Mar 26, 2020

This is going to hit all users now using join_xxx(x,y) with y an sf object. The error message now given by dplyr,

Error: `nm` must be `NULL` or a character vector the same length as `x`

is not helpful and I've already experienced several users then seeking help with the sf developers. Would it be possible to emit a more helpful error message, e.g. pointing to r-spatial/sf#1314 , to be removed a couple of releases later?

@hadley
Copy link
Member

hadley commented Mar 26, 2020

Ah, I didn't think about the y case; that's definitely worth fixing. I'll think about it and see what I can come up with.

hadley pushed a commit that referenced this issue Mar 27, 2020
This ensures that we always know what methods are being called, and predicts us against variations in subclass implementations. The main downside of this approach is that we can no longer translate grouping variable names in the presence of ambiguity

Fixes #4917
hadley added a commit that referenced this issue Mar 31, 2020
This ensures that we always know what methods are being called, predicts us against variations in subclass implementations, and avoid dev vctrs group re-computation. The main downside of this approach is that we can no longer translate grouping variable names in the presence of ambiguity

Fixes #4917
@hadley
Copy link
Member

hadley commented Mar 31, 2020

Ok, I'm pretty happy with the current fix — we'll need to keep thinking about what the optimal behaviour is here, but at least sf objects work once more.

@edzer
Copy link

edzer commented Mar 31, 2020

Great, thanks a lot!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior tables 🧮 joins and set operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants