Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Customize the suffixes of columns after join #1296

Closed
jianboli opened this issue Jul 30, 2015 · 7 comments
Closed

Feature Request: Customize the suffixes of columns after join #1296

jianboli opened this issue Jul 30, 2015 · 7 comments
Labels
Milestone

Comments

@jianboli
Copy link

@jianboli jianboli commented Jul 30, 2015

I do a lot of multiple table joins, and think the ability to customize the suffixes of the columns (like the "suffixes" parameter in function merge) in join would be very helpful. For example, suppose we would like to calculate the moving average of some column of a data.frame in the SQL way:

library(plyr)
library(dplyr)
library(zoo)
set.seed(1)
x <- data.frame(DATE=seq.Date(as.Date('2014-01-01'), by = 'd', length.out = 10), 
                X = runif(10), 
                Y = 1:10)
# as dplyr does not support DATE for now, let's convert the Date to numeric
x$DATE <- as.numeric(x$DATE)

# let's caluate the 3 day moving average of X
x %>% 
  inner_join(x %>% mutate(DATE = DATE + 1), by = 'DATE') %>%
  rename(X = X.x, X.Lag1 = X.y) %>% # rename to keep track of the variables
  inner_join(x %>% mutate(DATE = DATE + 2), by = 'DATE') %>%
  rename(X = X.x, X.Lag2 = X.y) %>%
  mutate(DATE = as.Date(DATE), MeanX = (X + X.Lag1 + X.Lag2)/3) %>%
  select(DATE, MeanX) %>%
  print()

To keep track of the variables information, we need to keep renaming the names. If not, the naming system will become a little odd:

x %>% 
  inner_join(x %>% mutate(DATE = DATE + 1), by = 'DATE') %>%
  inner_join(x %>% mutate(DATE = DATE + 2), by = 'DATE') %>%
  mutate(DATE = as.Date(DATE), MeanX = (X.x + X.y + X)/3) %>% # Note: not X.z or X.y.y but X!
  select(DATE, MeanX) %>%
  print()

It would be good to be able to do something like:

x %>% 
  inner_join(x %>% mutate(DATE = DATE + 1), by = 'DATE', suffixes = c("", ".Lag1")) %>%
  inner_join(x %>% mutate(DATE = DATE + 2), by = 'DATE', suffixes = c("", ".Lag2")) %>%
  mutate(DATE = as.Date(DATE), MeanX = (X + X.Lag1 + X.Lag2)/3) %>% 
  select(DATE, MeanX) %>%
  print()

This would be a mimic of the following SQL:

SELECT x.DATE, (x.X+Lag1.X+Lag2.X)/3 as MeanX From x
INNER JOIN x as Lag1 on Lag1.Date = x.Date-1
INNER JOIN x as Lag2 on Lag2.Date = x.Date-2

In such a case, it would be pretty straightforward to join more tables together.

Thanks for the great work.

@mczapanskiy-usgs
Copy link

@mczapanskiy-usgs mczapanskiy-usgs commented Jul 31, 2015

Good idea - aligns dplyr and SQL joins nicely. I'm curious, why use suffixes and give it a vector? In SQL you give a table one alias that gets prepended to each variable. With that in mind, what do you think of something like this:

x %>% 
  inner_join(x %>% mutate(DATE = DATE + 1), by = 'DATE', as = "Lag1") %>%
  inner_join(x %>% mutate(DATE = DATE + 2), by = 'DATE', as = "Lag2") %>%
  mutate(DATE = as.Date(DATE), MeanX = (X + Lag1.X + Lag2.X)/3) %>% 
  select(DATE, MeanX) %>%
  print()

@jianboli
Copy link
Author

@jianboli jianboli commented Aug 2, 2015

I agree that "as" would be much better. I proposed to customize the suffix because it looks much easier to implement. We already have the '.x' and '.y' appended if the column name are duplicated.

@hadley hadley added the feature label Aug 24, 2015
@hadley hadley added this to the 0.5 milestone Aug 24, 2015
@zerweck
Copy link

@zerweck zerweck commented Feb 2, 2016

(Seems to also relate to #148)

This is a function i wrote to rename a tbl object w prefixes/suffixes. In a long chain of joins, I just wrap the x and y arguments in this function. It solves my problem of often joining 10 tables or more, which all have an "id" or an "group_id" variable etc. (After 10 joins, I don't know where id.x.x.y.x came from anymore).

It does not work in pipes with the default prefix argument, because deparse(substitute()) does not work in pipes. For a limited workaround see tidyverse/magrittr#115.

tbl_affixer <-
  function(tbl,
           prefix = deparse(substitute(tbl)), # Default uses tbl R-object name. Does not work in pipes!
           suffix = NULL,
           sep = ".",
           index = seq_along(tbl_vars(tbl))) { # Select vars to be renamed. Default: All
    newnames <- tbl_vars(tbl)
    newnames[index] <- paste0(prefix,
                              sep,
                              tbl_vars(tbl),
                              suffix)[index]
    rename_(.data = tbl,
            .dots = setNames(tbl_vars(tbl), newnames))
  }

@hadley
Copy link
Member

@hadley hadley commented Mar 1, 2016

@mczapanskiy-usgs which table does as apply to? The right or the left? What if you want to assign a unique prefix to both tables?

@zerweck
Copy link

@zerweck zerweck commented Mar 1, 2016

It should probably refer to the right table, since in multiple chained joins, you usually add a new y, right? The piping output will be assigned to x if not explicitly declared, so multiple joins would default to making the x table have longer and longer prefixes.

@hadley
Copy link
Member

@hadley hadley commented Mar 1, 2016

@zerweck that's not how dplyr works currently, so I think that approach wouldn't be backward compatible.

@hadley hadley closed this in 886799b Mar 7, 2016
@deeenes
Copy link

@deeenes deeenes commented Sep 22, 2017

I think this is still a valid feature request, would be very useful to "force" suffixes even if column names are not duplicates. Except the columns in by. It should look something like this:

j <- x %>%
full_join(y, by = c('c1', ...), suffix = c('.x', '.y')) %>%
full_join(z, by = c('c1', ...), suffix = c('', '.z'), suffixes_mandatory = c(FALSE, TRUE))

But instead off adding a new boolean arg I think better to add an alternative argument to supply mandatory suffixes, which have priority over the ones applied only in case of duplicate colnames:

j <- x %>%
full_join(y, by = c('c1', ...), suffix = c('.x', '.y')) %>%
full_join(z, by = c('c1', ...), suffix_mandatory = c('', '.z'))

p.s. It is funny you can workaround this easily if you chain-join an even number of tables but not with odd numbers. At even numbers you can simply add the suffix of the odd tables as the left suffix at the next even:

j <- x %>%
full_join(y, by = c('c1', ...), suffix = c('.x', '.y')) %>%
full_join(z, by = c('c1', ...), suffix = c('foo', 'bar')) %>%
full_join(q, by = c('c1', ...), suffix = c('.z', '.q'))

@lock lock bot locked as resolved and limited conversation to collaborators Jun 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants