Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

union the exactly same two copies gives different number of rows #3238

Closed
jerryzhujian9 opened this issue Dec 2, 2017 · 12 comments
Closed

union the exactly same two copies gives different number of rows #3238

jerryzhujian9 opened this issue Dec 2, 2017 · 12 comments

Comments

@jerryzhujian9
Copy link

@jerryzhujian9 jerryzhujian9 commented Dec 2, 2017

tested with latest tidyverse

setequal(union(iris,iris),iris)
union(iris,iris) (149) has different number of rows from iris (150)! How can this be?

@cderv
Copy link
Contributor

@cderv cderv commented Dec 2, 2017

I often think that data.frame with rownames does not play well with dplyr. It may be related here.
If I pass rownames to columns, it works as expected.

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# with rownames
intersect(iris, iris) %>% nrow()
#> [1] 149
union(iris, iris) %>% nrow()
#> [1] 149
setequal(union(iris,iris),iris)
#> FALSE: Different number of rows
# Without rownames 
iris <- tibble::rownames_to_column(iris)
intersect(iris, iris) %>% nrow()
#> [1] 150
union(iris, iris) %>% nrow()
#> [1] 150
setequal(union(iris,iris),iris)
#> TRUE

Created on 2017-12-02 by the reprex package (v0.1.1.9000).

I think it is a good practice to not have rownames when working with tidyverse.
From the tibble rownames documentation

Generally, it is best to avoid row names, because they are basically a character column with different semantics to every other column.
It is why tibble provides some helpers to delete them or convert back and forth.

However, I am not sure why this behaviour with union and others. it is not consistent with base function. Don't know if it is intended or not.

Hope It helps.

@JohnMount
Copy link

@JohnMount JohnMount commented Dec 2, 2017

I think some of this is that one of dplyr::union_all() or dplyr::bind_rows() may be the better operator for what I assume is your intended application (union knocks out duplicate rows!). Also definitely do not use rownames with dplyr, tibble::rownames_to_column() is strongly advised.

@jerryzhujian9
Copy link
Author

@jerryzhujian9 jerryzhujian9 commented Dec 4, 2017

Thank you all for the insights. How about issuing a warning message then when having a input with rownames?

@jerryzhujian9
Copy link
Author

@jerryzhujian9 jerryzhujian9 commented Dec 6, 2017

Just a follow-up:

a=b=iris
rownames(a) <- NULL
rownames(b) <- NULL
union(a,b) %>% nrow  # still 149.  

Seems rownames() <- NULL would not have an effect here

@MZLABS
Copy link

@MZLABS MZLABS commented Dec 7, 2017

There is a duplicate row and union() removes duplicates:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
iris %>% group_by_at(., colnames(.)) %>% summarize(n = n()) %>% filter(n > 1)
#> # A tibble: 1 x 6
#> # Groups:   Sepal.Length, Sepal.Width, Petal.Length, Petal.Width [1]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species     n
#>          <dbl>       <dbl>        <dbl>       <dbl>    <fctr> <int>
#> 1          5.8         2.7          5.1         1.9 virginica     2

Please try union_all:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
union_all(iris, iris) %>% nrow()
#> [1] 300

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 12, 2017

Thanks. Looks like union() does work as expected. Maybe we should be explicit in the documentation that duplicate rows are removed?

@edublancas
Copy link
Contributor

@edublancas edublancas commented Mar 2, 2018

I can help with this, what is the best way to proceed? Add the clarification at the beginning of the docs, add one example, both?

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 5, 2018

Clarification and example would be great. Does this affect intersect() and other verbs?

@batpigandme
Copy link
Member

@batpigandme batpigandme commented Mar 24, 2018

@edublancas, are you still interested in submitting a PR for this? If so, awesome, and, if not, that's totally fine, too. If you wouldn't mind letting us know when you get a chance (or if you have any questions), that'd be great so I know whether to add to my TODO list! 👍
Thanks.

@edublancas
Copy link
Contributor

@edublancas edublancas commented Mar 26, 2018

@batpigandme Yes, I can work on this. I can probably find some time to do it during this week. Will post updates here.

edublancas pushed a commit to edublancas/dplyr that referenced this issue Mar 30, 2018
@edublancas
Copy link
Contributor

@edublancas edublancas commented Mar 30, 2018

Submitted PR (#3474), let me know what you think.

@krlmlr krlmlr closed this in #3474 Apr 1, 2018
krlmlr added a commit that referenced this issue Apr 1, 2018
* Improved documentation for set operations (#3238, @edublancas).
@lock
Copy link

@lock lock bot commented Sep 28, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants