Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner_join() reorders output #684

Closed
ThomasSiegmund opened this issue Oct 11, 2014 · 21 comments
Closed

inner_join() reorders output #684

ThomasSiegmund opened this issue Oct 11, 2014 · 21 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@ThomasSiegmund
Copy link

As discussed on StackOverflow

http://stackoverflow.com/questions/26279548/how-to-replace-lookup-based-on-row-names-when-using-dplyr

> library(dplyr)
> test <- data_frame(Greek = c("Alpha", "Beta", "Gamma"), Letters = LETTERS[1:3])
> lookup <- data_frame(Letters = c("C", "B", "C"))
> left_join(lookup, test)
Joining by: "Letters"
Source: local data frame [3 x 2]

  Letters Greek
1       C Gamma
2       B  Beta
3       C Gamma
> inner_join(lookup, test)
Joining by: "Letters"
Source: local data frame [3 x 2]

  Letters Greek
1       B  Beta
2       C Gamma
3       C Gamma
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-suse-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8       
 [4] LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.3.0.9000

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      lazyeval_0.1.9 magrittr_1.0.1 parallel_3.1.1
[6] Rcpp_0.11.3    tools_3.1.1   
@martinlenardon
Copy link

I see similar behavior with left_join--the ordering is not necessarily preserved nor are the levels themselves:

> require(dplyr)
> 
> x <- c("Small", "Medium", "Large", "Very Large")
> data1 <- data.frame(x=x)
> data1$x <- factor(data1$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data1$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> data2 <- data.frame(x=x, y=1:4)
> data2$x <- factor(data2$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data2$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> data3 <- left_join(data1, data2, by="x")
> print(levels(data3$x))
[1] "Large"      "Medium"     "Small"      "Very Large"
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.3.0.1

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      magrittr_1.0.0 parallel_3.1.0 Rcpp_0.11.3    tools_3.1.0

@hadley
Copy link
Member

hadley commented Oct 30, 2014

Duplicate of #730.

@hadley hadley closed this as completed Oct 30, 2014
@hadley hadley reopened this Nov 4, 2014
@hadley
Copy link
Member

hadley commented Nov 4, 2014

Not duplicate, this is about data frames

@hadley hadley added the bug an unexpected problem or unintended behavior label Nov 4, 2014
@hadley hadley added this to the 0.3.1 milestone Nov 4, 2014
@jennybc
Copy link
Member

jennybc commented Nov 4, 2014

I confirm similar problems with left_join(), meaning the factor in the joined output data.frame is like it's been through a conversion to character and then been re-processed through factor(). Unused levels have been dropped and the levels revert to alphabetical order. I am not concerned about reordering of the output but I am surprised about the changed factor levels.

I noticed this updating a Data Carpentry lesson that I last touched on August 8th. The new behaviour of dplyr::left_join() caused unexpected changes in my output. So the problem was introduced since August 8th, in case that helps.

@romainfrancois
Copy link
Member

For left_join I get:

> left_join(lookup, test)
Joining by: "Letters"

Source: local data frame [3 x 2]

  Letters Greek
1       C Gamma
2       B  Beta
3       C Gamma

That's what I expect.

@romainfrancois
Copy link
Member

Right, about the ordering of inner_join this is related to choosing different paths based on the relative sizes of the data frames.

In that case because nrow(lookup) <= nrow(test) it goes to the first path, and trains the data in the order of lookup hence the ordering.

This was added based on some advice that we should train based on the smaller data set. If we want to preserve ordering based invariably on the order of the second data set, then this superseeds the hack and we should just drop the corresponding branch.

    if( n_x <= n_y ){
        // train the map in terms of x
        train_push_back( map, n_x ) ;

        for( int i=0; i<n_y; i++){
            // find indices for rows in x that match the row i in y
            Map::iterator it = map.find(-i-1) ;
            if( it != map.end() ){
                push_back( indices_x, it->second );
                push_back( indices_y, i, it->second.size() ) ;
            }
        }
    } else {
        train_push_back_right( map, n_y ) ;

        for( int i=0; i<n_x; i++){
            Map::iterator it = map.find(i) ;
            if( it != map.end() ){
                push_back_right( indices_y, it->second );
                push_back( indices_x, i, it->second.size() ) ;
            }
        }
    }

@romainfrancois
Copy link
Member

@hadley should I just drop the if( n_x <= n_y ){ branch, and opt for ordering correctness instead of speed ?

The alternative is to ensure that indices_x is filled in order and indices_y filled accordingly. This might complicate the code and potentially lose the effect of the hack.

@romainfrancois
Copy link
Member

@martinlenardon which version did you use ? with the current level version, I get:

print(levels(data3$x))
# [1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"

@romainfrancois
Copy link
Member

@jennybc can you spare a reprex please ?

@martinlenardon
Copy link

That was dplyr_0.3.0.1, here's what I get after installing the latest from github:

> require(dplyr)
> 
> x <- c("Small", "Medium", "Large", "Very Large")
> data1 <- data.frame(x=x)
> data1$x <- factor(data1$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data1$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> 
> data2 <- data.frame(x=x, y=1:4)
> data2$x <- factor(data2$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data2$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> 
> data3 <- left_join(data1, data2, by="x")
> print(levels(data3$x))
[1] "Large"      "Medium"     "Small"      "Very Large"

So I still get the reduced factor levels. On the other hand, I have not updated my R version, so I can give that a try, too:

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.3.0.9000

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      magrittr_1.0.0 parallel_3.1.0 Rcpp_0.11.3    tools_3.1.0   

@romainfrancois
Copy link
Member

@martinlenardon well I get:

> x <- c("Small", "Medium", "Large", "Very Large")
>  data1 <- data.frame(x=x)
> data1$x <- factor(data1$x, levels=c("Very Small", x), ordered=TRUE)
> levels(data1$x)
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> data2 <- data.frame(x=x, y=1:4)
>
> data2$x <- factor(data2$x, levels=c("Very Small", x), ordered=TRUE)
> levels(data2$x)
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> data3 <- left_join(data1, data2, by="x")
> levels(data3$x)
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"

with

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] testthat_0.8.1.0.99 dplyr_0.3.0.9000    Rcpp_0.11.3
[4] magrittr_1.5        devtools_1.5

loaded via a namespace (and not attached):
 [1] assertthat_0.1 BH_1.54.0-2    DBI_0.3.1      digest_0.6.4   evaluate_0.5.5
 [6] httr_0.2       lazyeval_0.1.9 memoise_0.1    parallel_3.1.1 R6_2.0.0.9000
[11] RCurl_1.95-4.1 stringr_0.6.2  tools_3.1.1    whisker_0.3-2

@martinlenardon
Copy link

Yep. I get the same behavior after upgrading:

> require(dplyr)
> 
> x <- c("Small", "Medium", "Large", "Very Large")
> data1 <- data.frame(x=x)
> data1$x <- factor(data1$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data1$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
>  
> data2 <- data.frame(x=x, y=1:4)
> data2$x <- factor(data2$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data2$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> 
> data3 <- left_join(data1, data2, by="x")
> print(levels(data3$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.3.0.9000

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      magrittr_1.0.1 parallel_3.1.2 Rcpp_0.11.3    tools_3.1.2 

@jennybc
Copy link
Member

jennybc commented Nov 5, 2014

I will post a reprex later today.

I stripped my real example way down and the factor level mangling went away (!). Then I re-ran actual example and problem persists. It will take a little work to find simplest version that demonstrates the undesirable behaviour with left_join().

@jennybc
Copy link
Member

jennybc commented Nov 5, 2014

Here is my actual example as a Gist with an R script and a compiled Markdown notebook:

https://gist.github.com/jennybc/bd1626735dd221c30f1a

I am left_join()ing data.frames that each contain a Film factor (the Lord of the Rings movies), with identical levels, where the movies are in story order, not alphabetical order. Ironically, the reason I'm doing this join is because zero-length groups are dropped by group_by() (#341).

The output of left_join() has the correct levels but order has reverted to alphabetical.

If I get the time, I'll try to strip it down further.

@jennybc
Copy link
Member

jennybc commented Nov 5, 2014

UPDATED. OK here is a much smaller example that demonstrates the re-alphabetizing of factor levels.

m_days <- data_frame(mon = factor(c("jan", "feb", "apr"),
                                  c("jan", "feb", "mar", "apr")),
                     n_days = c(31, 28, 30))
m_equi <- data_frame(mon = factor(c("jan", "feb", "mar", "apr"),
                                  c("jan", "feb", "mar", "apr")),
                     has_equinox = c(FALSE, FALSE, TRUE, FALSE))
m_joined <- left_join(m_equi, m_days)
levels(m_days$mon)
levels(m_equi$mon)
levels(m_joined$mon)

Here's what that looks like when run:

> library(dplyr)
> m_days <- data_frame(mon = factor(c("jan", "feb", "apr"),
+ c("jan", "feb", "mar", "apr")),
+ n_days = c(31, 28, 30))
> m_equi <- data_frame(mon = factor(c("jan", "feb", "mar", "apr"),
+ c("jan", "feb", "mar", "apr")),
+ has_equinox = c(FALSE, FALSE, TRUE, FALSE))
> m_joined <- left_join(m_equi, m_days)
Joining by: "mon"
> levels(m_days$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_equi$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_joined$mon)
[1] "apr" "feb" "jan" "mar"

@hadley
Copy link
Member

hadley commented Nov 5, 2014

@jennybc oh hmmmm, that's a more complicated problem because you have two factors with different levels. Currently I think we're only treating factors as equivalent if the levels are exactly equal, but this suggests we should also consider factors to be equivalent if the levels of one are a subset of the other.

(But if the levels aren't equal, you're supposed to get a string, not a factor (with a warning maybe?) so that's another bug)

@jennybc
Copy link
Member

jennybc commented Nov 5, 2014

Sorry @hadley my first post of the simpler example had a mistake -- the different levels. I then updated so they have the same levels and the problem still happens. I understand that all bets are off if the levels are different for the input factors.

@romainfrancois
Copy link
Member

@hadley . I did not realize that factors with different levels should produce a character vector.

@jennybc I get:

> m_days <- data_frame(mon = factor(c("jan", "feb", "apr"),
+                                   c("jan", "feb", "mar", "apr")),
+                      n_days = c(31, 28, 30))
> m_equi <- data_frame(mon = factor(c("jan", "feb", "mar", "apr"),
+                                   c("jan", "feb", "mar", "apr")),
+                      has_equinox = c(FALSE, FALSE, TRUE, FALSE))
> m_joined <- left_join(m_equi, m_days)
Joining by: "mon"
> levels(m_days$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_equi$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_joined$mon)
[1] "jan" "feb" "mar" "apr"

Ok so the imp I got so far is this:

  • if both factors have the same levels in the same order, they are kept.
  • if factors have different levels, or the levels are not in the same order, order is not retained.
  • if one of the vectors is a character vector then it becomes a character vector.

So perhaps the second case should produce a character vector.

I guess when levels of one are subset of the other one could indeed represent the same concept, so perhaps we should deal with that, and e.g. keep the ordering of the bigger one ?

@hadley
Copy link
Member

hadley commented Nov 6, 2014

This is the behaviour I expect:

  • If both factors have the same levels in the same order, they're kept as a factor.
  • For every other case (e.g. levels don't match, one is a character), the variable is coerced to a character (with a message)

I think it would be nice to have special behaviour if the levels of one are a subset of the other, but I don't think it's that important. (Just as long as there's a message so the user knows what they need to do to fix the problem).

I don't think we should ever create a factor that has a different levels (value or order) to the input.

@romainfrancois
Copy link
Member

I think it's all taken care of now. This simplified code for factor/factor with different levels. Please reopen if I forgot something.

@jennybc
Copy link
Member

jennybc commented Nov 6, 2014

FWIW prior to upgrade I was using dplyr 0.3.0.1 installed from e0ba9d2.

I just upgraded to dplyr 0.3.0.9000 installed from 8b0191a; reran my real example and my small one from above and the factor level handling is fixed. 🎁 So thanks!

romainfrancois added a commit that referenced this issue Nov 12, 2014
@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants