inner_join() reorders output #684

ThomasSiegmund · 2014-10-11T12:39:46Z

As discussed on StackOverflow

http://stackoverflow.com/questions/26279548/how-to-replace-lookup-based-on-row-names-when-using-dplyr

> library(dplyr)
> test <- data_frame(Greek = c("Alpha", "Beta", "Gamma"), Letters = LETTERS[1:3])
> lookup <- data_frame(Letters = c("C", "B", "C"))
> left_join(lookup, test)
Joining by: "Letters"
Source: local data frame [3 x 2]

  Letters Greek
1       C Gamma
2       B  Beta
3       C Gamma
> inner_join(lookup, test)
Joining by: "Letters"
Source: local data frame [3 x 2]

  Letters Greek
1       B  Beta
2       C Gamma
3       C Gamma
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-suse-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8       
 [4] LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.3.0.9000

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      lazyeval_0.1.9 magrittr_1.0.1 parallel_3.1.1
[6] Rcpp_0.11.3    tools_3.1.1

martinlenardon · 2014-10-16T15:25:44Z

I see similar behavior with left_join--the ordering is not necessarily preserved nor are the levels themselves:

> require(dplyr)
> 
> x <- c("Small", "Medium", "Large", "Very Large")
> data1 <- data.frame(x=x)
> data1$x <- factor(data1$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data1$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> data2 <- data.frame(x=x, y=1:4)
> data2$x <- factor(data2$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data2$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> data3 <- left_join(data1, data2, by="x")
> print(levels(data3$x))
[1] "Large"      "Medium"     "Small"      "Very Large"
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.3.0.1

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      magrittr_1.0.0 parallel_3.1.0 Rcpp_0.11.3    tools_3.1.0

hadley · 2014-10-30T21:27:55Z

Duplicate of #730.

hadley · 2014-11-04T15:07:55Z

Not duplicate, this is about data frames

jennybc · 2014-11-04T23:51:16Z

I confirm similar problems with left_join(), meaning the factor in the joined output data.frame is like it's been through a conversion to character and then been re-processed through factor(). Unused levels have been dropped and the levels revert to alphabetical order. I am not concerned about reordering of the output but I am surprised about the changed factor levels.

I noticed this updating a Data Carpentry lesson that I last touched on August 8th. The new behaviour of dplyr::left_join() caused unexpected changes in my output. So the problem was introduced since August 8th, in case that helps.

romainfrancois · 2014-11-05T14:35:32Z

For left_join I get:

> left_join(lookup, test)
Joining by: "Letters"

Source: local data frame [3 x 2]

  Letters Greek
1       C Gamma
2       B  Beta
3       C Gamma

That's what I expect.

romainfrancois · 2014-11-05T14:44:13Z

Right, about the ordering of inner_join this is related to choosing different paths based on the relative sizes of the data frames.

In that case because nrow(lookup) <= nrow(test) it goes to the first path, and trains the data in the order of lookup hence the ordering.

This was added based on some advice that we should train based on the smaller data set. If we want to preserve ordering based invariably on the order of the second data set, then this superseeds the hack and we should just drop the corresponding branch.

    if( n_x <= n_y ){
        // train the map in terms of x
        train_push_back( map, n_x ) ;

        for( int i=0; i<n_y; i++){
            // find indices for rows in x that match the row i in y
            Map::iterator it = map.find(-i-1) ;
            if( it != map.end() ){
                push_back( indices_x, it->second );
                push_back( indices_y, i, it->second.size() ) ;
            }
        }
    } else {
        train_push_back_right( map, n_y ) ;

        for( int i=0; i<n_x; i++){
            Map::iterator it = map.find(i) ;
            if( it != map.end() ){
                push_back_right( indices_y, it->second );
                push_back( indices_x, i, it->second.size() ) ;
            }
        }
    }

romainfrancois · 2014-11-05T14:49:53Z

@hadley should I just drop the if( n_x <= n_y ){ branch, and opt for ordering correctness instead of speed ?

The alternative is to ensure that indices_x is filled in order and indices_y filled accordingly. This might complicate the code and potentially lose the effect of the hack.

romainfrancois · 2014-11-05T14:53:19Z

@martinlenardon which version did you use ? with the current level version, I get:

print(levels(data3$x))
# [1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"

romainfrancois · 2014-11-05T14:54:53Z

@jennybc can you spare a reprex please ?

martinlenardon · 2014-11-05T15:01:47Z

That was dplyr_0.3.0.1, here's what I get after installing the latest from github:

> require(dplyr)
> 
> x <- c("Small", "Medium", "Large", "Very Large")
> data1 <- data.frame(x=x)
> data1$x <- factor(data1$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data1$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> 
> data2 <- data.frame(x=x, y=1:4)
> data2$x <- factor(data2$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data2$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> 
> data3 <- left_join(data1, data2, by="x")
> print(levels(data3$x))
[1] "Large"      "Medium"     "Small"      "Very Large"

So I still get the reduced factor levels. On the other hand, I have not updated my R version, so I can give that a try, too:

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.3.0.9000

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      magrittr_1.0.0 parallel_3.1.0 Rcpp_0.11.3    tools_3.1.0

romainfrancois · 2014-11-05T15:08:51Z

@martinlenardon well I get:

> x <- c("Small", "Medium", "Large", "Very Large")
>  data1 <- data.frame(x=x)
> data1$x <- factor(data1$x, levels=c("Very Small", x), ordered=TRUE)
> levels(data1$x)
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> data2 <- data.frame(x=x, y=1:4)
>
> data2$x <- factor(data2$x, levels=c("Very Small", x), ordered=TRUE)
> levels(data2$x)
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> data3 <- left_join(data1, data2, by="x")
> levels(data3$x)
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"

with

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] testthat_0.8.1.0.99 dplyr_0.3.0.9000    Rcpp_0.11.3
[4] magrittr_1.5        devtools_1.5

loaded via a namespace (and not attached):
 [1] assertthat_0.1 BH_1.54.0-2    DBI_0.3.1      digest_0.6.4   evaluate_0.5.5
 [6] httr_0.2       lazyeval_0.1.9 memoise_0.1    parallel_3.1.1 R6_2.0.0.9000
[11] RCurl_1.95-4.1 stringr_0.6.2  tools_3.1.1    whisker_0.3-2

martinlenardon · 2014-11-05T16:26:55Z

Yep. I get the same behavior after upgrading:

> require(dplyr)
> 
> x <- c("Small", "Medium", "Large", "Very Large")
> data1 <- data.frame(x=x)
> data1$x <- factor(data1$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data1$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
>  
> data2 <- data.frame(x=x, y=1:4)
> data2$x <- factor(data2$x, levels=c("Very Small", x), ordered=TRUE)
> print(levels(data2$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"
> 
> 
> data3 <- left_join(data1, data2, by="x")
> print(levels(data3$x))
[1] "Very Small" "Small"      "Medium"     "Large"      "Very Large"

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.3.0.9000

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      magrittr_1.0.1 parallel_3.1.2 Rcpp_0.11.3    tools_3.1.2

jennybc · 2014-11-05T17:55:58Z

I will post a reprex later today.

I stripped my real example way down and the factor level mangling went away (!). Then I re-ran actual example and problem persists. It will take a little work to find simplest version that demonstrates the undesirable behaviour with left_join().

jennybc · 2014-11-05T18:50:53Z

Here is my actual example as a Gist with an R script and a compiled Markdown notebook:

https://gist.github.com/jennybc/bd1626735dd221c30f1a

I am left_join()ing data.frames that each contain a Film factor (the Lord of the Rings movies), with identical levels, where the movies are in story order, not alphabetical order. Ironically, the reason I'm doing this join is because zero-length groups are dropped by group_by() (#341).

The output of left_join() has the correct levels but order has reverted to alphabetical.

If I get the time, I'll try to strip it down further.

jennybc · 2014-11-05T21:01:08Z

UPDATED. OK here is a much smaller example that demonstrates the re-alphabetizing of factor levels.

m_days <- data_frame(mon = factor(c("jan", "feb", "apr"),
                                  c("jan", "feb", "mar", "apr")),
                     n_days = c(31, 28, 30))
m_equi <- data_frame(mon = factor(c("jan", "feb", "mar", "apr"),
                                  c("jan", "feb", "mar", "apr")),
                     has_equinox = c(FALSE, FALSE, TRUE, FALSE))
m_joined <- left_join(m_equi, m_days)
levels(m_days$mon)
levels(m_equi$mon)
levels(m_joined$mon)

Here's what that looks like when run:

> library(dplyr)
> m_days <- data_frame(mon = factor(c("jan", "feb", "apr"),
+ c("jan", "feb", "mar", "apr")),
+ n_days = c(31, 28, 30))
> m_equi <- data_frame(mon = factor(c("jan", "feb", "mar", "apr"),
+ c("jan", "feb", "mar", "apr")),
+ has_equinox = c(FALSE, FALSE, TRUE, FALSE))
> m_joined <- left_join(m_equi, m_days)
Joining by: "mon"
> levels(m_days$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_equi$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_joined$mon)
[1] "apr" "feb" "jan" "mar"

hadley · 2014-11-05T22:54:51Z

@jennybc oh hmmmm, that's a more complicated problem because you have two factors with different levels. Currently I think we're only treating factors as equivalent if the levels are exactly equal, but this suggests we should also consider factors to be equivalent if the levels of one are a subset of the other.

(But if the levels aren't equal, you're supposed to get a string, not a factor (with a warning maybe?) so that's another bug)

jennybc · 2014-11-05T23:15:15Z

Sorry @hadley my first post of the simpler example had a mistake -- the different levels. I then updated so they have the same levels and the problem still happens. I understand that all bets are off if the levels are different for the input factors.

romainfrancois · 2014-11-06T07:54:34Z

@hadley . I did not realize that factors with different levels should produce a character vector.

@jennybc I get:

> m_days <- data_frame(mon = factor(c("jan", "feb", "apr"),
+                                   c("jan", "feb", "mar", "apr")),
+                      n_days = c(31, 28, 30))
> m_equi <- data_frame(mon = factor(c("jan", "feb", "mar", "apr"),
+                                   c("jan", "feb", "mar", "apr")),
+                      has_equinox = c(FALSE, FALSE, TRUE, FALSE))
> m_joined <- left_join(m_equi, m_days)
Joining by: "mon"
> levels(m_days$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_equi$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_joined$mon)
[1] "jan" "feb" "mar" "apr"

Ok so the imp I got so far is this:

if both factors have the same levels in the same order, they are kept.
if factors have different levels, or the levels are not in the same order, order is not retained.
if one of the vectors is a character vector then it becomes a character vector.

So perhaps the second case should produce a character vector.

I guess when levels of one are subset of the other one could indeed represent the same concept, so perhaps we should deal with that, and e.g. keep the ordering of the bigger one ?

hadley · 2014-11-06T14:12:23Z

This is the behaviour I expect:

If both factors have the same levels in the same order, they're kept as a factor.
For every other case (e.g. levels don't match, one is a character), the variable is coerced to a character (with a message)

I think it would be nice to have special behaviour if the levels of one are a subset of the other, but I don't think it's that important. (Just as long as there's a message so the user knows what they need to do to fix the problem).

I don't think we should ever create a factor that has a different levels (value or order) to the input.

romainfrancois · 2014-11-06T16:48:43Z

I think it's all taken care of now. This simplified code for factor/factor with different levels. Please reopen if I forgot something.

jennybc · 2014-11-06T18:19:22Z

FWIW prior to upgrade I was using dplyr 0.3.0.1 installed from e0ba9d2.

I just upgraded to dplyr 0.3.0.9000 installed from 8b0191a; reran my real example and my small one from above and the factor level handling is fixed. 🎁 So thanks!

`inner_join` no longer reorders (#684).

Henrik-P mentioned this issue Oct 28, 2014

All join functions reorder data tables #730

Closed

hadley closed this as completed Oct 30, 2014

hadley reopened this Nov 4, 2014

hadley added the bug an unexpected problem or unintended behavior label Nov 4, 2014

hadley added this to the 0.3.1 milestone Nov 4, 2014

hadley assigned romainfrancois Nov 4, 2014

romainfrancois added a commit that referenced this issue Nov 5, 2014

inner_join no longer reorders (#684).

3d58981

romainfrancois mentioned this issue Nov 5, 2014

inner_join no longer reorders (#684). #747

Merged

romainfrancois closed this as completed in a440652 Nov 6, 2014

romainfrancois added a commit that referenced this issue Nov 12, 2014

Merge pull request #747 from hadley/inner-join-no-reordering

d203300

`inner_join` no longer reorders (#684).

lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inner_join() reorders output #684

inner_join() reorders output #684

ThomasSiegmund commented Oct 11, 2014

martinlenardon commented Oct 16, 2014

hadley commented Oct 30, 2014

hadley commented Nov 4, 2014

jennybc commented Nov 4, 2014

romainfrancois commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

martinlenardon commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

martinlenardon commented Nov 5, 2014

jennybc commented Nov 5, 2014

jennybc commented Nov 5, 2014

jennybc commented Nov 5, 2014

hadley commented Nov 5, 2014

jennybc commented Nov 5, 2014

romainfrancois commented Nov 6, 2014

hadley commented Nov 6, 2014

romainfrancois commented Nov 6, 2014

jennybc commented Nov 6, 2014

inner_join() reorders output #684

inner_join() reorders output #684

Comments

ThomasSiegmund commented Oct 11, 2014

martinlenardon commented Oct 16, 2014

hadley commented Oct 30, 2014

hadley commented Nov 4, 2014

jennybc commented Nov 4, 2014

romainfrancois commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

martinlenardon commented Nov 5, 2014

romainfrancois commented Nov 5, 2014

martinlenardon commented Nov 5, 2014

jennybc commented Nov 5, 2014

jennybc commented Nov 5, 2014

jennybc commented Nov 5, 2014

hadley commented Nov 5, 2014

jennybc commented Nov 5, 2014

romainfrancois commented Nov 6, 2014

hadley commented Nov 6, 2014

romainfrancois commented Nov 6, 2014

jennybc commented Nov 6, 2014