Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinct rearranges columns separately when used with objects of class Period #2568

Closed
aammd opened this issue Mar 24, 2017 · 10 comments
Closed
Assignees
Labels
bug an unexpected problem or unintended behavior vctrs ↗️

Comments

@aammd
Copy link

aammd commented Mar 24, 2017

This bug happens when working with lubridate + dplyr. If a data.frame contains some period object of length 0s, and also some other non-date column, distinct() actually rearranges the rows in each column, creating erroneous data.

library(lubridate)
library(dplyr)

time_fruit <- data_frame(x = hm("10:30", "10:30", "14:40"),
           y = c("apple",  "apple", "banana"))

time_fruit
#> # A tibble: 3 × 2
#>              x      y
#>   <S4: Period>  <chr>
#> 1   10H 30M 0S  apple
#> 2   10H 30M 0S  apple
#> 3   14H 40M 0S banana

distinct(time_fruit)
#> # A tibble: 2 × 2
#>              x      y
#>   <S4: Period>  <chr>
#> 1   10H 30M 0S  apple
#> 2   10H 30M 0S banana

time_fruit2 <- data_frame(x = hm("10:30", "10:30", "14:40", "0:0"),
                         y = c("apple",  "apple", "banana", "tomato"))

time_fruit2
#> # A tibble: 4 × 2
#>              x      y
#>   <S4: Period>  <chr>
#> 1   10H 30M 0S  apple
#> 2   10H 30M 0S  apple
#> 3   14H 40M 0S banana
#> 4           0S tomato
distinct(time_fruit2)
#> # A tibble: 3 × 2
#>              x      y
#>   <S4: Period>  <chr>
#> 1   10H 30M 0S  apple
#> 2   10H 30M 0S banana
#> 3   14H 40M 0S tomato
Session info
devtools::session_info()
#> Session info --------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.3.2 (2016-10-31)
#>  system   x86_64, darwin13.4.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_CA.UTF-8                 
#>  tz       Europe/Paris                
#>  date     2017-03-24
#> Packages ------------------------------------------------------------------
#>  package    * version date       source                        
#>  assertthat   0.1     2013-12-06 CRAN (R 3.3.0)                
#>  backports    1.0.5   2017-01-18 CRAN (R 3.3.2)                
#>  DBI          0.6     2017-03-09 cran (@0.6)                   
#>  devtools     1.12.0  2016-06-24 CRAN (R 3.3.0)                
#>  digest       0.6.12  2017-01-27 cran (@0.6.12)                
#>  dplyr      * 0.5.0   2016-06-24 CRAN (R 3.3.0)                
#>  evaluate     0.10    2016-10-11 cran (@0.10)                  
#>  htmltools    0.3.5   2016-03-21 CRAN (R 3.3.0)                
#>  knitr        1.15.1  2016-11-22 cran (@1.15.1)                
#>  lazyeval     0.2.0   2016-06-12 CRAN (R 3.3.0)                
#>  lubridate  * 1.6.0   2016-09-13 CRAN (R 3.3.0)                
#>  magrittr     1.5     2014-11-22 CRAN (R 3.3.0)                
#>  memoise      1.0.0   2016-01-29 CRAN (R 3.3.0)                
#>  R6           2.2.0   2016-10-05 cran (@2.2.0)                 
#>  Rcpp         0.12.10 2017-03-21 Github (RcppCore/Rcpp@5a71cf6)
#>  rmarkdown    1.3     2016-12-21 cran (@1.3)                   
#>  rprojroot    1.2     2017-01-16 CRAN (R 3.3.2)                
#>  stringi      1.1.2   2016-10-01 cran (@1.1.2)                 
#>  stringr      1.2.0   2017-02-18 cran (@1.2.0)                 
#>  tibble       1.2     2016-08-26 CRAN (R 3.3.0)                
#>  withr        1.0.2   2016-06-20 CRAN (R 3.3.0)                
#>  yaml         2.1.14  2016-11-12 cran (@2.1.14)
@hadley
Copy link
Member

hadley commented Mar 24, 2017

Even more minimal reprex for me:

df <- tibble(
  x = lubridate::hm("10:30", "10:30", "0:0"),
  y = c("apple", "apple", "tomato")
)
df
distinct(df)

# Different problem with group_by
df %>% group_by(x) %>% summarise(n())

Probably related to hashing S4 objects, and unlikely to get fixed in this release (probably will get bundled into vctrs)

@hadley hadley added bug an unexpected problem or unintended behavior data frame labels Mar 24, 2017
@aammd
Copy link
Author

aammd commented Mar 29, 2017

Indeed it seems like the duration 0S thing was a red herring; any Period object has the same problem.

I was wondering, should distinct() give an error, or warning, when called on a data.frame that includes Period objects? The current behaviour seems sort of hazardous for unwitting users.

@hadley
Copy link
Member

hadley commented Mar 29, 2017

Yes, that's why I labelled this issue with "bug".

@krlmlr
Copy link
Member

krlmlr commented Jul 12, 2017

@hadley: Will this be handled in #2432?

@hadley
Copy link
Member

hadley commented Jul 12, 2017

@krlmlr this seems to be a little different because it's not a two-table verb. It may be that something in the hybrid evaluator isn't preserving the S4 bit.

@krlmlr
Copy link
Member

krlmlr commented Jul 13, 2017

Filtering and reordering should also be a responsibility of vctrs, I think, but this might be a separate problem indeed.

@romainfrancois
Copy link
Member

@krlmlr I'm picking this one up

@romainfrancois
Copy link
Member

Right, it definitely sounds like something vctrs should handle, i.e.

I guess currently distinct only uses the .Data from df$x.

For now, we could refuse to go forward in https://github.com/tidyverse/dplyr/blob/master/inst/include/dplyr/visitor_impl.h#L43 if the object bit is set, but that might be a bit strong.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(
  x = lubridate::hm("10:30", "10:30", "0:0"),
  y = c("apple", "apple", "tomato")
)
df
#> # A tibble: 3 x 2
#>   x            y     
#>   <S4: Period> <chr> 
#> 1 10H 30M 0S   apple 
#> 2 10H 30M 0S   apple 
#> 3 0S           tomato
str( df$x )
#> Formal class 'Period' [package "lubridate"] with 6 slots
#>   ..@ .Data : num [1:3] 0 0 0
#>   ..@ year  : num [1:3] 0 0 0
#>   ..@ month : num [1:3] 0 0 0
#>   ..@ day   : num [1:3] 0 0 0
#>   ..@ hour  : num [1:3] 10 10 0
#>   ..@ minute: num [1:3] 30 30 0

Created on 2018-03-12 by the reprex package (v0.2.0).

romainfrancois added a commit that referenced this issue Mar 12, 2018
This is just a workaround until `vctrs`
@romainfrancois
Copy link
Member

This is definitely a vctrs thing. With the current impl of distinct:

// [[Rcpp::export]]
SEXP distinct_impl(DataFrame df, const SymbolVector& vars, const SymbolVector& keep) {
  if (df.size() == 0)
    return df;

  // No vars means ungrouped data with keep_all = TRUE.
  if (vars.size() == 0)
    return df;

  check_valid_colnames(df);
  DataFrameVisitors visitors(df, vars);

  std::vector<int> indices;
  VisitorSetIndexSet<DataFrameVisitors> set(visitors);

  int n = df.nrows();
  for (int i = 0; i < n; i++) {
    if (set.insert(i).second) {
      indices.push_back(i);
    }
  }

  return DataFrameSubsetVisitors(df, keep).subset(indices, get_class(df));
}

it means we need two visitor classes:

  • one VectorVisitor that goes in DataFrameVisitors
  • one SubsetVectorVisitor that goes in the DataFrameSubsetVisitors

So until we have a vctrs package, I've added some defensive coding to refuse Period.
This is better than leaving something that only seemingly works.

romainfrancois added a commit that referenced this issue Apr 9, 2018
This is just a workaround until `vctrs`
krlmlr added a commit that referenced this issue Apr 10, 2018
- Dedicated error message when trying to use columns of the `Interval` or `Period` classes (#2568).
romainfrancois added a commit that referenced this issue Sep 24, 2018
romainfrancois added a commit that referenced this issue Sep 24, 2018
romainfrancois added a commit that referenced this issue Oct 1, 2018
romainfrancois added a commit that referenced this issue Oct 1, 2018
romainfrancois added a commit that referenced this issue Oct 1, 2018
@lock
Copy link

lock bot commented Oct 7, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior vctrs ↗️
Projects
None yet
Development

No branches or pull requests

4 participants