Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinct rearranges columns separately when used with objects of class Period #2568

Closed
aammd opened this issue Mar 24, 2017 · 10 comments
Closed

distinct rearranges columns separately when used with objects of class Period #2568

aammd opened this issue Mar 24, 2017 · 10 comments
Assignees

Comments

@aammd
Copy link

@aammd aammd commented Mar 24, 2017

This bug happens when working with lubridate + dplyr. If a data.frame contains some period object of length 0s, and also some other non-date column, distinct() actually rearranges the rows in each column, creating erroneous data.

library(lubridate)
library(dplyr)

time_fruit <- data_frame(x = hm("10:30", "10:30", "14:40"),
           y = c("apple",  "apple", "banana"))

time_fruit
#> # A tibble: 3 × 2
#>              x      y
#>   <S4: Period>  <chr>
#> 1   10H 30M 0S  apple
#> 2   10H 30M 0S  apple
#> 3   14H 40M 0S banana

distinct(time_fruit)
#> # A tibble: 2 × 2
#>              x      y
#>   <S4: Period>  <chr>
#> 1   10H 30M 0S  apple
#> 2   10H 30M 0S banana

time_fruit2 <- data_frame(x = hm("10:30", "10:30", "14:40", "0:0"),
                         y = c("apple",  "apple", "banana", "tomato"))

time_fruit2
#> # A tibble: 4 × 2
#>              x      y
#>   <S4: Period>  <chr>
#> 1   10H 30M 0S  apple
#> 2   10H 30M 0S  apple
#> 3   14H 40M 0S banana
#> 4           0S tomato
distinct(time_fruit2)
#> # A tibble: 3 × 2
#>              x      y
#>   <S4: Period>  <chr>
#> 1   10H 30M 0S  apple
#> 2   10H 30M 0S banana
#> 3   14H 40M 0S tomato
Session info
devtools::session_info()
#> Session info --------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.3.2 (2016-10-31)
#>  system   x86_64, darwin13.4.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_CA.UTF-8                 
#>  tz       Europe/Paris                
#>  date     2017-03-24
#> Packages ------------------------------------------------------------------
#>  package    * version date       source                        
#>  assertthat   0.1     2013-12-06 CRAN (R 3.3.0)                
#>  backports    1.0.5   2017-01-18 CRAN (R 3.3.2)                
#>  DBI          0.6     2017-03-09 cran (@0.6)                   
#>  devtools     1.12.0  2016-06-24 CRAN (R 3.3.0)                
#>  digest       0.6.12  2017-01-27 cran (@0.6.12)                
#>  dplyr      * 0.5.0   2016-06-24 CRAN (R 3.3.0)                
#>  evaluate     0.10    2016-10-11 cran (@0.10)                  
#>  htmltools    0.3.5   2016-03-21 CRAN (R 3.3.0)                
#>  knitr        1.15.1  2016-11-22 cran (@1.15.1)                
#>  lazyeval     0.2.0   2016-06-12 CRAN (R 3.3.0)                
#>  lubridate  * 1.6.0   2016-09-13 CRAN (R 3.3.0)                
#>  magrittr     1.5     2014-11-22 CRAN (R 3.3.0)                
#>  memoise      1.0.0   2016-01-29 CRAN (R 3.3.0)                
#>  R6           2.2.0   2016-10-05 cran (@2.2.0)                 
#>  Rcpp         0.12.10 2017-03-21 Github (RcppCore/Rcpp@5a71cf6)
#>  rmarkdown    1.3     2016-12-21 cran (@1.3)                   
#>  rprojroot    1.2     2017-01-16 CRAN (R 3.3.2)                
#>  stringi      1.1.2   2016-10-01 cran (@1.1.2)                 
#>  stringr      1.2.0   2017-02-18 cran (@1.2.0)                 
#>  tibble       1.2     2016-08-26 CRAN (R 3.3.0)                
#>  withr        1.0.2   2016-06-20 CRAN (R 3.3.0)                
#>  yaml         2.1.14  2016-11-12 cran (@2.1.14)
@hadley
Copy link
Member

@hadley hadley commented Mar 24, 2017

Even more minimal reprex for me:

df <- tibble(
  x = lubridate::hm("10:30", "10:30", "0:0"),
  y = c("apple", "apple", "tomato")
)
df
distinct(df)

# Different problem with group_by
df %>% group_by(x) %>% summarise(n())

Probably related to hashing S4 objects, and unlikely to get fixed in this release (probably will get bundled into vctrs)

Loading

@aammd
Copy link
Author

@aammd aammd commented Mar 29, 2017

Indeed it seems like the duration 0S thing was a red herring; any Period object has the same problem.

I was wondering, should distinct() give an error, or warning, when called on a data.frame that includes Period objects? The current behaviour seems sort of hazardous for unwitting users.

Loading

@hadley
Copy link
Member

@hadley hadley commented Mar 29, 2017

Yes, that's why I labelled this issue with "bug".

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 12, 2017

@hadley: Will this be handled in #2432?

Loading

@hadley
Copy link
Member

@hadley hadley commented Jul 12, 2017

@krlmlr this seems to be a little different because it's not a two-table verb. It may be that something in the hybrid evaluator isn't preserving the S4 bit.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 13, 2017

Filtering and reordering should also be a responsibility of vctrs, I think, but this might be a separate problem indeed.

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 12, 2018

@krlmlr I'm picking this one up

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 12, 2018

Right, it definitely sounds like something vctrs should handle, i.e.

I guess currently distinct only uses the .Data from df$x.

For now, we could refuse to go forward in https://github.com/tidyverse/dplyr/blob/master/inst/include/dplyr/visitor_impl.h#L43 if the object bit is set, but that might be a bit strong.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(
  x = lubridate::hm("10:30", "10:30", "0:0"),
  y = c("apple", "apple", "tomato")
)
df
#> # A tibble: 3 x 2
#>   x            y     
#>   <S4: Period> <chr> 
#> 1 10H 30M 0S   apple 
#> 2 10H 30M 0S   apple 
#> 3 0S           tomato
str( df$x )
#> Formal class 'Period' [package "lubridate"] with 6 slots
#>   ..@ .Data : num [1:3] 0 0 0
#>   ..@ year  : num [1:3] 0 0 0
#>   ..@ month : num [1:3] 0 0 0
#>   ..@ day   : num [1:3] 0 0 0
#>   ..@ hour  : num [1:3] 10 10 0
#>   ..@ minute: num [1:3] 30 30 0

Created on 2018-03-12 by the reprex package (v0.2.0).

Loading

romainfrancois added a commit that referenced this issue Mar 12, 2018
This is just a workaround until `vctrs`
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 12, 2018

This is definitely a vctrs thing. With the current impl of distinct:

// [[Rcpp::export]]
SEXP distinct_impl(DataFrame df, const SymbolVector& vars, const SymbolVector& keep) {
  if (df.size() == 0)
    return df;

  // No vars means ungrouped data with keep_all = TRUE.
  if (vars.size() == 0)
    return df;

  check_valid_colnames(df);
  DataFrameVisitors visitors(df, vars);

  std::vector<int> indices;
  VisitorSetIndexSet<DataFrameVisitors> set(visitors);

  int n = df.nrows();
  for (int i = 0; i < n; i++) {
    if (set.insert(i).second) {
      indices.push_back(i);
    }
  }

  return DataFrameSubsetVisitors(df, keep).subset(indices, get_class(df));
}

it means we need two visitor classes:

  • one VectorVisitor that goes in DataFrameVisitors
  • one SubsetVectorVisitor that goes in the DataFrameSubsetVisitors

So until we have a vctrs package, I've added some defensive coding to refuse Period.
This is better than leaving something that only seemingly works.

Loading

romainfrancois added a commit that referenced this issue Apr 9, 2018
This is just a workaround until `vctrs`
@krlmlr krlmlr closed this in #3414 Apr 10, 2018
krlmlr added a commit that referenced this issue Apr 10, 2018
- Dedicated error message when trying to use columns of the `Interval` or `Period` classes (#2568).
romainfrancois added a commit that referenced this issue Sep 24, 2018
romainfrancois added a commit that referenced this issue Sep 24, 2018
romainfrancois added a commit that referenced this issue Oct 1, 2018
romainfrancois added a commit that referenced this issue Oct 1, 2018
romainfrancois added a commit that referenced this issue Oct 1, 2018
@lock
Copy link

@lock lock bot commented Oct 7, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Loading

@lock lock bot locked and limited conversation to collaborators Oct 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants