Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Crash with anti_join() #2118

Closed
MansMeg opened this issue Sep 10, 2016 · 7 comments
Closed

Bug: Crash with anti_join() #2118

MansMeg opened this issue Sep 10, 2016 · 7 comments
Assignees
Labels
Milestone

Comments

@MansMeg
Copy link

@MansMeg MansMeg commented Sep 10, 2016

This code crashes R/R-Studio.

The files (relatively big) can be downloaded here to reproduce the bug:

https://www.dropbox.com/s/om1239vt7j7miab/crp_tibble_factor_no_rare.Rdata?dl=0

https://www.dropbox.com/s/quireajwy8vzs8p/stoppord.txt?dl=0

library(tidytext)
library(dplyr)
sessionInfo()
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.5.0    tidytext_0.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6       lattice_0.20-33   tidyr_0.5.1      
 [4] psych_1.6.6       assertthat_0.1    SnowballC_0.5.1  
 [7] plyr_1.8.4        grid_3.3.0        R6_2.1.2         
[10] nlme_3.1-127      DBI_0.4-1         magrittr_1.5     
[13] tokenizers_0.1.4  stringi_1.1.1     reshape2_1.4.1   
[16] Matrix_1.2-6      tools_3.3.0       stringr_1.1.0    
[19] broom_0.4.1       parallel_3.3.0    janeaustenr_0.1.1
[22] mnormt_1.5-4      tibble_1.2 

load("crp_tibble_factor_no_rare.Rdata")
stopp_ord <- data_frame(word = read.table(file = "stoppord.txt")[,1], n = 1)
txt <- txt %>% anti_join(stopp_ord)
@MansMeg
Copy link
Author

@MansMeg MansMeg commented Sep 10, 2016

My guess is that the reason is that I join with two different factors.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 7, 2016

Thanks. Is a dataset with > 100 MB really necessary to replicate the problem?

Loading

@MansMeg
Copy link
Author

@MansMeg MansMeg commented Nov 8, 2016

Probably not. But this does replicate the problem.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 8, 2016

I'd really appreciate if you could reduce the problem size further. Segfaults are very important, but on the other hand it will be very tedious for us to work with a huge dataset to try to replicate the problem. Have you tried the development version of dplyr?

Loading

@MansMeg
Copy link
Author

@MansMeg MansMeg commented Nov 8, 2016

I do not know when Ill have the time though. Sorry.

Have you tried to replicate the bug?

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Nov 8, 2016

FWIW, here's what I see running @MansMeg's example with R + sanitizers + lldb:

> txt <- txt %>% anti_join(stopp_ord)
Joining, by = "word"


../inst/include/dplyr/JoinVisitorImpl.h:156:92: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
SUMMARY: AddressSanitizer: undefined-behavior ../inst/include/dplyr/JoinVisitorImpl.h:156:92 in
ASAN:DEADLYSIGNAL
=================================================================
==28660==ERROR: AddressSanitizer: SEGV on unknown address 0x000559c92820 (pc 0x000110e4a8ab bp 0x7fff50b96890 sp 0x7fff50b96860 T0)
==28660==The signal is caused by a READ memory access.
    #0 0x110e4a8aa in STRING_ELT memory.c:3390
    #1 0x122b11e13 in dplyr::JoinFactorFactorVisitor::get(int) const (dplyr.so+0x413e13)
    #2 0x122b102a5 in dplyr::JoinFactorFactorVisitor::hash(int) (dplyr.so+0x4122a5)
    #3 0x122a63ae8 in dplyr::VisitorSetHash<dplyr::DataFrameJoinVisitors>::hash(int) const (dplyr.so+0x365ae8)
    #4 0x122a62bd2 in boost::unordered::detail::table<boost::unordered::detail::map<std::__1::allocator<std::__1::pair<int const, std::__1::vector<int, std::__1::allocator<int> > > >, int, std::__1::vector<int, std::__1::allocator<int> >, dplyr::VisitorSetHasher<dplyr::DataFrameJoinVisitors>, dplyr::VisitorSetEqualPredicate<dplyr::DataFrameJoinVisitors> > >::find_node(int const&) const (dplyr.so+0x364bd2)
    #5 0x1228ff4a3 in anti_join_impl(Rcpp::DataFrame_Impl<Rcpp::PreserveStorage>, Rcpp::DataFrame_Impl<Rcpp::PreserveStorage>, Rcpp::Vector<16, Rcpp::PreserveStorage>, Rcpp::Vector<16, Rcpp::PreserveStorage>) (dplyr.so+0x2014a3)
    #6 0x122709a89 in dplyr_anti_join_impl (dplyr.so+0xba89)
    ...

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 8, 2016

Thanks @kevinushey for confirming. I can work from the big example.

Loading

@krlmlr krlmlr self-assigned this Feb 10, 2017
@krlmlr krlmlr added this to the data frame 1 milestone Feb 10, 2017
@krlmlr krlmlr added this to the data frame 1 milestone Feb 10, 2017
@krlmlr krlmlr closed this in #2451 Feb 20, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants