Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

left_join with large dataset and multiple matching columns crashes R if adding new rows (cartesian product) #1230

Closed
nickbond opened this issue Jun 23, 2015 · 2 comments

Comments

@nickbond
Copy link

@nickbond nickbond commented Jun 23, 2015

The above crash occurred for me on both OS X and windows, but was alleviated by specifying the number of rows in the second table being joined (df2 below had exactly 1130 rows). The first join column was formatted as POSIXct. It is a large dataset and I have not been able to reproduce with a simpler example, but happy to provide the data if you need a reproducible example.

left_join(df1, df2[1:1130,], by = c('date'='date', 'site.2014'='site')).

Thanks,
Nick

@nickbond nickbond changed the title left_join with multiple matching columns crashes R if adding new rows (cartesian product) left_join with large dataset and multiple matching columns crashes R if adding new rows (cartesian product) Jun 23, 2015
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Aug 12, 2015

yes please @nickbond provide a reprex.

@nickbond
Copy link
Author

@nickbond nickbond commented Aug 13, 2015

Dear François,
I've made a copy of the data frames in question, which can be downloaded at the following link (https://www.dropbox.com/s/yw0lyu5dng6r1vb/bond_example.RData?dl=0).

The code below joins the two dataframes. What I discovered by accident is that including 'zone' in the list of join terms avoids the error (example 1), as does specificying the length of the second data frame (example 2). Example 3 causes a crash. Importantly, 'zone' is not essential (other than maintaining it as a unique field in the joined table) because sites codes are unique within each zone, and thus I would expect the join to work regardless. The main question is why example 2 works, but not example 3.

Regards, Nick

library(dplyr)
load("bond_example.RData")
left_join(df1, df2, by = c('date'='date', 'zone'='zone', 'site'='site')) #Example 1
left_join(df1, df2[1:1130,], by = c('date'='date', 'site'='site')) #Example 2
left_join(df1, df2, by = c('date'='date', 'site'='site')) #Example 3 

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants