Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix corner case for num_as_location() #1166

Merged
merged 8 commits into from Jul 3, 2020
Merged

Conversation

krlmlr
Copy link
Member

@krlmlr krlmlr commented Jun 25, 2020

and replace by a simpler and faster implementation, with shortcut if appending in strictly increasing order without gaps.

Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for extending consecutively but non-monotonically? Is this needed for tibble?

R_len_t n_missing = 0;
int extended[n_extend];
int i_extend = 0;
int nn = n;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be benefit from an informative name, e.g. current_n or new_n


R_len_t n_subscript = Rf_length(sorted);
R_len_t n_missing = 0;
int extended[n_extend];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't allocate a variable size array that is potentially very large on the stack because it could overflow. At the minimum the stack size should be checked using R_CheckStack2().

There should also be two paths, one with a heap alloc if n_extend gets too large (say 100), and one with a stack alloc. I wonder if this doesn't get too complicated though, so a heap alloc is probably just as fine in this case, unless the stack alloc can be demonstrated to improve efficiency measurably (I would expect other costs to dominate, esp. for a small n_extend).

// should not require any sorting
if (elt - 1 == nn) {
++nn;
--n_extend;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_extend is no longer used at this point right? So no purpose for decreasing it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the loop, n_extend == i_extend. What's a good way to check this invariant in "debug builds"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably possible via the build system by introducing DESCRIPTION as a dependency and checking whether the version has large components, and including a debug header in that case.

But I don't think we want to include debug-only code paths in vctrs right now, so the algorithm needs to be correct ;) I don't mind having this consistency check unconditionally though, it won't make much of a difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add if (failure) Rf_error("INTERNAL: ..."); ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we use this pattern for internal errors:

Rf_error("Internal error in `s3_get_class()`: Class must have length.");

++nn;
--n_extend;
} else {
extended[i_extend++] = elt - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extended array was not zero-initialised, and this is the only path in the loop where it is assigned a value. This means the sorted array below will contain random memory in some cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I'm using only the first i_extend entries of the array during the sort and the subsequent check. Added a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you're right.

expect_identical(num_as_location(c(1:5, 7, 6), 3, oob = "extend"), c(1:5, 7L, 6L))
expect_identical(num_as_location(c(1, NA, 3), 2, oob = "extend"), c(1L, NA, 3L))
expect_identical(num_as_location(c(1, NA, 4, 3), 2, oob = "extend"), c(1L, NA, 4L, 3L))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to a separate "can extend beyond the end consecutively but non-monotonically" test please?

@krlmlr
Copy link
Member Author

krlmlr commented Jul 1, 2020

I'd like to maintain data frame compatibility here:

df <- data.frame(a = 1:3)
df[6:4, ] <- 6:4
df
#>   a
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6

Created on 2020-07-01 by the reprex package (v0.3.0)

@krlmlr krlmlr requested a review from lionel- July 3, 2020 12:14
@lionel- lionel- merged commit fbc70e0 into r-lib:master Jul 3, 2020
@lionel-
Copy link
Member

lionel- commented Jul 3, 2020

Thanks!

@krlmlr krlmlr deleted the b-num-as-location branch July 5, 2020 17:19
@krlmlr
Copy link
Member Author

krlmlr commented Jul 5, 2020

Are you planning a CRAN release any time soon? It would be useful to have this, and also vec_as_location(oob = "extend") for tibble.

@lionel-
Copy link
Member

lionel- commented Jul 6, 2020

yes this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants