-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx #17228
Conversation
Accessing a member through a null pointer is undefined behavior, even if no actual memory access is performed (like in this case, where it was used only for offset computation). See the following link for an explanation: https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html The current approach will also fail when building with tools like ubsan (undefined behavior sanitizer). The standard way to compute offsets of struct members is the offsetof macro but it seems it's not supported by Cython so I've used the approach described here: https://mail.python.org/pipermail/cython-devel/2013-April/003505.html
Ping |
sklearn/neighbors/_quad_tree.pyx
Outdated
<Py_ssize_t> &(<Cell*> NULL).barycenter, | ||
<Py_ssize_t> &(<Cell*> NULL).min_bounds, | ||
<Py_ssize_t> &(<Cell*> NULL).max_bounds, | ||
(<Py_ssize_t>&(dummy.parent) - <Py_ssize_t>&(dummy)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mailing list states to do this:
cdef Struct tmp
cdef Py_ssize_t offset
offset = <Py_ssize_t> (<Py_intptr_t>&(tmp.field) - <Py_intptr_t>&(tmp))
with castinig to Py_intptr_t
first, do you think this is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Thanks for reviewing this.
I think some form of intptr_t
is the right tool for the job, and I tried Py_intptr_t
first, but I couldn't make it work because of 'Py_intptr_t' is not a type identifier
errors when I tried to build.
I haven't been able to find Py_intptr_t
available as an import anywhere, but I've digged a little more and I've found that we can import the actual C intptr_t
type from libc.stdint
and it works.
Should I go with that instead? Do you have any advice about where to get Py_intptr_t
from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation is okay. It could be that Py_ssize_t
is typedefed to Py_intptr_t
anyways:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the github newbie question: if you think the current implementation is okay should I mark this conversation as resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally do not resolve them on my PRs, because resolved conversations are automatically hidden and could be interesting to other reviewers.
That being said, you can mark the conversation as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I'll leave it open then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Isn't there a way to get rid of this whole cdef Cell dummy;
CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype though I haven't looked at how this works in detail. |
I'll give it a try and report back. I'm a total newbie at pretty much everything around this project, so I won't be able to go much farther than "I made that change and tests pass/don't pass", but we should get a little bit of useful signal out of that at least. |
I tried the approach suggested by @rth and it looks like it builds and tests in the affected directories still pass. If nobody objects I'll modify this pull request tomorrow to use this approach, which is way simpler :) |
According to scikit-learn@9e9cdb0#issuecomment-638998985 these dictionaries with offsets are there for old numpy versions and they are no longer needed. We can use instead an approach similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187 I have verified that tests keep passing and that the undefined behavior that motivated this pull request in the first place is still fixed.
I have pushed a new commit using an approach similar to this other commit, as suggested above. Tests still pass and the original UndefinedBehaviorSanitizer warning about a null pointer dereference that motivated this pull request in the first place is still fixed. Please take another look :) |
Great that it works! I can't find the documentation for it though. All I have found so far is this Cython test and the fact that we are already doing it in,
In particular I don't understand why it's necessary to do the |
Any other ideas? |
If we declare Also, it seems that the original concern can be addressed by changing into
I'm not a fan of
no idea either, I'm even surprised the syntax is correct |
As I understand it, the intent of Would As for your other questions, sadly I have no idea. I'm willing to try any reasonable suggestions. |
Yes you're probably right Can we declare the dtype with a list instead of a dict, i.e.: CELL_DTYPE = np.dtype([
('parent', np.intp),
('children', np.intp, (8,)), # not exactly sure about the syntax, see https://numpy.org/doc/stable/user/basics.rec.html#structured-datatype-creation
...
],
aligned=True
) the |
Asked a SO question about it.
Well it we can make it work without manually defining dtypes that need to be in sync with the struct that would be preferrable. We already have such code in binary trees we just need to understand what it does. Let's see if anyone replies, otherwise I could reach out on some numpy channel. |
OK let's see if we can get an explanation. Though IMHO the fact that none of us here really understand what it does is an indication that we shouldn't be using it (the cycle is going to repeat once we're not around anymore). Repeating the dtype is verbose but it's not a big deal since these almost never change. FWIW, that's what we're doing in |
So we got an answer in https://stackoverflow.com/a/62449233/1791279 . I am continuously impressed by the knowledge of SO participants on sometimes obscure topics.
Yes, but do we know what happens if one side changes and not the other. Repeating might be OK, personally I'm not too comfortable with the pointer arithmetic initially done in this PR. I think using this solution with a link to the SO comment might be enough? Actually it looks like we merged a similar change earlier in #16141 . The original motivation was a comment by Jake that that was a cleaner way to do it https://github.com/scikit-learn/scikit-learn/blame/c45721d538d36ab4c322d18e33d9c10b55f5fe27/sklearn/neighbors/_binary_tree.pxi#L184 any opinions @jnothman @jeremiedbb ? |
Also we got this original issue by manually setting offsets , so if we can rely on numpy more to do it, it would more reliable I think even it's a more black box solution. As long as it works. |
From what I tried, either a segfault or a ValueError like |
I think jake's comment is right. It's cleaner and it's actually the recommended way to do that. Maybe we can add a small comment to explain that to avoid future confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the current solution then. Thanks @slackito !
if we merge this we should document it |
Surprisingly this still works |
Added a brief comment and a link to stackoverflow thread that explain why the simpler approach to define numpy types works.
Thanks everyone for the comments! Added comments with a brief explanation an a link to the stackoverflow thread posted above. Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we have a better place to document this? Is the inline comment enough? I really don't know.
sklearn/neighbors/_quad_tree.pyx
Outdated
<Py_ssize_t> &(<Cell*> NULL).max_bounds, | ||
] | ||
}) | ||
# Build the corresponding numpy dtyle for Cell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Build the corresponding numpy dtyle for Cell. | |
# Build the corresponding numpy dtype for Cell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
sklearn/tree/_tree.pyx
Outdated
<Py_ssize_t> &(<Node*> NULL).weighted_n_node_samples | ||
] | ||
}) | ||
# Build the corresponding numpy dtyle for Node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @slackito !
…17228) * Fix undefined behavior in _tree.pyx and _quad_tree.pyx Accessing a member through a null pointer is undefined behavior, even if no actual memory access is performed (like in this case, where it was used only for offset computation). See the following link for an explanation: https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html The current approach will also fail when building with tools like ubsan (undefined behavior sanitizer). The standard way to compute offsets of struct members is the offsetof macro but it seems it's not supported by Cython so I've used the approach described here: https://mail.python.org/pipermail/cython-devel/2013-April/003505.html * Simplify definitions of types for numpy. According to scikit-learn@9e9cdb0#issuecomment-638998985 these dictionaries with offsets are there for old numpy versions and they are no longer needed. We can use instead an approach similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187 I have verified that tests keep passing and that the undefined behavior that motivated this pull request in the first place is still fixed. * Document the new numpy type definition approach. Added a brief comment and a link to stackoverflow thread that explain why the simpler approach to define numpy types works. * Fix typo 'dtyle'->'dtype'
…17228) * Fix undefined behavior in _tree.pyx and _quad_tree.pyx Accessing a member through a null pointer is undefined behavior, even if no actual memory access is performed (like in this case, where it was used only for offset computation). See the following link for an explanation: https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html The current approach will also fail when building with tools like ubsan (undefined behavior sanitizer). The standard way to compute offsets of struct members is the offsetof macro but it seems it's not supported by Cython so I've used the approach described here: https://mail.python.org/pipermail/cython-devel/2013-April/003505.html * Simplify definitions of types for numpy. According to scikit-learn@9e9cdb0#issuecomment-638998985 these dictionaries with offsets are there for old numpy versions and they are no longer needed. We can use instead an approach similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187 I have verified that tests keep passing and that the undefined behavior that motivated this pull request in the first place is still fixed. * Document the new numpy type definition approach. Added a brief comment and a link to stackoverflow thread that explain why the simpler approach to define numpy types works. * Fix typo 'dtyle'->'dtype'
Accessing a member through a null pointer is undefined behavior, even if
no actual memory access is performed (like in this case, where it was
used only for offset computation). See the following link for an
explanation:
https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html
The current approach will also fail when building with tools like ubsan
(undefined behavior sanitizer).
The standard way to compute offsets of struct members is the offsetof
macro but it seems it's not supported by Cython so I've used the
approach described here:
https://mail.python.org/pipermail/cython-devel/2013-April/003505.html