[MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx #17228

slackito · 2020-05-14T22:34:01Z

Accessing a member through a null pointer is undefined behavior, even if
no actual memory access is performed (like in this case, where it was
used only for offset computation). See the following link for an
explanation:
https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html

The current approach will also fail when building with tools like ubsan
(undefined behavior sanitizer).

The standard way to compute offsets of struct members is the offsetof
macro but it seems it's not supported by Cython so I've used the
approach described here:

https://mail.python.org/pipermail/cython-devel/2013-April/003505.html

Accessing a member through a null pointer is undefined behavior, even if no actual memory access is performed (like in this case, where it was used only for offset computation). See the following link for an explanation: https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html The current approach will also fail when building with tools like ubsan (undefined behavior sanitizer). The standard way to compute offsets of struct members is the offsetof macro but it seems it's not supported by Cython so I've used the approach described here: https://mail.python.org/pipermail/cython-devel/2013-April/003505.html

slackito · 2020-05-28T18:39:15Z

Ping

thomasjpfan · 2020-06-01T22:01:24Z

sklearn/neighbors/_quad_tree.pyx

-        <Py_ssize_t> &(<Cell*> NULL).barycenter,
-        <Py_ssize_t> &(<Cell*> NULL).min_bounds,
-        <Py_ssize_t> &(<Cell*> NULL).max_bounds,
+        (<Py_ssize_t>&(dummy.parent) - <Py_ssize_t>&(dummy)),


The mailing list states to do this:

cdef Struct tmp cdef Py_ssize_t offset offset = <Py_ssize_t> (<Py_intptr_t>&(tmp.field) - <Py_intptr_t>&(tmp))

with castinig to Py_intptr_t first, do you think this is better?

Hi! Thanks for reviewing this.

I think some form of intptr_t is the right tool for the job, and I tried Py_intptr_t first, but I couldn't make it work because of 'Py_intptr_t' is not a type identifier errors when I tried to build.

I haven't been able to find Py_intptr_t available as an import anywhere, but I've digged a little more and I've found that we can import the actual C intptr_t type from libc.stdint and it works.

Should I go with that instead? Do you have any advice about where to get Py_intptr_t from?

I think the current implementation is okay. It could be that Py_ssize_t is typedefed to Py_intptr_t anyways:

https://github.com/cython/cython/blob/0a25de5b1023fd551f5e600583de8d8c87b7796d/Cython/Utility/MemoryView.pyx#L37

Sorry for the github newbie question: if you think the current implementation is okay should I mark this conversation as resolved?

I personally do not resolve them on my PRs, because resolved conversations are automatically hidden and could be interesting to other reviewers.

That being said, you can mark the conversation as resolved.

Thanks for the explanation. I'll leave it open then :)

thomasjpfan

LGTM

rth · 2020-06-04T17:32:58Z

Isn't there a way to get rid of this whole CELL_DTYPE dict that was originally there for old numpy as far as I understand? Similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187 where it might be equivalent to something along the lines of,

cdef Cell dummy;
CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype

though I haven't looked at how this works in detail.

slackito · 2020-06-05T22:19:06Z

Isn't there a way to get rid of this whole CELL_DTYPE dict that was originally there for old numpy as far as I understand? Similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187

I'll give it a try and report back. I'm a total newbie at pretty much everything around this project, so I won't be able to go much farther than "I made that change and tests pass/don't pass", but we should get a little bit of useful signal out of that at least.

slackito · 2020-06-09T22:45:02Z

I tried the approach suggested by @rth and it looks like it builds and tests in the affected directories still pass. If nobody objects I'll modify this pull request tomorrow to use this approach, which is way simpler :)

According to scikit-learn@9e9cdb0#issuecomment-638998985 these dictionaries with offsets are there for old numpy versions and they are no longer needed. We can use instead an approach similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187 I have verified that tests keep passing and that the undefined behavior that motivated this pull request in the first place is still fixed.

slackito · 2020-06-10T21:44:56Z

I have pushed a new commit using an approach similar to this other commit, as suggested above. Tests still pass and the original UndefinedBehaviorSanitizer warning about a null pointer dereference that motivated this pull request in the first place is still fixed.

Please take another look :)

rth · 2020-06-10T21:59:40Z

Great that it works! I can't find the documentation for it though. All I have found so far is this Cython test and the fact that we are already doing it in,

sklearn/neighbors/_binary_tree.pxi
184:NodeHeapData = np.asarray(<NodeHeapData_t[:1]>(&nhd_tmp)).dtype
194:NodeData = np.asarray(<NodeData_t[:1]>(&nd_tmp)).dtype

In particular I don't understand why it's necessary to do the [:1] slice. Any ideas @thomasjpfan or @NicolasHug ?

slackito · 2020-06-17T23:25:50Z

Any other ideas?

NicolasHug · 2020-06-17T23:51:30Z

If we declare CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype, can we still access the fields of a variable? I.e. can we still do some_cell['parent'] from Python?

Also, it seems that the original concern can be addressed by changing
<Py_ssize_t> &(<Cell*> NULL).parent,

into

<Py_ssize_t> &(<Cell*> &dummy).parent?

I'm not a fan of CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype because it's very cryptic.

In particular I don't understand why it's necessary to do the [:1] slice

no idea either, I'm even surprised the syntax is correct

slackito · 2020-06-18T01:27:28Z

As I understand it, the intent of <Py_ssize_t> &(<Cell*> NULL).parent was to get the offset of parent inside a Cell. This works if the null pointer is address 0 because then the address of the field is equal to its offset inside the object.

Would <Py_ssize_t> &(<Cell*> &dummy).parent accomplish the same thing? It seems to me it would return the address of the parent field in the dummy object, and not the offset (which is why my original commit and the reference link in the description subtract the address of the dummy object itself to get the relative offset, i.e. (<Py_ssize_t>&(dummy.parent) - <Py_ssize_t>&(dummy))).

As for your other questions, sadly I have no idea. I'm willing to try any reasonable suggestions.

NicolasHug · 2020-06-18T11:11:51Z

It seems to me it would return the address of the parent field in the dummy object, and not the offset

Yes you're probably right

Can we declare the dtype with a list instead of a dict, i.e.:

CELL_DTYPE = np.dtype([
	('parent', np.intp),
	('children', np.intp, (8,)),  # not exactly sure about the syntax, see https://numpy.org/doc/stable/user/basics.rec.html#structured-datatype-creation
	...
	],
	aligned=True
)

the aligned=True should take care of the offsets

rth · 2020-06-18T11:28:57Z

Asked a SO question about it.

Can we declare the dtype with a list instead of a dict, i.e.:

Well it we can make it work without manually defining dtypes that need to be in sync with the struct that would be preferrable. We already have such code in binary trees we just need to understand what it does. Let's see if anyone replies, otherwise I could reach out on some numpy channel.

NicolasHug · 2020-06-18T11:48:49Z

OK let's see if we can get an explanation.

Though IMHO the fact that none of us here really understand what it does is an indication that we shouldn't be using it (the cycle is going to repeat once we're not around anymore). Repeating the dtype is verbose but it's not a big deal since these almost never change. FWIW, that's what we're doing in sklearn/ensemble/_hist_gradient_boosting/common.pyx-pxd

rth · 2020-06-18T11:58:29Z

So we got an answer in https://stackoverflow.com/a/62449233/1791279 . I am continuously impressed by the knowledge of SO participants on sometimes obscure topics.

Repeating the dtype is verbose but it's not a big deal since these almost never change.

Yes, but do we know what happens if one side changes and not the other. Repeating might be OK, personally I'm not too comfortable with the pointer arithmetic initially done in this PR. I think using this solution with a link to the SO comment might be enough?

Actually it looks like we merged a similar change earlier in #16141 . The original motivation was a comment by Jake that that was a cleaner way to do it https://github.com/scikit-learn/scikit-learn/blame/c45721d538d36ab4c322d18e33d9c10b55f5fe27/sklearn/neighbors/_binary_tree.pxi#L184

any opinions @jnothman @jeremiedbb ?

rth · 2020-06-18T12:07:50Z

Also we got this original issue by manually setting offsets , so if we can rely on numpy more to do it, it would more reliable I think even it's a more black box solution. As long as it works.

NicolasHug · 2020-06-18T12:15:52Z

do we know what happens if one side changes and not the other

From what I tried, either a segfault or a ValueError like
ValueError: Buffer dtype mismatch, expected 'unsigned char' but got end in 'node_struct.blah'

jeremiedbb · 2020-06-18T12:15:52Z

<Cell[:1]>(&dummy) is not slicing. When you cast in cython you have to provide the length of the array so it's just cast dummy to an array of Cell of length 1.

I can't find the documentation for it though.

here it is

I think jake's comment is right. It's cleaner and it's actually the recommended way to do that. Maybe we can add a small comment to explain that to avoid future confusion.

rth

I'm OK with the current solution then. Thanks @slackito !

NicolasHug · 2020-06-18T13:31:13Z

if we merge this we should document it

NicolasHug · 2020-06-18T14:11:22Z

If we declare CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype, can we still access the fields of a variable? I.e. can we still do some_cell['parent'] from Python?

Surprisingly this still works

+1 pending a comment linking to SO

Added a brief comment and a link to stackoverflow thread that explain why the simpler approach to define numpy types works.

slackito · 2020-06-19T00:53:22Z

Thanks everyone for the comments!

Added comments with a brief explanation an a link to the stackoverflow thread posted above. Please take another look.

adrinjalali

Would we have a better place to document this? Is the inline comment enough? I really don't know.

adrinjalali · 2020-06-19T10:54:37Z

sklearn/neighbors/_quad_tree.pyx

-        <Py_ssize_t> &(<Cell*> NULL).max_bounds,
-    ]
-})
+# Build the corresponding numpy dtyle for Cell.


Suggested change

# Build the corresponding numpy dtyle for Cell.

# Build the corresponding numpy dtype for Cell.

Fixed, thanks!

adrinjalali · 2020-06-19T10:54:51Z

sklearn/tree/_tree.pyx

-        <Py_ssize_t> &(<Node*> NULL).weighted_n_node_samples
-    ]
-})
+# Build the corresponding numpy dtyle for Node.


Fixed, thanks!

rth

Thanks @slackito !

…17228) * Fix undefined behavior in _tree.pyx and _quad_tree.pyx Accessing a member through a null pointer is undefined behavior, even if no actual memory access is performed (like in this case, where it was used only for offset computation). See the following link for an explanation: https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html The current approach will also fail when building with tools like ubsan (undefined behavior sanitizer). The standard way to compute offsets of struct members is the offsetof macro but it seems it's not supported by Cython so I've used the approach described here: https://mail.python.org/pipermail/cython-devel/2013-April/003505.html * Simplify definitions of types for numpy. According to scikit-learn@9e9cdb0#issuecomment-638998985 these dictionaries with offsets are there for old numpy versions and they are no longer needed. We can use instead an approach similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187 I have verified that tests keep passing and that the undefined behavior that motivated this pull request in the first place is still fixed. * Document the new numpy type definition approach. Added a brief comment and a link to stackoverflow thread that explain why the simpler approach to define numpy types works. * Fix typo 'dtyle'->'dtype'

github-actions bot added module:neighbors module:tree labels May 14, 2020

slackito changed the title ~~Fix undefined behavior in _tree.pyx and _quad_tree.pyx~~ [MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx May 18, 2020

thomasjpfan reviewed Jun 1, 2020

View reviewed changes

thomasjpfan approved these changes Jun 4, 2020

View reviewed changes

rth previously approved these changes Jun 18, 2020

View reviewed changes

Document the new numpy type definition approach.

1582593

Added a brief comment and a link to stackoverflow thread that explain why the simpler approach to define numpy types works.

adrinjalali reviewed Jun 19, 2020

View reviewed changes

Fix typo 'dtyle'->'dtype'

a14d11d

rth approved these changes Jun 19, 2020

View reviewed changes

adrinjalali merged commit c11aaf5 into scikit-learn:master Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx #17228

[MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx #17228

slackito commented May 14, 2020

slackito commented May 28, 2020

thomasjpfan Jun 1, 2020

slackito Jun 1, 2020

thomasjpfan Jun 2, 2020

slackito Jun 3, 2020

thomasjpfan Jun 4, 2020

slackito Jun 4, 2020

thomasjpfan left a comment

rth commented Jun 4, 2020

slackito commented Jun 5, 2020

slackito commented Jun 9, 2020

slackito commented Jun 10, 2020

rth commented Jun 10, 2020

slackito commented Jun 17, 2020

NicolasHug commented Jun 17, 2020

slackito commented Jun 18, 2020

NicolasHug commented Jun 18, 2020 •

edited

rth commented Jun 18, 2020 •

edited

NicolasHug commented Jun 18, 2020

rth commented Jun 18, 2020

rth commented Jun 18, 2020 •

edited

NicolasHug commented Jun 18, 2020

jeremiedbb commented Jun 18, 2020 •

edited

rth left a comment

NicolasHug commented Jun 18, 2020

NicolasHug commented Jun 18, 2020

slackito commented Jun 19, 2020

adrinjalali left a comment

adrinjalali Jun 19, 2020

slackito Jun 19, 2020

adrinjalali Jun 19, 2020

slackito Jun 19, 2020

rth left a comment

	# Build the corresponding numpy dtyle for Cell.
	# Build the corresponding numpy dtype for Cell.

[MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx #17228

[MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx #17228

Conversation

slackito commented May 14, 2020

slackito commented May 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

rth commented Jun 4, 2020

slackito commented Jun 5, 2020

slackito commented Jun 9, 2020

slackito commented Jun 10, 2020

rth commented Jun 10, 2020

slackito commented Jun 17, 2020

NicolasHug commented Jun 17, 2020

slackito commented Jun 18, 2020

NicolasHug commented Jun 18, 2020 • edited

rth commented Jun 18, 2020 • edited

NicolasHug commented Jun 18, 2020

rth commented Jun 18, 2020

rth commented Jun 18, 2020 • edited

NicolasHug commented Jun 18, 2020

jeremiedbb commented Jun 18, 2020 • edited

rth left a comment

Choose a reason for hiding this comment

NicolasHug commented Jun 18, 2020

NicolasHug commented Jun 18, 2020

slackito commented Jun 19, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth left a comment

Choose a reason for hiding this comment

NicolasHug commented Jun 18, 2020 •

edited

rth commented Jun 18, 2020 •

edited

rth commented Jun 18, 2020 •

edited

jeremiedbb commented Jun 18, 2020 •

edited