Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx #17228

Merged
merged 4 commits into from
Jun 22, 2020

Conversation

slackito
Copy link
Contributor

Accessing a member through a null pointer is undefined behavior, even if
no actual memory access is performed (like in this case, where it was
used only for offset computation). See the following link for an
explanation:
https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html

The current approach will also fail when building with tools like ubsan
(undefined behavior sanitizer).

The standard way to compute offsets of struct members is the offsetof
macro but it seems it's not supported by Cython so I've used the
approach described here:

https://mail.python.org/pipermail/cython-devel/2013-April/003505.html

Accessing a member through a null pointer is undefined behavior, even if
no actual memory access is performed (like in this case, where it was
used only for offset computation). See the following link for an
explanation:
https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html

The current approach will also fail when building with tools like ubsan
(undefined behavior sanitizer).

The standard way to compute offsets of struct members is the offsetof
macro but it seems it's not supported by Cython so I've used the
approach described here:

https://mail.python.org/pipermail/cython-devel/2013-April/003505.html
@slackito slackito changed the title Fix undefined behavior in _tree.pyx and _quad_tree.pyx [MRG] Fix undefined behavior in _tree.pyx and _quad_tree.pyx May 18, 2020
@slackito
Copy link
Contributor Author

Ping

<Py_ssize_t> &(<Cell*> NULL).barycenter,
<Py_ssize_t> &(<Cell*> NULL).min_bounds,
<Py_ssize_t> &(<Cell*> NULL).max_bounds,
(<Py_ssize_t>&(dummy.parent) - <Py_ssize_t>&(dummy)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mailing list states to do this:

cdef Struct tmp
cdef Py_ssize_t offset
offset = <Py_ssize_t> (<Py_intptr_t>&(tmp.field) - <Py_intptr_t>&(tmp))

with castinig to Py_intptr_t first, do you think this is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for reviewing this.

I think some form of intptr_t is the right tool for the job, and I tried Py_intptr_t first, but I couldn't make it work because of 'Py_intptr_t' is not a type identifier errors when I tried to build.

I haven't been able to find Py_intptr_t available as an import anywhere, but I've digged a little more and I've found that we can import the actual C intptr_t type from libc.stdint and it works.

Should I go with that instead? Do you have any advice about where to get Py_intptr_t from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current implementation is okay. It could be that Py_ssize_t is typedefed to Py_intptr_t anyways:

https://github.com/cython/cython/blob/0a25de5b1023fd551f5e600583de8d8c87b7796d/Cython/Utility/MemoryView.pyx#L37

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the github newbie question: if you think the current implementation is okay should I mark this conversation as resolved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally do not resolve them on my PRs, because resolved conversations are automatically hidden and could be interesting to other reviewers.

That being said, you can mark the conversation as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I'll leave it open then :)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rth
Copy link
Member

rth commented Jun 4, 2020

Isn't there a way to get rid of this whole CELL_DTYPE dict that was originally there for old numpy as far as I understand? Similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187 where it might be equivalent to something along the lines of,

cdef Cell dummy;
CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype

though I haven't looked at how this works in detail.

@slackito
Copy link
Contributor Author

slackito commented Jun 5, 2020

Isn't there a way to get rid of this whole CELL_DTYPE dict that was originally there for old numpy as far as I understand? Similar to https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187

I'll give it a try and report back. I'm a total newbie at pretty much everything around this project, so I won't be able to go much farther than "I made that change and tests pass/don't pass", but we should get a little bit of useful signal out of that at least.

@slackito
Copy link
Contributor Author

slackito commented Jun 9, 2020

I tried the approach suggested by @rth and it looks like it builds and tests in the affected directories still pass. If nobody objects I'll modify this pull request tomorrow to use this approach, which is way simpler :)

According to
scikit-learn@9e9cdb0#issuecomment-638998985
these dictionaries with offsets are there for old numpy versions and
they are no longer needed. We can use instead an approach similar to
https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187

I have verified that tests keep passing and that the undefined behavior
that motivated this pull request in the first place is still fixed.
@slackito
Copy link
Contributor Author

I have pushed a new commit using an approach similar to this other commit, as suggested above. Tests still pass and the original UndefinedBehaviorSanitizer warning about a null pointer dereference that motivated this pull request in the first place is still fixed.

Please take another look :)

@rth
Copy link
Member

rth commented Jun 10, 2020

Great that it works! I can't find the documentation for it though. All I have found so far is this Cython test and the fact that we are already doing it in,

sklearn/neighbors/_binary_tree.pxi
184:NodeHeapData = np.asarray(<NodeHeapData_t[:1]>(&nhd_tmp)).dtype
194:NodeData = np.asarray(<NodeData_t[:1]>(&nd_tmp)).dtype

In particular I don't understand why it's necessary to do the [:1] slice. Any ideas @thomasjpfan or @NicolasHug ?

@slackito
Copy link
Contributor Author

Any other ideas?

@NicolasHug
Copy link
Member

If we declare CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype, can we still access the fields of a variable? I.e. can we still do some_cell['parent'] from Python?

Also, it seems that the original concern can be addressed by changing
<Py_ssize_t> &(<Cell*> NULL).parent,

into

<Py_ssize_t> &(<Cell*> &dummy).parent?

I'm not a fan of CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype because it's very cryptic.

In particular I don't understand why it's necessary to do the [:1] slice

no idea either, I'm even surprised the syntax is correct

@slackito
Copy link
Contributor Author

As I understand it, the intent of <Py_ssize_t> &(<Cell*> NULL).parent was to get the offset of parent inside a Cell. This works if the null pointer is address 0 because then the address of the field is equal to its offset inside the object.

Would <Py_ssize_t> &(<Cell*> &dummy).parent accomplish the same thing? It seems to me it would return the address of the parent field in the dummy object, and not the offset (which is why my original commit and the reference link in the description subtract the address of the dummy object itself to get the relative offset, i.e. (<Py_ssize_t>&(dummy.parent) - <Py_ssize_t>&(dummy))).

As for your other questions, sadly I have no idea. I'm willing to try any reasonable suggestions.

@NicolasHug
Copy link
Member

NicolasHug commented Jun 18, 2020

It seems to me it would return the address of the parent field in the dummy object, and not the offset

Yes you're probably right

Can we declare the dtype with a list instead of a dict, i.e.:

CELL_DTYPE = np.dtype([
	('parent', np.intp),
	('children', np.intp, (8,)),  # not exactly sure about the syntax, see https://numpy.org/doc/stable/user/basics.rec.html#structured-datatype-creation
	...
	],
	aligned=True
)

the aligned=True should take care of the offsets

@rth
Copy link
Member

rth commented Jun 18, 2020

Asked a SO question about it.

Can we declare the dtype with a list instead of a dict, i.e.:

Well it we can make it work without manually defining dtypes that need to be in sync with the struct that would be preferrable. We already have such code in binary trees we just need to understand what it does. Let's see if anyone replies, otherwise I could reach out on some numpy channel.

@NicolasHug
Copy link
Member

OK let's see if we can get an explanation.

Though IMHO the fact that none of us here really understand what it does is an indication that we shouldn't be using it (the cycle is going to repeat once we're not around anymore). Repeating the dtype is verbose but it's not a big deal since these almost never change. FWIW, that's what we're doing in sklearn/ensemble/_hist_gradient_boosting/common.pyx-pxd

@rth
Copy link
Member

rth commented Jun 18, 2020

So we got an answer in https://stackoverflow.com/a/62449233/1791279 . I am continuously impressed by the knowledge of SO participants on sometimes obscure topics.

Repeating the dtype is verbose but it's not a big deal since these almost never change.

Yes, but do we know what happens if one side changes and not the other. Repeating might be OK, personally I'm not too comfortable with the pointer arithmetic initially done in this PR. I think using this solution with a link to the SO comment might be enough?

Actually it looks like we merged a similar change earlier in #16141 . The original motivation was a comment by Jake that that was a cleaner way to do it https://github.com/scikit-learn/scikit-learn/blame/c45721d538d36ab4c322d18e33d9c10b55f5fe27/sklearn/neighbors/_binary_tree.pxi#L184

any opinions @jnothman @jeremiedbb ?

@rth
Copy link
Member

rth commented Jun 18, 2020

Also we got this original issue by manually setting offsets , so if we can rely on numpy more to do it, it would more reliable I think even it's a more black box solution. As long as it works.

@NicolasHug
Copy link
Member

do we know what happens if one side changes and not the other

From what I tried, either a segfault or a ValueError like
ValueError: Buffer dtype mismatch, expected 'unsigned char' but got end in 'node_struct.blah'

@jeremiedbb
Copy link
Member

jeremiedbb commented Jun 18, 2020

<Cell[:1]>(&dummy) is not slicing. When you cast in cython you have to provide the length of the array so it's just cast dummy to an array of Cell of length 1.

I can't find the documentation for it though.

here it is

I think jake's comment is right. It's cleaner and it's actually the recommended way to do that. Maybe we can add a small comment to explain that to avoid future confusion.

rth
rth previously approved these changes Jun 18, 2020
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the current solution then. Thanks @slackito !

@NicolasHug
Copy link
Member

if we merge this we should document it

@NicolasHug
Copy link
Member

If we declare CELL_DTYPE = np.asarray(<Cell[:1]>(&dummy)).dtype, can we still access the fields of a variable? I.e. can we still do some_cell['parent'] from Python?

Surprisingly this still works

@rth rth dismissed their stale review June 18, 2020 14:23

+1 pending a comment linking to SO

Added a brief comment and a link to stackoverflow thread that explain
why the simpler approach to define numpy types works.
@slackito
Copy link
Contributor Author

Thanks everyone for the comments!

Added comments with a brief explanation an a link to the stackoverflow thread posted above. Please take another look.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we have a better place to document this? Is the inline comment enough? I really don't know.

<Py_ssize_t> &(<Cell*> NULL).max_bounds,
]
})
# Build the corresponding numpy dtyle for Cell.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Build the corresponding numpy dtyle for Cell.
# Build the corresponding numpy dtype for Cell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

<Py_ssize_t> &(<Node*> NULL).weighted_n_node_samples
]
})
# Build the corresponding numpy dtyle for Node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slackito !

@adrinjalali adrinjalali merged commit c11aaf5 into scikit-learn:master Jun 22, 2020
rubywerman pushed a commit to MLH-Fellowship/scikit-learn that referenced this pull request Jun 24, 2020
…17228)

* Fix undefined behavior in _tree.pyx and  _quad_tree.pyx

Accessing a member through a null pointer is undefined behavior, even if
no actual memory access is performed (like in this case, where it was
used only for offset computation). See the following link for an
explanation:
https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html

The current approach will also fail when building with tools like ubsan
(undefined behavior sanitizer).

The standard way to compute offsets of struct members is the offsetof
macro but it seems it's not supported by Cython so I've used the
approach described here:

https://mail.python.org/pipermail/cython-devel/2013-April/003505.html

* Simplify definitions of types for numpy.

According to
scikit-learn@9e9cdb0#issuecomment-638998985
these dictionaries with offsets are there for old numpy versions and
they are no longer needed. We can use instead an approach similar to
https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187

I have verified that tests keep passing and that the undefined behavior
that motivated this pull request in the first place is still fixed.

* Document the new numpy type definition approach.

Added a brief comment and a link to stackoverflow thread that explain
why the simpler approach to define numpy types works.

* Fix typo 'dtyle'->'dtype'
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…17228)

* Fix undefined behavior in _tree.pyx and  _quad_tree.pyx

Accessing a member through a null pointer is undefined behavior, even if
no actual memory access is performed (like in this case, where it was
used only for offset computation). See the following link for an
explanation:
https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html

The current approach will also fail when building with tools like ubsan
(undefined behavior sanitizer).

The standard way to compute offsets of struct members is the offsetof
macro but it seems it's not supported by Cython so I've used the
approach described here:

https://mail.python.org/pipermail/cython-devel/2013-April/003505.html

* Simplify definitions of types for numpy.

According to
scikit-learn@9e9cdb0#issuecomment-638998985
these dictionaries with offsets are there for old numpy versions and
they are no longer needed. We can use instead an approach similar to
https://github.com/scikit-learn/scikit-learn/pull/15097/files#diff-b071106a87a03ee4e9149e3f0a2b1180L187

I have verified that tests keep passing and that the undefined behavior
that motivated this pull request in the first place is still fixed.

* Document the new numpy type definition approach.

Added a brief comment and a link to stackoverflow thread that explain
why the simpler approach to define numpy types works.

* Fix typo 'dtyle'->'dtype'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants