Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sqlite bind review #7

Closed

Conversation

Ten0
Copy link
Collaborator

@Ten0 Ten0 commented Nov 22, 2021

  • Query is boxed before collecting binds, ensuring any reference built in binds collection stays valid
  • We always rely on the destructor of BoundStatement to re-bind Null, which guarantees it also gets re-bound in case of panic
  • Remove ManuallyDrop as the contained values actually always need to be dropped when going out of scope.

weiznich and others added 18 commits November 15, 2021 14:55
This commit removes a old hack from our bind serialization layer that
requires multiple value copies for the sqlite backend going first to a
binary format and later on deserialize the data again to call the
corresponding bind function. After this commit it is possible to just
store the corresponding values and call the corresponding bind functions
directly onto those values. This removes a therotical overhead and
should improve the performance on bind heavy workloads (like inserts).
To do this we basically need to teach the compiler about a invariant
that is already true: Bind values need to live at least as long as the
corresponding query. To describe that in the type system we need to
introduce a few lifetime bounds for `QueryFragment::walk_ast` and
`ToSql::to_sql`. In exchange we can remove the write bound from
`ToSql::to_sql`. In my opion that does not make the it harder to
implement the corresponding traits as users don't need to interact with
those lifetimes at all and they describe something that is already true
for almost all cases. Additionally now that rust-analyzer can generate
function bodies for trait implementations on the fly users don't need to
type those bounds, as they are generated with the body stub.
Most of the changes are releated to just introduce the lifetime bounds
in all nessesary impls and fix minor incompatibilities there.
This improves the sqlite bind collector to skip even more data copies,
by setting the sqlite bind mode to `STATIC`. This prevents that sqlite
creates a internal copy of the bind data, which means we must ensure
that the data live at least as long as the buffer is bound to the
statement. This requires some unsafe hackery to ensure that all data
remain avaible. We ensure this by binding all information to
`StatementUse` itself, which lives as long as the user can access the
executed query. If `StatementUse` is dropped we cleanup the
corresponding buffers.
This commit improves the unsafe code usage in the implementation of
StatementUse by introducing the correct lifetimes to bind the
StatementUse instance to the query itself. This prevents dropping the
query (and any bound value) before dropping the StatementUse containing
the corresponding binds.
This commit renames some lifetimes related to the iterator setup to use
more concrete names than `'a` or `'b`. Additionally the
`elided_lifetimes_in_paths` lint is enabled to prevent hidden lifetimes
in types. This makes it much easier to reason about unsafe code
This applies the suggested made by Yandros in the rust users forum here:
https://users.rust-lang.org/t/code-review-for-unsafe-code-in-diesel/66798/7

Especially:

* Sperates the bind wrapper from the name wrapper
* Adds some more comments about what safety invariants some functions
have
* Use `ManuallyDrop` for shared buffers
* Fix unbinding bind values if we return early from the
`BoundStatement::bind` function, as it would occour if the
`Statment::bind` function fails
* Remove the `OnlyCell` usage for column names, as it is not required
* Add an additional safeguard that clears cached column names after the
first call to `.step()`
- Query is boxed before collecting binds, ensuring any reference built
  in binds collection stays valid
- We always rely on the destructor of BoundStatement to re-bind Null,
  which guarantees it also gets re-bound in case of panic
| SqliteBindValue::BigInt(_)
| SqliteBindValue::Float(_)
| SqliteBindValue::Double(_)
| SqliteBindValue::Null => {}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I favor exhaustive writing in this scenario as this guarantees we haven't forgotten anything.

Copy link
Owner

@weiznich weiznich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like most of the ideas implemented in this PR, unfortunately the semantics about unique ptrs (String/Vec/Box), shared buffers and ManuallyDrop.
Additionally I'm a bit skeptical about the lifetime transmute there. I've opted to implement most of the other changes as part of diesel-rs@d29eac1

// we need to store any owned bind values speratly, as they are not
// contained in the query itself. We use ManuallyDrop to
// communicate that this is a shared buffer
binds_to_free: ManuallyDrop<Vec<(i32, Option<SqliteBindValue<'static>>)>>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand the current state of rusts unsafe code guide lines removing the ManuallyDrop here may cause be undefined behaviour. The problem here is that internally this vector stores either a String or a Vec<u8> which both internally use a Unique pointer. As the name suggests this pointer type assumes unique access to the allocated memory. As soon as we bind the buffer to the sqlite statement we violate this contract. According to this is one possibility to communicate to the compiler that it cannot assume this property. (The same argument applies to the query field above)

Copy link
Collaborator Author

@Ten0 Ten0 Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as long as the corresponding pointer is never used by anything while the other references exist : the unique property is used to enable noalias llvm optimizations, which won't enter into play as long as the other reference is never deferenced.
There are similar issues describing this and what would be actually valid within Kimundi/owning-ref-rs#49.
It can be seen that the issue arises from the code accessing the same memory location through pointers that have noalias.
Note that in this PR, the box is never dereferenced after binds are collected and until they are dropped - which is the same that would happen when writing normal code.

Copy link
Collaborator Author

@Ten0 Ten0 Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this in mind when writing it but you're right, it's not so simple as my complete lack of commenting on this topic led to believe 😁
It's definitely something I should have commented on as it's definitely super not obvious 😄

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand the owning-ref issue and the linked unsafe code guideline issue the topic is:
a) Not decided yet
b) Box<T> contains a Unique<T> pointer which gets translated as noalias so yes technically as long as we don't access the pointer till we resolved the aliasing we should be fine. But that seems a property that is really hard to verify. Especially around drop. For example the compiler already notices that this field is "unused" so it may want to decide to just not store the corresponding struct field till BoundStatement is used. ManuallyDrop resolves the Drop issue, as we then can exactly tell the compiler to perform the drop at a specific location. It does not really resolve any other potential use through the Unique ptr there. It seems like the only way that is currently accepted as definitively not causing UB is to use Box::into_raw here to move out the pointer from the inner Unique<T> type into a normal pointer. This could be stored till the any aliasing is over (so till the Drop impl) and then reconstruct the Box via Box::from_raw. (Same reasoning for String/Vec as they are internally similar, but there we need to store the length as well).

Especially see that linked playground (notice the difference between release and debug build 😱 )

Note: This discussion likely applies to the mysql bind implementation as well. 😞

(I should note that the more I read about aliasing, the more I come horrified how to ever get that right in the context of non-trivial unsafe code like that one 😱 )

Copy link
Collaborator Author

@Ten0 Ten0 Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(playground is link broken)

the more I read about aliasing, the more I come horrified how to ever get that right in the context of non-trivial unsafe code like that one 😱

Yeah I didn't realize how hard unsafe actually was until I had a look at the open issues in owning-ref either 😅

that seems a property that is really hard to verify. Especially around drop. For example the compiler already notices that this field is "unused" so it may want to decide to just not store the corresponding struct field till BoundStatement is used.

AFAIK Drop's behavior is always specified - rust wouldn't call drop on the BoundStatement without having properly set all its fields (nor would it ever do that with any other function, regardless of how it optimizes its stack and registries before it has to make that call).
I'm not sure I understand what scenario you have in mind here.

Considering we don't need to use the query and only keep it here to prevent it from being destructed, I believe it's easier to guarantee that a field is never used than it is to guarantee that we don't forget to drop stuff on error/ok/panic. (e.g. in the previous implementation, if anything happens to panic within for (bind_idx, (bind, tpe)) in (1..).zip(binds.into_iter().zip(metadata)), query and binds_to_free are memory-leaked)
(Then again I should definitely have added comments specifying that this field should never be used because of noalias requirements.)

To the extreme, if we want to compile-enforce that the field is never used, we can

mod never_use {
    pub struct NeverUse<T> {
        _private: T,
    }
    impl<T> NeverUse<T> {
        fn new(inner: T) -> Self {
            Self { _private: inner }
        }
    }
}
pub use never_use::NeverUse;

I'd feel much less confident guaranteeing correctness of rust code that manages its allocations itself with raw pointers manipulations than guaranteeing that the Box is never dereferenced until dropped, and that it is never dropped before the relevant binds are re-bound, using RAII.

Copy link
Collaborator Author

@Ten0 Ten0 Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) Not decided yet

Not sure I understand what you mean. It looks like regardless of whether they decide that this code should be UB or not (seems reasonable to me that it is btw), as long as we don't use the "unique" at the same time as a pointer derived from it, we're fine?

Especially considering we're not mutating these values, either side: https://doc.rust-lang.org/nomicon/aliasing.html:

we don't actually care if aliasing occurs if there aren't any actual writes to memory happening

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed playground link

I'd feel much less confident guaranteeing correctness of rust code that manages its allocations itself with raw pointers manipulations than guaranteeing that the Box is never dereferenced until dropped, and that it is never dropped before the relevant binds are re-bound, using RAII.

I feel exactly the other way around. Handling memory allocations is something I'm quite confident about, while I fear that the compiler may decide that this are not used and therefore removes something. Maybe I'm just trying to be too perfect here...

Otherwise I've played a bit with an modified implementation that stores raw pointers. I've got a working version that:

  • Uses drop as single "cleanup" point
  • Only contains a minimal trivial amount of unsafe ( no transmute, no pointer case, just Box::from_raw and NonNull::new_unchecked)

I personally find that version quite readable and surprisingly simple compared to any other version yet. I will add some more comments tomorrow and push that as new commit so we can discuss that version.
Generally I would prefer to error on the "safe" side here, so if something is not that obvious from the unsafe code guidelines better use something simpler...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As promised I pushed 28409cc which uses raw pointers.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The PR is updated as well)

Comment on lines +218 to +222
let bind_collector = unsafe {
std::mem::transmute::<SqliteBindCollector<'_>, SqliteBindCollector<'static>>(
bind_collector,
)
};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is technically safe in the scope of the current function I really try to avoid transmute as much as possible, even if it makes the code a bit harder to write. My issue with transmute is that any code using transmute may be valid now, but it's incredible hard to spot if that remains true after a refactoring, especially for persons who've not written the original code.

Copy link
Collaborator Author

@Ten0 Ten0 Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made sure this single line enforces that we're only transmuting the lifetime, and we're calling a function that makes use of these references as static references right after anyway, which may become just as incorrect.
I think it's reasonable to assume that changes to this file are hard enough to make that they would be made by people who have read the whole file and would be thoroughly reviewed.
That is just to put the query in Self asap, so it always gets dropped after the binds are erased, whatever happens.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, doing that here is fine. I consider this to be a minor issue for now, the main point is the discussion above about aliasing.

@Ten0
Copy link
Collaborator Author

Ten0 commented Nov 27, 2021

Sorry for the late answers - I somehow missed the notifications 🙁

@weiznich
Copy link
Owner

@Ten0 No worries. I left a comment about the aliasing thing above. After reading the corresponding unsafe code guideline discussions I believe we should just store raw pointers there to be sure. That's not really nice, but at least that's the variant everyone agrees about being correct.

@Ten0 Ten0 deleted the simpler_and_safer_sqlite_bind branch January 3, 2022 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants