Skip to content

Conversation

@jmetter
Copy link
Contributor

@jmetter jmetter commented Feb 27, 2017

I admit this is an invasive with a lot of modifications to the C code, but I did not see a simpler way to make aggregate functions work correctly. The following is a high-level overview. If you want to know more details, please read the commit messages.

What is currently wrong with user defined aggregate functions?

Defining a second aggregate function and using both results in memory violation.

Currently sqlite3-ruby just holds a reference @agregator on the aggregator defined last. If the user defines two functions with different names, SQLite will hold the callback pointers to both of them but Ruby's GC will clean up the aggregator defined first. If the user subsequently uses the first function in a query, invalid memory access results. I observed crashes and weird behavior in practice.

The most straight-forward fix would to insert all aggregators defined in a ruby array and release them only at Database close. This is how it is currently done for (ordinary) functions.

Using the same aggregate function twice in a query mixes up values from both columns.

Assume we have defined an aggregate function "ACCUMULATE" that does the same as SUM:

CREATE TABLE foo (a INTEGER, b INTEGER);
INSERT INTO foo (a, b) VALUES (1, 10);
INSERT INTO foo (a, b) VALUES (2, 20);
SELECT ACCUMULATE(a), ACCUMULATE(b) FROM foo;`

You would expect [3, 30] but you actually get [33, nil]. The reason is that sqlite3-ruby does not properly account for individual execution contexts of aggregate functions. This has already raised issues earlier:

The assumption that SQLite will first bring one execution context to termination in finalize and then continue the next execution context is not generally valid. It may be violated by more complex GROUP BY queries. It is definitely violated in the example above, as SQLite calls step(1); step(10); step(2); step(20); finalize; finalize.

The way of handling separate execution context that is intended by the SQLite API is sqlite3_aggregate_context() which is a memory allocation function that returns the same memory when called twice from the same context. Admittedly a weird interface. The primary contribution of this pull request is to change sqlite3-ruby's aggregate instance handling to depend on sqlite3_aggregate_context().

Arity value is not respected

Documentation implies that library users shall explicitly set the arity of the function. Consequently the library user is likely to expect that this is passed down to SQLite, so he may define multiple
functions with the same name but different arity and SQLite to use the most appropriate one (documented sqlite feature). Unfortunately, sqlite3-ruby does not pass the arity to SQLite, which is surprising given that different arity-values are accounted for in the C code. The problem is that sqlite3-ruby does not call the AggregateHandlers "arity" method but instead queries the arity of the "step" method from Ruby (sqlite3_obj_method_arity). Confusingly, this is not the "step" provided by the library user, but the "step" of the Proxy Wrapper classes introduced by Database.create_aggregate or Database.create_aggregate_handler. Both of these are just step(*args), so the arity reported to SQLite is always -1.

Update: with Database.define_aggregator the library user was able to define functions with arity other than -1, although this was undocumented behavior.

Exceptions in callbacks are not appropriately handled

step and finalize execute in callback context from sqlite3_step.If one of these function raises, execution far-jumps immediately back to ruby code. SQLite certainly does not expect the callbacks it invokes to do such a thing. I don know if this results an any inconsistencies internal to SQLite. What i know is that the params array allocated by xcalloc in step won't be released in this case.

What does this pull request contain?

  • creation of aggregate handler instances moved to the C code. What is the same context and what forms a new context is now derived from sqlite3_aggregate_context()
  • in order to create the instances in C a new interface between the C and the Ruby part was necessary. I created a new private method define_aggregator2 that is similar to create_aggregate_handler for that purpose.
  • keep track of all the handler classes and their instances lifetime. In other words find them in the GC-mark phase. Release instances after they call finalize. Release aggregate handlers via sqlite3_create_function_v2's xDestroy callback if available.
  • pass the aggregators arity value down to SQLite as the documentation implies
  • Protect all ruby calls in callbacks and re-raise exceptions in Statement.step.
  • Update: make arity setting with Database.define_aggregator a documented feature.

Does this change via API?

I was careful to keep the API as it was and rewrote the create_aggregate and create_aggregate_handler shims to use define_aggregator2. I also implemented a ruby shim for define_aggregator. The only observable difference that I have not listed as a bug above that I am aware of is that the aggregator passed to define_aggregator must implement clone, as all regular Ruby objects do. This is because I need to clone it to create its individual instances for the execution contexts.

I also have a fallback in place for the old sqlite3_create_function should the new one be unavailable.

How is this tested?

I wrote a lot of Unit tests. I also checked memory with valgrind. Valgrind was able to detect the memory violation described above. After my patches only the reads from uninitialized memory warnings remain that ruby seems to cause no matter what it runs. I tested on OSX 10.10 with Apples Ruby 2.0.0 and on ArchLinux with Ruby 2.3.3.

What is not addressed by this patch?

  • create_function: impossible to call set_error #164 it is impossible to call FunctionProxy.set_error (and FunctionProxy.count). because @driver does not exist (anymore). I think it makes more sense to report errors via ruby exceptions and exceptions should be fine now.
  • Database.create_aggregate and the tests contain a 'text_rep' field, that is never passed down to SQLite. It is not advertised in the documentation, though. I did not come up with a use case for it.
  • Obviously, ordinary functions share a lot with the aggregate functions. I did not have a detailed look at the ordinary functions and this patch does nothing about them.

@tenderlove
Copy link
Member

Wow, thanks a lot for this work. The patch is a lot to grok, but your description makes sense. On your commit that introduces a linked list, the commit message says that it's being introduced because of finalization problems when using a Ruby array. Can you explain this? I'm not seeing anything in the new mark / sweep functions that would indicate we can't use a Ruby array (but maybe I'm just not seeing it).

@jmetter
Copy link
Contributor Author

jmetter commented Mar 22, 2017

Thank you for considering this merge request.

Concerning the use of Ruby Arrays: you are probably right and a ruby array works fine. Being used to C but not to writing ruby extensions I already had a clear picture what the lifetimes should be and was uneasy about the GC possibly doing something I do not expect. I implemented it, checked with valgrind, still had problems (see below) and then -- you might call it a short-circuit reaction -- rewrote things in a way I am more familiar with.

Unfortunately, I don't have the original code anymore. I rewrote this from memory, so please take it with a grain of salt. It looked like this:

struct _sqlite3Ruby {
  sqlite3 *db;
  VALUE aggregators; /* = rb_ary_new(); rb_gc_mark() */
};

typedef struct rb_sqlite3_aggregator_wrapper {
  /* back-ref to the aggregators array this wrapper lives in.
   * Not marked to avoid cyclic reference. Stupid optimization in hindsight. */
  VALUE parent_array;
  
  /* the AggregateHandler class we are wrapping here */
  VALUE handler_klass; /* rb_gc_mark() */
  
  /* Ruby array of in-flight instances of the AggregateHandler klass.
   * I did not get to the point where I actually allocated the instances. */
  VALUE instances; /* = rb_ary_new(); rb_gc_mark() */
} aggregator_wapper_t;

/* AggregatorWrapper ruby class, hidden from scripts */
VALUE cAggregatorWrapper;

VALUE rb_sqlite3_aggregator_alloc(VALUE self) {
    struct rb_sqlite3_aggregator_wrapper *x;
    return TypedData_Make_Struct(self, aggregator_wrapper_t, /* type */, x); 
}

/* called by sqlite3_create_function_v2's xDestroy-callback. */
void rb_sqlite3_aggregator_destroy(void *void_aw) {
    aggregator_wrapper_t *aw = void_aw;
    long i;
    if (aw->parent_array == Qnil) return;
    
    /* lookup my position in the aggregators array and remove reference to me */
    for (i = 0; i < RARRAY_LEN(aw->parent_array); i++) {
        aggregator_wrapper_t *x;
        VALUE ruby_x = rb_ary_entry(aw->parent_array, i);
        TypeData_Get_Struct(ruby_x, aggregator_wrapper_t, /* type */, x);
        if (x == aw) {
            rb_ary_delete_at(aw->parent_array, i);
            aw->parent_array = Qnil;
            return;
        }
    }
    rb_fatal("not in aggregators");
}

/* called by sqlite3_rb_close() after sqlite3_close() and in deallocate() */
void rb_sqlite3_aggregator_destroy_all(sqlite3RubyPtr ctx) {
    long i;
    if (ctx->aggregators = Qnil) return;
    
    for (i = 0; i < RARRAY_LEN(ctx->aggregators); i++) {
        aggregator_wrapper_t *aw;
        VALUE ruby_aw = rb_ary_entry(ctx->aggregators, i);
        TypeData_Get_Struct(ruby_aw, aggregator_wrapper_t, /* type */, aw);
        /* make sure rb_sqlite3_aggregator_destroy does not try to access
         * the aggregators array any more. */
        aw->parent_array = Qnil;
    }
    
    ctx->aggregators = Qnil;
}

void init() {
    rb_gc_register_address(&cAggregatorWrapper);
    cAggregatorWrapper = rb_class_new(rb_cObject);
    rb_define_alloc_func(cAggregatorWrapper, rb_sqlite3_aggregator_alloc);
}

Well, valgrind complained but I did not find a bug. I came up with the explanation that I may not access "member objects" in the finalizer deallocate(). With "member objects" I mean the aggregators array and indirectly the remaining AggregatorWrappers. I did not find useful documentation on Ruby.s GC, not even a guarantee that it won't move objects in memory, which AFAICT everyone assumes.

The funny part is: I again got the same valgrind error after switching to my linked list, but this time I was sure my code works as intended. Only then I realized that valgrind ruby -Itest test/test_integration_aggregate.rb is stupid and I must add -Ilib because otherwise ruby would still uses my globally installed gem instead of the sqlite3_native in the working directory.

That said I currently cannot say with certainty whether the ruby array approach is correct or not. Personally, I prefer the linked lists because it avoids the Data_Wrap_Struct / Data_Get_Struct hassle and lifetimes are explicit. If, however, the linked lists are not your taste, I would be willing to rewrite this to ruby arrays and try again with valgrind.

You may wonder why I packed all the VALUEs in C-Structs instead of accessing them with rb_iv_get and rb_iv_set. Well, I was afraid some weird ruby script would modify instance variables, e.g. clear the aggregators such that the AggegatorWrapper is free'd too early. You may legitimately consider defense against such behavior out of scope. Anyways, if you prefer Ruby IVs I can go with them when switching back to Ruby arrays.

@tenderlove
Copy link
Member

I did not find useful documentation on Ruby.s GC, not even a guarantee that it won't move objects in memory, which AFAICT everyone assumes.

Objects won't move. I'm working on a patch for the GC that will allow them to move, but it is conservative: anything marked with rb_gc_mark will not move, anything on the stack will not move. The only thing we need to worry about is pointers in the heap (but that is a concern regardless of movement as the GC may not see them and will free them).

You may wonder why I packed all the VALUEs in C-Structs instead of accessing them with rb_iv_get and rb_iv_set.

Ruby scripts can only get access to instance variables that start with an "@", but the C API allows you to use any name you wish. It's probably not documented, but rb_iv_set(obj, "no-at-sign-ivar", thing) is a common way to set private instance variables (we just have to make sure not to cache those pointers because they may move in the future).

Anyways, if you prefer Ruby IVs I can go with them when switching back to Ruby arrays.

I think using Arrays and IVs would drop the patch size so I'd prefer them. If fighting with the GC turns out to be too much hassle, I'd take this patch too.

Again, thanks for the work on this. I appreciate it!

jmetter added 18 commits March 29, 2017 12:10
For the upcoming tests, I need a third column in the 'foo' table -- or
alternatively a second table. Extending the existing table in
`test_integration.rb` let some other tests fail. Given that
`test_integration.rb` is already lengthy, I thought it would make
things clearer if I separate the aggregate tests into their own file
with their own setup/teardown methods.
Added tests to demonstrate three issues with the way sqlite3-ruby
currently handles aggregate functions:

(1) Defining a second aggregate function and using both results in
memory violation. The problem is that `rb_iv_set(self, "@agregator",
aggregator);`   (database.c:399) just overwrites the last reference to
the first  AggregateHandler, but SQLite still holds a pointer to it and
will follow   the pointer if the first aggregate function is used in a
query.
  The most straight-forward fix is to insert the aggregator in a ruby
array and never release it -- similar to how its done for functions.

(2) Using the same aggregate function twice in a query mixes up values
from both  columns. For example:
  `SELECT MYAGGREG(a), MYAGGREG(b) FROM foo;`
  Leads to: Myaggreg.step(a1); Myaggreg.step(b1); Myaggreg.step(a2);
  Myaggreg.step(b2); ... ; Myaggreg.finalize(), Myaggreg.finalize()
  The SQLite API expects the caller to differentiate between these
chains  of invocation via `sqlite3_aggregate_context()`, but current
sqlite3-ruby  does not account for that.
  #44 has been a work around for this in the special case that the first
  aggregation is finalized before the second started (separate queries).
  #161 does the analog for the function Proxy.

(3) Documentation implies that library users shall explicitly set the
arity of the function. Consequently the library user is likely to
expect that this is passed down to SQLite, so he may define multiple
functions with the same name but different arity and SQLite to use the
most appropriate one  (documented sqlite feature). Unfortunately,
sqlite3-ruby does not pass the arity to SQLite, which is surprising
given that different arity-values  are accounted for in the C code. The
problem is that sqlite3-ruby does not call the AggregateHandlers
"arity" method but instead queries the arity of
  the "step" method from Ruby (sqlite3_obj_method_arity). Confusingly,
this  is not the "step" provided by the library user, but the "step" of
the  Proxy Wrapper classes introduced by Database.create_aggregate or
Database.create_aggregate_handler. Both of these are just
`step(*args)`,  so the arity reported to SQLite is always -1.

Things not addressed:
  - #164 impossible to call FunctionProxy.set_error (and
FunctionProxy.count).  because @driver does not exist (anymore)
  - Database.create_aggregate and the tests contain a 'text_rep' field,
that  is never passed down to SQLite. It is not advertised in the
documentation, though.
To manage the active references to the aggregate handler and its
instances, I need some list/array/vector. I have used Ruby's array for
that, but wrapping the rb_sqlite3_aggregator_wrapper with a new, hidden
Ruby class turned out more complex than I initially expected.
Especially the undefined finalization order of the GC posed problems on
shutdown. Consequently I decided that it would be better to handle the
datastructures completely in C there I have full control over
the destruction order.
As I already wrote in an earlier commit message, the memory access
violation occurs after the library user installs a second aggregate
handler. In that case, Database.@aggregate is overwritten and
therefore no references to the AggregateHandler remain. The simplest
fix would be to change @Aggregator (or @agregator) into a Ruby array of
all AggregateHandlers ever installed. I decided against that because I
think its cleaner when SQLite controls the lifetime of the handlers. To
that end the sqlite developers extended the sqlite3_create_function API
with the xDestroy callback. The extended "v2" Version is available
since 3.7.3 (2010-10-08). I installed a fallback to the old version,
should the new function be unavailable. While it is possible to remove
elements from an @Aggregator ruby array while in xDestroy callback,
this was clumsy and difficult to do correctly under undefined
finalization order. I came to the conclusion that its easier to manage
the lifetime completely in C and only provide the appropriate
rb_gc_mark function to SQLite3::Database. The remaining issue is to
make the VALUEs for the AggregateHandlers (and later their instances)
visible to the mark function. To that end, I introduced the
aggregator_wrapper struct that holds the VALUE and the linked list
pointers. I replaced @Aggregator with the _sqlite3Ruby.aggregators
linked list head.

I used valgrind to check that memory accesses are sane now. Before this
patch, I saw loads from previously free'd memory that disappeared.
There are still lots of warnings about use initialized values, but this
is unrelated and most likely an internal issue with MRI 2.0.0.

As part of this patch, I moved the aggregate function related code into
its own files "aggregator.[ch]".
I missed this in the initial set of unit tests. #44 handles this case
already. Keep in mind though that is just the effect of the order in
which SQLite calls the step() and finalize() functions and could
theoretically change.
This is clearly an array of VALUE, not of VALUE*. As it should hold
sizeof(VALUE) == sizeof(VALUE*) this does not really matter though.
This commit addresses the issue that using the same aggregate function
twice in a select statement mixes up the execution contexts in the
AggregateHandler. While resetting the aggregate Handler in the
finalize() method emulates the context lifetime in the other cases
(group by, independent select), resetting is a fragile approach.
Instead sqlite3_aggregate_context() is the function in the SQLite API
that should be used to separate contexts.

sqlite3_create_context() is a bit weird though: it is memory allocator
that returns the same memory when called from a context that already
called sqlite3_create_context once. Equally strange is that there is no
destroy() callback -- sqlite will clear up the memory when it does not
need it any more and won't tell us. Because a second allocator is
difficult to combine with Ruby's GC and because I do not have the
lifetime control, I use it only to store a pointer to a heap allocated
aggregator_instance_t structure. This structure holds the VALUE for the
Ruby instance that belongs to the aggregate function's execution
context. The structure forms a linked list rooted in the
aggregate_wrapper_t for the function it as an instance of, so the
instance can be GC marked.

SQLite's API left me with no choice but to assume that every execution
context ends with a single call to finalize(). At that point I unlink
and destroy the aggregate_instance_t. I never observed SQLite doing
anything else. There are checks in place to catch if SQLite does call
finalize() a second time. If it does not ever call finalize(), we leak
memory. If someone wants to use sqlite3_result_error, he or she should
check if the invocation of sqlite3_result_error is also the right place
to destroy the aggregate_instance_t.

This commit comes with one rather big change: now sqlite3-ruby
allocates the instances of the aggregate handler in the C code. This
forced me to change the interface, because the existing interface
passed only the instance object, not the factory, to the C code. To
that end, I introduced a new private method Database.define_aggregator2
that takes the factory as it sole parameter. Its like
Database.create_aggregate_handler but without the FunctionProxy "ctx"
object that serves no useful purpose. I wrote a new Ruby shim for the
old Database.define_aggregator and adopted Database.create_aggregate
and Database.create_aggregate_handler to the new interface between Ruby
and C. Generally I was careful to keep the interface as it is.

There is a tiny API change that I could not help avoid:
As Database.define_aggregator only receives one object, but it needs
one per execution context, the only option i saw was to use the
provided one as template and give every new instance a clone. This
means, however, that the template must provide a suitable clone. As
every ordinary ruby object already provides one, this should not pose
problems in practice, though.

And of course, people who were getting wrong results from their
aggregate functions suddenly get the right one, which may also come as
a surprise to them.
Arities outside [-1,127] invoke undefined behavior.
A tiny optimization for the common case that an aggregate function
takes only one value.
We call +new+, +step+ and +finalize+ in callback context from
sqlite3_step.
If one of these function raises, we have to deal with the following
issues:
- all three functions (may) leak memory from xcalloc in this case
- the checks I inserted in 325ac74
result in
  a rb_fatal() if +new+ raises.
- SQLite surely does not expect its callbacks to far-jump out of
sqlite3_step.
  I don't if that causes any inconsistencies in SQLite's internal data
structures.

We address these issues by protecting these three funcalls with
rb_protect, which suppresses the exceptions. Statement.step is the
point where suppressed exceptions should be re-raised. As I could not
find an easy way to transport the status value from rb_protect to
Statement.step (sqlite3_set_error_code would be possible but fragile),
I instead re-raise the exception from rb_errcode(). In principal, this
may raise any exception in $!, but I don't see how this would affect
normal client code.

I also set sqlite3_result_error in finalize to indicate an exception,
although the user should see the raised exception instead. It would
certainly be more consequent to call sqlite3_result_error as soon as we
catch an exception. I guess this would stop SQLite from doing any more
useless work and abort the query. But, as I already wrote, I am unsure
about the consequences for memory lifetime in that case, so it seems
safer to me to defer any error until finalize.
Silence some warnings.
Call StringValueCStr before xcalloc but last of all calls into ruby.
This way we do not leak memory should name.to_s raise or return a
String with interior null.
This has been undocumented behavior before: while create_aggregate and
create_aggregate_handler effectively used an arity of -1, the library
user could actually define functions with a different arity by calling
define_aggregator directly. I was not aware of this behavior and
therefore broke it by forcing the arity to -1. Now I added the required
shim and a test.
All data is now managed in plain ruby objects referenced from ruby
arrays. Instance variables are prefixed with "-" to make them
inaccessible from scripts.

I ran into the finalization issue in `deallocate()`: When
`deallocate()` calls `sqlite3_close()` the object passed to
`sqlite3_create_function` must still be alive. Otherwise the
`xDestroy()` handler will attempt to destroy something that is not
around any more. Unfortunately, when `deallocate()` is invoked, the GC
might already have released the aggregator wrappers, because the
SQLite3::Database ruby object is already dead at that point. The only
way to circumvent this, that I came up with, would have been to
reference the aggregator wrappers from a global root, e.g. by
referencing a ruby array with `rb_gc_register_address()`. Given that
the aim was to cut code size, I considered it more appropriate to do
without the xDestroy callback all together. That is, I call the old
sqlite3_create_function API again.
rb_class_new generatos class with undefined allocator in ruby 1.9. Calling Class.new is also a one-liner and works in ruby 1.9 and 2.x.
@jmetter
Copy link
Contributor Author

jmetter commented Mar 29, 2017

I changed the code to use regular Ruby objects and instance variables. I prefixed the instance variables with "-" istead of "@", to make them inaccessible from Ruby. All references to the objects are now managed in Ruby arrays. That way, I could remove both the linked list and the mark operations. I also had to rebase against master.

I ran into the finalization issue in deallocate():

static void deallocate(void * ctx)
{
  sqlite3RubyPtr c = (sqlite3RubyPtr)ctx;
  sqlite3 * db     = c->db;

  if(db) sqlite3_close(db);
  xfree(c);
}

When deallocate() calls sqlite3_close() the object(s) passed to sqlite3_create_function must still be alive, because sqlite will call back the xDestroy handler for each of these. That means the xDestroy() handler will attempt to destroy something that is not around any more. Unfortunately, when deallocate() is invoked, the GC might already have released the aggregator wrappers. I reference the wrappers through an array that is an instance variable of the SQLite3::Database ruby object. But this array/ its contained objects don't not exist anymore when deallocate runs because the SQLite3::Database object is dead at that point. To my understanding, it won't make a difference when I mark the array/ its contained objects with a provided mark function. One way I came up with to keep them alive past deallocate would be to reference the aggregator wrappers from a global root, e.g. by referencing a ruby array with rb_gc_register_address(). Another approach would be to wrap a struct and delay freeing this struct until the xDestroy callback but only if an xDestroy callback has been registered (sqlite3_create_function_v2 called). That would not be much simpler or shorter than the code I already had. Note that the old code had xcalloc allocated structs instead of wrapper objects. I free those in the xDestroy handler or (as a backup) after the sqlite3_close.

Given that the aim was to cut code size, I considered it more appropriate to do without the xDestroy callback all together. That is, I call the old sqlite3_create_function API again. The only drawback is that overriden aggregators won't be released until the database is closed

I don't know whether this is future-proof with regards to your moving GC. While I can keep objects alive by referencing them from an array, that does not forbid moving them. I don't currently rb_gc_mark() them. Does it suffice to just call rb_gc_mark() on an objet after its creation to make it immovable forever, or would I have to provide a mark function callbacks that explicitly marks every object referenced in the ruby array on every mark run?

I stumbled over the sunny/shady distinction. To my understanding the objects created in C code are shady because they have been at some point accessible to C. If all shady objects are immovable, this code should be fine. If not, I think the original code would break under a compacting GC the same way as my current code does.

If the moving GC issue is not a blocker for you, I am done.

@tenderlove tenderlove merged commit 930dc07 into sparklemotion:master Jan 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants