Skip to content

CG SQL Blog Archive

Rico Mariani edited this page May 26, 2023 · 3 revisions

I've captured the original blogs from cgsql.dev for posterity in case that site goes away. There is useful documentation as well. Some of this should go into the main guide but for now here it is so that it isn't lost. The articles below are in the original post order and I've recovered the original date. I've also fixed the internal links (all, I think) so that they stay in this repo rather than linking out to the original repo.

Welcome (2020/10/12)

Hello everyone!

Thank you for visiting the CG/SQL's blog page. If you would like to read the very first blog announcing the project, please go over to the Facebook's Engineering post published in early Octover 2020.

Looking forward to working with all of you!

Sincerely, CG/SQL Team

One Month Update (2020/11/12)

It's hard to believe it's been a month since the welcome message went up. We were happy to see interest right away and even a bunch of forks but most of all pull requests. A sweeping change to modernize the cql.y grammar was much appreciated. That $1 stuff was very old school (I'm showing my age now).

Here's a quick summary of what's been going on:

  • @mingodad gave us an implementation of check and collate column attributes (the check attribute on tables should be easy to add from here)
  • the select function form should never return objects, only SQLite types, enforced
  • @attribute(cql:suppress_result_set) was added to save code gen for procedures that don't need the C result set wrappers
  • cql_cursor_diff_col and cql_cursor_diff_val methods were added to report what's different about two cursors (highly useful in test code)
  • cql_cursor_format was added so you can quickly convert any cursor into columns and values as string for debug output (no matter the shape)
  • sqlite3_changes was added to the builtin list so you don't have to use declare select function to use it anymore
  • cql_get_blob_size was added so you can see how big your blobs are (useful for diagnostics)
  • trim, rtrim and ltrim were added to the builtin list so you can use them without declare select function
  • the builtin function ifnull_crash was added so that nullables that have already checked can be safely typecast to not null
  • the bug we saw in demo video number 2 where some foreign keys were not properly linked up in autotest code was fixed (yay videos)
  • time functions are now known to be not null for a bunch of simple cases such as 'now' arguments
  • you can use the cast(.. as ..) operator on numeric types outside of the SQL context
  • @mingodad replaced all the positional references by named references in cql.y (yes! thank you!)
  • several minor bug fixes
  • the railroad diagrams were updated

NOTE: I often refer to "sugar" in the below. This is short for syntatic sugar which, in case you're not familiar with the term, refers to a syntatically more pleasing way of writing a concept that is otherwise totally doable with normal syntax. Many languages have sugar for forms that are common -- for brevity, clarity, and/or correctness.

And now a few notes on The Big Stuff

We often add new features to the language to facilitate the writing of tests. The tests have a lot of boilerplate often setting up and calling the same procedures again and again with slightly different arguments. Long argument lists and long insert column lists are especially problematic as these can be very error prone. Here good language constructs are very helpful. We've found good test constructs are often invaluable in production code as well, though in our experience the tests often have a lot more repitition that needs refactoring than production code. To that end we added some very useful things in the last month:

Declare cursors in the shape of a procedure's arguments and use them

The most common way to create a cursor is from a select statement but you can also make a cursor that can hold values for you by declaring it to be LIKE something else with a shape. A classic example is:

declare C cursor like some_table;

Now C has the same columns and types as some_table

Many procedures have a result type that is also a shape, for instance any procedure that ends with a select statement has a result shape defined by the columns of the select statement. You could always do this sort of thing:

declare C cursor like some_proc;

Meaning make C a cursor whose shape is whatever some_procreturns, which is of course exactly the kind of cursor you need to capture the result of some_proc.

Now we add:

declare C cursor like some_proc arguments;

The idea being that the arguments of some_proc are also a shape (unless it has none). With this done you want to use that cursor to call the procedure -- that being sort of the whole point. So we add this:

call some_proc(from C);

How do we use this effectively? Hold on just a second -- for that answer we need one more big tool to really help the syntax.

Loading cursors and inserting columns

Loading up a cursor is done with syntax that is very much like an insert statement. An example might be something like this:

fetch C(x,y,z) from values(1,2,3);

This is simple enough but it becomes more problematic if there are many values and especially if the values have complex names. To make this a little less error prone CQL now has this sugar form for fetch, insert, and soon update cursor (like maybe before you see this blog). The more readable form is:

fetch C using
  1  x,
  2  y,
  3  z;

This form has the values next to their names just like in a select statement, like all sugars, it is automatically rewritten to the normal form.

Likewise

insert into some_table using
  1            id,
  'fred'       first_name,
  'flintstone' last_name,
  'bedrock'    home_town,
  'dino'       favorite_pet,
  'wilma'      life_partner;

becomes

insert into some_table(id, first_name, last_name, home_town, favorite_pet, life_partner)
  values(1, 'fred', 'flintstone', 'bedrock', 'dino', 'wilma');

except the sugar form is much less error prone. This form doesn't generalize to many values but the single row case is super common.

Since this form is automatically rewritten SQLite will never see the sugar syntax, it will get the normal syntax.

NOTE: the insert rewrite is coming later today, and will likely be live by the time you read this.

Putting these together

Let's suppose you have to write a test. You have a procedure test_subject that takes some arguments plus you have another helper procedure test_setup that puts seed data in the right places for your subject. But there are many variations and a lot of what you do between variations is the same. How can you write this economically making it clear what is different between variations without a lot of fuss. Well you can do something like this:

-- use defaults for all the named values
-- use 'seed' for everything else that isn't named
create proc default_setup_args(seed integer not null)
begin
  declare args cursor like test_setup arguments;
  fetch args using
    1334    primary_id,
    98012   secondary_id,
    'foo'   useful_name,
    'bar'   other_useful_name,
    1       fast_mode
    @dummy_seed(seed);
  out args;
end;

With the above you can easily see which values go to which arguments

Your test setup can now look something like this:

declare setup_args cursor like test_setup arguments;
fetch setup_args from call default_setup_args(1999);
update cursor setup_args using
   0 fast_mode;  -- override fast mode for this test
call test_setup(from setup_args);

To call the test subject you probably need some of those setup arguments and maybe some more things.

create proc default_subject_args(like default_setup_args, other_thing bool not null)
begin
  declare args cursor like test_subject arguments;
  fetch args using
     primary_id    primary_id,    -- this came from the default_setup_args result
     secondary_id  secondary_id,  -- so did this
     useful_name   name,          -- the field names don't have to match
     fast_mode     fast_mode,
     other_thing   other_thing;
  out args;
end;

Then the test code

declare test_args cursor like test_subject arguments;
fetch test_args from call default_subject_args(0);
call test_subject(from test_args);

Importantly, the cursor set operations are all by name so the order doesn't matter. Which means even if there are many arguments you don't have to worry that you got them in the wrong order or that they are the wrong type. Effectively you have a simple call by name strategy and you can easily read off the arguments. You could do something similarly brief with helper functions to provide the default arguments but then you can't readily re-use those arguments in later calls or for verification so this way seems a lot more useful in a test context.

When it comes time to validate, probably your test subject is returning a cursor from a select that you want to check. A slightly different call will do the job there.

Cursor Differencing

With the setup above you can verify results very easily. Let's change it a little bit:

-- same as before, with a cursor
declare results cursor for call test_subject(from test_args);

-- get the first row
fetch results;

declare expected cursor like results;
fetch expected using
   setup_args.primary_id     primary_id,
   setup_args.useful_name    name,
   test_args.other_thing     other_thing
   @dummy_seed(1999);   -- dummy values for all other columns

-- make a macro like EXPECT_CURSOR_EQ(x,y) for this
-- if the cursors are different the result is a string with the first
-- different column name and the left and right values ready to print

call ExpectNull(cql_cursor_diff_val(expected, result));

ExpectEqual could be

create proc ExpectNull(t text)
begin
  if t is not null then
    call printf('%s\n', t); -- or whatever
    throw;
  end if;
end;

All that testing support comes from:

  • cursors in the shape of arguments
  • cleaner fetch/insert syntax
  • cursors passed as arguments
  • cursor differences

It kills a lot of boilerplate resulting in tests that are much clearer.

And that's what's been going on for the last month in CG/SQL land.

If you got this far thanks for reading. If you didn't get this far, you aren't reading this anyway so thanking you is moot =P

Stay safe.

Rico for CG/SQL

P.S. most of these fragments don't actually compile because of missing schema and maybe the odd typo. If there is interest I'll make a demo that works soup to nuts.

More Flexible Cursor Patterns Using "Boxing" (2020/11/15)

I was reviewing the update posting that just went out and I realized I'd forgotten to mention another big ticket item. So consider this an appendix to the update.

In some cases we started seeing a need to "ship cursors around" a little bit more flexibly. Note shipping values around is already doable so this new work is largely about being able to create a "statement cursor" in one procedure and consume it safely elsewhere. The general pattern looks like this:

Declare a statement cursor as usual, maybe something like this:

declare C cursor for select * from shape_source;

-- or

declare C cursor for call proc_that_returns_a_shape();

Make an object that can hold a cursor:

declare obj object<T cursor>;

Where T is the name of a shape. It can be a table name, or a view name, or it can be the name of the canonical procedure that returns the result. You really want this to be some kind of global name though. Something you can get with a #include in various places. In this case choices for T might be shape_source the table or proc_that_returns_a_shape the procedure.

Remember you can always make a fake procedure that returns a result to sort of typedef a shape name. e.g.

declare proc my_shape() (id integer not null, name text);

The procedure here my_shape doesn’t have to actually ever be created, in fact it’s probably better if it doesn’t. You won’t call it, you’re just using its hypothetical result as a shape. This could be useful if you have several procedures like proc_that_returns_a_shape that all return my_shape.

At this point you could use the cursor maybe something like:

loop fetch C
begin
  -- do stuff with C
end;

Those are the usual patterns and they let you consume statement cursors sort of “up” from where it was created, but what if you want some worker procedures that consume a cursor there is no good way to pass your cursor down again. Well, there wasn't. Now there is. Let's go back to that box object creation and use it

-- recap: declare the box that holds the cursor (T changed to my_shape for this example)
declare obj object<my_shape cursor>;

-- box the cursor into the object (the cursor shape must match the box shape)
set obj from cursor C;

The variable obj can now be passed around as usual. Then, later, you can "unbox" it to get a cursor back. Like so

-- unboxing a cursor from an object
declare D cursor for obj;

These primitives will allow cursors to be passed around with managed lifetime. Example:

-- consumes a cursor
create proc cursor_user(box object<my_shape cursor>)
begin
   declare C cursor for box;  -- the cursors shape will be my_shape matching box
   loop fetch C
   begin
      -- do something with C
   end;
end;

-- captures a cursor and passes it on
create proc cursor_boxer()
begin
   declare C cursor for select * from something_like_my_shape;
   declare box object<my_shape cursor>
   set box from cursor C; -- produces error if shape doesn't match
   call cursor_user(box);
end;

Importantly, once you box a cursor the underlying SQLite statement’s lifetime is managed by the box object with normal retain/release semantics so timely release becomes imperative.

With this pattern it's possible to, for instance, consume some of the rows in one procedure and the rest in another procedure.

Now, the main reason for doing this is if you have some standard helper methods that can get a cursor from a variety of places and process it. But remember, that boxing isn’t the usual pattern at all and returning cursors in a box, while possible, should be avoided in favor of the simpler pattern of doing your select or call at the end to compute the result as we do now, if only because then then lifetime is very simple in all those cases. Durably storing a boxed cursor could lead to all manner of problems -- it's just like holding on to a sqlite3_stmt * for a long time. Actually "just like" is an understatement, it's exactly the same as holding on to a statement for a long time with all the same problems because that is exactly what's going on here.

So, good generalization, but possibly less Pit of Success, especially with complex box patterns. So watch the sharp edges.

Introducing General Purpose Error Tracing (2020/11/16)

Today we made a couple of minor changes in the code generation to take care of some lingering issues.

The first is that when you did a throw inside a catch to basically rethrow the error, you would lose the error code if something had succeeded within the catch handler.

The old codegen looked something like this:

  catch_start_1: {
    printf("error\n");
    cql_best_error(&_rc_)
    goto cql_cleanup;
  }

The problem being that while the printf above is fine and well, if you did any SQL operation then _rc_ would be clobbered and you'd end up throwing an unrelated error code. cql_best_error would at least make sure it was a failure code (SQLITE_ERROR) but the original error code was lost.

The new code looks like this:

  catch_start_1: {
    _rc_thrown_ = _rc_;
    printf("error\n");
    _rc_ = cql_best_error(_rc_thrown_);
    goto cql_cleanup;
  }

So now if there are db operations, the original return code is still preserved. Note: you still lose sqlite3_errmsg() because SQLite doesn't know that cleanup logic is running.

This brings us to the second new thing: general purpose error traces.

Error checking of result codes happens very consistently in CQL output. The usual pattern looks something like this:

  _rc_ = cql_exec(_db_,
    "SAVEPOINT base_proc_savepoint");
  if (_rc_ != SQLITE_OK) goto cql_cleanup;

or if it's inside a try block a little different... very little actually

  // try
  {
    _rc_ = cql_exec(_db_,
      "RELEASE SAVEPOINT base_proc_savepoint");
    if (_rc_ != SQLITE_OK) goto catch_start_8;
    // ... the rest of the try block
  }

Basically if the local _rc_ doersn't match the necessary condition we goto the appropriate error label... either the relevant catch block or else the procedure's cleanup code.

We generalize this a bit now so that it looks like this:

  if (_rc_ != SQLITE_OK) { cql_error_trace(); goto cql_cleanup; }

-- or, in a catch...

  if (_rc_ != SQLITE_OK) { cql_error_trace(); goto catch_start_8; }

Now the default implementation of cql_error_trace() is in cqlrt.h which you can and should customize. I'll be writing more about that later but suffice to say you're supposed to replace cqlrt.h and cqlrt.c with suitable runtime helpers for your environment while keeping cqlrt_common.h and cqlrt_common.c fixed.

So for instance, your cqlrt.h could look like this:

#ifndef CQL_TRACING_ENABLED
#define cql_error_trace()
#else
// whatever tracing you want, for example this might help in test code.
#define cql_error_trace() \
  fprintf(stderr, "Error at %s:%d in %s: %d %s\n", __FILE__, __LINE__, _PROC_, _rc_, sqlite3_errmsg(_db_))
#endif

So then when you need to debug problems involving lots of error recovery you can watch the entire chain of events easily.

Note that there are some useful variables there:

In any procedure _db_ is the current database and _rc_ is the most recent return code from SQLite. __FILE__ and __LINE__ of course come from the preprocessor. and _PROC_ (one underscore) is now generated by the compiler. Every procedure's body now begins with:

#undef _PROC_
#define _PROC_ "the_current_procedure"

So by defining your own cql_error_trace macro you can cause whatever logging you need to happen. Note this can be very expensive indeed because this happens a lot and even the string literals needed are a significant cost. So generally this should be off for production builds and enabled as needed for debug builds.

The default implementation is just an empty block

#define cql_error_trace()

But the hook is enough to light up whatever logging you might need, and you can use sqlite3_errmsg() before that message is gone.

Good hunting.

Error Tracing Helper Macro (2020/11/18)

Following up on the last blog entry, I thought it would be useful to present a simple error tracing macro that you can use to see what kind of error flow is going on when you're having trouble understanding why a procedure is returning an error code. The idea is we want to create a macro that we can use like this:

BEGIN_VERBOSE_STDERR_TRACING;

-- Some procedure(s) that you want to trace

END_VERBOSE_STDERR_TRACING;

We can do that with something like the below macros. These particular ones cause the output to go to stderr via fprintf but if that isn't what you need you can simply edit the macro. The macros looks like this:

-- manually force tracing on by redefining the cql_error_trace macro
#define BEGIN_VERBOSE_STDERR_TRACING \
    @echo c, "#undef cql_error_trace\n"; \
    @echo c, "#define cql_error_trace() fprintf(stderr, \"CQL Trace at %s:%d in %s: %d %s\\n\", __FILE__, __LINE__, _PROC_, _rc_, sqlite3_errmsg(_db_))\n"

#define END_VERBOSE_STDERR_TRACING \
    @echo c, "#undef cql_error_trace\n"; \
    @echo c, "#define cql_error_trace()\n"

So basically it's telling CQL to emit a #define into its output stream. In this case:

#define cql_error_trace() fprintf(stderr, "CQL Trace at %s:%d in %s: %d %s\n", __FILE__, __LINE__, _PROC_, _rc_, sqlite3_errmsg(_db_))

You could change that to any function you like, you can have it dump the errors where you like, or you can make it some dummy function you add so that you can set a breakpoint on it.

Whatever you do, do not leave your code with this sort of tracing enabled -- it's far too expensive in terms of code size. But it's perfect if you have this one procedure that is failing and it's hard for you to see where.

Obviously if you're making a custom trace thingy you don't need the macro at all, you can just emit your own #define with @echo as needed.

Note: @echo is quite a sledgehammer so don't use it lightly and not in production code but it is quite helpful for this sort of thing. CQL tests often use it to help make things visible to the tests. If you use @echo in weird ways you might not get working code when the codegen changes in the future.

The relevant state that is available to you inside a macro like this is:

  • __FILE__ the current filename (comes from the C pre-processor, this is the .c file name not the .sql)
  • __LINE__ the current line number (comes from the C pre-processor, this is the .c line number)
  • _rc_ the current SQLite result code (always the current return code in every CQL procedure that uses SQLite)
  • _db_ the current SQLite database pointer (always the current database in every CQL procedure that uses SQLite)
  • _PROC_ the current procedure name (CQL has a #define for this for you)

A quick tutorial on LIKE forms (2020/11/20)

Everyone knows the usual expression syntax x LIKE y to do a string match. But the CQL compiler also uses LIKE in a different way that's powerful and important. CQL has the notion of data shapes and you use LIKE to refer to them. The simplest source of a data shape, and maybe the most common, is a table. Maybe something like this:

create table T(
 id integer not null,
 name text not null,
 age integer not null
);

Now suppose you want to write a procedure that can insert a row into that table, You could write

create proc insert_into_T(
  id_ integer not null, 
  name_ text not null, 
  age_ integer not null
)
begin
  insert into T(id, name, age)  values(id_, name_, age_);
end;

This is all fine and well but what if T had 50 columns? That gets old fast. And how can you be sure that you inserted the columns into T in the right order? This second example also compiles even though it's clearly wrong:

  insert into T(id, name, age) values(age_, name_, id_);

And of course you can imagine things get only more complicated with more columns in T.

We started adding the LIKE form to ease these issues and to ensure some consistency in the APIs while preventing simple transpostion errors. So you can instead write:

create proc insert_into_T(like T)
begin
  insert into T from arguments;
end;

so here the like T in the argument list simply means "make arguments that are the same as the columns of table T" -- well, almost. It also adds an _ to the end of each name so you end up exactly the same declaration as the long form above. But you won't miss any arguments, and they'll be in the right order.

And notice that we used from arguments to indicate that we wanted the values to come from the arguments in order. Again this saves you from a lot of typing and a lot of error checking. You can't get the arguments in the wrong order.

These are the most basic patterns. But there are quite a few more.

Let's suppose you want to write a procedure that returns in row with the highest age in the above. Maybe you write something like this:

create proc highest_age()
begin
  declare C cursor for select * from T;
  declare M cursor like C;
  loop fetch C
  begin
     if (not M or M.age < C.age) then
       fetch M from C;
     end if;
  end;
  out M;
end;

Here we made a cursor M that is the same as the cursor C and then we are going to generate a single row result from the cursor. Note that if you use a cursor name like M in an expression it refers to the hidden boolean that says if the cursor has a row in it or not. So M begins empty and we will load it if it's empty or if the age is higher than what we've already got.

Let's show a few more of the forms. Suppose we don't want to return name, just the id and the age. We can change things up a tiny bit.

create proc highest_age()
begin
  declare C cursor for select * from T;
  declare M cursor like select 1 id, 99 age;
  loop fetch C
  begin
     if (not M or M.age < C.age) then
       fetch M from cursor C(like M);
     end if;
  end;
  out M;
end;

So two things to notice. We used an ad hoc shape, making a fake select statement that returns the shape we want. This select doesn't run but it does define types and columns easily. Two not null integers in this case. Now M is not the same as C so we can't use the simplest form fetch M from C we have to use the more general form.

Fully expanded, what we wrote becomes:

  FETCH M(id, age) FROM VALUES(C.id, C.age);

But as you can see, we didn't have to type all those column names. And that's kind of the point of the LIKE construct.

So we've covered a bunch of the shape sources already:

  • a table name
  • a cursor name
  • a select statement that gives the shape in an ad hoc fashion

There are three more

  • a view name
  • the return shape of a procedure that returns a result set
  • the arguments of a procedure

View names are pretty simple, and they work the same as table names so we don't need to discuss those. Let's look at some of the other uses with procedures.

Suppose we have a procedure that can return a result set shape but we want to be able to mock its results so we can fake whatever result we need for testing.

We'll complicate this a bit adding a new table (keeping short table names for the sample to save typing)

create table U(
 id integer not null,
 email text not null
);

And here's a procedure:

create proc my_proc()
begin
   select T.*, U.email from T inner join U on T.id = U.id;
end;

Now we want to be able to make any fake result we want, so maybe want a temp table. No problem:

create proc _init_fake_results()
begin
  create temp table if not exists fake_results(
   like my_proc
  );
end;

create proc add_fake_result(like fake_results)
begin
  insert into fake_results from arguments;
end;

create proc get_fake_results()
begin
  select * from fake_results;
end;

The above is very generic and will maintain well. You can see we made a temp table that will have exactly the same shape as whatever my_proc returns. In this case it becomes:

CREATE PROC _init_fake_results ()
BEGIN
  CREATE TEMP TABLE IF NOT EXISTS fake_results(
    id INTEGER NOT NULL,
    name TEXT NOT NULL,
    age INTEGER NOT NULL,
    email TEXT NOT NULL
  );
END;

And the rest are patterns we've seem before.

The last source of shapes are procedure arguments. There's lots of good cases for those, I wrote an entry on those previously but I'll give a simple example here too.

Suppose we have this weird procedure:

create proc delete_stuff(age_ integer, name_ text)
begin
  if age_ is not null then
     delete from T where T.age = age_;
  end if;

  if name_ is not null then
     delete from T where T.name = name_;
  end if;
end;

What if we wanted to log any errors that happen here? Maybe make a verison that logs. We can do it like this:

create proc delete_and_log(like delete_stuff arguments)
begin
  begin try
    call delete_stuff(from arguments);
  end try;
  begin catch
    call printf("delete failed\n"); -- or whatever
    throw;
  end catch;
end;

The nice thing about this logging wrapper procedure is that if delete_stuff changes, the wrapper will change with it.

That covers all of the shape sources and as we saw we can use them to create things like cursors, tables, and argument lists. We can use them to specify a subset of columns that might be of interest when fetching or updating cursors. And we can use them in one last way -- to restrict arguments to a particular shape. Let's see how that works by making the previous logger a little different. Here we added an argument which tells if we should look. And that might look like it would spoil the from arguments part of the forwarding, but there is the final way to use LIKE.

create proc delete_and_log2(log bool not null, like delete_stuff arguments)
begin
  if log and age_ is not null then
    call printf("deleting %d\n", age_); -- or whatever
  end if;
  if log and name_ is not null then
    call printf("deleting %d\n", name_); -- or whatever
  end if;

  call delete_stuff(from arguments like delete_stuff arguments);
end;

So this form lets you use some of your arguments, the ones that match a certain shape. And as we saw in the previous article you can also use from C to pass arguments where C is a cursor and in that case you can also specify that arguments be matched by name from C like shape. In both those cases the formal parameter names of the called procedure are matched against the names of the shape and passed in the order of the formals. So this is like "call by name", the fields of the cursor or the order of arguments in the argument list might be different than the formals but you'll get the correct items in the correct order regardless, because it matches by name.

These forms can save you a lot of typing... and are excellent at avoiding errors and improving maintainability. Where they appear in SQL statements, everything is expanded before it goes to SQLite so SQLite will see normal syntax forms. Which is good because obviously SQLite doesn't know anything about this enhanced LIKE business.

In the examples above there were only one or two columns with very short names, but in real world code there can easily be dozens of columns with very long names. In those cases, these forms really shine.

Introducing Declare Enum (2020/12/03)

There is an unfortunate pattern of hard coding constants in SQL which I think comes from the fact that there's not an especially good way to encode constants in SQL. Things are a little better In CG/SQL's CQL language because it's normal to run things through the pre-processor first so you can do things like:

#define BUSINESS_TYPE_RESTAURANT 1
#define BUSINESS_TYPE_LAUNDROMAT 2

Having done so, you could write:

insert into Business using
   "Rico's Laundry"  name,
   BUSINESS_TYPE_LAUNDROMAT type;

-- by the time SQL sees this it becomes
insert into Business(name, type) values('Rico''s Laundry', 2);

And at least you don't have to see these loose '2' values all over. An especially unfortunate form is the below, in which the auther is clearly crying for a symbol to use:

insert into Business using
   "Rico's Laundry"  name,
   2 type; /* laundromat */

But if we use #define the language knows nothing of the names and it can't help you manage them or export them consistently or anything like that. I guess #define is pretty useful in several langauges (C and C++) so you could maybe #include the macros somehow but that doesn't seem especially great. And if you need them in Java you're getting no help at all.

So to this world we add enumerated constants. This is a bit short of enumerated types as we'll see later. You can now write something like this:

declare enum business_type integer (
  restuarant,
  laundromat,
  corner_store = 11+3  /* math added for demo purposes only */
);

After this:

select business_type.corner_store;

is the same as

select 14;

And that is exactly what SQLite will see, the literal 14.

What's going on here? There's just a few rules:

  • the enumeration can be any numeric type (bool, integer, long integer, real)
  • the values of the enumeration start at 1 (i.e. if there is no = expression the first item will be 1, not 0)
  • if you don't specify a value, the next value is the previous value + 1
  • if you do specify a value it can be any constant expression and it will be cast to the type of the enumeration (even if thatis lossy)
  • the enumeration can refer to previous values in itself with no qualification (big = 100.0, medium = big/2, small = medium/2)
  • the enumeration can refer to previously defined enumerations as usual (code = business_type.restaurant)
  • Once the enumeration is defined you refer to its members in a fully qualified fashion enum_name.member_name elsewhere

Why is this better than macros? Well for one thing the enum values can be checked at their declaration site, so if you have errors you will hear about them in a more reasonable place. But additionally since the structure is known to the compiler it can give you useful information in the outputs.

In the .h files you get:

enum business_type {
  business_type__restaurant = 1,
  business_type__laundromat = 2,
  business_type__corner_store = 14
};

In case of floating point values such as:

declare enum floating real (
  one = 1.0,
  two = 2.0,
  e = 2.71828,
  pi = 3.14159
);

You get:

// enum floating (floating point values)
#define floating__one 1.000000e+00
#define floating__two 2.000000e+00
#define floating__e 2.718280e+00
#define floating__pi 3.141590e+00

Which is unfortunately the best you can do since C has no floating point enums.

But in both cases the enums section of the JSON has the name of the enums and their members and values ready to go. With these values you can readily generate (with moustache or something) the language interfaces of your choice. This is a real help if you're trying to make helpers to call your CQL from say Java or something.

To do all this we needed to add some constant folding and general evaluation to the compiler. It's not much, just the normal numeric types and null. The supported operations include:

+, -, *, /, %, |, &, <<, >>, ~, and, or, not, ==, <=, >=, !=, <, >, the cast operator and the case forms. These are enough to make a lot of very interesting expressions, all of which are envaluated at compile time.

While the constant folding was added to allow for rich enum expressions, there is also the const() primitive in the language now which can appear anywhere a literal could appear. This allows you do things that were previously not allowed such as:

create table something(
  x integer default const((1<<16)|0xf) /*  again the math is just for illustration */
);

The const form is also very useful in macros:

#define SOMETHING const(12+3)

This form ensures that the constant will be evaluated at compile time. Const can also also nest so you can build these kinds of macros from other macros or you can build enums this way. Anywhere you might need literals, you can use const.

Importantly, no enumerated data types were added to the language to do any of this. The values can help you to achieve some correctness by avoiding transcription mistakes but there is no additional type-safety provided here. Indeed given the rich mix between these types in SQLite, and with SQLite having no knowledge of enumerations at all it would be tricky to do a complete job. Still, this might happen in the future.

But for now, declaring constants that are really an intimate part of your schema is now possible and the addition of the constants to the .h files and the .json output should hopefully make these generally useful. At least we might see less of the hard-coded constant business with good values baked right into the schema declarations.

Happy Holidays.

Introducing Argument Bundles (2020/12/08)

There are many cases where stored procedures require complex arguments using data shapes well known to higher level languages or that come from the schema. There is already some affordance for this sort of thing in the form of this kind of pattern:

(I'll continue to use this simple example as I discuss the generalization below)

create table Person (
   id text primary key,
   name text not null,
   address text not null,
   birthday real
);

Then maybe something like this

create proc insert_person(like Person)
begin
    insert into Person from arguments;
end;

The above expands into:

create proc insert_person(
    id_ text not null,
    name_ text not null,
    address_ text not null,
    birthday_ real)
begin
    insert into Person(id, name, address, birthday)
        values(id_, name_, address_, birthday_);
end;

And I think we can all agree the sugared version is a lot easier to reason about and much less prone to errors as well.

Those features have been in the language for a long time and that's all fine and well but it isn't general enough to handle the usual mix of situations. For instance what if you need a procedure that works with two people? A hypothetical insert_two_people procedure cannot be written with the old form. This is where argument bundles come in. The idea here is to name the bundle which provides useful reference. To wit:

create proc insert_two_people(p1 like Person, p2 like Person)
begin
    call insert_person(from p1);
    call insert_person(from p2);
end;

or alternatively

create proc insert_two_people(p1 like Person, p2 like Person)
begin
    insert into Person from p1;
    insert into Person from p2;
end;

So what's going on here? Well, there are lots of reasons to keep the API to procedures simple and adding general purpose structured types would be at odds with that. It would require lots of knowledge about C structure layout and whatnot. And trying to call from java would require very complex JNI for any such procedure. So we avoid all that. We keep simple arguments. The above expands into:

create proc insert_person(
    p1_id text not null,
    p1_name text not null,
    p1_address text not null,
    p1_birthday real,
    p2_id text not null,
    p2_name text not null,
    p2_address text not null,
    p2_birthday real)
begin
    insert into Person(id, name, address, birthday)
        values(p1_id, p1_name, p1_address, p1_birthday);
    insert into Person(id, name, address, birthday)
        values(p2_id, p2_name, p2_address, p2_birthday);
end;

Or course the types don't have to be the same, you can create and name shapes of your choice. The language allow you to use an argument bundle in all the places that a cursor was previously a valid source. That includes insert, fetch, update cursor, and procedure calls. You can refer to the arguments by their expanded name p1_address or alternatively p1.address means the same thing.

Here's another example showing a silly but illustrative thing you could do:

create proc insert_lotsa_people(P like Person)
begin
    declare C cursor like P;
    fetch C from P;
    declare i integer not null;
    set i := 0;
    while (i < 20)
    begin
        update cursor C using
            printf("id_%d", i) id;
        insert into Person from C;
    end;
end;

The above shows that you can use a bundle as the source of a shape elsewhere, and you can use a bundle as a source of data to load a cursor. After which you can do all the usual value cursor things like out statements and so forth.

In order to call procedures with argument bundles more readily from other languages, the JSON output now includes additional information about where procedure arguments originated; The field with this information is creatively called "argOrigin:" and it has 3 forms.

  • "arg_name" -> the argument is not an expansion of anything
  • "T arg_name" -> the argument came from like T
    • there will be one arg for each member of T
    • the formal argument name for this arg will be arg_name_
    • if T is procedure arguments like p1 arguments then you'll get "p1[arguments] arg_name"
  • "name T arg_name" -> the argument came from name like T (a named bundle)
    • there will be one arg for each member of T
    • the formal argument name for this arg will be T_arg_name
    • T could be procedure arguments as above
  • If the source of an argument was a cursor or argument bundle name you get instead that thing's shape source name
    • this is always better because cursor names and bundle names are not globally unique.
  • If the cursor had an anonymous source (e.g. like select 1 x) then you get the useless shape name "select"
    • this is an indicator that you should make some ad hoc struct for this procedure because there is no useful name for the arg bundle's type

None of this matters unless you're trying to make wrappers for a CQL procedure for some other language and you'd like to have your wrapper deal with structs rather than all loose arguments. the JSON basically tells you the structs.

Interestingly, argument bundles resulted in a significant reduction of code in the compiler. The argument bundle name has to be usable in the contexts where a cursor was previously usable. It is another source of shaped data. Getting that to work proved to be super simple as the two forms look almost identical to the compiler -- no coincidence there. So very little code was required to make from [cursor_name] work with from [any_shape_name] in the half dozen or so places that this construct is allowed (e.g. procedure call arguments, insert statements, etc.). However, there was as much code associated with from arguments as there was from cursor_name. And the code was nearly identical..

When argument bundles were introduced the natural thing to do was to create an artifical bundle called "arguments" which represents the bundle that is ALL the arguments. With that done, all the code for from arguments could be deleted because arguments itself was a valid shape name. Hence insert into T from arguments "just works". And so half the rewrites were deleted. The only cost was that the form from arguments like shape became the cursor form from arguments(like shape) which only adds mandatory parens to a form that was largely unused anyway (there were two cases in our entire codebase). The cursor form is more general as you can do from C(like A, like B) to get the fields that match A then those that match B. Arguments get this for free as well (well, at the cost of parens).

So overall, this feature was added, and the compiler got smaller and cleaner. Only the test suite had to grow.

Stay safe out there.

Introducing Virtual Tables (2020/12/16)

Language support for virtual tables has lagged since I always thought they were of little interest to the language anyway. The CREATE TABLE forms in general are only declarations (except if you're doing the schema installer/upgrader output) and so you could just declare say a temp table that corresponds to the virtual table that you made in the same way that you might declare say sqlite_master if you wanted to use it. And since you have to register the module anyway, you may as well create the virtual table at the same time.

So there was no point in adding language support for the thing.

Furthermore the CREATE VIRTUAL TABLE form includes no information about the schema of the table so you'd need some kind of declaration anyway in order to tell the language what the columns are for the table you just created. So again you may as well just declare it like a normal table and not include that table in your schema upgrade file and be done with it.

And that was my thinking for the last 2 years. And then I learned something.

Virtual tables are durable.

Yeah, I always assumed that virtual tables were temp tables and they vanish and have to be redeclared every session but that is not the case. They are part of the durable schema so while you must pre-register the module associated with the virtual table, the virtual table is like other tables in that you only create it once and from then on it's part of the schema every time the database loads until you DROP it.

This changes everything.

With virtual tables being durable they belong in the schema upgrade process. And if they go there they also have to go into the JSON output. But we can't use the vanilla syntax that SQLite uses because that syntax is:

  • not parseable, because the module arguments can be literally anything (or nothing), even a letter to your grandma.
  • the arguments do not necessarily say anything about the table's schema at all

So in the CQL language I change the syntax a bit, the generalized form looks like this:

create virtual table virt_table using my_module [(module arguments)]  as (
  id integer not null,
  name text
);

The part after the AS is used by CQL as a table declaration for the virtual table. The grammar for that is exactly the same as a normal CREATE TABLE statement. However that part is not transmitted to SQLite; when the table is created, SQLite sees only the part it cares about, the part before the AS.

Now this leaves the module arguments, they can be one of three things:

  1. no arguments at all
  2. a list of identifiers, constants, and parenthesized sublists just like in the @attribute form
  3. the words arguments following

Case 1 Example

create virtual table virt_table using my_module as (
  id integer not null,
  name text
);

becomes (to SQLite)

CREATE TABLE virt_table USING my_module;

Note: empty arguments USING my_module() are not allowed in the SQLite docs but do seem to work in SQLite. We take the position that no args should be done with no parens, at least for now.

Case 2 Example

create virtual table virt_table using my_module(foo, 'goo', (1.5, (bar, baz))) as (
  id integer not null,
  name text
);
CREATE VIRTUAL TABLE virt_table USING my_module(foo, "goo", (1.5, (bar, baz)));

This form allows for very flexible arguments but not totally arbitrary arguments, so it can still be parsed and validated.

Case 3 Example

This case recognizes the popular choice that the arguments are often the actual schema declaration for the table in question. So

create virtual table virt_table using my_module(arguments following) as (
  id integer not null,
  name text
);

becomes

CREATE VIRTUAL TABLE virt_table USING my_module(
  id INTEGER NOT NULL,
  name TEXT
);

The normalized text (keywords capitalized, whitespace normalized) of the table declaration in the as clause is used as the arguments.

Other details

Virtual tables go into their own section in the JSON and they include the module and moduleArgs entries, they are additionally marked isVirtual in case you want to use the same processing code for virtual tables as normal tables. The JSON format is otherwise the same, although some things can't happen in virtual tables (e.g. there is no TEMP option so "isTemp" must be false in the JSON.

For purposes of schema processing, virtual tables are on the @recreate plan, just like indices, triggers, etc. This is the only option since the alter table form is not allowed on a virtual table.

Semantic validation enforces "no alter statements on virtual tables" as well as other things like, no indices, and no triggers, since SQLite does not support any of those things.

Finally, because virtual tables are on the @recreate plan, you may not have foreign keys that reference virtual tables. Such keys seem like a bad idea in any case.

Introducing Named Types (2021/01/14)

A common source of errors in stored procedures is incorrect typing in arguments. For instance, a particular key for an entity might need to be LONG or even always LONG NOT NULL or LONG NOT NULL @SENSITIVE and the only way to do this in the past was maybe with some #define thing. Otherwise you have to diligently get the type right in all the places, and should it ever change, again you have to visit all the places. To help with this situation, and to make code a little more self-describing we add named types to the language. This is a lot like typedef in the C language. They do not create different incompatible types but do let you name things well.

You can now write these sorts of forms:

declare foo_id type long not null;

create table foo(
  id foo_id primary key autoincrement,
  name text
);

create proc inserter(name_ text, out id foo_id)
begin
  insert into foo(id, name) values(NULL, name_);
  set id := last_insert_rowid();
end;

Refer to the railroad diagram for the grammar details.

Additionally any enumerated type can be used as a type name. e.g.

declare enum thing integer (
  thing1,
  thing2
);

declare x thing;

Enumerations always get "not null" in addition to their base type.

This isn't a very complex feature but we hope that it will help create clearer code that is less likely to have type-related bugs.

Introducing Type "Kinds" (2021/01/20)

Further adding to the type calculus of the CQL language we introduced the ability to encode the "kind" of primitive types. This can be used in a number of ways -- like "units" for natural things and like a "type" for synthetic keys and other such. It's easier to illustrate by example.

declare job_id type long<job_id>;
declare person_id type long<person_id>;

declare j job_id;
decalre p person_id;

set p := j;  -- this is an error

With the above in place, other expressions like p == j would also produce errors as these long values are no longer type compatible. This is a great way to add enforcement to your schema and procedures. Likewise you can use these annotations to add "units" to your data types. e.g.

declare meters type real<meters>;
declare grams type real<grams>;

declare m meters;
declare g grams;

Variables of type grams (e.g. g) are not compatible with variables of type meters (e.g. m) even though both are real.

Likewise, attemping to insert grams into a column that is typed to meters will give errors. Of course SQLite doesn't know about any of this so all the <> stuff is removed in the generated SQL. This is just about type enforcement at compile time.

Enumerations like:

declare enum surface integer (paper, canvas);
declare enum writer integer (pen, paper, brush);

enable this:

declare s surface;                  -- s is now of type integer<surface>
declare w writer;                   -- w is now of type integer<writer>
set s := surface.paper;             -- ok
set s := writer.pen;                -- error
set w := writer.pencil;             -- ok
case when s == w then 1 else 0 end; -- error (w/s not comparable)
set w := s;                         -- error again

additionally in DML/DDL:

create table draw_action(
  w writer,
  s surface
);

insert into draw_action values(w, s); -- ok
insert into draw_action values(s, w); -- error!

So the type kinds can be quite helpful when dealing with loose variables.

The notion of specific types was added to the language nearly two years ago to support the object type because there was a great desire to prevent object<dictionary> being assigned from object<list> but this "type kind", whether it's with units (e.g. "meters", "grams") or a type name (e.g. "job_id") adds a lot of high value type checking.

The kind can be added, stripped, or changed with a cast operation and the type system allows a constant or variable with no kind (e.g. "1") to mix and match with any kind so long as the base type is compatible as usual. So you get the most value by using the specific type consistently but you won't go insane adding test cases that use constants for instance.

As of this writing the expression kinds are checked for compatibility everywhere plus or minus bugs. There are extensive tests.

Change in No-Result Semantics (2021/02/10)

Important change in CQL semantics.

Previously if you did an early return, or fall through the end, from a procedure that is supposed to return a result set but did not in fact provide one, you would get a fake SQLITE_ERROR. Now you get an empty result set for "free".

This interpretation seems much more natural and avoids a lot of really annoying stub selects to comply with the contract.

This also works for the out statement in the same fashion.

If you want to return an error, use throw. This is a lot more natural...

examples:

-- this gives you an empty result set if x <= 0
create proc maybe_return(x integer)
begin
   if x > 0 then
     select * from foo where foo.y > x;
   end if;
end;

-- so does this
create proc maybe_return(x integer)
begin
  if x <= 0 then
     return;
  end if;
  select * from foo where foo.y > x;
end;

-- so does this
create proc maybe_out(x integer)
begin
  if x <= 0 then
    declare C cursor for select etc.
    out C;
  end if;
end;

Introducing Select .. If Nothing (2021/02/14)

The nested select statement is frequently misused, in particular if you get no rows back from the expression that's an error. So for instance:

set x_ := (select x from foo.x where id = y);

This will throw (with a peculiar error, SQLITE_DONE) if there is no such row.

Sometimes people try to fix this problem with a nullcheck:

set x_ := IFNULL((select x from foo.x where id = y), -1);

That doesn't help at all. It's not a null value situation, there's no row at all.

set x_ := (select IFNULL(x,-1) from foo.x where id = y), -1);

Is likewise unhelpful. To help with this situation we add two forms:

-- useful if foo.x is already known to be not null
set x_ := (select x from foo.x where id = y IF NOTHING -1);

-- useful if foo.x might be null
set x_ := (select x from foo.x where id = y IF NOTHING OR NULL -1);

Both of these deal with the case where there is no row. The second lets you have a simple default for both no row or null value. That form is equivalent to:

set x_ := (select IFNULL(x,-1) from foo.x where id = y IF NOTHING -1);

i.e. both problem cases are handled.

Of course the -1 here could be any valid expression, even a second (select...)

Introducing @RC builtin variable (2021/02/21)

We've long needed a way to see the most recent SQLite result code SQLite in the context of say a catch block (most other times you can assume SQLITE_OK was the last result code otherwise control flow would transfer elsewhere. Sometimes SQLITE_ROW or SQLITE_DONE might be the current result code.

Soon we'll provide a sample header that declares the most common error codes in an enum but for now you could do something like this:

-- pasted from the sqlite.c
#define SQLITE_BUSY         5   /* The database file is locked */

-- this is a contrived example
create proc get_first_foo(out can_retry bool not null)
begin

  -- can_retry is set to 0 automatically, language semantics guarantee this

  begin try
    select foo from bar limit 1;
  end try;
  begin catch
    set can_retry := (@rc == SQLITE_BUSY);
    throw; -- rethrow the original error
  end catch;
end;

Introducing Shared Fragments (2021/12/14)

Shared fragments are a real game-changer for CQL.

Remember, these are designed to let you write part of a query and then substitute in parameters. So it's like a parameterized view in normal SQL terms. But actually it's more powerful than that, fragments also provide features that are more like Java generics. Let's do some examples.

Suppose we have a procedure which looks something like this:

CREATE PROC get_stuff(to_include_ text, to_exclude_ text)
BEGIN
  WITH
  to_exclude_recursive_query (tok, rest) AS (
    SELECT
      '',
      to_exclude_ || ','
    UNION ALL
    SELECT
      substr(rest, 1, instr(rest, ',') - 1),
      substr(rest, instr(rest, ',') + 1)
    FROM to_exclude_recursive_query
    WHERE rest <> ''
  ),
  to_exclude (id) AS (
    SELECT CAST(tok AS LONG)
    FROM to_exclude_recursive_query
    WHERE tok <> ''
  )
  to_include_recursive_query (tok, rest) AS (
    SELECT
      '',
      to_include_ || ','
    UNION ALL
    SELECT
      substr(rest, 1, instr(rest, ',') - 1),
      substr(rest, instr(rest, ',') + 1)
    FROM to_include_recursive_query
    WHERE rest <> ''
  ),
  to_include (id) AS (
    SELECT CAST(tok AS LONG)
    FROM to_include_recursive_query
    WHERE tok <> ''
  )
  SELECT * from stuff S
  WHERE
    S.id in (select * from to_include) AND
    S.id not in (select * from to_exclude);
END;

With shared fragments you could write something like this:

@attribute(cql:shared_fragment)
CREATE PROC split_commas(str text)
BEGIN
  WITH splitter(tok, rest) AS (
    SELECT '', IFNULL(str || ',', '')
    UNION ALL
    SELECT
      substr(rest, 1, instr(rest, ',') - 1),
      substr(rest, instr(rest, ',') + 1)
    FROM splitter
    WHERE rest <> '')
  select tok from splitter where tok <> '';
END;

@attribute(cql:shared_fragment)
CREATE PROC ids_from_string(str text)
BEGIN
  WITH toks(tok) AS (CALL split_commas(str))
  SELECT CAST(tok AS LONG) AS id from toks;
END;

We now have a shared fragment called split_commas which can be anywhere like maybe in a standard include file. There are some immediate benefits:

  • the fragment is compiled on its own before usage so any errors are reported in the fragment
    • in contrast, with macros you get errors when you try to use the macro and they are all charged to the line the macro appears on so it's hopeless figuring out what's wrong
  • the text of the shared fragment will be the same, so it can be re-used in all locations, this can be a big binary size savings
    • in contrast, macros are pre-processed before CQL ever sees the text so it doesn't "know" it's the same code
  • fragments compose cleanly as we'll see; and they have typed arguments
  • fragments can be independently tested outside of the context in which they appear
    • make a test context and explore the fragment, no worries about it breaking on edge cases later

The first fragment called split_commas does exactly what it sounds like, it takes a string argument and makes a list of the strings in it.

The second fragment uses the first to split a string and then it converts all the strings to long integers.

Now instead of the above we could write:

#include <stringsplit.sql> /* whereever you put the fragments */

CREATE PROC get_stuff(to_include_ text, to_exclude_ text)
BEGIN
  WITH
    to_include(id) AS (CALL ids_from_string(to_include_)),
    to_exclude(id) AS (CALL ids_from_string(to_exclude_))
  SELECT * from stuff S
  WHERE
    S.id in (select * from to_include) AND
    S.id not in (select * from to_exclude);
END;

And of course since ids_from_string is somewhere shared (stringsplit.sql) so these fragments can be used all over your code and you'll only pay for the text one time. This gives you great flexibility, very much like parameterized views. You can have any number of these fragments, they will share code, they compose like crazy and there is no schema cost!

Generics

A series of useful fragments for generating data would go a long way but there are other applications of fragments and you might want to operate on various data sources without hard coding them all. This is where the generic form of fragments comes in. Consider a case where you want to be able to filter stuff by say name and age. You could create this fragment:

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
  pattern_ text not null,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  WITH
    source(*) LIKE stuff
  SELECT * from source S
  WHERE
    S.name LIKE pattern_ AND
    S.age BETWEEN min_age_ and max_age_;
END;

Now imagine that we had added the shared fragment annotation to get_stuff (just like the above). We could then write the following:

CREATE PROC the_right_stuff(
  to_include_ text,
  to_exclude_ text,
  pattern_ text not null,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  WITH
    get_stuff(*) AS (call get_stuff(to_include_, to_exclude_)),
    filter_stuff(*) AS (call filter_stuff(pattern_, min_age_, max_age_)
      using get_stuff as source)
  SELECT * from filter_stuff S
  ORDER BY name
  LIMIT 5;
END;

Or with some sugar to forward arguments and assume the CTE name matches, more economically:

CREATE PROC the_right_stuff(
  to_include_ text,
  to_exclude_ text,
  pattern_ text not null,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  WITH
    (call get_stuff(*)),
    (call filter_stuff(*) using get_stuff as source)
  SELECT * from filter_stuff S
  ORDER BY name
  LIMIT 5;
END;

The arg syntax (*) simply indicates that the arg names in the caller should match to the same names in the callee. In general call foo(*) expands to call foo(from arguments like foo arguments). * is rather more economical than that.

In this example filter_stuff doesn't know where its data will be coming from, you bind its table parameter source to a compatible data source of your choice. For example, this would also be legal:

CREATE PROC almost_the_right_stuff(
  pattern_ text not null,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  WITH
    (call filter_stuff(*) using stuff as source)
  SELECT * from filter_stuff S
  ORDER BY name
  LIMIT 5;
END;

Conditionals

It's often desirable to have some options in the generated SQL without having to fork your entire query. Shared fragments address this as well with the conditional form. In this form the top level of the fragment is an IF statement and there are a number of alternatives. Here are some simple modifications to the above that illustrate some of the possibilities.

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
  pattern_ text,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  IF pattern_ IS NOT NULL THEN
    WITH
        source(*) LIKE stuff
    SELECT * from source S
    WHERE
        S.name LIKE pattern_ AND
        S.age BETWEEN min_age_ and max_age_;
  ELSE
    WITH
        source(*) LIKE stuff
    SELECT * from source S
    WHERE
        S.age BETWEEN min_age_ and max_age_;
  END IF;
END;

In the above if the input pattern is NULL then it is not considered, it won't be part of the generated SQL at all. Note that source (same name) appears in both branches and therefore must be the same type as it will be fulfilled by one actual table parameter.

Now the above could have been achieved with something like this:

pattern_ IS NULL OR S.name LIKE pattern_

But that would have no useful selectivity. But in general you might be able to avoid joins and so forth with your constraints. Consider something like this hypothetical:

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
  pattern_ text,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  IF pattern_ IS NOT NULL THEN
    WITH
        source(*) LIKE stuff
    SELECT DISTINCT S.* from source S
    INNER JOIN keywords K
    WHERE
        K.keyword LIKE pattern_ AND
        S.age BETWEEN min_age_ and max_age_;
  ELSE
    WITH
        source(*) LIKE stuff
    SELECT * from source S
    WHERE
        S.age BETWEEN min_age_ and max_age_;
  END IF;
END;

Here we save the DISTINCT and the JOIN if there is no pattern which might be important. Of course there are probably better ways to match keywords but this is just an illustration of what's possible.

There are numerous ways this flexibility can be used, again a simple example, a real schema transform would be more complex.

@attribute(cql:shared_fragment)
CREATE PROC get_stuff(
  to_include_ text,
  to_exclude_ text,
  schema_v2 bool not null)
BEGIN
  IF schema_v2 THEN
    WITH
        to_include(id) AS (CALL ids_from_string(to_include_)),
        to_exclude(id) AS (CALL ids_from_string(to_exclude_))
    SELECT * from stuff_2 S
    WHERE
        S.id in (select * from to_include) AND
        S.id not in (select * from to_exclude);
  ELSE
    WITH
        to_include(id) AS (CALL ids_from_string(to_include_)),
        to_exclude(id) AS (CALL ids_from_string(to_exclude_))
    SELECT * from stuff S
    WHERE
        S.id in (select * from to_include) AND
        S.id not in (select * from to_exclude);
   END IF;
END;

Validation

All of this requires a bunch of checking, at least this:

  • the LIKE forms can only appear in a shared fragment
  • the CALL forms must refer to shared fragments
  • the CALL args must be compatible
  • the number and type of the provided tables in USING must be correct
  • the shared fragment must be a single select statement or an IF statement with an ELSE
    • the statement lists of the IF/ELSE combo must all be single select statements
    • all the choices in the IF block must return the same shape (this is normal for procedures)
  • the shared fragment can't have any out arguments
  • the provided fragment arguments cannot themselves use the nested SELECT construct

I think this is a total game changer for SQL authoring and should go a long way to making it easier to get your work done on SQLite. A good base set of shared fragments as part any suite of procedures seems like a good idea.

There are more details in the section on shared fragments in Chapter 14 of The Guide.

These features are in the current build as of today (12/14/2021).

Happy Holidays and stay safe.

Control Flow Analysis in CQL (2021/12/30)

One of the biggest changes to CQL in 2021 was the addition of control flow analysis. Given an understanding of how execution can flow within a user's program, CQL can do things like infer when a nullable variable must contain a nonnull value and improve its type appropriately, or issue an error when a nonnull variable may be used before it has been initialized.

Improving Nullability

As of mid-2021, and with increasing sophistication throughout the remainder of the year, CQL has been able to infer that a variable of a nullable type must not be NULL within a portion of a user's program:

DECLARE PROC another_proc(t0 TEXT NOT NULL, t1 TEXT NOT NULL);

CREATE PROC some_proc(t0 TEXT, t1 TEXT)
BEGIN
  IF t0 IS NULL RETURN;
  -- `t0` must be nonnull here if we made it this far

  IF t1 IS NOT NULL THEN
    -- `t0` and `t1` are nonnull here
    CALL another_proc(t0, t1);
  ELSE
    -- `t0` is nonnull here
    CALL another_proc(t0, "default");
  END IF;
END;

The ability of the CQL compiler to infer non-nullability greatly reduces the need to use the functions ifnull_crash and ifnull_throw to coerce values to a nonnull type—functions that, if they are ever used incorrectly, usually result in programs misbehaving.

For a detailed description and many additional examples of what is possible—CQL can handle much more than what is shown above—see the user guide's section on nullability improvements.

Enforcing Initialization Before Use

In CQL, it is possible to declare a variable of a nonnull type without giving it a value. If the variable is of a non-reference type, it is assigned a default value of 0. If the variable is of a reference type (BLOB, OBJECT, or TEXT), however, it is simply set to NULL despite the nonnull type as no default value exists.

To help prevent accessing a reference variable of a nonnull type and getting back NULL, CQL recently began enforcing that such variables are initialized before use. The following code, therefore, now results in an error:

DECLARE t TEXT NOT NULL;
CALL requires_text_notnull(t); -- error!

Using the same engine for control flow analysis that is behind nullability improvements, CQL can improve a variable to be initialized:

DECLARE t TEXT NOT NULL;

IF some_condition THEN
  SET t := "some example text";
  -- `t` is initialized here
ELSE
  THROW;
END IF;
-- `t` must be initialized here if we made it this far

CALL requires_text_notnull(t); -- okay!

Thanks to CQL's ability to understand the control flow of users' programs, the above example works just fine.

CQL now also enforces that all procedures with OUT parameters of a nonnull reference type properly initialize said parameters before they return:

CREATE PROC some_proc(b BOOL NOT NULL, OUT t TEXT NOT NULL)
BEGIN
  IF b THEN
    SET t := another_proc(t);
    -- `t` is initialized here
  ELSE
    SET t := yet_another_proc(t);
    -- `t` is initialized here
  END IF;
  -- `t` must be initialized here because all possible
  -- branches initialized it, so `some_proc` is okay!
END;

As with nullability improvements, understanding the nuances of what will be considered initialized is easier if one has a sense for how control flow analysis works in the compiler.

Understanding Control Flow Analysis in CQL

To develop an intuition for how control flow analysis works in CQL, let's begin by taking a look at the following example:

DECLARE PROC p1(OUT t TEXT NOT NULL);
DECLARE PROC p2(i INTEGER NOT NULL, OUT t TEXT NOT NULL);

CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
BEGIN
  IF i IS NULL THEN
    IF b THEN
      CALL p1(t);
    ELSE
      SET t := "";
    END IF;
    RETURN;
  END IF;

  IF i == 0 THEN
    SET t := "";
  ELSE IF i > 0 THEN
    SET t := p2(i);
  ELSE
    THROW;
  END IF;
END;

There are a couple of things we must verify in order to ensure the code is type-safe:

  • With regard to the parameters of p0: Since t is an OUT parameter of type TEXT NOT NULL, p0 must always assign it a value before it returns. If it does not, a caller of p0 may end up with a variable of a NOT NULL type that actually contains NULL.

  • With regard to the calling of p2 in p0: Since p2 requires a first argument of type INTEGER NOT NULL, some sort of check must be performed to ensure that i is not NULL before p2(i) is executed.

If we carefully study p0, we can determine that both of the above conditions are satisfied. Making this determination, however, is not exactly trivial, and real-world code is often significantly more complicated than this—and it evolves over time. For these reasons, having a compiler that can make such determinations automatically is critical; most modern production compilers perform these sorts of checks.

The easiest way to understand how CQL does its job is to take the above example line-by-line. This is not exactly how CQL works under the hood, but it should provide an intuitive sense of how control flow analysis works in the compiler:

==> CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
    BEGIN
      ...
    END;

Right away, CQL can see that t is declared both OUT and TEXT NOT NULL and thus requires initialization before p0 returns. CQL can, therefore, add a fact about what it is analyzing to its previously null set of facts:

  • t requires initialization.

We can then continue:

==>   IF i IS NULL THEN
        ...
      END IF;

Here, the compiler notices that we're at an IF statement. In CQL, IF statements contain one or more branches, and the compiler considers every IF to be the start of a branch group. The same line also indicates the condition for the first branch: i IS NULL. CQL can update its set of facts:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:

It then proceeds to the next line:

      IF i IS NULL THEN
    ==> IF b THEN
          CALL p1(t);
        ELSE
          SET t := "";
        END IF;
        RETURN;
      END IF;

Another branch group and branch:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • In branch group:
        • In branch when b:

Continuing:

      IF i IS NULL THEN
        IF b THEN
      ==> CALL p1(t);
        ELSE
          SET t := "";
        END IF;
        RETURN;
      END IF;

Since p1 takes an OUT argument of type TEXT NOT NULL, this call initializes t, and so CQL can update its set of facts once again:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • In branch group:
        • In branch when b:
          • t is initialized.

Jumping ahead a couple of lines:

      IF i IS NULL THEN
        IF b THEN
          CALL p1(t);
        ELSE
      ==> SET t := "";
        END IF;
        RETURN;
      END IF;

At this point, we're in another branch. We also have yet another fact to add because t is initialized here as well due to the SET:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • In branch group:
        • In branch when b:
          • t is initialized.
        • In ELSE branch:
          • t is initialized.

Moving ahead one more line, things get a bit more interesting:

      IF i IS NULL THEN
        IF b THEN
          CALL p1(t);
        ELSE
          SET t := "";
    ==> END IF;
        RETURN;
      END IF;

Here, we're at the end of an IF, and thus the end of a branch group. Whenever CQL reaches the end of a branch group, it merges the effects of all of its branches.

One very important thing to note here is that the current branch group has an ELSE branch, and so the set of branches covers all possible cases. That means if something is initialized in every branch within the branch group, we can consider it to be initialized after the branch group has ended: Initialization will always occur. This allows CQL to simplify its set of facts as follows as it leaves the branch group:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • t is initialized.

Stepping forward one line again, we reach a RETURN:

      IF i IS NULL THEN
        ...
    ==> RETURN;
      END IF;

We're now at a point where we can exit the procedure. CQL will, therefore, verify that if something requires initialization, it has been initialized. Since we have both the facts "t requires initialization" and "t is initialized", all is well!

The fact that the current branch returns early is added to the set of facts:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • t is initialized.
      • Returns.

Moving ahead one more line, we reach the end of another branch and branch group, and again something interesting happens:

      ...
      IF i IS NULL THEN
        ...
  ==> END IF;

Upon ending the branch group, we know that the branch group has exactly one branch, that the branch is entered only when i IS NULL, and that the branch returns. What that tells CQL is that, if execution is going to continue after the branch group, its sole branch must not have been taken, and so CQL knows the opposite of its condition for entry will be true from this point onward:

  • t requires initialization.
  • i is not null.

The next IF is rather similar to what we've seen already in its structure, so we can jump ahead several lines to the next point of interest:

      IF i == 0 THEN
        SET t := "";
      ELSE IF i > 0 THEN
    ==> SET t := p2(i);
      ELSE
        THROW;
      END IF;

Before we analyze the above-indicated line, we have the following set of facts:

  • t requires initialization.
  • i is not null.
  • In branch group:
    • In branch when i == 0:
      • t is initialized.
    • In branch when i > 0:

In the call p2(i), we know that i was declared to have type INTEGER and that p2 requires an INTEGER NOT NULL, but we also have the fact "i is not null". For this reason, we can consider p2(i) to be a valid call. We can also add the fact that t is initialized to our current set of facts:

  • ...
    • In branch when i > 0:
      • t is initialized.

NOTE: When it comes to code generation, it is not so simple as to say p2(i) is valid and proceed as usual. That's because p2 expects an argument of type INTEGER NOT NULL, but we merely have a value of type INTEGER that we happen to know cannot be null: INTEGER NOT NULL and INTEGER do not share the same underlying representation, and so we cannot pass the declared-nullable variable i directly to p2. To solve this problem, CQL rewrites the expression such that p2(i) becomes p2(cql_inferred_notnull(i)), where cql_inferred_notnull is an internal-only function that handles the nullable-to-nonnull representational conversion for us. This explains its presence in the following examples.

Jumping ahead again, we encounter a THROW:

      IF i == 0 THEN
        SET t := "";
      ELSE IF i > 0 THEN
        SET t := p2(cql_inferred_notnull(i));
      ELSE
    ==> THROW;
      END IF;

The fact that the branch will throw is added to the current set of facts:

  • t requires initialization.
  • i is not null.
  • In branch group:
    • In branch when i == 0:
      • t is initialized.
    • In branch when i > 0:
      • t is initialized.
    • In ELSE branch:
      • Throws.

We then proceed to the end of the IF:

      IF i == 0 THEN
        SET t := "";
      ELSE IF i > 0 THEN
        SET t := p2(cql_inferred_notnull(i));
      ELSE
        THROW;
  ==> END IF;

Once again, CQL merges the effects of the branches in the branch group to finish the analysis of the IF. Since it can see that t was initialized in all branches except the one that throws, and since the branches cover all possible cases, the set of facts is simplified as follows given the knowledge that, if THROW was not encountered, t must have been initialized:

  • t requires initialization.
  • i is not null.
  • t is initialized.

Moving ahead one final time, we encounter the end of the procedure:

    CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
    BEGIN
      ...
==> END;

The only thing left to do at this point is to validate that anything requiring initialization has been initialized. Since we have both "t requires initialization" and "t is initialized", everything is in order.

Looking Ahead

As a recently generalized piece of functionality within the CQL compiler, control flow analysis will soon be used to enforce additional properties of users' programs. In particular, CQL will be able to ensure that cursors are always fetched before they're used and that cursors are always checked to have a row before their fields are accessed.

Hopefully you now understand the fundamentals of control flow analysis in CQL and the benefits it brings to your programs. Best wishes for 2022!

Using the LIKE form in the SELECT statement (2022/02/03)

One of the signature features of the CQL language is the ability to use the "LIKE" form to slice out columns that conform to a shape. This notion appears in many places in the language. For instance if I have a table Foo. I can make a cursor for that shape like so:

declare C cursor like Foo;

Which says I want the columns of C to be like the columns of Foo.

If I have a cursor D that has the Foo columns but maybe more and maybe in a different order I can load C as follows:

fetch C from D(like Foo)

Which again saves me from having to list all the (potentially dozens) of Foo columns. This construct is in many places:

declare proc P(like Foo)
begin
  insert into Foo from arguments;
end;

even

declare proc P(f like Foo, b like Bar)
begin
  insert into Foo from f;
  insert into Bar from b;
end;

And other examples... This is discussed more fully in Chapter 5 of the Guide.

However, one of the few places that shapes are interesting but not supported was in the select list. And so, just a couple of days ago, we added the COLUMNS construct to the language which allows for a sugared syntax for extracting columns in bulk. It's kind of a generalization of the select T.* pattern but with CQL-style slicing and type-checking.

These forms are supported:

  • columns from a join table or tables
-- same as A.*
select columns(A) from ...;

-- same as A.*, B.*
select columns(A, B) from ...;
  • columns from a particular join table that match a shape
-- the columns of A that match the shape Foo
select columns(A like Foo) from ...;

-- get the Foo shape from A and the Far shape from B
select columns(A like Foo, B like Bar) from ...;
  • columns from any join table that match a shape
--- get the Foo shape from anywhere in the join
select columns(like Foo) from ...;

-- get the Foo and Bar shapes, from anywhere in the join
select columns(like Foo, like Bar) from ...;
  • specific columns
-- x and y columns plus the foo shape
select columns(T1.x, T2.y, like Foo) from ...;
  • distinct columns from the above (not distinct values!)
-- removes duplicate column names
-- e.g. there will be one copy of 'pk'
select columns(distinct A, B) from A join B using(pk);

-- if both Foo and Bar have an (e.g.) 'id' field you only get one copy
select columns(distinct like Foo, like Bar) from ...;

-- if a specific column is mentioned it is always included
-- but later clauses that are not a specific column will avoid it
-- if F or B has an x it won't appear again, just T.x
select columns(distinct T.x, F like Foo, B like Bar) from F, B ..;

Of course this is all just sugar, so it all ends up being a column list with table qualifications -- but the syntax is very powerful. For instance, for narrowing a wide table, or for fusing joins that share common keys

-- just the Foo columns
select columns(like Foo) from Superset_Of_Foo_From_Many_Joins_Even;

-- only one copy of 'pk'
select columns(distinct A,B,C) from
  A join B using (pk) join C using (pk);

And of course you can define shapes however you like and then use them to slice off column chucks of your choice. There are many ways to build up shapes from other shapes. Probably the easiest is to declare procedures that return the shape you want and never actual create them. E.g.

declare proc shape1() (x integer, y real, z text);
declare proc shape2() (like shape1, u bool, v bool);

With this combination you can easily define common column shapes and slice them out of complex queries without having to type the columns names over and over...

Note that the COLUMNS(...) form is not a general replacement for the select list. For instance, general expressions are not allowed inside of COLUMNS(...) but, where extraction of lots of columns is needed, or even re-ordering of colummns, it's a very good option indeed and it composes well with the other select features.

This was the last significant area where shapes are useful but totally absent.

Introducing Expression Fragments (2022/02/19)

Following on the heels of shared fragments, we're introducing the same kind of thing for shared fragments that are expressions rather than tables. The syntax is as follows:

-- this isn't very exciting because regular max would do the job
@attribute(cql:shared_fragment)
create proc max_func(x integer, y integer)
begin
  select case when x >= y then x else y end;
end;

The above can be used in the context of a SQL statement like so:

select max_func(T1.column1, T1.column2) the_max from foo T1;

The consequence of the above is that the body of max_func is inlined into the generated SQL. However, like the other shared fragments, this is done in such a way that the text can be shared between instances so you only pay for the cost of the text* in your program one time, no matter how many time you use it.

* You still pay for the cost of a pointer to the text.

In particular, for the above, the compiler will generate the following SQL:

select (
  select case when x >= y then x else y end
    from (select T1.column1 x, column2 y))

But each line will be its own string literal, so, more accurately, it will concatenate the following three strings:

"select (",                                      // string1
" select case when x >= y then x else y end",    // string2
" from (select T1.column1 x, column2 y))"        // string3

Importantly, string2 is fixed for any given fragment. The only thing that changes is string3, i.e., the arguments. The C compiler, and then the linker, will unify the string2 literal across all translation units so you only pay for the cost of that text one time. It also means that the text of the arguments appears exactly one time, no matter how complex they are. For these benefits, we pay the cost of the select wrapper on the arguments. This is cost is frequently negative. Consider this following:

select max_func((select max(T.m) from T), (select max(U.m) from U))

A direct expansion of the above would result in something like this:

case when (select max(T.m) from T) >= (select max(U.m) from U)
   then (select max(T.m) from T)
   else (select max(U.m) from U)
end;

The above could be accomplished with a simple pre-processor macro, but the fragments code generates the following:

select (
  select case when x >= y then x else y end
    from select (select max(T.m) from T) x, (select max(U.m) from U) y))

Expression fragments can nest, so you could write:

@attribute(cql:shared_fragment)
create proc max3_func(x integer, y integer, z integer)
begin
  select max_func(x, max_func(y, z));
end;

Again, this particular example is a waste because regular max would already do the job.

To give another example, common mappings from one kind of code to another using case/when can be written and shared this way:

-- this sort of thing happens all the time
@attribute(cql:shared_fragment)
create proc remap(x integer not null)
begin
   select case x
     when 1 then 1001
     when 2 then 1057
     when 3 then 2010
     when 4 then 2011
     else 9999
   end;
end;

In the following:

select remap(T1.c), remap(T2.d), remap(T3.e) from C, D, E;

The text for remap will appear three times in the generated SQL query but only one time in your binary.

Restrictions:

  • the function must consist of exactly one simple select statement
    • no FROM, WHERE, HAVING, etc. -- the result is an expression
  • the select list must have exactly one value
    • Note: the expression can be a nested SELECT which could have all the usual SELECT elements
  • the usual shared fragment rules apply, e.g. no out-parameters, exactly one statement, etc.

FAQ:

Q: Why does the expression fragment have a select in it?

A: Expression fragments are only interesting in SQL contexts where normal procedure and function calls are not available. The select keyword makes it clear to the author and the compiler that the expression will be evaluated by SQLite and the rules for what is allowed to go in the expression are the SQLite rules.

Q: Why no FROM clause?

A: We're trying to produce an expression, not a table-value with one column. If you want a table-value with one column, the original shared fragments solution already do exactly that. This gives you a solution for sharing code in, say, the WHERE clause or the select list.

Q: Isn't this just the same as doing, say, #define max_func(x,y) case when (x) >= (y) then x else y end;?

A: Macros can give you a ton of flexibility, but they have many problems:

  • if the macro has an error, you see the error in the call site with really bad diagnostic info
  • the compiler doesn't know that the sharing is going on so it won't be able to share text between call sites
  • the arguments can be evaluated many times each which could be expensive, bloaty, or wrong
  • there is no type-checking of arguments to the macro so you may or may not get compilation errors after expansion
  • you have to deal with all the usual pre-processor hazards

In general, macros can be used for meta-programming (as in C and C++), but that doesn't mean it's a good idea.

Using the FROM construct in more places (2022/02/21)

This new feature is a pretty simple generalization of the FROM construct as applied to expression lists. Note this isn't the same as using FROM the usual way in a select statement. An example will clear this right up.

Suppose you wanted to create a procedure that inserts a row into a table. You could write this:

create table Shape_xy (x int, y int);

create proc insert_xy(like Shape_xy)
begin
  insert into Shape_xy from arguments;
end;

Here we're using from to introduce some shape of values. It can appear in a lot of places.

Suppose now I want to insert two of those shapes. I could write this slightly more complicated procedure:

create proc insert_two_xy(xy1 like Shape_xy, xy2 like Shape_xy)
begin
   call insert_xy(from xy1);
   call insert_xy(from xy2);
end;

This also composes with cursors, so maybe you need to get two xy values from diverse locations. You can mix and match.

create proc write_xy()
begin
   declare C cursor for select T.x, T.y from somewhere T;
   fetch C;
   declare D cursor for select T.x, T.y from somewhere_else T;
   fetch D;
   if C and D then
     -- strange combos for illustration only
     call insert_two_xy(from C, from D);
     call insert_two_xy(from D, 5, 3);
     call insert_two_xy(4, 2, from C);
     call insert_two_xy(4, from C, 8);
   end if;
end;

So, as you can see, we can start from data in one or more cursors and we can turn that data, plus other expressions, into arguments, composing them as we like. This gives you the ability to call procedures and functions using shapes from a mixed set of sources. None of this is new.

However, the other places where expression lists happen -- fetch, update cursor, and insert -- only allowed you specify a single object as the input source such as insert into Shape_xy from C.

With a little work, this is trivially generalized so that all value lists can use the from construct.

Here's a complete example showing all the new forms.

create table Shape_xy (x int, y int);
create table Shape_uv (u text, v text);
create table Shape_uvxy (like Shape_xy, like Shape_uv);

create proc ShapeTrix()
begin
  declare C cursor for select Shape_xy.*, '1' u, '2' v from Shape_xy;
  fetch C;

  -- This new form is equivalent to the old form:
  --    insert into Shape_xy from C(like Shape_xy)
  -- but the values(...) form generalizes, see below.
  insert into Shape_xy values(from C like Shape_xy);

  declare D cursor for select * from Shape_uv;
  fetch D;

  declare R cursor like Shape_uvxy;

  -- This form works just like the function call case
  -- that was previously supported (it uses the same code even).
  -- This form lets you load R from any combination of sources
  -- as long as you make a suitable row.
  fetch R from values (from C like Shape_xy, from D);

  -- Same thing is supported in update cursor
  -- the x, y come from C and the u,v come from D.x, D.y.
  -- Note that C.u and C.v would not even be type compatible.
  update cursor R from values (from C like Shape_xy, from D);

  -- And in a select-values clause
  declare S cursor for
    with cte(l,m,n,o) as (values (from C like Shape_xy, from D))
     select * from cte;
  fetch S;
  insert into Shape_uvxy from S;
end;

As you can see, you can choose a subset of the from shape using like.

These combinations let you flexibily assemble rows of data for cursors, calls, and insertions, using any combination of data sources you might want, without resorting to listing every column by hand.

Introducing Blob Storage (2022/03/17)

Introduction and Context

The general idea here is that you might want to store composite data in a single column in the database. This is a common way to get more generic schema, the idea being that you can have one or more blob columns that store in tables a lot of data that doesn't have to be indexed. You could store it in other ways, like a JSON blob or some such, but we'll be using blobs as the basis for storage here -- hence the name blob "storage".

How do I define one of these blobs?

In SQL/CQL, the main way you define structures, especially those that you want to maintain, is with tables. Hence we introduce this

@attribute(cql:blob_storage)
create table news_info(
  who text,
  what text,
  when_ long -- timestamp of some kind
);

The blob_storage attribute indicates that the thing we're about to define here is not really going to be a materialized table. As a result, you will not be able to (e.g.) DROP the table or SELECT from it, and there will be no schema upgrade for it should you request one. However, the usual schema rules still apply which help you to create compatible versions of this structure. For instance, new columns can be added only at the end, and only if they are nullable. Here we add source to the schema in a hypothetical "version 6". Note that schema versions move forward globally in the schema, not locally in one table; this implies there are versions 1-5 elsewhere, not shown.

@attribute(cql:blob_storage)
create table news_info(
  who text,
  what text,
  when_ long -- timestamp of some kind
  source text @create(6)
);

Additionally, since the storage is not backed by SQL with SQL's constraint system, default values and constraints are not allowed in a table marked with cql:blob_storage; it's just data. Similarly, triggers, views, and indices may not use the "table".

Where do you keep your blob storage?

Naturally, blob storage goes in a blob field, but recall CQL has discriminated types so we could make something like this:

create table info(
  id long primary key,
  news_info blob<news_info>
);

From a SQL perspective news_info is just a blob. That means if you want to do a WHERE clause or something like that on the info, you're out of luck. Maybe you could write a user-defined function to crack the blob and so forth but really this isn't the point. If you're using this feature then, by construction, you don't need to index on this data. It's simply not suitable for use at all if you need field-at-a-time access within SQL.

How do I make one of these blobs?

The natural place that CQL stores structures is in value cursors so the most natural thing to do is to provide a variation of the SET statement that lets you load a blob from a cursor like so:

create proc make_blob(like news_info, out result blob<news_info>)
begin
  declare c cursor like news_info;
  fetch c from arguments;
  set result from cursor c;
END;

This declares a cursor, loads it from argument values, and converts it to a blob. Of course all of the usual cursor building forms can be used to power your blob creation, you just do one serialization at the end. The above is assembling a blob from arguments but you could equally make the blob from data.

create proc get_news_info(id_ long not null, out result blob<news_info>)
begin
   -- use our columns sugar syntax for getting just news_info columns from
   -- a table with potentially lots of stuff (or an error if it's missing columns)
   declare c cursor for
     select columns(like news_info) from some_source_of_info where info.id = id_;
   fetch c;
   set result from cursor c;
END;

There are many cursor fetch forms, including dummy data forms and other interesting bits of sugar. You can fetch a cursor from arguments, from other cursors, and even combinations. We want all of that to work for blobs as well without adding tons of new syntax and code generation. The obvious way to accomplish that is for cursors to be the source of blobs.

How do I unpack one of these blobs?

Again, the normal way that you work with records in CQL is by creating suitable cursors. These can be economically accessed on a field-by-field basis. What we need is a way to easily recreate a cursor from the blob so we can read the data values. This gives rise to this form:

let b := (select news_info from info where id = id_ if nothing null);
declare c cursor like b;
fetch c from b; -- note this can fail
-- now use c.who, c.what, etc.

Data loaded in a cursor is very economical to access on a field-by-field basis, and, since the deserialization of the blob happens all at once, that is also economical. Importantly, we cannot assume that the blob is well formed, it could be coming from anywhere. For secure-code reasons we must assume it is hostile. Hence the decoding validates the shape, internal lengths, and so forth.

If we had instead started with something this:

let b := (select news_info from info where id = id_ if nothing null);

Then maybe we might like to write:

if b.who == 'U2') then ... end if;

However, this sort of thing would be very uneconomical. For one thing, the blob does not have fixed-offset fields: It is carrying all the serialized data for the string fields and so forth. Each "dot" operation would be costly and, furthermore, each "dot" operation could fail if the blob is badly formed. Having to deal with a b.who that might fail seems very bad indeed.

Once you have the cursor you can make new blobs with different combinations, slice the cursor fields using the LIKE operator, return the cursor with OUT, or OUT UNION, or pass the blob fields as arguments to functions using the FROM forms. Cursors already are super flexible in terms of what you can do with their contents.

What is the representation of one of these blobs?

It's important that we allow the blobs to evolve over time, so each blob has to be self-describing. We also want to be able to throw an error if you use the wrong kind of blob when loading a cursor, so the blob has to contain the following:

  • the number of columns in the blob data type when it was stored
  • the type of each field is encoded as a single plain-text character
    • the types are bool, int, long, (double) real, (string) text, blob;
    • we use 'f' (flag) for bools, hence "fildsb"
    • these are encoded with one letter each, upper case meaning 'not null' so the storage might be "LFss"
    • the buffer begins with a null terminated string that serve for both the count and the types
  • Each nullable field may be present or null; 1 bit is used to store this fact. The bits are in an array of bytes that comes immediately after the type info (which implicitly tells us its size)
  • Boolean values are likewise encoded as bits within the same array, so the total number of bits stored is nullables plus booleans (nullable booleans use 2 bits)
  • If you are reading a newer version of a record from an older piece of data that is missing a column then the column is assumed to be NULL
  • Any columns you add after the initial version (using @create) must be nullable; this is normal for adding columns to existing schema
  • Integers and longs are stored in varint format after zigzag encoding
  • Text is stored inline in null terminated strings (embedded nulls are not allowed in CQL text)
  • Nested blobs are stored inline, with a length prefix encoded like any other int

What about more than one row in a blob?

Well, this is a bit more advanced but in principle this could be done as well. To make it useful, we would want to make a new cursor type that can iterate over rows in a blob. The syntax leaves room for this, something like so:

declare c cursor for blob b;
loop fetch c
begin
  -- the usual stuff
end;

This cursor would be another variation; it would keep its current index into the blob to read data out of it. Such a blob would also have to include a count of rows as part of its storage.

However, that's future looking. There is no such support at present.

Conclusion

With a fairly modest amount of work, we now support structured storage natively and have pretty rich language constructs. We carefully chose language constructs that lead to economical serialization and deserialization patterns and a record format that is versioned well, without resorting to something super loose like JSON.

As with many other features, it's possible to replace the (de)serialization with code of your choice by supplying your own runtime methods. So for instance, thrift encoding is possible; though it is more flexible than is strictly necessary for the few SQL data types, it might be convenient.

Storage types that are going to be persisted in the database or go over a wire-protocol should be managed like schema with the usual validation rules. On the other hand, formats that will be used only transiently in memory can be changed at whim from version to version. As mentioned above, the design specifically considers cases where a new client discovers and old-format blob (with fewer columns) and, the reverse, cases where an old client recieves a datagram from a new client with too many columns.

Appendix

A more complete example is included for reference.

@attribute(cql:blob_storage)
create table news_info(
  who text,
  what text,
  when_ long -- timestamp of some kind
);

-- a place where the blob appears in storage
create table some_table(
  x integer,
  y integer,
  news_blob blob<news_info>
);

-- a procedure that creates the blob from loose args
create proc make_blob(like news_info, out result blob<news_info>)
begin
  declare c cursor like news_info;
  fetch c from arguments;
  set result from cursor c;
end;

-- a procedure that cracks the blob
create proc crack_blob(data blob<news_info>)
begin
  declare c cursor like news_info;
  fetch c from data;
  out c;
end;

-- a procedure that cracks the blob into loose args if needed
-- the OUT statement was created specifically to allow you to avoid this sort mass OUT awfulness
create proc crack_blob_to_vars(
  data blob<news_info>,
  out who text,
  out what text,
  out when_ long)
begin
  declare c cursor like news_info;
  fetch c from data;
  set who := c.who;
  set what := c.what;
  set when_ := c.when_;
end;

-- this just defines a shape for the part we are keeping from the original structure
declare proc my_basic_columns() (
  x int,
  y int
);

-- this just defines a shape for the result we want
-- we're never actually defining this procedure
declare proc my_result_shape() (
  like my_basic_columns,
  like news_info
);

create proc select_and_crack(whatever_condition bool)
begin
  declare c cursor for select * from some_table where whatever_condition;
  loop fetch c
  begin
    -- crack the blob in c
    declare n cursor like news_info;
    fetch n from blob c.news_blob;

    -- assemble the result we want from the parts we have
    declare result cursor like my_result_shape;
    fetch result from values (from c like my_basic_columns, from n);

    -- emit one row
    out union result;
  end;
end;

Introducing Backed Tables (2022/10/05)

Introduction and Context

Most production databases include some tables that are fairly generic, they use maybe a simple key-value combination to store some simple settings or something like that. In the course of feature development this kind of thing comes up pretty often and in large client applications (like Messenger, but certainly not limited to Messenger) there are many small features that need a little bit of state. It's easy enough to model whatever state you need with a table or two but this soon results in an explosion of tiny tables. In some cases there are only a few rows of configuration data and indeed the situation can be so bad that the text of the schema for the little state table is larger than the sum of all the data you will ever store there. This is a bit tragic because SQLite has initialization cost associated with each table. So these baby tables are really not paying for themselves at all. What we'd like to do is use some kind of generic table as the backing store for many of these small tables while preserving type safety. The cost of access might be a bit higher but since data volumes are expected to be low anyway this would be a good trade-off. And we can have as many as we like. In some cases the state doesn't even need to be persisted, so we're talking about tables in an in-memory database. Here low cost of initialization is especially important. And lastly, if your product has dozens or even hundreds of small features like this, the likelihood that all of them are even used in a session is quite low and so again, having a low fixed cost for the schema is a good thing. No need to create 100 in-memory tables on the off chance that they are needed.

See also the related feature: blob storage.

How do I define one of these backed tables?

First you need a place to store the data, we define a backing table in the usual way. A simple backing table is just a key/value store and looks like this:

@ATTRIBUTE(cql:backing_table)
CREATE TABLE backing(
  k BLOB PRIMARY KEY,
  v BLOB NOT NULL
);

The backing_table attribute indicates that the table we're about to define is to be used for backing storage. At present it is signficantly restricted. It has to have exactly two columns, both of which are blobs, one is the key and one is the value. It should be either baseline schema or annotated with @create as it is expected to be precious data. If it's an in-memory table the versioning is somewhat moot but really the backing store is not supposed to change over time, that's the point. In future versions we expect to allow some number of additional physical columns which can be used by the backed tables (discussed below) but for now it's this simple pattern.

Backed table looks like this:

@ATTRIBUTE(cql:backed_by=backing)
CREATE TABLE backed(
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  bias REAL
);

@ATTRIBUTE(cql:backed_by=backing)
CREATE TABLE backed2(
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL
);

The backed_by attribute indicates that the table we're about to define is not really going to be its own table. As a result, you will not be able to (e.g.) DROP the table or CREATE INDEX or CREATE TRIGGER on it, and there will be no schema upgrade for it should you request one. It may not contain constraints as there would be no way to enforce them. But as compensation for these restrictions it can be changed freely and has no physical schema cost associated with it.

How do I read this data?

To understand how this works imagine that we had a view for each backed table which simply read the blobs out of the backing store and then extracted the backed columns using some blob extraction functions. This would work, but then we'd be trading view schema for table schema so the schema savings we're trying to achieve would go up in smoke.

We might be lost here but CQL already has something very "view like" and that's the shared fragment structure. So what we do instead of views is to automatically create a shared fragment just like the view we could have made. They look like this:

@ATTRIBUTE(cql:shared_fragment)
CREATE PROC _backed ()
BEGIN
  SELECT
   rowid,
   cql_blob_get(T.k, backed.id) AS id,
   cql_blob_get(T.v, backed.name) AS name,
   cql_blob_get(T.v, backed.bias) AS bias
    FROM backing AS T
    WHERE cql_blob_get_type(T.k) = 2105552408096159860L;
END;

So some things to notice right away:

First, this fragment has the right shape, but the shared fragment doesn't directly call blob extractors. Rather it uses these cql_blob_get. The point of this is to make the actual blob functions configurable. The test suites include some very simple extraction functions for blobs with just integers in them, but you can create whatever blob format you want. You could use the blob storage feature for encoding or you can encode it as you see fit. You can even have different encodings in different backed tables.

Second, there is a type code embedded in the procedure. The type code is a hash of the type name and the names and types of all the not-null fields in the backed table. The hash is arbitrary but repeatable, any system can compute the same hash and find the records they want without having to share headers. The actual hash is open source but it's just a SHA256 reduced to 64 bits with some name canonicalization. Shortly the JSON will also include the relevant hashes so you can easily consume them without even having to know the hash function.

Here's the slightly smaller shared fragment for backed2

@ATTRIBUTE(cql:shared_fragment)
CREATE PROC _backed2 ()
BEGIN
  SELECT
    rowid,
    cql_blob_get(T.k, backed2.id) AS id,
    cql_blob_get(T.v, backed2.name) AS name
    FROM backing AS T
    WHERE cql_blob_get_type(T.k) = -1844763880292276559L;
END;

As you can see it's very similar -- the type hash is different and of course it has different columns.

Why does the type hash include only the non-null fields?

The idea is that the backed table might change over time and you can add new optional fields without invalidating your existing data. If you change the name of the type or if you add new not null fields the type identity changes and any data you have in the backing table will basically be ignored because the type hash will not match.

What do cql_blob_get and cql_blob_get_type turn into?

You can configure them as you see fit. By default cql_blob_get turns into either bgetkey or bgetval depending on if you are reading from the key blob or the value blob. The directives for configuring this function are:

@blob_get_key bgetkey offset;
@blob_get_val bgetval;

You can configure the system to ask for the column by offset (this is normal for the primary key because it has a fixed number of columns for any given key type and they are all mandatory), or by hash code (this is normal for the value type because it might be missing some columns and so offset is probably not appropriate). However both are configurable so you want to do key by hashcode simply omit the "offset" part of the directive. Likewise if your values are offset addressable you can add "offset" to the value directive. Here the offset means the zero based ordinal of the column in the key or the value.

The type access functions are similarly configurable (they never need a code or an offset).

@blob_get_key_type bgetkey_type;
@blob_get_val_type bgetval_type;

What does this end up looking like?

Armed with these basic transforms we can already do a simple transform to make select statement work. Suppose CQL sees:

declare C cursor for select * from backed;

We can make this work with a simple transform:

 DECLARE C CURSOR FOR WITH
  backed (*) AS (CALL _backed())
  SELECT *
    FROM backed;

Now remember _backed was the automatically created shared fragment. Basically, if we see a select statement that mentions any backed table we simply add a call to the corresponding shared fragment in the WITH clause. This effectively creates that "view" we need. And because we're using the shared fragment form, all users of this fragment will share the text. So there's no schema and the text of the backed appears only once in the binary. More precisely we get this:

WITH
backed (rowid, id, name, bias) AS (
  SELECT
    rowid,
    bgetkey(T.k, 0),                      -- 0 is offset of backed.id in key blob
    bgetval(T.v, -6639502068221071091L),  -- note hash of backed.name
    bgetval(T.v, -3826945563932272602L)   -- note hash of backed.bias
  FROM backing AS T
  WHERE bgetkey_type(T.k) = 2105552408096159860L)
SELECT rowid, id, name, bias
  FROM backed;

Now with this in mind we can see that it would be very beneficial to also add this:

CREATE INDEX backing_index ON backing(bgetkey_type(k));

or more cleanly:

CREATE INDEX backing_index ON backing(cql_blob_get_type(k));

Either of these result in a computed index on the row type stored in the blob. Other physical indices might be helpful too and these can potentially be shared by many backed tables, or used in partial indicies.

Of course your type function might be named something other than the default bgetkey_type.

Now consider a slightly more complex example:

A slightly more complex example:

select T1.* from backed T1 join backed2 T2 where T1.id = T2.id;

becomes:

WITH
  backed (rowid, id, name, bias) AS (CALL _backed()),
  backed2 (rowid, id, name) AS (CALL _backed2())
  SELECT T1.*
    FROM backed AS T1
    INNER JOIN backed2 AS T2
    WHERE T1.id = T2.id;

Now even though two different backed tables will be using the backing store the select "just works". All the compiler had to do was add both backed table fragments. And of course if backed was joined against itself, that would also just work.

How do I insert data like this?

Consider:

insert into backed values (1, "n001", 1.2), (2, "n002", 3.7);

This has to insert into the backing storage and convert the various values into key and value blobs. A simple transform does this job as well:

 WITH
  _vals (id, name, bias) AS (
    VALUES(1, "n001", 1.2), (2, "n002", 3.7)
  )
  INSERT INTO backing(k, v) SELECT
    cql_blob_create(backed, V.id, backed.id),
    cql_blob_create(backed,
      V.name, backed.name,
      V.bias, backed.bias)
    FROM _vals AS V;

What's going on here? Well, the issue is that the data to be inserted can be arbitrarily complicated. It might refer to all kinds of things. In this case it's just literal values but in general it could be anything. So the transform takes the original values and puts them in a _vals(...) CTE. Then we insert into the backing store by converting _vals into blobs -- one for the key and one for the value. There is only the one place we need to do this for any given insert statement no matter now many items or how complex the insertion is.

cql_blob_create similarly expands to a user configured function with optional hash codes and mandatory field types. There is default configuration that corresponds to this:

@blob_create_key bcreatekey offset;
@blob_create_val bcreateval;

The final SQL looks like this:

WITH
_vals (id, name, bias) AS (
  VALUES(1, "n001", 1.2), (2, "n002", 3.7)
)
INSERT INTO backing(k, v) SELECT
  bcreatekey(2105552408096159860, V.id, 1), -- type 1 is integer, offset implied
  bcreateval(2105552408096159860,
    -6639502068221071091, V.name, 4,  -- hash as before, type 4 is text,
    -3826945563932272602, V.bias, 3)  -- hash as before, type 3 is real,
  FROM _vals AS V

Note that both blobs have the same overall type code (2105552408096159860) as before. The key blob did not use per-field type codes, so the argument positions give the implied offset. In contrast the value blob is using hash codes (offset was not specified). This configuration is typical.

A more complex insert works just as well:

insert into backed
  select id+10, name||'x', bias+3 from backed where id < 3;

The above insert statement is a bit of a mess. It's taking some of the backed data and using it to create new backed data. But the simple transforms we have work just as before. We add the needed backed CTE and create _vals like before.

WITH
  backed (*) AS (CALL _backed()),
  _vals (id, name, bias) AS (
    SELECT id + 10, name || 'x', bias + 3
    FROM backed
    WHERE id < 3
  )
  INSERT INTO backing(k, v)
   SELECT
     cql_blob_create(backed, V.id, backed.id),
     cql_blob_create(backed, V.name, backed.name, V.bias, backed.bias)
   FROM _vals AS V;

Looking closely at the above we see a few things:

  • cql_blob_create will expand as before (not shown)
  • we added backed(*) as usual
  • _vals once again just has the exact unchanged insert clause
  • the insert into backing(k, v) part is identical, the same recipe always works

How does the delete operation work?

Now let's look at a simple delete example:

delete from backed where id = 7;

Now remember we're again looking for a pattern that will generalize when the where condition gets crazy. But fortunately this is not so hard. The following form is fully general:

WITH
  backed (*) AS (CALL _backed())
DELETE FROM backing
  WHERE rowid IN (
    SELECT rowid
    FROM backed
    WHERE id = 7
  );

All we had to do here was:

  • add the usual _backed CTE
  • move the original WHERE clause into a subordinate SELECT that gives us the rowids to delete.

With the backed table in scope, any WHERE clause works. If other backed tables are mentioned, the compiler simply adds those as usual.

Here's a more complicated delete, it's a bit crazy but illustrative:

delete from backed where
  id in (select id from backed2 where name like '%x%');

So this is using rows in backed2 to decide which rows to deleted in backed. The same simple transform works directly.

WITH
  backed2 (*) AS (CALL _backed2()),
  backed (*) AS (CALL _backed())
DELETE FROM backing
  WHERE rowid IN (
    SELECT rowid
    FROM backed
    WHERE id IN (
      SELECT id FROM backed2 WHERE name LIKE '%x%'
    )
  );

What happened here:

  • the WHERE clause went directly into the body of the rowid select
  • backed was used as before but now we also need backed2

The delete pattern does not need any additional cql helpers beyond what we've already seen.

What about updating tables?

The update statement is the most complicated of the bunch and it requires all the tricks from all the previous statements plus one more.

First, we'll need two more blob helpers that are configurable. By default they look like this:

@blob_update_key bupdatekey offset;
@blob_update_val bupdateval;

These are used to replace particular columns in a stored blob. Now let's start with a very simple update to see now it all works:

update backed set name = 'foo' where id = 5;

Fundamentally we need to do these things:

  • the target of the update has to end up being the backing table
  • we need the backed table CTE so we can do the filtering
  • we want to use the rowid trick to figure out which rows to update which handles our where clause
  • we need to modify the existing key and/or value blobs rather than create them from scratch

Let's see how this looks:

WITH
  backed (*) AS (CALL _backed())
UPDATE backing
  SET v = cql_blob_update(v, 'foo', backed.name)
    WHERE rowid IN (SELECT rowid
    FROM backed
    WHERE id = 5);

Tearing this down a bit:

  • we needed the normal CTE so that we can use backed rows
  • the WHERE clause moved into a WHERE rowid sub-select just like in the DELETE case
  • we changed the SET targets to be k and v very much like the INSERT case, except we used an update helper that takes the current blob and creates a new blob to store
    • the helper is varargs so as we'll see it can mutate many columns in one call

This gives us a working update statement... with one hitch. It's possible to use the existing column values in the update expressions and there's no way to use our backed CTE to get them since the final update has to be all relative to the backing table.

Let's look at another example to illustrate the problem:

update backed set name = name || 'y' where bias < 5;

So this is adding the letter 'y' to some rows. Kind of goofy but similar mutations do happen and have to work. To make this work the reference to name inside of the set expression has to change. We end up with something like this:

WITH
  backed (*) AS (CALL _backed())
UPDATE backing
  SET v = cql_blob_update(v,
    cql_blob_get(v, backed.name) || 'y',
    backed.name)
  WHERE rowid IN (SELECT rowid
    FROM backed
    WHERE bias < 5);

Importantly the reference to name in the set expression was changed to cql_blob_get(v, backed.name) -- extracting the name from the value blob. After which it is appended with 'y' as usual.

The rest of the pattern is just as it was, in fact literally everything else is unchanged. But it's easy to see that the WHERE clause could be arbitrarily complex and it just works. Since the UPDATE statement has no FROM clause only the fields in the target table might need to be rewritten, so in this case name, id, and bias were possible but only name was mentioned.

After the cql_blob_get and cql_blob_update are expanded the result looks like this:

WITH
backed (rowid, id, name, bias) AS (
  SELECT
    rowid,
    bgetkey(T.k, 0),
    bgetval(T.v, -6639502068221071091L),
    bgetval(T.v, -3826945563932272602L)
  FROM backing AS T
  WHERE bgetkey_type(T.k) = 2105552408096159860L
)
UPDATE backing
SET v =
  bupdateval(
    v,
    -6639502068221071091L, bgetval(v, -6639502068221071091L) || 'y', 4
  )
  WHERE rowid IN (SELECT rowid
  FROM backed
  WHERE bias < 5);

The blob update function for the value blob requires the original blob, the hash or offset to update, the new value, and the type of the new value. The blob update function for the key blob is the same (blob, hash/offset, value) but the type is not required since the key blob necessarily has all the fields present because they are necessarily not null. Therefore the type codes are already all present and so the type of every column is known. The value blob might be missing nullable values hence their type might not be stored/known.

To illustrate these cases we can make another small example; we'll set up yet another small table that uses the same backing store:

@attribute(cql:backed_by=backing)
create table meta(
 name text,
 state long,
 prev_state long,
 primary key(name, state)
);

This update mixes all kinds of values around...

update meta
 set state = state + 1, prev_state = state
 where name = 'foo';

And the final output will be:

WITH
meta (rowid, name, state, prev_state) AS (
  SELECT
    rowid,
    bgetkey(T.k, 0),
    bgetkey(T.k, 1),
    bgetval(T.v, -4464241499905806900)
  FROM backing AS T
  WHERE bgetkey_type(T.k) = 3397981749045545394
)
SET
  k = bupdatekey(k, bgetkey(k, 1) + 1, 1),
  v = bupdateval(v, -4464241499905806900, bgetkey(k, 1),  2)
  WHERE rowid IN (SELECT rowid
  FROM meta
  WHERE name = 'foo');

As expected the bupdatekey call gets the column offset (1) but not the type code (2). bupdateval gets a hash code and a type.

All of these transforms are live in the code as of a few days ago.

The upshot is that, if you write some simple encoding and decoding functions, you can have very flexible blob storage.

Appendix

If you want to refer to your blob functions in your own code, such as for indices you'll also need to do something like this:

declare select function bgetkey_type(b blob) long;
declare select function bgetval_type(b blob) long;
declare select function bgetkey(b blob, iarg integer) long;
declare select function bgetval(b blob, iarg integer) long;
declare select function bcreateval no check blob;
declare select function bcreatekey no check blob;
declare select function bupdateval no check blob;
declare select function bupdatekey no check blob;

bgetval and bgetkey are not readily declarable generally because their result is polymorphic so it's preferable to use cql_blob_get as above which then does the rewrite for you. But it is helpful to have a UDF declaration for each of the above, especially if you want the --rt query_plan output to work seamlessly. Typically bgetval would only be needed in the context of a create index statement.

Introducing Parent/Child Result Sets (2022/10/06)

Introduction and Context

There are many cases where you might want to nest one result set inside of another one. In order to do this ecomomically there was a great desire to be able to run a parent query and a child query and then link the child rows to the parent rows. One way to do this is of course to run one query for each "child" but then you end up with O(n) child queries and if there are sub-children it would be O(n*m) and so forth. What you really want to do here is something more like a join, only without the cross-product part of the join. Many systems have such features, sometimes they are called "chaptered rowsets" but in any case there is a general need for such a thing.

We did a bunch of work in the name of Parent/Child results sets but like many goals of this kind it caused us to ripen the CQL language in a variety of ways and its interesting to talk about those changes. Importantly, we wanted to be able to do work of this kind in the language while adding the fewest new notions and basically enabling the language to express a concept like a child rowset in the first place.

Here are some things that happened along the way that are interesting.

Cursor Types and Result Types

One of the first problems we run into thinking about how a CQL program might express pieces of a rowset and turn them into child results is that you need to be able to hash a row, append row data, and extract a result set from a key.

Let's think about that for just a second: in order to do anything at all with a child rowset, no matter how we got such a thing, we have to be able to describe it in a type-safe way. These objects already exist at runtime but they do not appear anywhere in the language explicitly and that was going to have to change.

To address this we added a new object type, kind of like we did with boxed statements. A result set has a type that looks like this object <proc_name set>. Here proc_name must the the name of a procedure that returns a result set and the object will represent a result set with the corresponding columns in it.

That step may seem like it's super important but actually it's kind of optional, it provides type-safety but the initial versions of the feature just used the type object which works fine provided you make no mistakes... it turns out there are even more fundamental needs that aren't optional.

Creating New Cursor Types From Existing Cursor Types

The first thing you need to be able to to is take the type of the parent query and add to it one more columns to whole the child result set or sets (note that you can have more than one child result set per parent). So for instance you might have a list of people, and one child result might be the names of the schools they attended and another is the names of the jobs they worked.

So while adding columns to existing rows might sound like a bizarre thing to do but actually it's actually fundamental to the job here. We must be able to create a new output row is that is the sames as the parent but includes columns for the the child results too. There was no good syntax for this. The cursor declaration forms were:

/* option 1 */ declare C cursor like shape_name;
/* option 2 */ declare C cursor like select 1 x, "2" y, false z;

The first option implies that you already have a shape from (e.g.) a procedure or table and you want to make an identical cursor. That doesn't work here because we're trying to modify an existing shape, not use it as is.

The second form was supposed to be able to create any kind of cursor shape by simply declaring a select statement that is an example of what you want to capture. In principle this can define almost anything. However, there's a catch -- you can't get object types to come out of a select so it's hopeless for result set types. And, maybe just as important, you can't just add a few columns to an existing type with any kind of ease, you have to list all columns.

Fortunately there was a pretty simple solution to this problem. There were already lots of cases where a typed name list happens in the language -- for example in the return type of a function you can specify something like (id integer, name text). That construction also defines a shape just like a select statement and there was already code to handle all the correctness analysis. Additionally, the LIKE construct can be used in such a list to refer to existing types. So for instance a function that returns all the columns of tables A and B could be defined like so

declare function foo() (LIKE A, LIKE B);

So we could solve all the cursor type problems by allowing a typed name list to be used to define a cursor shape. Probably the approach that should have been taken in the first place. The select option seems weird by comparison.

With the already existing support for shapes in a type list we could make the result shape for this parent/child case with ease, like so:

declare result cursor like (like parent, child_result object<child_proc set>);

So, all the parent columns plus a child result set. Or more than one child result set if needed.

Lastly there were going to be cases where we needed to make a new cursor using only some of the field of an existing cursor. The case in particular I'm thinking of is that we might have a big row from the parent and it might have only one or two columns that we need that form the key columns for the child. We didn't have a good way to do that either, but solving this turns out to be simple enough. We already had this form:

declare D cursor like C;

we just added:

declare D cursor like C(a, b, c);

Which chooses just the 3 named fields from C and makes a cursor with only those. Recently we added the form:

declare D cursor like C(-x);

To mean take all the columns of C except x

With the a shape for the key fields defined, we can use existing syntax to load the fields economically:

fetch D from C(like D);

Which says we want to load D from the fields of C, but using only the columns of D. That operation is of course going to be an exact type match by construction. So now we could describe the key columns from child rows, and the key columns from parent rows. And we could add columns to the parent type to create space to hold child result sets. All of our type problems are solved. Almost.

Cursor Arguments

It was clear that we would need to be able to do things like "hash a cursor" (any cursor) or "store this row into the appropriate partition" and this requirement meant that we had to be able to write functions that could take any cursor and dynamically do things to it based on its type information. There is no good way to write these generic helper things in CQL, but:

  • we don't need very many of them,
  • it's pretty easy to do that job in C

The main thing we need is to create a way to declare such functions and call them a with cursor and the necessary shape info.

So we added this notion of being able to call an external function with any cursor. Like so:

declare function cursor_hash(C cursor) long not null;

you can call it like so:

let hash := cursor_hash(C);

where C is any cursor.

When such a call is made the C function cursor_hash gets passed what we call a "dynamic cursor". This includes:

  • a pointer to the data for the cursor
  • the count of fields
  • the names of the fields
  • the type/offset of every field in the cursor

So you can (e.g.) generically do the hash by applying a hash to each field and then combining all of those. This kind of function works on any cursor and all the extra data about the shape that's needed to make the call is static, so really the cost of the call stays modest. Details of the dynamic cursor type are in cqlrt_common.h and there are many example functions now in the cqlrt_common.c file.

Again, creating this facility was a pretty minor matter, the compiler already has all this data and uses it to create result sets in the first place. We just allowed other functions to use that same data and made a public type for it.

The Specific Parent/Child Functions

To do the parent/child operations we needed three helper functions:

DECLARE FUNC cql_partition_create ()
   CREATE OBJECT<partitioning> NOT NULL;

DECLARE FUNC cql_partition_cursor (
  part OBJECT<partitioning> NOT NULL,
  key CURSOR,
  value CURSOR)
    BOOL NOT NULL;

DECLARE FUNC cql_extract_partition (
  part OBJECT<partitioning> NOT NULL,
  key CURSOR)
    CREATE OBJECT NOT NULL;

The first function makes a new partitioning.

The second function hashes the key columns of a cursor (specified by the key argument) and appends the values provided into a bucket for that key. By making a pass over the child rows you can easily create a partitioning with each unique key combo having a buffer of all the matching rows.

The third function is used once the partitioning is done. Given a key again, which you now presumably get from the parent rows, you get the buffer you had accumulated and then make a result set out of it and return that. Note that this function returns the vanilla object type because it could be returning any shape.

Result Set Sugar

With the type system mentioned above you could now join together any kind of complex parent and child combo you needed, but it might be a lot of code, and it's error prone. This is a good job for a little sugar. So we added some simple syntax to specify the usual partitioning.

It looks like this:

-- parent and child defined elsewhere
declare proc parent(x integer not null) (id integer not null, a integer, b integer);
declare proc child(y integer not null) (id integer not null, u text, v text);

-- join together parent and child using 'id'
create proc parent_child(x_ integer not null, y_ integer not null)
begin
  out union call parent(x_) join call child(y_) using (id);
end;

The generated code is simple enough, even though there's a good bit of it. But it's a useful exercise to look at it once. Comments added for clarity.

CREATE PROC parent_child (x_ INTEGER NOT NULL, y_ INTEGER NOT NULL)
BEGIN
  DECLARE __result__0 BOOL NOT NULL;

  -- we need a cursor to hold just the key of the child row
  DECLARE __key__0 CURSOR LIKE child(id);

  -- we need our partitioning object (there could be more than one per function
  -- so it gets a number, likewise everything else gets a number
  LET __partition__0 := cql_partition_create();

  -- we invoke the child and then iterate its rows
  DECLARE __child_cursor__0 CURSOR FOR CALL child(y_);
  LOOP FETCH __child_cursor__0
  BEGIN
    -- we extract just the key fields (id in this case)
    FETCH __key__0(id) FROM VALUES(__child_cursor__0.id);

    -- we add this child to the partition using its key
    SET __result__0 := cql_partition_cursor(__partition__0, __key__0, __child_cursor__0);
  END;

  -- we need a shape for our result, it is the columns of the parent plus the child rowset
  DECLARE __out_cursor__0 CURSOR LIKE (id INTEGER NOT NULL, a INTEGER, b INTEGER,
                                       child1 OBJECT<child SET> NOT NULL);

  -- now we call the parent and iterate it
  DECLARE __parent__0 CURSOR FOR CALL parent(x_);
  LOOP FETCH __parent__0
  BEGIN
    -- we load the key values out of the parent this time, same key fields
    FETCH __key__0(id) FROM VALUES(__parent__0.id);

    -- now we create a result row using the parent columns and the child result set
    FETCH __out_cursor__0(id, a, b, child1) FROM VALUES(__parent__0.id, __parent__0.a, __parent__0.b, cql_extract_partition(__partition__0, __key__0));

    -- and then we emit that row
    OUT UNION __out_cursor__0;
  END;
END;

This code iterates the child once and the parent once and only has two database calls, one for the child and one for the parent. And this is enough to create parent/child result sets for the most common examples.

Result Set Values

While the above is probably the most common case, another case can happen where you might want to make a procedure call for each parent row to compute the child. And, more generally, there was no good way to work with result sets from procedure calls other than iterating them with a cursor. The iteration pattern is very good if the data is coming from a select statement -- we don't want to materialize all of the results if we can stream instead. However, when working with result sets the whole point is to create materialized results for use elsewhere. We now had the power to express a result set type with object<proc_name set> but no way to actually get such a set from an existing procedure. Procedures generated them, but they could only be consumed in the C layer.

Fortunately this is also an easy problem to solve. We already supported the ability to use procedures as functions in expressions if they had the right signature. We now add the ability to call a procedure that returns a result set and capture that result. Previously this was not supported and would have produced an error.

With the new features you can write:

declare child_result object<child set>;
set child_result := child(args);

or better still:

let child_result := child(args);

With this simple change we had the power to write something like this:

declare proc parent(x integer not null) (id integer not null, a integer, b integer);
declare proc child(id integer not null) (id integer not null, u text, v text);

create proc parent_child(x_ integer not null, y_ integer not null)
begin
  -- the result is like the parent with an extra column for the child
  declare result cursor like (like parent, child object<child set>);

  -- call the parent and loop over the results
  declare P cursor for call parent(x_);
  loop fetch P
  begin
     -- compute the child for each P and then emit it
     fetch result from values(from P, child(P.id));
     out union result;
  end;
end;

After the sugar is applied this compiles down to this program:

DECLARE PROC parent (x INTEGER NOT NULL) (id INTEGER NOT NULL, a INTEGER, b INTEGER);
DECLARE PROC child (id INTEGER NOT NULL) (id INTEGER NOT NULL, u TEXT, v TEXT);

CREATE PROC parent_child (x_ INTEGER NOT NULL, y_ INTEGER NOT NULL)
BEGIN
  DECLARE result CURSOR LIKE (id INTEGER NOT NULL, a INTEGER, b INTEGER,
                              child OBJECT<child SET>);

  DECLARE P CURSOR FOR CALL parent(x_);
  LOOP FETCH P
  BEGIN
    FETCH result(id, a, b, child) FROM VALUES(P.id, P.a, P.b, child(P.id));
    OUT UNION result;
  END;
END;

The LIKE and FROM forms are very powerful but they aren't new. They do make it a lot easier to express this notion of just adding one more column to the result. Note that the code for emitting the parent_child result before the transformation doesn't need to specify what the columns of the parent are or the columns of the child, only that the parent has at least the id column. Even that could have been removed.

This call could have been used instead:

fetch result from values(from P, child(from P like child arguments));

That syntax would result in using the columns of P that match the arguments of child -- just P.id in this case. But if there were 7 such columns the sugar might be easier to understand.

Additional Language Support

Last, but not least, to make this more accessible we wanted more support in the generated code. The C interface would have produced generic object results for the child result columns. This isn't wrong exactly but it would mean that a cast would be required in every use case on the native side, and it's easy to get the cast wrong. So the result type of column getters was adjusted to be a child_result_set_ref instead of just cql_object_ref.

Similar transforms were needed if column setters were being emitted (yes that's an option!) and of course the Java and Objective C output needed the same transform.

Conclusion

The prosecution of native support for parent/child result sets in CQL resulted in a bunch of very useful generalizations for declaring and managing cursors. The old special case code for blobs was actually replaced by these forms. The language overall expressiveness increased far more than just the ability to do this one kind of join. It's now possible to write general purpose debug helpers for cursors. It's possible to store and return pre-cooked result sets, creating useful caches and other such combinations. The type extensions to allow extending and narrowing existing types allow even more return flexibility while keeping everything strongly typed.

Parent/Child result sets exploit all of these things.

Some updates on the CQL schema upgrade system (2022/11/07)

Foreword

I was tempted to subtitle this article "How a great idea went horribly, horribly, wrong" but in the final analysis the outcome isn't actually at all horrible. But there are some good lessons here, and it's useful to capture the history while it is still fresh.

Introduction and Context

The CQL compiler can produce for you, starting from a set of table declarations and schema annotations, a schema upgrader that can upgrade your schema from any previous version to the current version, provided some simple rules are followed. Helpfully, the compiler enforces those rules with plain error messages so that you can reasonably expect your upgrader to work provided all is well with your database connection.

Broadly, the entities of the schema are on one of two plans, "create", and "recreate". These notions are all discussed in more detail in Chapter 10 of the guide.

The Create Plan

This plan applies strictly to tables, and is used for tables that have precious data that cannot reasonably be restored from say the cloud or some on-device backup. Typically the primary data is on this plan.

On this plan you are limited to these operations:

  • new tables can be created (including the so called baseline tables, those having no annotation at all)
  • columns can be added to the end of a table such that an ALTER TABLE ADD COLUMN statement could add them
  • columns can be deleted, making them illegal to use in queries but otherwise having no physical consequence
    • in CQL "select * from foo" will not include deleted columns hence "*" is fully expanded
  • tables can be deleted, leaving a tombstone in the schema
    • the tombstone provides the clue to the upgrader that the table should be dropped if it is found

The primary directives for this plan use @create annotations, hence the name.

The Recreate Plan

Triggers, Indicies, and Views are all on this plan and tables can be too if they are annotated with @recreate instead of @create. The idea with this plan is that if the entity changes at all you simply drop the old version and create the new version. This means any change is possible but it also means the upgrade is always destructive:

  • if the upgrader is going to do anything at all it drops all views and all triggers at the start and recreates them at the end

    • this not destructive and takes a lot of weird failure modes off the table
    • note steps in the upgrade logic therefore cannot rely on the existence of views or triggers
  • if any index or table changes at all it is dropped and recreated

    • this is done by computing a 64 bit CRC of the entities schema and comparing it to the stored CRC
    • if the CRC is changed the recreate happens

Probably the first thing you noticed once you create the notion of recreate for tables is that you really want to do the recreation in groups. There are constellations of schema that have related information and if one of them changes they all need to be updated. This lets you have complex foreign key relationships within this "recreate group".

You'll also notice that a recreate group can have foreign keys within itself and it can make foreign keys to things that are on the create plan but you run into trouble if you try to make foreign keys to some other recreate group. That group might vanish on you, or rather, it might try to vanish and discover that it cannot because of constraint violations. Originally recreate groups could not refer to other groups but recently this was generalized to track a directed acyclic graph of groups. This means that a core group recreating forces the recreation of any groups that refer to it. On this plan its common to end up with a snowflake type schema where the outer parts of the snowflake update often and the inner parts hardly at all.

Overall CRC

In addition to the CRCs for the recreate groups, and indices there was a one CRC for overall schema. The upgrader checks this before anything else. If the overall schema CRC matches the current schema then nothing needs to be done (the upgrader has already done its job). If it doesn't match then some steps have to be applied.

Immutable Schema Versions

Other than the cross-group dependencies things began in the form above. The recreate plan was CRC driven and the create plan was version driven. The original design simply generated the appropriate corrections at each schema version and tracked the current version. If the overall CRC had changed, whichever steps you needed were executed.

This turned out to be a disaster and it was changed within days.

The idea seems fine enough, but the first thing you run into is that two people might make a change to the schema creating say version 5. The problem is if one person adds table X and the other adds table Y they will each run their own and mark themselves as schema 5. When they merge their changes with some other developer, the version 5 upgrade will have already run and they will get nothing despite the fact that v5 includes more for both of them. This is crazytown for developers.

So, rather than simply tracking the current schema version, each schema version got its own mini-CRC. The upgrader would run the steps of each version if the CRC was absent or didn't match. With steps like CREATE TABLE IF NOT EXISTS and so forth a merge would result in you getting the other half of the changes for your version and work could accumulate at say schema v5 with no problems. Actual customers would never see this because they only saw completed schema versions.

This worked a lot better and lasted about a year.

The problem is that the system is based on these "mostly immutable" schema versions. You never restate the past you always give instructions on how to move forward. With the versions being nearly immutable, and the upgrade steps being idempotent, things seemed good. But it turns out neither of those two assumptions was really exactly true.

Mutating Schema Versions

The reality of the schema we created for our platform was that there was one large uber schema that had all the possible schema you might need for a variety of features and any give product could opt-in to the features it wanted, thereby getting the necessary schema. The schema system had a good way to partition the schema using regions. The upgrader could work on a set of regions and provide the union of schema in those regions, omitting the rest.

Super. Here's where things get exciting. A schema consumer could reasonably decide at some time in the future that it wants new features and so it opts into additonal regions. That's fair enough, but the net of this is that of course new tables appear. Some of these are likely to be in the baseline schema (v0) and some might have appear later (e.g. v5, v10, v21). This is all fine, the CRCs for those versions change and the schema upgrader runs again. Those versions execute and add the correct schema. Perfect.

Actually no.

Zombie Tables

About two years into the history of CQL we started noticing that some attempts to delete tables were failing. The DROP commands claimed that there was a constraint problem -- but these should have been leaf tables. What constraint could possibly be the issue? This was the first time a major design flaw with this system was revealed. Previous bugs had been few and had all been silly logic errors or off by one checks in version numbers, that kind of thing, easily fixed. This was a puzzler. But the answer was fortunately available in the set of annotations.

Basically, imagine a table "zombie" had been created say in the baseline schema, and then later deleted; suppose it was deleted in version 20. All is well, the upgrade steps for version 20 include a table drop. However, now a team subscribes to more schema, causing the v0 schema to include a few more tables. Here's the problem, when the steps for v0 run again they notice that "zombie" is missing and helpfully create it, thinking this is the right thing to do. But this is a disaster... The "zombie" table is supposed to be deleted in v20 but that CRC is unchanged! So now a table exists that has no business existing. If "zombie" has an FK reference to some other table which we wnat to delete, then all attempts to drop that table will fail because "zombie" is there gumming up the works. Even if it's empty... which it will be in this case.

This problem was fixed by having all tables that need deleting be unconditionally deleted at the end of the upgrade and not in the steps for the version in which the delete happened. This meant that the next upgrade purged all the zombies and enabled the correct table drops to start running with no errors. The consequence of this was a 90% reduction in schema upgrade failures!

Unsubscription

Another reason for the "immutable" version history to (cough) mutate was a desire to opt out of tables. As described in this section we created an affordance to allow people to unsubscribe from some of the tables they had previously selected. This provided finer-grain control of the schema subscription and also made it possible to undo previous mistakes of over-subscription. However, it was clear from the beginning that you might want to undo an unsubscription at some time in the future. In keeping with schema directives that create a clear history the @unsub and @resub statements were added to the language with lots of rules for correctness. The upgrader did the following things:

  • upon finding an unsubscription at version X that version includes DDL to drop the unsubscribed table
  • changes to that table in an future versions were omitted
  • upon finding a resubscription at version Y that version included DDL to create the table as it exists at version Y
  • later changes to that table are once again emitted as usual

This was very nearly right except it had the same problem as the delete case above. A table created in say the baseline might come back as a zombie even though it was unsubscribed. However, now wise to this issue a small fix takes care of the problem.

  • always drop tables in the unsubscribed state at the end just like delete tables
  • no code is needed to do an unsubscribe at version x (the drop at the end will do the job)
  • a resubscribe at version X first drops the table and then recreates as it exists at version X

This gives us a consistent upgrade path again and importantly avoids the problem of a resubscription finding a zombie that prevents it from doing its job.

Performance Optimization 1

On July 1, 2022 we made a modest change that reduced the number of SQLite statements required to do a full upgrade. The opportuntity came from the many column existence checks we made before running ALTER TABLE ADD COLUMN. Rather than run a statement that looked like this (SELECT EXISTS(SELECT * FROM sqlite_master WHERE tbl_name = table_name AND sql GLOB column_declaration_pattern)) for each column we first selected all of the table schema out of the sqlite_master table and put it into a hash table keyed by the table name. Reading even a few hundred table names was much faster than running a single statement for each column that needed to be checked -- especially when recreating the schema from scratch. In the most relevant test case this was a 7% improvement.

Importantly, it motivated us to add hash tables into cqlrt_common.c and generalize the mechanism for object management so that the cqlrt allows creation of new objects without having to add special support for each one.

This new hash table meant that we could do a hash lookup and substring match instead of a sqlite query for each column.

Performance Optimization 2

On Oct 11, 2022 we stopped using CRCs for the version checks on the create plan entirely. This was in fact an optimization but it was motivated by a real, but rare, problem.

What was happening was something like maybe 1 in 10^5 databases was missing columns. The sequence of events that caused this was very hard to diagnose but the situation was very clear. The database was at say schema version 100. The columns had been added at say version 50. The CRCs indicated that the v50 upgrade had already run so it didn't run again. The columns would now never be added.

We had yet to come up with a set of steps that would adequately describe how this happened. I have to guess some combination of a resubscription ran because of one of those "the schema is not really immutable" changes and then "medium" version of the table say v25 was resubscribed but the columns added in say v50 never got readded because v50 thought it had already run.

This was getting to be a nightmare but there was a simple solution.

We already had created this dictionary that had all the tables and their schema from sqlite master, we were already using it to determine if we needed to add a particular column. The only reason we had version CRCs at all was to allow us to skip steps, but since we could already skip column adds super fast all we needed was to be able to skip table adds -- there is nothing else. Well the same hash table can obviously easily tell us if a table exists. Non-existent tables have no schema and hence are absent from the hash table which is loaded directly from sqlite_master.

So the new algorithm, goes something like this:

  • use the version numbers only for ordering
  • before adding a table, check if it exists in the table, this is faster htran running CREATE TABLE IF NOT EXISTS
  • check the columns as before
  • attempt each of these every time the overall schema changes, and trust that the fast checks are fast enough

On this plan we change the way @unsub and @resub are handled to something much simpler:

  • @unsub acts like an out of band @delete on the table or view to which it is applied
    • the drop happens at the end like before
  • @resub resets that state so the table is considered not deleted if the last operation was @resub

To this we add one new rule:

  • the schema upgrader removes any reference to deleted tables entirely
    • they are removed from baseline
    • they are not included in any upgrade rules
    • they are only dropped at the end if they still exist

This vastly simplifies unsub/resub and delete. An unsubscribed table will always get cleaned up at the end, just like deleted tables. No strange interim states happen in resub. If a table is resubcribed it just reappears in the schema and the various operations run as usual.

The only compromise to this is that we still have a single CRC for the overall baseline schema. However even that could be removed at the expense of more hash table lookups. There is a binary size win for fewer checks and since baseline by definition depends on nothing it seems like safe position to take.

This approach was about 13-15% faster in fact, the time saved examining and writing back schema CRCs more than paid for the extra hash table checks (which were ~100x faster than the db operations). And the hash table already existed! The reduction of the CRC checks and removal of vestigial upgrade logic for deleted tables also resulted in a 2.2% reduction of upgrader size for our most important case.

The most recent change and further simplications in unsub/resub logic

With all of this in place it's clear that the various rules for unsubscription and resubscription and the sort of historical playback that was used to try to create these immutable stages is moot. The only thing that matters is if we end in the unsubscribed state. Removing the unsubscribe upgrade steps from the upgrader entirely just simplifies everything. So no @resub is needed at all nor are @unsub version numbers. Presently set to land is set of changes that remove resubcription entirely, to resubscribe you simply remove the @unsub directive for your table/view.

This lets us eliminate a bunch of validations and test cases to get a simpler, clearer, and more easily verifiable upgrader. There's just much less to go wrong.

Even crazy cases like "an unsubscription happens in v5, the resubscription happens in v7, a user runs the upgrade and they might have a database that is v4, v5, v6, or v7 (having already been upgraded)". All of these had different potential flows before. Now they are all the same. All the cases will roll the table forward to v7 from whatever version they might be on with the usual rules and states particular to unsubscription or resubscription. The table is present or it isn't. It is missing columns or it isn't. Same as always.

A Versionless Future

More thinking is needed here but it's clear that now that we've arrived at this simpler place ALL the version numbers are moot. The only thing we really have to do with version numbers is run ad hoc migrations at the appropriate time, and only once. The rules for migrators would have to change such that they are responsible for finding the state of the schema, and maybe some context could be provided for this. But ad hoc data migrators are very uncommon and regular annotations are much more so.

Conclusion

The fundamental assumption about how schema changes would happen was wrong. Even so, it was close enough that upgrades were over 99.99% successful when the other parts of the system are working ok. This is probably about the best we can hope for given the state of affairs with flash drives on Android devices. The current system is actually pretty close code-wise to what we had planned -- just look at the Oct 11, 2022 diff to see what I mean. It's not that big of a difference in the end. The new system has been deployed for nearly a month now and it is immune basically all of the failure modes of the old. It will take some time before we know what its true reliability is given the low failure rate of both. But we do know the new system is significantly faster. Optimizations 1 and 2 together are over 20% for full installations.

I should note that someone who was obviously smarter than me told me that we would land on a solution like this and I didn't believe them. They were of course correct. You know who you are. Sorry I doubted you.

Clone this wiki locally