-
Notifications
You must be signed in to change notification settings - Fork 2
CG SQL Blog Archive
I've captured the original blogs from cgsql.dev for posterity in case that site goes away. There is useful documentation as well. Some of this should go into the main guide but for now here it is so that it isn't lost. The articles below are in the original post order and I've recovered the original date. I've also fixed the internal links (all, I think) so that they stay in this repo rather than linking out to the original repo.
Hello everyone!
Thank you for visiting the CG/SQL's blog page. If you would like to read the very first blog announcing the project, please go over to the Facebook's Engineering post published in early Octover 2020.
Looking forward to working with all of you!
Sincerely, CG/SQL Team
It's hard to believe it's been a month since the welcome message went up. We were
happy to see interest right away and even a bunch of forks but most of all
pull requests. A sweeping change to modernize the cql.y grammar was much
appreciated. That $1
stuff was very old school (I'm showing my age now).
- @mingodad gave us an implementation of check and collate column attributes (the check attribute on tables should be easy to add from here)
- the
select function
form should never return objects, only SQLite types, enforced - @attribute(cql:suppress_result_set) was added to save code gen for procedures that don't need the C result set wrappers
-
cql_cursor_diff_col
andcql_cursor_diff_val
methods were added to report what's different about two cursors (highly useful in test code) -
cql_cursor_format
was added so you can quickly convert any cursor into columns and values as string for debug output (no matter the shape) -
sqlite3_changes
was added to the builtin list so you don't have to usedeclare select function
to use it anymore -
cql_get_blob_size
was added so you can see how big your blobs are (useful for diagnostics) -
trim
,rtrim
andltrim
were added to the builtin list so you can use them withoutdeclare select function
- the builtin function
ifnull_crash
was added so that nullables that have already checked can be safely typecast to not null - the bug we saw in demo video number 2 where some foreign keys were not properly linked up in autotest code was fixed (yay videos)
- time functions are now known to be
not null
for a bunch of simple cases such as 'now' arguments - you can use the
cast(.. as ..)
operator on numeric types outside of the SQL context - @mingodad replaced all the positional references by named references in cql.y (yes! thank you!)
- several minor bug fixes
- the railroad diagrams were updated
NOTE: I often refer to "sugar" in the below. This is short for syntatic sugar which, in case you're not familiar with the term, refers to a syntatically more pleasing way of writing a concept that is otherwise totally doable with normal syntax. Many languages have sugar for forms that are common -- for brevity, clarity, and/or correctness.
We often add new features to the language to facilitate the writing of tests. The tests have a lot of boilerplate often setting up and calling the same procedures again and again with slightly different arguments. Long argument lists and long insert column lists are especially problematic as these can be very error prone. Here good language constructs are very helpful. We've found good test constructs are often invaluable in production code as well, though in our experience the tests often have a lot more repitition that needs refactoring than production code. To that end we added some very useful things in the last month:
The most common way to create a cursor is from a select
statement but you can also make a cursor that can hold values for you
by declaring it to be LIKE
something else with a shape. A classic example is:
declare C cursor like some_table;
Now C
has the same columns and types as some_table
Many procedures have a result type that is also a shape, for instance any procedure that ends with a select
statement has a result
shape defined by the columns of the select statement. You could always do this sort of thing:
declare C cursor like some_proc;
Meaning make C
a cursor whose shape is whatever some_proc
returns, which is of course exactly the kind of cursor you need to capture
the result of some_proc
.
Now we add:
declare C cursor like some_proc arguments;
The idea being that the arguments of some_proc
are also a shape (unless it has none). With this done you want to use that cursor
to call the procedure -- that being sort of the whole point. So we add this:
call some_proc(from C);
How do we use this effectively? Hold on just a second -- for that answer we need one more big tool to really help the syntax.
Loading up a cursor is done with syntax that is very much like an insert
statement. An example might be something like this:
fetch C(x,y,z) from values(1,2,3);
This is simple enough but it becomes more problematic if there are many values and especially if the values have complex names.
To make this a little less error prone CQL now has this sugar form for fetch
, insert
, and soon update cursor
(like maybe
before you see this blog). The more readable form is:
fetch C using
1 x,
2 y,
3 z;
This form has the values next to their names just like in a select statement, like all sugars, it is automatically rewritten to the normal form.
Likewise
insert into some_table using
1 id,
'fred' first_name,
'flintstone' last_name,
'bedrock' home_town,
'dino' favorite_pet,
'wilma' life_partner;
becomes
insert into some_table(id, first_name, last_name, home_town, favorite_pet, life_partner)
values(1, 'fred', 'flintstone', 'bedrock', 'dino', 'wilma');
except the sugar form is much less error prone. This form doesn't generalize to many values but the single row case is super common.
Since this form is automatically rewritten SQLite will never see the sugar syntax, it will get the normal syntax.
NOTE: the insert rewrite is coming later today, and will likely be live by the time you read this.
Let's suppose you have to write a test. You have a procedure test_subject
that takes some arguments plus
you have another helper procedure test_setup
that puts seed data in the right places for your subject.
But there are many variations and a lot of what you do between variations is the same. How can you write this
economically making it clear what is different between variations without a lot of fuss.
Well you can do something like this:
-- use defaults for all the named values
-- use 'seed' for everything else that isn't named
create proc default_setup_args(seed integer not null)
begin
declare args cursor like test_setup arguments;
fetch args using
1334 primary_id,
98012 secondary_id,
'foo' useful_name,
'bar' other_useful_name,
1 fast_mode
@dummy_seed(seed);
out args;
end;
With the above you can easily see which values go to which arguments
Your test setup can now look something like this:
declare setup_args cursor like test_setup arguments;
fetch setup_args from call default_setup_args(1999);
update cursor setup_args using
0 fast_mode; -- override fast mode for this test
call test_setup(from setup_args);
To call the test subject you probably need some of those setup arguments and maybe some more things.
create proc default_subject_args(like default_setup_args, other_thing bool not null)
begin
declare args cursor like test_subject arguments;
fetch args using
primary_id primary_id, -- this came from the default_setup_args result
secondary_id secondary_id, -- so did this
useful_name name, -- the field names don't have to match
fast_mode fast_mode,
other_thing other_thing;
out args;
end;
Then the test code
declare test_args cursor like test_subject arguments;
fetch test_args from call default_subject_args(0);
call test_subject(from test_args);
Importantly, the cursor set operations are all by name so the order doesn't matter. Which means even if there are many arguments you don't have to worry that you got them in the wrong order or that they are the wrong type. Effectively you have a simple call by name strategy and you can easily read off the arguments. You could do something similarly brief with helper functions to provide the default arguments but then you can't readily re-use those arguments in later calls or for verification so this way seems a lot more useful in a test context.
When it comes time to validate, probably your test subject is returning a cursor from a select that you want to check. A slightly different call will do the job there.
With the setup above you can verify results very easily. Let's change it a little bit:
-- same as before, with a cursor
declare results cursor for call test_subject(from test_args);
-- get the first row
fetch results;
declare expected cursor like results;
fetch expected using
setup_args.primary_id primary_id,
setup_args.useful_name name,
test_args.other_thing other_thing
@dummy_seed(1999); -- dummy values for all other columns
-- make a macro like EXPECT_CURSOR_EQ(x,y) for this
-- if the cursors are different the result is a string with the first
-- different column name and the left and right values ready to print
call ExpectNull(cql_cursor_diff_val(expected, result));
ExpectEqual could be
create proc ExpectNull(t text)
begin
if t is not null then
call printf('%s\n', t); -- or whatever
throw;
end if;
end;
All that testing support comes from:
- cursors in the shape of arguments
- cleaner fetch/insert syntax
- cursors passed as arguments
- cursor differences
It kills a lot of boilerplate resulting in tests that are much clearer.
And that's what's been going on for the last month in CG/SQL land.
If you got this far thanks for reading. If you didn't get this far, you aren't reading this anyway so thanking you is moot =P
Stay safe.
Rico for CG/SQL
P.S. most of these fragments don't actually compile because of missing schema and maybe the odd typo. If there is interest I'll make a demo that works soup to nuts.
I was reviewing the update posting that just went out and I realized I'd forgotten to mention another big ticket item. So consider this an appendix to the update.
In some cases we started seeing a need to "ship cursors around" a little bit more flexibly. Note shipping values around is already doable so this new work is largely about being able to create a "statement cursor" in one procedure and consume it safely elsewhere. The general pattern looks like this:
Declare a statement cursor as usual, maybe something like this:
declare C cursor for select * from shape_source;
-- or
declare C cursor for call proc_that_returns_a_shape();
Make an object that can hold a cursor:
declare obj object<T cursor>;
Where T
is the name of a shape. It can be a table name, or a view name, or it can be the name of the canonical procedure that returns the result. You really want this to be some kind of global name though. Something you can get with a #include
in various places. In this case choices for T
might be shape_source
the table or proc_that_returns_a_shape
the procedure.
Remember you can always make a fake procedure that returns a result to sort of typedef a shape name. e.g.
declare proc my_shape() (id integer not null, name text);
The procedure here my_shape
doesn’t have to actually ever be created, in fact it’s probably better if it doesn’t. You won’t call it, you’re just using its hypothetical result as a shape. This could be useful if you have several procedures like proc_that_returns_a_shape
that all return my_shape
.
At this point you could use the cursor maybe something like:
loop fetch C
begin
-- do stuff with C
end;
Those are the usual patterns and they let you consume statement cursors sort of “up” from where it was created, but what if you want some worker procedures that consume a cursor there is no good way to pass your cursor down again. Well, there wasn't. Now there is. Let's go back to that box object creation and use it
-- recap: declare the box that holds the cursor (T changed to my_shape for this example)
declare obj object<my_shape cursor>;
-- box the cursor into the object (the cursor shape must match the box shape)
set obj from cursor C;
The variable obj
can now be passed around as usual. Then, later, you can "unbox" it to get a cursor back. Like so
-- unboxing a cursor from an object
declare D cursor for obj;
These primitives will allow cursors to be passed around with managed lifetime. Example:
-- consumes a cursor
create proc cursor_user(box object<my_shape cursor>)
begin
declare C cursor for box; -- the cursors shape will be my_shape matching box
loop fetch C
begin
-- do something with C
end;
end;
-- captures a cursor and passes it on
create proc cursor_boxer()
begin
declare C cursor for select * from something_like_my_shape;
declare box object<my_shape cursor>
set box from cursor C; -- produces error if shape doesn't match
call cursor_user(box);
end;
Importantly, once you box a cursor the underlying SQLite statement’s lifetime is managed by the box object with normal retain/release semantics so timely release becomes imperative.
With this pattern it's possible to, for instance, consume some of the rows in one procedure and the rest in another procedure.
Now, the main reason for doing this is if you have some standard helper methods that can get a cursor from a variety of places and process it.
But remember, that boxing isn’t the usual pattern at all and returning cursors in a box, while possible, should be avoided in favor of the simpler
pattern of doing your select
or call
at the end to compute the result as we do now, if only because then then lifetime is very simple in all those cases.
Durably storing a boxed cursor could lead to all manner of problems -- it's just like holding on to a sqlite3_stmt *
for a long time.
Actually "just like" is an understatement, it's exactly the same as holding on to a statement for a long time with all the same problems because that
is exactly what's going on here.
So, good generalization, but possibly less Pit of Success, especially with complex box patterns. So watch the sharp edges.
Today we made a couple of minor changes in the code generation to take care of some lingering issues.
The first is that when you did a throw
inside a catch
to basically rethrow the error, you would lose
the error code if something had succeeded within the catch handler.
The old codegen looked something like this:
catch_start_1: {
printf("error\n");
cql_best_error(&_rc_)
goto cql_cleanup;
}
The problem being that while the printf
above is fine and well, if you did any SQL operation then _rc_
would be
clobbered and you'd end up throwing an unrelated error code. cql_best_error
would at least make sure it was
a failure code (SQLITE_ERROR
) but the original error code was lost.
The new code looks like this:
catch_start_1: {
_rc_thrown_ = _rc_;
printf("error\n");
_rc_ = cql_best_error(_rc_thrown_);
goto cql_cleanup;
}
So now if there are db operations, the original return code is still preserved. Note: you still lose sqlite3_errmsg()
because
SQLite doesn't know that cleanup logic is running.
This brings us to the second new thing: general purpose error traces.
Error checking of result codes happens very consistently in CQL output. The usual pattern looks something like this:
_rc_ = cql_exec(_db_,
"SAVEPOINT base_proc_savepoint");
if (_rc_ != SQLITE_OK) goto cql_cleanup;
or if it's inside a try block a little different... very little actually
// try
{
_rc_ = cql_exec(_db_,
"RELEASE SAVEPOINT base_proc_savepoint");
if (_rc_ != SQLITE_OK) goto catch_start_8;
// ... the rest of the try block
}
Basically if the local _rc_
doersn't match the necessary condition we goto
the appropriate error label... either the relevant
catch block or else the procedure's cleanup code.
We generalize this a bit now so that it looks like this:
if (_rc_ != SQLITE_OK) { cql_error_trace(); goto cql_cleanup; }
-- or, in a catch...
if (_rc_ != SQLITE_OK) { cql_error_trace(); goto catch_start_8; }
Now the default implementation of cql_error_trace()
is in cqlrt.h
which you can and should customize. I'll be writing more
about that later but suffice to say you're supposed to replace cqlrt.h
and cqlrt.c
with suitable runtime helpers for your environment
while keeping cqlrt_common.h
and cqlrt_common.c
fixed.
So for instance, your cqlrt.h
could look like this:
#ifndef CQL_TRACING_ENABLED
#define cql_error_trace()
#else
// whatever tracing you want, for example this might help in test code.
#define cql_error_trace() \
fprintf(stderr, "Error at %s:%d in %s: %d %s\n", __FILE__, __LINE__, _PROC_, _rc_, sqlite3_errmsg(_db_))
#endif
So then when you need to debug problems involving lots of error recovery you can watch the entire chain of events easily.
Note that there are some useful variables there:
In any procedure _db_
is the current database and _rc_
is the most recent return code from SQLite. __FILE__
and __LINE__
of course come from the preprocessor. and _PROC_
(one underscore) is now generated by the compiler. Every procedure's
body now begins with:
#undef _PROC_
#define _PROC_ "the_current_procedure"
So by defining your own cql_error_trace macro you can cause whatever logging you need to happen. Note this can be very expensive indeed because this happens a lot and even the string literals needed are a significant cost. So generally this should be off for production builds and enabled as needed for debug builds.
The default implementation is just an empty block
#define cql_error_trace()
But the hook is enough to light up whatever logging you might need, and you can use sqlite3_errmsg()
before that message is gone.
Good hunting.
Following up on the last blog entry, I thought it would be useful to present a simple error tracing macro that you can use to see what kind of error flow is going on when you're having trouble understanding why a procedure is returning an error code. The idea is we want to create a macro that we can use like this:
BEGIN_VERBOSE_STDERR_TRACING;
-- Some procedure(s) that you want to trace
END_VERBOSE_STDERR_TRACING;
We can do that with something like the below macros. These particular ones cause the output to go to stderr
via fprintf
but if that isn't what you need you can simply edit the macro. The macros looks like this:
-- manually force tracing on by redefining the cql_error_trace macro
#define BEGIN_VERBOSE_STDERR_TRACING \
@echo c, "#undef cql_error_trace\n"; \
@echo c, "#define cql_error_trace() fprintf(stderr, \"CQL Trace at %s:%d in %s: %d %s\\n\", __FILE__, __LINE__, _PROC_, _rc_, sqlite3_errmsg(_db_))\n"
#define END_VERBOSE_STDERR_TRACING \
@echo c, "#undef cql_error_trace\n"; \
@echo c, "#define cql_error_trace()\n"
So basically it's telling CQL to emit a #define
into its output stream. In this case:
#define cql_error_trace() fprintf(stderr, "CQL Trace at %s:%d in %s: %d %s\n", __FILE__, __LINE__, _PROC_, _rc_, sqlite3_errmsg(_db_))
You could change that to any function you like, you can have it dump the errors where you like, or you can make it some dummy function you add so that you can set a breakpoint on it.
Whatever you do, do not leave your code with this sort of tracing enabled -- it's far too expensive in terms of code size. But it's perfect if you have this one procedure that is failing and it's hard for you to see where.
Obviously if you're making a custom trace thingy you don't need the macro at all, you can just emit your own #define
with @echo
as needed.
Note: @echo
is quite a sledgehammer so don't use it lightly and not in production code but it is quite helpful for this sort of thing. CQL tests often use it to help make things visible to the tests. If you use @echo
in weird ways you might not get working code when the codegen changes in the future.
The relevant state that is available to you inside a macro like this is:
-
__FILE__
the current filename (comes from the C pre-processor, this is the .c file name not the .sql) -
__LINE__
the current line number (comes from the C pre-processor, this is the .c line number) -
_rc_
the current SQLite result code (always the current return code in every CQL procedure that uses SQLite) -
_db_
the current SQLite database pointer (always the current database in every CQL procedure that uses SQLite) -
_PROC_
the current procedure name (CQL has a#define
for this for you)
Everyone knows the usual expression syntax x LIKE y
to do a string match. But the CQL compiler also uses
LIKE
in a different way that's powerful and important. CQL has the notion of data shapes and you use
LIKE
to refer to them. The simplest source of a data shape, and maybe the most common, is a table.
Maybe something like this:
create table T(
id integer not null,
name text not null,
age integer not null
);
Now suppose you want to write a procedure that can insert a row into that table, You could write
create proc insert_into_T(
id_ integer not null,
name_ text not null,
age_ integer not null
)
begin
insert into T(id, name, age) values(id_, name_, age_);
end;
This is all fine and well but what if T had 50 columns? That gets old fast. And how can you be sure that you inserted the columns into T in the right order? This second example also compiles even though it's clearly wrong:
insert into T(id, name, age) values(age_, name_, id_);
And of course you can imagine things get only more complicated with more columns in T.
We started adding the LIKE
form to ease these issues and to ensure some consistency in the APIs while preventing
simple transpostion errors. So you can instead write:
create proc insert_into_T(like T)
begin
insert into T from arguments;
end;
so here the like T
in the argument list simply means "make arguments that are the same as the columns of table T" -- well,
almost. It also adds an _
to the end of each name so you end up exactly the same declaration as the long form above.
But you won't miss any arguments, and they'll be in the right order.
And notice that we used from arguments
to indicate that we wanted the values to come from the arguments in order. Again
this saves you from a lot of typing and a lot of error checking. You can't get the arguments in the wrong order.
These are the most basic patterns. But there are quite a few more.
Let's suppose you want to write a procedure that returns in row with the highest age in the above. Maybe you write something like this:
create proc highest_age()
begin
declare C cursor for select * from T;
declare M cursor like C;
loop fetch C
begin
if (not M or M.age < C.age) then
fetch M from C;
end if;
end;
out M;
end;
Here we made a cursor M
that is the same as the cursor C
and then we are going to generate a single row result
from the cursor. Note that if you use a cursor name like M
in an expression it refers to the hidden boolean
that says if the cursor has a row in it or not. So M
begins empty and we will load it if it's empty or if the age
is higher than what we've already got.
Let's show a few more of the forms. Suppose we don't want to return name
, just the id
and the age
. We can
change things up a tiny bit.
create proc highest_age()
begin
declare C cursor for select * from T;
declare M cursor like select 1 id, 99 age;
loop fetch C
begin
if (not M or M.age < C.age) then
fetch M from cursor C(like M);
end if;
end;
out M;
end;
So two things to notice. We used an ad hoc shape, making a fake select
statement that returns the shape we want. This
select doesn't run but it does define types and columns easily. Two not null integers in this case. Now M
is not the
same as C
so we can't use the simplest form fetch M from C
we have to use the more general form.
Fully expanded, what we wrote becomes:
FETCH M(id, age) FROM VALUES(C.id, C.age);
But as you can see, we didn't have to type all those column names. And that's kind of the point of the LIKE
construct.
So we've covered a bunch of the shape sources already:
- a table name
- a cursor name
- a select statement that gives the shape in an ad hoc fashion
There are three more
- a view name
- the return shape of a procedure that returns a result set
- the arguments of a procedure
View names are pretty simple, and they work the same as table names so we don't need to discuss those. Let's look at some of the other uses with procedures.
Suppose we have a procedure that can return a result set shape but we want to be able to mock its results so we can fake whatever result we need for testing.
We'll complicate this a bit adding a new table (keeping short table names for the sample to save typing)
create table U(
id integer not null,
email text not null
);
And here's a procedure:
create proc my_proc()
begin
select T.*, U.email from T inner join U on T.id = U.id;
end;
Now we want to be able to make any fake result we want, so maybe want a temp table. No problem:
create proc _init_fake_results()
begin
create temp table if not exists fake_results(
like my_proc
);
end;
create proc add_fake_result(like fake_results)
begin
insert into fake_results from arguments;
end;
create proc get_fake_results()
begin
select * from fake_results;
end;
The above is very generic and will maintain well. You can see we made a temp table that will have
exactly the same shape as whatever my_proc
returns. In this case it becomes:
CREATE PROC _init_fake_results ()
BEGIN
CREATE TEMP TABLE IF NOT EXISTS fake_results(
id INTEGER NOT NULL,
name TEXT NOT NULL,
age INTEGER NOT NULL,
email TEXT NOT NULL
);
END;
And the rest are patterns we've seem before.
The last source of shapes are procedure arguments. There's lots of good cases for those, I wrote an entry on those previously but I'll give a simple example here too.
Suppose we have this weird procedure:
create proc delete_stuff(age_ integer, name_ text)
begin
if age_ is not null then
delete from T where T.age = age_;
end if;
if name_ is not null then
delete from T where T.name = name_;
end if;
end;
What if we wanted to log any errors that happen here? Maybe make a verison that logs. We can do it like this:
create proc delete_and_log(like delete_stuff arguments)
begin
begin try
call delete_stuff(from arguments);
end try;
begin catch
call printf("delete failed\n"); -- or whatever
throw;
end catch;
end;
The nice thing about this logging wrapper procedure is that if delete_stuff
changes, the wrapper will change with it.
That covers all of the shape sources and as we saw we can use them to create things like cursors, tables, and argument lists.
We can use them to specify a subset of columns that might be of interest when fetching or updating cursors. And we can use
them in one last way -- to restrict arguments to a particular shape. Let's see how that works by making the previous logger
a little different. Here we added an argument which tells if we should look. And that might look like it would
spoil the from arguments
part of the forwarding, but there is the final way to use LIKE
.
create proc delete_and_log2(log bool not null, like delete_stuff arguments)
begin
if log and age_ is not null then
call printf("deleting %d\n", age_); -- or whatever
end if;
if log and name_ is not null then
call printf("deleting %d\n", name_); -- or whatever
end if;
call delete_stuff(from arguments like delete_stuff arguments);
end;
So this form lets you use some of your arguments, the ones that match a certain shape. And as we saw in
the previous article you can also use from C
to pass arguments where C
is a cursor and in that case
you can also specify that arguments be matched by name from C like shape
. In both those cases the
formal parameter names of the called procedure are matched against the names of the shape and passed in
the order of the formals. So this is like "call by name", the fields of the cursor or the order of
arguments in the argument list might be different than the formals but you'll get the correct items
in the correct order regardless, because it matches by name.
These forms can save you a lot of typing... and are excellent at avoiding errors and improving maintainability.
Where they appear in SQL statements, everything is expanded before it goes to SQLite so SQLite will see
normal syntax forms. Which is good because obviously SQLite doesn't know anything about this enhanced
LIKE
business.
In the examples above there were only one or two columns with very short names, but in real world code there can easily be dozens of columns with very long names. In those cases, these forms really shine.
There is an unfortunate pattern of hard coding constants in SQL which I think comes from the fact that there's not an especially good way to encode constants in SQL. Things are a little better In CG/SQL's CQL language because it's normal to run things through the pre-processor first so you can do things like:
#define BUSINESS_TYPE_RESTAURANT 1
#define BUSINESS_TYPE_LAUNDROMAT 2
Having done so, you could write:
insert into Business using
"Rico's Laundry" name,
BUSINESS_TYPE_LAUNDROMAT type;
-- by the time SQL sees this it becomes
insert into Business(name, type) values('Rico''s Laundry', 2);
And at least you don't have to see these loose '2' values all over. An especially unfortunate form is the below, in which the auther is clearly crying for a symbol to use:
insert into Business using
"Rico's Laundry" name,
2 type; /* laundromat */
But if we use #define
the language knows nothing of the names and it can't help you manage them
or export them consistently or anything like that. I guess #define
is pretty useful in several
langauges (C and C++) so you could maybe #include
the macros somehow but that doesn't seem
especially great. And if you need them in Java you're getting no help at all.
So to this world we add enumerated constants. This is a bit short of enumerated types as we'll see later. You can now write something like this:
declare enum business_type integer (
restuarant,
laundromat,
corner_store = 11+3 /* math added for demo purposes only */
);
After this:
select business_type.corner_store;
is the same as
select 14;
And that is exactly what SQLite will see, the literal 14.
What's going on here? There's just a few rules:
- the enumeration can be any numeric type (bool, integer, long integer, real)
- the values of the enumeration start at 1 (i.e. if there is no
= expression
the first item will be 1, not 0) - if you don't specify a value, the next value is the previous value + 1
- if you do specify a value it can be any constant expression and it will be cast to the type of the enumeration (even if thatis lossy)
- the enumeration can refer to previous values in itself with no qualification
(big = 100.0, medium = big/2, small = medium/2)
- the enumeration can refer to previously defined enumerations as usual
(code = business_type.restaurant)
- Once the enumeration is defined you refer to its members in a fully qualified fashion
enum_name.member_name
elsewhere
Why is this better than macros? Well for one thing the enum values can be checked at their declaration site, so if you have errors you will hear about them in a more reasonable place. But additionally since the structure is known to the compiler it can give you useful information in the outputs.
In the .h
files you get:
enum business_type {
business_type__restaurant = 1,
business_type__laundromat = 2,
business_type__corner_store = 14
};
In case of floating point values such as:
declare enum floating real (
one = 1.0,
two = 2.0,
e = 2.71828,
pi = 3.14159
);
You get:
// enum floating (floating point values)
#define floating__one 1.000000e+00
#define floating__two 2.000000e+00
#define floating__e 2.718280e+00
#define floating__pi 3.141590e+00
Which is unfortunately the best you can do since C has no floating point enums.
But in both cases the enums
section of the JSON has the name of the enums and their members and values ready to go.
With these values you can readily generate (with moustache or something) the language interfaces of your choice. This
is a real help if you're trying to make helpers to call your CQL from say Java or something.
To do all this we needed to add some constant folding and general evaluation to the compiler. It's not much, just the normal numeric types and null. The supported operations include:
+
, -
, *
, /
, %
, |
, &
, <<
, >>
, ~
, and
, or
, not
, ==
, <=
, >=
, !=
, <
, >
, the cast
operator
and the case
forms. These are enough to make a lot of very interesting expressions, all of which are envaluated at
compile time.
While the constant folding was added to allow for rich enum
expressions, there is also the const()
primitive in the
language now which can appear anywhere a literal could appear. This allows you do things that were previously not
allowed such as:
create table something(
x integer default const((1<<16)|0xf) /* again the math is just for illustration */
);
The const
form is also very useful in macros:
#define SOMETHING const(12+3)
This form ensures that the constant will be evaluated at compile time. Const can also also nest so you can build these
kinds of macros from other macros or you can build enums this way. Anywhere you might need literals, you can use const
.
Importantly, no enumerated data types were added to the language to do any of this. The values can help you to achieve some correctness by avoiding transcription mistakes but there is no additional type-safety provided here. Indeed given the rich mix between these types in SQLite, and with SQLite having no knowledge of enumerations at all it would be tricky to do a complete job. Still, this might happen in the future.
But for now, declaring constants that are really an intimate part of your schema is now possible and the addition
of the constants to the .h
files and the .json
output should hopefully make these generally useful. At least
we might see less of the hard-coded constant business with good values baked right into the schema declarations.
Happy Holidays.
There are many cases where stored procedures require complex arguments using data shapes well known to higher level languages or that come from the schema. There is already some affordance for this sort of thing in the form of this kind of pattern:
(I'll continue to use this simple example as I discuss the generalization below)
create table Person (
id text primary key,
name text not null,
address text not null,
birthday real
);
Then maybe something like this
create proc insert_person(like Person)
begin
insert into Person from arguments;
end;
The above expands into:
create proc insert_person(
id_ text not null,
name_ text not null,
address_ text not null,
birthday_ real)
begin
insert into Person(id, name, address, birthday)
values(id_, name_, address_, birthday_);
end;
And I think we can all agree the sugared version is a lot easier to reason about and much less prone to errors as well.
Those features have been in the language for a long time and that's all fine and well
but it isn't general enough to handle the usual mix of situations. For instance what
if you need a procedure that works with two people? A hypothetical insert_two_people
procedure cannot be written with the old form. This is where argument bundles come in.
The idea here is to name the bundle which provides useful reference. To wit:
create proc insert_two_people(p1 like Person, p2 like Person)
begin
call insert_person(from p1);
call insert_person(from p2);
end;
or alternatively
create proc insert_two_people(p1 like Person, p2 like Person)
begin
insert into Person from p1;
insert into Person from p2;
end;
So what's going on here? Well, there are lots of reasons to keep the API to procedures simple and adding general purpose structured types would be at odds with that. It would require lots of knowledge about C structure layout and whatnot. And trying to call from java would require very complex JNI for any such procedure. So we avoid all that. We keep simple arguments. The above expands into:
create proc insert_person(
p1_id text not null,
p1_name text not null,
p1_address text not null,
p1_birthday real,
p2_id text not null,
p2_name text not null,
p2_address text not null,
p2_birthday real)
begin
insert into Person(id, name, address, birthday)
values(p1_id, p1_name, p1_address, p1_birthday);
insert into Person(id, name, address, birthday)
values(p2_id, p2_name, p2_address, p2_birthday);
end;
Or course the types don't have to be the same, you can create and name shapes of your choice. The language allow
you to use an argument bundle in all the places that a cursor was previously a valid source. That includes insert
,
fetch
, update cursor
, and procedure calls. You can refer to the arguments by their expanded name p1_address
or alternatively p1.address
means the same thing.
Here's another example showing a silly but illustrative thing you could do:
create proc insert_lotsa_people(P like Person)
begin
declare C cursor like P;
fetch C from P;
declare i integer not null;
set i := 0;
while (i < 20)
begin
update cursor C using
printf("id_%d", i) id;
insert into Person from C;
end;
end;
The above shows that you can use a bundle as the source of a shape elsewhere, and you can
use a bundle as a source of data to load a cursor. After which you can do all the usual value cursor things
like out
statements and so forth.
In order to call procedures with argument bundles more readily from other languages, the JSON output now includes additional information about where procedure arguments originated; The field with this information is creatively called "argOrigin:" and it has 3 forms.
- "arg_name" -> the argument is not an expansion of anything
- "T arg_name" -> the argument came from
like T
- there will be one arg for each member of T
- the formal argument name for this arg will be arg_name_
- if T is procedure arguments
like p1 arguments
then you'll get "p1[arguments] arg_name"
- "name T arg_name" -> the argument came from
name like T
(a named bundle)- there will be one arg for each member of T
- the formal argument name for this arg will be T_arg_name
- T could be procedure arguments as above
- If the source of an argument was a cursor or argument bundle name you get instead that thing's shape source name
- this is always better because cursor names and bundle names are not globally unique.
- If the cursor had an anonymous source (e.g.
like select 1 x
) then you get the useless shape name "select"- this is an indicator that you should make some ad hoc struct for this procedure because there is no useful name for the arg bundle's type
None of this matters unless you're trying to make wrappers for a CQL procedure for some other language and you'd like to have your wrapper deal with structs rather than all loose arguments. the JSON basically tells you the structs.
Interestingly, argument bundles resulted in a significant reduction of code in the compiler. The argument bundle
name has to be usable in the contexts where a cursor was previously usable. It is another source of shaped data.
Getting that to work proved to be super simple as the two forms look almost identical to the compiler -- no coincidence there.
So very little code was required to make from [cursor_name]
work with from [any_shape_name]
in the half dozen or so places
that this construct is allowed (e.g. procedure call arguments, insert statements, etc.). However, there was as
much code associated with from arguments
as there was from cursor_name
. And the code was nearly identical..
When argument bundles were introduced the natural thing to do was to create an artifical bundle called "arguments" which
represents the bundle that is ALL the arguments. With that done, all the code for from arguments
could be deleted
because arguments
itself was a valid shape name. Hence insert into T from arguments
"just works". And so half
the rewrites were deleted. The only cost was that the form from arguments like shape
became the cursor form
from arguments(like shape)
which only adds mandatory parens to a form that was largely unused anyway (there were two
cases in our entire codebase). The cursor form is more general as you can do from C(like A, like B)
to get the
fields that match A
then those that match B
. Arguments get this for free as well (well, at the cost of parens).
So overall, this feature was added, and the compiler got smaller and cleaner. Only the test suite had to grow.
Stay safe out there.
Language support for virtual tables has lagged since I always thought they were of little interest to
the language anyway. The CREATE TABLE
forms in general are only declarations (except if you're doing
the schema installer/upgrader output) and so you could just declare say a temp table that corresponds to the
virtual table that you made in the same way that you might declare say sqlite_master
if you wanted to use it.
And since you have to register the module anyway, you may as well create the virtual table at the same time.
So there was no point in adding language support for the thing.
Furthermore the CREATE VIRTUAL TABLE
form includes no information about the schema of the table so you'd
need some kind of declaration anyway in order to tell the language what the columns are for the table you
just created. So again you may as well just declare it like a normal table and not include that table in your
schema upgrade file and be done with it.
And that was my thinking for the last 2 years. And then I learned something.
Virtual tables are durable.
Yeah, I always assumed that virtual tables were temp tables and they vanish and have to be redeclared every
session but that is not the case. They are part of the durable schema so while you must pre-register the
module associated with the virtual table, the virtual table is like other tables in that you only create it
once and from then on it's part of the schema every time the database loads until you DROP
it.
This changes everything.
With virtual tables being durable they belong in the schema upgrade process. And if they go there they also have to go into the JSON output. But we can't use the vanilla syntax that SQLite uses because that syntax is:
- not parseable, because the module arguments can be literally anything (or nothing), even a letter to your grandma.
- the arguments do not necessarily say anything about the table's schema at all
So in the CQL language I change the syntax a bit, the generalized form looks like this:
create virtual table virt_table using my_module [(module arguments)] as (
id integer not null,
name text
);
The part after the AS
is used by CQL as a table declaration for the virtual table. The grammar for that
is exactly the same as a normal CREATE TABLE
statement. However that part is not transmitted to
SQLite; when the table is created, SQLite sees only the part it cares about, the part before the AS
.
Now this leaves the module arguments, they can be one of three things:
- no arguments at all
- a list of identifiers, constants, and parenthesized sublists just like in the
@attribute
form - the words
arguments following
create virtual table virt_table using my_module as (
id integer not null,
name text
);
becomes (to SQLite)
CREATE TABLE virt_table USING my_module;
Note: empty arguments USING my_module()
are not allowed in the SQLite docs but do seem to work in SQLite.
We take the position that no args should be done with no parens, at least for now.
create virtual table virt_table using my_module(foo, 'goo', (1.5, (bar, baz))) as (
id integer not null,
name text
);
CREATE VIRTUAL TABLE virt_table USING my_module(foo, "goo", (1.5, (bar, baz)));
This form allows for very flexible arguments but not totally arbitrary arguments, so it can still be parsed and validated.
This case recognizes the popular choice that the arguments are often the actual schema declaration for the table in question. So
create virtual table virt_table using my_module(arguments following) as (
id integer not null,
name text
);
becomes
CREATE VIRTUAL TABLE virt_table USING my_module(
id INTEGER NOT NULL,
name TEXT
);
The normalized text (keywords capitalized, whitespace normalized) of the table declaration in the as
clause is used as the arguments.
Virtual tables go into their own section in the JSON and they include the module
and moduleArgs
entries, they are additionally
marked isVirtual
in case you want to use the same processing code for virtual tables as normal tables. The JSON format is otherwise
the same, although some things can't happen in virtual tables (e.g. there is no TEMP
option so "isTemp"
must be false in the JSON.
For purposes of schema processing, virtual tables are on the @recreate
plan, just like indices, triggers, etc. This is the only option since
the alter table
form is not allowed on a virtual table.
Semantic validation enforces "no alter statements on virtual tables" as well as other things like, no indices, and no triggers, since SQLite does not support any of those things.
Finally, because virtual tables are on the @recreate
plan, you may not have foreign keys that reference virtual tables. Such keys seem
like a bad idea in any case.
A common source of errors in stored procedures is incorrect typing in arguments. For instance, a particular key
for an entity might need to be LONG
or even always LONG NOT NULL
or LONG NOT NULL @SENSITIVE
and the only
way to do this in the past was maybe with some #define
thing. Otherwise you have to diligently get the type right
in all the places, and should it ever change, again you have to visit all the places. To help with this situation,
and to make code a little more self-describing we add named types to the language. This is a lot like typedef
in
the C language. They do not create different incompatible types but do let you name things well.
You can now write these sorts of forms:
declare foo_id type long not null;
create table foo(
id foo_id primary key autoincrement,
name text
);
create proc inserter(name_ text, out id foo_id)
begin
insert into foo(id, name) values(NULL, name_);
set id := last_insert_rowid();
end;
Refer to the railroad diagram for the grammar details.
Additionally any enumerated type can be used as a type name. e.g.
declare enum thing integer (
thing1,
thing2
);
declare x thing;
Enumerations always get "not null" in addition to their base type.
This isn't a very complex feature but we hope that it will help create clearer code that is less likely to have type-related bugs.
Further adding to the type calculus of the CQL language we introduced the ability to encode the "kind" of primitive types. This can be used in a number of ways -- like "units" for natural things and like a "type" for synthetic keys and other such. It's easier to illustrate by example.
declare job_id type long<job_id>;
declare person_id type long<person_id>;
declare j job_id;
decalre p person_id;
set p := j; -- this is an error
With the above in place, other expressions like p == j
would also produce errors as these long
values are no longer type compatible. This is
a great way to add enforcement to your schema and procedures. Likewise you can use these annotations to add "units" to your data types. e.g.
declare meters type real<meters>;
declare grams type real<grams>;
declare m meters;
declare g grams;
Variables of type grams
(e.g. g
) are not compatible with variables of type meters
(e.g. m
) even though both are real
.
Likewise, attemping to insert grams
into a column that is typed to meters
will give errors. Of course SQLite doesn't know about any of this
so all the <>
stuff is removed in the generated SQL. This is just about type enforcement at compile time.
Enumerations like:
declare enum surface integer (paper, canvas);
declare enum writer integer (pen, paper, brush);
enable this:
declare s surface; -- s is now of type integer<surface>
declare w writer; -- w is now of type integer<writer>
set s := surface.paper; -- ok
set s := writer.pen; -- error
set w := writer.pencil; -- ok
case when s == w then 1 else 0 end; -- error (w/s not comparable)
set w := s; -- error again
additionally in DML/DDL:
create table draw_action(
w writer,
s surface
);
insert into draw_action values(w, s); -- ok
insert into draw_action values(s, w); -- error!
So the type kinds can be quite helpful when dealing with loose variables.
The notion of specific types was added to the language nearly two years ago to support the object
type because there was a great desire
to prevent object<dictionary>
being assigned from object<list>
but this "type kind", whether it's with units (e.g. "meters", "grams")
or a type name (e.g. "job_id") adds a lot of high value type checking.
The kind can be added, stripped, or changed with a cast
operation and the type system allows a constant or variable with no kind (e.g. "1")
to mix and match with any kind so long as the base type is compatible as usual. So you get the most value by using the specific type
consistently but you won't go insane adding test cases that use constants for instance.
As of this writing the expression kinds are checked for compatibility everywhere plus or minus bugs. There are extensive tests.
Important change in CQL semantics.
Previously if you did an early return, or fall through the end, from a procedure that is supposed to return a result set but did not in fact provide one, you would get a fake SQLITE_ERROR. Now you get an empty result set for "free".
This interpretation seems much more natural and avoids a lot of really annoying stub selects to comply with the contract.
This also works for the out
statement in the same fashion.
If you want to return an error, use throw
. This is a lot more natural...
examples:
-- this gives you an empty result set if x <= 0
create proc maybe_return(x integer)
begin
if x > 0 then
select * from foo where foo.y > x;
end if;
end;
-- so does this
create proc maybe_return(x integer)
begin
if x <= 0 then
return;
end if;
select * from foo where foo.y > x;
end;
-- so does this
create proc maybe_out(x integer)
begin
if x <= 0 then
declare C cursor for select etc.
out C;
end if;
end;
The nested select statement is frequently misused, in particular if you get no rows back from the expression that's an error. So for instance:
set x_ := (select x from foo.x where id = y);
This will throw (with a peculiar error, SQLITE_DONE
) if there is no such row.
Sometimes people try to fix this problem with a nullcheck:
set x_ := IFNULL((select x from foo.x where id = y), -1);
That doesn't help at all. It's not a null value situation, there's no row at all.
set x_ := (select IFNULL(x,-1) from foo.x where id = y), -1);
Is likewise unhelpful. To help with this situation we add two forms:
-- useful if foo.x is already known to be not null
set x_ := (select x from foo.x where id = y IF NOTHING -1);
-- useful if foo.x might be null
set x_ := (select x from foo.x where id = y IF NOTHING OR NULL -1);
Both of these deal with the case where there is no row. The second lets you have a simple default for both no row or null value. That form is equivalent to:
set x_ := (select IFNULL(x,-1) from foo.x where id = y IF NOTHING -1);
i.e. both problem cases are handled.
Of course the -1 here could be any valid expression, even a second (select...)
We've long needed a way to see the most recent SQLite result code SQLite in the context
of say a catch
block (most other times you can assume SQLITE_OK was the last
result code otherwise control flow would transfer elsewhere. Sometimes SQLITE_ROW
or SQLITE_DONE might be the current result code.
Soon we'll provide a sample header that declares the most common error codes in an enum but for now you could do something like this:
-- pasted from the sqlite.c
#define SQLITE_BUSY 5 /* The database file is locked */
-- this is a contrived example
create proc get_first_foo(out can_retry bool not null)
begin
-- can_retry is set to 0 automatically, language semantics guarantee this
begin try
select foo from bar limit 1;
end try;
begin catch
set can_retry := (@rc == SQLITE_BUSY);
throw; -- rethrow the original error
end catch;
end;
Shared fragments are a real game-changer for CQL.
Remember, these are designed to let you write part of a query and then substitute in parameters. So it's like a parameterized view in normal SQL terms. But actually it's more powerful than that, fragments also provide features that are more like Java generics. Let's do some examples.
Suppose we have a procedure which looks something like this:
CREATE PROC get_stuff(to_include_ text, to_exclude_ text)
BEGIN
WITH
to_exclude_recursive_query (tok, rest) AS (
SELECT
'',
to_exclude_ || ','
UNION ALL
SELECT
substr(rest, 1, instr(rest, ',') - 1),
substr(rest, instr(rest, ',') + 1)
FROM to_exclude_recursive_query
WHERE rest <> ''
),
to_exclude (id) AS (
SELECT CAST(tok AS LONG)
FROM to_exclude_recursive_query
WHERE tok <> ''
)
to_include_recursive_query (tok, rest) AS (
SELECT
'',
to_include_ || ','
UNION ALL
SELECT
substr(rest, 1, instr(rest, ',') - 1),
substr(rest, instr(rest, ',') + 1)
FROM to_include_recursive_query
WHERE rest <> ''
),
to_include (id) AS (
SELECT CAST(tok AS LONG)
FROM to_include_recursive_query
WHERE tok <> ''
)
SELECT * from stuff S
WHERE
S.id in (select * from to_include) AND
S.id not in (select * from to_exclude);
END;
With shared fragments you could write something like this:
@attribute(cql:shared_fragment)
CREATE PROC split_commas(str text)
BEGIN
WITH splitter(tok, rest) AS (
SELECT '', IFNULL(str || ',', '')
UNION ALL
SELECT
substr(rest, 1, instr(rest, ',') - 1),
substr(rest, instr(rest, ',') + 1)
FROM splitter
WHERE rest <> '')
select tok from splitter where tok <> '';
END;
@attribute(cql:shared_fragment)
CREATE PROC ids_from_string(str text)
BEGIN
WITH toks(tok) AS (CALL split_commas(str))
SELECT CAST(tok AS LONG) AS id from toks;
END;
We now have a shared fragment called split_commas
which can be anywhere like maybe in a standard include file. There are some immediate benefits:
- the fragment is compiled on its own before usage so any errors are reported in the fragment
- in contrast, with macros you get errors when you try to use the macro and they are all charged to the line the macro appears on so it's hopeless figuring out what's wrong
- the text of the shared fragment will be the same, so it can be re-used in all locations, this can be a big binary size savings
- in contrast, macros are pre-processed before CQL ever sees the text so it doesn't "know" it's the same code
- fragments compose cleanly as we'll see; and they have typed arguments
- fragments can be independently tested outside of the context in which they appear
- make a test context and explore the fragment, no worries about it breaking on edge cases later
The first fragment called split_commas
does exactly what it sounds like, it takes a string argument and makes a list of the strings in it.
The second fragment uses the first to split a string and then it converts all the strings to long integers.
Now instead of the above we could write:
#include <stringsplit.sql> /* whereever you put the fragments */
CREATE PROC get_stuff(to_include_ text, to_exclude_ text)
BEGIN
WITH
to_include(id) AS (CALL ids_from_string(to_include_)),
to_exclude(id) AS (CALL ids_from_string(to_exclude_))
SELECT * from stuff S
WHERE
S.id in (select * from to_include) AND
S.id not in (select * from to_exclude);
END;
And of course since ids_from_string
is somewhere shared (stringsplit.sql
) so these fragments can be used
all over your code and you'll only pay for the text one time. This gives you great flexibility, very much
like parameterized views. You can have any number of these fragments, they will share code, they compose like crazy
and there is no schema cost!
A series of useful fragments for generating data would go a long way but there are other applications
of fragments and you might want to operate on various data sources without hard coding them all. This is
where the generic form of fragments comes in. Consider a case where you want to be able to filter stuff
by say name and age. You could create this fragment:
@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
pattern_ text not null,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
WITH
source(*) LIKE stuff
SELECT * from source S
WHERE
S.name LIKE pattern_ AND
S.age BETWEEN min_age_ and max_age_;
END;
Now imagine that we had added the shared fragment annotation to get_stuff
(just like the above).
We could then write the following:
CREATE PROC the_right_stuff(
to_include_ text,
to_exclude_ text,
pattern_ text not null,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
WITH
get_stuff(*) AS (call get_stuff(to_include_, to_exclude_)),
filter_stuff(*) AS (call filter_stuff(pattern_, min_age_, max_age_)
using get_stuff as source)
SELECT * from filter_stuff S
ORDER BY name
LIMIT 5;
END;
Or with some sugar to forward arguments and assume the CTE name matches, more economically:
CREATE PROC the_right_stuff(
to_include_ text,
to_exclude_ text,
pattern_ text not null,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
WITH
(call get_stuff(*)),
(call filter_stuff(*) using get_stuff as source)
SELECT * from filter_stuff S
ORDER BY name
LIMIT 5;
END;
The arg syntax (*)
simply indicates that the arg names in the caller should match to the same names in the callee. In
general call foo(*)
expands to call foo(from arguments like foo arguments)
. *
is rather more economical than that.
In this example filter_stuff
doesn't know where its data will be coming from, you bind its table parameter source
to a compatible data source of your choice. For example, this would also be legal:
CREATE PROC almost_the_right_stuff(
pattern_ text not null,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
WITH
(call filter_stuff(*) using stuff as source)
SELECT * from filter_stuff S
ORDER BY name
LIMIT 5;
END;
It's often desirable to have some options in the generated SQL without having to fork your entire query. Shared
fragments address this as well with the conditional form. In this form the top level of the fragment is an
IF
statement and there are a number of alternatives. Here are some simple modifications to the above that illustrate
some of the possibilities.
@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
pattern_ text,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
IF pattern_ IS NOT NULL THEN
WITH
source(*) LIKE stuff
SELECT * from source S
WHERE
S.name LIKE pattern_ AND
S.age BETWEEN min_age_ and max_age_;
ELSE
WITH
source(*) LIKE stuff
SELECT * from source S
WHERE
S.age BETWEEN min_age_ and max_age_;
END IF;
END;
In the above if the input pattern is NULL then it is not considered, it won't be part of the generated SQL at all. Note that
source
(same name) appears in both branches and therefore must be the same type as it will be fulfilled by one actual table
parameter.
Now the above could have been achieved with something like this:
pattern_ IS NULL OR S.name LIKE pattern_
But that would have no useful selectivity. But in general you might be able to avoid joins and so forth with your constraints. Consider something like this hypothetical:
@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
pattern_ text,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
IF pattern_ IS NOT NULL THEN
WITH
source(*) LIKE stuff
SELECT DISTINCT S.* from source S
INNER JOIN keywords K
WHERE
K.keyword LIKE pattern_ AND
S.age BETWEEN min_age_ and max_age_;
ELSE
WITH
source(*) LIKE stuff
SELECT * from source S
WHERE
S.age BETWEEN min_age_ and max_age_;
END IF;
END;
Here we save the DISTINCT and the JOIN if there is no pattern which might be important. Of course there are probably better ways to match keywords but this is just an illustration of what's possible.
There are numerous ways this flexibility can be used, again a simple example, a real schema transform would be more complex.
@attribute(cql:shared_fragment)
CREATE PROC get_stuff(
to_include_ text,
to_exclude_ text,
schema_v2 bool not null)
BEGIN
IF schema_v2 THEN
WITH
to_include(id) AS (CALL ids_from_string(to_include_)),
to_exclude(id) AS (CALL ids_from_string(to_exclude_))
SELECT * from stuff_2 S
WHERE
S.id in (select * from to_include) AND
S.id not in (select * from to_exclude);
ELSE
WITH
to_include(id) AS (CALL ids_from_string(to_include_)),
to_exclude(id) AS (CALL ids_from_string(to_exclude_))
SELECT * from stuff S
WHERE
S.id in (select * from to_include) AND
S.id not in (select * from to_exclude);
END IF;
END;
All of this requires a bunch of checking, at least this:
- the LIKE forms can only appear in a shared fragment
- the CALL forms must refer to shared fragments
- the CALL args must be compatible
- the number and type of the provided tables in USING must be correct
- the shared fragment must be a single select statement or an IF statement with an ELSE
- the statement lists of the IF/ELSE combo must all be single select statements
- all the choices in the IF block must return the same shape (this is normal for procedures)
- the shared fragment can't have any out arguments
- the provided fragment arguments cannot themselves use the nested SELECT construct
I think this is a total game changer for SQL authoring and should go a long way to making it easier to get your work done on SQLite. A good base set of shared fragments as part any suite of procedures seems like a good idea.
There are more details in the section on shared fragments in Chapter 14 of The Guide.
These features are in the current build as of today (12/14/2021).
Happy Holidays and stay safe.
One of the biggest changes to CQL in 2021 was the addition of control flow analysis. Given an understanding of how execution can flow within a user's program, CQL can do things like infer when a nullable variable must contain a nonnull value and improve its type appropriately, or issue an error when a nonnull variable may be used before it has been initialized.
As of mid-2021, and with increasing sophistication throughout the remainder of
the year, CQL has been able to infer that a variable of a nullable type must not
be NULL
within a portion of a user's program:
DECLARE PROC another_proc(t0 TEXT NOT NULL, t1 TEXT NOT NULL);
CREATE PROC some_proc(t0 TEXT, t1 TEXT)
BEGIN
IF t0 IS NULL RETURN;
-- `t0` must be nonnull here if we made it this far
IF t1 IS NOT NULL THEN
-- `t0` and `t1` are nonnull here
CALL another_proc(t0, t1);
ELSE
-- `t0` is nonnull here
CALL another_proc(t0, "default");
END IF;
END;
The ability of the CQL compiler to infer non-nullability greatly reduces the
need to use the functions ifnull_crash
and ifnull_throw
to coerce values to
a nonnull type—functions that, if they are ever used incorrectly, usually result
in programs misbehaving.
For a detailed description and many additional examples of what is possible—CQL can handle much more than what is shown above—see the user guide's section on nullability improvements.
In CQL, it is possible to declare a variable of a nonnull type without giving it
a value. If the variable is of a non-reference type, it is assigned a default
value of 0
. If the variable is of a reference type (BLOB
, OBJECT
, or
TEXT
), however, it is simply set to NULL
despite the nonnull type as no
default value exists.
To help prevent accessing a reference variable of a nonnull type and getting
back NULL
, CQL recently began enforcing that such variables are initialized
before use. The following code, therefore, now results in an error:
DECLARE t TEXT NOT NULL;
CALL requires_text_notnull(t); -- error!
Using the same engine for control flow analysis that is behind nullability improvements, CQL can improve a variable to be initialized:
DECLARE t TEXT NOT NULL;
IF some_condition THEN
SET t := "some example text";
-- `t` is initialized here
ELSE
THROW;
END IF;
-- `t` must be initialized here if we made it this far
CALL requires_text_notnull(t); -- okay!
Thanks to CQL's ability to understand the control flow of users' programs, the above example works just fine.
CQL now also enforces that all procedures with OUT
parameters of a nonnull
reference type properly initialize said parameters before they return:
CREATE PROC some_proc(b BOOL NOT NULL, OUT t TEXT NOT NULL)
BEGIN
IF b THEN
SET t := another_proc(t);
-- `t` is initialized here
ELSE
SET t := yet_another_proc(t);
-- `t` is initialized here
END IF;
-- `t` must be initialized here because all possible
-- branches initialized it, so `some_proc` is okay!
END;
As with nullability improvements, understanding the nuances of what will be considered initialized is easier if one has a sense for how control flow analysis works in the compiler.
To develop an intuition for how control flow analysis works in CQL, let's begin by taking a look at the following example:
DECLARE PROC p1(OUT t TEXT NOT NULL);
DECLARE PROC p2(i INTEGER NOT NULL, OUT t TEXT NOT NULL);
CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
BEGIN
IF i IS NULL THEN
IF b THEN
CALL p1(t);
ELSE
SET t := "";
END IF;
RETURN;
END IF;
IF i == 0 THEN
SET t := "";
ELSE IF i > 0 THEN
SET t := p2(i);
ELSE
THROW;
END IF;
END;
There are a couple of things we must verify in order to ensure the code is type-safe:
-
With regard to the parameters of
p0
: Sincet
is anOUT
parameter of typeTEXT NOT NULL
,p0
must always assign it a value before it returns. If it does not, a caller ofp0
may end up with a variable of aNOT NULL
type that actually containsNULL
. -
With regard to the calling of
p2
inp0
: Sincep2
requires a first argument of typeINTEGER NOT NULL
, some sort of check must be performed to ensure thati
is notNULL
beforep2(i)
is executed.
If we carefully study p0
, we can determine that both of the above conditions
are satisfied. Making this determination, however, is not exactly trivial, and
real-world code is often significantly more complicated than this—and it evolves
over time. For these reasons, having a compiler that can make such
determinations automatically is critical; most modern production compilers
perform these sorts of checks.
The easiest way to understand how CQL does its job is to take the above example line-by-line. This is not exactly how CQL works under the hood, but it should provide an intuitive sense of how control flow analysis works in the compiler:
==> CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
BEGIN
...
END;
Right away, CQL can see that t
is declared both OUT
and TEXT NOT NULL
and
thus requires initialization before p0
returns. CQL can, therefore, add a fact
about what it is analyzing to its previously null set of facts:
t
requires initialization.
We can then continue:
==> IF i IS NULL THEN
...
END IF;
Here, the compiler notices that we're at an IF
statement. In CQL, IF
statements contain one or more branches, and the compiler considers every
IF
to be the start of a branch group. The same line also indicates the
condition for the first branch: i IS NULL
. CQL can update its set of facts:
t
requires initialization.- In branch group:
- In branch when
i IS NULL
:
It then proceeds to the next line:
IF i IS NULL THEN
==> IF b THEN
CALL p1(t);
ELSE
SET t := "";
END IF;
RETURN;
END IF;
Another branch group and branch:
t
requires initialization.- In branch group:
- In branch when
i IS NULL
:
- In branch group:
- In branch when
b
:
Continuing:
IF i IS NULL THEN
IF b THEN
==> CALL p1(t);
ELSE
SET t := "";
END IF;
RETURN;
END IF;
Since p1
takes an OUT
argument of type TEXT NOT NULL
, this call
initializes t
, and so CQL can update its set of facts once again:
t
requires initialization.- In branch group:
- In branch when
i IS NULL
:
- In branch group:
- In branch when
b
:
t
is initialized.
Jumping ahead a couple of lines:
IF i IS NULL THEN
IF b THEN
CALL p1(t);
ELSE
==> SET t := "";
END IF;
RETURN;
END IF;
At this point, we're in another branch. We also have yet another fact to add
because t
is initialized here as well due to the SET
:
t
requires initialization.- In branch group:
- In branch when
i IS NULL
:
- In branch group:
- In branch when
b
:
t
is initialized.- In ELSE branch:
t
is initialized.
Moving ahead one more line, things get a bit more interesting:
IF i IS NULL THEN
IF b THEN
CALL p1(t);
ELSE
SET t := "";
==> END IF;
RETURN;
END IF;
Here, we're at the end of an IF
, and thus the end of a branch group. Whenever
CQL reaches the end of a branch group, it merges the effects of all of its
branches.
One very important thing to note here is that the current branch group has an
ELSE
branch, and so the set of branches covers all possible cases. That means
if something is initialized in every branch within the branch group, we can
consider it to be initialized after the branch group has ended: Initialization
will always occur. This allows CQL to simplify its set of facts as follows as it
leaves the branch group:
t
requires initialization.- In branch group:
- In branch when
i IS NULL
:
t
is initialized.
Stepping forward one line again, we reach a RETURN
:
IF i IS NULL THEN
...
==> RETURN;
END IF;
We're now at a point where we can exit the procedure. CQL will, therefore,
verify that if something requires initialization, it has been initialized. Since
we have both the facts "t
requires initialization" and "t
is initialized",
all is well!
The fact that the current branch returns early is added to the set of facts:
t
requires initialization.- In branch group:
- In branch when
i IS NULL
:
t
is initialized.- Returns.
Moving ahead one more line, we reach the end of another branch and branch group, and again something interesting happens:
...
IF i IS NULL THEN
...
==> END IF;
Upon ending the branch group, we know that the branch group has exactly one
branch, that the branch is entered only when i IS NULL
, and that the branch
returns. What that tells CQL is that, if execution is going to continue after
the branch group, its sole branch must not have been taken, and so CQL knows
the opposite of its condition for entry will be true from this point onward:
t
requires initialization.i
is not null.
The next IF
is rather similar to what we've seen already in its structure, so
we can jump ahead several lines to the next point of interest:
IF i == 0 THEN
SET t := "";
ELSE IF i > 0 THEN
==> SET t := p2(i);
ELSE
THROW;
END IF;
Before we analyze the above-indicated line, we have the following set of facts:
t
requires initialization.i
is not null.- In branch group:
- In branch when
i == 0
:
t
is initialized.- In branch when
i > 0
:
In the call p2(i)
, we know that i
was declared to have type INTEGER
and
that p2
requires an INTEGER NOT NULL
, but we also have the fact "i
is not
null". For this reason, we can consider p2(i)
to be a valid call. We can also
add the fact that t
is initialized to our current set of facts:
- ...
- In branch when
i > 0
:
t
is initialized.
NOTE: When it comes to code generation, it is not so simple as to say
p2(i)
is valid and proceed as usual. That's because p2
expects an argument
of type INTEGER NOT NULL
, but we merely have a value of type INTEGER
that
we happen to know cannot be null: INTEGER NOT NULL
and INTEGER
do not share
the same underlying representation, and so we cannot pass the declared-nullable
variable i
directly to p2
. To solve this problem, CQL rewrites the
expression such that p2(i)
becomes p2(cql_inferred_notnull(i))
, where
cql_inferred_notnull
is an internal-only function that handles the
nullable-to-nonnull representational conversion for us. This explains its
presence in the following examples.
Jumping ahead again, we encounter a THROW
:
IF i == 0 THEN
SET t := "";
ELSE IF i > 0 THEN
SET t := p2(cql_inferred_notnull(i));
ELSE
==> THROW;
END IF;
The fact that the branch will throw is added to the current set of facts:
t
requires initialization.i
is not null.- In branch group:
- In branch when
i == 0
:
t
is initialized.- In branch when
i > 0
:
t
is initialized.- In ELSE branch:
- Throws.
We then proceed to the end of the IF
:
IF i == 0 THEN
SET t := "";
ELSE IF i > 0 THEN
SET t := p2(cql_inferred_notnull(i));
ELSE
THROW;
==> END IF;
Once again, CQL merges the effects of the branches in the branch group to finish
the analysis of the IF
. Since it can see that t
was initialized in all
branches except the one that throws, and since the branches cover all possible
cases, the set of facts is simplified as follows given the knowledge that, if
THROW
was not encountered, t
must have been initialized:
t
requires initialization.i
is not null.t
is initialized.
Moving ahead one final time, we encounter the end of the procedure:
CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
BEGIN
...
==> END;
The only thing left to do at this point is to validate that anything requiring
initialization has been initialized. Since we have both "t
requires
initialization" and "t
is initialized", everything is in order.
As a recently generalized piece of functionality within the CQL compiler, control flow analysis will soon be used to enforce additional properties of users' programs. In particular, CQL will be able to ensure that cursors are always fetched before they're used and that cursors are always checked to have a row before their fields are accessed.
Hopefully you now understand the fundamentals of control flow analysis in CQL and the benefits it brings to your programs. Best wishes for 2022!
One of the signature features of the CQL language is the ability to use the "LIKE" form to
slice out columns that conform to a shape. This notion appears in many places in the language.
For instance if I have a table Foo
. I can make a cursor for that shape like so:
declare C cursor like Foo;
Which says I want the columns of C
to be like the columns of Foo
.
If I have a cursor D
that has the Foo
columns but maybe more and maybe in a different order I can load C
as follows:
fetch C from D(like Foo)
Which again saves me from having to list all the (potentially dozens) of Foo columns. This construct is in many places:
declare proc P(like Foo)
begin
insert into Foo from arguments;
end;
even
declare proc P(f like Foo, b like Bar)
begin
insert into Foo from f;
insert into Bar from b;
end;
And other examples... This is discussed more fully in Chapter 5 of the Guide.
However, one of the few places that shapes are interesting but not supported was in the select list.
And so, just a couple of days ago, we added the COLUMNS
construct to the language which allows for
a sugared syntax for extracting columns in bulk. It's kind of a generalization of the select T.*
pattern but with CQL-style slicing and type-checking.
These forms are supported:
- columns from a join table or tables
-- same as A.*
select columns(A) from ...;
-- same as A.*, B.*
select columns(A, B) from ...;
- columns from a particular join table that match a shape
-- the columns of A that match the shape Foo
select columns(A like Foo) from ...;
-- get the Foo shape from A and the Far shape from B
select columns(A like Foo, B like Bar) from ...;
- columns from any join table that match a shape
--- get the Foo shape from anywhere in the join
select columns(like Foo) from ...;
-- get the Foo and Bar shapes, from anywhere in the join
select columns(like Foo, like Bar) from ...;
- specific columns
-- x and y columns plus the foo shape
select columns(T1.x, T2.y, like Foo) from ...;
- distinct columns from the above (not distinct values!)
-- removes duplicate column names
-- e.g. there will be one copy of 'pk'
select columns(distinct A, B) from A join B using(pk);
-- if both Foo and Bar have an (e.g.) 'id' field you only get one copy
select columns(distinct like Foo, like Bar) from ...;
-- if a specific column is mentioned it is always included
-- but later clauses that are not a specific column will avoid it
-- if F or B has an x it won't appear again, just T.x
select columns(distinct T.x, F like Foo, B like Bar) from F, B ..;
Of course this is all just sugar, so it all ends up being a column list with table qualifications -- but the syntax is very powerful. For instance, for narrowing a wide table, or for fusing joins that share common keys
-- just the Foo columns
select columns(like Foo) from Superset_Of_Foo_From_Many_Joins_Even;
-- only one copy of 'pk'
select columns(distinct A,B,C) from
A join B using (pk) join C using (pk);
And of course you can define shapes however you like and then use them to slice off column chucks of your choice. There are many ways to build up shapes from other shapes. Probably the easiest is to declare procedures that return the shape you want and never actual create them. E.g.
declare proc shape1() (x integer, y real, z text);
declare proc shape2() (like shape1, u bool, v bool);
With this combination you can easily define common column shapes and slice them out of complex queries without having to type the columns names over and over...
Note that the COLUMNS(...)
form is not a general replacement for the select list.
For instance, general expressions are not allowed inside of COLUMNS(...)
but,
where extraction of lots of columns is needed, or even re-ordering of colummns,
it's a very good option indeed and it composes well with the other select
features.
This was the last significant area where shapes are useful but totally absent.
Following on the heels of shared fragments, we're introducing the same kind of thing for shared fragments that are expressions rather than tables. The syntax is as follows:
-- this isn't very exciting because regular max would do the job
@attribute(cql:shared_fragment)
create proc max_func(x integer, y integer)
begin
select case when x >= y then x else y end;
end;
The above can be used in the context of a SQL statement like so:
select max_func(T1.column1, T1.column2) the_max from foo T1;
The consequence of the above is that the body of max_func
is inlined into the generated SQL. However, like
the other shared fragments, this is done in such a way that the text can be shared between instances so
you only pay for the cost of the text* in your program one time, no matter how many time you use it.
* You still pay for the cost of a pointer to the text.
In particular, for the above, the compiler will generate the following SQL:
select (
select case when x >= y then x else y end
from (select T1.column1 x, column2 y))
But each line will be its own string literal, so, more accurately, it will concatenate the following three strings:
"select (", // string1
" select case when x >= y then x else y end", // string2
" from (select T1.column1 x, column2 y))" // string3
Importantly, string2
is fixed for any given fragment. The only thing that changes is string3
, i.e., the arguments.
The C compiler, and then the linker, will unify the string2
literal across all translation units so you only
pay for the cost of that text one time. It also means that the text of the arguments appears exactly one time,
no matter how complex they are. For these benefits, we pay the cost of the select wrapper on the arguments. This
is cost is frequently negative. Consider this following:
select max_func((select max(T.m) from T), (select max(U.m) from U))
A direct expansion of the above would result in something like this:
case when (select max(T.m) from T) >= (select max(U.m) from U)
then (select max(T.m) from T)
else (select max(U.m) from U)
end;
The above could be accomplished with a simple pre-processor macro, but the fragments code generates the following:
select (
select case when x >= y then x else y end
from select (select max(T.m) from T) x, (select max(U.m) from U) y))
Expression fragments can nest, so you could write:
@attribute(cql:shared_fragment)
create proc max3_func(x integer, y integer, z integer)
begin
select max_func(x, max_func(y, z));
end;
Again, this particular example is a waste because regular max
would already do the job.
To give another example, common mappings from one kind of code to another using case/when can be written and shared this way:
-- this sort of thing happens all the time
@attribute(cql:shared_fragment)
create proc remap(x integer not null)
begin
select case x
when 1 then 1001
when 2 then 1057
when 3 then 2010
when 4 then 2011
else 9999
end;
end;
In the following:
select remap(T1.c), remap(T2.d), remap(T3.e) from C, D, E;
The text for remap
will appear three times in the generated SQL query but only one time in your binary.
Restrictions:
- the function must consist of exactly one simple select statement
- no
FROM
,WHERE
,HAVING
, etc. -- the result is an expression
- no
- the select list must have exactly one value
- Note: the expression can be a nested
SELECT
which could have all the usualSELECT
elements
- Note: the expression can be a nested
- the usual shared fragment rules apply, e.g. no out-parameters, exactly one statement, etc.
FAQ:
Q: Why does the expression fragment have a select
in it?
A: Expression fragments are only interesting in SQL contexts where normal procedure and function calls are not available.
The select
keyword makes it clear to the author and the compiler that the expression will be evaluated by
SQLite and the rules for what is allowed to go in the expression are the SQLite rules.
Q: Why no FROM
clause?
A: We're trying to produce an expression, not a table-value with one column. If you want a table-value with
one column, the original shared fragments solution already do exactly that. This gives you a solution for
sharing code in, say, the WHERE
clause or the select list.
Q: Isn't this just the same as doing, say, #define max_func(x,y) case when (x) >= (y) then x else y end;
?
A: Macros can give you a ton of flexibility, but they have many problems:
- if the macro has an error, you see the error in the call site with really bad diagnostic info
- the compiler doesn't know that the sharing is going on so it won't be able to share text between call sites
- the arguments can be evaluated many times each which could be expensive, bloaty, or wrong
- there is no type-checking of arguments to the macro so you may or may not get compilation errors after expansion
- you have to deal with all the usual pre-processor hazards
In general, macros can be used for meta-programming (as in C and C++), but that doesn't mean it's a good idea.
This new feature is a pretty simple generalization of the FROM
construct as applied to expression lists.
Note this isn't the same as using FROM
the usual way in a select
statement. An example will clear
this right up.
Suppose you wanted to create a procedure that inserts a row into a table. You could write this:
create table Shape_xy (x int, y int);
create proc insert_xy(like Shape_xy)
begin
insert into Shape_xy from arguments;
end;
Here we're using from
to introduce some shape of values. It can appear in a lot of places.
Suppose now I want to insert two of those shapes. I could write this slightly more complicated procedure:
create proc insert_two_xy(xy1 like Shape_xy, xy2 like Shape_xy)
begin
call insert_xy(from xy1);
call insert_xy(from xy2);
end;
This also composes with cursors, so maybe you need to get two xy
values from diverse
locations. You can mix and match.
create proc write_xy()
begin
declare C cursor for select T.x, T.y from somewhere T;
fetch C;
declare D cursor for select T.x, T.y from somewhere_else T;
fetch D;
if C and D then
-- strange combos for illustration only
call insert_two_xy(from C, from D);
call insert_two_xy(from D, 5, 3);
call insert_two_xy(4, 2, from C);
call insert_two_xy(4, from C, 8);
end if;
end;
So, as you can see, we can start from data in one or more cursors and we can turn that data, plus other expressions, into arguments, composing them as we like. This gives you the ability to call procedures and functions using shapes from a mixed set of sources. None of this is new.
However, the other places where expression lists happen -- fetch
, update cursor
, and insert
-- only
allowed you specify a single object as the input source such as insert into Shape_xy from C
.
With a little work, this is trivially generalized so that all value lists can use the from
construct.
Here's a complete example showing all the new forms.
create table Shape_xy (x int, y int);
create table Shape_uv (u text, v text);
create table Shape_uvxy (like Shape_xy, like Shape_uv);
create proc ShapeTrix()
begin
declare C cursor for select Shape_xy.*, '1' u, '2' v from Shape_xy;
fetch C;
-- This new form is equivalent to the old form:
-- insert into Shape_xy from C(like Shape_xy)
-- but the values(...) form generalizes, see below.
insert into Shape_xy values(from C like Shape_xy);
declare D cursor for select * from Shape_uv;
fetch D;
declare R cursor like Shape_uvxy;
-- This form works just like the function call case
-- that was previously supported (it uses the same code even).
-- This form lets you load R from any combination of sources
-- as long as you make a suitable row.
fetch R from values (from C like Shape_xy, from D);
-- Same thing is supported in update cursor
-- the x, y come from C and the u,v come from D.x, D.y.
-- Note that C.u and C.v would not even be type compatible.
update cursor R from values (from C like Shape_xy, from D);
-- And in a select-values clause
declare S cursor for
with cte(l,m,n,o) as (values (from C like Shape_xy, from D))
select * from cte;
fetch S;
insert into Shape_uvxy from S;
end;
As you can see, you can choose a subset of the from
shape using like
.
These combinations let you flexibily assemble rows of data for cursors, calls, and insertions, using any combination of data sources you might want, without resorting to listing every column by hand.
The general idea here is that you might want to store composite data in a single column in the database. This is a common way to get more generic schema, the idea being that you can have one or more blob columns that store in tables a lot of data that doesn't have to be indexed. You could store it in other ways, like a JSON blob or some such, but we'll be using blobs as the basis for storage here -- hence the name blob "storage".
In SQL/CQL, the main way you define structures, especially those that you want to maintain, is with tables. Hence we introduce this
@attribute(cql:blob_storage)
create table news_info(
who text,
what text,
when_ long -- timestamp of some kind
);
The blob_storage attribute
indicates that the thing we're about to define here is not really going to be a materialized table. As a result, you will not be
able to (e.g.) DROP
the table or SELECT
from it, and there will be no schema upgrade for it should you request one. However, the usual schema rules still
apply which help you to create compatible versions of this structure. For instance, new columns can be added only at the end, and only if they are nullable.
Here we add source
to the schema in a hypothetical "version 6". Note that schema versions move forward globally in the schema, not locally in one table; this implies there are versions 1-5 elsewhere, not shown.
@attribute(cql:blob_storage)
create table news_info(
who text,
what text,
when_ long -- timestamp of some kind
source text @create(6)
);
Additionally, since the storage is not backed by SQL with SQL's constraint system, default values and constraints are not allowed in a table marked
with cql:blob_storage
; it's just data. Similarly, triggers, views, and indices may not use the "table".
Naturally, blob storage goes in a blob field, but recall CQL has discriminated types so we could make something like this:
create table info(
id long primary key,
news_info blob<news_info>
);
From a SQL perspective news_info
is just a blob. That means if you want to do a WHERE
clause or something like that on the info,
you're out of luck. Maybe you could write a user-defined function to crack the blob and so forth but really this isn't the point.
If you're using this feature then, by construction, you don't need to index on this data. It's simply not suitable for use at all
if you need field-at-a-time access within SQL.
The natural place that CQL stores structures is in value cursors so the most natural thing to do is to provide a variation of the SET
statement that lets you load a blob from a cursor like so:
create proc make_blob(like news_info, out result blob<news_info>)
begin
declare c cursor like news_info;
fetch c from arguments;
set result from cursor c;
END;
This declares a cursor, loads it from argument values, and converts it to a blob. Of course all of the usual cursor building forms can be used to power your blob creation, you just do one serialization at the end. The above is assembling a blob from arguments but you could equally make the blob from data.
create proc get_news_info(id_ long not null, out result blob<news_info>)
begin
-- use our columns sugar syntax for getting just news_info columns from
-- a table with potentially lots of stuff (or an error if it's missing columns)
declare c cursor for
select columns(like news_info) from some_source_of_info where info.id = id_;
fetch c;
set result from cursor c;
END;
There are many cursor fetch forms, including dummy data forms and other interesting bits of sugar. You can fetch a cursor from arguments, from other cursors, and even combinations. We want all of that to work for blobs as well without adding tons of new syntax and code generation. The obvious way to accomplish that is for cursors to be the source of blobs.
Again, the normal way that you work with records in CQL is by creating suitable cursors. These can be economically accessed on a field-by-field basis. What we need is a way to easily recreate a cursor from the blob so we can read the data values. This gives rise to this form:
let b := (select news_info from info where id = id_ if nothing null);
declare c cursor like b;
fetch c from b; -- note this can fail
-- now use c.who, c.what, etc.
Data loaded in a cursor is very economical to access on a field-by-field basis, and, since the deserialization of the blob happens all at once, that is also economical. Importantly, we cannot assume that the blob is well formed, it could be coming from anywhere. For secure-code reasons we must assume it is hostile. Hence the decoding validates the shape, internal lengths, and so forth.
If we had instead started with something this:
let b := (select news_info from info where id = id_ if nothing null);
Then maybe we might like to write:
if b.who == 'U2') then ... end if;
However, this sort of thing would be very uneconomical. For one thing, the blob does not have fixed-offset fields: It is carrying all the serialized data for the string fields
and so forth. Each "dot" operation would be costly and, furthermore, each "dot" operation could fail if the blob is badly formed. Having to deal with a b.who
that might fail
seems very bad indeed.
Once you have the cursor you can make new blobs with different combinations, slice the cursor fields using the LIKE
operator, return the cursor with OUT
, or OUT UNION
,
or pass the blob fields as arguments to functions using the FROM
forms. Cursors already are super flexible in terms of what you can do with their contents.
It's important that we allow the blobs to evolve over time, so each blob has to be self-describing. We also want to be able to throw an error if you use the wrong kind of blob when loading a cursor, so the blob has to contain the following:
- the number of columns in the blob data type when it was stored
- the type of each field is encoded as a single plain-text character
- the types are bool, int, long, (double) real, (string) text, blob;
- we use 'f' (flag) for bools, hence "fildsb"
- these are encoded with one letter each, upper case meaning 'not null' so the storage might be "LFss"
- the buffer begins with a null terminated string that serve for both the count and the types
- Each nullable field may be present or null; 1 bit is used to store this fact. The bits are in an array of bytes that comes immediately after the type info (which implicitly tells us its size)
- Boolean values are likewise encoded as bits within the same array, so the total number of bits stored is nullables plus booleans (nullable booleans use 2 bits)
- If you are reading a newer version of a record from an older piece of data that is missing a column then the column is assumed to be NULL
- Any columns you add after the initial version (using @create) must be nullable; this is normal for adding columns to existing schema
- Integers and longs are stored in varint format after zigzag encoding
- Text is stored inline in null terminated strings (embedded nulls are not allowed in CQL text)
- Nested blobs are stored inline, with a length prefix encoded like any other int
Well, this is a bit more advanced but in principle this could be done as well. To make it useful, we would want to make a new cursor type that can iterate over rows in a blob. The syntax leaves room for this, something like so:
declare c cursor for blob b;
loop fetch c
begin
-- the usual stuff
end;
This cursor would be another variation; it would keep its current index into the blob to read data out of it. Such a blob would also have to include a count of rows as part of its storage.
However, that's future looking. There is no such support at present.
With a fairly modest amount of work, we now support structured storage natively and have pretty rich language constructs. We carefully chose language constructs that lead to economical serialization and deserialization patterns and a record format that is versioned well, without resorting to something super loose like JSON.
As with many other features, it's possible to replace the (de)serialization with code of your choice by supplying your own runtime methods. So for instance, thrift encoding is possible; though it is more flexible than is strictly necessary for the few SQL data types, it might be convenient.
Storage types that are going to be persisted in the database or go over a wire-protocol should be managed like schema with the usual validation rules. On the other hand, formats that will be used only transiently in memory can be changed at whim from version to version. As mentioned above, the design specifically considers cases where a new client discovers and old-format blob (with fewer columns) and, the reverse, cases where an old client recieves a datagram from a new client with too many columns.
A more complete example is included for reference.
@attribute(cql:blob_storage)
create table news_info(
who text,
what text,
when_ long -- timestamp of some kind
);
-- a place where the blob appears in storage
create table some_table(
x integer,
y integer,
news_blob blob<news_info>
);
-- a procedure that creates the blob from loose args
create proc make_blob(like news_info, out result blob<news_info>)
begin
declare c cursor like news_info;
fetch c from arguments;
set result from cursor c;
end;
-- a procedure that cracks the blob
create proc crack_blob(data blob<news_info>)
begin
declare c cursor like news_info;
fetch c from data;
out c;
end;
-- a procedure that cracks the blob into loose args if needed
-- the OUT statement was created specifically to allow you to avoid this sort mass OUT awfulness
create proc crack_blob_to_vars(
data blob<news_info>,
out who text,
out what text,
out when_ long)
begin
declare c cursor like news_info;
fetch c from data;
set who := c.who;
set what := c.what;
set when_ := c.when_;
end;
-- this just defines a shape for the part we are keeping from the original structure
declare proc my_basic_columns() (
x int,
y int
);
-- this just defines a shape for the result we want
-- we're never actually defining this procedure
declare proc my_result_shape() (
like my_basic_columns,
like news_info
);
create proc select_and_crack(whatever_condition bool)
begin
declare c cursor for select * from some_table where whatever_condition;
loop fetch c
begin
-- crack the blob in c
declare n cursor like news_info;
fetch n from blob c.news_blob;
-- assemble the result we want from the parts we have
declare result cursor like my_result_shape;
fetch result from values (from c like my_basic_columns, from n);
-- emit one row
out union result;
end;
end;
Most production databases include some tables that are fairly generic, they use maybe a simple key-value combination to store some simple settings or something like that. In the course of feature development this kind of thing comes up pretty often and in large client applications (like Messenger, but certainly not limited to Messenger) there are many small features that need a little bit of state. It's easy enough to model whatever state you need with a table or two but this soon results in an explosion of tiny tables. In some cases there are only a few rows of configuration data and indeed the situation can be so bad that the text of the schema for the little state table is larger than the sum of all the data you will ever store there. This is a bit tragic because SQLite has initialization cost associated with each table. So these baby tables are really not paying for themselves at all. What we'd like to do is use some kind of generic table as the backing store for many of these small tables while preserving type safety. The cost of access might be a bit higher but since data volumes are expected to be low anyway this would be a good trade-off. And we can have as many as we like. In some cases the state doesn't even need to be persisted, so we're talking about tables in an in-memory database. Here low cost of initialization is especially important. And lastly, if your product has dozens or even hundreds of small features like this, the likelihood that all of them are even used in a session is quite low and so again, having a low fixed cost for the schema is a good thing. No need to create 100 in-memory tables on the off chance that they are needed.
See also the related feature: blob storage.
First you need a place to store the data, we define a backing table in the usual way. A simple backing table is just a key/value store and looks like this:
@ATTRIBUTE(cql:backing_table)
CREATE TABLE backing(
k BLOB PRIMARY KEY,
v BLOB NOT NULL
);
The backing_table
attribute indicates that the table we're about to define is to be used for backing storage. At present it is signficantly restricted.
It has to have exactly two columns, both of which are blobs, one is the key and one is the value. It should be either baseline schema or annotated with
@create
as it is expected to be precious data. If it's an in-memory table the versioning is somewhat moot but really the backing store is not supposed
to change over time, that's the point. In future versions we expect to allow some number of additional physical columns which can be used by the backed
tables (discussed below) but for now it's this simple pattern.
Backed table looks like this:
@ATTRIBUTE(cql:backed_by=backing)
CREATE TABLE backed(
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
bias REAL
);
@ATTRIBUTE(cql:backed_by=backing)
CREATE TABLE backed2(
id INTEGER PRIMARY KEY,
name TEXT NOT NULL
);
The backed_by
attribute indicates that the table we're about to define is not really going to be its own table. As a result, you will not be
able to (e.g.) DROP
the table or CREATE INDEX
or CREATE TRIGGER
on it, and there will be no schema upgrade for it should you request one.
It may not contain constraints as there would be no way to enforce them. But as compensation for these restrictions it can be changed freely and
has no physical schema cost associated with it.
To understand how this works imagine that we had a view for each backed table which simply read the blobs out of the backing store and then extracted the backed columns using some blob extraction functions. This would work, but then we'd be trading view schema for table schema so the schema savings we're trying to achieve would go up in smoke.
We might be lost here but CQL already has something very "view like" and that's the shared fragment structure. So what we do instead of views is to automatically create a shared fragment just like the view we could have made. They look like this:
@ATTRIBUTE(cql:shared_fragment)
CREATE PROC _backed ()
BEGIN
SELECT
rowid,
cql_blob_get(T.k, backed.id) AS id,
cql_blob_get(T.v, backed.name) AS name,
cql_blob_get(T.v, backed.bias) AS bias
FROM backing AS T
WHERE cql_blob_get_type(T.k) = 2105552408096159860L;
END;
So some things to notice right away:
First, this fragment has the right shape, but the shared fragment doesn't directly call blob extractors. Rather it uses these cql_blob_get
.
The point of this is to make the actual blob functions configurable. The test suites include some very simple extraction functions for blobs
with just integers in them, but you can create whatever blob format you want. You could use the blob storage
feature for encoding or you can encode it as you see fit. You can even have different encodings in different backed tables.
Second, there is a type code embedded in the procedure. The type code is a hash of the type name and the names and types of all the not-null fields in the backed table. The hash is arbitrary but repeatable, any system can compute the same hash and find the records they want without having to share headers. The actual hash is open source but it's just a SHA256 reduced to 64 bits with some name canonicalization. Shortly the JSON will also include the relevant hashes so you can easily consume them without even having to know the hash function.
Here's the slightly smaller shared fragment for backed2
@ATTRIBUTE(cql:shared_fragment)
CREATE PROC _backed2 ()
BEGIN
SELECT
rowid,
cql_blob_get(T.k, backed2.id) AS id,
cql_blob_get(T.v, backed2.name) AS name
FROM backing AS T
WHERE cql_blob_get_type(T.k) = -1844763880292276559L;
END;
As you can see it's very similar -- the type hash is different and of course it has different columns.
The idea is that the backed table might change over time and you can add new optional fields without invalidating your existing data. If you change the name of the type or if you add new not null fields the type identity changes and any data you have in the backing table will basically be ignored because the type hash will not match.
You can configure them as you see fit. By default cql_blob_get turns into either bgetkey
or bgetval
depending on if you are
reading from the key blob or the value blob. The directives for configuring this function are:
@blob_get_key bgetkey offset;
@blob_get_val bgetval;
You can configure the system to ask for the column by offset (this is normal for the primary key because it has a fixed number of columns for any given key type and they are all mandatory), or by hash code (this is normal for the value type because it might be missing some columns and so offset is probably not appropriate). However both are configurable so you want to do key by hashcode simply omit the "offset" part of the directive. Likewise if your values are offset addressable you can add "offset" to the value directive. Here the offset means the zero based ordinal of the column in the key or the value.
The type access functions are similarly configurable (they never need a code or an offset).
@blob_get_key_type bgetkey_type;
@blob_get_val_type bgetval_type;
Armed with these basic transforms we can already do a simple transform to make select statement work. Suppose CQL sees:
declare C cursor for select * from backed;
We can make this work with a simple transform:
DECLARE C CURSOR FOR WITH
backed (*) AS (CALL _backed())
SELECT *
FROM backed;
Now remember _backed
was the automatically created shared fragment. Basically, if we see a select statement that mentions any backed table
we simply add a call to the corresponding shared fragment in the WITH clause. This effectively creates that "view" we need. And because we're
using the shared fragment form, all users of this fragment will share the text. So there's no schema and the text of the backed appears only
once in the binary. More precisely we get this:
WITH
backed (rowid, id, name, bias) AS (
SELECT
rowid,
bgetkey(T.k, 0), -- 0 is offset of backed.id in key blob
bgetval(T.v, -6639502068221071091L), -- note hash of backed.name
bgetval(T.v, -3826945563932272602L) -- note hash of backed.bias
FROM backing AS T
WHERE bgetkey_type(T.k) = 2105552408096159860L)
SELECT rowid, id, name, bias
FROM backed;
Now with this in mind we can see that it would be very beneficial to also add this:
CREATE INDEX backing_index ON backing(bgetkey_type(k));
or more cleanly:
CREATE INDEX backing_index ON backing(cql_blob_get_type(k));
Either of these result in a computed index on the row type stored in the blob. Other physical indices might be helpful too and these can potentially be shared by many backed tables, or used in partial indicies.
Of course your type function might be named something other than the default bgetkey_type
.
Now consider a slightly more complex example:
A slightly more complex example:
select T1.* from backed T1 join backed2 T2 where T1.id = T2.id;
becomes:
WITH
backed (rowid, id, name, bias) AS (CALL _backed()),
backed2 (rowid, id, name) AS (CALL _backed2())
SELECT T1.*
FROM backed AS T1
INNER JOIN backed2 AS T2
WHERE T1.id = T2.id;
Now even though two different backed tables will be using the backing store the select "just works". All the compiler had to do was add both backed
table fragments. And of course if backed
was joined against itself, that would also just work.
Consider:
insert into backed values (1, "n001", 1.2), (2, "n002", 3.7);
This has to insert into the backing storage and convert the various values into key and value blobs. A simple transform does this job as well:
WITH
_vals (id, name, bias) AS (
VALUES(1, "n001", 1.2), (2, "n002", 3.7)
)
INSERT INTO backing(k, v) SELECT
cql_blob_create(backed, V.id, backed.id),
cql_blob_create(backed,
V.name, backed.name,
V.bias, backed.bias)
FROM _vals AS V;
What's going on here? Well, the issue is that the data to be inserted can be arbitrarily complicated. It might refer to all kinds of things. In this case it's just literal values but in general it could be anything. So the transform takes the original values and puts them in a _vals(...) CTE. Then we insert into the backing store by converting _vals into blobs -- one for the key and one for the value. There is only the one place we need to do this for any given insert statement no matter now many items or how complex the insertion is.
cql_blob_create
similarly expands to a user configured function with optional hash codes and mandatory field types. There is default
configuration that corresponds to this:
@blob_create_key bcreatekey offset;
@blob_create_val bcreateval;
The final SQL looks like this:
WITH
_vals (id, name, bias) AS (
VALUES(1, "n001", 1.2), (2, "n002", 3.7)
)
INSERT INTO backing(k, v) SELECT
bcreatekey(2105552408096159860, V.id, 1), -- type 1 is integer, offset implied
bcreateval(2105552408096159860,
-6639502068221071091, V.name, 4, -- hash as before, type 4 is text,
-3826945563932272602, V.bias, 3) -- hash as before, type 3 is real,
FROM _vals AS V
Note that both blobs have the same overall type code (2105552408096159860) as before. The key blob did not use per-field type codes, so the argument positions give the implied offset. In contrast the value blob is using hash codes (offset was not specified). This configuration is typical.
A more complex insert works just as well:
insert into backed
select id+10, name||'x', bias+3 from backed where id < 3;
The above insert statement is a bit of a mess. It's taking some of the backed data and using it to create new backed data. But the simple transforms we have work just as before. We add the needed backed
CTE and create _vals
like before.
WITH
backed (*) AS (CALL _backed()),
_vals (id, name, bias) AS (
SELECT id + 10, name || 'x', bias + 3
FROM backed
WHERE id < 3
)
INSERT INTO backing(k, v)
SELECT
cql_blob_create(backed, V.id, backed.id),
cql_blob_create(backed, V.name, backed.name, V.bias, backed.bias)
FROM _vals AS V;
Looking closely at the above we see a few things:
-
cql_blob_create
will expand as before (not shown) - we added
backed(*)
as usual -
_vals
once again just has the exact unchanged insert clause - the
insert into backing(k, v)
part is identical, the same recipe always works
Now let's look at a simple delete example:
delete from backed where id = 7;
Now remember we're again looking for a pattern that will generalize when the where condition gets crazy. But fortunately this is not so hard. The following form is fully general:
WITH
backed (*) AS (CALL _backed())
DELETE FROM backing
WHERE rowid IN (
SELECT rowid
FROM backed
WHERE id = 7
);
All we had to do here was:
- add the usual
_backed
CTE - move the original
WHERE
clause into a subordinateSELECT
that gives us the rowids to delete.
With the backed table in scope, any WHERE
clause works. If other backed tables are mentioned, the compiler
simply adds those as usual.
Here's a more complicated delete, it's a bit crazy but illustrative:
delete from backed where
id in (select id from backed2 where name like '%x%');
So this is using rows in backed2
to decide which rows to deleted in backed
. The same simple transform works directly.
WITH
backed2 (*) AS (CALL _backed2()),
backed (*) AS (CALL _backed())
DELETE FROM backing
WHERE rowid IN (
SELECT rowid
FROM backed
WHERE id IN (
SELECT id FROM backed2 WHERE name LIKE '%x%'
)
);
What happened here:
- the
WHERE
clause went directly into the body of the rowid select -
backed
was used as before but now we also needbacked2
The delete pattern does not need any additional cql helpers beyond what we've already seen.
The update statement is the most complicated of the bunch and it requires all the tricks from all the previous statements plus one more.
First, we'll need two more blob helpers that are configurable. By default they look like this:
@blob_update_key bupdatekey offset;
@blob_update_val bupdateval;
These are used to replace particular columns in a stored blob. Now let's start with a very simple update to see now it all works:
update backed set name = 'foo' where id = 5;
Fundamentally we need to do these things:
- the target of the update has to end up being the backing table
- we need the backed table CTE so we can do the filtering
- we want to use the rowid trick to figure out which rows to update which handles our where clause
- we need to modify the existing key and/or value blobs rather than create them from scratch
Let's see how this looks:
WITH
backed (*) AS (CALL _backed())
UPDATE backing
SET v = cql_blob_update(v, 'foo', backed.name)
WHERE rowid IN (SELECT rowid
FROM backed
WHERE id = 5);
Tearing this down a bit:
- we needed the normal CTE so that we can use
backed
rows - the
WHERE
clause moved into aWHERE rowid
sub-select just like in theDELETE
case - we changed the SET targets to be
k
andv
very much like theINSERT
case, except we used an update helper that takes the current blob and creates a new blob to store- the helper is varargs so as we'll see it can mutate many columns in one call
This gives us a working update statement... with one hitch. It's possible to use the existing column values in the update expressions and there's no way to use our backed
CTE to get them since the final update has to be all relative to the backing table.
Let's look at another example to illustrate the problem:
update backed set name = name || 'y' where bias < 5;
So this is adding the letter 'y' to some rows. Kind of goofy but similar mutations do happen and have to work. To make this work the reference to
name
inside of the set expression has to change. We end up with something like this:
WITH
backed (*) AS (CALL _backed())
UPDATE backing
SET v = cql_blob_update(v,
cql_blob_get(v, backed.name) || 'y',
backed.name)
WHERE rowid IN (SELECT rowid
FROM backed
WHERE bias < 5);
Importantly the reference to name
in the set expression was changed to cql_blob_get(v, backed.name)
-- extracting the name from the value blob. After which it is appended with 'y' as usual.
The rest of the pattern is just as it was, in fact literally everything else is unchanged. But it's easy to see that the WHERE
clause could be arbitrarily complex and it just works. Since the UPDATE
statement has no FROM
clause only the fields in the target table might need to be rewritten, so in this case name
, id
, and bias
were possible but only name
was mentioned.
After the cql_blob_get
and cql_blob_update
are expanded the result looks like this:
WITH
backed (rowid, id, name, bias) AS (
SELECT
rowid,
bgetkey(T.k, 0),
bgetval(T.v, -6639502068221071091L),
bgetval(T.v, -3826945563932272602L)
FROM backing AS T
WHERE bgetkey_type(T.k) = 2105552408096159860L
)
UPDATE backing
SET v =
bupdateval(
v,
-6639502068221071091L, bgetval(v, -6639502068221071091L) || 'y', 4
)
WHERE rowid IN (SELECT rowid
FROM backed
WHERE bias < 5);
The blob update function for the value blob requires the original blob, the hash or offset to update, the new value, and the type of the new value. The blob update function for the key blob is the same (blob, hash/offset, value) but the type is not required since the key blob necessarily has all the fields present because they are necessarily not null. Therefore the type codes are already all present and so the type of every column is known. The value blob might be missing nullable values hence their type might not be stored/known.
To illustrate these cases we can make another small example; we'll set up yet another small table that uses the same backing store:
@attribute(cql:backed_by=backing)
create table meta(
name text,
state long,
prev_state long,
primary key(name, state)
);
This update mixes all kinds of values around...
update meta
set state = state + 1, prev_state = state
where name = 'foo';
And the final output will be:
WITH
meta (rowid, name, state, prev_state) AS (
SELECT
rowid,
bgetkey(T.k, 0),
bgetkey(T.k, 1),
bgetval(T.v, -4464241499905806900)
FROM backing AS T
WHERE bgetkey_type(T.k) = 3397981749045545394
)
SET
k = bupdatekey(k, bgetkey(k, 1) + 1, 1),
v = bupdateval(v, -4464241499905806900, bgetkey(k, 1), 2)
WHERE rowid IN (SELECT rowid
FROM meta
WHERE name = 'foo');
As expected the bupdatekey
call gets the column offset (1) but not the type code (2). bupdateval
gets a hash code and a type.
All of these transforms are live in the code as of a few days ago.
The upshot is that, if you write some simple encoding and decoding functions, you can have very flexible blob storage.
If you want to refer to your blob functions in your own code, such as for indices you'll also need to do something like this:
declare select function bgetkey_type(b blob) long;
declare select function bgetval_type(b blob) long;
declare select function bgetkey(b blob, iarg integer) long;
declare select function bgetval(b blob, iarg integer) long;
declare select function bcreateval no check blob;
declare select function bcreatekey no check blob;
declare select function bupdateval no check blob;
declare select function bupdatekey no check blob;
bgetval
and bgetkey
are not readily declarable generally because their result is polymorphic so it's preferable to use cql_blob_get
as above which then does the rewrite for you. But it is helpful to have a UDF declaration for each of the above, especially if you want the --rt query_plan
output to work seamlessly. Typically bgetval
would only be needed in the context of a create index
statement.
There are many cases where you might want to nest one result set inside of another one. In order to
do this ecomomically there was a great desire to be able to run a parent query and a child query and
then link the child rows to the parent rows. One way to do this is of course to run one query for
each "child" but then you end up with O(n)
child queries and if there are sub-children it would be
O(n*m)
and so forth. What you really want to do here is something more like a join, only without
the cross-product part of the join. Many systems have such features, sometimes they are called
"chaptered rowsets" but in any case there is a general need for such a thing.
We did a bunch of work in the name of Parent/Child results sets but like many goals of this kind it caused us to ripen the CQL language in a variety of ways and its interesting to talk about those changes. Importantly, we wanted to be able to do work of this kind in the language while adding the fewest new notions and basically enabling the language to express a concept like a child rowset in the first place.
Here are some things that happened along the way that are interesting.
One of the first problems we run into thinking about how a CQL program might express pieces of a rowset and turn them into child results is that you need to be able to hash a row, append row data, and extract a result set from a key.
Let's think about that for just a second: in order to do anything at all with a child rowset, no matter how we got such a thing, we have to be able to describe it in a type-safe way. These objects already exist at runtime but they do not appear anywhere in the language explicitly and that was going to have to change.
To address this we added a new object type, kind of like we did with boxed statements. A result set
has a type that looks like this object <proc_name set>
. Here proc_name
must the the name of a
procedure that returns a result set and the object will represent a result set with the
corresponding columns in it.
That step may seem like it's super important but actually it's kind of optional, it provides type-safety
but the initial versions of the feature just used the type object
which works fine provided you make
no mistakes... it turns out there are even more fundamental needs that aren't optional.
The first thing you need to be able to to is take the type of the parent query and add to it one more columns to whole the child result set or sets (note that you can have more than one child result set per parent). So for instance you might have a list of people, and one child result might be the names of the schools they attended and another is the names of the jobs they worked.
So while adding columns to existing rows might sound like a bizarre thing to do but actually it's actually fundamental to the job here. We must be able to create a new output row is that is the sames as the parent but includes columns for the the child results too. There was no good syntax for this. The cursor declaration forms were:
/* option 1 */ declare C cursor like shape_name;
/* option 2 */ declare C cursor like select 1 x, "2" y, false z;
The first option implies that you already have a shape from (e.g.) a procedure or table and you want to make an identical cursor. That doesn't work here because we're trying to modify an existing shape, not use it as is.
The second form was supposed to be able to create any kind of cursor shape by simply declaring a select
statement that is an example of what you want to capture. In principle this can define almost anything.
However, there's a catch -- you can't get object types to come out of a select
so it's hopeless for result set types.
And, maybe just as important, you can't just add a few columns to an existing type with any kind of ease,
you have to list all columns.
Fortunately there was a pretty simple solution to this problem. There were already lots of cases where
a typed name list happens in the language -- for example in the return type of a function you can
specify something like (id integer, name text)
. That construction also defines a shape just like a
select statement and there was already code to handle all the correctness analysis. Additionally,
the LIKE
construct can be used in such a list to refer to existing types. So for instance a function that
returns all the columns of tables A and B could be defined like so
declare function foo() (LIKE A, LIKE B);
So we could solve all the cursor type problems by allowing a typed name list to be used to define a cursor shape. Probably the approach that should have been taken in the first place. The select option seems weird by comparison.
With the already existing support for shapes in a type list we could make the result shape for this parent/child case with ease, like so:
declare result cursor like (like parent, child_result object<child_proc set>);
So, all the parent columns plus a child result set. Or more than one child result set if needed.
Lastly there were going to be cases where we needed to make a new cursor using only some of the field of an existing cursor. The case in particular I'm thinking of is that we might have a big row from the parent and it might have only one or two columns that we need that form the key columns for the child. We didn't have a good way to do that either, but solving this turns out to be simple enough. We already had this form:
declare D cursor like C;
we just added:
declare D cursor like C(a, b, c);
Which chooses just the 3 named fields from C
and makes a cursor with only those. Recently we added
the form:
declare D cursor like C(-x);
To mean take all the columns of C
except x
With the a shape for the key fields defined, we can use existing syntax to load the fields economically:
fetch D from C(like D);
Which says we want to load D
from the fields of C
, but using only the columns of D
. That operation
is of course going to be an exact type match by construction. So now we could describe the key columns from
child rows, and the key columns from parent rows. And we could add columns to the parent type to create space
to hold child result sets. All of our type problems are solved. Almost.
It was clear that we would need to be able to do things like "hash a cursor" (any cursor) or "store this row into the appropriate partition" and this requirement meant that we had to be able to write functions that could take any cursor and dynamically do things to it based on its type information. There is no good way to write these generic helper things in CQL, but:
- we don't need very many of them,
- it's pretty easy to do that job in C
The main thing we need is to create a way to declare such functions and call them a with cursor and the necessary shape info.
So we added this notion of being able to call an external function with any cursor. Like so:
declare function cursor_hash(C cursor) long not null;
you can call it like so:
let hash := cursor_hash(C);
where C
is any cursor.
When such a call is made the C function cursor_hash
gets passed what we call a "dynamic cursor".
This includes:
- a pointer to the data for the cursor
- the count of fields
- the names of the fields
- the type/offset of every field in the cursor
So you can (e.g.) generically do the hash by applying a hash to each field and then combining all of those.
This kind of function works on any cursor and all the extra data about the shape that's needed to make the
call is static, so really the cost of the call stays modest. Details of the dynamic cursor type are in
cqlrt_common.h
and there are many example functions now in the cqlrt_common.c
file.
Again, creating this facility was a pretty minor matter, the compiler already has all this data and uses it to create result sets in the first place. We just allowed other functions to use that same data and made a public type for it.
To do the parent/child operations we needed three helper functions:
DECLARE FUNC cql_partition_create ()
CREATE OBJECT<partitioning> NOT NULL;
DECLARE FUNC cql_partition_cursor (
part OBJECT<partitioning> NOT NULL,
key CURSOR,
value CURSOR)
BOOL NOT NULL;
DECLARE FUNC cql_extract_partition (
part OBJECT<partitioning> NOT NULL,
key CURSOR)
CREATE OBJECT NOT NULL;
The first function makes a new partitioning.
The second function hashes the key columns of a cursor (specified by the key argument) and appends the values provided into a bucket for that key. By making a pass over the child rows you can easily create a partitioning with each unique key combo having a buffer of all the matching rows.
The third function is used once the partitioning is done. Given a key again, which you now presumably get from the parent rows, you get the buffer you had accumulated and then make a result set out of it and return that. Note that this function returns the vanilla object type because it could be returning any shape.
With the type system mentioned above you could now join together any kind of complex parent and child combo you needed, but it might be a lot of code, and it's error prone. This is a good job for a little sugar. So we added some simple syntax to specify the usual partitioning.
It looks like this:
-- parent and child defined elsewhere
declare proc parent(x integer not null) (id integer not null, a integer, b integer);
declare proc child(y integer not null) (id integer not null, u text, v text);
-- join together parent and child using 'id'
create proc parent_child(x_ integer not null, y_ integer not null)
begin
out union call parent(x_) join call child(y_) using (id);
end;
The generated code is simple enough, even though there's a good bit of it. But it's a useful exercise to look at it once. Comments added for clarity.
CREATE PROC parent_child (x_ INTEGER NOT NULL, y_ INTEGER NOT NULL)
BEGIN
DECLARE __result__0 BOOL NOT NULL;
-- we need a cursor to hold just the key of the child row
DECLARE __key__0 CURSOR LIKE child(id);
-- we need our partitioning object (there could be more than one per function
-- so it gets a number, likewise everything else gets a number
LET __partition__0 := cql_partition_create();
-- we invoke the child and then iterate its rows
DECLARE __child_cursor__0 CURSOR FOR CALL child(y_);
LOOP FETCH __child_cursor__0
BEGIN
-- we extract just the key fields (id in this case)
FETCH __key__0(id) FROM VALUES(__child_cursor__0.id);
-- we add this child to the partition using its key
SET __result__0 := cql_partition_cursor(__partition__0, __key__0, __child_cursor__0);
END;
-- we need a shape for our result, it is the columns of the parent plus the child rowset
DECLARE __out_cursor__0 CURSOR LIKE (id INTEGER NOT NULL, a INTEGER, b INTEGER,
child1 OBJECT<child SET> NOT NULL);
-- now we call the parent and iterate it
DECLARE __parent__0 CURSOR FOR CALL parent(x_);
LOOP FETCH __parent__0
BEGIN
-- we load the key values out of the parent this time, same key fields
FETCH __key__0(id) FROM VALUES(__parent__0.id);
-- now we create a result row using the parent columns and the child result set
FETCH __out_cursor__0(id, a, b, child1) FROM VALUES(__parent__0.id, __parent__0.a, __parent__0.b, cql_extract_partition(__partition__0, __key__0));
-- and then we emit that row
OUT UNION __out_cursor__0;
END;
END;
This code iterates the child once and the parent once and only has two database calls, one for the child and one for the parent. And this is enough to create parent/child result sets for the most common examples.
While the above is probably the most common case, another case can happen where you might
want to make a procedure call for each parent row to compute the child. And, more generally,
there was no good way to work with result sets from procedure calls other than iterating them
with a cursor. The iteration pattern is very good if the data is coming from a select statement
-- we don't want to materialize all of the results if we can stream instead. However, when working
with result sets the whole point is to create materialized results for use elsewhere.
We now had the power to express a result set type with object<proc_name set>
but no way to
actually get such a set from an existing procedure. Procedures generated them,
but they could only be consumed in the C layer.
Fortunately this is also an easy problem to solve. We already supported the ability to use procedures as functions in expressions if they had the right signature. We now add the ability to call a procedure that returns a result set and capture that result. Previously this was not supported and would have produced an error.
With the new features you can write:
declare child_result object<child set>;
set child_result := child(args);
or better still:
let child_result := child(args);
With this simple change we had the power to write something like this:
declare proc parent(x integer not null) (id integer not null, a integer, b integer);
declare proc child(id integer not null) (id integer not null, u text, v text);
create proc parent_child(x_ integer not null, y_ integer not null)
begin
-- the result is like the parent with an extra column for the child
declare result cursor like (like parent, child object<child set>);
-- call the parent and loop over the results
declare P cursor for call parent(x_);
loop fetch P
begin
-- compute the child for each P and then emit it
fetch result from values(from P, child(P.id));
out union result;
end;
end;
After the sugar is applied this compiles down to this program:
DECLARE PROC parent (x INTEGER NOT NULL) (id INTEGER NOT NULL, a INTEGER, b INTEGER);
DECLARE PROC child (id INTEGER NOT NULL) (id INTEGER NOT NULL, u TEXT, v TEXT);
CREATE PROC parent_child (x_ INTEGER NOT NULL, y_ INTEGER NOT NULL)
BEGIN
DECLARE result CURSOR LIKE (id INTEGER NOT NULL, a INTEGER, b INTEGER,
child OBJECT<child SET>);
DECLARE P CURSOR FOR CALL parent(x_);
LOOP FETCH P
BEGIN
FETCH result(id, a, b, child) FROM VALUES(P.id, P.a, P.b, child(P.id));
OUT UNION result;
END;
END;
The LIKE
and FROM
forms are very powerful but they aren't new. They do make
it a lot easier to express this notion of just adding one more column to the result.
Note that the code for emitting the parent_child
result before the transformation
doesn't need to specify what the columns of the parent are or the columns of the child,
only that the parent has at least the id
column. Even that could have been removed.
This call could have been used instead:
fetch result from values(from P, child(from P like child arguments));
That syntax would result in using the columns of P that match the arguments of child
-- just
P.id
in this case. But if there were 7 such columns the sugar might be easier to understand.
Last, but not least, to make this more accessible we wanted more support in the generated code.
The C interface would have produced generic object results for the child result columns.
This isn't wrong exactly but it would mean that a cast would be required in every use case on the
native side, and it's easy to get the cast wrong. So the result type of column getters was
adjusted to be a child_result_set_ref
instead of just cql_object_ref
.
Similar transforms were needed if column setters were being emitted (yes that's an option!) and of course the Java and Objective C output needed the same transform.
The prosecution of native support for parent/child result sets in CQL resulted in a bunch of very useful generalizations for declaring and managing cursors. The old special case code for blobs was actually replaced by these forms. The language overall expressiveness increased far more than just the ability to do this one kind of join. It's now possible to write general purpose debug helpers for cursors. It's possible to store and return pre-cooked result sets, creating useful caches and other such combinations. The type extensions to allow extending and narrowing existing types allow even more return flexibility while keeping everything strongly typed.
Parent/Child result sets exploit all of these things.
I was tempted to subtitle this article "How a great idea went horribly, horribly, wrong" but in the final analysis the outcome isn't actually at all horrible. But there are some good lessons here, and it's useful to capture the history while it is still fresh.
The CQL compiler can produce for you, starting from a set of table declarations and schema annotations, a schema upgrader that can upgrade your schema from any previous version to the current version, provided some simple rules are followed. Helpfully, the compiler enforces those rules with plain error messages so that you can reasonably expect your upgrader to work provided all is well with your database connection.
Broadly, the entities of the schema are on one of two plans, "create", and "recreate". These notions are all discussed in more detail in Chapter 10 of the guide.
This plan applies strictly to tables, and is used for tables that have precious data that cannot reasonably be restored from say the cloud or some on-device backup. Typically the primary data is on this plan.
On this plan you are limited to these operations:
- new tables can be created (including the so called baseline tables, those having no annotation at all)
- columns can be added to the end of a table such that an
ALTER TABLE ADD COLUMN
statement could add them - columns can be deleted, making them illegal to use in queries but otherwise having no physical consequence
- in CQL "select * from foo" will not include deleted columns hence "*" is fully expanded
- tables can be deleted, leaving a tombstone in the schema
- the tombstone provides the clue to the upgrader that the table should be dropped if it is found
The primary directives for this plan use @create
annotations, hence the name.
Triggers, Indicies, and Views are all on this plan and tables can be too if they are annotated with @recreate
instead of @create
. The idea with this plan is that if the entity changes at all you simply drop the old version
and create the new version. This means any change is possible but it also means the upgrade is always destructive:
-
if the upgrader is going to do anything at all it drops all views and all triggers at the start and recreates them at the end
- this not destructive and takes a lot of weird failure modes off the table
- note steps in the upgrade logic therefore cannot rely on the existence of views or triggers
-
if any index or table changes at all it is dropped and recreated
- this is done by computing a 64 bit CRC of the entities schema and comparing it to the stored CRC
- if the CRC is changed the recreate happens
Probably the first thing you noticed once you create the notion of recreate for tables is that you really want to do the recreation in groups. There are constellations of schema that have related information and if one of them changes they all need to be updated. This lets you have complex foreign key relationships within this "recreate group".
You'll also notice that a recreate group can have foreign keys within itself and it can make foreign keys to things that are on the create plan but you run into trouble if you try to make foreign keys to some other recreate group. That group might vanish on you, or rather, it might try to vanish and discover that it cannot because of constraint violations. Originally recreate groups could not refer to other groups but recently this was generalized to track a directed acyclic graph of groups. This means that a core group recreating forces the recreation of any groups that refer to it. On this plan its common to end up with a snowflake type schema where the outer parts of the snowflake update often and the inner parts hardly at all.
In addition to the CRCs for the recreate groups, and indices there was a one CRC for overall schema. The upgrader checks this before anything else. If the overall schema CRC matches the current schema then nothing needs to be done (the upgrader has already done its job). If it doesn't match then some steps have to be applied.
Other than the cross-group dependencies things began in the form above. The recreate plan was CRC driven and the create plan was version driven. The original design simply generated the appropriate corrections at each schema version and tracked the current version. If the overall CRC had changed, whichever steps you needed were executed.
This turned out to be a disaster and it was changed within days.
The idea seems fine enough, but the first thing you run into is that two people might make a change to the schema creating say version 5. The problem is if one person adds table X and the other adds table Y they will each run their own and mark themselves as schema 5. When they merge their changes with some other developer, the version 5 upgrade will have already run and they will get nothing despite the fact that v5 includes more for both of them. This is crazytown for developers.
So, rather than simply tracking the current schema version, each schema version got its own mini-CRC. The upgrader
would run the steps of each version if the CRC was absent or didn't match. With steps like CREATE TABLE IF NOT EXISTS
and so forth a merge would result in you getting the other half of the changes for your version and work
could accumulate at say schema v5 with no problems. Actual customers would never see this because they only saw
completed schema versions.
This worked a lot better and lasted about a year.
The problem is that the system is based on these "mostly immutable" schema versions. You never restate the past you always give instructions on how to move forward. With the versions being nearly immutable, and the upgrade steps being idempotent, things seemed good. But it turns out neither of those two assumptions was really exactly true.
The reality of the schema we created for our platform was that there was one large uber schema that had all the possible schema you might need for a variety of features and any give product could opt-in to the features it wanted, thereby getting the necessary schema. The schema system had a good way to partition the schema using regions. The upgrader could work on a set of regions and provide the union of schema in those regions, omitting the rest.
Super. Here's where things get exciting. A schema consumer could reasonably decide at some time in the future that it wants new features and so it opts into additonal regions. That's fair enough, but the net of this is that of course new tables appear. Some of these are likely to be in the baseline schema (v0) and some might have appear later (e.g. v5, v10, v21). This is all fine, the CRCs for those versions change and the schema upgrader runs again. Those versions execute and add the correct schema. Perfect.
Actually no.
About two years into the history of CQL we started noticing that some attempts to delete tables were failing. The DROP commands claimed that there was a constraint problem -- but these should have been leaf tables. What constraint could possibly be the issue? This was the first time a major design flaw with this system was revealed. Previous bugs had been few and had all been silly logic errors or off by one checks in version numbers, that kind of thing, easily fixed. This was a puzzler. But the answer was fortunately available in the set of annotations.
Basically, imagine a table "zombie" had been created say in the baseline schema, and then later deleted; suppose it was deleted in version 20. All is well, the upgrade steps for version 20 include a table drop. However, now a team subscribes to more schema, causing the v0 schema to include a few more tables. Here's the problem, when the steps for v0 run again they notice that "zombie" is missing and helpfully create it, thinking this is the right thing to do. But this is a disaster... The "zombie" table is supposed to be deleted in v20 but that CRC is unchanged! So now a table exists that has no business existing. If "zombie" has an FK reference to some other table which we wnat to delete, then all attempts to drop that table will fail because "zombie" is there gumming up the works. Even if it's empty... which it will be in this case.
This problem was fixed by having all tables that need deleting be unconditionally deleted at the end of the upgrade and not in the steps for the version in which the delete happened. This meant that the next upgrade purged all the zombies and enabled the correct table drops to start running with no errors. The consequence of this was a 90% reduction in schema upgrade failures!
Another reason for the "immutable" version history to (cough) mutate was a desire to opt out of tables. As described in this section
we created an affordance to allow people to unsubscribe from some of the tables they had previously selected.
This provided finer-grain control of the schema subscription and also made it possible to undo previous mistakes of over-subscription. However, it was clear from the beginning that you might want to undo an unsubscription at some time in the future. In keeping with schema directives that create a clear history the @unsub
and @resub
statements were added to the language with lots of rules for correctness. The upgrader did the following things:
- upon finding an unsubscription at version X that version includes DDL to drop the unsubscribed table
- changes to that table in an future versions were omitted
- upon finding a resubscription at version Y that version included DDL to create the table as it exists at version Y
- later changes to that table are once again emitted as usual
This was very nearly right except it had the same problem as the delete case above. A table created in say the baseline might come back as a zombie even though it was unsubscribed. However, now wise to this issue a small fix takes care of the problem.
- always drop tables in the unsubscribed state at the end just like delete tables
- no code is needed to do an unsubscribe at version x (the drop at the end will do the job)
- a resubscribe at version X first drops the table and then recreates as it exists at version X
This gives us a consistent upgrade path again and importantly avoids the problem of a resubscription finding a zombie that prevents it from doing its job.
On July 1, 2022 we
made a modest change that reduced the number of SQLite statements required to do a full upgrade. The opportuntity
came from the many column existence checks we made before running ALTER TABLE ADD COLUMN
. Rather than
run a statement that looked like this (SELECT EXISTS(SELECT * FROM sqlite_master WHERE tbl_name = table_name AND sql GLOB column_declaration_pattern))
for each column we first selected all of the table schema out of the sqlite_master
table and put it into a hash table keyed by the table name. Reading even a few hundred table names was much faster than running a single statement for each column that needed to be checked -- especially when recreating the schema from scratch. In the most relevant test case this was a 7% improvement.
Importantly, it motivated us to add hash tables into cqlrt_common.c
and generalize the mechanism for object management so that the cqlrt
allows creation of new objects without having to add special support for each one.
This new hash table meant that we could do a hash lookup and substring match instead of a sqlite query for each column.
On Oct 11, 2022 we stopped using CRCs for the version checks on the create plan entirely. This was in fact an optimization but it was motivated by a real, but rare, problem.
What was happening was something like maybe 1 in 10^5 databases was missing columns. The sequence of events that caused this was very hard to diagnose but the situation was very clear. The database was at say schema version 100. The columns had been added at say version 50. The CRCs indicated that the v50 upgrade had already run so it didn't run again. The columns would now never be added.
We had yet to come up with a set of steps that would adequately describe how this happened. I have to guess some combination of a resubscription ran because of one of those "the schema is not really immutable" changes and then "medium" version of the table say v25 was resubscribed but the columns added in say v50 never got readded because v50 thought it had already run.
This was getting to be a nightmare but there was a simple solution.
We already had created this dictionary that had all the tables and their schema from sqlite master, we were already using it to determine if we needed to add a particular column. The only reason we had version CRCs at all was to allow us to skip steps, but since we could already skip column adds super fast all we needed was to be able to skip table adds -- there is nothing else. Well the same hash table can obviously easily tell us if a table exists. Non-existent tables have no schema and hence are absent from the hash table which is loaded directly from sqlite_master
.
So the new algorithm, goes something like this:
- use the version numbers only for ordering
- before adding a table, check if it exists in the table, this is faster htran running
CREATE TABLE IF NOT EXISTS
- check the columns as before
- attempt each of these every time the overall schema changes, and trust that the fast checks are fast enough
On this plan we change the way @unsub
and @resub
are handled to something much simpler:
-
@unsub
acts like an out of band@delete
on the table or view to which it is applied- the drop happens at the end like before
-
@resub
resets that state so the table is considered not deleted if the last operation was@resub
To this we add one new rule:
- the schema upgrader removes any reference to deleted tables entirely
- they are removed from baseline
- they are not included in any upgrade rules
- they are only dropped at the end if they still exist
This vastly simplifies unsub/resub and delete. An unsubscribed table will always get cleaned up at the end, just like deleted tables. No strange interim states happen in resub. If a table is resubcribed it just reappears in the schema and the various operations run as usual.
The only compromise to this is that we still have a single CRC for the overall baseline schema. However even that could be removed at the expense of more hash table lookups. There is a binary size win for fewer checks and since baseline by definition depends on nothing it seems like safe position to take.
This approach was about 13-15% faster in fact, the time saved examining and writing back schema CRCs more than paid for the extra hash table checks (which were ~100x faster than the db operations). And the hash table already existed! The reduction of the CRC checks and removal of vestigial upgrade logic for deleted tables also resulted in a 2.2% reduction of upgrader size for our most important case.
With all of this in place it's clear that the various rules for unsubscription and resubscription and the
sort of historical playback that was used to try to create these immutable stages is moot. The only thing
that matters is if we end in the unsubscribed state. Removing the unsubscribe upgrade steps from the upgrader
entirely just simplifies everything. So no @resub
is needed at all nor are @unsub
version numbers. Presently
set to land is set of changes that remove resubcription entirely, to resubscribe you simply remove the @unsub
directive for your table/view.
This lets us eliminate a bunch of validations and test cases to get a simpler, clearer, and more easily verifiable upgrader. There's just much less to go wrong.
Even crazy cases like "an unsubscription happens in v5, the resubscription happens in v7, a user runs the upgrade and they might have a database that is v4, v5, v6, or v7 (having already been upgraded)". All of these had different potential flows before. Now they are all the same. All the cases will roll the table forward to v7 from whatever version they might be on with the usual rules and states particular to unsubscription or resubscription. The table is present or it isn't. It is missing columns or it isn't. Same as always.
More thinking is needed here but it's clear that now that we've arrived at this simpler place ALL the version numbers are moot. The only thing we really have to do with version numbers is run ad hoc migrations at the appropriate time, and only once. The rules for migrators would have to change such that they are responsible for finding the state of the schema, and maybe some context could be provided for this. But ad hoc data migrators are very uncommon and regular annotations are much more so.
The fundamental assumption about how schema changes would happen was wrong. Even so, it was close enough that upgrades were over 99.99% successful when the other parts of the system are working ok. This is probably about the best we can hope for given the state of affairs with flash drives on Android devices. The current system is actually pretty close code-wise to what we had planned -- just look at the Oct 11, 2022 diff to see what I mean. It's not that big of a difference in the end. The new system has been deployed for nearly a month now and it is immune basically all of the failure modes of the old. It will take some time before we know what its true reliability is given the low failure rate of both. But we do know the new system is significantly faster. Optimizations 1 and 2 together are over 20% for full installations.
I should note that someone who was obviously smarter than me told me that we would land on a solution like this and I didn't believe them. They were of course correct. You know who you are. Sorry I doubted you.