reljr
is a relational algebra evaluator written in the Clojure programming
language to satisfy the requirements for Project Four in the Fall 2020 session
of the Advanced Databases course offered at USF, under the instruction of Dr.
Tu.
After cloning the reljr
repository, install yarn.
From the main reljr
directory:
yarn install
yarn shadow-cljs watch reljr
This will serve the debug build of reljr
at localhost:3000
. To build the
release build, use yarn shadow-cljs release reljr
. The release build can be
served using e.g. python -m http.server
from reljr/public.
We are also hosting an instance of reljr at reljr.com
.
reljr
expects input tables stored as csv files, with a header row.- Use
read
to read a table in.- If the filename is
bar.csv
, reljr will store the table in memory asbar
. - If reljr already knows about a table with this name, it will be overwritten.
- If the filename is
- Table names are case-sensitive, so reading
bar.csv
andBar.csv
will result in two separate tables in memory. - Using
store <query> as <name>
will store the table that results from a query (see below) with the given name. As always, if the name exists, the table will be overwritten.store
will rename all the columns to use the new relation name as by the rename operator. reljr
assumes that every column in the table shares the same qualifier (i.e., columnsS.n
andR.n
can’t be read in from the same file.)
- To write a table out to a file, use
write <tablename>
. This command launches a file download dialogue. write <tablename> as <name>
will write out a file to<name>
, where name can be a csv file or an extensionless file.reljr
doesn’t support any other file extensions.
- You can delete a table from memory with
delete <tablename>
. - If you make changes to a table and want to reset it within
reljr
, useread
again. - The
list
command will output the names of all currently loaded tables, along with each tables’ column names.
A query can be run on its own after a table load or as part of e.g. the store
command above. Running commands on tables that haven’t been loaded into reljr
is an error.
reljr
supports the following types of relational algebra queries, with the
tokens reljr
recognizes for the operation listed after each:
- Projection (π,
pi
,project
) - Selection (σ,
sigma
,select
), with the following boolean operations:- Not (¬,
not
) - And (∧,
&&
,and
) - Or (∨,
||
,or
) - Equality (=)
- Inequality (≠,
!=
) - Greater than (>)
- Greater than or equal to (≥,
>=
) - Less than (<)
- Less than or equal to ( ≤,
<=
)
- Not (¬,
- Renaming of a column or relation (ρ,
rho
,rename
) - Order by (τ,
tau
,order by
), followed by a token to indicate the order of the column:ASC
,asc
,DESC
,desc
- Group by (γ,
gamma
,group by
), with support for these aggregates:count(*)
count(<colname>)
min(<colname>)
max(<colname>)
sum(<colname>)
avg(<colname>)
- Cross product (×,
*
,cross join
) - Inner Join(⋈, ⨝,
join
,inner join
), with support for the same boolean operations as Selection - Natural Join(⋈, ⨝,
natural join
) - The set operations:
- Division (÷,
/
) - Intersection (∩,
intersect
) - Union (∪,
union
) - Subtraction (
-
,\\
,except
)
- Division (÷,
Reljr has limited support for expressions in projection (pi R.x + 1 -> y R
),
selection (sigma R.x + 1 > 2 R
) and group by (gamma a; max(a - b) R
).
The preprocess <query>
and optimize <query>
can be used to inspect how reljr
manipulates queries.
The grammar for the reljr parser and CLI commands is written in Extended
Backus-Naur Form (EBNF). The grammar is entirely contained within the resource
file RAParser.bnf. The grammar and operator precedence order for QueryCommand
and all its child nonterminals were developed using the railroad diagrams in the
RelaX wiki as a reference. The grammar and operator precedence for the CLI
commands are unique to reljr
. We used the third-party Instaparse parser
generator library to construct a parser from the EBNF. Instaparse’s detailed
error reporting makes it possible for reljr
to provide much better information
about syntactic errors in user queries compared to those offered by RelaX.
In decreasing order of precedence:
- Reading a table from a file
- Storing a query to a table in memory
- Renaming an in-memory table
- Deleting an in-memory table
- Writing a table to a file
- Listing all in-memory tables
- Quitting
reljr
- A relational algebra expression (
RAExpression
)
These precedences are the same as those used by RelaX. In decreasing order of precedence:
- Projection, Selection, Rename Columns, Rename Relation, Group, Order By
- Cross Product, Inner Join, Natural Join, Division, a relation name, an RAExpression within parentheses
- Intersection
- Union, Subtraction
The preprocessor is used to convert from the raw abstract syntax tree produced by our parser into an internal AST fit for later operations. It also performs some basic sanity checking on the provided queries- in particular, it checks that the operations requested will not have column name collisions.
Reljr implements a simple query optimizer: it optimizes groups of selections, cross products, and inner joins into left-deep trees of inner joins on selections. The optimizer works in three passes:
- Optimize up: Convert inner joins into selections of cross products, move selections to the root of join subtrees and merge them. This is a normalization step, converting any subtree of selections/joins/cross products into a single selection of cross products.
- CNF selections: Convert the predicate of each remaining selection into conjunctive normal form
- Normalize joins: Convert each selection+cross-product subtree into a left-deep tree of inner joins and cross products by pushing the clauses of the selection condition down the tree. Clauses that only involve one sub-query become selection nodes on the leaves of the join tree. This step could be taught to consider alternate join algorithms or have preferences about selection criteria, but due to time constraints this will have to be left for future work.
The source file interpeter.clj deals with the evaluation of relational algebra
expressions and related computations. However, the functions in this file
(notably evaluate
) depend on the table abstraction specified in table.clj.
Because of this, it is necessary to discuss the implementation of the functions
that constitute the table abstraction before discussing the implementation of
the interpreter itself.
reljr
stores tables internally as sets of maps. This is a convenient
representation: sets trivially guarantee uniqueness of tuples and maps keep
column information close to table data. As an example, consider the table below.
The Clojure code below the table shows the same table as a Clojure data
structure (assuming the table was read from the file foo.csv
.)
R.x | S.x | R.y |
---|---|---|
5 | 6.3 | 6 |
3 | 9.4 | 2 |
{"foo"
#{{:R/x 5 :S/x 6.3 :R/y 6}
{:R/x 3 :S/x 9.4 :R/y 2}}}
The project
function expects to receive two arguments, table
and keys
. For the query pi x,y foo
and the same data structure as above, the arguments are as follows:
table = #{{:R/x 5 :S/x 6.3 :R/y 6} {:R/x 3 :S/x 9.4 :R/y 2}}
keys = (:R/x :R/y)
The project
function takes advantage of that fact that Clojure keys (e.g.
:R/x
) can be used as functions to get the value from a map that is associated
with that key (as in (:R/x {:R/x 2})
, which returns 2
.) val-funcs
is then
a function that applies each of the keys to a row of the table. map
performs
the actual iteration over table rows. The return value of this function is of
the same form as the input (a set of maps) but with only the requested columns:
#{#:R{:x 3, :y 2} #:R{:x 5, :y 6}}
The select
function expects to receive a table as we’ve seen before, as well
as a test
, which is a predicate function. The function
=predicate-runner= handles the creation of
these predicates. With the predicate in hand, select
filters the tuples in
table
using the Clojure built-in filter
, putting those tuples into a new
set, and returns.
The rename
function renames a relation. It iterates through every tuple in the
table, creating a new tuple with the appropriate name. The iteration through
tuples is necessary because column names are namespaced by the table name, and
so every table member requires a rename.
Rename column works by using the assoc=/ =dissoc
pair of functions. dissoc
removes a mapping with the given key from a map, while assoc adds one. The
“thread-first”, or “forward-threading”, macro ->
in the call to map is
equivalent to the following expression:
(dissoc (assoc r to (from r)) from)
The order-records-by
function orders the tuples in a table
based on a set of
col-rules
, where each col-rule
is a pair composed of a namespaced key (a
column name, like :R/a
) and a clojure function, which is one of the following
comparators:
#(< (compare %1 %2) 0)
#(> (compare %1 %2) 0)
Clojure’s compare
function performs the following (taken from ClojureDocs):
...Returns a negative number, zero, or a positive number when x is logically
'less than', 'equal to', or 'greater than' y. Same as Java x.compareTo(y) except
it also works for nil, and compares numbers and collections in a
type-independent manner....
These comparators are used in the ordering
function (contained within
order-records-by
) to determine row order. Once the ordering
function has
been built up, it is applied to the rows of the table by into
. into
’s
arguments are an empty sorted-set
, whose comparator is now the entire
ordering
function. into
works by taking every member of table
and adding
it to the set. When each element gets added, it is placed in the set according
to the comparator (the ordering
function).
The group-records-by
function returns a new table with the desired columns and
new columns for any aggregates requested by the user. This function relies on
the Clojure built-in group-by
. We consider as an example the query gamma b;
count(b) -> baz bar
, on the following relation:
#{{:R/a 1, :R/b "a", :R/c "d"}
{:R/a 4, :R/b "d", :R/c "f"}
{:R/a 3, :R/b "c", :R/c "c"}
{:R/a 5, :R/b "d", :R/c "b"}
{:R/a 6, :R/b "e", :R/c "f"}}
Calling group-by
on table gives the following:
{["a"] [{:R/a 1, :R/b "a", :R/c "d"}]
["d"] [{:R/a 4, :R/b "d", :R/c "f"} {:R/a 5, :R/b "d", :R/c "b"}]
["c"] [{:R/a 3, :R/b "c", :R/c "c"}]
["e"] [{:R/a 6, :R/b "e", :R/c "f"}]}
The first vector (the keys of this map) is the value for the column we have
selected on; the function places it into the resulting table later on (inside of
the map
.) The key for each vector is a vector of the rows that contained that
value.
We used Clojure’s into
to accomplish this. The outer into
places the result of
the for
into a set. The for
loop is equivalent to the following for loop in
pseudocode that is a little more imperative:
for r1 in table1:
for r2 in table2:
put every r2 item into r1
The result accumulates as the return value of the for
loop, which is then
placed into the set by the outer into
as discussed earlier.
The inner-join
function is implemented in the same manner as cross-product
,
but with a test condition applied. The test condition comes from
=predicate-runner= just as it does in select
.
The natural-join
function is implemented using project
and inner-join
,
where both the test condition for the inner join and the columns to be projected
out are computed by the make-natural-join-test
function. We will show the
execution of this function using the following tables as examples:
{"zoop"
#{#:S{:a 4, :z "d", :g "f"}
#:S{:a 5, :z "d", :g "b"}
#:S{:a 6, :z "e", :g "f"}
#:S{:a 1, :z "a", :g "d"}
#:S{:a 3, :z "g", :g "g"}}}
{"bar"
#{#:R{:a 1, :b "a", :c "d"}
#:R{:a 4, :b "d", :c "f"}
#:R{:a 3, :b "c", :c "c"}
#:R{:a 5, :b "d", :c "b"}
#:R{:a 6, :b "e", :c "f"}}}
The example query for the two tables is zoop natural join bar
.
Our natural join implementation mirrors RelaX’s in that it always keeps the join
column from the left of the two tables in the natural join. The complexity of
make-natural-join-test
arises from the possibility for two specific join
scenarios:
- Two tables may each have multiple columns that can be join on (e.g. if
zoop
andbar
above each had a columnb
), or - Either (or both) of the two tables has multiple columns with the same name
but different namespaces (e.g. if
zoop
had anS.a
column and aQuux.a
column)
Because of this, the test condition must determine, for every pair of join columns for every pair of records in the table, which records will be kept.
Let’s take a close look at the test function returned by
make-natural-join-test
:
(fn [l r]
(every? (fn [[lj rj]]
(every? (set (rj r))
(lj l)))
tests))
It’s more instructive if we swap in some actual values; we’ll choose two tuples
that will make the test return true
:
(fn [{:S/a 4, :S/z "d", :S/g "f"} {:R/a 4, :R/b "d", :R/c "f"}]
(every? (fn [[lj rj]]
(every? (set (rj {:R/a 4, :R/b "d", :R/c "f"}))
(lj {:S/a 4, :S/z "d", :S/g "f"})))
[(apply juxt [:S/a]) (apply juxt [:R/a])]))
lj
and rj
take the values of the vector right after the inner fn
(apply
juxt...
); this has the affect of applying a key to a map, which returns the
value of that key from the map:
((apply juxt [:S/a]) {:S/a 4, :S/z "d", :S/g "f"}) ;; => 4
((apply juxt [:R/a]) {:R/a 4, :S/z "d", :S/g "f"}) ;; => 4
The last trick in the function here is the call to (set)
, which makes a set
from the result of the inner function application (here, it’s the number 4). A
clojure set can be used as a function to determine if an item is contained in
the set, as shown in the following examples:
((set [4 2 3]) 5) ;; => nil
((set [4 2 3]) 2) ;; => 2
(map (set [4 2 3]) [5 1 2 6 3]) ;; => (nil nil 2 nil 3)
Going back to the code, we see that all we’ve done is create a set
with 4 in
it (the value of applying the keys to the right tuple) and checked that the
values from the left tuple are all contained in that set. If each table had more
than one column that matched columns from the other table, the tests
list
would contain a pair of test functions for each pair of columns, and we would
iterate through all of those with the outer every?
.
The evaluator walks the expression tree and performs the corresponding table operations as it goes.
There are a few ways we could improve on Reljr going forward. The simplest being improving the surface-level support for more flexible queries. For example, we could implement more functions and operators to use in projection/selection expressions. Alternately, we could improve the reljr internals: we could add support for indexes, or teach the query optimizer about table statistics.