reljr

Introduction

reljr is a relational algebra evaluator written in the Clojure programming language to satisfy the requirements for Project Four in the Fall 2020 session of the Advanced Databases course offered at USF, under the instruction of Dr. Tu.

Getting `reljr`

After cloning the reljr repository, install yarn. From the main reljr directory:

yarn install
yarn shadow-cljs watch reljr

This will serve the debug build of reljr at localhost:3000. To build the release build, use yarn shadow-cljs release reljr. The release build can be served using e.g. python -m http.server from reljr/public.

We are also hosting an instance of reljr at reljr.com.

Using `reljr`

Reading tables

reljr expects input tables stored as csv files, with a header row.
Use read to read a table in.
- If the filename is bar.csv, reljr will store the table in memory as bar.
- If reljr already knows about a table with this name, it will be overwritten.
Table names are case-sensitive, so reading bar.csv and Bar.csv will result in two separate tables in memory.
Using store <query> as <name> will store the table that results from a query (see below) with the given name. As always, if the name exists, the table will be overwritten. store will rename all the columns to use the new relation name as by the rename operator.
reljr assumes that every column in the table shares the same qualifier (i.e., columns S.n and R.n can’t be read in from the same file.)

Writing tables

To write a table out to a file, use write <tablename>. This command launches a file download dialogue.
write <tablename> as <name> will write out a file to <name>, where name can be a csv file or an extensionless file. reljr doesn’t support any other file extensions.

Managing in-memory tables

You can delete a table from memory with delete <tablename>.
If you make changes to a table and want to reset it within reljr, use read again.
The list command will output the names of all currently loaded tables, along with each tables’ column names.

Performing queries

A query can be run on its own after a table load or as part of e.g. the store command above. Running commands on tables that haven’t been loaded into reljr is an error.

reljr supports the following types of relational algebra queries, with the tokens reljr recognizes for the operation listed after each:

Projection (π, pi, project)
Selection (σ, sigma, select), with the following boolean operations:
- Not (¬, not)
- And (∧, &&, and)
- Or (∨, ||, or)
- Equality (=)
- Inequality (≠, !=)
- Greater than (>)
- Greater than or equal to (≥, >=)
- Less than (<)
- Less than or equal to ( ≤, <=)
Renaming of a column or relation (ρ, rho, rename)
Order by (τ, tau, order by), followed by a token to indicate the order of the column:
- ASC, asc, DESC, desc
Group by (γ, gamma, group by), with support for these aggregates:
- count(*)
- count(<colname>)
- min(<colname>)
- max(<colname>)
- sum(<colname>)
- avg(<colname>)
Cross product (×, *, cross join)
Inner Join(⋈, ⨝, join, inner join), with support for the same boolean operations as Selection
Natural Join(⋈, ⨝, natural join)
The set operations:
- Division (÷, /)
- Intersection (∩, intersect)
- Union (∪, union)
- Subtraction (-, \\, except)

Reljr has limited support for expressions in projection (pi R.x + 1 -> y R), selection (sigma R.x + 1 > 2 R) and group by (gamma a; max(a - b) R).

Preprocess queries

The preprocess <query> and optimize <query> can be used to inspect how reljr manipulates queries.

Implementation Details

Parser

The grammar for the reljr parser and CLI commands is written in Extended Backus-Naur Form (EBNF). The grammar is entirely contained within the resource file RAParser.bnf. The grammar and operator precedence order for QueryCommand and all its child nonterminals were developed using the railroad diagrams in the RelaX wiki as a reference. The grammar and operator precedence for the CLI commands are unique to reljr. We used the third-party Instaparse parser generator library to construct a parser from the EBNF. Instaparse’s detailed error reporting makes it possible for reljr to provide much better information about syntactic errors in user queries compared to those offered by RelaX.

CLI Operation Precedence

In decreasing order of precedence:

Reading a table from a file
Storing a query to a table in memory
Renaming an in-memory table
Deleting an in-memory table
Writing a table to a file
Listing all in-memory tables
Quitting reljr
A relational algebra expression (RAExpression)

RAExpression Precedence

These precedences are the same as those used by RelaX. In decreasing order of precedence:

Projection, Selection, Rename Columns, Rename Relation, Group, Order By
Cross Product, Inner Join, Natural Join, Division, a relation name, an RAExpression within parentheses
Intersection
Union, Subtraction

Preprocessor

The preprocessor is used to convert from the raw abstract syntax tree produced by our parser into an internal AST fit for later operations. It also performs some basic sanity checking on the provided queries- in particular, it checks that the operations requested will not have column name collisions.

Optimizer

Reljr implements a simple query optimizer: it optimizes groups of selections, cross products, and inner joins into left-deep trees of inner joins on selections. The optimizer works in three passes:

Optimize up: Convert inner joins into selections of cross products, move selections to the root of join subtrees and merge them. This is a normalization step, converting any subtree of selections/joins/cross products into a single selection of cross products.
CNF selections: Convert the predicate of each remaining selection into conjunctive normal form
Normalize joins: Convert each selection+cross-product subtree into a left-deep tree of inner joins and cross products by pushing the clauses of the selection condition down the tree. Clauses that only involve one sub-query become selection nodes on the leaves of the join tree. This step could be taught to consider alternate join algorithms or have preferences about selection criteria, but due to time constraints this will have to be left for future work.

Table Abstraction/Operations

The source file interpeter.clj deals with the evaluation of relational algebra expressions and related computations. However, the functions in this file (notably evaluate) depend on the table abstraction specified in table.clj. Because of this, it is necessary to discuss the implementation of the functions that constitute the table abstraction before discussing the implementation of the interpreter itself.

reljr stores tables internally as sets of maps. This is a convenient representation: sets trivially guarantee uniqueness of tuples and maps keep column information close to table data. As an example, consider the table below. The Clojure code below the table shows the same table as a Clojure data structure (assuming the table was read from the file foo.csv.)

R.x	S.x	R.y
5	6.3	6
3	9.4	2

{"foo"
   #{{:R/x 5 :S/x 6.3 :R/y 6}
     {:R/x 3 :S/x 9.4 :R/y 2}}}

Project

The project function expects to receive two arguments, table and keys. For the query pi x,y foo and the same data structure as above, the arguments are as follows:

table = #{{:R/x 5 :S/x 6.3 :R/y 6} {:R/x 3 :S/x 9.4 :R/y 2}}
keys = (:R/x :R/y)

The project function takes advantage of that fact that Clojure keys (e.g. :R/x) can be used as functions to get the value from a map that is associated with that key (as in (:R/x {:R/x 2}), which returns 2.) val-funcs is then a function that applies each of the keys to a row of the table. map performs the actual iteration over table rows. The return value of this function is of the same form as the input (a set of maps) but with only the requested columns:

#{#:R{:x 3, :y 2} #:R{:x 5, :y 6}}

Select

The select function expects to receive a table as we’ve seen before, as well as a test, which is a predicate function. The function =predicate-runner= handles the creation of these predicates. With the predicate in hand, select filters the tuples in table using the Clojure built-in filter, putting those tuples into a new set, and returns.

Rename

The rename function renames a relation. It iterates through every tuple in the table, creating a new tuple with the appropriate name. The iteration through tuples is necessary because column names are namespaced by the table name, and so every table member requires a rename.

Rename-Column

Rename column works by using the assoc=/ =dissoc pair of functions. dissoc removes a mapping with the given key from a map, while assoc adds one. The “thread-first”, or “forward-threading”, macro -> in the call to map is equivalent to the following expression:

(dissoc (assoc r to (from r)) from)

Order-Records-By

The order-records-by function orders the tuples in a table based on a set of col-rules, where each col-rule is a pair composed of a namespaced key (a column name, like :R/a) and a clojure function, which is one of the following comparators:

#(< (compare %1 %2) 0)
#(> (compare %1 %2) 0)

Clojure’s compare function performs the following (taken from ClojureDocs):

...Returns a negative number, zero, or a positive number when x is logically
'less than', 'equal to', or 'greater than' y. Same as Java x.compareTo(y) except
it also works for nil, and compares numbers and collections in a
type-independent manner....

These comparators are used in the ordering function (contained within order-records-by) to determine row order. Once the ordering function has been built up, it is applied to the rows of the table by into. into’s arguments are an empty sorted-set, whose comparator is now the entire ordering function. into works by taking every member of table and adding it to the set. When each element gets added, it is placed in the set according to the comparator (the ordering function).

Group-Records-By

The group-records-by function returns a new table with the desired columns and new columns for any aggregates requested by the user. This function relies on the Clojure built-in group-by. We consider as an example the query gamma b; count(b) -> baz bar, on the following relation:

#{{:R/a 1, :R/b "a", :R/c "d"}
  {:R/a 4, :R/b "d", :R/c "f"}
  {:R/a 3, :R/b "c", :R/c "c"}
  {:R/a 5, :R/b "d", :R/c "b"}
  {:R/a 6, :R/b "e", :R/c "f"}}

Calling group-by on table gives the following:

{["a"] [{:R/a 1, :R/b "a", :R/c "d"}]
 ["d"] [{:R/a 4, :R/b "d", :R/c "f"} {:R/a 5, :R/b "d", :R/c "b"}]
 ["c"] [{:R/a 3, :R/b "c", :R/c "c"}]
 ["e"] [{:R/a 6, :R/b "e", :R/c "f"}]}

The first vector (the keys of this map) is the value for the column we have selected on; the function places it into the resulting table later on (inside of the map.) The key for each vector is a vector of the rows that contained that value.

Cross-Product

We used Clojure’s into to accomplish this. The outer into places the result of the for into a set. The for loop is equivalent to the following for loop in pseudocode that is a little more imperative:

for r1 in table1:
    for r2 in table2:
        put every r2 item into r1

The result accumulates as the return value of the for loop, which is then placed into the set by the outer into as discussed earlier.

Inner-Join

The inner-join function is implemented in the same manner as cross-product, but with a test condition applied. The test condition comes from =predicate-runner= just as it does in select.

Natural-Join

The natural-join function is implemented using project and inner-join, where both the test condition for the inner join and the columns to be projected out are computed by the make-natural-join-test function. We will show the execution of this function using the following tables as examples:

{"zoop"
 #{#:S{:a 4, :z "d", :g "f"}
   #:S{:a 5, :z "d", :g "b"}
   #:S{:a 6, :z "e", :g "f"}
   #:S{:a 1, :z "a", :g "d"}
   #:S{:a 3, :z "g", :g "g"}}}

{"bar"
 #{#:R{:a 1, :b "a", :c "d"}
   #:R{:a 4, :b "d", :c "f"}
   #:R{:a 3, :b "c", :c "c"}
   #:R{:a 5, :b "d", :c "b"}
   #:R{:a 6, :b "e", :c "f"}}}

The example query for the two tables is zoop natural join bar.

Our natural join implementation mirrors RelaX’s in that it always keeps the join column from the left of the two tables in the natural join. The complexity of make-natural-join-test arises from the possibility for two specific join scenarios:

Two tables may each have multiple columns that can be join on (e.g. if zoop and bar above each had a column b), or
Either (or both) of the two tables has multiple columns with the same name but different namespaces (e.g. if zoop had an S.a column and a Quux.a column)

Because of this, the test condition must determine, for every pair of join columns for every pair of records in the table, which records will be kept.

Let’s take a close look at the test function returned by make-natural-join-test :

(fn [l r]
  (every? (fn [[lj rj]]
            (every? (set (rj r))
                    (lj l)))
          tests))

It’s more instructive if we swap in some actual values; we’ll choose two tuples that will make the test return true:

(fn [{:S/a 4, :S/z "d", :S/g "f"} {:R/a 4, :R/b "d", :R/c "f"}]
  (every? (fn [[lj rj]]
            (every? (set (rj {:R/a 4, :R/b "d", :R/c "f"}))
                    (lj {:S/a 4, :S/z "d", :S/g "f"})))
          [(apply juxt [:S/a]) (apply juxt [:R/a])]))

lj and rj take the values of the vector right after the inner fn (apply juxt...); this has the affect of applying a key to a map, which returns the value of that key from the map:

((apply juxt [:S/a])  {:S/a 4, :S/z "d", :S/g "f"}) ;; => 4
((apply juxt [:R/a])  {:R/a 4, :S/z "d", :S/g "f"}) ;; => 4

The last trick in the function here is the call to (set), which makes a set from the result of the inner function application (here, it’s the number 4). A clojure set can be used as a function to determine if an item is contained in the set, as shown in the following examples:

((set [4 2 3]) 5) ;; => nil
((set [4 2 3]) 2) ;; => 2
(map (set [4 2 3]) [5 1 2 6 3]) ;; => (nil nil 2 nil 3)

Going back to the code, we see that all we’ve done is create a set with 4 in it (the value of applying the keys to the right tuple) and checked that the values from the left tuple are all contained in that set. If each table had more than one column that matched columns from the other table, the tests list would contain a pair of test functions for each pair of columns, and we would iterate through all of those with the outer every?.

Evaluator

The evaluator walks the expression tree and performs the corresponding table operations as it goes.

Future Work

There are a few ways we could improve on Reljr going forward. The simplest being improving the surface-level support for more flexible queries. For example, we could implement more functions and operators to use in projection/selection expressions. Alternately, we could improve the reljr internals: we could add support for indexes, or teach the query optimizer about table statistics.

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
.github/workflows		.github/workflows
public		public
resources		resources
src/reljr		src/reljr
tests/reljr		tests/reljr
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.org		README.org
fly.toml		fly.toml
package.json		package.json
shadow-cljs.edn		shadow-cljs.edn
yarn.lock		yarn.lock

reljr/reljr

Folders and files

Latest commit

History

Repository files navigation