This project includes a volcano-style iterator model query processor written in python. This program processes JSON serialized queries from tables described in a .csv
file format. This query processor is capable of Select, Project, Join, Scan, and Limit query operations. It is also capable of evaluating column references and aliases, constants, arithmetic operations, comparison operations, and logical operations. The code takes a modular approach, where the join, limit, project, scan, and select, are all subclasses of a base operation superclass allowing for further implementation of additional operations.
To validate this model, I ran all of the example queries included within this project. The queries exercise multiple operators. I made sure that every operator was tested and utilized in at least one query. I statically verified the correctness of each query since the correct output for each query was short enough to be read by a human.
Builds a hash table from the right input based on a join key, then iterates over rows from the left input to find matching rows efficiently. The operator combines matching records from both sides into a single result and returns them one by one.
Loads and sorts both input tables by their join keys, then sequentially scans through them to find and combine matching rows
Traverses the query plan tree and moves filter conditions (Select operators) as close as possible to the data sources (Scan operators)
- All queries (
.json
files) and tables (.csv
files) can be found in data/ - There are two example folders (example1 and example2) which each contain different queries and table data
- Set
CENGINE_OPTIMIZE=true/false to enable/disable
predicate pushdown Optimization - Set
CENGINE_JOIN_ALGO=nested_loop/hash/sort_merge
to enable nested loop joins, hash joins, or sort_merge joins
CENGINE_OPTIMIZE=[true/false] CENGINE_JOIN_ALGO=[nested_loop/hash/sort_merge] python3 query_processor.py /path/to/query.json
CENGINE_OPTIMIZE=true CENGINE_JOIN_ALGO=hash python3 query_processor.py data/example2/query_open_order_line_totals.json
python3 benchmark.py
- Since I had no idea where to start this project, I asked ChatGPT to generate a folder structure and some boilerplate code for this project. The ideas for the modular approach came from ChatGPT. I filled in the rest of the details I also had ChatGPT generate the benchmarking script, as well as generate most of the predicate pushown code.