We use the gradle shadowJar plugin to build the project.
- Make sure to run
python word2vec_server.pybefore (this will use port 10000 by default).
- Ensure your database properties are set correctly in
config.propertiesin the main project folder. If this doesn't exist, you should create it by copying
- a functioning MySQL instance with necessary data from each dataset pre-loaded. For example, have MySQL up and running, then create database name
masand follow instructions on the MAS dataset README.
Running Templar tests
TemplarCV- Runs a cross-validation test on a specific dataset given some parameters.
After building, we can run:
java -cp build/libs/templar-all.jar edu.umich.templar.TemplarCV <dataset> <log_level> <log_join_on>
Choices for each argument:
Disabling the candidates cache
Since a lot of keywords are frequently reused in each dataset, we implemented a cache to speed up testing. This can be enabled/disabled by changing the setting for
ENABLE_CACHE in the
These caches will be saved in
data/<dataset>/<dataset>.cands.cache, so to clear the cache, just delete these files.
Adding new datasets
In order to add new datasets, you need to
- Load the dataset with name
- Create the folder
data/<dataset>. Each dataset is required to have the following files (see existing datasets for examples):
<dataset>_keywords.csv: pre-parsed keywords, metadata, and answers. See other datasets for examples. Note specifically that we allow multiple correct answers, separated by semicolons, and that pairs are given in comma-separated form. This formatting matters because our accuracy evaluation is done via string comparison.
<dataset>_joins.csv: correct join paths for each query. These are in a nested, parenthetical format, where the first table alphabetically is always the first, then a table's children is given by parentheses after it, and multiple children of a tree are separated by commas. For example,
author(organization,writes(publication))is a join path where
authoris the first alphabetical table name, then its children are
writes, and then
publicationas a child. This formatting matters because our accuracy evaluation is done via string comparison.
<dataset>_all.sqls: the correct SQL labels for each NLQ, one query per line. This is fed in as our query log.
<dataset>.fkpk.json: a JSON file listing all the foreign key-primary key relationships in the schema
<dataset>.main_attrs.json: defining the main/display/default attributes for each relation
<dataset>.proj_attrs.json: defining the paired attributes for each relation