NOTE: This project is actively maintained, but is going through changes rapidly since it is
research code. We do our best to make sure the code works after cloning and running installation
steps, but greatly appreciate any bug reports and encourage you to open a pull request to fix the
bug or add documentation. We will make a note here when we create a stable 2.0
tag.
The primary way to run Qanta is using our Packer and Terraform scripts to run it on Elastic Cloud Compute (EC2) which is part of Amazon Web Services (AWS). The alternative is to inspect the bash scripts associated with our Packer/Terraform scripts to infer the setup procedure.
Packer installs dependencies that don't need to know about runtime information (eg, it
installs apt-get
software, download software distributions, etc). Terraform takes care of
creating AWS EC2 machines and provisioning them correctly (networking, secrets, dns, SSD drives,
etc).
WARNING: Running Qanta scripts will create EC2 instances which you will be billed for
Qanta scripts by default use Spot Instances to get machines
at the lowest price at the expense that they may be terminated at any time if demand increases.
We find in practice that using the region us-west-1
makes such terminations rare. Qanta primarily
uses r3.8xlarge
machines which have 32 CPU cores, 244GB of RAM, and 640GB of SSD storage, but
other EC2 Instance Types are available.
To execute the AWS scripts you will need to follow these steps (brew
options are for Macs):
- Install Packer Binaries or run
brew install packer
- Install Terraform 0.7.x or run
brew install terraform
- Python 3.5+: If you don't have a preferred distribution, Anaconda Python is a good choice
- Install the AWS command line tools via
pip3 install awscli
. Runpip3 install pyhcl
- Run
aws configure
to setup your AWS credentials, set default region tous-west-2
- Create an EC2 key pair
- Set the environment variable
TF_VAR_key_pair
to the key pair name from the prior step - Set the environment variables
TF_VAR_access_key
andTF_VAR_secret_key
to match your AWS credentials. - WARNING: These are copied by Terraform to the cluster so that the cluster has S3 access. See AWS the configuration section for a summary of how the Terraform install scripts treat these keys.
- Run
bin/generate-ssh-keys.sh n
where n equals the number of workers. You should start with zero and scale up as necessary. This will generate SSH keys that are copied to the Spark cluster so that nodes can communicate via SSH - Install sshuttle which is used to create an SSH VPN
This section is purely informative, you can skip to Run AWS Scripts
- Python 3.6
- Apache Spark 2.1.0
- Vowpal Wabbit 8.1.1
- Docker 1.11.1
- Postgres
- KenLM
- Redis (latest stable)
- All python packages in
packer/requirements.txt
- Creates and configures an AWS virtual private cloud, internet gateway, route table, subnet on us-west-1b, and security groups that optimize between security and convenience
- Security Groups: SSH access is enabled to the master, all other master node ports are closed to the internet, all other instances can communicate with each other but are not reachable by the internet.
- Spot instance requests for requested number of workers and a master node
- SSH keys generated from
bin/generate-ssh-keys.sh
are copied to each instance. Each instance receives its own ssh key and all other instances have SSH access to every other instance. - AWS keys are copied to
/home/ubuntu/.bashrc
,/home/ubuntu/dependencies/spark-1.6.1-bin-hadoop2.6/conf/spark-env.sh
, and/home/ubuntu/.aws/credentials
. - Warning: AWS keys are printed during
terraform apply
, we plan on fixing this, but haven't yet. - Configure the 2 SSD drives attached to
r3.8xlarge
instances for use - Clone the
Pinafore/qb
to/ssd-c/qanta/qb
and set it as the quiz bowl root - Download bootstrap AWS files to get the system running faster
- TODO: based on variables download data from latest run to continue work on.
The AWS scripts are split between Packer and Terraform. Packer should be run from packer/
and
Terraform from the root directory. Running Packer is optional because we publish public AMIs which Terraform uses by default.
If you are developing new pieces of qanta that require new software it might be helpful to build your own AMIs
- (Optional) Packer:
packer build packer.json
- Terraform:
terraform apply
and note themaster_ip
output - SSH into the
master_ip
withssh -i mykey.pem ubuntu@ipaddr
The packer step is optional because we publish the two most recent Qanta AMIs on AWS which Terraform uses automatically.
Additionally, the output from terraform apply
is documented below and can be shown again with
terraform show
master_private_dns
: Address to access when usingsshuttle
master_private_ip
: Internal AWS ip addressmaster_public_dns
andmaster_public_ip
: Use for access from open web (eg ssh)vpc_id
: Useful when adding custom security group
Below is a list of variables that can change the behavior of Terraform. These can also be
passed into the CLI via -var name=value
and dropping the TF_VAR
portion.
TF_VAR_key_pair
: Which EC2 key pair to useTF_VAR_access_key
: AWS access keyTF_VAR_secret_key
: AWS Secret keyTF_VAR_spot_price
: Max EC2 spot priceTF_VAR_master_instance_type
: Which EC2 instance type to use for masterTF_VAR_worker_instance_type
: Which EC2 instance type to use for workers (TODO: no-op ATM)TF_VAR_num_workers
: How many workers to use (TODO: no-op ATM)TF_VAR_cluster_id
: On multi-user accounts allows separate users to run simultaneous machinesTF_VAR_qb_aws_s3_bucket
: Used to setQB_AWS_S3_BUCKET
for checkpoint scriptTF_VAR_qb_aws_s3_namespace
: Used to setQB_AWS_S3_NAMESPACE
for checkpoint script
To teardown the cluster, you have two options.
terraform destroy
will destroy all infrastructure created including the VPC/subnets/etc. If you want to completely reset the AWS infrastructure this does the jobterraform destroy -target=aws_spot_instance_request.master
will only destroy the EC2 instance. This is the only part of the insfrastructure aside from S3 that AWS charges you for.
Since we do not primarily develop qanta outside of AWS and setups vary widely we don't maintain a formal set of procedures to get qanta running not using AWS. Below are a listing of the important scripts that Packer and Terraform run to install and configure a running qanta system.
packer/packer.json
: Inventory of commands to setup pre-runtime imagepacker/setup.sh
: Install dependencies which don't require runtime informationaws.tf
: Terraform configurationterraform/
: Bash scripts and configuration scripts
The majority of QANTA configuration is done through environment variables. These are set appropriately for AWS by Packer/Terraform, but are otherwise set to sensible defaults.
Reference conf/qb-env.sh.template
for a list of available configuration variables
For security reasons, the AWS machines qanta creates are only accessible to the internet via SSH to the master node. To gain access to the various web UIs (Spark, Luigi, Ganglia) and other services running on the cluster there are three options:
- Create an SSH VPN with
sshuttle
. This forwards all traffic from your machine through the master node, and gives access to AWS internal routing - Create an SSH tunnel to forward specific ports on the master to localhost
- In the EC2 Console create a security group which whitelists your IP address and add it to the instance
sshuttle
gets you working off the ground the fastest but routes all your traffic to AWS. SSH
tunneling doesn't give access to everything. EC2 Security Groups is the overall most convenient
solution since it requires only adding your custom group to the instance after it starts
The reason for these security precautions is that allowing access to the Spark application master or the spark master web UI would in principle expose a way for an attacker to gain access to your AWS credentials.
Warning: The following steps guide you through creating an SSH VPN to the master node. This means all traffic will be redirected through the master node while the SSH VPN is running. AWS charges for bandwidth which is outgoing from AWS so avoid downloading large files/videos to your machine while the VPN is active (AWS does not charge for incoming bandwidth).
We recommend that before you work with Qanta that you run:
sshuttle -N -H --dns -r ubuntu@public-ip 0/0
The following SSH command will forward all the important UIs running on the master node to
localhost
:
ssh -L 8080:localhost:8080 -L 4040:localhost:4040 -L 8082:localhost:8082 ubuntu@instance-ip
This can be made easier by adding an entry like below in ~/.ssh/config
. Note that the example
domain example.com
is mapped to the master ip address outputed by terraform. This can be
accomplished by modifying /etc/hosts
or creating a new DNS entry for the domain.
Host qanta
HostName example.com
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
User ubuntu
LocalForward 8082 127.0.0.1:8082
LocalForward 8080 127.0.0.1:8080
Now you can simply do ssh qanta
and navigating to localhost:8082
will access the EC2 instance.
- Go to console.aws.amazon.com
- Under "Network & Security" click "Security Groups"
- Click "Create Security Group"
- Configure with a name, any relevant inbound rules (eg from a whitelist IP), and be sure to choose
the VPC created by Terraform. This can be retrieved by using
terraform show
and using the variable output fromvpc_id
. - Under "Instance" click "Instances"
- Select your instance, click the "Actions" drop down, click "Networking" then "Change Security Groups", and finally add your security group
If you are running on AWS, these files are already downloaded. Otherwise you will need to run either
terraform/aws-downloads.sh
to get dependencies from Amazon S3 or run the bash commands below.
# Download Wikifier (S3 Download in script mentioned above is much faster, this is 8GB file compressed)
wget -O /tmp/Wikifier2013.zip http://cogcomp.cs.illinois.edu/software/Wikifier2013.zip
unzip /tmp/Wikifier2013.zip -d data/external
rm /tmp/Wikifier2013.zip
# Download nltk data
$ python3 setup.py download
Before running qanta software you will need to compile the C language model and download the nltk datasets used by running:
# Run pre-requisites
$ python3 setup.py download
$ make clm
QANTA can be run in two modes: batch or streaming. Batch mode is used for training and evaluating
large batches of questions at a time. Running the batch pipeline is managed by
Spotify Luigi. Luigi is a pure python make-like framework for
running data pipelines. The QANTA pipeline is specified in qanta/pipeline.py
. Below are the
pre-requisites that need to be met before running the pipeline and how to run the pipeline itself.
These steps will guide you through starting Apache Spark, Luigi, and running the pipeline. Where marked steps are marked"(Non-AWS)" indicates a step which is unnecessary to do if running qanta from the AWS instance started by Terraform.
- Start the spark cluster by navigating into
$SPARK_HOME
and runningsbin/start-all.sh
- Start the Luigi daemon:
luigid --background
from/ssd-c/qanta
- Before you can run any of the features, you need to build the guess database:
luigi --module qanta.pipeline CreateGuesses --workers 1
. You can skip ahead to next step if you want (it will also create the guesses, but seeing the guesses will ensure that the deep guesser worked as expected). - Run the full pipeline:
luigi --module qanta.pipeline AllSummaries --workers 30
- Observe pipeline progress at http://hostname:8082
To rerun any part of the pipeline it is sufficient to delete the target file generated by the task you wish to rerun.
Warning: This mode is highly experimental
Again, Apache Spark needs to be running at the url specified in the environment variable
QB_SPARK_MASTER
.
Streaming mode works by coordinating several processes to predict whether or not to buzz given line of text (a sentence, partial sentence, paragraph, or anything not containing a new line). The Qanta server is responsible for:
- Creating a socket then waiting until a connection is established
- The connection is established by starting an Apache Spark streaming job that binds its input to that socket
- Once the connection is established the Qanta server will start streaming questions to Spark until its queue is empty.
- Each question is also stored in a PostgreSQL database with a column reserved for Spark's response
- The Qanta server will start to poll the database every 100ms to see if Spark completed all the questions that were queued.
Spark Streaming will then do the following per input line:
- Read the input from the socket
- Extract all features
- Collect the features, form a Vowpal Wabbit input line, and have VW create predictions. This requires that VW is running in daemon mode
- Save the output to a PostgreSQL database
Once the outputs are saved in the PostgreSQL database
- The Qanta server reads the results and outputs the desired quantities
With that high-level overview in place, here is how you start the whole system.
- Start the PostgreSQL database (Docker must be running first):
bin/start-postgres.sh
- Start the Vowpal Wabbit daemon:
bin/start-vw-daemon.sh model-file.vw
(egdata/models/sentence.16.vw
) - Start Qanta server:
python3 cli.py qanta_stream
- Start Spark Streaming job:
python3 cli.py spark_stream
To provide and easy way to version, checkpoint, and restore runs of qanta we provide a script to
manage that at aws_checkpoint.py
. We assume that you set an environment variable
QB_AWS_S3_BUCKET
to where you want to checkpoint to and restore from. We assume that we have full
access to all the contents of the bucket so we suggest creating a dedicated bucket.
pg_config executable not found
Install postgres (required for python package psycopg2, used by streaming)
pyspark uses the wrong version of python
Set PYSPARK_PYTHON to be python3
ImportError: No module named 'pyspark'
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
ValueError: unknown locale: UTF-8
export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8
No module named 'pyspark'
The expo files can be generated from a completed qanta run by calling
luigi --module qanta.expo.pipeline --workers 2 AllExpo
If that has already been done you can restore the expo files from a backup instead of running the pipeline
./checkpoint restore expo
Then to finally run the expo
python3 qanta/expo/buzzer.py --questions=output/expo/test.questions.csv --buzzes=output/expo/test.16.buzz --output=output/expo/competition.csv --finals=output/expo/test.16.final
Terraform works by reading all files ending in .tf
within the directory that it is run. Unless the
filename ends with _override
it will concatenate all these files together. In the case of
_override
it will use the contents to override the current configuration. The combination of these
allows for keeping the root aws.tf
clean while adding the possibility of customizing the build.
In the repository there are a number of .tf.tftemplate
files. These are not read by terraform but
are intended to be copied to the same filename without the .tftemplate
extension. The extension
merely serves to make it so that terraform by default does not read it, but to keep it in source
control (the files ending in .tf
are in .gitignore
). Below is a description of these
aws_gpu_override.tf.tftemplate
: This configures terraform to start a GPU instance instead of a normal instance. This instance uses a different AMI that has GPU enabled Tensorflow/CUDA/etc.aws_small_override.tf.tftemplate
: This configures terraform to use a smaller CPU instance than the default r3.8xlargenaqt_db.tf.tftemplate
: Configure qanta to use the private NAQT dataseteip.tf.template
: Configure terraform to add a pre-made elastic IP to the instance