Skip to content

RumbleDB/iris-hep-benchmark-presto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High-energy Physics Analysis Queries on PrestoDB

This repository contains implementations of High-energy Physics (HEP) analysis queries from the IRIS HEP benchmark written in SQL to be run on PrestoDB.

Motivation

The purpose of this repository is to study the suitability of SQL for HEP analyses and to serve as a use case for improving database technologies. While SQL has often been considered unsuited for HEP analyses, we believe that the support for arrays and structured types introduced in SQL:1999 make SQL actually a rather good fit. As a high-level declarative language, it has the potential to bring many of its benefits to HEP including transparent query optimization, transparent caching, and a separation of the logical level (in the form of data model and query) from the physical level (in the form of storage formats and performance optimizations).

Prerequisites and Setup

  1. Install Python 3 with pip.
  2. Install the Python requirements:
    pip3 install -r requirements.txt
  3. Install docker and docker-compose.
  4. Clone this repository and bring up the services with Docker compose.
  5. Optionally download the Presto CLI client matching the version in the docker image.
  6. Set up scripts/presto.sh, either based on scripts/presto.local.sh by modifying it to point to the Presto CLI executable from the previous step, or with the following command (which uses a Presto CLI in one of the docker images):
    cp scripts/presto.docker.sh scripts/presto.sh

Data

The benchmark defines a data set in the ROOT format, which is not supported by Presto. However, the Rumble implementation of the benchmark provides scripts to convert the data to Parquet, which Presto can query in-place.

HDFS

You can run the queries against "external tables" consisting of Parquet files on HDFS. A basic HDFS installation is part of the services brought up by Docker compose. Read the instructions of that repository for details. The main steps are as follows:

  1. Copy Run2012B_SingleMu-restructured-1000.parquet from this repository to the data/ repository of the Docker compose project.
  2. Upload it to HFDS:
    docker compose exec -it namenode hadoop fs -mkdir /Run2012B_SingleMu-restructured-1000/
    docker compose exec -it namenode hadoop fs -put /data/Run2012B_SingleMu-restructured-1000.parquet /Run2012B_SingleMu-restructured-1000/
  3. Create an external table with the provided script:
    scripts/create_table.py \
        --table-name Run2012B_SingleMu_1000 \
        --location hdfs://namenode/Run2012B_SingleMu-restructured-1000/ \
        --variant native \
        --view-name Run2012B_SingleMu_1000_view  # ignored for "native" variant
    Check out the help of that script in case you want to connect to Presto with non-default parameters.

S3

You can also read the data from an S3 bucket if you run Presto in an EC2 instance with a properly configured instance profile. Assume the data is uploaded to S3, use the following command to create an external table based on these files:

scripts/create_table.py \
    --table-name Run2012B_SingleMu_1000 \
    --location s3a://my_bucket/Run2012B_SingleMu_1000/ \
    --variant native \
    --view-name Run2012B_SingleMu_1000_view  # ignored for "native" variant

Shredded Data Format

It is also possible to read Parquet files where all structs are "shredded" into columns (see the Rumble implementation for details). Use a command along the following lines for that purpose:

scripts/create_table.py \
    --table-name Run2012B_SingleMu_1000 \
    --location s3a://my_bucket/Run2012B_SingleMu_shredded_1000/ \
    --variant shredded \
    --view-name Run2012B_SingleMu_shredded_1000_view

The queries should then be run agains Run2012B_SingleMu_shredded_1000_view which exposes the data in the same format as the non-shredded "native" Parquet files.

Running Queries

Queries are run through test_queries.py. Run the following command to see its options:

$ ./test_queries.py --help
usage: test_queries.py [options] [file_or_dir] [file_or_dir] [...]

...
custom options:
  -Q QUERY_ID, --query-id=QUERY_ID
                        Folder name of query to run.
  -F FREEZE_RESULT, --freeze-result=FREEZE_RESULT
                        Whether the results of the query should be persisted to disk.
  -N NUM_EVENTS, --num-events=NUM_EVENTS
                        Number of events taken from the input file. This influences which reference file should be taken.
  -I INPUT_TABLE, --input-table=INPUT_TABLE
                        Name of input table or view.
  -P PRESTO_CMD, --presto-cmd=PRESTO_CMD
                        Path to the script that runs the Presto CLI.
  -S PRESTO_SERVER, --presto-server=PRESTO_SERVER
                        URL as <host>:<port> of the Presto server.
  -C PRESTO_CATALOGUE, --presto-catalogue=PRESTO_CATALOGUE
                        Default catalogue to use in Presto.
  --presto-schema=PRESTO_SCHEMA
                        Default schema to use in Presto.
  --plot-histogram      Plot resulting histogram as PNG file.

For example, the following command runs queries 6-1 and 6-2 against the table created above:

./test_queries.py -vs --num-events 1000 --query-id query-6-1 --query-id query-6-2

About

IRIS HEP Benchmark for Presto

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published