Skip to content
A TIFF to JP2 to S3 Bucket Microservice
Java HTML Other
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.travis
docs
src
.gitignore
.travis.yml
LICENSE.txt
README.md
pom.xml

README.md

Bucketeer  Build Status Known Vulnerabilities

A TIFF to JPX to S3 bucket microservice. It will turn TIFF images into JPEG 2000 images in two ways:

  1. The first way is to convert individual TIFF images into JPEG 2000 images on the local machine. To do this Bucketeer receives individual requests, accesses the TIFFs from a mounted directory, converts TIFFs into JPEG 2000 images, and then uploads them to S3. This method is triggered by a RESTful API.

  2. The second way is to convert TIFF images into JPEG 2000 images in batch. With this method, a CSV file is uploaded to Bucketeer and TIFF images from the CSV file, available to Bucketeer from a locally mounted directory, are uploaded to an S3 bucket. An AWS Lambda function picks up on that event and converts the TIFFs into JPEG 2000s. Lastly, the Bucketeer Lambda function stores the JPEG 2000 images in another S3 bucket. This method is triggered by uploading a CSV file through a Web page on the Bucketeer site.

Currently, the CSV upload method is hard-coded for UCLA's particular metadata model. This will be changed to make the process more generic.

Requirements

  • A Slack Team and a Slack bot, with at least one channel configured to support file uploads.

    • In order to run tests that use Slack, you should copy the bucketeer.slack.* settings out of the sample settings.xml file in src/test/resources and copy them into your own settings.xml file (perhaps at /etc/maven/settings.xml), supplying the values from your own Slack account.
  • An AWS account with S3 buckets created.

    • In order to run tests that use S3, you should copy the bucketeer.s3.* settings out of the sample settings.xml file in src/test/resources and copy them into your own settings.xml file (perhaps at /etc/maven/settings.xml), supplying the values from your own S3 account.
  • A valid license for Kakadu and the Kakadu binaries installed (to use local conversion)

    • In order to run tests that use Kakadu, you should install Kakadu on your system and the tests should pick it up. If you have several copies of Kakadu and you'd like to tell the tests which one to use you can set the KAKADU_HOME environmental variable and point it to the Kakadu binaries you want to use.
  • An installation of kakadu-lambda-converter (to use the batch conversion)

    • See that project's GitHub page for information about how to install it.

Building the Project

The project builds an executable Jar that can be run to start the microservice. To build the project, run:

mvn package

This will put the executable Jar in the target/build-artifact directory.

The application, in its simplest form, can be run with the following command:

java -Dvertx-config-path=target/test-classes/test-config.properties -jar target/build-artifact/bucketeer-*.jar

To generate the site's Javadocs documentation, run:

mvn site

This will generate the documentation in the target/site directory.

If you'd like to run Bucketeer in a Docker container, a repository to build a Docker image is available. Using it requires you to supply your own AWS credentials and to have a licensed copy of the Kakadu source code.

Running the Application for Development

You can run a development instance of Bucketeer by typing the following within the project root:

mvn -Plive test

Once run, the service can be verified/accessed at http://localhost:8888/status. The API documentation can be accessed at http://localhost:8888/docs

If you want to run the application with a different mount point (for image sources) and file prefix (e.g. the UCLA file path prefix), you can use something like:

mvn -Plive test -Dbucketeer.fs.mount=/opt/data -Dbucketeer.fs.prefix=UCLAFilePathPrefix

If you leave off the bucketeer.fs.prefix Bucketeer will treat the bucketeer.fs.mount as the default directory.

Tweaking the Batch Upload

Choosing between conversion methods depends largely on how quickly TIFF images can be uploaded to the AWS Lambda bucket. AWS Lambda scales horizontally (up to 1000 simultaneous functions), so if you can upload TIFFs to the S3 bucket faster than they can be processed by the X number of cores on your local machine, it makes sense to use the batch method.

To support getting TIFFs up to S3 as quickly as possible, there are a number of controls in Bucketeer that can be adjusted to improve performance.

The worker verticle count
The S3 upload verticle is a worker that has its own S3 client. If you configure multiple worker / upload verticles, there will be multiple clients sending TIFF images to the S3 bucket. This is controlled by the s3.uploader.instances property.

When you run the application locally, through the live test method, this can be set in a Maven settings file (for permanent usage) or can be set at runtime (for easier testing) by passing the value in as a system property.

Note: If you're using the Docker container mentioned earlier, the value would be passed in as an ENV property or be set in the application's configuration file. See the Docker Bucketeer project for more detail.
The worker verticle's thread count
Each S3 client (in a worker verticle) can be configured to use one or more threads. Setting the thread count to more than one will allow each S3 client to upload multiple files at a time. The property for this value is s3.uploader.threads. Keep in mind that this doesn't set the total number of threads used, but just the number of threads per S3 client.

This value, like the worker verticle count, can be set in a Maven settings file or passed in via the command line as a system property. In the Docker environment, it should be set through the application's configuration file or as an ENV property.
The maximum number of S3 requests
If you set the above two values very high, it's easy to run out of RAM on your machine (since the S3 clients read all the TIFF files concurrently). The maximum number of S3 requests threshold provides an upper limit on the number of S3 PUTs that can be in process at any given time, regardless of the number of worker verticles or threads that have been configured.

The maximum number of requests you allow will depend on the amount of RAM available on the machine. When the maximum has been reached, conversion requests are requeued until there are resources available. The property for this configuration is s3.max.requests.

The ways that it can be set are just like the prior two properties.
The requeuing delay
Bucketeer is built on a messaging platform so, when the maximum number of PUT requests has been reached, any additional requests that come in for conversion will be requeued (until a new upload slot is available). The requeuing delay isn't really a performance configuration, like the above three, but it does allow you to reduce the number of messages flowing through the system by introducing an X number of seconds wait until a message is requeued. The property for this is s3.requeue.delay.

It's fine to leave this with its default value, but if you want to change it you'd do so in the same way as you would for the above properties.

We're still experimenting with different configurations, so we don't have a recommendation for best values, given a particular type of machine, for these properties at this time.

Contact

We use an internal ticketing system, but we've left the GitHub issues open in case you'd like to file a ticket or make a suggestion. You can also contact Kevin S. Clarke at ksclarke@ksclarke.io if you have a question about the project.

You can’t perform that action at this time.