Skip to content

Example Single Machine Dataset Upload

William Silversmith edited this page Aug 4, 2022 · 12 revisions

The below listing is derived from a script that uploaded a multi-hundred gigabyte uncompressed image dataset off an external hard drive to Google Storage but it would also work with Amazon S3. It has the following helpful features:

  1. Resumable Upload
  2. Progress Printing
  3. Multi-core Upload

The resumable upload feature works by writing the index of uploaded slices to disk by touching filenames in a newly created ./progress/ directory. You can easily reset the upload with rm -r ./progress or avoid reuploading files by touching e.g. touch progress/5 which would avoid uploading z=5.

A curious feature of this script is that it uses ProcessPoolExecutor as an independent multi-process runner rather than using CloudVolume's parallel=True option. This is helpful because it helps parallelize the file reading and decoding step. ProcessPoolExecutor is used instead of multiprocessing.Pool as the original multiprocessing module hangs when a child process dies.

Please use the below Python3 code as a guide.

import os
from concurrent.futures import ProcessPoolExecutor

import numpy as np
import tifffile 

from cloudvolume import CloudVolume
from cloudvolume.lib import mkdir, touch

info = CloudVolume.create_new_info(
	num_channels = 1,
	layer_type = 'image', # 'image' or 'segmentation'
	data_type = 'uint16', # can pick any popular uint
	encoding = 'raw', # see: https://github.com/seung-lab/cloud-volume/wiki/Compression-Choices
	resolution = [ 4, 4, 4 ], # X,Y,Z values in nanometers
	voxel_offset = [ 0, 0, 1 ], # values X,Y,Z values in voxels
	chunk_size = [ 1024, 1024, 1 ], # rechunk of image X,Y,Z in voxels
	volume_size = [ 8368, 2258, 12208 ], # X,Y,Z size in voxels
)

# If you're using amazon or the local file system, you can replace 'gs' with 's3' or 'file'
vol = CloudVolume('gs://bucket/dataset/layer', info=info)
vol.provenance.description = "Description of Data"
vol.provenance.owners = ['email_address_for_uploader/imager'] # list of contact email addresses

vol.commit_info() # generates gs://bucket/dataset/layer/info json file
vol.commit_provenance() # generates gs://bucket/dataset/layer/provenance json file

direct = 'local/path/to/images'

progress_dir = mkdir('progress/') # unlike os.mkdir doesn't crash on prexisting 
done_files = set([ int(z) for z in os.listdir(progress_dir) ])
all_files = set(range(vol.bounds.minpt.z, vol.bounds.maxpt.z + 1))

to_upload = [ int(z) for z in list(all_files.difference(done_files)) ]
to_upload.sort()

def process(z):
	img_name = 'brain_%06d.tif' % z
	print('Processing ', img_name)
        image = tifffile.imread(os.path.join(direct, img_name))
        image = np.swapaxes(image, 0, 1)
        image = image[..., np.newaxis]
	vol[:,:, z] = image
	touch(os.path.join(progress_dir, str(z)))

with ProcessPoolExecutor(max_workers=8) as executor:
    executor.map(process, to_upload)

RGB Data

To work with RGB data, set num_channels=3, set data_type="uint8", and make sure that RGB is the last axis on the image array.

To view an RGB image in neuroglancer, paste this code into the rendering box.

void main() {
  vec3 data = vec3(toNormalized(getDataValue(0)), toNormalized(getDataValue(1)), toNormalized(getDataValue(2)));
  emitRGB(data);
}

Sharded Images

Sharded images compact many files into a single randomly accessible file to reduce strain on the filesystem. You probably only need to worry about them once your data is in the multi-teravoxel range. To upload a sharded image, you have two options.

  1. Upload the original image and use igneous xfer --sharded to create a new sharded image. Then you delete the original upload.
  2. Follow the instructions in Creating a Sharded Image from Scratch