# Image Data Storage for the Web

## Learning objectives

- Become familiar with the design of modern, **cloud storage systems**
- Gain experience with the **zarr** and **n5 formats**
- Understand the relationship between **chunked, compressed**, object storage and **parallel processing and multi-scale visualization**
- Become familiar with **OME-NGFF**, the next-generation file format for scientific imaging

<a href="https://colab.research.google.com/github/thewtex/modern-insights-from-microscopy-images/blob/master/03_Data_Storage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install package dependencies
import sys

!{sys.executable} -m pip install --upgrade --pre zarr multiscale-spatial-image matplotlib itk-io tifffile

# Cloud storage

**Cloud storage services**, such as:

- Amazon Simple Storage Service (AWS S3)
- Google Cloud Storage
- Microsoft Azure Storage
- Minio Cloud Storage

**differ from traditional filesystem storage**.

In *File Storage*:

- Data is organized into files and folders.
- There is generally a pool of storage, e.g. a volume, with limited capacity that can be accessed.
- Data can be overwritten.
- Limited metadata is associated with the file.

In cloud, *Object Storage* systems:

- Objects, binary blobs, live in a flat structure.
- Object have a unique identifier and associated metadata, typically JSON-compatible
- Access is possible via simple HTTP requests
- Object's cannot be modified
- There are not structural limits to scaling

## Zarr and n5 formats

[Zarr](https://zarr-developers.github.io/about/) and [n5](https://github.com/saalfeldlab/n5/) are file formats with reference implementatinos that map well to cloud Object Storage services. They are also suitable for storage of large bioimages.

Together zarr and n5 are formats used by the bioimaging data model of the [Open Microscopy Environment (OME)-Next-generation File Format (NGFF)](https://ngff.openmicroscopy.org/latest/), which represents what was previously stored in [OME-TIFF](http://www.openmicroscopy.org/ome-files/) as *a hierarchy of n-dimensional (dense) arrays with metadata*.

Zarr and n5 support:

- Group hierarchies
- Arbitrary JSON-compatible meta-data
- Chunked, n-dimensional binary tensor datasets
- Binary component types: [u]int8, [u]int16, [u]int32, [u]int64, float32, float64
- Next-generation lossless compression with [blosc](https://blosc.org/pages/blosc-in-depth/) of binary chunks.

When combined with a **multi-scale image model** such as [OME-NGFF](https://www.nature.com/articles/s41592-021-01326-w), **large image visualization** is possible.

The object storage-compatible model facilitates **parallel processing** because it is conducive to **compressed chunk writes**, even in a cloud storage environment.

## Creating an OME-NGFF

In [4]:
from multiscale_spatial_image import to_multiscale
from spatial_image import is_spatial_image, to_spatial_image
import itk
import numpy as np
from urllib.request import urlretrieve
import os

Download example image

Derived from:

https://data.broadinstitute.org/bbbc/BBBC024/

Image set BBBC024vl [Svoboda David, Kozubkek Michal, Stejskal Stanislav. Generation of Digital Phantoms of Cell Nuclei and Simulation of Image Formation in 3D Image Cytometry. Cytometry Part A, John Wiley & Sons, Inc., 75A, 6, pp. 494-509, 16 pages. ISSN 1552-4922. 2009.] from the Broad Bioimage Benchmark Collection.

In [12]:
image_name= 'HL50_cell_line_c00_03_extraction'

image_name = 'monkey_brain'
filename = f'{image_name}.tif'
if not os.path.exists(filename):
    url = 'https://data.kitware.com/api/v1/file/5b61f16c8d777f06857c1949/download'
    urlretrieve(url, filename)

In [13]:
# Image metadata
image = itk.imread(filename)
print(image)

Image (0x55eaa38d0180)
  RTTI typeinfo:   itk::Image<unsigned short, 3u>
  Reference Count: 1
  Modified Time: 805
  Debug: Off
  Object Name: 
  Observers: 
    none
  Source: (none)
  Source output name: (none)
  Release Data: Off
  Data Released: False
  Global Release Data: Off
  PipelineMTime: 620
  UpdateMTime: 804
  RealTimeStamp: 0 seconds 
  LargestPossibleRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [363, 212, 129]
  BufferedRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [363, 212, 129]
  RequestedRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [363, 212, 129]
  Spacing: [1, 1, 1]
  Origin: [0, 0, 0]
  Direction: 
1 0 0
0 1 0
0 0 1

  IndexToPointMatrix: 
1 0 0
0 1 0
0 0 1

  PointToIndexMatrix: 
1 0 0
0 1 0
0 0 1

  Inverse Direction: 
1 0 0
0 1 0
0 0 1

  PixelContainer: 
    ImportImageContainer (0x55eaa3920c80)
      RTTI typeinfo:   itk::ImportImageContainer<unsigned long, unsigned short>
      Reference Count: 1
      Modified Time: 801
  

Convert the `itk.Image` to an [`xarray.DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html).

In [14]:
image_da = itk.xarray_from_image(image)
image_da.name = image_name
image_da

In [15]:
# Does the image_da have the characteristics of a bioimage with spatial metadata?
is_spatial_image(image_da)

True

OME-NGFF stores a multiscale pyramid for large-scale visualization or analysis. Let's generate it.

In [17]:
multiscale = to_multiscale(image_da, [2,4])
multiscale

Unnamed: 0,Array,Chunk
Bytes,18.93 MiB,512.00 kiB
Shape,"(129, 212, 363)","(64, 64, 64)"
Count,72 Tasks,72 Chunks
Type,uint16,numpy.ndarray
"Array Chunk Bytes 18.93 MiB 512.00 kiB Shape (129, 212, 363) (64, 64, 64) Count 72 Tasks 72 Chunks Type uint16 numpy.ndarray",363  212  129,

Unnamed: 0,Array,Chunk
Bytes,18.93 MiB,512.00 kiB
Shape,"(129, 212, 363)","(64, 64, 64)"
Count,72 Tasks,72 Chunks
Type,uint16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.34 MiB,512.00 kiB
Shape,"(64, 106, 181)","(64, 64, 64)"
Count,846 Tasks,6 Chunks
Type,uint16,numpy.ndarray
"Array Chunk Bytes 2.34 MiB 512.00 kiB Shape (64, 106, 181) (64, 64, 64) Count 846 Tasks 6 Chunks Type uint16 numpy.ndarray",181  106  64,

Unnamed: 0,Array,Chunk
Bytes,2.34 MiB,512.00 kiB
Shape,"(64, 106, 181)","(64, 64, 64)"
Count,846 Tasks,6 Chunks
Type,uint16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,36.56 kiB,36.56 kiB
Shape,"(16, 26, 45)","(16, 26, 45)"
Count,903 Tasks,1 Chunks
Type,uint16,numpy.ndarray
"Array Chunk Bytes 36.56 kiB 36.56 kiB Shape (16, 26, 45) (16, 26, 45) Count 903 Tasks 1 Chunks Type uint16 numpy.ndarray",45  26  16,

Unnamed: 0,Array,Chunk
Bytes,36.56 kiB,36.56 kiB
Shape,"(16, 26, 45)","(16, 26, 45)"
Count,903 Tasks,1 Chunks
Type,uint16,numpy.ndarray


## Exercises

In [3]:
# Get metadata on an image
!ome_zarr info https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001240.zarr/

ERROR:ome_zarr.cli:not a zarr: None


*Does the entire dataset need to be downloaded to examine its metadata?*

In [None]:
# Download an image dataset
!ome_zarr download https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001240.zarr/ --output image.zarr

downloading...
   .
to image.zarr
[########################################] | 100% Completed |  4.9s
[########################################] | 100% Completed |  7.6s


*Examine the contents of the filesystem representation of the OME-Zarr multi-scale image. What information is stored in each file?

In [None]:
%ls -a image.zarr/6001240.zarr/

[0m[01;34m.[0m/  [01;34m..[0m/  [01;34m0[0m/  [01;34m1[0m/  .zattrs  .zgroup


In [None]:
%pycat image.zarr/6001240.zarr/.zattrs

In [None]:
%pycat image.zarr/6001240.zarr/.zgroup

In [None]:
%ls -a image.zarr/6001240.zarr/0

[0m[01;34m.[0m/           0.0.169.0.0  0.0.29.0.0  0.1.10.0.0   0.1.172.0.0  0.1.32.0.0
[01;34m..[0m/          0.0.17.0.0   0.0.3.0.0   0.1.100.0.0  0.1.173.0.0  0.1.33.0.0
0.0.0.0.0    0.0.170.0.0  0.0.30.0.0  0.1.101.0.0  0.1.174.0.0  0.1.34.0.0
0.0.1.0.0    0.0.171.0.0  0.0.31.0.0  0.1.102.0.0  0.1.175.0.0  0.1.35.0.0
0.0.10.0.0   0.0.172.0.0  0.0.32.0.0  0.1.103.0.0  0.1.176.0.0  0.1.36.0.0
0.0.100.0.0  0.0.173.0.0  0.0.33.0.0  0.1.104.0.0  0.1.177.0.0  0.1.37.0.0
0.0.101.0.0  0.0.174.0.0  0.0.34.0.0  0.1.105.0.0  0.1.178.0.0  0.1.38.0.0
0.0.102.0.0  0.0.175.0.0  0.0.35.0.0  0.1.106.0.0  0.1.179.0.0  0.1.39.0.0
0.0.103.0.0  0.0.176.0.0  0.0.36.0.0  0.1.107.0.0  0.1.18.0.0   0.1.4.0.0
0.0.104.0.0  0.0.177.0.0  0.0.37.0.0  0.1.108.0.0  0.1.180.0.0  0.1.40.0.0
0.0.105.0.0  0.0.178.0.0  0.0.38.0.0  0.1.109.0.0  0.1.181.0.0  0.1.41.0.0
0.0.106.0.0  0.0.179.0.0  0.0.39.0.0  0.1.11.0.0   0.1.182.0.0  0.1.42.0.0
0.0.107.0.0  0.0.18.0.0   0.0.4.0.0   0.1.110.0.0  0.1.183.0.

In [None]:
%pycat image.zarr/0/6001240.zarr/.zarray

In [None]:
import zarr
group = zarr.open('image.zarr/6001240.zarr/')
group

<zarr.hierarchy.Group '/'>

In [None]:
group.attrs.keys()

dict_keys(['multiscales', 'omero'])

In [None]:
group.attrs['multiscales']

[{'datasets': [{'path': '0'}, {'path': '1'}], 'version': '0.1'}]

In [None]:
list(group.keys())

['0', '1']

In [None]:
scale0 = group['0']

In [None]:
scale0

<zarr.core.Array '/0' (1, 2, 236, 275, 271) >u2>

In [None]:
import numpy as np
np.asarray(scale0)

array([[[[[ 8,  9,  8, ...,  9,  9, 10],
          [ 9,  9,  9, ...,  8,  9,  9],
          [ 8,  8,  8, ..., 26, 40,  8],
          ...,
          [ 9,  9,  9, ...,  9, 10, 14],
          [ 8,  9, 10, ...,  9, 10,  9],
          [ 9,  8, 10, ..., 10,  8,  8]],

         [[ 9,  9,  9, ...,  8, 11, 11],
          [ 9,  8,  9, ..., 10,  9, 10],
          [ 9, 16,  9, ..., 39, 30,  9],
          ...,
          [10,  9, 10, ..., 10, 10,  9],
          [10,  8, 10, ..., 10, 10, 10],
          [10, 11,  9, ...,  9, 10, 10]],

         [[ 9,  9,  9, ..., 14,  7, 15],
          [ 9,  9,  9, ..., 10,  9,  9],
          [ 8,  9,  9, ...,  9, 67,  8],
          ...,
          [ 8,  9,  9, ...,  9, 19,  9],
          [ 8,  9,  8, ...,  7,  9, 10],
          [ 7,  9,  9, ...,  9,  9, 10]],

         ...,

         [[ 8,  9, 57, ...,  9,  9,  8],
          [ 8,  9,  8, ...,  7,  8,  9],
          [21,  9,  9, ...,  8,  9,  7],
          ...,
          [ 9,  9,  8, ...,  7,  8,  9],
          [14,  9