In this competition, we deal with TIFF images. I've never worked with this format before so my first question was: what library should I use to read this format? The most obvious choices were `Pillow` and `tifffile` packages. However, they didn't work: the former failed with `DecompressionBombError` (!), while the latter complained about NumPy array being non-contiguous for some image in the dataset.

[After some search](https://stackoverflow.com/questions/7569553/working-with-tiffs-import-export-in-python-using-numpy), I stumbled upon the `osgeo` library. As the name suggests, the library was probably written to work with maps, satellite images, or something like that. But it has TIFF reading functionality built-in so why not to apply it for medical imagery, right? 

Here I show my approach to read the dataset images. This approach worked out for me right out of the box, without any errors that I encountered with other libraries. But if you know a better/faster way, please let me know!

In [None]:
import gc
import glob
import json
from os.path import basename, dirname, splitext
import numpy as np
from osgeo import gdal

In [None]:
def human_readable_size(arr: np.ndarray) -> str:
    """Gets array's size as a verbose, human-readable string."""
    
    n = arr.nbytes
    for unit in ('bytes', 'Kb', 'Mb', 'Gb'):
        if n >= 1024:
            n /= 1024
        else:
            break
    return f'{n:.3f} {unit}'

In [None]:
def read_tiff(path: str) -> np.ndarray:
    """Reads TIFF file."""
    
    dataset = gdal.Open(path, gdal.GA_ReadOnly)
    n_channels = dataset.RasterCount
    width = dataset.RasterXSize
    height = dataset.RasterYSize
    image = np.zeros((n_channels, height, width), dtype=np.uint8)
    for i in range(n_channels):
        band = dataset.GetRasterBand(i+1)
        channel = band.ReadAsArray()
        image[i] = channel
    return image

In [None]:
meta = []

for filename in glob.glob('/kaggle/input/hubmap-kidney-segmentation/**/*.tiff'):
    print(f'Processing file: {filename}')
    identifier, _ = splitext(basename(filename))
    subset = basename(dirname(filename))
    img = read_tiff(filename)
    meta.append(dict(
        identifier=identifier,
        filename=filename,
        subset=subset,
        memory_bytes=img.nbytes,
        memory_human_readable=human_readable_size(img),
        image_shape=img.shape
    ))
    del img
    gc.collect()

In [None]:
for info in meta:
    print(
        f'id: {info["identifier"]}, '
        f'memory: {info["memory_human_readable"]:>10s}, '
        f'shape: {info["image_shape"]}'
    )

In [None]:
with open('/kaggle/working/meta.json', 'w') as fp:
    json.dump(meta, fp)