# Segment TIFF images

The general idea is to use the large TIFF images as input layers for a U-Net for training, validation and testing. I'll do this by chopping the image layers into smaller chunks and feeding them into the neural net.

In [1]:
import math
from pathlib import Path

import numpy as np
from skimage import io

In [2]:
LAYER_DIR = Path("..") / "data" / "layers"

## Look at image properties

In [3]:
fa = io.imread(LAYER_DIR / "fa.tif")
fa.shape, fa.dtype, fa.min(), fa.max()

((41668, 19981), dtype('float32'), 0.0, 3.4e+38)

In [4]:
slope = io.imread(LAYER_DIR / "slope.tif")
slope.shape, slope.dtype, slope.min(), slope.max()

((41668, 19981), dtype('float32'), -3.4028235e+38, 0.9276394)

In [5]:
wetness = io.imread(LAYER_DIR / "wetness.tif")
wetness.shape, wetness.dtype, wetness.min(), wetness.max()

((41668, 19981), dtype('float32'), -3.4028235e+38, 53.34468)

In [6]:
dem = io.imread(LAYER_DIR / "dem.tif")
dem.shape, dem.dtype, dem.min(), dem.max()

((41668, 19981), dtype('float32'), -3.4028235e+38, 53.149834)

In [7]:
larv = io.imread(LAYER_DIR / "larv_spot_50m_correct.tif")
larv.shape, larv.dtype, larv.min(), larv.max()

((41668, 19981), dtype('float32'), 0.0, 3.4e+38)

In [8]:
np.finfo(np.float32).min, np.finfo(np.float32).max

(-3.4028235e+38, 3.4028235e+38)

## How many tiles can we actually use?

In [15]:
tile_size = 512

In [16]:
h = fa.shape[0] / tile_size
w = fa.shape[1] / tile_size
h, w, h * w

(81.3828125, 39.025390625, 3175.996047973633)

These numbers are based off of raw image size but much of the image is masked out, so try a closer appoximation. The unmassked areas are mostly contiguous but offset from row to row.

In [18]:
has_value = (fa < 3.0e+38).sum()
has_value, has_value / (tile_size * tile_size)

(278090069, 1060.8294258117676)

Well that is rather depressing.

I need to cordon off areas for the test and valdation datasets. The remainder is for the training dataset. I'll randomly crop the training data. The edges between the 3 datasets are absolute but tiles may partially hang out into the undefined regions.

There is a triangle at the top of the images thathas no targets. Should I include that? For now, "Yes."

Dataset distribution strategy:
1. Slice the images into 81 rows of tile sized data.
2. Randomly assign the rows to the three datasets.
There will be 16 testing and validation stripes and (81 - 32 =) 49 training stripes.

In [19]:
80 * 0.2

16.0