## Downloading the data

We first need to fetch the data from prof. Josef Feit.

In [1]:
from util.dataset_processing import fetch_files_prof_feit
from pathlib import Path

destination_folder = Path('/home/tomas/Projects/histoseg/data/Feit_colon-annotation')

In [None]:
fetch_files_prof_feit('https://atlases.muni.cz/atlases/colonannot.html', destination_folder)

Processing file 1 out of 63
Processing file 2 out of 63
Processing file 3 out of 63
Processing file 4 out of 63
Processing file 5 out of 63
Processing file 6 out of 63
Processing file 7 out of 63
Processing file 8 out of 63
Processing file 9 out of 63
Processing file 10 out of 63
Processing file 11 out of 63


Next, we need to extract the annotations from the .bz2 files.

In [2]:
from util.data_manipulation_scripts import decompress_bz2_in_subfolder

decompress_bz2_in_subfolder(destination_folder)

## Validation and testing segmentation data set

Note that each whole-slide image image is in a separate folder. To create training, validation, and testing splits of the whole-slide images, youneed to do so manually.
I use *Feit_colon-annotation_valid* as the validation folder.

## Extracting image patches

Next, we can extract image patches from these annotations. This is done by function *generate_dataset_from_annotated_slides*.
The basic parameters are path to the data set *data_path_source*, *destination_folder* to save the tiles and *tile_size* (I recommend 256).

If you want to down-sample the image, you can set the *scale* to an integer > 0. Parameter *neighbourhood* is an integer, which specify the amount
of neighborhood patches of size tile_size^2 that will be extracted. The neighbours do not belong to the same class.
Variable *fraction* specifies the fraction of the tiles that will be saved (useful for large data sets).

In [4]:
from util.data_manipulation_scripts import generate_dataset_from_annotated_slides

tiles_destination = Path('/home/tomas/Projects/histoseg/data/Feit_colon-annotation_tiles-256')
generate_dataset_from_annotated_slides(dataset_path_source = destination_folder, tiles_destination = tiles_destination,
                                       tile_size = 256, scale = 0, neighborhood = 0, fraction = 1.0)

Processing /home/tomas/Projects/histoseg/data/Feit_colon-annotation/ns-adenoca-colon-14254-2019-20x-he-1/ns-adenoca-colon-14254-2019-20x-he-1.tiff, file 1 out of 4
--Processing polygon 64 out of 64

Processing /home/tomas/Projects/histoseg/data/Feit_colon-annotation/ns-adenoca-colon-14254-2019-20x-he-2/ns-adenoca-colon-14254-2019-20x-he-2.tiff, file 2 out of 4
--Processing polygon 156 out of 156

Processing /home/tomas/Projects/histoseg/data/Feit_colon-annotation/ns-adenoca-colon-15071-2019-20x-he-10/ns-adenoca-colon-15071-2019-20x-he-10.tiff, file 3 out of 4
--Processing polygon 136 out of 136

Processing /home/tomas/Projects/histoseg/data/Feit_colon-annotation/ns-adenoca-colon-15071-2019-20x-he-4/ns-adenoca-colon-15071-2019-20x-he-4.tiff, file 4 out of 4
--Processing polygon 232 out of 232



## Splitting patch data set

The data set of extracted patches can be easily split into training, validation, (and possibly testing) data sets.

In [5]:
from util.data_manipulation_scripts import split_train_valid

tiles_train = Path('/home/tomas/Projects/histoseg/data/Feit_colon-annotation_tiles-256-train')
tiles_valid = Path('/home/tomas/Projects/histoseg/data/Feit_colon-annotation_tiles-256-valid')

# Split size specifies the size of the training data set as a real number between 0 and 1.
split_train_valid(source_dir = tiles_destination, data_train = tiles_train, data_valid = tiles_valid, split_size = 0.8)

## Pre-computing annotation map for segmentation validation data set

For the validation and testing splits, you need to precompute the annotation map. It is recommended to use resolution of 32.

In [7]:
from util.data_manipulation_scripts import precompute_annotation_map

data_validation = Path('/home/tomas/Projects/histoseg/data/Feit_colon-annotation_valid/ns-adenoca-colon-15071-2019-20x-he-4')
# It is recommended to process all the images at once
# data_validation = Path('/home/tomas/Projects/histoseg/data/Feit_colon-annotation_valid')

precompute_annotation_map(data_validation, resolution = 32)

Processing image annotation /home/tomas/Projects/histoseg/data/Feit_colon-annotation_valid/ns-adenoca-colon-15071-2019-20x-he-4/ns-adenoca-colon-15071-2019-20x-he-4.tiff
Processing location 742/742, 890/892
-------------------
