# Datasets

![](https://reconstrue.github.io/brightfield_on_colab/content/images/651806289_minip_turbo_banner.png)

Brightfield neuron datasets are measured in tens of gigabytes (GB). For example, the Allen Institute acquires images with an effective X-Y pixel size of 0.114 micron x 0.114 micron, slides of which are acquired at 0.28 μm depth increments. "In mice, axonal diameters range from less than 0.2 μm up to 10 μm" [[*](https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/axon)]. Visible light is in the range of 0.4 microns to 0.7 microns, blue to red respectively. So, the combination of the brightfield modality and size of the neurites means the acquired data is dirty by nature, making the neuron reconstruction software's task difficult.

## Data sources
So far sample datasets have been exercised from:
- The Allen Institute
- ShuTu sample data

It is also a goal to enable Colab users to upload (or link to) their own brightfield image stacks and then generate SWC files, on Colab. With a (currently nonexistant) pre-trained brightfield reconstruction model, this would simply be the inference phase of the hypothetical neuron detector model's lifecycle. 

Unfortunately image file formats are not particularly well defined so there is ETL data wrangling to be done. 

## Training corpus

As with any open computer vision problem, a test corpus is extremely valuable for measure progress. For example, there is the [Standford Medical ImageNet](https://aimi.stanford.edu/research/medical-imagenet) but that is for radiology not brightfield neuroscience. More germane to this project, [The Allen Institute's Cell Type Database](data_sources/allen_institute/cell_types_db.html) has hundreds of brightfield neuron image stacks with "labeled data." More relevant, they have already set out [a brightfield reconstruction challenge dataset](./allen_institute/brightfield_neuron_reconstruction_challenge.html), consisting of 105 training and 10 test specimen.



## Input file types
Neuron reconstruction input datasets have two parts:
- image stacks: TIFF files off of a brightfield microscopes
- skeletons: the `*.swc` files

Both training and test neuron specimens come with image stacks. Only training neurons come with skeletons. The skeletons can be thought of as the classification labels for the training data, classifying each voxel as:
- soma (1)
- axon (2)
- basal dendrite (3)
- apical dendrite (4)
- or extra-cellular

### SWC files
In the context of neuron reconstructor model training, the core analogy is data:labels::micrographs:skeletons.

There is an unfortunate bit of nomenclatural history to the "SWC" name, as per the [SWC+ spec](https://neuroinformatics.nl/swcPlus/):
> (S.W.C. encodes for the last names of its initial designers Ed Stockley, Howard Wheal, and Robert Cannon) 

The following is [an example SWC, rendered via Janelia's Sharkviewer](https://www.janelia.org/sharkviewer).

<img src="https://reconstrue.github.io/brightfield_on_colab/content/images/sharkviewer.png" width="60%" />


### Image stack

In the Allen Institute's Cell Type Database, individual image stacks average around 20 GB, with some as large as 60 GB.

TIFF files are the standard image format for brightfield. TIFF uses 32-bit offsets which means TIFF file size maxes out at 4 GB. [BigTIFF](https://www.awaresystems.be/imaging/tiff/bigtiff.html) is a 64-bit.

Sometimes image stacks are distributed as a single file, containing multiple ["TIFF subfiles"](https://en.wikipedia.org/wiki/TIFF#Multiple_subfiles) which is part of the core TIFF spec. Given the image stack size, BigTIFF is sometime the file format, not TIFF.


The images might be rectilinear as in the case of data from the Allen Institute. Or the images might be the result of stitching, which may or may not be rectilinear. (The Allen's data is stitched, but it is also rectilinear as can be seen by subtle lighting banding artifacts in some images.) For example, the following virtual slide stitching image was produced by ShuTu.

![(c) ShuTu](https://reconstrue.github.io/brightfield_on_colab/content/images/shutu_stitched.jpg)