# Welcome to the Schoborg Fly Brain Volume Estimation Jupyter Notebook!
#### By Samuel Fay and Holden Bindl


## Layout

This notebook provides background for the project and explains how volume is estimated for a fly brain. Then it goes into the code structure while explaining which strategies worked and what I wasn't able to get working.

## Contact

If you have any questions about anything in this project, please feel free to contact me at smafay6 AT gmail.com. You can also contact Lars Kotthoff, the Computer Science mentor of the project, at larsko AT uwyo.edu. Or you can contact the Molecular Biology mentor of this project, Todd Schoborg, at todd.schoborg AT uwyo.edu.

## Background

Todd Schoborg is a professor in the Department of Molecular Biology at the University of Wyoming. He uses the fruit fly, Drosophila melanogaster, as a model organism to study particular genes involved in the control of brain growth. Mutations of these genes result in a small brain phenotype referred to as microcephaly. One useful metric in measuring the size of a fly's brain is its volume, which can be determined from micro-CT scans of the fly's head.

The micro-CT scan of a fly’s head results in a stack of roughly 150 images that are taken at equal intervals while moving through the fly's head. Each pixel in each image represents a certain volume within that fly's head. The length and width of each pixel are the surface area on an image, and the depth of the pixel is equal to the gap between the images. Throughout his research, Todd has drawn thousands of hand-traced masks, in which each pixel is "on" if there is brain tissue in the pixel and "off" if there is not. Here are a few examples of slices from a single fly head, each of which contains brain tissue, and their corresponding hand-traced masks.

<img src='Images/Brain Scan Example 1.png' width="200" height="200">  <img src='Images/Brain Scan Example 2.png' width="200" height="200">  <img src='Images/Brain Scan Example 3.png' width="200" height="200">

<img src='Images/Hand Drawn Mask 1.png' width="200" height="200">  <img src='Images/Hand Drawn Mask 2.png' width="200" height="200">  <img src='Images/Hand Drawn Mask 3.png' width="200" height="200">

These three images have about 20 images in between each of them, so the changes between each consecutive image are relatively gradual. This allows us to come up with a very good approximation of volume from a collection of masks. By taking the total count of pixels in the mask that are "on" (i.e., that correspond to brain tissue) and then multiplying that by the dimensions of each pixel, we can get a number that represents the volume of the brain in that slice. Then, by doing this for every slice that contains brain tissue and summing the volumes together, we can calculate the total brain volume for the fly. The image below demonstrates taking the volume of a slice that makes up a larger volume. There can be some small errors because each slice (represented in brown in the figure)  is extended for $\Delta x$ without any change in width and height, but if there are enough slices, this method provides a very good approximation.

<img src='Images/Volume Cross Section.png' width="400" height="400">source: math24.net

The main goal of this project is to automate this whole volume calculation process so that Todd does not have to devote a ton of time to manually tracing brain regions in more than 100 slices per fly. To automate this process, we used an image segmentation neural network that classifies each image in a scan as “brain”, or “not brain”. The network was trained using all of the flies that Todd drew masks for. The scans that were shown above are shown below, with the masks that were generated by the neural network.

<img src='Images/Brain Scan Example 1.png' width="200" height="200">  <img src='Images/Brain Scan Example 2.png' width="200" height="200">  <img src='Images/Brain Scan Example 3.png' width="200" height="200">

<img src='Images/Generated Mask 1.png' width="200" height="200">  <img src='Images/Generated Mask 2.png' width="200" height="200">  <img src='Images/Generated Mask 3.png' width="200" height="200">




The scans are not perfect replicates of Todd’s hand-drawn masks, but they are good enough to get a pretty good approximation. They are precise, but not too accurate. The network tends to underestimate the brain region as compared to the hand-drawn scans, but it underestimates it at a pretty consistent rate. Plotting both sets of data resulted in a regression line with an $R^2$ value of $\approx 0.97$. When the network used the training data, the average proportion of the generated volume relative to the actual volume was $\approx 0.87$. If we multiply each predicted volume by $\frac{1}{0.87}\approx 1.15$, there is an average error of about $2.6\%$. If this level of underestimation for the training data is consistent with the results from the new data, then we could get good estimates by multiplying each predicted volume by $1.15$. However, this is a big if, and I don't know what exactly is causing the underapproximation. All I can really say is that the masks the network produces for flies it has not seen before look accurate, so that leads me to think that volume estimations would be precise at the very least.


## Code Documentation

### Code Layout

There are 4 files for this project. In this notebook, I will go over each file in some depth and explain the functions and how they work. For more line by line explanations, the files will have comments in the code.

The general workflow for training the network and then predicting a set of images goes as follows:

1. Read in tif stack training images and convert them to square pngs. Determine if they are a mask or a scan, and then accordingly output them into a general Masks/ or Scans/ folder. 

2. Train the network using the Masks/ and Scans/ folder as data. Then, save the model to a file. If the model has already been trained and saved, then you can just change a variable in the code and load in the model.

3. Read in the images you want to predict. Convert them to the correct size for the neural network, while storing the original resolution for later rescaling. Predict the images, and store the resulting masks.

4. Take the predicted masks and scale them back to the original size. Then, sum the total amount of pixels in each fly and multiply by the voxel size to get the volume.

### ImageManip.py

ImageManip.py has functions that are used to convert various types of images into a format that can either be used to train the network, or to have the network predict the brain mask. As a little background on how the training images are stored, they are within a parent folder that has 16 fly types folders in it. One fly type name could be "FL Del NLS1-3". Each of these fly type folders has around 5 individual fly folders in it. These individual folders are numbered with an underscore before the number, e.g. "\_02". Within the individual folder, there are two tif stacks. One of these has the scans, and one has the brain masks. Each has the same number of images in it, which are all in order. Some of the other images given were used in slightly different formats, which there are different functions for. I will explain those when I get to those functions. 

#### writePNG(outDir, flyType, flyNumber, fileName, image, imageNumber, isMaskVar)

This function writes a png to an output directory composed of the input parameters. `outDir` refers to a parent directory which you want to contain all the fly types: "outDir/flyType/flyNumber". This doesn't incorporate whether it's a mask or not, it was just an early version of the function for testing purposes. I kept it because it maybe could be useful for some testing.

#### writePNGGrouped(outDir, flyType, flyNumber, fileName, image, imageNumber, imagePairID, isMaskVar)

This function is used to write the png in all cases. For this function, `outDir` needs to end in the `flyNumber`, e.g. "someParentDir/flyType/flyNumber". It is this way because in the function that calls this one, I pass in a list of all of the dirs that I want to output images to. If `isMaskVar` is true, then "Masks/" is appended to the end of `outDir`, otherwise "Scans/" is appended. 

Then, the `outPath` variable is made based on some of the arguments:

```python
outPath += (str(imagePairID) + "-" + flyType + "-" + str(flyNumber) + "-" + 
                str(imageNumber) + ".png")
```

`imagePairID` is a number that is unique to that image in whatever directory it is located. `imageNumber` is a number that is unique relative to whatever fly that image is in. So, if the 113th image from fly "\_01" of type "Asp C" is going into a the masks training folder that has 20000 other images before it, then the file name might be "20001-Asp C-\_01-113.png". However, if it is going into a folder where the only images are from Asp C \_01, then the file name would just be "113-Asp C-\_01-113.png". After making the filename, the image is casted to uint16, encoded to a png, and written to file.

#### makeImageSquare(img)

This function crops the longer side of the image evenly from each side so that the image ends up square. If an image is 324x303, then the resulting image will be 303x303. 

```python
if diff % 2 == 0:
    # uneccessary int to satisfy slicing int guarantee
    firstCrop = secondCrop = int(diff / 2)
else:
    firstCrop = int(diff / 2)
    secondCrop = firstCrop + 1
```

If the difference between the longer side and the shorter side is even, then we can chop the same number of pixels off of each side. However, if it isn't, then cropping evenly from each side won't result in a square image. Therefore, we add one pixel to the secondCrop to account for that. After that, we return a sliced copy of the image.

#### noiseReduce(img)

This function just turns each pixel to 0 if the pixel's value is below a certain threshold. In theory, this should make the borders on the scan a little more crisp, but it didn't help accuracy when I tried it out once. When I tried it, the network was not trained with noise-reduced images. It's possible that making the training images also be noise-reduced could help. 

This function also contains some code from when Holden was working on the project. This code is commented out, and it detects the edges on the image and then makes a square image out of it. One problem with this edge detection code is that it crops it to a fixed size, which could result in cropping out the brain if it's a high resolution image.

#### isMask(tifPath)

This function determines if a tif stack consists of masks or scans. Masks only have two unique values throughout all of the images, while scans are grayscale, with values between 0 and 65535 if it's read in as a uint16. This means that if there are more than 2 unique values in the image, then it must be a scan. If there are only 2 unique values, then it is most likely a mask. However, I make sure that there are two images that have exactly 2 unique values in them, just in case there's a scan that is mostly black and then has one pixel that is a value other than 0. 

#### getSpecificImageFromTif(tifPath, index, getScan=False)

One of the issues that this network has is underestimating the volume. One solution Lars suggested for this is to replicate some the images with the largest absolute error in the training dataset. One of the things I output when I test the network on the training set is the index of the image with the largest error for each fly. Then, to replicate these images in the training dataset, we need both the mask and the scan from the index. To indicate whether we need the mask or the scan, there is the optional argument `getScan` with a default value of `False`. If the function is supplied the scan tif file, and we need the mask, or is it's supplied the mask file and we need the scan, then we find the other tif file in the current directory. This is assuming that in each folder for each fly follows the same format as the training data, where there are two tif stack files, one with scans and one with masks. If this is the case, then we just iterate through the directory, and once we find a file that is not hidden, and doesn't have the same name as the one we don't want, then we choose that file.

Now that we have the correct tif stack which we want to get, we just iterate through each page and increment a counter varable until we reach the desired index. Then, we return that image. 

#### convertImageHelper(tifPath, outputDirectory, totalImages, imagePairID)

This is a helper function for the different functions that convert tif images to png images for training or testing. It takes in a path to the tif to convert, the output directory, `totalImages`, and `imagePairID`. `imagePairID` is used to give each fly a unique number that can be matched to the corresponding mask or scan. `imagePairID` is not incremented after this function returns if the corresponding folder of the current fly hasn't been processed yet. This logic is done in the one of the parent functions `convertAndNormalizeTestImages()`, but it is still relevant to this function.

`writePNGGrouped()` is then called, which writes the image using the unique number ID `imagePairID` and a local variable `imageNumber`, which denotes which number the image is with respect to the current fly. The function then returns `totalImages` to indicate the total number of images that have been processed in the current file and all the files before it. It also returns a dictionary with a key of the name and type of the fly, and a value of the side dimension of the square image of the fly before it is scaled down to 256x256. This is a necessary value to have, because when calculating the brain volume of an image, the masks must be scaled back up to the original square resolution of the image so that each pixel takes up the area that it was intended to take up.

#### convertAndNormalizeTestImages(parentDirectory, outputDirectory, totalImages, imagePairID)

This function takes in a parent that contains directories in the format `parentDirectory/flyType/flyNumber`, where each fly number folder has two tif image stack files in it, one of scans and one of masks. The outputDirectory is a directory that will be used for training, so it just contains two folders of images, one for masks and one for scans. If being done on a new folder without already existing uniquely numbered data, then `totalImages` and `imagePairID` can be initialized to 1. 

All of the fly directories are looped through. For each directory, if it's the first pass through the directory, we haven't processed either the scans or masks yet, so the `firstPass` variable is `True`. When first pass is true, we set `previousTotalImages` equal to `totalImages`, and then we run `convertImageHelper()`. The result gets returned into `totalImages`. Then we get the difference between `totalImages` and `previousTotalImages` to get how many images the current fly contains, and store this in `firstPassImageCount`. 

For the second pass, we also create `previousTotalImages` in order to make sure that there is no mismatch between the number of masks and scans for every fly. We call `convertImageHelper()`, get the new `totalImages`, look at the difference and make sure that `numberNewImages` is equal to `firstPassImageCount`. If it's not, we print a message warning that there is an image mismatch in the current fly. Then we increment `imagePairID` by `numberNewImages` because we have written both the scans and the masks for the current fly.

As an example, lets say `totalImages` starts at 300 and `imagePairID` start at 151, meaning we've written one total fly so far with 150 images, and we are just starting on the second one. on the first pass, we iterate through the fly directory and happen to choose the mask file. We call `convertImageHelper()` and it assigns image IDs 151-305 to the images which are written to the `Masks/` folder in `parentDirectory`, which means that this second fly has 155 images. `convertImageHelper()` then returns total images, which is 455. In the current function, `imagePairID` is still `151` because it isn't returned from `convertImageHelper()`, it's only passed to it. We set `firstPass` to False, continue in the loop, and then process the scan file. We pass `totalImages=455` and `imagePairID=151` to `convertImageHelper()`, which processes the images in the tif scans file. That returns our new value for `totalImages`, which is hopefully `610`. We check that the images that were just processed from the scan file is equal to the images from the mask file, and then we increment `imagePairID` by `numberNewImages`, so `imagePairID` is now `306`.

#### convertAndNormalizeSpecificTestImages(tifDirectoryList, outputDirectory, totalImages, imagePairID)

This function just takes a list of directories of tifs and calls `convertImageHelper()` for them. It doesn't do any unique pair IDs for the images, it just outputs them to the output directory. I only used it once for testing purposes to make sure `convertImageHelper()` was working.

#### convertToDifferentDirectories(tifFileList, outputDirectoryList)

This function converts from a the tif file at index `i` in `tifFileList` to PNGs in the directory at index `i` in `outputDirectoryList`. It also returns a dictionary of the original image resolutions, which can be used later if you are predicting and need to rescale the predicted masks back to the resolution they are supposed to be at for accurate volume calculations. This is the function that I used to convert the training data from tifs to pngs when I was testing how the network predicted on the training data.

#### convertIndividualTifToPng(inputFlyDir, outputDir)

Some of the new flies from Todd to test out the segmentation network on images it had never seen before came in a slightly different format, where instead of coming in a tif stack they are in individual tif images. These images are numbered which is how we determine the order. Even though the order doesn't really matter because we are just predicting each image individually, it helps to make sure we're outputting them in the original order because it gives you a better sense of the pattern of the brain as the scan goes through the head. One example of an image name from the data he gave is `_01235.tif`. Therefore, we sort the images in `inputFlyDir` with 
```python
imageList.sort(key = lambda x: int(str(x.parts[-1])[2:-4]))
```
`x.parts[-1]` gets the last section from the whole file path, which is the file name. Then, the `[2:-4]` gets the number `1235`, which is the key for sorting. It's possible that for other data, the slice may need to be `[1:-4]` if the number is a 5 digit number. After we sort the list of tifs in the directory, we just go through them and convert them like a normal tif stack. We also keep track of their original resolutions so that we can scale the masks that get predicted down the line back to a resolution that lets us calculate the volume.

#### calculateBrainDivisions(flyMaskDirList, numImagesPerStack)

This function was for the regression network that isn't currently used. The regression network takes in 3 overlaid scan images from a fly, and outputs a number representing the volume. In order to get 3 good overlaid images, we look at a folder of masks and determine the masks that show a reasonable selection of brain. Then we split them up by their order, and randomly choose a selection from each of the 3 lists to get a good representative slice for the beginning, middle, and end of the brain. For this function, which I ran on the training data, I found that 2500 "on" pixels was a good threshold, although this could change for different data at a different resolution. It then keeps track of which images are good, and then makes a list of these images for both the scans and the masks. These lists are then split up in close to even proportions according to the `numImagesPerStack` variable. Because the regression network takes in a stack of 3 images, then `numImagesPerStack` was set to 3 when I was using it. The function returns the split up list of scan images, because we were testing the network out on scans. However, there is also a list of the mask images that also gets split up, so you can return that list too if you wish to change it. 

#### getRandomStacks(flyMaskDirList, numImagesPerStack, numStacksPerFly, outDir, organized=True)

This function is also for the Regression network. It takes in the mask directory, number of images per stack, the number of randomly selected 3 image stacks that we want for each fly, the output directory for these stacks, and whether we want the stacks to be in the same or different folders. It then calls `calculateBrainDivisions()` and for each fly, gets the desired number of stacks and outputs the to the `outDir`.

#### getTiffMetadata(parentDirectory)

This function gets the ImageJ metadata from a `parentDirectory` containing flies with tif images. I made it to see what was stored in the metadata and whether I could determine the scale for the image from the metadata, however the training data didn't have all the metadata needed to get the scale. It's possible that for other images that have all the metadata, this function could be useful.

#### main()

main() is only run if the file is explicitly ran. If it is just imported, then it is not run and only the functions above are used. main() contains a few commented code snippets that have brief descriptions of what they do. The first snippet converts all of the training data from tifs to pngs with matching IDs, in separate folders, ready to be used for training. It then manually goes and counts all the files in the `Masks/` and `Scans/` folder and makes sure that the number of images is the same. It then prints the results. 

The second snippet just converts a two flies to a folder for testing purposes. The third snippet converts a few images from some flies not in the training dataset that contain individual tif images rather than tif stacks. The fourth snippet gets the metadata for all of the flies in that same group that the third snippet dealt with. The fifth snippet was to test how exactly to perform noise reduction on some scans to see if it improved mask accuracy.

At the end of the file is this bit of code:
```python
if __name__ == "__main__":
    main()
```
`__name__` is a built in varaible in python. If the file is being run on its own, then the `__name__` will be `"__main__"`. If it's not being run alone, and is instead being imported into another file, then it will be the name of the file, which in this case is `"ImageManip"`.

### SEMANTIC2.py

SEMANTIC2.py contains the code for the image segmentation neural network. Much of the code for the network comes straight from the Tensorflow Image Segmentation tutorial, which can be found [here](https://www.tensorflow.org/tutorials/images/segmentation). The architecture of the network remains the same apart from the resolution that the images are processed at, and what is changed is mainly how the images are loaded in, how the training data is processed, and the model being saved. I will mainly go over the functions that are modified and aren't straight from the tutorial. 


`BUFFER_SIZE = 1000
BATCH_SIZE = 32
IMG_WIDTH = 224
IMG_HEIGHT = 224
\# brain, or not brain
OUTPUT_CHANNELS = 2`

Buffer size and batch size are just set to values that worked for me. They definitely could be changed. The width and heights of the images are set to 224 because that is the max dimensions that the unet model this network is made with accepts. The original tutorial dataset had pictures of pets with masks associated with them. The masks had 3 types of pixels, one for pet, one for the border of pet, and one for not pet. The training masks only have 2 types of pixels, one for brain and one for not brain. When I first started to train the network, the output channels was still at 3 and it worked totally fine, so it may not be super necessary to change it down to 2, but it probably doesn't hurt.

#### normalize(img, mask)

This function first resizes both the `img` (aka the scan) and `mask` to `IMG_WIDTH` and `IMG_HEIGHT`, which are both `224`. The images are read in as `tf.uint8`, which has a range of `[0,255]`. Therefore, we divide both images by `255` to normalize the values between 0 and 1. 

#### load_image_train(dp), load_image_test(dp)

These functions are very similar, so I'm going to talk about them at the same time. They both load in the image from the tf.dataset, and apply a rotation to it. This rotation only turns it a random distance between 0 and 180 degrees in the clockwise direction, so it doesn't do it the full rotation. I haven't experimented at all with changing the presence of the rotation or expanding it to rotating the image 360 degrees. After rotating the images, we normalize them with `normalize()`. 

`load_image_train()` has some commented out code from when Holden was working on it, where the images are also flipped. I have not experimented with flipping the images either.

#### display(display_list)

This function was used to display images when it was being run locally. I do not use it anymore when I am running the code on teton.

#### show_predictions(dataset=None, num=1)

Also a debugging function, would show predictions given an input dataset

#### on_epoch_end(self, epoch, logs=None)

This function is a member of the `DisplayCallback` class which inherits from `tf.keras.callbacks.Callback`. It is what calls after each epoch finishes up. When the code was being run locally, it would call `show_predictions()` which would display the predicted output for a specific, randomly chosen scan. This randomly chosen scan would be only chosen at the beginning of training, so you got to see how the network would improve as each epoch went on. For training on teton, I have it commented out, though. The only thing this function does is print a message denoting that it has finished the current epoch.

#### unet_model(output_channels)

The only thing changed in this function from the tutorial is that the input shape for the `inputs` and `base_model` are changed to be 224x224, the largest resolution supported by this model. 

#### predictImages(imageDirectory, outputDirectory)

This function takes in an directory of images, reads them in, and predicts them using the model. It then outputs the masks as pngs into the output directory. The masks in the output directory have the same filename as the scan it comes from. The function prints how many images were written after it's done.

#### main()

Unlike ImageManip.py, SEMANTIC2.py has code that is run every time we run this file because we always want to load in or train the model when we import this file. First are the important lines that describe whether we have a model that we want to load in, and what the name of that model is:
```python
hasModel = False
modelPath = "5epochsegmentationmodel"
```
If `hasModel` is `True`, then we simply load in the model:
```python
if hasModel:
    print("\nLoading Model...")
    model = tf.keras.models.load_model(modelPath)
```
If `hasModel` is `False`, then we need to load in the data and train the model:
```python
else:
    print("training new model")
    print("\nloading data...")
    dirMasks = '../3D\ Test\ Images/Masks'
    dirScans = '../3D\ Test\ Images/Scans'
    masks = pathlib.Path(dirMasks)
    scans = pathlib.Path(dirScans)
    masks = tf.data.Dataset.list_files((str(masks/'*.png')),shuffle=False)
    scans = tf.data.Dataset.list_files((str(scans/'*.png')),shuffle=False)
```
If you remember from the main function in `ImageManip.py`, when we make the training data, we write it to `../3D Test Images/` directory. So, we convert those directories into a path from the `pathlib` library which allows us to do useful things like grab all sorts of files and iterate through the directory. In this case, we just use the `tf.data.Dataset` function to get all the png files from the mask and scan directories.
```python
    ds = tf.data.Dataset.zip(({"image":scans,"seg":masks}))
    ds = ds.shuffle(BUFFER_SIZE,reshuffle_each_iteration=False)
    # take the first 90% of the data and leave the last 10% for validation
    train = ds.take(int(0.9*len(list(ds.as_numpy_iterator()))))
    test = ds.skip(int(0.9*len(list(ds.as_numpy_iterator()))))
    train = train.map(load_image_train,num_parallel_calls=AUTOTUNE)
    test = test.map(load_image_test,num_parallel_calls=AUTOTUNE)
```
Because the images have a unique number in front of them, and the files are sorted, then all the images the masks and scans folder are in the same order. Therefore we can just zip the two datasets together, and the scans will be matched with what the desired output should be. Then, we take 90% of the data for the training dataset and leave the remaining 10% for validation testing. We then map the `load_image_train()` and `load_image_test()` functions to the two datasets, respectively. This will transform the datasets from just containing the filenames to actually storing matrices of the pixel values of the rotated and normalized images. This should definitely be experimented with, as I don't know whether or not changing the `reshuffle_each_iteration` variable makes the scans get new random rotations every epoch, or whether the rotations persist throughout the whole training of the model.
```python
    train_batch = train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
    train_batch = train_batch.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    test_batch = test.batch(BATCH_SIZE)
    model = unet_model(OUTPUT_CHANNELS)
    model.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['accuracy'])
```
Then, we make shuffle the data set and split it up into batches. Then, the model is compiled. This line is straight from the tutorial.
```python
    EPOCHS = 5
    VALIDATION_STEPS = 1
    # uncomment if you want to show predictions before 1 epoch is run
    # show_predictions()
    model_history = model.fit(train_batch,epochs=EPOCHS,validation_data=test_batch,callbacks=[DisplayCallback()])
    # save the newly trained model
    model.save("5epochsegmentationmodel")
    print("model saved \n")
```
Finally, we define the amount of epochs we use. I went with 5, as this is where the validation accuracy peaked and didn't really improve after that. Then, we train the model. Using GPUs on Teton, this takes about 2-3 minutes per epoch. I will get into the environment more in the Environment section.

### VolumeFrom3D.py

`VolumeFrom3D.py` is the file that ties `SEMANTIC2.py` and `ImageManip.py` together. After the library import statements, we import both files:
```python
ImageManip = importlib.import_module("ImageManip")
SegmentationNetwork = importlib.import_module("SEMANTIC2")
```
We also use the pandas library to read in the csv file that contains the true volumes. These volumes were calculated from the masks that Todd hand-drew using FIJI image processing software:
```python
df = pd.read_csv('../Brain Volume Measurements.csv')
```

#### getFlyTifLocations(parentDirectory, numberPerFlyType, getScan = True)

This function takes in a parentDirectory with the format of the training data (`parentDirectory/flyType/flyNumber/scanOrMaskTifStack.tif`). It then iterates through all of the flies in the directory and returns a list of either the scan filepaths if `getScan` is `True` or the mask filepaths if `getScan` is `False`. `numberPerFlyType` refers to how many individuals from each fly type the user wants to pull. It's important to note that sometimes the order of the directory doesn't pull it in order numerically, so if there were 5 flies in a certain fly type and `numberPerFlyType` was 3, then it might return the file paths of flies `_02, _03, _04` but not `_01`.

#### getFlyPngLocations(parentDirectory, numberPerFlyType, getScan = True)

This function is slightly different than the previous function because it takes in a parent directory with the format `parentDirectory/flyType/flyNumber/[FolderNamedMasksOrScans]/`. Depending on the value of `getScan`, the function returns a list of filepaths leading to the `Scans/` or `Masks/` directory in the fly directories. 

#### getOutputDirsFromFiles(tifList, outDirPrefix)

Given a list of tif files in a directory formatted like the training data, this function returns a list of folders where the converted images from the tif file list could be stored within the `outDirPrefix`. If `outDirPrefix` is `../Test Images/`, and if the first element of `tifList` is a tif in fly `_04` from fly type `FL Del NLS1-3`, then the first element of the returned list would be `../Test Images/FL Del NLS1-3/_04/`.

#### predictImages(flyPngDirList, outDirList)

This function is basically a wrapper function for the `predictImages()` function in `SEMANTIC2.py`. The main difference is that it takes in a list of directories of fly PNG scans `flyPngDirList` and a list of output directories `outDirList`. It for an index `i`, it predicts the masks for all the images in `flyPngDirList[i]` and outputs the masks to a matching folder within `outDirList[i]`.

#### getLabel(flyTypeAndNumber)

This function gets the label (the correct volume number) for a given string `flyTypeAndNumber`. This string is just a concatenation of the fly's type and number, so `FL Del NLS1-3/_04` would be `flyTypeAndNumber = "FL Del NLS1-3_04"` The column for Genotype in the csv is where the fly name and number are stored in that format. Then, once we've selected the right row of data from the csv, which is the row for a specific fly, we read the `Brain Volume um^3 (Raw)` column and return that. In the CSV file, there is also a column for normalized volume. I have just used the raw volumes because the units for each pixel are in micrometers, so it was most simple to just estimate and work with the raw volumes. I'm not sure what the advantges or disadvantages of working with the normalized volume would be.

#### calculateTotalSurfaceArea(flyMaskDirList, imageResolutions)

This function takes in two parameters: `flyMaskDirList`, a list of directories that contain already predicted fly mask pngs, and `imageResolutions` a dictionary with keys of `flyTypeAndNumber`, and values of the original dimension square images. This is only one number, because they are squares. It is necessary to resize each mask to the resolution that it was originally at because they are all resized to the same resolution throughout the prediction process. 

One thing this function returns is a list of lists called `individualCount`, where each internal list contains the individual pixel counts for each image. This is useful for comparing the predicted masks to the hand-drawn masks, and figuring out which images contain the largest errors. The function also returns `volumeDict`, a dictionary that has keys of the `flyTypeAndNumber` and values of the total cumulative pixels summed through all the images in one fly. I guess it really should be called `surfaceAreaDict` since it is total pixels, not the actual volume.

#### calculateTotalSurfaceAreaFromTif(flyMaskDirList)

Like the previous function, this function returns a `volumeDict` and an `individualCount` dictionary. However, it calculates the surface area from tif files, which are assumed to be the hand-drawn masks that Todd made, since the predicted masks are written to pngs, not tifs. Therefore, we don't have a dictionary of the images' original resolutions, since these files haven't had their resolution modified as they aren't being read into the network. We just count the amount of pixels in each image in the tif stack that are "on", and return volume and individual count dictionaries. 

#### compareTotalMaskPixels(generatedMaskDirList, originalMaskFileList, imageResolutions, replicateWorstImages=False)

This function is used to compare the pixels in the label masks and the predicted masks. It also has the optional parameter to replicate the worst image(s) from each fly, which set to be `False` at default. This functionality to replicate the worst images isn't finished. At the current stage, I am able to get the worst performing images and write them to file, however there may be bugs, and I haven't tested training the network on any replicated images. This is a strategy suggested by Lars to help improve the network with its tendency to underestimate the volume of brains. Lars suggested to not be afraid to replicate the images many times within the dataset, so you could make 10 identical images for each worst performing mask in each fly. 

#### printPredictedAndActualVolumeTuples(predictedVolumeDict)

This function takes in the predicted volume dictionary (although it should be called surface area dictionary as the number represents pixels). It then compares the predicted volumes to the actual volumes and the adjusted volumes. Adjusted volumes are the volumes where I try to take into account the network's tendency to underestimate. 

First, we make a `predictedValue` for the volume by multiplying by the volume in micrometers that each pixel in the training set represents. For the training set, all the flies are taken at a resolution of 2.95 um^3 per pixel. Therefore, I multiply the volume by 2.95^3:
```python
predictedValue = value * 2.95 * 2.95 * 2.95
```
To try to see what the error would be with accounting for the tendency to underestimate, I summed the proportions, which are the `predictedValue / actualValue`, and divided by the total number of flies in order to get the average proportion. This proportion was about 0.87, so for every fly I get a `predictedAdjusted` value, which is 
```python
predictedAdjusted = predictedValue * (1/0.87286715585677)
```
I then sum up the error between the adjusted value and the actual value and print the average error. 

#### main()

main() is only run if the file is explicitly ran. If it is just imported, then it is not run and only the functions above are exposed. main() contains some commented code snippets that have brief descriptions in the code of what they do.

##### Snippet #1

The first snippet basically does the full function of this project from start to finish. It first reads in the locations of the tif scans and masks, then gets the directories that we will output all of the converted png scans to. Then, if something within the conversion process has changed, the line 
```python
#imageResolutions = ImageManip.convertToDifferentDirectories(tifScanList, outDirList)
```
can be uncommented, and every image will be converted to a png. However, if nothing within the conversion process has changed and the converted images are already in that directory, then this line is a lot of computation to just get a dictionary of the original image resolutions. Therefore, I stored this dictionary in a json file in the working directory called `training.json`. If the original resolution dictionary changes, then there is some commented out code to write the dictionary to file:
```python
#with open("training.json", "w") as outfile:
#    json.dump(imageResolutions, outfile)
```
Since the dictionary is updated for the training set, the lines of code that are uncommented read the json file and store it in the `imageResolutions` dictionary:
```python
with open('training.json', 'r') as openfile:
    # Reading from json file
    imageResolutions = json.load(openfile)
```
Then, we make lists of where the png scans are and where we need to output the masks to. After that, there is a commented out line which predicts the scans and outputs them in the masks directory:
```python
#predictImages(pngDirList, outMaskDirList)
```
Predicting them also takes a long time. If you've changed something about how the network trains then it's necessary to repredict the images by changing the `hasModel` variable in `SEMANTIC.py` to `False` so that it retrains a new model when the code is compiled. To save a ton of time while training, I recommend using a teton node with a dedicated gpu. I will go into this more in the Environment section. 

If you want to keep the same model, but something about the images are different then you can keep the `hasModel` variable to `True` while still running `predictImages()`, which will freshly predict the images in `pngDirList`.

Lastly, we calculate the surface area for all the scans and print out stats about how accurate the volume estimation is:
```python
predictedDict = calculateTotalSurfaceArea(outMaskDirList, imageResolutions)[0]
printPredictedAndActualVolumeTuples(predictedDict)
```

##### Snippet \#2

The second snippet predicts the volume for 5 mutant flies that Todd supplied which are not in the training dataset. The process that it goes through is quite similar to Snippet \#1, except that it doesn't compare the actual volumes to the predicted volumes since the actual volumes are not known.

##### Snippet \#3

This snippet converts input scans into layered stacks to be used to try to estimate volume in the Regression network. It uses `ImageManip.getRandomStacks()` to select these stacks:
```python
ImageManip.getRandomStacks(outMaskDirList, 3, 5, 
                            "../TestStacksForRegression/")
```
The 3 and 5 parameters mean that we want 5 random stacks of 3 images for each fly.

##### Snippet \#4

Some of the new flies that Todd gave us come in a slightly different format than the training data, where rather than having a tif stack of the scans, the scans are in individual tif images. This requires some slightly different processing, so this snippet provides an example of how to predict volumes for all of the flies in the `Angle 2` directory. These `Angle 2` flies are also different in that they're not computationally manipulated to be viewed from the exact same viewing angle for every fly. Instead, they contain the natural variation in viewing angle that happens when the scans are taken. This ability to detect brain in scans that aren't on the same angle as the training data is an important element of the network because it reduces the manual tasks that Todd or another scientist has to do before inputting the scans into the program.

### Regression.py

This network was used as a second step in the project before I transitioned to calculating it based on the volume gotten from expanding every slice. After the scans were converted to masks, this network would receive a few images (scans or masks depending on the implementation) and estimate the volume based on those scans. I wasn't able to get it working well though, so it is not currently used.

The general layout and many of the functions in this file are quite similar to `SEMANTIC2.py`. 
`display()` displays some images for debugging purposes if you are running the code locally. 
`get_label()` returns the relevant label from `BrainVolumeMeasurements.csv`. 
`load_image()` loads in the image from a `tf.Dataset` datapoint and applies a rotation. It is used to load images for training.
`loadImagesFromDisk()` load in the image from a file on disk. This is used for predicting images after training.
`predictImages()` predicts images from a directory. At the moment it rotates the image and compares the rotated volume number to the non-rotated number to get an idea of how the performance changes with rotation.

Before the data is loaded in, we set the random seed so that the shuffle of the training files is the same every time the network is run. This makes it so that there is no random variation in one training set being better or worse than another.

The model inputs a (256,256,3) image and is made up of convolutional 2d transpose layers separated by pooling layers. Then, there is a flattening layer followed by 3 dense layers which end in a single number which is the volume estimation. I haven't experimented with changing the architecture of the network so far.

In the main section, there is also a similar `hasModel` Boolean variable that determines whether the model should be loaded or trained. 

### Environment

#### Teton

For most of the time I have been working on this project I have been using the Teton High Performance Computing cluster. In order to be able to work on it, you need to be added to the SchoborgFBVE project. This will give you access to the project folder. At the moment, I have been working within the `/gscratch/sfay1` folder, which is my personal folder on Teton which I was working in before the SchoborgFBVE project was created. I will work on moving all of the images over to the `/project/SchoborgFBVE/` folder so that all of the files are accesible on teton to anyone on the project. It would be more convenient to do this over having you download the github repository, because the repository doesn't contain any images, and uploading the images takes a long time. Either way, if you need the images, there is a hard drive that has all of the training data on it, and a google drive folder which has all of the other fly images in it that I can share with you.

For setting up the programming environment, the [wiki](https://arccwiki.atlassian.net/wiki/spaces/DOCUMENTAT/overview) is really helpful. The people who run ARCC are great, and nicely answered my dumb questions through email. You can email them at arcc-info@uwyo.edu. The main thing that I did to get my environment set up was to follow the [tensorflow tutorial](https://arccwiki.atlassian.net/wiki/spaces/DOCUMENTAT/pages/177209456/Using+TensorFlow). 

One mistake I made with teton is I was running all the code and training the network on the login node, which is the computer that supports everyone logging in. This is a bad thing to do because there are many other nodes dedicated to doing the computing. Reading through all the wiki pages about the [SLURM workload manager](https://arccwiki.atlassian.net/wiki/spaces/DOCUMENTAT/pages/3113024/Slurm+Workload+Manager) was very helpful for me since I didn't have any previous experience with running jobs on a super powerful computer like Teton before. When needing to run code, I prefer to use an interactive session, which allows you to actually enter the terminal of the node that you are using to perform computations. The alternative to this is submitting a batch file with a job on it. The interactive sessions are very good for debugging, but submitting a batch job would be much easier for a process that is already streamlined.

If I am just needing to do anything but training the network, then I make an allocated session without a gpu. The command I use looks like this, although the parameters can be tweaked to your liking:
`salloc --account=schoborgfbve --time=3:00:00 --nodes=1 --ntasks-per-node=4 --mem=16G`

If I do need to train the network, I request a gpu like this:
`salloc --account=schoborgfbve --time=3:00:00 --nodes=1 --ntasks-per-node=1 --mem=16G --gres=gpu:1`
Using a GPU cuts run time down hugely, like from 4 hours to 10 minutes to train the network. Using an interactive session allows you to keep track in real time of how far it is through training.

After getting into the compute node, I runthe following commands to load into my tensorflow environment. These can also be found in the tensorflow tutorial on the wiki.

`module load miniconda3
module load cuda/10.1.243
source activate tensorflow_env`

Another important thing to note, is that when you want to run your code with a gpu, you need to prefix the command with `srun`. So, a command to use a GPU would look like:
`srun python VolumeFrom3D.py`

#### Connecting to Teton Through Your Local Environment

It would be incredibly hard to edit all the code through a terminal, so to use an IDE on your local computer but still be in the Teton filesystem I use SSHFS, which stands for ssh file system. My work computer is a Mac, so I installed SSHFS from [here](https://osxfuse.github.io/). SSHFS allows you to mount a remote directory onto your local system. It doesn't load everything in immediately, so navigating to directories that have images can take a while for them to load. This is the reason why I made the `Test Images/` directory, which has the images in different folders for each fly. If I tried to access a folder with thousands of images in it, it would never load. An article explaining more about how SSHFS works can be found [here](https://www.redhat.com/sysadmin/sshfs).

The command I use to connect to my `gscratch` directory on Teton is 
`sshfs -o allow_other,default_permissions sfay1@teton.arcc.uwyo.edu:/gscratch/sfay1/CNNProject mnt/teton/`

After you are done for the day, if you turn your computer off and sshfs disconnects, the remote directory remains mounted in your system, but it doesn't work. This is why I would recommend running the command `umount -f mnt/teton` to remove the remote directory from your pc, so that there are no hang ups when you attempt to connect to the remote directory again the next day.

Following what Holden used, I have used [Spyder](https://www.spyder-ide.org/) as my IDE. It has worked well, I have no complaints. 

If you have any other questions about the environment, please contact me at smafay6 AT gmail.com.

