# <center>Using Statistical Models in Genetic Algorithms</center>

<center>by Katrina Gensterblum</center>

---
Statistical models are critical to almost every type of data-driven research, no matter what kind of data is being used. Statistics allow researchers to systematically manipulate their data, clearly explain the contents of their data, and create fake data that follows a specified trend [1]. Not only do they allow researchers to work with their input data, but they allow researchers to explain and understand the results from their research as well. Statistical models provide an easy way to compare the performance of multiple algorithmic solutions to the same problem, thus allowing the best one to be identified. For our research, we can apply statistical models to each part of our algorithm. We can use statistics to work with our input data, use it to add to our output data, and use it to compare multiple solutions to our problem.  

Our research involves using genetic algorithms in the context of image segmentation. We provide the genetic algorithm with an image, and several options of segmentation algorithms to use with that image. The genetic algorithm then uses that information to find the best segmentation algorithm, and the best hyperparameters for that algorithm. Currently the genetic algorithm is finding reasonably good solutions for the images we give it. However, often image preprocessing is included with the image segmentation algorithms in order to get better results. Right now, we include no preprocessing techniques with our genetic algorithm. However, we could include several preprocessing algorithms and add those as options for our genetic algorithm to search over when trying to find the optimal solution. If we were to do that, one of the preprocessing techniques we would include is histogram equalization [2]. This is a statistical model where the histogram generated by the image is used to adjust the contrast of the image. By using this model on our input data, our algorithm has the possibility to yield even better results.   

Along with using a statistical model on our input data, statistics could be used to interpret the results of our genetic algorithm. Currently our algorithm just outputs the best segmentation algorithm it could find for the problem, along with the fitness value (or error) of that solution. We could add to this though, and have our algorithm return general statistics of the population. We could see what the mean fitness value is of the population, or what the maximum is. In fact, the library we are using to help construct our genetic algorithm could automatically provide these statistics for us.[3] We could also use statistics to see the evolutionary trend of the population over generations, providing us with a sense for how the fitness value changed and improved. This would highlight how our genetic algorithm worked and strengthen our argument that it was effective.  

Finally, we could use statistical models to compare the results of running our genetic algorithm multiple times. The algorithm currently trains on just one input image and finds the best segmentation algorithm it can for that image. However, we could run the genetic algorithm on every image in a dataset (or just a subset depending on the size of the dataset) and record the final fitness values for each experiment. These values could then be manipulated and plotted in order to see the overall trend of the genetic algorithm’s performance. Once statistics from these experiments were found, not only could we use it to compare experiments on the same dataset, but we could compare it to the statistics of experiments run on different datasets. In this way, we could evaluate the performance of the genetic algorithm in generally finding good solutions no matter the problem.  

Statistical models offer a mathematical way to manipulate input data, understand output data, and evaluate algorithm performance. While we can apply models to do this within our own research, the ideas behind it transfer to almost every data-driven research problem. In this way, statistical models provide an invaluable resource for researchers everywhere. 


---
### References

[\[1\]](https://www.biorxiv.org/content/10.1101/810408v1) Hothorn, Ludwig A., Felix M. Kluxen, and Mario Hasler. “Pseudo-Data Generation Allows the Statistical Re-Evaluation of Toxicological Bioassays Based on Summary Statistics.” BioRxiv, January 1, 2019, 810408.


[\[2\]](https://onlinelibrary.wiley.com/doi/abs/10.1002/cplx.21499) Hum, Y.C., Lai, K.W. and Mohamad Salim, M.I. (2014), Multiobjectives bihistogram equalization for image contrast enhancement. Complexity, 20: 22-36.  

[\[3\]](https://dl.acm.org/doi/abs/10.1145/2330784.2330799) François-Michel De Rainville, Félix-Antoine Fortin, Marc-André Gardner, Marc Parizeau, and Christian Gagné. 2012. DEAP: a python framework for evolutionary algorithms. In Proceedings of the 14th annual conference companion on Genetic and evolutionary computation (GECCO ’12). Association for Computing Machinery, New York, NY, USA, 85–92.  