### Background and Motivation
All students on this project are majoring in math, and we have interests in bioinformatics and mathematical biology. We want to build skills in data analysis that could transfer over onto these topics. While our project will likely not be too intensive on the data acquisition, we are planning to focus heavily on the data analysis side of the project as that is highly applicable to all of our future career goals. Also, Ella is double majoring in chemistry and works in a systems biology research lab doing both experimental and computational research which involves prokaryotic genetics, so our group has some expertise relevant to the biological side of this project.

### Project Objectives
The goal of this project is to learn and develop the skills to know how and when to apply different clustering algorithms (k-means, density based, and hierarchical), and practice those skills with genetic data. We will be using publicly accessible genetic data from different bacterial species and comparing them for genetic similarity. We will choose bacteria which cause different diseases in different hosts. These bacteria are of interest for public study as they will be chosen from the list of rising threats in the 2019 CDC antimicrobial resistance threat report. The questions we aim to answer are:
   
1. To what extent can any of the following be used as a way of categorizing and clustering different strains of bacteria by species?:  
      a. G/C content  
      b. Percentage of genome dedicated to different functions  
      c. Time of data processing
4. How similar are the genomes of bacteria with different pathogenicity in different hosts, and how similar are the genomes of different bacterial species which cause similar symptoms and diseases in those hosts?
5. Can we use our models to predict and classify the identify of an unknown strain given it’s bacterial sequence?

### Data Description and Acquisition
Data source: https://www.ncbi.nlm.nih.gov/datasets
We will use reference genomes obtained from the NCBI database of genomic data. These
will be downloaded as GenBanks which contain *.FASTA files holding the genetic information. This format is a special type of file for genetic data but it is essentially a text file for all intents and purposes. These can be downloaded off of the internet and are freely accessible to the public. The data in these files contains both the raw genome sequence as well as characterizing data which breaks the genome into the individual genes and their predicted functions (amongst other variables which describe the gene’s identity). We will identify bacterial species which we are interested in exploring (as of right now that includes E. Coli, Pseudomonas (aeruginosa and syringae).

### Ethical Considerations
We are planning to work with data from genetic databases. In the past, they have had ethical problems in terms of how that genetic data is collected, and who is compensated for that data collection. This is most famous/infamous in terms of human data (like HeLa cells for example), but also is a problem in non-human genetic data. There is ongoing debate right now about how and when researchers should provide payment when they collect non-human genetic data from “genetically-rich” regions of the world; sometimes researchers will go somewhere like a rainforest and collect lots of genetic data, but then the place where they collected that data does not receive an equitable distribution of benefits from that research. The data we will be using will be from an unknown source and could have these same issues listed above.

Additionally, there would be concerns about using data which we have not collected ourselves. The NCBI’s policy on data use is described as follows. In the terms and conditions of the website it says, “They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. Nor do we accept data when the submitter has requested restrictions on reuse or redistribution.” So there are no restrictions on the use of our data as they do not accept any data that has restrictions on the use of it. That being said, NCBI does still state on their website that “some submitters of the original data (or the country of origin of such data) may claim patent, copyright, or other intellectual property rights in all or a portion of the data,” so we need to make sure that we are using data according to these copyright rules. As we are not publishing this project there are less strict concerns in this area.


### Data Cleaning and Processing
There will be some data clean-up. Some processing will need to be done to ensure that all data is read correctly as sometimes gene names have characters in them that can be read into programming languages in ways that break any code written. The actual cleaning and processing of the code is fairly straightforward as there are already many python libraries built to work with this kind of data, such as BioPython, and the data is formatted with the intention of using it for research. The raw genetic sequences will need to be extracted and cleaned from the rest of the data in the file, and the file will need to be processed into a dataframe which organizes all of the genetic characterization data associated with each gene. This is a matter of writing a simple script as each type of data point has an identification tag which can be read in with each gene name. This will result in a large dataframe which contains all of the information associated with both the genome sequence and the species of interest.

### Exploratory Analysis
We are planning to create a bar chart for each species of bacteria that will show the percentage of the genome dedicated to different functions. We also would like to create a bar chart comparing the G/C content of each species of bacteria. We will also use hierarchical clustering to create something similar to a phylogenetic tree which is typically used to visualize the relations of different species within the biological sciences. Then, when we move on to comparing this same information but across multiple strains for each species, we plan to create a cluster heatmap to hopefully be able to visualize this data in one graphic. This data can be classified by genome functions to show what proportions of each species contains certain functions and can be used to compare very different species. We are also interested in using the parallel coordinates plot to visualize this data. If we find interesting ways of characterizing the bacteria, we would like to use a decision tree to summarize those results.

### Analysis Methodology
We are hoping to start by comparing the genetic similarity of different species of bacteria. We will start with just one strain of each species as our reference genomes, and compare what
percentage of the genome is dedicated to different functions between those species. From there, we want to see if our analysis can be used to categorize each species across many different strains. We plan to try using clustering algorithms to categorize all the strains by species. K means clustering will be the best method of categorizing the data (our basic idea is to use the initial set of data of reference genomes as the centroid seeds, and then see if strains of the same species will be clustered together). However, we are also interested in exploring some additional forms of data clustering such as hierarchical and density based clustering. Additionally, we think that using cluster heat maps could be useful in comparing all the different strains. This analysis will allow us to understand if one species has a larger amount of genes dedicated to a specific function than another and if those genes are similar in sequence across species. With this data we aim to create a model which can be used to predict a species identify given it’s genetic sequence.
One challenge we think we may have is that our data is pretty multidimensional (there are lots of different functions that the genome code for and therefore there are a lot of different variables). Because of this, we are planning to start by comparing the G/C content of different bacteria, as this is just one single variable that we could potentially use to categorize different strains of bacteria by species. Once we get our method down for this, we plan to move on to the more complicated comparisons.
To analyze runtime we will be processing our data using PROKKA and OrthoFinder which are open source tools used to clean and process prokaryotic genetic data for genetic similarity (both of these tools have been used by our group before). As a side project (time permitting), we are interested in seeing if the run time of different species through this pipeline varies in a significant way. We will do comparisons of run time data to see if this can be used as an invariant of sorts to categorize species of bacteria.

### Project Schedule
We will meet most Saturdays from 1-3pm to work on the project in person. Ella will take responsibility for more of the front part of the project with the data acquisition and data processing as she has more experience processing this type of data and the familiarity with the biological concepts to understand which parts of the data are relevant to the project. Sylvie and Hailey will take more responsibility for the back end of the project with the data analysis and visualization of our data to ensure that the work is evenly shared amongst the group members.
#### Deadlines
#### March 17-22
1) Choose bacterial strains and species of interest and have all data collected
2) Have data cleaned and processed through any methods which cannot be done in python code, so all group members have access to the raw data
#### March 23-29
1) Begin data formatting and do sanity checks to ensure all data is read and processed correctly
2) Perform initial data visualization to identify the most interesting characteristics of the data to be used in further analysis
#### March 30 - April 5
1) Create clustering and modeling algorithms
2) Test algorithms on new genetic data
#### April 6-12
1) Debugging/Buffer time to allow flexibility in the project
2) Determine final data interpretation and conclusions
#### April 13-18
1) Finalize and turn in project
2) Make presentation slides and work decide on how to best
present results and interpretation
3) Film video presentation