## Sample Exercise Answers

**Exercise 1.** Which Python data structure(s) would you use to represent each of the following types of data, and why? (Note that in some cases there is more than one right answer depending on context - the important part of the exercise is in your consideration). 

* Which python data structure(s) would you use to represent the number of herbivorous fish on a coral reef?
* the number of herbivorous fish on a coral reef, organized by species?
* the length of a genome in nucleotides?
* the nucleotide sequence of a gene in the human genome (such sequences are represented by the letters A,T,C and G)?
5. the function of each gene in a genome (e.g. *Adh1* is a gene that encodes an [alcohol dehydrogense](https://en.wikipedia.org/wiki/Alcohol_dehydrogenase), etc)
* the taxonomy of an organism (it's phylum, class, order, family, genus and species)?
* the scientific names of all the organisms found in an environment?
* the locations of exons in an mRNA (representing each exon as start and stop coordinates relative to the beginning of the mRNA)?

* To literally represent the **number of herbivorous fish on a coral reef**, i'd likely use an integer:

In [2]:
n_herbivorous_fish = 10082

* The **length of a genome in nucleotides** is a whole number - it's not possible to have a partial Adenosine in a genome for example.  Therefore I'd represent that length with an integer. So for example if I was representing the length of [**Escherichia coli** K-12 MG1655](https://www.genome.jp/dbget-bin/www_bget?gn:T00007), I might write something like:

In [None]:
genome_length = 4641652

I'd probably have to think about the variable name a bit to make sure it makes sense in context. If there is only one genome, then genome_length is probably fine as a variable name. If we were comparing the physical length of the genome (if it were cut and unrolled) vs. it's length in nucleotides, then I might need to add to that variable a little bit to specify that it refers to nucleotides. Perhaps in that case I could name the two variables genome_length_nm (nm is the abbreviation for nanometers) or genome_length_nt (nt is a fairly commonly used abbreviation for nucleotides).

* **To represent the function of each gene in a genome**, I would probably use a dict with the gene identifiers as keys and the gene functions as values. When I say gene identifiers, I mean unique ids for each gene, similar to a gene name but guaranteed not to overlap with other genes or similar genes in other organisms. Here's an example:

In [3]:
gene_functions = {"Aldh1": "Alcohol dehydrogenase", "TLR2":"Toll-like receptor 2"}

* I would likely **represent the nucleotide sequence of a gene in the human genome** with a **string** object. (Alternatively, one could implement a custom class to handle this type of data - we'll cover this later on).

In [None]:
dna_seq = "GACTCGTACGATCGATCGGGGGCAGCATCGACGCGCGCTACGTACGTACTAGCTACGATCGGCGATCGATCGATCACGGGGGGTACA"

* To represent the **scientific names of all organisms in an environment** I might use a list. Alternatively, if my input data had duplicates and I wanted a list of each species, I might use a set object:

In [4]:
original_reef_fish_observations = ["Sparisoma viride","Stegastes planifrons","Acanthurus bahianus",
                        "Stegastes planifrons","Stegastes planifrons","Stegastes planifrons",
                        "Sparisoma viride"]
unique_species = set(original_reef_fish_observations)
n_unique_species = len(unique_species)
print(f"{n_unique_species} unique fish species were observed on the reef.")

3 unique fish species were observed on the reef.


* To represent the **start and stop locations of exons in an mRNA**, I might use a dict of lists. The lists would hold start and stop integers for the indices at which the exon starts and stops in the RNA. As discussed above for DNA sequences, I would probably use a string object to represent the RNA itself: 

In [5]:
RNA_seq = "AUGGGCUAGUAGCUACGCGCGGCGUAGCUAGCUACGUACGCGCGGCCGUAGUCAGUAUCGUAGCUGCAUCGAUGCUACGUCGUACG"
exons = {"exon1":[12,56],"exon2":[67,75]}

#The above structure lets us easily cut out exons using string slicing
start,stop = exons["exon1"]
exon1 = RNA_seq[start:stop]
print("The sequence of exon1 is: ", exon1)

The sequence of exon1 is:  CUACGCGCGGCGUAGCUAGCUACGUACGCGCGGCCGUAGUCAGU
