# Pangenomes, dictionaries and sets



We have 3 genomes
```
Genome1 = [1,7,67,834,17,8]

Genome2 = [834,8,17,29,7,1]

Genome3 = [67,17,834,1,7]
```
and we want to work out which genes are in all of them (and we bear in mind that later on we want to know which genes are in at least 90% of them). In real life, when we do this kind of this, a real bacterial genome will have around 3000-5000 genes, so this is a toy example of a real-thing.

What did our *Simple idea* code look like again?




In [None]:
Genome1 = [1,7,67,834,17,8]
Genome2 = [834,8,17,29,7,1]
Genome3 = [67,17,834,1,7]
core=[]
for g in Genome1:
    occurrences_of_g = 1
    for h in Genome2:
        if (g==h):
            occurrences_of_g =occurrences_of_g+1
    for j in Genome3:
        if (g==j):
            occurrences_of_g =occurrences_of_g+1
            if (occurrences_of_g==3):
                core.append(g)

#I didn't put anything in the lecture notes about actually printing out the core genome
print("The core genome, i.e. the set of genes in all 3 genomes, is:")
for c in core:
    print(c)

# Stage 1 (i'm putting this here just so we have "bookmarks" to refer to)

Now, as we mentioned in the lecture, the above code is fine for 3 small toy genomes, but there are definite limitations.
If we had to run it on more genomes, we would need to write more loops, one for each extra genome.

Right now it says
```
for h in Genome2:
    ..blah blah
for j in Genome3:
    ..whatever
```
We would need to explicitly add new lines for each new genome. This is faster and more reliable than manually looking through the files, but it is definitely no fun. So, we'd like a better solution. So, let's store the above code for future use in a function, and then try the other method

In [None]:
def simple_function(g1, g2, g3):
    """
    Remember a function is like a toaster, or Deliveroo - you pass in specific inputs,
    and you get out specific outputs.
    Here we are going to pass in 3 genomes g1,g2,g3 , each a list of genes.
    The output will be a list of core genes.
    """
    core=[]
    for g in g1:
        occurrences_of_g = 1
        for h in g2:
            if (g==h):
                occurrences_of_g =occurrences_of_g+1
                for j in g3:
                    if (g==j):
                        occurrences_of_g =occurrences_of_g+1
                        if (occurrences_of_g==3):
                            core.append(g)

    return(core)


In [None]:
#you can use the function like this, passing in the genomes as lists:
c = simple_function([1,7,67,834,17,8] , [834,8,17,29,7,1], [67,17,834,1,7])


#or you can pass them in as variables (which were defined up above)
d= simple_function(Genome1, Genome2, Genome3)

In [None]:
print(c)
print(d)

## Now what about the second approach we talked about in the lecture, using dictionaries?

Let's first practice with dictionaries a bit. Here are 2 examples

In [None]:
#Let's imagine we are keeping track of how many mutations there are in different genes, while processing some information
muts={"gene1":7 ,
     "gene2":8,
     "gene3":100}

#what if we want to remember which codon encodes for each amino acid?
#(I have not double checked this table is correct, but you should get the idea)
#notice how we can intersperse comments using ## to give us more information.
codon_to_amino_acid = {
    'TTT': 'F', 'TTC': 'F',  # Phenylalanine
    'TTA': 'L', 'TTG': 'L',  # Leucine
    'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',  # Leucine
    'ATT': 'I', 'ATC': 'I', 'ATA': 'I',  # Isoleucine
    'ATG': 'M',  # Methionine (Start codon)
    'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',  # Valine
    'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',  # Serine
    'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',  # Proline
    'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',  # Threonine
    'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',  # Alanine
    'TAT': 'Y', 'TAC': 'Y',  # Tyrosine
    'TAA': '*', 'TAG': '*',  # Stop codons
    'CAT': 'H', 'CAC': 'H',  # Histidine
    'CAA': 'Q', 'CAG': 'Q',  # Glutamine
    'AAT': 'N', 'AAC': 'N',  # Asparagine
    'AAA': 'K', 'AAG': 'K',  # Lysine
    'GAT': 'D', 'GAC': 'D',  # Aspartic Acid
    'GAA': 'E', 'GAG': 'E',  # Glutamic Acid
    'TGT': 'C', 'TGC': 'C',  # Cysteine
    'TGA': '*',  # Stop codon
    'TGG': 'W',  # Tryptophan
    'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',  # Arginine
    'AGT': 'S', 'AGC': 'S',  # Serine
    'AGA': 'R', 'AGG': 'R',  # Arginine
    'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',  # Glycine
}

Now it's your turn. Can you make a dictionary to remember which days of the week are weekdays,
and which are the weekend.

In [None]:
#put your dictionary here. Remember Python wants quotes around text strings, so "Monday", not Monday.
#I don't really mind, but you could encode weekend as 1 and weekday as 0. Up to you.

Now let's try a loop, where you add information to a dictionary

In [None]:
data_for_example=[1,2,3,3,3,15,18,9,23,3,56,556,707979,3]

In [None]:

#Suppose we had to write a loop which iterates through data_for_example and counts how many occurrences of the number
# 3 there are. For example in [2,2,3] the answer would be 1. For [33,3] the answer would also be 1, we're
#just looking for exactly the number 3, Not numbers with 3 in them like 33 or 123.
#We would do this
count=0
for b in data_for_example:
    if b==3:
        count=count+1
#this would work nicely, if we just wanted to count how many 3's there were
#what if I want to know how many of ALL the numbers there are?

#HINT: again, go through every element of data_for_example.
#      if it is NOT in your dictionary, add it. key=number, value=0
#       if it is in your dictionary, increment the count

#put your loop here, and run it!



In [None]:












## Answer to the previous is here:
# make a dictionary - for each number, it will track how often we've seen it
dict={}
for b in data_for_example:
    if b in dict:
        dict[b]=dict[b]+1
    else:
        dict[b]=1 # this is the first time we have seen it
#check results
for c in dict.keys():
    print(c)
    print(" happened ")
    print(dict[c])
    print(" times.")

# We will just introduce a way to loop through a dictionary and get the key-value pairs

This is a small dictionary we mentioned above
```
muts={"gene1":7 ,
     "gene2":8,
     "gene3":100}
````
# The items() function, which operates on a dictionary will give you a list of key-value pairs

This function is quite useful for dictionaries.
```
muts.items()
```
 will return
 
```
 [ ("gene1", 7) , ("gene2", 8) , ("gene3", 100) ]
```

Apologies for this being weird. Up to now we have talked about functions working like \
```
print("zam")
```
you have the function name, and then the input in brackets. In Python, for some things, such as list, dictionary, the thing has
functions of its own. In that case, the function name and inputs go afterwards. Here we have a dictionary called muts, and we call the items function on it (no inputs) like this:

```
muts.items()
```



# Stage 2 


Now let's try a loop where we add information to a dictionary

In [None]:
#Let's imagine we are observing animals moving in and out of a specific
# area. We are later going to release a pheromone to see if the types of animal
# that appear changes after the pheromone is released.
# But for now we just need a baseline. 
animals=["dog", "dog", "hyrax","tortoise", "antelope","cat", "unicorn", "megatherium", "antelope","antelope","antelope","antelope","antelope","antelope","antelope","antelope","antelope","dog"]

#write a loop that goes through this list, and for each type of animal, keeps a count of how many times it is there.
# For each element on the list, check if you've seen the animal before. If yes, update a dictionary entry with a new count.
#if no, create a new entry in the dictionary


# Introducing the final data structure: the set

The final step is to take a look at, and run, the second way of calculating the core genome, using dictionaries.
We will, for each gene keep track of the genomes which contain it.

We could use a LIST, as we have done before. If we did, then
```
[genome1, genome2]
```
would be different to
```
[genome2, genome1]
```

because order matters in lists. But we just want to know which genomes contain the gene, and don't really care about the order.
In Python (and maths) we use a SET for this.
```
{genome1, genome2}
```
is a SET of genes, and is the same set as

```
{genome2, genome1}
```

We'll use this to make our life easier.

## Look at the example in the next cell - it shows you how you can iterate through a dictionary



In [None]:

# In a list, we do something like this to go through them
my_list=["a","b","c"]
for v in my_list:
   print(v)

#In a dictionary, we can ALSO go through them all, but a dictionary is a set of PAIRs (key and value)
my_dict={ "a":1 , "b":2, "c": 3}
for letter, number in my_dict.items():
    print(letter)
    print("--->")
    print(number)
    print("\n")


In [None]:
genome_names={"g1":[1,7,67,834,17,8],
              "g2":[834,8,17,29,7,1],
              "g3":[67,17,834,1,7]
             }
#set up an empty dictionary to keep track of which genomes contain each gene
genomes_containing={}

#I want to iterate through all the keys in the dictionary, and for each key
#keep track of what its value is. The items() function returns a list of key-value pairs
for genome_name, genome in genome_names.items(): #eg "g1" and [1,7,67,834,17,8]
    #loop through all the genes in this genome
    for gene in genome:
        #do a check to see if this gene is already a key in the dictionary
        if gene in genomes_containing:
            #then we have already got a set of genome names containing this gene. Add our
            #current genome_name to the set
            genomes_containing[gene].add(genome_name)
        else:
            #create a new dictionary entry. KEY=this gene, and VALUE=a set containing just this genome
            genomes_containing[gene]={genome_name}

#Find genes present in all genomes
all_genomes=set(genome_names.keys()) #set of all genome names
#create ("initialise") an empty list
genes_in_all_genomes=[] 

#loop through the dictionary we made above
for gene, genome_set in genomes_containing.items():
    #check if the gene is in all genomes.
    #On the line below, on the left we have genomes this gene is in; on the right we have all genomes
    if genome_set==all_genomes:  
        genes_in_all_genomes.append(gene) #add the gene to the result list

#print the results
print("Genes present in all genomes", genes_in_all_genomes)
        
            
            
    

In [None]:
print(genome_names)