<small><small><i>
Introduction to Python for Bioinformatics - available at https://github.com/GunzIvan28/MScMak2025-IntroductionToPython.
</i></small></small>

## Dictionaries

Dictionaries are mappings between keys and items stored in the dictionaries. Unlike lists and tuples, dictionaries are unordered. Alternatively one can think of dictionaries as sets in which something stored against every element of the set. They can be defined as follows:

To define a dictionary, equate a variable to { } or dict()

In [1]:
d = dict() # or equivalently d={}
print(type(d))
d['abc'] = 3
d[4] = "A string"
print(d)

<class 'dict'>
{'abc': 3, 4: 'A string'}


As can be guessed from the output above. Dictionaries can be defined by using the `{ key : value }` syntax. The following dictionary has three elements

In [2]:
d = { 1: 'One', 2 : 'Two', 100 : 'Hundred'}
len(d)

3

In [3]:
d.keys()

dict_keys([1, 2, 100])

In [4]:
d.values()

dict_values(['One', 'Two', 'Hundred'])

Now you are able to access 'One' by the index value set at 1

In [5]:
print(d[1])

One


There are a number of alternative ways for specifying a dictionary including as a list of `(key,value)` tuples.
To illustrate this we will start with two lists and form a set of tuples from them using the **zip()** function
Two lists which are related can be merged to form a dictionary.

In [1]:
names = ['One', 'Two', 'Three', 'Four', 'Five']
numbers = [1, 2, 3, 4, 5]
[ (name,number) for name,number in zip(names,numbers)] # create (name,number) pairs

[('One', 1), ('Two', 2), ('Three', 3), ('Four', 4), ('Five', 5)]

In [4]:
pot = ( (name,number) for name,number in zip(names,numbers))
print (pot)

<generator object <genexpr> at 0x00000293DB16BD30>


Now we can create a dictionary that maps the name to the number as follows.

In [2]:
a1 = dict((name,number) for name,number in zip(names,numbers))
print(a1)

{'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}


Note that the ordering for this dictionary is not based on the order in which elements are added but on its own ordering (based on hash index ordering). It is best never to assume an ordering when iterating over elements of a dictionary.

By using tuples as indexes we make a dictionary behave like a sparse matrix:

In [3]:
matrix={ (0,1): 3.5, (2,17): 0.1}
matrix[2,2] = matrix[0,1] + matrix[2,17]
print(matrix)

#what is this. how to make it the total of the first column instead

{(0, 1): 3.5, (2, 17): 0.1, (2, 2): 3.6}


Dictionary can also be built using the loop style definition.

In [4]:
a2 = { name : len(name) for name in names}
print(a2)

{'One': 3, 'Two': 3, 'Three': 5, 'Four': 4, 'Five': 4}


### Built-in Functions

The **len()** function and **in** operator have the obvious meaning:

In [5]:
print("a1 has",len(a1),"elements")
print("One is in a1",'One' in a1,"but not Zero", 'Zero' in a1)

a1 has 5 elements
One is in a1 True but not Zero False


**clear( )** function is used to erase all elements.

In [None]:
a2.clear()
print(a2)

**values( )** function returns a list with all the assigned values in the dictionary. (Acutally not quit a list, but something that we can iterate over just like a list to construct a list, tuple or any other collection):

In [6]:
[ v for v in a1.values() ]

[1, 2, 3, 4, 5]

**keys( )** function returns all the index or the keys to which contains the values that it was assigned to.

In [7]:
{ k for k in a1.keys() }

{'Five', 'Four', 'One', 'Three', 'Two'}

**items( )** is returns a list containing both the list but each element in the dictionary is inside a tuple. This is same as the result that was obtained when zip function was used - except that the ordering has been 'shuffled' by the dictionary.

In [None]:
",  ".join( "%s = %d" % (name,val) for name,val in a1.items())

'One = 1,  Two = 2,  Three = 3,  Four = 4,  Five = 5'

In [10]:
wat = (",  ".join( "%s = %d" % (name,val) for name,val in a1.items()))
print(wat)
print(type(wat))

One = 1,  Two = 2,  Three = 3,  Four = 4,  Five = 5
<class 'str'>


**pop( )** function is used to get the remove that particular element and this removed element can be assigned to a new variable. But remember only the value is stored and not the key. Because the is just a index value.

In [None]:
val = a1.pop('Four')
print(a1)
print("Removed",val)

## Exercise 1

- Using strings, lists, tuples and dictionaries concepts, find the reverse complement of AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA

In [1]:
a = 'AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'


In [None]:
#ans

seq = 'AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'
complement = {"A": "T", }

In [None]:
b = list(a)
d = {"A": "T", "T": "A", "G": "C": "G"}
c = [d[base] for base in b]
reverseC

In [4]:
seq = "ABCD"
val = []
for x in seq:
    val.append(x)

print("".join(val))

ABCD


## Exercise 2
Given the DNA sequence AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA find the k-mer(3-mer) with the highest frequency

In [6]:
dna = "AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA"
#dna[0:3]
#dna[1:4]
#dna[2:5]  #recognize the pattern of the sliding k-mers
len(dna)

47

In [None]:
print (f" my sequence is (a). and")

In [None]:
#Code
step = 0
while step <=len(dna)-3:
    print(dna[step:step+3])
    step = step + 1

In [None]:
step = 0
dna_list=[]
while step <=len(dna)-3:
    dna_list.append(dna[step:step+3])
    step = step + 1

In [None]:
dna_list

In [None]:
dna_dict = {}
for dna in dna_list:
    dna_dict[dna] = dna_list.count(dna)

In [None]:
dna_dict

In [None]:
#answer example
kmer_list = []
for kmer in range(0, len(dna)-3):
    kmer_list.append(dna[kmer:kmer+3])
print(kmer_list)

kmer_dict = {}

for kmer in kmer_list:
    kmer_dict[kmer] = kmer_list.count(kmer)
    
print(kmer_dict)

highest_frequency = max

['AAA', 'AAA', 'AAA', 'AAT', 'ATC', 'TCC', 'CCC', 'CCG', 'CGA', 'GAG', 'AGG', 'GGC', 'GCG', 'CGG', 'GGC', 'GCT', 'CTA', 'TAT', 'ATA', 'TAT', 'ATA', 'TAG', 'AGG', 'GGG', 'GGC', 'GCT', 'CTC', 'TCC', 'CCG', 'CGG', 'GGA', 'GAG', 'AGG', 'GGC', 'GCG', 'CGT', 'GTA', 'TAA', 'AAT', 'ATA', 'TAT', 'ATA', 'TAA', 'AAA']


In [20]:
#answer example2
dna = "AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA"
 

In [9]:
k = 3
kmers = [dna[n:n+k] for n in range(len(dna))]

In [10]:
kmers

['AAA',
 'AAA',
 'AAA',
 'AAT',
 'ATC',
 'TCC',
 'CCC',
 'CCG',
 'CGA',
 'GAG',
 'AGG',
 'GGC',
 'GCG',
 'CGG',
 'GGC',
 'GCT',
 'CTA',
 'TAT',
 'ATA',
 'TAT',
 'ATA',
 'TAG',
 'AGG',
 'GGG',
 'GGC',
 'GCT',
 'CTC',
 'TCC',
 'CCG',
 'CGG',
 'GGA',
 'GAG',
 'AGG',
 'GGC',
 'GCG',
 'CGT',
 'GTA',
 'TAA',
 'AAT',
 'ATA',
 'TAT',
 'ATA',
 'TAA',
 'AAA',
 'AAA',
 'AA',
 'A']

In [11]:
k = 3
kmers = [dna[n:n+k] for n in range(len(dna)-2)]

In [12]:
kmers

['AAA',
 'AAA',
 'AAA',
 'AAT',
 'ATC',
 'TCC',
 'CCC',
 'CCG',
 'CGA',
 'GAG',
 'AGG',
 'GGC',
 'GCG',
 'CGG',
 'GGC',
 'GCT',
 'CTA',
 'TAT',
 'ATA',
 'TAT',
 'ATA',
 'TAG',
 'AGG',
 'GGG',
 'GGC',
 'GCT',
 'CTC',
 'TCC',
 'CCG',
 'CGG',
 'GGA',
 'GAG',
 'AGG',
 'GGC',
 'GCG',
 'CGT',
 'GTA',
 'TAA',
 'AAT',
 'ATA',
 'TAT',
 'ATA',
 'TAA',
 'AAA',
 'AAA']

In [13]:
k = 6
kmers = [dna[n:n+k] for n in range(len(dna)-5)]

In [14]:
kmers

['AAAAAT',
 'AAAATC',
 'AAATCC',
 'AATCCC',
 'ATCCCG',
 'TCCCGA',
 'CCCGAG',
 'CCGAGG',
 'CGAGGC',
 'GAGGCG',
 'AGGCGG',
 'GGCGGC',
 'GCGGCT',
 'CGGCTA',
 'GGCTAT',
 'GCTATA',
 'CTATAT',
 'TATATA',
 'ATATAG',
 'TATAGG',
 'ATAGGG',
 'TAGGGC',
 'AGGGCT',
 'GGGCTC',
 'GGCTCC',
 'GCTCCG',
 'CTCCGG',
 'TCCGGA',
 'CCGGAG',
 'CGGAGG',
 'GGAGGC',
 'GAGGCG',
 'AGGCGT',
 'GGCGTA',
 'GCGTAA',
 'CGTAAT',
 'GTAATA',
 'TAATAT',
 'AATATA',
 'ATATAA',
 'TATAAA',
 'ATAAAA']

In [15]:
k = 3
kmers = [dna[n:n+k] for n in range(len(dna)- k)]

In [16]:
kmers

['AAA',
 'AAA',
 'AAA',
 'AAT',
 'ATC',
 'TCC',
 'CCC',
 'CCG',
 'CGA',
 'GAG',
 'AGG',
 'GGC',
 'GCG',
 'CGG',
 'GGC',
 'GCT',
 'CTA',
 'TAT',
 'ATA',
 'TAT',
 'ATA',
 'TAG',
 'AGG',
 'GGG',
 'GGC',
 'GCT',
 'CTC',
 'TCC',
 'CCG',
 'CGG',
 'GGA',
 'GAG',
 'AGG',
 'GGC',
 'GCG',
 'CGT',
 'GTA',
 'TAA',
 'AAT',
 'ATA',
 'TAT',
 'ATA',
 'TAA',
 'AAA']

In [17]:
k = 3
kmers = [dna[n:n+k] for n in range(len(dna)- k+1)]

In [18]:
kmers

['AAA',
 'AAA',
 'AAA',
 'AAT',
 'ATC',
 'TCC',
 'CCC',
 'CCG',
 'CGA',
 'GAG',
 'AGG',
 'GGC',
 'GCG',
 'CGG',
 'GGC',
 'GCT',
 'CTA',
 'TAT',
 'ATA',
 'TAT',
 'ATA',
 'TAG',
 'AGG',
 'GGG',
 'GGC',
 'GCT',
 'CTC',
 'TCC',
 'CCG',
 'CGG',
 'GGA',
 'GAG',
 'AGG',
 'GGC',
 'GCG',
 'CGT',
 'GTA',
 'TAA',
 'AAT',
 'ATA',
 'TAT',
 'ATA',
 'TAA',
 'AAA',
 'AAA']

In [21]:
k = 3
kmers = [dna[n:n+k] for n in range(len(dna)- k+1)]
#kmers.count ("AAA")
{a: kmers.count(a) for a in kmers} 

{'AAA': 5,
 'AAT': 2,
 'ATC': 1,
 'TCC': 2,
 'CCC': 1,
 'CCG': 2,
 'CGA': 1,
 'GAG': 2,
 'AGG': 3,
 'GGC': 4,
 'GCG': 2,
 'CGG': 2,
 'GCT': 2,
 'CTA': 1,
 'TAT': 3,
 'ATA': 4,
 'TAG': 1,
 'GGG': 1,
 'CTC': 1,
 'GGA': 1,
 'CGT': 1,
 'GTA': 1,
 'TAA': 2}

In [None]:
k = 3
kmers = [dna[n:n+k] for n in range(len(dna)- k+1)]
#kmers.count ("AAA")
{a: kmers.count(a) for a in set(kmers)} #set ensures that the kmers are picked uniquely

{'TAG': 1,
 'AAT': 2,
 'CCC': 1,
 'CTC': 1,
 'GGC': 4,
 'GCT': 2,
 'TAT': 3,
 'CTA': 1,
 'AGG': 3,
 'ATC': 1,
 'GGA': 1,
 'GTA': 1,
 'CGA': 1,
 'AAA': 5,
 'GCG': 2,
 'CGG': 2,
 'GAG': 2,
 'GGG': 1,
 'CCG': 2,
 'CGT': 1,
 'TAA': 2,
 'ATA': 4,
 'TCC': 2}

In [80]:
k = 3
kmers = [dna[n:n+k] for n in range(len(dna)- k+1)]
#kmers.count ("AAA")
freq = {a: kmers.count(a) for a in set(kmers)}

In [24]:
freq.items()

dict_items([('TAG', 1), ('AAT', 2), ('CCC', 1), ('CTC', 1), ('GGC', 4), ('GCT', 2), ('TAT', 3), ('CTA', 1), ('AGG', 3), ('ATC', 1), ('GGA', 1), ('GTA', 1), ('CGA', 1), ('AAA', 5), ('GCG', 2), ('CGG', 2), ('GAG', 2), ('GGG', 1), ('CCG', 2), ('CGT', 1), ('TAA', 2), ('ATA', 4), ('TCC', 2)])

In [25]:
freq.keys()

dict_keys(['TAG', 'AAT', 'CCC', 'CTC', 'GGC', 'GCT', 'TAT', 'CTA', 'AGG', 'ATC', 'GGA', 'GTA', 'CGA', 'AAA', 'GCG', 'CGG', 'GAG', 'GGG', 'CCG', 'CGT', 'TAA', 'ATA', 'TCC'])

In [29]:
freq.values()

dict_values([1, 2, 1, 1, 4, 2, 3, 1, 3, 1, 1, 1, 1, 5, 2, 2, 2, 1, 2, 1, 2, 4, 2])

In [71]:
max (freq.values())

5

In [None]:
best = [best for best, v in freq.items() if v == max(freq.values())]

In [73]:
print(best)

['AAA']


In [74]:
print (f"My best kmer is {best}. And it appears {max(freq.values())} times")

My best kmer is ['AAA']. And it appears 5 times


In [None]:
best = [best for best, v in freq.items() if v == max(freq.values())][0]  #[0] removes [] from the AAA

In [76]:
print(best)

AAA


In [79]:
print (f"My best kmer is {best}. And it appears {max(freq.values())} times")

My best kmer is AAA. And it appears 5 times


In [77]:
len(dna)

47

In [78]:
range(47)

range(0, 47)