# Genome Sequencing, Week 1

Solve the String Composition Problem.

    Input: An integer k and a string Text.
    Output: Compositionk(Text) (the k-mers can be provided in any order).


In [2]:
def composition(k, text):
    return [text[i:i+k] for i in range(len(text)-k+1)]

In [3]:
k = 5
text = "CAATCCAAC"
print(" ".join(composition(k, text)))

CAATC AATCC ATCCA TCCAA CCAAC


It's straightforward to solve the **String Composition** problem, but to piece together a genome from reads, we have to solve the inverse problem, the **String Reconstruction** problem:

**String Reconstruction Problem**: *Reconstruct a string from its k-mer composition.*
- **Input**: An integer *k* and a collection *Patterns* of *k*-mers.
- **Output**: A string *Text* with *k*-mer composition equal to *Patterns* (if such a string exists).

If we know the **genome path** already, it is trivial to solve this problem.

**String Spelled by a Genome Path Problem**: *Reconstruct a string from its genome path.*
- **Input**: A sequence path of k-mers Pattern1, … ,Patternn such that the last k - 1 symbols of Patterni are equal to the first k-1 symbols of Patterni+1 for 1 ≤ i ≤ n-1.
- **Output**: A string Text of length k+n-1 such that the i-th k-mer in Text is equal to Patterni (for 1 ≤ i ≤ n).


In [4]:
def max_overlap_index(s1, s2):

    def similarity(s1, s2):
        sim_list = []
        for i in range(len(s1)):
            if s1[i] == s2[i]:
                sim_list.append(1)
            else:
                sim_list.append(-1) # penalize mismatches
        return sum(sim_list)
    
    index_to_overlap = {i: similarity(s1[i:], s2[0:len(s2)-i]) for i in range(len(s1))}
    return max(index_to_overlap, key=index_to_overlap.get)
    
def string_from_genome_path(path):
    string = path[0]
    for i in range(1, len(path)):
        prev = path[i-1]
        curr = path[i]
        max_index = max_overlap_index(prev, curr)
        string += curr[len(curr)-max_index:]
    return string

In [5]:
path = ["AGCAGATC", "GCAGATCA", "CAGATCAT", "AGATCATC", "GATCATCG", "ATCATCGG"]

f = open("rosalind_ba3b.txt", "r")
path = f.read().rstrip("\n").split("\n")
f.close()

print(string_from_genome_path(path)[:100])

AAATGTATGACCAGCACCTCACGACAACGAGCGCTGTCCGTGATTAAACTCCAAACGTCGTAATCTACCGACGACGCGACGATAGCTTGCAGGATAAGCC


Overlap Graph Problem: Construct the overlap graph of a collection of k-mers.
     Input: A collection Patterns of k-mers.
     Output: The overlap graph Overlap(Patterns), in the form of an adjacency list. (You may return the nodes and their edges in any order.)

In [13]:
def prefix(pattern):
    return pattern[:-1]
def suffix(pattern):
    return pattern[1:]
def overlap(patterns):
    #pattern_to_fix = {p: [prefix(p), suffix(p)] for p in patterns}
    overlap_graph = {}
    for p in patterns:
        suff = suffix(p)
        for pp in patterns:
            pref = prefix(pp)
            if suff == pref:
                overlap_graph.setdefault(p, []).append(pp)
    return overlap_graph

In [12]:
patterns = ["ATGCG", "GCATG", "CATGC", "AGGCA", "GGCAT", "GGCAC"]

In [14]:
out = overlap(patterns)

for key, value in out.items():
    print(f'{key} -> {" ".join(value)}')

GCATG -> CATGC
CATGC -> ATGCG
AGGCA -> GGCAT GGCAC
GGCAT -> GCATG


In [None]:
binary_4mers = ["0000", "0001", "0010", "0011", "0100", "0101", "0110", "0111", "1000", "1001", "1010", "1011", "1100", "1101", "1110", "1111"]



We can take a different approach. Instead of placing every 3-mer on a node, we put it on an edge. And we put the prefix of a 3-mer at the tail of the edge, and the suffix of a 3-mer at the head of the edge. We can merge nodes and create a de bruijn graph, then we can look for the Eulerian path---a path that visits each edge exactly once. Between the Hamiltonian Path problem and the Eulerian Path program, the Eulerian Path problem has been solved, but no one has developed a fast algorithm for the Hamiltonian Path problem. The Hamiltonian Path problem is NP-complete.

In [15]:
def de_bruijn_from_string(k, text):
    adjacency_list = {}
    edges = [text[i:i+k] for i in range(len(text)-k+1)]
    for edge in edges:
        pref = prefix(edge)
        suff = suffix(edge)
        adjacency_list.setdefault(pref, []).append(suff)
    return adjacency_list

In [16]:
f = open("deBruijn.txt", "r")
db_input = f.read().split("\n")
f.close()
k = int(db_input[0])
text = db_input[1]

out = de_bruijn_from_string(k, text)

for key, value in out.items():
    print(f'{key}: {" ".join(value)}')

CTGACCATCGT: TGACCATCGTT
TGACCATCGTT: GACCATCGTTT
GACCATCGTTT: ACCATCGTTTC
ACCATCGTTTC: CCATCGTTTCA
CCATCGTTTCA: CATCGTTTCAT
CATCGTTTCAT: ATCGTTTCATA
ATCGTTTCATA: TCGTTTCATAA
TCGTTTCATAA: CGTTTCATAAG
CGTTTCATAAG: GTTTCATAAGC
GTTTCATAAGC: TTTCATAAGCG
TTTCATAAGCG: TTCATAAGCGG
TTCATAAGCGG: TCATAAGCGGG
TCATAAGCGGG: CATAAGCGGGT
CATAAGCGGGT: ATAAGCGGGTG
ATAAGCGGGTG: TAAGCGGGTGG
TAAGCGGGTGG: AAGCGGGTGGA
AAGCGGGTGGA: AGCGGGTGGAA
AGCGGGTGGAA: GCGGGTGGAAC
GCGGGTGGAAC: CGGGTGGAACG
CGGGTGGAACG: GGGTGGAACGG
GGGTGGAACGG: GGTGGAACGGA
GGTGGAACGGA: GTGGAACGGAT
GTGGAACGGAT: TGGAACGGATC
TGGAACGGATC: GGAACGGATCG
GGAACGGATCG: GAACGGATCGA
GAACGGATCGA: AACGGATCGAG
AACGGATCGAG: ACGGATCGAGC
ACGGATCGAGC: CGGATCGAGCC
CGGATCGAGCC: GGATCGAGCCC
GGATCGAGCCC: GATCGAGCCCA
GATCGAGCCCA: ATCGAGCCCAT
ATCGAGCCCAT: TCGAGCCCATA
TCGAGCCCATA: CGAGCCCATAA
CGAGCCCATAA: GAGCCCATAAC
GAGCCCATAAC: AGCCCATAACG
AGCCCATAACG: GCCCATAACGC
GCCCATAACGC: CCCATAACGCC
CCCATAACGCC: CCATAACGCCT
CCATAACGCCT: CATAACGCCTA
CATAACGCCTA: ATAACGCCTAC


```
DeBruijn(Patterns)
    dB ← graph in which every k-mer in Patterns is isolated edge between its prefix and suffix
    dB ← graph resulting from ﻿gluing all nodes in dB with identical labels
    return dB
```

In [24]:
def de_bruijn_from_kmers(patterns):
    adjacency_list = {}
    for p in patterns:
        pref = prefix(p)
        suff = suffix(p)
        adjacency_list.setdefault(pref, []).append(suff)
    return adjacency_list

In [35]:
patterns = ["CA", "CA", "CA", "CA", "CC", "CA"]

f = open("deBruijnKmers.txt", "r")
patterns = f.read().split(" ")
f.close()

print(patterns)



['GTACAGTTGTTGTAGGTCCT', 'TTTACGGCTCGCAGTCTCAG', 'GCTCATTTTCGGACGCACAA', 'AGTGGCGAGACATAGTAATT', 'GAGTGGGGGGCCACTAATGC', 'TTACAGCCCACTATTCCATT', 'AGAAGGCTCACACTCAACCA', 'TGTGCTTTATGAGCCGATAC', 'TACTTCTAGTGCTGTAGCAA', 'AGGTCCTTTCCACACCGGCT', 'TTCGGACAACCCCCCATGCG', 'TTTCTCCAGGATCCTGCGAG', 'GGAAATTAATTTCTACTTCC', 'ACGCGGAACTGCCCTGTTGC', 'CCGCCCTTGCAACAAGAAGG', 'GTACTTCTAGTGCTGTAGCA', 'TGTCGTTGGCGTAATTATGA', 'ATGCGTCGCTTCCTACAGTG', 'TTTAGTCAGCTATGTCCATT', 'TGCACTATCGCTCACGGACC', 'TTTAAGCCGAAAATCCCGCC', 'TGGGTGACGAGGTCGTGACC', 'GCCTTCGGACAACCCCCCAT', 'ACAACCCCCCATGCGTCGCT', 'AGCGCGTCGCGCAGAATAAG', 'GACATAGTATAAGTGTTCCT', 'ACATAGTAATTAGTATACCT', 'AACTCTCAATCCGAAGAAAG', 'TACTTGTGATCTGTAAACGC', 'GCAGGATTACATGCTCTGGG', 'GGAGATACAGACTGTCGCCC', 'ACTACCCTACTATCGGTTAG', 'ACTTGTTCAGTGTATCCTAT', 'GCATCCCTCCATGGTTTATC', 'CGCTGCCGGTCCGGTCCTCG', 'ATAGTATCTGCAGGAGTGTT', 'CTTGTACAGTTGTTGTAGGT', 'GGCTCATGGGCGCTACACGG', 'GCTTGATCTAGAGAGTCGGC', 'GCGGTCTCATGCCCATACAC', 'GCAACTATTTTCTGCCCCTA', 'CTAACTGACTTCAC

In [36]:
out = de_bruijn_from_kmers(patterns)

keys = list(out.keys())
keys.sort()
for key in keys:
    print(f'{key}: {" ".join(out[key])}')

AAAAATATCTATTGGACGT: AAAATATCTATTGGACGTG
AAAAATGCATCAGCGATAT: AAAATGCATCAGCGATATC
AAAACCGCAGCTTACTAAA: AAACCGCAGCTTACTAAAG
AAAATATCTATTGGACGTG: AAATATCTATTGGACGTGC
AAAATCCCGCCTAATTATA: AAATCCCGCCTAATTATAC
AAAATGCATCAGCGATATC: AAATGCATCAGCGATATCG
AAACATCAGGCACCCTGAG: AACATCAGGCACCCTGAGC
AAACCAGAAGCGCGTCGCG: AACCAGAAGCGCGTCGCGC
AAACCGCAGCTTACTAAAG: AACCGCAGCTTACTAAAGG
AAACCTTAGTATATATCGA: AACCTTAGTATATATCGAG
AAACGCCCTCTAGCCAGTC: AACGCCCTCTAGCCAGTCA
AAACTGAAATTCAGGGGCG: AACTGAAATTCAGGGGCGA
AAAGATTGGGTTCCTGACT: AAGATTGGGTTCCTGACTA
AAAGCGCATGCGTCGCTTC: AAGCGCATGCGTCGCTTCC
AAAGGTATTTCGTAACTCC: AAGGTATTTCGTAACTCCC
AAAGTAGCCGCTGCCGGTC: AAGTAGCCGCTGCCGGTCC
AAAGTTTTCTCAAATGGAC: AAGTTTTCTCAAATGGACC
AAATATCTATTGGACGTGC: AATATCTATTGGACGTGCT
AAATCCCGCCTAATTATAC: AATCCCGCCTAATTATACT
AAATCCGAAATCGCTCATT: AATCCGAAATCGCTCATTT
AAATCGCTCATTTTCGGAC: AATCGCTCATTTTCGGACG
AAATGCATCAGCGATATCG: AATGCATCAGCGATATCGG
AAATGGACCCATGTAACAG: AATGGACCCATGTAACAGG
AAATTAATTTCTACTTCCT: AATTAATTTCTACTTCCTC
AAATTCAGGGGCGAAA

In [51]:
three_mers = ["000", "001", "010", "011", "100", "101", "110", "111"]
string = "0001110100"
true_false = [i in string for i in three_mers]
if False in true_false:
    print("nah")
else:
    print("yah")

yah


In [38]:
print(true_false)

[True, True, True, True, True, True, True, True]


In [45]:
patterns = ["GCGA", "CAAG", "AAGA", "GCCG", "ACAA", "AGTA", "TAGG", "AGTA", "ACGT", "AGCC", "TTCG", "AGTT", "AGTA", "CGTA", "GCGC", "GCGA", "GGTC", "GCAT", "AAGC", "TAGA", "ACAG", "TAGA", "TCCT", "CCCC", "GCGC", "ATCC", "AGTA", "AAGA", "GCGA", "CGTA"]
out = de_bruijn_from_kmers(patterns)

keys = list(out.keys())
keys.sort()
for key in keys:
    print(f'{key}: {" ".join(out[key])}')

AAG: AGA AGC AGA
ACA: CAA CAG
ACG: CGT
AGC: GCC
AGT: GTA GTA GTT GTA GTA
ATC: TCC
CAA: AAG
CCC: CCC
CGT: GTA GTA
GCA: CAT
GCC: CCG
GCG: CGA CGC CGA CGC CGA
GGT: GTC
TAG: AGG AGA AGA
TCC: CCT
TTC: TCG
