# Error correction

Firstly, we are interested in k-mer distribution. We want to count number of k-mers in our input set of reads. Write function *kmerHist* which recieves a list of reads and k-mer size and returns histogram: key-value pairs where key is k-mer and value is k-mer frequency in the input set (how many times does this k-mer appear).

We do so by iterating through reads, partitioning every read on his k-mers and counting up the k-mers. For simplicity, we've used pythons [defaultdict](https://docs.python.org/2/library/collections.html#collections.defaultdict).

In [None]:
from collections import defaultdict

def kmerHist(reads, k):
    # initializing histogram as defaultdict with setting default_factory to int, in order to count-up the k-mers
    hist = defaultdict(int)
    # going through every read
    for read in reads:
        # going through each position in the read (k-mer)
        for i in range(len(read)-k+1):
            # increasing value (frequency) for that k-mer
            hist[read[i:i+k]]+=1 
    return hist  

In [2]:
kmerHist(['ACACG', 'ACGCA', 'CAAAC'], 3)

defaultdict(int,
            {'AAA': 1,
             'AAC': 1,
             'ACA': 1,
             'ACG': 2,
             'CAA': 1,
             'CAC': 1,
             'CGC': 1,
             'GCA': 1})

### Finding neighbouring k-mers

For every k-mer we need to know his neigbouring k-mers. In our case, we define neighbouring k-mer as k-mer which has one different base (Hamming distance of 1). 

Write function that receives two arguments - k-mer and set of characters which can appear in k-mer (in our case, since we're doing these examples for DNA data, we can say it's ACTG). Function returns list of neighbouring k-mers. 

This list is made by adding to it k-mers made by substituting character on each of the positions in k-mer with other charcters from alphabet. In other words, we iterate through every position in k-mer. For each of the positions we replace current character with rest of the characters from alphabet. 

In [3]:
def neighbour1mer(kmer, alphabet):
    neighbours = []
    # iterate though every position in k-mer
    for i in range(len(kmer)-1, -1, -1): 
        # iterate through every character in alphabet
        for a in alphabet:
            # substitute character on given postion with each of the non-matching characters from alphabet
            # add this k-mer to list of neigbouring k-mers
            if a!=kmer[i]: neighbours.append(kmer[:i]+a+kmer[i+1:]) 
    return neighbours

When generating neighbouring k-mers, we're iterating through every position in k-mer backwards (range(len(kmer)-1, -1, -1)). This is because we want to check these neighbours first, since we expect for error to be on this (end) position. This is because in the *errorCorrection* step we're iterating through every k-mer from the start to the end of the read, and error correcting it. In this case, since we've already potentially error corrected previous k-mers that means that first k-1 bases of our current k-mer have potentialy already been error corrected. We expect error to be most likely on the end of the k-mer, since this is the new information. This is why we first want to check neighbouring k-mers from the end of the k-mer and add them on beginning of the list.

In [4]:
neighbour1mer('CCC', 'ACTG')

['CCA', 'CCT', 'CCG', 'CAC', 'CTC', 'CGC', 'ACC', 'TCC', 'GCC']

In [5]:
['CAT' * 10]

['CATCATCATCATCATCATCATCATCATCAT']

In [6]:
x=range(len('CCC')-1, -1, -1)

In [7]:
list(x)

[2, 1, 0]

## Error Correct function

Last, let's write *errorCorrect* function. It receives read, k-mer histogram, k-mer size, alphabet and frequency treshold. This function should check each k-mer in the read. If it's frequency is bellow the certain treshold it indicates this is an error and we should error correct it. We do error correction on this part of the read (k-mer) and return corrected read. 

If we detect that k-mer contains an error, we iterate through it's neigbouring k-mers. When we detect one of it's neighbours which is above the treshold, we error correct our kmer. That means that on this position in the read we replace our k-mer with it's neighbour. 

In [8]:
def errorCorrect(read, hist, k, alphabet, treshold):
    # go through every k-mer in read
    for i in range(len(read)-k+1): 
        # check if it contains an error (infrequent k-mer)
        if hist[read[i:i+k]]<treshold: 
            neigbours = neighbour1mer[read[i:i+k], alphabet]
        # go through all neighbouring k-mers
        for neighbour in neighbours:
            # find a frequent neighbour
            if hist[neighbour]>treshold:
                # substitute the k-mer with the error corrected one;
                read=read[:i]+neighbour+read[i+k:] 
                # and move to next k-mer
                break
    return read

This is simplified version of error correction algorithm. In real case, we would for instance use a sequencing base quality (phred score) as an additional indicator. For instance, if base quality is high, it means that this might not be an error, while if it's low, it can be additional confirmation that this is an error.

Secondly, in curent implementation, we create neighbours for every position in the k-mer. We should have been more carefull with this, since corecting any of the first k-1 positions in k-mer will affect previous k-1 k-mers in the read. This is also why in *neighbour1mer* we first check neighbouring k-mer from end positions (and return these k-mers on beginning of the list), becuase when substituting the current k-mer with these, we won't disrupt the previous k-1 kmers.