We live in a time of great discoveries in the field of medicine.  Many discoveries have been made thanks to the ability to analyze our blood, skin and even genes! That's impressive! All information about our body can be written in the form of a sequence of letters! I think it impresses any person in love with data. Data inside us!
We often hear the saying "Lucky with genetics".  But today, experts who are engaged in the analysis of our genes can say for sure, lucky or not!
I am not an expert in bioinformatics, but I am very interested in this area. In this kernel, I want to show some interesting things that we can learn about our genes. I do not guarantee absolute correctness and will be glad to comments and suggestions.
![![image.png](attachment:image.png)](https://teacherofya.files.wordpress.com/2017/11/img_4401.gif?w=640)

***Alignment of biological sequences***

In the process of evolution, all biological macromolecules undergo multiple mutation events in the course of molecular evolution. This results in the loss or acquisition of extended chunks of sequence or individual point mutations. 
If two proteins or two nucleotide sequences have a great similarity, they are homologues, and usually have a common precursor, similar function and similar structures. 
**Magical!**  We can find out how different we are from a butterfly or a giraffe :D
We can do sequence alignment with a wonderful module - Biopython. And we can independently by algorithms, which are discussed in many articles. 

**Local alignment. Smith-Waterman Algorithm**

Smith-Waterman algorithm is designed to obtain local alignment of sequences, that is, to identify similar areas of two nucleotide or protein sequences. The Smith-Waterman algorithm compares segments of all possible lengths and optimizes the measure of similarity across all segments and all alignments of those segments.
![![image.png](attachment:image.png)](https://upload.wikimedia.org/wikipedia/commons/9/92/Smith-Waterman-Algorithm-Example-En.gif)


In [None]:
import sys
from Bio import SeqIO
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio import Phylo
import numpy as numpy
import matplotlib
import pylab
import matplotlib.pyplot as plt


Let's compare the DNA chain of a human and a gibbon. Let's see how similar they are. You can download these and other protein sequences here: https://omabrowser.org/oma/vps/HUMAN22168/fasta/

In [None]:
Homo_sapiens = 'MTELETAMGMIIDVFSRYSGSEGSTQTLTKGELKVLMEKELPGFLQSGKDKDAVDKLLKDLDANGDAQVDFSEFIVFVAAITSACHKYFEKAGLK'
Gibbon = 'MTELETAMGMIIDVFSRYSGSEGSTQTLTKGELKVLMEKELPGFLQTGKDKDAVDKLLKDLDANGDAEVDFSEFIVFVAAITSACHKYFEKAELK'

In [None]:
seq1 = Homo_sapiens
seq2 = Gibbon

class Scores(object):
    gap = -10
    match = 5
    mismatch = -1

    def __init__(self, s1, s2):      
        if len(s1) >= len(s2):
            self.t = s1
            self.s = s2
        else:
            self.t = s2
            self.s = s1
        self.a = None
        self.alignments = None   

    def matrix(self):
        m = len(self.s) + 1  
        n = len(self.t) + 1  

        self.a = [
            [0 for j in range(n)] for i in range(m)
        ]

        for i in range(1, m):
            for j in range(1, n):
                value = self.match if self.s[i - 1] == self.t[j - 1] else self.mismatch
                self.a[i][j] = max(
                    self.a[i - 1][j] + self.gap,
                    self.a[i - 1][j - 1] + value,
                    self.a[i][j - 1] + self.gap,
                    0
                )

    def alignment(self, i, j, s, t):

        if self.a[i][j] == 0:
            self.alignments.append(
                [t[::-1], s[::-1]]
            )
            return
        if i > 0 and self.a[i][j] == self.a[i - 1][j] + self.gap:
            self.alignment(
                i=i - 1,
                j=j,
                s=s + self.s[i - 1],
                t=t + '_'
            )
        if i > 0 and j > 0 and self.a[i][j] == self.a[i - 1][j - 1] + (self.match if self.s[i - 1] == self.t[j - 1] else self.mismatch):
            self.alignment(
                i=i - 1,
                j=j - 1,
                s=s + self.s[i - 1],
                t=t + self.t[j - 1]
            )
        if j > 0 and self.a[i][j] == self.a[i][j - 1] + self.gap:
            self.alignment(
                i=i,
                j=j - 1,
                s=s + '_',
                t=t + self.t[j - 1]
            )

    def print_alignments(self):    
        self.alignments = []

        bigger_positions = []
        bigger = -sys.maxsize
        for i in range(len(self.a)):
            for j in range(len(self.a[i])):
                if self.a[i][j] > bigger:
                    bigger = self.a[i][j]
                    bigger_positions = [[i, j]]
                elif self.a[i][j] == bigger:
                    bigger_positions.append([i, j])

        for [i, j] in bigger_positions:
            self.alignment(
                i=i,
                j=j,
                s='',
                t=''
            )
        print('Score: ', bigger, '\n')
        for [t, s] in self.alignments:
            print('Align1: ', t)
            print('Align2: ', s, '\n')

if __name__ == '__main__':
    Scores = Scores(seq1,seq2)

    print('\nLocal alignment')
    Scores.matrix()
    Scores.print_alignments()

Now let's use a faster way. Biopython :)

In [None]:
alignments = pairwise2.align.localms(seq1,seq2,5,-1,-10,0)

for a in alignments: 
    print(format_alignment(*a))


Isn't that wonderful? We can compare ourselves to any living creature on the planet!
Of course more clearly will look at the phylogenetic tree. I built this tree last year during a training project. The code is huge and boring, so I don't include it here. If someone is interested to see, I will be happy to share.
It is very convenient that we can build a tree by opening a file with weights for it. Weights are calculated using a distance matrix and algorithms:WPGMA or UPGMA.
My weights for the tree were stored in a file xml.


In [None]:
tree = Phylo.read('../input/treedata/tree.xml', 'newick')
tree.root.color = 'c'
tree.ladderize()
matplotlib.rc('font',size=10)
Phylo.draw(tree, branch_labels = lambda c: c.branch_length)


**Explore some of the files in the dataset**

I have not figured out what to do with the data on the genes stored in CSV format. But I will definitely do it! In the meantime, we can analyze some information stored in the fasta format.
"mrna-genbank.fa" - a single fasta file with a large number of mRNA sequences from GenBank associated with D.melanogaster.
Let's look at this file.

In [None]:
plt.figure(figsize=(12,7))
sizes = [len(r) for r in SeqIO.parse('../input/drosophila-melanogaster-genome/mrna-genbank.fa', "fasta")]
pylab.hist(sizes, bins=20, color = "pink")
pylab.title("%i sequences\nLengths %i to %i" \
% (len(sizes),min(sizes),max(sizes)))
pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Count")
pylab.show()

A large file that contains sequences of different lengths. It is with such sequences that scientists work when they collect the genome. 
**Genome Assembly**  is the process of combining a large number of short DNA fragments (reads) into one or more long sequences (contigs and scaffolds) in order to restore the DNA sequences of chromosomes from which these fragments originated during sequencing.
There are two approaches for genome Assembly — based on overlap overlay-layout-consensus (used for long fragments) and based on de Bruijn graphs (used for short fragments).

**De Bruijn graph**. 
In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence.

![![image.png](attachment:image.png)](http://skrinshoter.ru/i/141018/m7Wpw13U.png)


In sequence assembly, k-mers are typically used during the construction of De Bruijn graphs. In order to create a De Bruijn Graph, the strings stored in each edge with length  L, must overlap another string in another edge by  L-1 in order to create a vertex. Reads generated from next-generation sequencing will typically have different read lengths being generated. For example, reads by Illumina’s sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is that these small fractions of the possible k-mers violate the key assumption of de Bruijn graphs that all the k-mer reads must overlap its adjoining k-mer in the genome by  k-1 (which can’t occur when all the possible k-mers aren’t present). The solution to this problem is to break these k-mer sized reads into smaller k-mers, such that the resulting smaller k-mers will represent all the possible k-mers of that smaller size that are present in the genome. Furthermore, splitting the k-mers into smaller sizes also helps alleviate the problem of different initial read lengths. An example of the solution of splitting the reads into smaller k-mers is shown in figure 1. In this example the 5 reads do not account for all the possible 7-mers of the genome, and as such, a de Bruijn graph cannot be created. But when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a de Bruijn graph.

**K - mer counting**

Consider an example of counting K-mers. Unfortunately, you will not be able to do it with the file available in the dataset. Since we have many sequences of different lengths.  I uploaded another file.

In [None]:
from collections import Counter

NOISE_LEVEL = 0.2
ADD_REVERSE_C = True

k = 15

with open('../input/forkmer/hw3_dataset.fasta', 'r') as file:
    fasta = file.read()

fasta = fasta.split('\n')
fasta = fasta[1::2]

kmers = [s[i:i + k] for s in fasta for i in range(len(s) - k + 1)]

if ADD_REVERSE_C:
    reverse_c_trans = str.maketrans('acgt', 'catg')
    kmers += [kmer.translate(reverse_c_trans) for kmer in kmers]

kmers_occurence = Counter(Counter(kmers).values())

plt.figure(figsize=(15,5))
plt.bar(list(kmers_occurence.keys()),kmers_occurence.values(), color='c')
plt.yscale('log')
plt.xlabel('Сoverage')
plt.ylabel('Density')
plt.grid()
plt.show()

kmers_freq_len = len(kmers_occurence)
kmers_freq_noise_level = int(NOISE_LEVEL * kmers_freq_len)
kmers_occurence = sorted(kmers_occurence.items())[kmers_freq_noise_level:]

coverage, _ = max(kmers_occurence, key=lambda x: x[1])
print('Coverage:', coverage)

genom_len = sum(k*v for k, v in kmers_occurence) // coverage
print('The length of the genome:', genom_len)

This is not even 1% of bioinformatics capabilities to analyze our genes. Magical! 

**Work in progress. Any comments are welcome. Thank you all for your attention!**