In [1]:
from re import sub, search
import numpy as np
import pandas as pd
import os
import math

''' Display List '''
# display list neatly
# https://stackoverflow.com/questions/1524126/how-to-print-a-list-more-nicely
def lstcol(obj, cols=4, columnwise=True, gap=4):
    sobj = [str(item) for item in obj]
    if cols > len(sobj): cols = len(sobj)
    max_len = max([len(item) for item in sobj])
    if columnwise: cols = int(math.ceil(float(len(sobj)) / float(cols)))
    plist = [sobj[i: i+cols] for i in range(0, len(sobj), cols)]
    if columnwise:
        if not len(plist[-1]) == cols:
            plist[-1].extend(['']*(len(sobj) - len(plist[-1])))
        plist = zip(*plist)
    printer ='\n'.join([
        ''.join([c.ljust(max_len + gap) for c in p])
        for p in plist])
    print (printer)

![](https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/8cc1eeaa-4046-4c4a-ae93-93d656f68688/deny9ja-e5f8e0f2-94f0-402a-9350-32ebbf1fb61d.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzhjYzFlZWFhLTQwNDYtNGM0YS1hZTkzLTkzZDY1NmY2ODY4OFwvZGVueTlqYS1lNWY4ZTBmMi05NGYwLTQwMmEtOTM1MC0zMmViYmYxZmI2MWQuanBnIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.y0MmVL5Y2FuoS12AgjAJLAGOWtpd-WYLfzg7gfNu_1g)
Photography of <b>Evolution</b> by [@sashayudaev](https://unsplash.com/@sashayudaev)

<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#FF5733;
       font-size:220%;
       font-family:Nexa;
       letter-spacing:0.5px">
    <p style="padding: 20px;
          color:white;">
        <b>1 |</b> BACKGROUND
    </p>
</div>

### <b><span style='color:#FF5733'>1.1</span> | CELLS</b>
- A cell is mostly composed of water, a bacteria cell has a weight composition of roughly 70% water & 30% chemical origin, of which <b>7% are small molecules</b>, inc. <b>amino acids</b> and <b>nucleotides</b> & 23% -> <b>macro molecules</b> (<b>proteins</b>,lipids,polysaccharides)
- According to their internal structure, they can be divided into to major categories; 
  - <b>Prokaryotic</b> cells : have no nucleus or internal membranes. 
  - <b>Eukaryotic</b> cells : which have a defined <b>nucleus</b>, <b>internal membranes</b> and functional elements called <b>organelles</b>; have varying shapes and sizes and play specific functions.
- At a structural level, all cells are surrounded by a structure called cell membrane or plasma membrane. This membrane is permeable to molecules that cells need to absorb from or excrete to the outside medium.
- Within the cell we find the <b>cytoplasm</b> (largely composed of water), which serves as the medium for the cell.

#### <b><span style='color:#FF5733'>NUCLEIC ACIDS</span></b>
- Among molecules with a biological role, we can find <b>nucleic acids</b>. 
> Nucleic acids encode and express the genetic code that is kept within the cell. 
- There are two major types of nucleic acids: 
  - (a) <b>DeoxyriboNucleic Acid (DNA)</b>
  - (b) <b>RiboNucleic Acid (RNA)</b>
- DNA contains the information necessary to build a cell, and keep it functioning. 
- In <b>eukaryotic</b> cells, DNA will be found in the nucleus, whilst in the <b>prokaryotic</b> cells, it will be found in the cytoplasm. 
    
|Name|Symbol|Nucleotides| Name | Symbol | Nucleotides |
|-|-|-|-|-|-|
|Adenine|A|A|Amino|M|A,C|
|Cytosine|C|C|Keto|K|G,T|
|Guanine|G|G|Purine|R|A,G|
|Thymine|T|T|Pyrimidine|Y|C,T|
|Uracil|U|U|Strong|S|C,G|
|Strong|S|C,G|Weak|W|A,T|

#### <b><span style='color:#FF5733'>AMINO ACIDS</span></b>
- Amino acids: The <b>building blocks of proteins</b>, which are <b>macromolecules</b> that perform most of the functions in a cell.
- Proteins have a broad range of functions, spanning from catalytic to structural functions:
  - <b>Enzymes</b> : type of abundant proteins that promote chemical reactions and convert certain molecules into other types of molecules required for the functioning of the cell.
  - <b>Carbohydrates</b> : serve as energy storage, both for immediate and long term energy demands.
  - <b>Lipids</b>, which are part of the plasma membrane, doing signaling and energy storage.
- The cell also contains other components of varying complexity. Of importance: 
  - <b>mitochondria</b> & the <b>chloroplasts</b> : organelles involved in the production of energy. 
  - <b>Ribosomes</b> : Large and complex molecules composed of a mixture of genetic material, required to assemble proteins and play a central role in the flow of genetic information.
- Some Aminoacids (as per IUPAC amino acid list):
    
| Name | Symbol | Name | Symbol | Name | Symbol | 
|-|-|-|-|-|-|
| Alanine | A | Arginine | R | Aspartic acid | D |
| Asparagine | N | Cysteine | C | Glycine | G |
| Glutamic Acid | E | Glutamine | Q | Histidine | H | 
| Isoleucine | I | Leucine | L | Lysine | K |
| Methionine | M | Proline | P | Phenylalanine | F | 
| Serine | S | Threonine | T | Tryptophan | W | 
| Tyrosine | Y | Valine | V | 

### <b><span style='color:#FF5733'>1.2</span> | NUCLEIC ACID</b>
- DNA is a molecule composed of <b>two complementary strands</b> that form and stick together due to the connections established between the nucleotides in both strands. 
- This is made possible by due to the chemical phenomenon where <b>Adenine</b> (A), bonds only with <b>Thymine</b> (T) nucleotides, as a result of two hydrogen connections. Similarly, <b>Guanine</b> (G) bonds only with <b>Cytosine</b> (C) nucleotides by three hydrogen connections.
- This results in <b>two complementary</b> and anti-parallel strands (connected in opposite directions), if we know the nucleotide sequence in one of the strands, we can get the sequence in the opposite strand by taking the complement of its nucleotides, which are also read backwards, thus we have the <b>reverse complement</b> of the other strand. It has become a standard to describe the DNA though only one of the strands, due to this complementarity using <b>[A,T,G,C]</b>.
- The existence of these two strands is essential in order to pass on genetic information to new cells and produce <b>proteins</b>.

### <b><span style='color:#FF5733'>1.3</span> | CENTRAL DOGMA OF MOLECULAR & CELL BIOLOGY</b>

- <b>DNA</b>, <b>RNA</b> & <b>Proteins</b> are the central elements of the flow of genetic information that occurs in two steps (1) <b>transcription</b> & (2) <b>translation</b>.

#### <b>(1) RNA Synthesis : <span style='color:#FF5733'>Transcription</span></b>

- *Transcription* : Preliminary step required to produce a <b>protein</b>, as shown in Fig. X.
> - The nucleotide sequence of a gene from one of the DNA strands is transcribed ( copied into a complementary molecule of RNA )
> - The complementarity of the genetic code allows recovering the information encoded in the original DNA sequence, a process performed by the enzyme, RNA polymerase.
> - Additional steps of RNA processing, including stabilising elements at the end of the molecule, are performed by different protein complexes.
> - After these steps, which occur within the nucleus of the cell, an RNA molecule; <b>mature messenger RNA (mRNA)</b> is obtained.
> - The <b>mRNA</b> is then transported to the cytoplasm, where it will be used by the cellular machine to guide the production of a protein.

#### <b>(2) Protein Synthesis : <span style='color:#FF5733'>Translation</span></b>
> - <b>Proteins</b> are cellular entities that have either:
    - (a) <b>structural function</b>, participating in the physical definition of a cell.
    - (b) <b>chemical function</b>; being involved in chemical reactions occuring in the cell.
> -  In order to function as expected, a protein needs to acquire the appropriate structure & this structure is often decomposed at different complexity levels.
> The primary structure is defined by the chain of amino acids; <b>polypeptide</b>, consisting in part or completely of protein.

> - *Translation* - process in which the nucleotide sequence of the mRNA is <b>transcribed into a chain of amino acids</b>, forming a <b>polypeptide</b>. 

<b>What performs this?</b>
> - This <b>process is performed by the ribosomes</b> that attach and scan the mRNA from one end to the other, in groups of <b>nucleotide triplets</b> / <b>Codons</b>.

<b>Codons</b>
    
> - In each position of the triplet, we have 1/4 nucleotides, ie. there are 4x4x4 (64) possible triplets/Codons. 
> - For each codon in the mRNA sequence, we have a corresponding amino acid in the <b>polypeptide chain</b>.
    
<b>Start & Stop Codons</b>
    
> - Some of these codons represent significant signals that indicate the <b>initiation</b> or the <b>termination</b> of the translation process.
> - Once the **ribosome** detects an initiation codon, it starts the formation of the amino acid chain, and when it scans the stop codon, it stops the translation and detaches from the mRNA molecule.

> - There are <b>20 types of amino acids used to form polypeptides</b> (for IUPAC) & less than the 64 possible codons, therefore we have more than one codon corresponds to a type of amino acid.
    
> - During the translation process:
    - A type of small RNA molecule, <b>transfer RNA (tRNAs)</b> will bring to the <b>ribosome</b>, the amino acids of the corresponding type, which will be complementary to the mRNA codon that is currently being scanned.
    - Each mRNA molecule can be scanned multiple times by different ribosomes, giving rise to multiple copies of the polypeptide. 
    - With its redundency, where more than one codon encodes an amino acid, the genetic code encloses a very efficient code-correction mechanism that minimises the impact of errors in the nucleotide sequence occuring in DNA replication.

**ORF**
    
> - During translation process:
    - Parsing of the mRNA sequence by the ribosome may start at different nucleotides. 
    - Given that a codon is composed of three nucleotides, the mRNA sequence may have <b>3 possible interpretations</b>. 
    - These three ways of parsing the sequence are called; <b>reading frames</b>.

### **<span style='color:#FF5733'>1.4</span> | CUSTOM SEQUENCE CLASS**
- Whilst there are wonderful libraries like <b>BioPython</b>, it's often quite benefitial to understand the workings of the code & expand on the code if a need arises. Whilst this is more time consuming, it's definitely more interesting.
- In this notebook, emphasis will be placed on replicating something similar to two BioPython modules.
> from Bio import SeqIO  ( Class for readng sequences ) <br>
> from Bio.Seq import Seq ( Class for Sequence Operations )

### **<span style='color:#FF5733'>1.5</span> | USEFUL REFERENCE**
- [BioPython GenBank](https://github.com/biopython/biopython/tree/master/Tests/GenBank) ( Small collection of files containing readable sequences )
- [Sequence Similarity Searching](https://www.ebi.ac.uk/Tools/sss/) ( Sequence related database )

<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#FF5733;
       font-size:220%;
       font-family:Nexa;
       letter-spacing:0.5px">
    <p style="padding: 20px;
          color:white;">
        <b>2 |</b> WORKING WITH SEQUENCES
    </p>
</div>

### **<span style='color:#FF5733'>2.1</span> | MAPPING DICTIONARIES**
- Similar to BioPython's <b>CodonTable</b> as shown in [Biopython for Bioinformatics Basics](https://www.kaggle.com/shtrausslearning/biopython-for-bioinformatics-basics), a dictionary containing mapping data between different combinations of mRNA nucleotides (codons) and <b>amino acids</b> is obtainable in <b>dic_map</b> by setting <b>map_id</b> to *codon*. 
- Similarly, mapping between the IUPAC amino acid symbols & names can be obtained by using <b>map_id = 'iupac_amino'</b> &  <b>map_id = 'iupac_nucleotide'</b> respectively.

In [2]:
# Mapping Dictionary
def dic_map(map_id='codon',tid=None):
    
    # Codon / Amino Acid Conversion
    if(map_id is 'codon'):
        tc = {"GCT":"A", "GCC":"A", "GCA":"A", "GCG":"A",  # Alanine
              "TGT":"C", "TGC":"C",    # Cystene
              "GAT":"D", "GAC":"D",    # Aspartic Acid 
              "GAA":"E", "GAG":"E",    # Glutamic Acid
              "TTT":"F", "TTC":"F",    # Phenylalanine
              "GGT":"G", "GGC":"G", "GGA":"G", "GGG":"G",  # Glutamic Acid
              "CAT":"H", "CAC":"H",            # Histidine
              "ATA":"I", "ATT":"I", "ATC":"I",   # Isoleucine
              "AAA":"K", "AAG":"K",              # Lycine
              "TTA":"L", "TTG":"L", "CTT":"L",   # Leucine
              "CTC":"L", "CTA":"L", "CTG":"L",   # Leucine
              "ATG":"M",                         # Start Codon | Methionine 
              "AAT":"N", "AAC":"N",             # Asparagine
              "CCT":"P", "CCC":"P", "CCA":"P", "CCG":"P",  # Proline
              "CAA":"Q", "CAG":"Q",              # Glutamine
              "CGT":"R", "CGC":"R", "CGA":"R",   # Arginine
              "CGG":"R", "AGA":"R", "AGG":"R",   # Arginine
              "TCT":"S", "TCC":"S", "TCA":"S",   # Serine
              "TCG":"S", "AGT":"S", "AGC":"S",   # Serie
              "ACT":"T", "ACC":"T", "ACA":"T", "ACG":"T",  # Threonine
              "GTT":"V", "GTC":"V", "GTA":"V", "GTG":"V",  # Valine
              "TGG":"W",                        # Tryptophan
              "TAT":"Y", "TAC":"Y",             # Tryosine
              "TAA":"_", "TAG":"_", "TGA":"_" # Stop Codons | Gap
            }
    
    # IUPAC Amino Acids
    elif(map_id is 'iupac_amino'):
        tc   = {'A':'Alanine','C':'Cysteine','D':'Aspartic Acid','E':'Glutamic Acid',
                'F':'Phenylalanine','G':'Glycine','H':'Histidine','I':'Isoleucine',
                'L':'Lysine','M':'Methionine','N':'Asparagine','P':'Proline',
                'Q':'Glutamine','R':'Arginine','S':'Serine','T':'Threonine',
                'V':'Valine','W':'Tryptophan','Y':'Tryosine','_':'Gap'}
       
    # IUPAC nuceotides
    elif(map_id is 'iupac_nucleotide'):
        tc  = {'A':'Adenine','C':'Cytosine',
               'G':'Guanine','T':'Thymine',
               'U':'Uracil'}
    
    if tid in tc: 
      return tc[tid]
    else: 
      return None

### **<span style='color:#FF5733'>2.2</span> | MAIN SEQUENCE CLASS, SQ()**
- We define a custom class, <b>SQ</b>, which stores <b>information about the sequence</b> & contains basic sequence related operations.
- The <b>sequence class (SQ)</b> can be coupled with other classes that incorporate more sophisticated sequence based operations; eg. <b>sequence alignment</b>, and is passed on as a reference to other classes.

In [3]:
import plotly.express as px
import plotly.graph_objects as go
from collections import Counter

def dict_sum(dictlist):
  outdic = {}
  for d in dictlist:
    for k in d.keys():
      outdic[k] = 0
  for d in dictlist:
    for k in d.keys():
      outdic[k]+=d[k]
  return outdic

# Class for Sequence Operations 
class SQ: 
    
    '''Constructor'''
    def __init__ (self, seq=None, seq_type = "DNA"): 
        self.seq = seq.upper()
        self.seq_type = seq_type
          
    # class instance operations
    def __len__(self):
        return len(self.seq)
    def __getitem__(self, n):
        return self.seq[n]
    def __getslice__(self, i, j):
        return self.seq[i:j]
    def __str__(self):
        return self.seq
    
    '''Get frequency of all sequence symbols'''
    def freq(self,compare=None,show_id='perc',fheight=None,fwidth=None):
        
        if(compare is not None):
            if(self.seq_type != compare.seq_type):
                print('sequences are not of same type')
                return None
            
        c1 = dict(Counter(self.seq))  # abc counter for s1
        if(compare is not None):
            c2 = dict(Counter(compare))  # abc counter for s2
            
        abc = list(self.abc())
        count = Counter(abc)
        abc_c = dict(Counter({x:0 for x in count}))
        
        c_all1 = dict_sum([c1,abc_c])
        if(compare is not None):
            c_all2 = dict_sum([c2,abc_c])    

        lst = []
        for i in c_all1.keys():
           if(self.seq_type == 'DNA' or self.seq_type == 'mRNA'):
               lst.append(dic_map('iupac_nucleotide',i))
           elif(self.seq_type == 'PROTEIN'):
               lst.append(dic_map('iupac_amino',i))
                
        if(compare is not None):
            lst2 = []
            for i in c_all2.keys():
               if(self.seq_type == 'DNA' or self.seq_type == 'mRNA'):
                   lst2.append(dic_map('iupac_nucleotide',i))
               elif(self.seq_type == 'PROTEIN'):
                   lst2.append(dic_map('iupac_amino',i))
          
        perc = [round(x / len(self.seq),3) for x in [*c_all1.values()]]
        if(show_id is 'perc'):
            show1 = lst; show2 = perc
        elif(show_id is 'count'):
            show1 = lst; show2 = [*c_all1.values()]
        fig = go.Figure(go.Bar(y=show1,x=show2,
                               marker_color='rgb(26, 118, 255)',
                               orientation='h',text=show2,
                               textposition='outside',name='SEQ1'))
        if(compare is not None):
            perc = [round(x / len(compare),3) for x in [*c_all2.values()]]
            if(show_id is 'perc'):
                show1 = lst2; show2 = perc
            elif(show_id is 'count'):
                show1 = lst2; show2 = [*c_all2.values()]
            fig.add_trace(go.Bar(y=show1,x=show2,marker_color='rgb(55, 83, 109)',
                                 orientation='h',text=show2,
                                 textposition='outside',name='SEQ2'))
        fig.update_layout(template='plotly_white',height=fheight,width=fwidth,
                         title=f'<b>{self.seq_type} SEQUENCE CONTENT</b>',
                         font=dict(family='sans-serif',size=12),
                         margin=dict(l=40, r=40, t=50, b=10))
        fig.show()

    '''Return percentage of G & C Nucleotides in sequence'''
    def gc(self):
        if (self.seq_type == "DNA" or self.seq_type == "mRNA"):
            ii = 0
            for s in self.seq:
                if(s in "GCgc"):
                    ii += 1
            return round(ii / len(self.seq),4)
        else:
            return None
        
    '''General sequence information'''
    def get_seq_biotype (self):
        return self.seq_type
    def info(self):
        print (f"SEQ: {self.seq}" +" "+ f"TYPE: {self.seq_type}")
        
    '''Get the aphabet of the sequence type'''
    def abc(self):
        if(self.seq_type=="DNA"): 
          return "ACGT"
        elif(self.seq_type=="mRNA"):
          return "ACGU"
        elif (self.seq_type=="PROTEIN"): 
          return "ACDEFGHIKLMNPQRSTVWY"
        else: 
          return None
        
    '''Check if sequence alphabet match dictionary'''
    # calls abc function 
    def validate(self,verbose=False):
        alp = self.abc()
        res = True; i = 0
        while (res and i < len(self.seq)):
            if self.seq[i] not in alp: 
                res = False
            else: i += 1
        if(res):
            if(verbose):
                print(f'{self.seq_type} is valid')
            return res
        else:
            if(verbose):
                print(f'{self.seq_type} is invalid')
            return res
        
    '''Transcription DNA -> RNA'''
    def transcription(self):
        if (self.seq_type == "DNA"):
            return SQ(self.seq.replace("T","U"), "mRNA")
        else:
            return None
    
    '''Reverse Compliment of DNA -> DNA'''
    def reverse_comp(self):
        
        if (self.seq_type != "DNA"): 
            print('input not DNA')
            return None
    
        lst_seq = ['A','T','G','C']
        lst_comp = ['T','A','C','G']
            
        comp = ''
        for char in self.seq:
            ii=-1
            for c in lst_seq:
                ii+=1
                if(char == c ):
                    comp = lst_comp[ii] + comp
            
        return SQ(comp, "DNA")
        
    ''' Translates a DNA to aminoacid sequence '''
    # using defined dictionary mapping
    @staticmethod
    def translate(seq,p0=0):
        seq_aa = ""
        for pos in range(p0,len(seq)-2,3):
            cod = seq[pos:pos+3]
            seq_aa += dic_map(map_id='codon',tid=cod)
        return seq_aa
    
    '''Get All Possible open reading frames (ORF)'''
    # store all possible collections of amino acid groups 
    # in all 6 frames
    def frames(self):
        res = []
        for i in range(0,3):
            res.append(self.translate(self.seq,i))
        rc = self.reverse_comp()
        for i in range(0,3):
            res.append(self.translate(rc,i)) 
        return res
    
    '''Computes all possible proteins in an amino acid sequence in reading frame '''
    # using the knowledge that it starts with M and ends with _, 
    # filter out rule breaking ORFs
    @staticmethod
    def all_proteins_RF(aa_seq):
        # aa_seq -> converted ORF
        current_prot = []
        proteins = []
        for aa in aa_seq:
            if(aa == "_"):
                if current_prot:
                    for p in current_prot:
                        proteins.append(p)
                    current_prot = []
            else:
                if(aa == "M"):
                    current_prot.append("")
                for i in range(len(current_prot)):
                    current_prot[i] += aa
        return proteins
    
    '''Computes all possible proteins for all ORF'''
    # and sort them based on size
    def ORF_protein(self, mins = 0):
        
        # order 
        def insert_prot_ord (prot, list_prots):
            i = 0
            while i < len(list_prots) and len(prot) < len(list_prots[i]):        
                i += 1
            list_prots.insert(i, prot)
        
        rfs = self.frames()  # get all ORF conversions
        res = []
        for rf in rfs:
            prots = self.all_proteins_RF(rf) # return only protein cases
            # additionally sort based on protein size
            for p in prots: 
                if len(p) > mins: 
                    insert_prot_ord(p, res)
        return res

In [4]:
# Define sequences in string format
seq1 = 'ATGACGGATCAGCCGCAAGCGGAATTGGCGTTTACGTACGATGCGCCGTAA'
seq2 = 'TAATATGTTTTCGTTCATGCAGAGAGATTAAGGGTGTCTAATGAAGAAAAGTTCTATTGTGGCAACCATTATAACTATT'
seqaa1 = 'MMMELQHQRLMALAGQLQLESLISAAPALSQQAVDQEWSYMDFLEHLLHEEKLARHQRKQAMYTRMAAFPAVKTFEEYDFTF'
seqaa2 = 'ATGAPQKQLQSLRSLSFIERNENIVLLGPSGVGKTHLAIAMGYEAVRAGIKVRFTTAADLLLQLSTAQRQGRYKTTLQRGVMAPRLLI'

# Define Two DNA sequences (instances)
sq1 = SQ(seq1,'DNA')
sq2 = SQ(seq2,'DNA')
# define new protein sequence 
sq3 = SQ(seqaa1,'PROTEIN')
sq4 = SQ(seqaa2,'PROTEIN')

# Show class variables
print(f'Sequence: {sq1.seq} | Sequence Type: {sq1.seq_type}')
print(f'Sequence: {sq2.seq} | Sequence Type: {sq2.seq_type}')
print(f'Sequence: {sq3.seq} | Sequence Type: {sq3.seq_type}')
print(f'Sequence: {sq4.seq} | Sequence Type: {sq4.seq_type}')

Sequence: ATGACGGATCAGCCGCAAGCGGAATTGGCGTTTACGTACGATGCGCCGTAA | Sequence Type: DNA
Sequence: TAATATGTTTTCGTTCATGCAGAGAGATTAAGGGTGTCTAATGAAGAAAAGTTCTATTGTGGCAACCATTATAACTATT | Sequence Type: DNA
Sequence: MMMELQHQRLMALAGQLQLESLISAAPALSQQAVDQEWSYMDFLEHLLHEEKLARHQRKQAMYTRMAAFPAVKTFEEYDFTF | Sequence Type: PROTEIN
Sequence: ATGAPQKQLQSLRSLSFIERNENIVLLGPSGVGKTHLAIAMGYEAVRAGIKVRFTTAADLLLQLSTAQRQGRYKTTLQRGVMAPRLLI | Sequence Type: PROTEIN


<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#FF5733;
       font-size:220%;
       font-family:Nexa;
       letter-spacing:0.5px">
    <p style="padding: 20px;
          color:white;">
        <b>3 |</b> SEQUENCE CLASS USAGE EXAMPLES
    </p>
</div>

### **<span style='color:#FF5733'>3.1</span> | NUCLEOTIDE/AMINO ACID FREQUENCY**
- Having defined all the <b>IUPAC Nucleotide</b> & <b>Amino acid</b> (From [TABLE](https://www.bioinformatics.org/sms/iupac.html)) in the dictionary <code>dic_map</code>, we may want to obtain the count for each each nucleotide or amino acid in the given sequence. 
- We can use the <code>.freq()</code> function to show the count/percentage of each of the alphabet or compare two sets of sequences with the addition of <b>compare</b>, eg. <code>freq(compare=None,show_id='perc')</code>, and changing between the <b>percentage (perc)</b> & <b>number count (count)</b> using <b>show_id</b>.

In [5]:
# compare content percentage (default)
sq1.freq(sq2,fheight=250)

In [6]:
# show the amino acid content of the protein
sq3.freq(show_id='count',fheight=500)

### **<span style='color:#FF5733'>3.2</span> | GC CONTENT**
From GC-CONTENT | [WIKIPEDIA](https://en.wikipedia.org/wiki/GC-content)

> In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either <b>guanine</b> (G) or <b>cytosine</b> (C). This measure indicates the proportion of G and C bases out of an implied four total bases, also including adenine and thymine in DNA and adenine and uracil in RNA.

From GC-CONTENT | [SCIENCEDIRECT](https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/gc-content)
> GC content is strongly correlated with biological features of genome organization such as distribution of various classes of repeated elements, gene density, level and tissue-specificity of transcription, and mutation rate.

In [7]:
print(f'GC-Content Sequence 1: {round(sq1.gc()*100,3)}%')
print(f'GC-Content Sequence 2: {sq2.gc()*100}%')

GC-Content Sequence 1: 54.9%
GC-Content Sequence 2: 31.65%


### **<span style='color:#FF5733'>3.3</span> | SEQUENCE VALIDITY**
- A sequence is valid if its alphabet corresponds to a specific set of code, class <b>SQ</b> uses the [IUPAC](https://www.bioinformatics.org/sms2/iupac.html) code standard.
- We can check whether a sequence is valid or not by using the <code>.validate()</code> function, returning a logical output (True/False).

In [8]:
print(f'sequence: {sq3.seq} is valid: {sq3.validate()}')

sequence: MMMELQHQRLMALAGQLQLESLISAAPALSQQAVDQEWSYMDFLEHLLHEEKLARHQRKQAMYTRMAAFPAVKTFEEYDFTF is valid: True


### **<span style='color:#FF5733'>3.4</span> | TRANSCRIPTION**
- The mRNA is created as the two complementary strands are split, and is complement to one of the strands.
- Transcription of <b>DNA sequence</b> can be obtained using the <code>.transcription()</code> function, where the <b>T</b> is replaced by <b>U</b> in the sequence string.

In [9]:
mrna_sq2 = sq2.transcription()
print(f'mRNA sequence: {mrna_sq2.seq}')

mRNA sequence: UAAUAUGUUUUCGUUCAUGCAGAGAGAUUAAGGGUGUCUAAUGAAGAAAAGUUCUAUUGUGGCAACCAUUAUAACUAUU


### **<span style='color:#FF5733'>3.5</span> | REVERSE COMPLEMENT**
- DNA has <b>two complementary strands</b>. 
- Due to the complementarity of the DNA strands, usually only one of the strands is provided in a sequence file obtained from databases.
- The second strand to the input <b>DNA sequence</b> can be obtained by calling the <code>.reverse_comp()</code> function.

In [10]:
rev_comp_sq1 = sq1.reverse_comp()
rev_comp_sq1.info()

SEQ: TTACGGCGCATCGTACGTAAACGCCAATTCCGCTTGCGGCTGATCCGTCAT TYPE: DNA


### **<span style='color:#FF5733'>3.6</span> | TRANSLATION**
#### <b><span style='color:#FF5733'>OVERVIEW</span></b>
- The background to what happens during the translation process was already outlined in <code>1.3</code>, however it makes sense to expand on theory a little more.
- Proteins are synthesised by creating <b>chains of aminoacids</b>, according to information contained in the <b>messenger RNA (mRNA)</b> in a process called <b>translation</b>.

#### <b><span style='color:#FF5733'>START & STOP CODONS</span></b>
- <b>Translation</b> of a protein always begins with a specific codon <b>ATG</b> -> <b>M (Methionine)</b>, which is always the first amino in the protein.
- <b>Translation</b> process terminates when a stop codon is found; <b>TAA</b>,<b>TAG</b>,<b>TGA</b> -> <code>_</code>.
- An example sequence where we know exactly where the <b>start</b> and <b>termination codons</b> are:
    - <b>ATG</b><code>ACGGATCAGCCGCAAGCGGAATTGGCGTTTACGTACGATGCGCCG</code><b>TAA</b>
    - We can note it starts with <b>ATG</b> & ends with one of the three stop codons <b>TAA</b>.
- As the sequence follows the rule of start and ending codon, in such a case we can use the <b>staticmethod</b> defined in SQ; <code>.translate(p0=0)</code> directly.

#### <b><span style='color:#FF5733'>OPEN READING FRAME</span></b>
- A <b>reading frame</b> is a way of dividing the DNA sequence into a set of consecutive, <b>non-overlapping triplet nucleotides</b> (possible codons) (using dictionary mapping).
- A given sequence has 3 possible reading frames, first, second and third nucleotide positions. In addition, considering there is another complementary strand, we should compute the only 3 frames corresponding to the reverse compliment.
- In many cases, given a DNA sequence, <b>we don't know in advance where the coding regions are</b>, especially when dealing with complete sequences.
- In such cases, we need to <b>scan the DNA sequence for the coding region</b>. First, we need to divide and compute these reading frames (6 in total). The <code>frames</code> function stores the converted 6 converted amino acid strings in a list.

<b>OPEN READING FRAME</b> | [genomove.gov](https://www.genome.gov/genetics-glossary/Open-Reading-Frame)

> An open reading frame is a portion of a DNA molecule that, when translated into amino acids, contains no stop codons. The genetic code reads DNA sequences in groups of three base pairs, which means that a double-stranded DNA molecule can read in any of six possible reading frames--three in the forward direction and three in the reverse. A long open reading frame is likely part of a gene.

<center>

| Open Reading Frames (ORF)[[1]](https://www.genome.gov/genetics-glossary/Open-Reading-Frame) |
| - |
|<img src="https://www.genome.gov/sites/default/files/tg/en/illustration/open_reading_frame.jpg" alt="Drawing" style="width:700px;"/> |
    
</center>

#### <b><span style='color:#FF5733'>IDEAL CASE (P=0)</span></b>
- Let's try one of the reading frames at the start of the sequence (<b>p0=0</b>), which we know follows the correct rules observed in life.
- Let's also consider all other possible ORFS & get the final protein found in the DNA sequence.

In [11]:
# Sequence with initial codon coinciding with start codon &
# end w/ end codon (ATG & TAA respectively)
seq = 'ATGACGGATCAGCCGCAAGCGGAATTGGCGTTTACGTACGATGCGCCGTAA'
strand = SQ(seq=seq,seq_type='DNA')
print(f'Correct ORF: {strand.translate(strand.seq,p0=0)}')
print(f'All ORF:')
lst_ORF = strand.frames()
lstcol(lst_ORF)

Correct ORF: MTDQPQAELAFTYDAP_
All ORF:
MTDQPQAELAFTYDAP_    DGSAASGIGVYVRCAV     YGASYVNANSACG_SV     
_RISRKRNWRLRTMRR     LRRIVRKRQFRLRLIRH    TAHRT_TPIPLAADPS     


In [12]:
# Only one of the ORFs meets the requirement, so only one 
# protein is found 
strand.ORF_protein()

['MTDQPQAELAFTYDAP']

<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#FF5733;
       font-size:220%;
       font-family:Nexa;
       letter-spacing:0.5px">
    <p style="padding: 20px;
          color:white;">
        <b>4 |</b> LOADING FILES CONTAINING SEQUENCES
    </p>
</div>

### <b><span style='color:#FF5733'>4.1</span> | REAL SEQUENCES</b>
- It's quite straightforward to work with very short sequences; simply using the python <code>input()</code> function.
- Most realistic application of bioinformatics certainly involve sequences that are <b>too big to type out</b> & thus have to be read from files.
- On top of that, databases already store specific formats which contain information about a sequence and the sequence itself; one of such formats is <b>FASTA</b>.

### <b><span style='color:#FF5733'>4.2</span> | THE FASTA FORMAT</b>

Snipplet from [FASTA format](https://en.wikipedia.org/wiki/FASTA_format):
>In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either <b>nucleotide sequences</b> or <b>amino acid (protein) sequences</b>, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics.

#### <b><span style='color:#FF5733'>OVERVIEW</span></b>
- The format is commonly used to store nucleotide or protein sequences, it's less rigorous, usually containing only the <b>sequence</b>, and <b>name/header</b> only.
- The file extension typically changed based on the <b>sequence type</b> content of the file, defined below in the dictionary <code>FASTA_dic</code>.
- The format can contain any number of sequences, each starting with the symbol: <b>></b>.
- The <b>name/header</b>, defined after the symbol <b>></b> contains an origin identifier (<b>NCBI identifiers</b>), defined in the the dictionary <code>identifiers_dic</code>.

> The NCBI defined a standard for the unique identifier used for the sequence (SeqID) in the header line. This allows a sequence that was obtained from a database to be labelled with a reference to its database record. The database identifier format is understood by the NCBI tools like makeblastdb and table2asn. The following list describes the NCBI FASTA defined format for sequence identifiers

In [13]:
# NCBI identifiers
identifiers_dic = {'lcl':'local(nodb)','bbs':'GenInfo backbone seqid',
                   'bbm':'GenInfo backbone moltype','gim':'GenInfo import ID',
                   'gb':'GenBank','emb':'EMBL','pir':'PIR','sp':'SWISS-PROT',
                   'pat':'patent','pgp':'pre-grant patent','ref':'RefSeq',
                   'gnl':'general database reference','prf':'PRF','pdb':'PDB',
                   'gi':'GenInfo integrated database','dbj':'DDBJ'}

# FASTA formats
FASTA_dic = {'fa':'generic','fasta':'generic','fna':'nucleic acid',
             'ffn':'nucleotide of gene regions','faa':'amino acid',
             'frn':'non-coding RNA'}

### **<span style='color:#FF5733'>4.3</span> | READ SEQUENCE CLASS, read_seq()**
- Class <b>read_seq</b> defines function a <code>read_FASTA</code>, that can be used to read <b>FASTA</b> based formats and is integrated into the previously defined general purpose sequence class.
- More sophisticated formats can also be incorporated into the current class. [BioPython GenBank Dataset](https://www.kaggle.com/shtrausslearning/biopython-genbank) contains other formats as well.
- The class automatically should detect the class type and call the corresponding class, storing the sequence upon instantiation. 
- Since the constructor can't return anything upon instantiation, <b>get_sq</b> is used to return a SQ class instance for each of the read sequences in the file.

In [14]:
# Class to read different files and store info only
class read_seq(SQ):
    
    def __init__(self,name):
        self.name = name
        self.format = name.split('.')[1]
        if(self.format in FASTA_dic):      # if one of the fasta formats
            self.read_FASTA(self.name)

    # read FASTA format
    def read_FASTA(self,filename):

        tseq = None; self.lst_seq = []     # list of sequences
        thead = None; self.lst_header = [] # list of sequence identifications
        ff = FASTA_dic[filename.split('.')[1]]
        file = open(filename,'r')

        for line in file:
            if(search(">.*", line)): # get lines w/ >
                    if(tseq != None and thead != None and tseq != ""):
                        self.lst_seq.append(tseq)
                    thead = line; self.lst_header.append(line)              
                    tseq = ""
            else:
                if(tseq == None):
                    return None
                else: 
                    tseq += sub("\s","",line)

        if(tseq != None and thead != None and tseq != ""):
            self.lst_seq.append(tseq)
            
        print(f'READ -> FASTA [{ff}] | #SEQ: {len(self.lst_seq)}')
        file.close()
        
    # get read sequences
    def get_sq(self):
        lst_out = []
        if(len(self.lst_seq) > 1):
            for i in range(0,len(self.lst_seq)):
                lst_types = ['DNA','mRNA','PROTEIN']
                for check in lst_types:
                    if(SQ(self.lst_seq[i],check).validate()):
                        lst_out.append(SQ(self.lst_seq[i],check)) 
            return lst_out
        else:
            lst_types = ['DNA','mRNA','PROTEIN']
            for check in lst_types:
                if(SQ(self.lst_seq[0],check).validate()): # if valid sq
                    return SQ(self.lst_seq[0],check)

- Fetched (Validate & Set) <b>set of sequences</b> are stored in the output variable in a <b>list</b> format; containing a set of <b>SQ</b> instances for each sequence in the file, whilst FASTA files containing only a <b>single sequence</b> are class instances.
- Two examples are shown below, reading files: <code>file_faa</code> & <code>file_fna</code>, containing a <b>set of protein sequences</b> and a <b>single nucleotide sequence</b> resepectively.

In [15]:
# define pathway to FASTA file
file_faa = '/kaggle/input/biopython-genbank/NC_005816.faa'
file_fna = '/kaggle/input/biopython-genbank/NC_005816.fna'

# fetch sequence from file and store each in sequence class, SQ
col_seq_aa = read_seq(file_faa).get_sq()
col_seq_n = read_seq(file_fna).get_sq()

READ -> FASTA [amino acid] | #SEQ: 10
READ -> FASTA [nucleic acid] | #SEQ: 1


In [16]:
print(f'sequence: {col_seq_aa[1].seq}')
print(f'sequence type: {col_seq_aa[1].seq_type}')

sequence: MMMELQHQRLMALAGQLQLESLISAAPALSQQAVDQEWSYMDFLEHLLHEEKLARHQRKQAMYTRMAAFPAVKTFEEYDFTFATGAPQKQLQSLRSLSFIERNENIVLLGPSGVGKTHLAIAMGYEAVRAGIKVRFTTAADLLLQLSTAQRQGRYKTTLQRGVMAPRLLIIDEIGYLPFSQEEAKLFFQVIAKRYEKSAMILTSNLPFGQWDQTFAGDAALTSAMLDRILHHSHVVQIKGESYRLRQKRKAGVIAEANPE
sequence type: PROTEIN


In [17]:
print(f'Number of Sequences stored: {len(col_seq_aa)}'); 
print(f'List of Sequences Type: {type(col_seq_aa)}')
print(f'List Content Type: {type(col_seq_aa[0])}')
print(f'Single Fetched Sequence Type: {type(col_seq_n)}')

Number of Sequences stored: 10
List of Sequences Type: <class 'list'>
List Content Type: <class '__main__.SQ'>
Single Fetched Sequence Type: <class '__main__.SQ'>


In [18]:
# Select only a subset as the strand is a little too big (arbitrarily selected)
col_seq_subset = SQ(col_seq_n.seq[100:500],col_seq_n.seq_type)
# get all proteins with a length of more than 1
proteins = col_seq_subset.ORF_protein(mins=1)

# list all found proteins above a length of 2 
print('Proteins in Sequence col_seq_subset:')
lstcol(proteins,2)

Proteins in Sequence col_seq_subset:
MNALRMVIPPRYPWSLISRAITVAGIL    MLILTKSRQR                     
MVIPPRYPWSLISRAITVAGIL         MP                             
MNTGIIFVNASPMLILTKSRQR         MR                             
MSIGDALTNIIPVFIQE                                             


<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#FF5733;
       font-size:220%;
       font-family:Nexa;
       letter-spacing:0.5px">
    <p style="padding: 20px;
          color:white;">
        <b>5 |</b> COVID-19: PROTEINS IDENTIFICATION
    </p>
</div>

### <b><span style='color:#FF5733'>5.1</span> | REAL WORLD EXAMPLE : PROTEIN IDENTIFICATION</b>
- Having defined a class that can <b>read</b>, <b>store</b> & <b>derive proteins</b> from a sequence file, we can use a database to obtain a real sequence, which should be quite familiar. 
- Let's use the Coronavirus sequence, already uploaded to [Kaggle](https://www.kaggle.com/paultimothymooney/coronavirus-genome-sequence) & [original source](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512).

#### <b><span style='color:#FF5733'>BLAST</span></b>

- Our goal here will be to (1) get the proteins encoded in the genome & (2) learn something about them. 
- To do this, we'll be using a very useful resource that uses [BLAST (Basic Local Alignment Search Tool)](https://blast.ncbi.nlm.nih.gov/Blast.cgi). 
- We will look into <b>Local Sequence Alignment</b> in another notebook, however for the time being, a protein sequence is compared to others already found in a database & is compared with a corresponding <b>similarity score</b>.
- To use the databse, we'll first need to (1) get the proteins & (2) simply search for matching results in the database (we'll need to copy & past our resultant protein sequences).

In [19]:
# Read FASTA format containing Covid Genome
virus_fna = '/kaggle/input/coronavirus-genome-sequence/MN908947.fna'
virus_n = read_seq(virus_fna) # read and store FNA data 
print(f'Sequence Header: {virus_n.lst_header[0]}')
print(f'Sequence Length: {len(virus_n.lst_header[0])}')

READ -> FASTA [nucleic acid] | #SEQ: 1
Sequence Header: >MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

Sequence Length: 96


In [20]:
# Store the sequence, defining a SQ class instance
virus_sq = virus_n.get_sq()
print(f'Sequence length: {len(virus_sq)} nucleotides')

# The sequence is a bit too long, lets show the first 1000 characters
virus_sq[0:1000]

Sequence length: 29903 nucleotides


'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTC

In [21]:
# The virus contains a heavy portion of Thymine & Adenine (63%)
virus_sq.freq(fheight=200)
print(f'GC-Content Sequence 1: {round(virus_sq.gc()*100,3)}%')

GC-Content Sequence 1: 37.97%


In [22]:
# Get all amino acid chains above default = 1
print(f'Amino Acid Chains: {len(virus_sq.ORF_protein())}')

Amino Acid Chains: 1208


### <b><span style='color:#FF5733'>5.2</span> | FUNCTIONAL PROTEINS & OLIGOPEPTIDES</b>
- To not complicate things too much, in <code>Section 3.6</code>, it wasn't mentioned that upon translation, our <b>amino acid chains</b> aren't actually all proteins.
- There are some extra criteria these chains must meet in order to be classified as proteins. <b>Functional proteins</b> are chains above <b>20 amino acids</b>.
- Smaller chains are called [oligopeptides](https://en.wikipedia.org/wiki/Oligopeptide) (2-20 amino acids) & have other functionalities.
- Let's select the largest amino acid found in the genome & search for it in a databse, so we can identify it.

In [23]:
# For convenience, let's select a subset above 50 amino acids
print(f'Amino Acid Chains: {len(virus_sq.ORF_protein(mins=50))}')
print(f'Largest Amino Acid: {virus_sq.ORF_protein(mins=50)[0]}')
print(f'Length of Largest Amino Acid: {len(virus_sq.ORF_protein(mins=50)[0])}')

Amino Acid Chains: 243
Largest Amino Acid: MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGGAYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFPLNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTGDFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESGLKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNLLEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGNFKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAAITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKLKPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLVNKFLALCADSIIIGGAKLKALNLGETFVTHSKGLYRKCVKSREETGLLMPLKAPKEIIFLEGETLPTEVLTEEVVLKTGDLQPLEQPTSEAVEAPLVGTPVCINGLMLLEIKDTEKYCALAPNMMVTNNTFTLKGGAPTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEKCSAYTVELGTEVNEFACVVADAVIKTLQPVSELLTPLGIDLDEWSMATYYLFDESGEFKLASHMYCSFYPPDEDEEEGDCEEEEFEPSTQYEYGTEDDYQG

#### <b><span style='color:#FF5733'>SEARCHING DATABASES</span></b>

- Next, we can use this string; copy & paste this protein & search for it using the [PSI BLAST](https://www.ebi.ac.uk/Tools/sss/psiblast/) in order to find proteins already in a predefined database. 
- For our current protein, which we wish to identify, the results indicate that a 100% match was found with an already existing sequence in the database: [Replicase polyprotein 1a](https://pubchem.ncbi.nlm.nih.gov/protein/P0C6U8) as shown by a [temporary job link](https://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=psiblast-I20210801-145740-0891-84805416-p1m)
- Through database searches & the <b>local sequence alignment</b> algorithm it uses, we can get a more clear picture of what functions the protein of interest can have. 
- Searches are stored for 7 days, after which a new search needs to be made when the link expires.

<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#FF5733;
       font-size:220%;
       font-family:Nexa;
       letter-spacing:0.5px">
    <p style="padding: 20px;
          color:white;">
        <b>6 |</b> SUMMARY
    </p>
</div>

#### <b><span style='color:#FF5733'>CUSTOM CLASSES</span></b>
- In this notebook, we have defined a class that stores <b>basic information</b> about a specific sequence of interest. 
- Another class was also defined in order to read a commonly encountered format type <b>FASTA</b>. 

#### <b><span style='color:#FF5733'>OVERVIEW</span></b>
- We also looked at common sequence related operations such as <b>transcription</b> & <b>translation</b>, which are quite critical concepts.
- We also looked at some basic visualisation of the nucleotide & amino acid distributions, which can be helpful when comparing not only the <b>count</b>, but also the percentage of each alphabet for the corresponding sequence type.

#### <b><span style='color:#FF5733'>WHERE TO FROM HERE</span></b>
- As with the <b>BioPython</b> module, we have defined classes which store sequence data & which can be expanded further.
- The class, <b>SQ</b> will be used as a basis for other sequence related operations, such as <b>pairwise</b> & <b>multiple sequence alignment</b>, which due to their content size alone, can have their own dedicated classes, similar to the class <b>READ_SEQ</b>, which is dedicated to reading and passing on the sequence information to class <b>SQ</b>.
- The notebook will of course be continuously updated and futher functions will be added in the future.