---
WBT-MBT2-25E <i>Programming Python for Bioinformatics</i> &copy; 2020-2023 Michal Bukowski (m.bukowski@uj.edu.pl) Department of Analytical Biochemistry, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University

---

<p style="font-size:15pt;font-weight:bold;border:1px solid;border-color:#aabbcc;padding:15px;background:#ddeeff;border-radius:15px">DNASeq, a new data type for processing DNA sequences</p>

<font size=3>
    
Below you will find an incomplete implementation of the class `DNASeq` that is supposed to define a new type of a data structure facilitating different operations performed on DNA sequences. These include operations that might be used to process multiple sequence alignments in FASTA format, which apart from letters belonging to the [IUPAC ambiguous DNA alphabet](https://www.bioinformatics.org/sms/iupac.html), may contain gaps (the minus sign `-`).<br>

Complete gradually the implentation fo the class DNASeq (<u>it must be contained in one cell</u>). Follow the comments in the code. Upon completion of each numbered part, run code in the last section `Test the code!`. It will help you to build the class step by step.<br>

<u>Remove the comment sign</u> `#` placed before subsequent code lines (manually or using `Ctrl + /` for multiple lines at once) and fill in the gaps marked with `???`. Sometimes it requires only one line of code or a simple expression, sometimes a whole block of code.<br>

Again, remember to fill the code step by step, part by part.<br>
    
<b>REMEMBER</b> to <u>rerun</u> the cell with DNASeq class definition after making any changes. Otherwise, those will not take any effect.<br>
    
<b>TEST</b> your code upon finishing each section. It <u>must work</u> before you move forward. In order to make the code work, you <u>must implement sections in the given order</u>.

</font>

---
### On the GenBank notation
#### <u>An example for positions/coordinates/indices 1 and 4</u>:
(<i>start</i> and <i>stop</i> in Python, <i>begining</i> and <i>end</i> in GenBank)
<pre style="font-family:Monospace">
<span style="color:#00aaaa">          |----&gt;|</span>
<span style="color:#aaaaaa"> Python: 0 1 2 3 4 5 6 7 8</span>
         A T C G T A C G A
<span style="color:#aaaaaa">GenBank: 1 2 3 4 5 6 7 8 9</span>
<span style="color:#00aaaa">        |------&gt;|</span>
         T A G C A T G C T   (complementary/minus strand)
<span style="color:#aa0000">        |&lt;------|</span>

<span style="color:#aaaaaa"> Python:</span> seq[5]   <span style="color:#aaaaaa">-&gt;</span> <span style="color:#00aaaa">'A'</span>
<span style="color:#aaaaaa">GenBank:</span> seq[5]   <span style="color:#aaaaaa">-&gt;</span> <span style="color:#00aaaa">'T'</span>

<span style="color:#aaaaaa"> Python:</span> seq[1:4] <span style="color:#aaaaaa">-&gt;</span> <span style="color:#00aaaa">'TCG'</span>
<span style="color:#aaaaaa">GenBank:</span> seq[1:4] <span style="color:#aaaaaa">-&gt;</span> <span style="color:#00aaaa">'ATCG'</span>

<span style="color:#aaaaaa"> Python:</span> seq[4:1] <span style="color:#aaaaaa">-&gt;</span> <span style="color:#aa0000">''</span>
<span style="color:#aaaaaa">GenBank:</span> seq[4:1] <span style="color:#aaaaaa">-&gt;</span> <span style="color:#aa0000">'CGAT'</span>   (5' to 3' direction, reverse-complement)
</pre>

---

In [5]:
import re
import numpy as np

#### The overview of the implemented class

    class DNASeq:
    
        ALPH
        FASTA_REGEX
        
        __init__()
        __repr__()
        __str__()
        __len__()
        __add__()
        __getitem__()
        
        from_file()
        revcmpl()
        copy()

In [112]:
# and there it is, our DNASeq class!
class DNASeq:
 
    #===SECTION=1========================================================
    
    # a dictionary with the IUPAC ambiguous DNA alphabet
    # (https://www.bioinformatics.org/sms/iupac.html)
    # and a gap sign, the key is paired with a value
    # that is the key's complement, this dictionary
    # is defined as a class variable,
    # revise how to access values of such variables,
    # TODO: fill up the missing letters,
    #       make them capitals
    
    ALPH = {
       'A' : 'T',   'T' : 'A',   'G' : 'C',   'C' : 'G',
       'K' : 'M',   'M' : 'K',   'B' : 'V',   'D' : 'H',
       'H' : 'D',   'V' : 'B',   'N' : 'N',   '-' : '-'
    }
    
    # regular expression for FASTA format processing,
    # another class variable,
    # TODO: assign to FASTA_REGEX variable a proper
    #       value to formulate a proper regular expression
    
    FASTA_REGEX = '\>([^\ \n]+)\ ?(.*)\n([^\>]+)'

    
    # TEST the code in the Section 1
    
    
    #===SECTION=2========================================================
    
    # the special method __init__() initialising the
    # initial state of a newly created object (a class instance),
    # TODO: define the special method __init__() proper for
    #       initialisation of a new object of DNASeq type,
    #       which is to be created as following:
    #       seq = DNASeq(seqid, title, seq)
    
    def __init__(self, seqid, title, seq):
        # TODO: assign values of the arguments passed
        #       whilst creating a new object to the
        #       object attributes of the same name
        
        self.seqid = seqid
        self.title = title
        self.seq = seq
        
    # TEST the code in the Section 2
        
        
    #===SECTION=3========================================================
        
    # now we need a class method that will help us to
    # deserialise objects from a file in FASTA format
    # TODO: define a class method from_file that next to
    #       the automatically passed reference to a class
    #       will accept an argument with a file path, and
    #       will work as following:
    #       seqs = DNASeq.from_file('some/path/some_file.fasta')
    
    @classmethod
    def from_file(cls, filename):
        
        seqs = {}      
        
        # TODO: load the content of the file to
        #       a variable buf by using the
        #       instruction with
        with open(filename) as f: buffer = f.read()

        
        # TODO: start iteratively searching for FASTA_REGEX in
        #       buf by using a for loop and the re.finditer() method        
        for match in re.finditer(cls.FASTA_REGEX, buffer):
            
            seqid, title, seq = match.groups()
            
            seq = re.sub('[\ \n\t]+', '', seq).upper()
            
            seq_alph = set(seq)
              
            # TODO: find a proper method for Python sets,
            #       which will allow you to see if the
            #       alphabet of seq is a subset of ALPHABET,
            #       if not raise an exception            
            if seq_alph in set(DNASeq.ALPH):
                raise Exception(f'Sequence {seqid} contains incorrect letters')

            seqs[seqid] = cls(seqid, title, seq)
              
        return seqs
    
    # TEST the code in the Section 3
    
        
    #===SECTION=4========================================================
    
    # the special method __repr__() returns a string representation of
    # an object, we will define this represetation as 10 first letters
    # of the contained sequence and three dots '...' if the sequnce
    # is longer than 10 letters
    
    def __repr__(self):
        
        # get sequence until 10th character
        repr = self.seq[:10]
        
        # if the length of the sequence is greater to 10
        if len(self.seq) > 10:
            # add ... in the end
            repr = repr+'...'
        
        return repr
    
    # TEST the code in the Section 4
    
    
    #===SECTION=5========================================================
        
    # TODO: define the proper special method that will return the lenght of
    #       the sequence contained within a given DNASeq object when the reference
    #       to that object is passed to the built-in len() function
    
    def __len__(self):
        return len(self.seq)
    
    # TEST the code in the Section 5
    
    
    #===SECTION=6========================================================
    
    # TODO: implement the special method __str__(), which returns
    #       a string whenever a DNASeq object is being converted to one,
    #       eg. str(obj) or print(obj),
    #       make it return the sequence contained in an object 
    #       of DNASeq type as FASTA format sequence string,
    #       let the sequence be divided into 60-character lines
    
    def __str__(self):
        
        seqid = '>'+self.seqid
        
        lines = '\n'.join(self.seq[i:i+60] for i in range(0, len(self.seq), 60))

        title = f' {self.title}' if self.title != '' else ''

        fasta = f'{seqid}{title}\n{lines}'
        
        return fasta
    
    
    # TEST the code in the Section 6
    
    
    #===SECTION=7=======================================================
    
    # let's create a custom method and call it revcmpl(),
    # it simply returns a new object of DNASeq type, which
    # will contain a reverse-complement sequence to the
    # one contained in the object the method is called from,
    # to make both object distinguishable, create a new
    # seqid for the sequence by adding '_revcmpl' suffix
    # to the seqid of the original one
    
    def revcmpl(self):
        # TODO:convert sequence contained in the object
        #      to a list called seq
        
        seq = self.seq
        
        # TODO: reverse the list in-place        
        inv = seq[::-1]
       
        # TODO: using string method join(), the class dictionary ALPH and a
        #       list comprehension, translate the reversed sequence and
        #       convert into a string
        
        seq_revcmpl = ''.join(self.ALPH[inv[i]] for i in range(len(inv)))
        
        # TODO: create seqid variable and assign to it the object's seqid
        #       and the suffix '_revcmpl'
        
        seqid = self.seqid+'_revcmpl'
        
        # TODO: create a new object of the DNASeq type using the new seqid,
        #       title contained in the object as well as
        #       reveresed and translated sequence, return the new object
        
        revcmpl_dnaseq = DNASeq(seqid, self.title, seq_revcmpl)
        
        return revcmpl_dnaseq
    
    # TEST the code in the Section 7
    
    
    #===SECTION=8========================================================
    
    # beware! this is the hardest part to go through,
    # we will implement the special method __getitem__()
    # that allows to program what happens when an object
    # is indexed or sliced (like a list or string),
    # eg. obj[3] or obj[4:5]
    #
    # we know that indexing and slicing in Python works as follows:
    # - indexing starts at 0 and goes up
    # - when a slice is taken [3:4] the first index is included
    #   the last not: eg. s = 'abcde', s[1:3] -> 'bc'
    #                          01234
    # - slice [3:1] will be an empty one
    #
    # when it comes to biological sequences, those are indexed
    # according to GenBank notation, which is quite different:
    # - indexing starts at 1
    # - when a fragment is requested, both indices are inclusive:
    #   seq = 'ATGCTACG', seq[1:3] -> 'ATG'
    #          12345678
    # - the start index greater than stop index indicates
    #   a reverse complement (the complementary strand):
    #   seq[3:1] -> 'CAT'
    #
    # now we will try to implement this behaviour in case of
    # our objects of DNASeq type
    
    def __getitem__(self, key):
        
        if isinstance(key, slice):
            # if the key is a slice object, it has three properties:
            # start, stop and step, eg. list[start:stop:step]
            # if any is missing its value is None, eg. list[start:end]
            # first two will be equivalent of beginning and end
            # in GenBank notation
            
            # the whole point here is to translate GenBank indices
            # into Python ones and index or slice the sequence
            # contained in the DNASeq object, which is surely of
            # Python string format, so it cannot be directly indexed
            # or sliced with GenBank indices wihtout translating them
            
            # to make the code nicer, let's assign values of slice
            # properties to single variables
            
            start, end, step = key.start, key.stop, key.step
            
            # let's test a few possibilites and exclude those
            # that cannot be translated into GenBank notation
            
            if step is not None:
                # there is not an equivalent of step in GenBank notation
                # raise a KeyError                
                raise KeyError('Step is not allowed in GenBank notation')
            
            if start is None:
                # start must be provided                
                raise KeyError('start index is required')
            
            if end is None:
                # as well as end                
                raise KeyError('end index is required')

                
            if not np.issubdtype(type(start), np.integer) or \
              not np.issubdtype(type(end), np.integer):
                # start and end must be defined as integer values,
                # otherwise raise a KeyError
                
                raise TypeError('Start and end must be integers')
            
            if start <= 0 or end <= 0:
                # both must be greater than 0 as in GenBank notation,
                # in which indices starts at 1                
                raise KeyError('Minimal value for start and end is 1')
                
            # now we are sure there is only start and end in the slice,
            # and that both are integer type values and equal or greater to 1,
            # let's move on then
            
            # TODO: create a variable strand and set it to 1 if start is
            #       less or equal to end, otherwise to -1
            # TODO: if strand is equal to -1, swap the values of start and end,
            #       by using unpacking, so start is less than end (they are ordered)
            
            if start <= end:
                strand = 1
            else:
                strand = -1    
                start, end = end, start
            
            # TODO: create a new seqid by adding to the existing one
            #       suffix '_loc(start_end)', where start and end are
            #       values of our variables            
          
            new_seqid = self.seqid+f'_loc({start}_{end})'
            
            # TODO: decrease start by 1, if start in GenBank notation is 1,
            #       it is and equivalent of 0 in Python indexing (etc.),
            #       so it needs to be decreased by 1 to be translated
            #       from GenBank to Python
            #
            #       you must not decrease end, as in GenBank it is inclusive,
            #       but in Python exclusive, so it is decreased
            #       somewhat automatically
            
            start -= 1 
            
            # TODO: create a new object of DNASeq type, by using new seqid
            #       you created a few lines before, title of the existing object,
            #       a slice of the sequence contained in that object by using
            #       the adjusted start index and the end index (string slicing)    
            
            title = self.title
            sub_seq = self.seq[start:end]
            new_seq = DNASeq(new_seqid, title, sub_seq)

            
            # TODO: if strand is -1, assign to the same reference ("variable")
            #       a reverse complement of the new sequence object by using
            #       the method revcmpl() you implemented before
            
            if strand == -1:
                new_seq = new_seq.revcmpl()        
                                
            # TODO: return the new sequence object       
            
            return new_seq
                
        else:
            
            # if we are here, it means that key is not a slice object,
            # then we allow it be an integer greather than 0,
            # which is GenBank notation
            
            if not np.issubdtype(type(key), np.integer):
                raise TypeError('Index must be an integer')
                
            if key <= 0:
                raise KeyError('Minimal value of index is 1')
                
            # in case it is just one letter not a slice, we
            # will return simply a letter from the sequence,
            # not a new DNASeq object, we need to remeber to
            # decrease the key value by one to translate it from
            # GenBank to Python
                
            return self.seq[key-1]
    
    # TEST the code in the Section 8
        
        
    #===SECTION=9========================================================
    
    # implement the special method __add__() which will be
    # invoked when two objects of DNASeq type are being added,
    # eg. seq_3 = seq_1 + seq_2
    #
    # create the result of addition as a new object of DNASeq type
    # and set its:
    # - seqid to seqids of added objects separated by an underscore
    # - title to titles of added objects separated by an underscore
    # - seq to a simple concatenation of both sequences
    #
    # return the new object
    
    def __add__(seq1, seq2):
        add_seqid = seq1.seqid+f'_{seq2.seqid}'
        add_title = seq1.title+f'_{seq2.title}'
        add_seq = seq1.seq + seq2.seq
        
        add_dnaseq = DNASeq(add_seqid, add_title, add_seq)
        
        return add_dnaseq
    
    # TEST the code in the Section 9
    
    
    #===SECTION=10========================================================
        
    # define a custom method copy() that will return an exact copy
    # of the existing object
        
    def copy(self):
        
        seqid = self.seqid
        title = self.title
        seq = self.seq
        
        copy_dnaseq = DNASeq(seqid, title, seq)
        
        return copy_dnaseq
        
    
    # TEST the code in the Section 10
    

<p style="font-size:15pt;font-weight:bold;border:1px solid;border-color:#aabbcc;padding:15px;background:#ddeeff;border-radius:15px">Test the code!</p>

---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 1<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
The whole dictionary: {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G', 'K': 'M', 'M': 'K', 'B': 'V', 'D': 'H', 'H': 'D', 'V': 'B', 'N': 'N', '-': '-'}<br><br>
A letter complementary to "C": G<br><br>
Regular expression for FASTA format '\\>([^\\ \n]+)\\ ?(.*)\n([^\\>]+)'
</p>

In [66]:
# let's see our class variables:
print('The whole dictionary:', DNASeq.ALPH, '\n')

print('A letter complementary to "C":', DNASeq.ALPH['C'], '\n')

print('Regular expression for FASTA format', repr(DNASeq.FASTA_REGEX), '\n')


The whole dictionary: {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G', 'K': 'M', 'M': 'K', 'B': 'V', 'D': 'H', 'H': 'D', 'V': 'B', 'N': 'N', '-': '-'} 

A letter complementary to "C": G 

Regular expression for FASTA format '\\>([^\\ \n]+)\\ ?(.*)\n([^\\>]+)' 



---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 2<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
new_seq (an exemplary sequence): ATCGT...
</p>

In [67]:
seq = DNASeq('new_seq', 'an exemplary sequence',
             'ATCGTAGGATCGGATTAGAGCGATTAGCTAG')

print(f'{seq.seqid} ({seq.title}): {seq.seq[:5]}...')


new_seq (an exemplary sequence): ATCGT...


---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 3<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
arcC, aroE, glpF, gmk, pta, tpi, yqiL<br><br>
yqiL (Acetyle coenzyme A acetyltransferase): GCGTT...
</p>

In [68]:
seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

In [69]:
# deserialise sequences from a FASTA format file
# into a collection of objects of DNASeq class

seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

seqids = ', '.join( seqs.keys() )

print(seqids)

seq = seqs['yqiL']

print(f'\n{seq.seqid} ({seq.title}): {seq.seq[:5]}...')
    

arcC, aroE, glpF, gmk, pta, tpi, yqiL

yqiL (Acetyle coenzyme A acetyltransferase): GCGTT...


---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 4<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
{'arcC': TTATTAATCC...,<br>
 'aroE': AATTTTAATT...,<br>
 'glpF': GGTGCTGATT...,<br>
 'gmk': CGAATATTTG...,<br>
 'pta': GCAACACAAT...,<br>
 'tpi': CACGAAACAG...,<br>
 'yqiL': GCGTTTAAAG...}
</p>

In [70]:
# reload the sequences to have a collection of objects
# that are instances of the up-to-date DNASeq class

seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

seqs


{'arcC': TTATTAATCC...,
 'aroE': AATTTTAATT...,
 'glpF': GGTGCTGATT...,
 'gmk': CGAATATTTG...,
 'pta': GCAACACAAT...,
 'tpi': CACGAAACAG...,
 'yqiL': GCGTTTAAAG...}

---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 5<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
516
</p>

In [71]:
# reload the sequences to have a collection of objects
# that are instances of the up-to-date DNASeq class

seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

# select one of the sequences by its sequence id (seqid)
seq = seqs['yqiL']

# look up the length of the contained sequence
len(seq)


516

---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 6<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
&gt;yqiL Acetyle coenzyme A acetyltransferase<br>
GCGTTTAAAGACGTGCCAGCCTATGATTTAGGTGCGACTTTAATAGAACATATTATTAAA<br>
GAGACGGGTTTGAATCCAAGTGAGATTGATGAAGTTATCATCGGTAACGTACTACAAGCA<br>
GGACAAGGACAAAATCCAGCACGAATTGCTGCTATGAAAGGTGGCTTGCCAGAAACAGTA<br>
CCTGCATTTACAGTGAATAAAGTATGTGGTTCTGGGTTAAAGTCGATTCAATTAGCATAT<br>
CAATCTATTGTGACTGGTGAAAATGACATCGTGCTAGCTGGCGGTATGGAGAATATGTCT<br>
CAGTCACCAATGCTTGTCAACAACAGTCGCTTCGGTTTTAAAATGGGACATCAATCAATG<br>
GTTGATAGCATGGTATATGATGGTTTAACAGATGTATTTAATCAATATCATATGGGTATT<br>
ACTGCTGAAAATTTAGTGGAGCAATATGGTATTTCAAGAGAAGAACAAGATACATTTGCT<br>
GTAAACTCACAACAAAAAGCAGTACGTGCACAGCAA
</p>

In [72]:
# reload the sequences to have a collection of objects
# that are instances of the up-to-date DNASeq class

seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

# select one of the sequences by its sequence id (seqid)
seq = seqs['yqiL']

print(seq)


>yqiL Acetyle coenzyme A acetyltransferase
GCGTTTAAAGACGTGCCAGCCTATGATTTAGGTGCGACTTTAATAGAACATATTATTAAA
GAGACGGGTTTGAATCCAAGTGAGATTGATGAAGTTATCATCGGTAACGTACTACAAGCA
GGACAAGGACAAAATCCAGCACGAATTGCTGCTATGAAAGGTGGCTTGCCAGAAACAGTA
CCTGCATTTACAGTGAATAAAGTATGTGGTTCTGGGTTAAAGTCGATTCAATTAGCATAT
CAATCTATTGTGACTGGTGAAAATGACATCGTGCTAGCTGGCGGTATGGAGAATATGTCT
CAGTCACCAATGCTTGTCAACAACAGTCGCTTCGGTTTTAAAATGGGACATCAATCAATG
GTTGATAGCATGGTATATGATGGTTTAACAGATGTATTTAATCAATATCATATGGGTATT
ACTGCTGAAAATTTAGTGGAGCAATATGGTATTTCAAGAGAAGAACAAGATACATTTGCT
GTAAACTCACAACAAAAAGCAGTACGTGCACAGCAA


---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 7<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
&gt;yqiL_revcmpl Acetyle coenzyme A acetyltransferase<br>
TTGCTGTGCACGTACTGCTTTTTGTTGTGAGTTTACAGCAAATGTATCTTGTTCTTCTCT<br>
TGAAATACCATATTGCTCCACTAAATTTTCAGCAGTAATACCCATATGATATTGATTAAA<br>
TACATCTGTTAAACCATCATATACCATGCTATCAACCATTGATTGATGTCCCATTTTAAA<br>
ACCGAAGCGACTGTTGTTGACAAGCATTGGTGACTGAGACATATTCTCCATACCGCCAGC<br>
TAGCACGATGTCATTTTCACCAGTCACAATAGATTGATATGCTAATTGAATCGACTTTAA<br>
CCCAGAACCACATACTTTATTCACTGTAAATGCAGGTACTGTTTCTGGCAAGCCACCTTT<br>
CATAGCAGCAATTCGTGCTGGATTTTGTCCTTGTCCTGCTTGTAGTACGTTACCGATGAT<br>
AACTTCATCAATCTCACTTGGATTCAAACCCGTCTCTTTAATAATATGTTCTATTAAAGT<br>
CGCACCTAAATCATAGGCTGGCACGTCTTTAAACGC
</p>

In [73]:
# reload the sequences to have a collection of objects
# that are instances of the up-to-date DNASeq class

seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

# select one of the sequences by its sequence id (seqid)
seq = seqs['yqiL']

new_seq = seq.revcmpl()

print( new_seq )


>yqiL_revcmpl Acetyle coenzyme A acetyltransferase
TTGCTGTGCACGTACTGCTTTTTGTTGTGAGTTTACAGCAAATGTATCTTGTTCTTCTCT
TGAAATACCATATTGCTCCACTAAATTTTCAGCAGTAATACCCATATGATATTGATTAAA
TACATCTGTTAAACCATCATATACCATGCTATCAACCATTGATTGATGTCCCATTTTAAA
ACCGAAGCGACTGTTGTTGACAAGCATTGGTGACTGAGACATATTCTCCATACCGCCAGC
TAGCACGATGTCATTTTCACCAGTCACAATAGATTGATATGCTAATTGAATCGACTTTAA
CCCAGAACCACATACTTTATTCACTGTAAATGCAGGTACTGTTTCTGGCAAGCCACCTTT
CATAGCAGCAATTCGTGCTGGATTTTGTCCTTGTCCTGCTTGTAGTACGTTACCGATGAT
AACTTCATCAATCTCACTTGGATTCAAACCCGTCTCTTTAATAATATGTTCTATTAAAGT
CGCACCTAAATCATAGGCTGGCACGTCTTTAAACGC


---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 8<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
&gt;gmk Guanylate kinase<br>
CGAATATTTGAAGATCCAAGTACATCATATAAGTATTCTATTTCAATGACAACACGTCAA<br>
ATGCGTGAAGGTGAAGTTGATGGCGTAGATTACTTTTTTAAAACTAGGGATGCGTTTGAA<br>
GCTTTAATCAAAGATGACCAATTTATAGAATATGCTGAATATGTAGGCAACTATTATGGT<br>
ACACCAGTTCAATATGTTAAAGATACAATGGACGAAGGTCATGATGTATTTTTAGAAATT<br>
GAAGTAGAAGGTGCAAAGCAAGTTAGAAAGAAATTTCCAGATGCGCTATTTATTTTCTTA<br>
GCACCTCCAAGTTTAGAACACTTGAGAGAGCGATTAGTAGGTAGAGGAACAGAATCTGAT<br>
GAGAAAATACAAAGTCGTATTAACGAAGCGCGTAAAGAAGTTGAAATGATGAATTTA
</p>
<p style="background:#ffeedd;font-family:Sans;font-size:10pt;font-weight:bold;padding:10px;margin:0px">
and:
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
&lt;class 'str'&gt;<br>
C<br><br>
&lt;class 'str'&gt;<br>
ATATTT
</p>
<p style="background:#ffeedd;font-family:Sans;font-size:10pt;font-weight:bold;padding:10px;margin:0px">
and:
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
&lt;class 'str'&gt;<br>
C<br><br>
&lt;class '__main__.DNASeq'&gt;<br>
&gt;gmk_loc(4_9) Guanylate kinase<br>
ATATTT<br><br>
&lt;class '__main__.DNASeq'&gt;<br>
&gt;gmk_loc(1_1) Guanylate kinase<br>
C
</p>
<p style="background:#ffeedd;font-family:Sans;font-size:10pt;font-weight:bold;padding:10px;margin:0px">
and:
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
&gt;gmk_loc(4_9) Guanylate kinase<br>
ATATTT<br><br>
&gt;gmk_loc(4_9)_revcmpl Guanylate kinase<br>
AAATAT<br><br>
&gt;gmk_loc(4_9)_revcmpl Guanylate kinase<br>
AAATAT
</p>

In [104]:
# przeładowanie, aby utworzyć obiekty na podstawie
# najświeższej wersji klasy DNASeq

seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

seq = seqs['gmk']

print(seq)


>gmk Guanylate kinase
CGAATATTTGAAGATCCAAGTACATCATATAAGTATTCTATTTCAATGACAACACGTCAA
ATGCGTGAAGGTGAAGTTGATGGCGTAGATTACTTTTTTAAAACTAGGGATGCGTTTGAA
GCTTTAATCAAAGATGACCAATTTATAGAATATGCTGAATATGTAGGCAACTATTATGGT
ACACCAGTTCAATATGTTAAAGATACAATGGACGAAGGTCATGATGTATTTTTAGAAATT
GAAGTAGAAGGTGCAAAGCAAGTTAGAAAGAAATTTCCAGATGCGCTATTTATTTTCTTA
GCACCTCCAAGTTTAGAACACTTGAGAGAGCGATTAGTAGGTAGAGGAACAGAATCTGAT
GAGAAAATACAAAGTCGTATTAACGAAGCGCGTAAAGAAGTTGAAATGATGAATTTA


In [105]:
# indeksowanie w Pythonie zaczyna się od 0,
# w przypadku slice'ów drugi indeks jest wyłączny,
# możemy to zobaczyć operując bezpośrednio na sekwencji
# obiektu typu DNASeq, która jest zmienną tekstową

letter = seq.seq[0]

print(type(letter), letter, sep='\n', end='\n\n')

s = seq.seq[3:9]

print(type(s), s, sep='\n')


<class 'str'>
C

<class 'str'>
ATATTT


In [106]:
# dzięki implementacji metody specjalnej __getitem__,
# możemy indeksować obiekt typu DNASeq bezpośrednio, zgodnie
# z notacją GenBank, czyli indeksując od 1 z drugim indeksem włącznie,
# porównaj to z przykładem wyżej indeksowania w Pythonie

letter = seq[1]

print(type(letter), letter, sep='\n', end='\n\n')

s = seq[4:9]

print(type(s), s, sep='\n', end='\n\n')


# aby otrzymać zmienną typu DNASeq zamiast str, przy pojedynczym znaku,
# możemy utworzyć slice'a o długości 1, zgdonie z notacją GenBank

s = seq[1:1]

print(type(s), s, sep='\n')


<class 'str'>
C

<class '__main__.DNASeq'>
>gmk_loc(4_9) Guanylate kinase
ATATTT

<class '__main__.DNASeq'>
>gmk_loc(1_1) Guanylate kinase
C


In [107]:
# ponieważ slice obiektu DNASeq zwraca te sam typ
# obiektu możemy dalej pracować z fragmentem sekwencji
# tak jak z całą sekwencją,
# tutaj wyświetlacjąc jego sekwencję komplementarną
# w formacie FASTA

print(seq[4:9], end='\n\n')

print(seq[4:9].revcmpl(), end='\n\n')

# sekwencję komplementarną możemy wyświetlić,
# korzystając z zaimplementowanego indeksowania
# GenBank, podając odwrócone wartości indeksów

print(seq[9:4])


>gmk_loc(4_9) Guanylate kinase
ATATTT

>gmk_loc(4_9)_revcmpl Guanylate kinase
AAATAT

>gmk_loc(4_9)_revcmpl Guanylate kinase
AAATAT


---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 9<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
&gt;arcC_loc(1_10) Carbamate kinase<br>
TTATTAATCC<br><br>
&gt;glpF_loc(1_10) Glycerol kinase<br>
GGTGCTGATT<br><br>
&gt;arcC_loc(1_10)_glpF_loc(1_10) Carbamate kinase_Glycerol kinase<br>
TTATTAATCCGGTGCTGATT
</p>

In [111]:
# reload the sequences to have a collection of objects
# that are instances of the up-to-date DNASeq class

seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

# select 10 nucleotide fragments of two different
# sequences and add them together
seq_1 = seqs['arcC'][1:10]

seq_2 = seqs['glpF'][1:10]

seq_3 = seq_1 + seq_2

print(seq_1, seq_2, seq_3, sep='\n\n')


>arcC_loc(1_10) Carbamate kinase
TTATTAATCC

>glpF_loc(1_10) Glycerol kinase
GGTGCTGATT

>arcC_loc(1_10)_glpF_loc(1_10) Carbamate kinase_Glycerol kinase
TTATTAATCCGGTGCTGATT


---
<p style="background:#ffeedd;font-family:Sans;font-size:12pt;font-weight:bold;padding:10px;margin:0px">
Section 10<br><br>
<span style="font-size:10pt">The output should be:</span>
</p>
<p style="background:#ddffee;font-family:Monospace;font-size:10pt;padding:10px;margin:0px">
pta<br><br>
new object<br><br>
===<br><br>
pta<br><br>
new ref. the same obj.<br><br>
new ref. the same obj.
</p>

In [113]:
# reload the sequences to have a collection of objects
# that are instances of the up-to-date DNASeq class

seqs = DNASeq.from_file('input/Staphylococcus_MLST_genes.fasta')

# select one of the sequences by its sequence id (seqid)
seq = seqs['pta']

# our method copy() allows to create
# an independent copy of an existing object

seq_copy = seq.copy()

# an object copy is an independent object,
# separatly written in RAM memory
seq_copy.seqid = 'new object'

print(seq.seqid, seq_copy.seqid, sep='\n\n', end='\n\n===\n\n')

# in contrast to the operation of assignment,
# which simply creates a new reference to
# the same, already existing in memory object

print(seq.seqid, end='\n\n')

new_ref = seq

new_ref.seqid = 'new ref. the same obj.'

print(seq.seqid, new_ref.seqid, sep='\n\n')


pta

new object

===

pta

new ref. the same obj.

new ref. the same obj.


<p style="font-size:15pt;font-weight:bold;border:1px solid;border-color:#aabbcc;padding:15px;background:#ddeeff;border-radius:15px">The End :)</p>