# OOPLab2-Spring22 
So now that we know a little more about inheritance and how it works, let's update and expand on our previous work with the Sequence class and Sequence records.
 

When we wrote Sequence before, we did so at least somewhat thinking we'd be dealing with DNA sequences. However, that's not always the case,  Sequence could just as well refer to a protein sequence instead. And the methods we wrote in our Sequence class, the things we might have added like get GC count or finding the reverse complement, aren't appropriate for protein sequences.  This represents a great opportunity to utilize inheritance to create two subtypes of Sequence.

 

In a notebook, start with a markdown cell and plan out what you think these 3 classes should look like. What are the common elements of Sequences (things we could define in the parent class Sequence) and what would need to be unique to DNASequence and ProteinSequence classes? What rules do you want to enforce about what these sequences should look like and how do you want enforce those rules? Do you need to override constructors, or could the parent's work? Remember, eventually you want these to work with the SequenceRecord class we built earlier, so don't make any huge fundamental changes that would break that.


#### Your classes should, at minimum:

- have a __repr__ and __str__that provide a meaningful representation as a string  
- check that the bases or amino acids in the string are valid  
- work as the argument for a SequenceRecord
#### DNA

- a translate method that will convert the DNA sequence and return a ProteinSequence object  
- one other method of your choice (what you did previously is fine)  

#### Protein

- a method of your choice. In this case, if the method you would implement is too complex to reasonably implement or would use resources you don't have access to, it is okay to leave it as what is called a stub method (has only one line, "pass") and explain in comments what this method would do and it's purpose
 

Here is a dictionary you can copy into your code to help facilitate DNA translation:

aa_dict = {'M':['ATG'], 'F':['TTT', 'TTC'], 'L':['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 'C':['TGT', 'TGC'], 'Y':['TAC', 'TAT'], 'W':['TGG'], 'P':['CCT', 'CCC', 'CCA', 'CCG'], 'H':['CAT', 'CAC'],  
'Q':['CAA', 'CAG'], 'R':['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'I':['ATT', 'ATC', 'ATA'], 'T':['ACT', 'ACC', 'ACA', 'ACG'],  
'N':['AAT', 'AAC'], 'K':['AAA', 'AAG'], 'S':['AGT', 'AGC', 'TCT', 'TCC', 'TCA', 'TCG'], 'V':['GTT', 'GTC', 'GTA', 'GTG'],  
'A':['GCT', 'GCC', 'GCA', 'GCG'], 'D':['GAT', 'GAC'], 'E':['GAA', 'GAG'], 'G':['GGT', 'GGC', 'GGA', 'GGG'], '*':['TAA','TAG','TGA']}

#### I plan to create two classes that are children of the Sequence method with new features specific to amino acids and nucleotides. These sequences will not be comparable.  
Each one of the sequence methods will check for the object type and raise and exception if they are not the same. These will be minor changes to the sequence class to avoid needing to make significant overrides.  

Additionally I end up removing nucleotide specific functions from the class.

In [118]:
from typing import List
from xmlrpc.client import Boolean
#sequence class goes here
class Sequence:

    def __init__(self, sequence: str) ->None:
        #string the represents the dictionary sequence
        self.sequence = sequence

    def __str__(self):
        return 'A character sequence beginning with "{}"...'.format(self.sequence[0:10])

    def __repr__(self):
        return 'Sequence Object: {}'.format(self.sequence)

    def __len__(self):
        length = 0
        for i in self.sequence:
            length += 1
        return(length)

    def __eq__(self, other):
        #Find if the sequence string is equivalent.
        return self.sequence == other.sequence

    def __lt__(self, other):
        if type(self) != type(other):
            raise TypeError("Cannot compare these types.")
        #lengths for sorting
        return self.__len__ < other.__len__

    def __add__(self, other):
        if type(self) != type(other):
            raise TypeError("Cannot add these types.")
        self.sequence += other.sequence
        return self.sequence

For the nucleosequence class I added the convert method, and created verify methods to check for the contents of the string.  
This init method will return and error if a protein sequence is passed into it.

In [119]:
#Nucleotide
class NucleoSeq(Sequence):
    
    def __init__(self, sequence):
        super().__init__(sequence)

        def verify(self):
            nucleotides = ['a', 'g', 'c', 't']
            if not all(i in nucleotides for i in self.sequence):
                return False
            return True
        if not verify:
            raise TypeError('Input is not in nucelotide format')

    def __str__(self):
        return 'A character sequence of nucleotides beginning with "{}"...'.format(self.sequence[0:10])

    def __repr__(self):
        return 'Protein Sequence Object of length {} :{}'.format(self.__len__, self.sequence)

    def compliment(self):
        convert = {'a':'t', 'g':'c', 't':'a', 'c':'g'}
        inter = ''
        for i in self.sequence:
            j = convert[i]
            inter += j
        complimentarySequence = Sequence(inter)
        return complimentarySequence

    def convert(self):
        aa_dict = {'M':['ATG'], 'F':['TTT', 'TTC'], 'L':['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 'C':['TGT', 'TGC'], 'Y':['TAC', 'TAT'], 'W':['TGG'], 'P':['CCT', 'CCC', 'CCA', 'CCG'], 'H':['CAT', 'CAC'],  
        'Q':['CAA', 'CAG'], 'R':['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'I':['ATT', 'ATC', 'ATA'], 'T':['ACT', 'ACC', 'ACA', 'ACG'],  
        'N':['AAT', 'AAC'], 'K':['AAA', 'AAG'], 'S':['AGT', 'AGC', 'TCT', 'TCC', 'TCA', 'TCG'], 'V':['GTT', 'GTC', 'GTA', 'GTG'],  
        'A':['GCT', 'GCC', 'GCA', 'GCG'], 'D':['GAT', 'GAC'], 'E':['GAA', 'GAG'], 'G':['GGT', 'GGC', 'GGA', 'GGG'], '*':['TAA','TAG','TGA']}
        inter = ""
        codons = [self.sequence[i:i+3] for i in range(len(self.sequence)-2)[::3]]

        for i in codons:
            for j in aa_dict:
                if i.upper() in aa_dict[j]:
                        inter += j
        convertedSequence = ProtSeq(inter)
        return convertedSequence



For the protein sequence I attempted a method to return the sequence its nucleotide sequence, however advanced software and metadata would be needed to do something like this.

In [120]:
##Protein
class ProtSeq(Sequence):
    
    def __init__(self, sequence ):
        super().__init__(sequence)

        def verify(self):
            nucleotides = ['a', 'g', 'c', 't']
            if all(i in nucleotides for i in self.sequence):
                return False
            return True

        if not verify:
            raise TypeError('Input is not in nucelotide format')

    def convert(self):
        aa_dict = {'M':['ATG'], 'F':['TTT', 'TTC'], 'L':['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 'C':['TGT', 'TGC'], 'Y':['TAC', 'TAT'], 'W':['TGG'], 'P':['CCT', 'CCC', 'CCA', 'CCG'], 'H':['CAT', 'CAC'],  
        'Q':['CAA', 'CAG'], 'R':['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'I':['ATT', 'ATC', 'ATA'], 'T':['ACT', 'ACC', 'ACA', 'ACG'],  
        'N':['AAT', 'AAC'], 'K':['AAA', 'AAG'], 'S':['AGT', 'AGC', 'TCT', 'TCC', 'TCA', 'TCG'], 'V':['GTT', 'GTC', 'GTA', 'GTG'],  
        'A':['GCT', 'GCC', 'GCA', 'GCG'], 'D':['GAT', 'GAC'], 'E':['GAA', 'GAG'], 'G':['GGT', 'GGC', 'GGA', 'GGG'], '*':['TAA','TAG','TGA']}
        inter = ''
        for i in self.sequence:
            j = aa_dict[i][0]
            inter += j
        nucSeq = NucleoSeq(inter)
        return nucSeq 

    def __str__(self):
        return 'A character sequence of amino acids beginning with "{}"...'.format(self.sequence[0:10])

    def __repr__(self):
        return 'Protein Sequence Object of length {} :{}'.format(self.__len__, self.sequence)

Minor tweaks were made ot sequence record to accept the different types of objects. With changes to a single line the class works the same.

In [126]:
# SequenceRecord class goes here
class SequenceRecord:
    def __init__(self, record: List[str], sequenceO) -> None:
        if type(sequenceO) == Sequence or type(sequenceO) == NucleoSeq or type(sequenceO = ProtSeq):
            self.sequenceO = sequenceO
        else:
            print("Not a sequence file.")
            return False

        recordList = record.split(' ', 1)

        self.record = recordList

    def __str__(self):
        return 'A sequence record of a "{}"'.format(self.record[0])

    def __repr__(self):
        return 'SequenceRecord Object: {} {} | {}'.format(self.record[0], self.record[1], self.sequenceO.sequence)

    def __eq__(self, other):
        return self.record[0] == other.record[0]

    def __lt__(self, other):
        #lengths for sorting
        return self.sequenceO.__len__ < other.sequenceO.__len__

Fasta generator is changed and taken into a run function which simply takes the type of file and the file name, and returns a list of sequence objects.

In [132]:
def fastaGenerator(fasta):
    
    with open(fasta) as f:
        header = None
        sequence = ""
        for line in f:
            line = line.strip()
            if line.startswith(">"):
                    if header:
                        yield(header,sequence)
                    sequence = ""
                    header = line.lstrip(">")
                    
            else:
                sequence += line.strip()
    yield(header,sequence)

def run(filename, type):
    file = fastaGenerator(filename)
    sequenceList = []
    if type == "p":
        for i in file:
            newSequence = ProtSeq(i[1])
            newSequenceRecord = SequenceRecord(i[0], newSequence)
            sequenceList.append(newSequenceRecord)
    else:
        for i in file:
            newSequence = NucleoSeq(i[1])
            newSequenceRecord = SequenceRecord(i[0], newSequence)
            sequenceList.append(newSequenceRecord)

    print(sequenceList[:2])
    #return sequenceList

run("testDNA.fasta", "n")


[SequenceRecord Object: Lungfish Protopterus dolloi | atggcaacaaatatccgaaaaactcacccgctccttaaaatcgtaaacaactccctaattgacctgccaaccccatcaaacatttcagcatgatgaaacttcggctcacttcttggattctgccttattactcaaattctcacaggattattcttagctatacactacactgctgacacctcaacagccttctcatctatcgcacacatcgcccgcgacgtaaactatggctggctcctgcgcaacattcacgcaaacggagcatccatattttttatttgcatctacatccacattggtcgtggaatttattacggatccttcctatatacagagacctgaaatatcggagtagttctttttcttttaactataataactgcattcgtaggctacgttctcccgtgaggtcaaatatccttctggggtgccacagtcatcactaatctcctctcagccgtcccatacctaggagataccctagttcaatggatttggggcggattttctgtagacaacgccaccctcacccgattcttcgcttttcacttccttctccccttcatcatctctgcaataaccgccgcacactttttattcctccacgaaacaggctcaaataacccaacaggattaaactctaacctagacaaaatctcgttccacccgtattttactataaaagaccttttagggttcctaatacttgcttcttttctctgcctattagccctattttctcctaatcttctaggggacccagaaaattttaccccggctaatccacttgtcaccccaacccacatcaagccagagtgatacttcctctttgcatatgcaattctgcgctccatcccaaataaacttggaggcgtactagcacttatagcgtcgatccttattctttttatcattccgtttcttcaccgagcaaaacaacgcacta