**Name:** Whitney Brannen

**Email:** wbranne1@uncc.edu

## <font color = green><center>Code Layout</center>

<div class="alert alert-success">
    
1. User is prompted to input a fasta file

    
2. Fasta file name is read in to class `FileSequenceReader` where it creates a file object
    
* currently there are no other functions in this class, but this class could be used to create methods to manipulate the file: ex. only reading in a certain amount of lines, changing the file name or creating an output file, etc...

    
3. If this file object is valid (includes proper fasta extention) it is parsed through with the `fasta_parser()` function to create a `SequenceRecord` class of the header/ sequence pairs, accessible individually by header and sequence objects 


4. The sequence object is then auto detected in the `SequenceRecord` class as a `DNASequence` class or `ProteinSequence` class, which are both sub classes of the `Sequence` class.  

    
5. Finally, the user can use the built in functions included in the `DNAsequence` class or `ProteinSequence` class to manipulate their input sequences (such as returning the reverse complement, DNA->Protein translation, etc...)

* although my code is written for user interaction, the user has to have knowledge of reading and calling functions within classes to properly use it.  If this were to be modified later for easier user input a printed manual could be added at the beginning with instructions on how to use each of the functions.


## <font color = blue><center>FileSequenceReader Class Info</center>

<div class="alert alert-block alert-info">
    
I chose to use an instance methods for the `FileSequenceReader` class because it is relying on user input that can change with each run.  I have one object in this class, which is the name of the fasta file, which is the only necessary object, because further objects will be created in other classes throughout the process of my code.  A class variable does not seem necessary in this class for my current code because we do not have any set variable that needs to be constant throughout this class.  One possibility for a class method that could be added may be a default number of lines read in (ex. 50 lines) that can be accessed and changed by the user depending on their goals.  Another possibility could be adding an output file, where the name is constant with each run (ex. {name_of_input}.output) unless specified by the user to be different.

## Establish `Sequence` class and `ProteinSequence` and `DNASequence` subclasses:

In [5]:
from functools import total_ordering
@total_ordering

class Sequence:
    def __init__(self,seq):
        self.seq = seq
        for i in self.seq:
            if i in ['B', 'J', 'O', 'U', 'X', 'Z']: # letters that do not represent a base or amino acid
                raise ValueError ('Invalid base found.')
                break
    
    def __len__(self):
        return f'Sequence Length: {len(self.seq)}'
    
    def __add__(self,other):
        return Sequence(self.seq + other.seq)
    
    def __repr__(self):
        return f'Sequence: {self.seq}'
    
    def __str__(self):
        return self.seq
    
    def __eq__(self,other):
        if self.seq == other.seq:
            return True
        return False
    
    def __lt__(self,other):
        if len(self.seq) < len(other.seq):
            return True
        return False


class DNASequence(Sequence):
    def __init__(self,seq):
        super().__init__(seq)
        valid_seq = ['A','T','G','C']
        for i in self.seq:
            if i not in valid_seq:
                raise ValueError ("Invalid base found.")
                break
            
    def __repr__(self):
        return f'DNA Sequence: {self.seq}'
    
    def reverse_complement(self):
        reverse_comp = "".join(["A" if x=="T" else "C" if x=="G" else "T" if x=="A" else "G" for x in self.seq[::-1]])
        return f'Reverse Complement: {reverse_comp}'
    
    def translate(self):
        aa_seq = []
        for i in range(0, len(self.seq), 3):
            codon = self.seq[i:i+3]
            for aa, codons in aa_dict.items():
                if len(codon) == 3: # avoid codons of 2 or 1 at the end being used 
                    if codon in codons:
                        aa_seq.append(aa)
        final_translate = "".join(aa_seq)               
        return ProteinSequence(final_translate)
            
        
class ProteinSequence(Sequence):
    def __init__(self,seq):
        super().__init__(seq)
        possible_dna_seq = ['A','T','G','C'] # check to see if sequence contains only ATGC, print warning for user
        count = 0
        for i in self.seq:
            if i in possible_dna_seq:
                count += 1
        if count == len(self.seq):
            print("Warning: Possible DNA Sequence!")
            
        
    def __repr__(self):
        return f'Protein Sequence: {self.seq}'
    
    
aa_dict = {'M':['ATG'], 'F':['TTT', 'TTC'], 'L':['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 'C':['TGT', 'TGC'], 'Y':['TAC', 'TAT'], 'W':['TGG'], 'P':['CCT', 'CCC', 'CCA', 'CCG'], 'H':['CAT', 'CAC'],
'Q':['CAA', 'CAG'], 'R':['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'I':['ATT', 'ATC', 'ATA'], 'T':['ACT', 'ACC', 'ACA', 'ACG'],
'N':['AAT', 'AAC'], 'K':['AAA', 'AAG'], 'S':['AGT', 'AGC', 'TCT', 'TCC', 'TCA', 'TCG'], 'V':['GTT', 'GTC', 'GTA', 'GTG'],
'A':['GCT', 'GCC', 'GCA', 'GCG'], 'D':['GAT', 'GAC'], 'E':['GAA', 'GAG'], 'G':['GGT', 'GGC', 'GGA', 'GGG'], '*':['TAA','TAG','TGA']}    

###  Establish `FileSequenceReader` class for fasta files and `SequenceRecord` class for fasta file organization

In [7]:
class FileSequenceReader:
    
    def __init__(self,file):
        self.file = file
        valid_extention = ["fa","fasta"]
        try:
            extention = file.split(".")[1] # will raise an index error if no extention included bc cannot split on '.'
            if extention not in valid_extention:
                raise Exception ("Invalid file type, please enter a FASTA file.") # catches invalid extention types or lack of extention
        except IndexError:
            print("Please include the file extention type.")
            
            
 
    def fasta_parser(self):
        with open(str(self.file)) as fh: 
            header = None
            sequence = ""
            for line in fh:
                line = line.strip()
                if line.startswith(">"):
                    if header:
                        yield SequenceRecord(header,sequence)
                        sequence = ""
                    header = line
                else:
                    sequence += line.strip()
        yield SequenceRecord(header,sequence)
    

class SequenceRecord:
    
    def __init__(self,header,sequence):
        self.header = header
        try: # auto detecting if DNA sequence or protein sequence
            self.sequence = DNASequence(sequence)
        except ValueError: # if not a dna, try a protein
            try:
                self.sequence = ProteinSequence(sequence)
            except Exception: # if not a protein, just do a general sequence
                self.sequence = Sequence(sequence)
        

    def __str__(self):
        return self.header + " " + str(self.sequence)
    
    def __repr_(self):
        return f"Header: {self.header}\n Sequence:{str(self.sequence)}"
    

## <font color = red><center>User interaction code & output below</center>

In [11]:

fasta_input = FileSequenceReader(input("Enter the name of your file: "))

for object in FileSequenceReader.fasta_parser(fasta_input): # iterate through generator function to see results
    print(object.header) # prints header object of FileSequenceReader
    print(object.sequence) # prints sequence object of FileSequenceReader as a ProteinSequence or DNASequence class
    print(object.sequence.translate()) # translate function to display protein sequence
    print(object.sequence.reverse_complement()) # reverse complement function of dna sequence


        

Enter the name of your file: example_fasta.fa
>MD10G1276500 pacid=40089867 polypeptide=MD10G1276500 locus=MD10G1276500 ID=MD10G1276500.v1.1.491 annot-version=v1.1
CTCCTGTGTGCAATGTCTGCGGCGAGCAGGTGGGGCTTGGTGCCAATGGGGAGGTTTTCGTGGCATGCCACGAGTGTAATTTCCCCATTTGCAAGGCTTGTTTCGATGAAGATGTCAAGGCTGGGCGTAAAGTTTGCTTGCAGTGTGGTATTCCCTATGACGATAACCCGTTGGCGGAGTATGAAACAAAGGTGTCAGGCACTCGATCCACAATGGAAGCTCACCTGAATAATACACAGGATACAGGAATTCATGCTAGGCATATCAGCAGTGTGTCTACGTTGGATAGTGAATTAAACGATGAATCTGGCAATCCGATTTGGAAGAATAGAGTGGAAAGTTGGAAGGATAAGAAGGATAAGAAGGATAAAAAGATCAAGAAGAAAAAGGATACACCTAATGGGGAAAAAGAGGCTCAAATTCCACCTGAGAAGCAGATGACAGAGGAATATTCATCAGAGGCTGCGGAACCACTTTCAACTCTCGTCCCACTTCCATCTAACAGAATCACACCATACAGAACTGTTATAATTATGCGATTGATCATTCTCGCCCTTTTCTTCCATTATCGAGTAACAAATCCTGTTGATAGTGCTTACGGTCTATGGTTCACTTCGATCATATGTGAGATCTGGTTTGCTTTTTCTTGGGTGTTGGATCAGTTTCCTAAGTGGTCTCCAGTTAATCGGACTACATTTACTGACAGGTTATCTGCCAGGTTTGAAAGAGAGGGTGAACTCTCCGAGCTTGCTGCTGTGGATTTCTTCGTAAGTACAGTTGATCCGTTGAAAGAACCGCCCTTGATTACTGCCAATACCGTGCTTTCTATCCTTGCTG