# Final Exam Instructions
---

Write a Python program that takes as input a file containing DNA sequences in multi-FASTA format, and computes the answers to the following questions. You can choose to write one program with multiple functions to answer these questions, or you can write several programs to address them. We will provide a multi-FASTA file for you, and you will run your program to answer the exam questions. 

While developing your program(s), please use the following example file to test your work: dna.example.fasta

You'll be given a different input file to launch the exam itself.

Here are the questions your program needs to answer. The quiz itself contains the specific multiple-choice questions you need to answer for the file you will be provided.

---
1. How many records are in the file? A record in a FASTA file is defined as a single-line header, followed by lines of sequence data. The header line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is an optional description of the entry. There should be no space between the ">" and the first letter of the identifier. 
---


In [11]:
def readFasta(filename):
    '''
    Read multiline fasta sequence from file.
    Input : fasta filename or file path.
    Output : dict with sequence name and fasta sequence.
    '''
    try:
        fastaFile = open(filename, "r")
    except IOError:
        print("File ",filename," does not exist!!")
        return {}
    
    seqs={}
    for line in fastaFile:
        line=line.rstrip()
        if line[0]=='>':
            words = line.split()
            name=words[0][1:]
            seqs[name]=""
        else : # sequence, not header
            seqs[name] = seqs[name] + line
    fastaFile.close()
    return seqs


In [19]:
fasta['gi|142022655|gb|EQ086233.1|91']


'CTCGCGTTGCAGGCCGGCGTGTCGCGCAACGACGTGTGGGGCCTGACGGGCAGGGAGGATCTCGGCGGCGCCAACTATGCGGTCTTTCGGCTCGAAAGCCAGTTCCAGACCTCCGACGGCGCGCTGACCGTGCCCGGCTCCGCATTCAGTTCGCAAGCCTACGTCGGGCTCGGCGGCGACTGGGGGACCGTGACGCTCGGGCGCCAGTTCGATTTCGTCGGCGATCTGATGCCGGCTTTCGCGATCGGCGCGAACACGCCGGCCGGCCTGCTCGCGTGGGGCTTGCCGGCGAATGCGTCGGCGGGCGGTGCGCTCGACAACCGCGTGTGGGGCGTCCAGGTGAACAATGCGGTGAAGTACGTGAGCCCGACGTTCGGCGGATTGTCGTTCGGCGGCCTGTGGGGCTTCGGCAACGTGCCCGGCACGGTCGCGCGCAGCAGCGTGCAAAGCGCGATGCTGTCCTACACGCAAGGCGCGTTCAGCGCCGCGCTCGCTTATTTCGGCCAGCACGATGTAACTGCCGGTGGCAATCTGCGCAATTTCTCGGGCGGTGCAGGCTACAACGTCGGGCAGTTCCGCGTCTTCGGCATGGTGTCGGACGTGCGGATCAGCGCCGCCGCGCCGCTGCGGGCCACGACCTATGACGGCGGCTTGACCTATGCGGTCACGCCGGCGTTGCAGCTCGGCGGCGGCTTCCAGTACCAGCAGCGCGGCGGCGACATCGGCTCGGCCAACCAGGTCACGTTGAGCGCCGACTATTCGCTGTCGAAGCGTACCGGCCTTTACGTGGTATTCGCACGCGGGCACGACAGTGCGTATGGCGCGCAGGTCGAGGCGGCGCTCGGCGGGGCGGCGTCCGGCTCGACGCAGACCGCGGTCCGGCTCGGGCTGCGGCATCAGTTCTGACGATGCGCGAGAAACACGGGCTGCCGCGTACGCCGCGCGCGAGCCCGTGTTTTTCCGCCGGATTCAGAACCGATGCATCATCCCGACGCGCAA

---
2. What are the lengths of the sequences in the file? What is the longest sequence and what is the shortest sequence? Is there more than one longest or shortest sequence? What are their identifiers? 
---

In [13]:
file_path = "data/dna2.fasta"
fasta = readFasta(file_path)
def minMax(fasta_dict):
    '''Given fasta as input provide Minmum and Maximum Length for Sequence as Dict'''
    seqs = {name: len(seq)for name, seq in fasta_dict.items()}
    mx,mn = max(seqs.values()), min(seqs.values())
    return {key:value for key,value in seqs.items() if value == mx or value == mn}

minMax(fasta)

{'gi|142022655|gb|EQ086233.1|255': 4894, 'gi|142022655|gb|EQ086233.1|346': 115}

---
3. In molecular biology, a reading frame is a way of dividing the DNA sequence of nucleotides into a set of consecutive, non-overlapping triplets (or codons). Depending on where we start, there are six possible reading frames: three in the forward (5' to 3') direction and three in the reverse (3' to 5'). For instance, the three possible forward reading frames for the sequence `AGGTGACACCGCAAGCCTTATATTAGC` are: 

    `AGG TGA CAC CGC AAG CCT TAT ATT AGC`

    `A GGT GAC ACC GCA AGC CTT ATA TTA GC`

    `AG GTG ACA CCG CAA GCC TTA TAT TAG C` 

    These are called reading frames 1, 2, and 3 respectively. An open reading frame (ORF) is the part of a reading frame that has the potential to encode a protein. It starts with a start codon (ATG), and ends with a stop codon (`TAA`, `TAG` or `TGA`). For instance, ATGAAATAG is an ORF of length 9.

    Given an input reading frame on the forward strand (1, 2, or 3) your program should be able to identify all ORFs present in each sequence of the FASTA file, and answer the following questions: what is the length of the longest ORF in the file? What is the identifier of the sequence containing the longest ORF? For a given sequence identifier, what is the longest ORF contained in the sequence represented by that identifier? What is the starting position of the longest ORF in the sequence that contains it? The position should indicate the character number in the sequence. For instance, the following ORF in reading frame 1:
    ```
    >sequence1
    ATGCCCTAG
    ```
    starts at position 1.

    Note that because the following sequence:
    ```
    >sequence2
    ATGAAAAAA
    ```
    does not have any stop codon in reading frame 1, we do not consider it to be an ORF in reading frame 1. 

---


In [46]:
def reverseComplemnet(dna):
    transtab = str.maketrans("ATCG", "TAGC")
    return dna.translate(transtab)[::-1].strip()
dna  =  reverseComplemnet('AGGTGACACCGCAAGCCTTATATTAGC')
dna = 'AAAATGAGGGTGGGGTAAAAAAAA'

def orfs(dna,start= 'ATG',stop = ['TAA', 'TAG' , 'TGA'],frame='all') :  
    _orfs = set()
    if frame == 'all':
        frame = 0
        step = 1
    else:
        step = 3
    for i in range(frame,len(dna)-2,step):
        if dna[i:i+3] == start:
            orf = ''
            j  = i
            while j+3 < len(dna)-1:
                if dna[j:j+3] in stop:
                    orf += dna[j:j+3]
                    _orfs.add(orf)
                    break
                else:
                    orf += dna[j:j+3]
                j += 3
    return _orfs[frame]  


def find_orfs(dna, frame='all', complement= True,translate = True):
    if complement:
        _orfs = orfs(dna,frame = frame).union(orfs(reverseComplemnet(dna),frame = frame))
    else:
        _orfs = orfs(dna,frame =frame)
    
    if translate:
        from Bio.Seq import Seq
        _translated = set()
        for dna in _orfs:
            dna = Seq(dna)
            _translated.add(str(dna.translate()))
        return list(_translated)
    else:
        return list(_orfs)
    


In [52]:
# dict version
def reverseComplemnet(dna):
    transtab = str.maketrans("ATCG", "TAGC")
    return dna.translate(transtab)[::-1].strip()
dna  =  reverseComplemnet('AGGTGACACCGCAAGCCTTATATTAGC')
dna = 'AAAATGAGGGTGGGGTAAAAAAAA'

def orfs(dna,start= 'ATG',stop = ['TAA', 'TAG' , 'TGA'],frame='all') :  
    _orfs = {0:set(),1:set(),2:set()}
    if frame == 'all':
        frame = 0
        step = 1
    else:
        step = 3
    for i in range(frame,len(dna)-2,step):
        if dna[i:i+3] == start:
            orf = ''
            j  = i
            while j+3 < len(dna)-1:
                if dna[j:j+3] in stop:
                    orf += dna[j:j+3]
                    _orfs[i % 3].add(orf)
                    break
                else:
                    orf += dna[j:j+3]
                j += 3
    return _orfs[1] 


def find_orfs(dna, frame='all', complement= True,translate = True):
    if complement:
        _orfs = orfs(dna,frame = frame).union(orfs(reverseComplemnet(dna),frame = frame))
    else:
        _orfs = orfs(dna,frame =frame)
    
    if translate:
        from Bio.Seq import Seq
        _translated = set()
        for dna in _orfs:
            dna = Seq(dna)
            _translated.add(str(dna.translate()))
        return list(_translated)
    else:
        return list(_orfs)

dna = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAGTAG'
orfs(dna) 


{'ATGTAG'}

In [21]:
import re
pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')
pattern.findall(dna)

['ATGAGGGTGGGG']

In [31]:
dna = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAGTAG'
find_orfs(dna,frame=2,translate=True)

['MLLGSFRLIPKETLIQVAGSSPCNLS*']

In [34]:
import re

def find_all_proteins(dna):
    p = []  
    for seq in [dna, reverseComplemnet(dna)]:
        starts = [aa.start() for aa in re.finditer('ATG', dna)]
        stops =  [aa.start() for aa in re.finditer('TAG|TGA|TAA', dna)]

        for x in starts:
            for y in stops:
                if (y-x) > 0 and ((y-x) % 3) == 0:  
                    p.append(dna[x:y+3])
                    break
    return set(p)

In [36]:
find_all_proteins(dna)

{'ATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAA',
 'ATGATCCGAGTAGCATCTCAGTAG',
 'ATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAA',
 'ATGTAG'}

In [37]:
{seq_name:longest_orf(find_all_proteins(dna))  for seq_name,dna in fasta.items()}

{'gi|142022655|gb|EQ086233.1|91': 1296,
 'gi|142022655|gb|EQ086233.1|304': 147,
 'gi|142022655|gb|EQ086233.1|255': 1443,
 'gi|142022655|gb|EQ086233.1|45': 2394,
 'gi|142022655|gb|EQ086233.1|396': 1281,
 'gi|142022655|gb|EQ086233.1|250': 1560,
 'gi|142022655|gb|EQ086233.1|322': 189,
 'gi|142022655|gb|EQ086233.1|88': 138,
 'gi|142022655|gb|EQ086233.1|594': 213,
 'gi|142022655|gb|EQ086233.1|293': 711,
 'gi|142022655|gb|EQ086233.1|75': 504,
 'gi|142022655|gb|EQ086233.1|454': 1401,
 'gi|142022655|gb|EQ086233.1|16': 1644,
 'gi|142022655|gb|EQ086233.1|584': 132,
 'gi|142022655|gb|EQ086233.1|4': 249,
 'gi|142022655|gb|EQ086233.1|277': 279,
 'gi|142022655|gb|EQ086233.1|346': 0,
 'gi|142022655|gb|EQ086233.1|527': 1821}

In [51]:
{seq_name:longest_orf(orfs(dna,frame='all')) for seq_name,dna in fasta.items()}

{'gi|142022655|gb|EQ086233.1|91': 237,
 'gi|142022655|gb|EQ086233.1|304': 0,
 'gi|142022655|gb|EQ086233.1|255': 1185,
 'gi|142022655|gb|EQ086233.1|45': 528,
 'gi|142022655|gb|EQ086233.1|396': 1281,
 'gi|142022655|gb|EQ086233.1|250': 552,
 'gi|142022655|gb|EQ086233.1|322': 156,
 'gi|142022655|gb|EQ086233.1|88': 138,
 'gi|142022655|gb|EQ086233.1|594': 33,
 'gi|142022655|gb|EQ086233.1|293': 420,
 'gi|142022655|gb|EQ086233.1|75': 504,
 'gi|142022655|gb|EQ086233.1|454': 822,
 'gi|142022655|gb|EQ086233.1|16': 1458,
 'gi|142022655|gb|EQ086233.1|584': 27,
 'gi|142022655|gb|EQ086233.1|4': 126,
 'gi|142022655|gb|EQ086233.1|277': 279,
 'gi|142022655|gb|EQ086233.1|346': 0,
 'gi|142022655|gb|EQ086233.1|527': 570}

In [16]:
def longest_orf(orfs):
    orf_length = [len(orf) for orf in orfs]
    try :
        max_len = max(orf_length)
    except ValueError:
        max_len = 0
    return max_len 

longest_orf(find_orfs(dna,frame='all',translate=False))        

4
24
30
75
8
73
93


81

In [236]:
{seq_name:longest_orf(find_orfs(dna,frame='all',translate= False))  for seq_name,dna in fasta.items()}

76
228
292
346
453
511
588
640
658
817
908
978
1071
1083
1102
1194
1200
1400
1414
1657
1747
2055
2077
2226
2270
2385
2395
2454
2557
2583
2855
2860
2917
3089
3386
3424
3449
3475
3542
3563
3650
3744
3752
3797
3895
3904
3941
4082
4138
4262
4310
4330
4391
4471
4600
52
242
281
371
452
533
551
563
608
692
738
1158
1184
1367
1449
1544
1731
1758
1869
1983
2062
2088
2100
2248
2313
2462
2517
2526
2529
2556
2578
2664
2799
2886
3223
3433
3439
3460
3489
3492
3550
3648
3651
3738
3913
4045
4490
130
265
271
519
577
620
623
858
908
1097
572
893
1019
70
88
109
205
211
286
291
297
345
430
565
589
745
768
776
862
880
1178
1290
1387
1395
1640
1691
1725
1754
1817
1855
1863
1866
2021
2032
2035
2059
2119
2377
2395
2425
2464
2467
2485
2502
2527
2572
2580
2609
2626
2650
2779
2956
2995
3121
3139
3211
3262
3481
3704
3757
3790
3803
3838
3922
4025
4161
4216
4275
4332
4386
4453
4494
4863
4885
17
25
29
135
151
194
341
344
398
405
499
506
529
540
560
867
960
1135
1168
1188
1191
1194
1204
1399
1552
1771
1798
1819
1897


{'gi|142022655|gb|EQ086233.1|91': 1632,
 'gi|142022655|gb|EQ086233.1|304': 147,
 'gi|142022655|gb|EQ086233.1|255': 1443,
 'gi|142022655|gb|EQ086233.1|45': 2394,
 'gi|142022655|gb|EQ086233.1|396': 1338,
 'gi|142022655|gb|EQ086233.1|250': 1560,
 'gi|142022655|gb|EQ086233.1|322': 189,
 'gi|142022655|gb|EQ086233.1|88': 171,
 'gi|142022655|gb|EQ086233.1|594': 213,
 'gi|142022655|gb|EQ086233.1|293': 1233,
 'gi|142022655|gb|EQ086233.1|75': 504,
 'gi|142022655|gb|EQ086233.1|454': 1566,
 'gi|142022655|gb|EQ086233.1|16': 1719,
 'gi|142022655|gb|EQ086233.1|584': 720,
 'gi|142022655|gb|EQ086233.1|4': 963,
 'gi|142022655|gb|EQ086233.1|277': 363,
 'gi|142022655|gb|EQ086233.1|346': 0,
 'gi|142022655|gb|EQ086233.1|527': 1821}

In [26]:
{seq_name:longest_orf(find_orfs(dna,frame=1,translate=False))  for seq_name,dna in fasta.items()}

{'gi|142022655|gb|EQ086233.1|91': 1632,
 'gi|142022655|gb|EQ086233.1|304': 0,
 'gi|142022655|gb|EQ086233.1|255': 1185,
 'gi|142022655|gb|EQ086233.1|45': 2268,
 'gi|142022655|gb|EQ086233.1|396': 1338,
 'gi|142022655|gb|EQ086233.1|250': 1341,
 'gi|142022655|gb|EQ086233.1|322': 156,
 'gi|142022655|gb|EQ086233.1|88': 171,
 'gi|142022655|gb|EQ086233.1|594': 33,
 'gi|142022655|gb|EQ086233.1|293': 1233,
 'gi|142022655|gb|EQ086233.1|75': 504,
 'gi|142022655|gb|EQ086233.1|454': 822,
 'gi|142022655|gb|EQ086233.1|16': 1458,
 'gi|142022655|gb|EQ086233.1|584': 240,
 'gi|142022655|gb|EQ086233.1|4': 963,
 'gi|142022655|gb|EQ086233.1|277': 336,
 'gi|142022655|gb|EQ086233.1|346': 0,
 'gi|142022655|gb|EQ086233.1|527': 1014}

In [14]:
find_orfs(fasta['gi|142022655|gb|EQ086233.1|527'],frame=2,translate=True)

20
176
497
635
698
704
737
833
1112
1712
1820
2036
2102
2375
2414
269
353
500
641
695
764
773
848
971
980
1052
1142
1532
1811
1823
1841
1907
1967
2015
2159
2279
2312
2345
2447
2624


['MSRRAAAATRACERQCAGNVSHR*',
 'MNSGASKPPAVTGSITRDSSGMPMIANPPPNAPRMNAITNTPANAIRIVASVNCRPIAAGVAITTSMPVRIAAWRTATERVARAASPHDGSGLNALHPCCFSNEVPGRCGTPTHRRPHGPHKHFSPGTIDLTKAECTAVYLSAIKTLTSVQYSLLSVQSMSNSEYLQLADAIAAQIADGTLRPGDRLPPQRHFADQHAIAASTAGRVYAELLRRGLVVGEVGRGTFVSGETRRGAAAPGEPRGVRIDFEFNYPTVPAQTALITRSLRGLHRPAELDAALREATSTGTPVIRSVAAAYLAQHEWAPSPDQLVFTGNGRQSIAAAVAAVVPTGGRCGVEALTYPFIKGIAAKLGISLVPLAMDDDGVRPDAVQKAHREARLSAIYVQPAIQNPLGTTMSAARRADLLRVVDKLDIPVIEDNVYGFLGDEPPLAALAPDACIVIDSLSKRVTPGLTLGFIVPPPRLRESVMASVRSGGWTASGFAFAAAQRLMRDGTVAELARLKRIDAIARQALAIERLAGFDVRTNGKCYHLWLTLPAHWRSQALVAAAARRDIGLTPSTTFAVSSGHAPNAIRLALAAPSMDQLDAGLRTLTAMLNGREGDFDATE*',
 'MLCRPLPVNTSWSGDGAHSCCARYAAATLRMTGVPVLVASRNAASSSAGRCNPRRLLVINAVWAGTVG*',
 'MASVRSGGWTASGFAFAAAQRLMRDGTVAELARLKRIDAIARQALAIERLAGFDVRTNGKCYHLWLTLPAHWRSQALVAAAARRDIGLTPSTTFAVSSGHAPNAIRLALAAPSMDQLDAGLRTLTAMLNGREGDFDATE*',
 'MNAITNTPANAIRIVASVNCRPIAAGVAITTSMPVRIAAWRTATERVARAASPHDGSGLNALHPCCFSNEVPGRCGTPTHRRPHGPHKHFSPGTIDLTKAECTAVYLSAIKTLTSVQYSLLSVQSMSNSEYLQLADAI

In [4]:
"""
What is the length of the longest ORF appearing in any sequence in any of the 3 forward reading frames?
"""

f = open("data/dna2.fasta", "r")
file = f.readlines()

sequences = []
seq = ""
for f in file:
    if not f.startswith('>'):
        f = f.replace(" ", "")
        f = f.replace("\n", "")
        seq = seq + f
    else:
        sequences.append(seq)
        seq = ""

# Add the last seq
sequences.append(seq)

sequences = sequences[1:]

# Find orf 1
def find_orf_1(sequence):
    # Find all ATG indexs
    start_position = 0
    start_indexs = []
    stop_indexs = []
    for i in range(0, len(sequence), 3):
        if sequence[i:i+3] == "ATG":
            start_indexs.append(i)

    # Find all stop codon indexs
    for i in range(0, len(sequence), 3):
        stops =["TAA", "TGA", "TAG"]
        if sequence[i:i+3] in stops:
            stop_indexs.append(i)

    orf = []
    mark = 0
    start_position = {}
    for i in range(0,len(start_indexs)):
        for j in range(0, len(stop_indexs)):
            if start_indexs[i] < stop_indexs[j] and start_indexs[i] > mark:
                orf.append(sequence[start_indexs[i]:stop_indexs[j]+3])
                start_position[len(sequence[start_indexs[i]:stop_indexs[j]+3])] = start_indexs[i]
                mark = stop_indexs[j]+3
                break
    return orf

# Find orf 2
def find_orf_2(sequence):
    # Find all ATG indexs
    start_position = 1
    start_indexs = []
    stop_indexs = []
    for i in range(1, len(sequence), 3):
        if sequence[i:i+3] == "ATG":
            start_indexs.append(i)

    # Find all stop codon indexs
    for i in range(1, len(sequence), 3):
        stops =["TAA", "TGA", "TAG"]
        if sequence[i:i+3] in stops:
            stop_indexs.append(i)

    orf = []
    mark = 0
    start_position = {}
    for i in range(0,len(start_indexs)):
        for j in range(0, len(stop_indexs)):
            if start_indexs[i] < stop_indexs[j] and start_indexs[i] > mark:
                orf.append(sequence[start_indexs[i]:stop_indexs[j]+3])
                start_position[len(sequence[start_indexs[i]:stop_indexs[j]+3])] = start_indexs[i]
                mark = stop_indexs[j]+3
                break
    return orf

# Find orf 3
def find_orf_3(sequence):
    # Find all ATG indexs
    start_position = 2
    start_indexs = []
    stop_indexs = []
    for i in range(2, len(sequence), 3):
        if sequence[i:i+3] == "ATG":
            start_indexs.append(i)

    # Find all stop codon indexs
    for i in range(2, len(sequence), 3):
        stops =["TAA", "TGA", "TAG"]
        if sequence[i:i+3] in stops:
            stop_indexs.append(i)

    orf = []
    mark = 0
    start_position = {}
    for i in range(0,len(start_indexs)):
        for j in range(0, len(stop_indexs)):
            if start_indexs[i] < stop_indexs[j] and start_indexs[i] > mark:
                orf.append(sequence[start_indexs[i]:stop_indexs[j]+3])
                start_position[len(sequence[start_indexs[i]:stop_indexs[j]+3])] = start_indexs[i]
                mark = stop_indexs[j]+3
                break
    return orf


n = 1
lengths = []
for i in sequences:
    # print("["+str(n)+"]")
    orfs = find_orf_2(i) #+ find_orf_2(i) + find_orf_3(i)
    for j in orfs:
        lengths.append(len(j))
    n += 1
print(max(lengths))

1458


In [7]:
for i in sequences:
    print(find_orf_2(i))

['ATGCGGTCTTTCGGCTCGAAAGCCAGTTCCAGACCTCCGACGGCGCGCTGA', 'ATGCGTCGGCGGGCGGTGCGCTCGACAACCGCGTGTGGGGCGTCCAGGTGA', 'ATGCGGTGA', 'ATGTAA', 'ATGACGGCGGCTTGA', 'ATGCGGTCACGCCGGCGTTGCAGCTCGGCGGCGGCTTCCAGTACCAGCAGCGCGGCGGCGACATCGGCTCGGCCAACCAGGTCACGTTGA', 'ATGGCGCGCAGGTCGAGGCGGCGCTCGGCGGGGCGGCGTCCGGCTCGACGCAGACCGCGGTCCGGCTCGGGCTGCGGCATCAGTTCTGACGATGCGCGAGAAACACGGGCTGCCGCGTACGCCGCGCGCGAGCCCGTGTTTTTCCGCCGGATTCAGAACCGATGCATCATCCCGACGCGCAACGCCAGCTGGTTGCGGCCCGACGACTGCCCGGCCGTGCCGAGCACGTGCGCGTAG', 'ATGTACGTCGAGGTTCGCTTGCTCAGATCGTAG', 'ATGTTCTGCTGCGCGCGGCCGCTGAAGACCGTATCGTCGCTCGACAACGCACCGCCCGCGGTCGCGCCGCCGTTGTTGGCCTTCAGGTACGCGACGGCGGCGGAGAACGCGCCCGCCTGGTACTGCACGGCGGCGCTATAG', 'ATGCGGTAG', 'ATGTCGAGCGTGGGGTCGTACTGGCGGCCGAACGTGAGCGTGCCGAACCGGTCGGAATCGAGGCCGACGTAG', 'ATGGCGCTCGCGGCCGATACGGCCGCCATCATCTTCTTCATCGTCGATCTCCAGGTGTGGGCAGCCCACGCGGCGCGGTGCGGTTTCCGACGGCATACGTCAGCACCGGACGCGTGCAGCGAGTCCGTTTGTCGTTAG', 'ATGAGCGCGAGCAACGTATAG', 'ATGATCGCGAGCGCATACAGCAGGTAA', 'ATGGCCTGGTTGTCCATTTGA']
[]
['ATGCCGATCTTGC

---
4. A repeat is a substring of a DNA sequence that occurs in multiple copies (more than one) somewhere in the sequence. Although repeats can occur on both the forward and reverse strands of the DNA sequence, we will only consider repeats on the forward strand here. Also we will allow repeats to overlap themselves. For example, the sequence ACACA contains two copies of the sequence ACA - once at position 1 (index 0 in Python), and once at position 3. Given a length n, your program should be able to identify all repeats of length n in all sequences in the FASTA file. Your program should also determine how many times each repeat occurs in the file, and which is the most frequent repeat of a given length.
---