It is of some importance to solve this problem without AI assistance given how P24 was solved.

All fastq input sequences must overlap with at least one other sequence by at least 50%
Overlapping subsequence must either include the 5' or the 3' end of a sequence (both for identical sequences..)

In [2]:
#only used for rounding down numbers
from math import trunc

In [3]:
#reading in fasta function same as before
#taken from P5 solution

#takes a filename as input and returns a dictionary 
#where the keys are sequence IDs and the values are the corresponding sequences
def read_fasta(filename):
    sequences = {}
    with open(filename, "r") as fasta_file:
        # Initialize variables
        sequence_id = None
        sequence = ""

        # Iterate over each line in the file
        for line in fasta_file:
            # Remove newline character
            line = line.strip()

            if line.startswith(">"):
                # This is a header line
                if sequence_id is not None:
                    # Save the previous sequence
                    sequences[sequence_id] = sequence
                    sequence = ""

                # Extract the sequence ID from the header line
                sequence_id = line[1:]
            else:
                # This is a sequence line
                sequence += line

        # Save the last sequence
        if sequence_id is not None:
            sequences[sequence_id] = sequence

    return sequences


In [4]:
seqs = read_fasta('test.txt')

In [136]:
seqs = read_fasta('rosalind_long.txt')

Approach:

Take 5' and 3' end of first sequence as subsequence and iterate over all other sequences to find which ones also have that subsequence inside them 
All sequences downstream for 3' and upstream for 5' subsequence can be added to initial sequence to make a new superstring 
In cases of multiple matches, retain only sequence with the minimum overlap and discard all others. 
Make sure to check if a larger subsequence overlaps completely 

In [137]:
#create a list of all distinct sequences
lseqs = []
for name, seq in seqs.items():
    lseqs.append(seq)
lseqs

['ACACTGTATAAATGGAGGTTAACGTGGTACTACCCTGCCATCACCAAGTTCATAACCGCAACGTATCGATTGCGGGGAGGCTGGCCTAGACCCCTCACGTGACTAACGTACGCCGATAGTCGAGGAAAGGTCCTCAAACTCTACTTAAACGCGTTACTCATCAACGTTTTAGTATAACACTATGTGAGCGCGGGACGGTCGCCCCCGGAATTATAGCTAGGAGACCAAGGTTCCTATGAGTAGTTAACCTTGGATCTCGAGATCGCCGATGGCTTGTGACTAATGGGCGTGTTTCAATATTTACCACGGTAGTCGAACGCATACGACCCGTCCCATGGGGTACATCCGACAGAACACAGGGTACCGGTAACCCACGCCAACCTACCCAGTAGACAGCTTGGATCAAACAGGTGTATAGCTACCCAACGCAAAGTTTGTAACAAACCGTCGCATTGTAGGGGCGTTCGGAGGGCAAGCCGCGGGTAATAGTTAGCTGGGCCAGTACCCCTGAGGCGTATACAGGGCTGATTATGCGGATTTTCGAGTCCTTCTGCACGGACCTCTTCGCTGGAATATTTGTCTACGCATTGATGTTTGCTAGATACGCGGCAAGATCATACCTGCTGGGGACGAAATTCCGCTTGGGACCTTGGTTAACGCGTCACTTCGCTCGGCTGTGGTTTTCGCGATAGTTAGGCTCGGGGTTCGCAGCGTTGCACGGCTTCCCATTTTCGGACGTTTGAATTGCGGCCAGGTTGTTCGTGCTCGAGGAGCGCCTGTTGGCTCCCCGCTATCACACTTAGATAGAGTTAACTCCTTATTAGTTGCTGCAACGATTACATTTAGCTGTGGCGTCAGCAGAACCGTATCGCCTAGGCGAGCTGTACCTCCACCATGCGAAGCATGGACCAAAAATTGTCTTTTCCTAGCACAGAACACGATTAGGCGCGGATGTTCATTTTCGGTCCTAGGAACGCTGTTAGG',
 'ATAGACTGA

In [138]:
#calls the first seq in a dict 
#deprecated when no longer using dict next(iter(seqs.items()))[1]

#takes in a list of sequences and a superstring outputs the supersequence / chrom that is formed by them
#only after a single iteration! 
def long(inlis, superseq):
    
    #we dont want to modify the original list
    lis = inlis[:]
    
    #iterate over list and sequences to find matches 
    for sequence in lis:
        
        #5' sequence of sequence 
        sequence_u = sequence[0:trunc(len(sequence)/2)]

        #3' sequence of sequence
        sequence_d = sequence[trunc(len(sequence)/2): len(sequence)]
    
        #checks if sequence is completely part of the superstring and removes if so
        if sequence in superseq:
            lis.remove(sequence)
        
        #if 5' is in sequence, it indicates that the 5' of sequence is located in the 3' area of superseq
        #first check if 5' sequence in superseq is at the right location and matches 
        elif sequence_u in superseq:
            
            #theoretical overlap of the sequence with superseq 
            overlap = superseq[superseq.index(sequence_u):len(superseq)]
            
            #checks if the rest of the sequence also matches 
            #adds overhang to superseq if so and removes sequence from lis
            if overlap in sequence:  
                #print (sequence, 'is in', superseq)
                lis.remove(sequence)
                
                overhang = sequence.replace(overlap, "")
                superseq = superseq + overhang 
                #print ('new sseq', superseq) 
                
        
        #check for 3' location 
        elif sequence_d in superseq:
            
            #theoretical overlap of the sequence with superseq 
            overlap = superseq[0:(superseq.index(sequence_d)+len(sequence_d))]            
            
            #checks if the rest of the sequence also matches 
            if overlap in sequence:           
                #print (sequence, 'is in', superseq)
                lis.remove(sequence)
                
                overhang = sequence.replace(overlap, "")
                superseq = overhang + superseq  
                #print ('new sseq', superseq) 
    
    #important to return modified lis as it removes sequences that will truncate our superseq 
    return superseq, lis

In [139]:
#initalize superstring with first sequence
#calls the first sequence and removes it from the list 
sseq = lseqs[0]
lseqs.pop(0)    

'ACACTGTATAAATGGAGGTTAACGTGGTACTACCCTGCCATCACCAAGTTCATAACCGCAACGTATCGATTGCGGGGAGGCTGGCCTAGACCCCTCACGTGACTAACGTACGCCGATAGTCGAGGAAAGGTCCTCAAACTCTACTTAAACGCGTTACTCATCAACGTTTTAGTATAACACTATGTGAGCGCGGGACGGTCGCCCCCGGAATTATAGCTAGGAGACCAAGGTTCCTATGAGTAGTTAACCTTGGATCTCGAGATCGCCGATGGCTTGTGACTAATGGGCGTGTTTCAATATTTACCACGGTAGTCGAACGCATACGACCCGTCCCATGGGGTACATCCGACAGAACACAGGGTACCGGTAACCCACGCCAACCTACCCAGTAGACAGCTTGGATCAAACAGGTGTATAGCTACCCAACGCAAAGTTTGTAACAAACCGTCGCATTGTAGGGGCGTTCGGAGGGCAAGCCGCGGGTAATAGTTAGCTGGGCCAGTACCCCTGAGGCGTATACAGGGCTGATTATGCGGATTTTCGAGTCCTTCTGCACGGACCTCTTCGCTGGAATATTTGTCTACGCATTGATGTTTGCTAGATACGCGGCAAGATCATACCTGCTGGGGACGAAATTCCGCTTGGGACCTTGGTTAACGCGTCACTTCGCTCGGCTGTGGTTTTCGCGATAGTTAGGCTCGGGGTTCGCAGCGTTGCACGGCTTCCCATTTTCGGACGTTTGAATTGCGGCCAGGTTGTTCGTGCTCGAGGAGCGCCTGTTGGCTCCCCGCTATCACACTTAGATAGAGTTAACTCCTTATTAGTTGCTGCAACGATTACATTTAGCTGTGGCGTCAGCAGAACCGTATCGCCTAGGCGAGCTGTACCTCCACCATGCGAAGCATGGACCAAAAATTGTCTTTTCCTAGCACAGAACACGATTAGGCGCGGATGTTCATTTTCGGTCCTAGGAACGCTGTTAGG'

In [140]:
#repeat until sseq is no longer changing
n = 0
while n < 51:
    sseq, lseqs = long(lseqs, sseq)
    n+=1
    
print (sseq)

GACCTCTTTCAGTACCGCCAAGTCTGATGTACTGGAACGGAGGTGGTCCTGGTGAGATACGTAGGCACCAGTGCTATCCGTGGGTAGCTCGGCAGCCTGTGACCTCCGGTTTTTGGGGATAGGTATAAGGAGGATGAGTCGGGAACTCTTCACGGCAACACGGCTAGGCCCGTACGGCGACCTATGCACAGCGTCTTCACCGCCGAAAAGTCTTTCGACGGTCCTGCAAAACACTCCCACTCGTTGAGCTCTAAACGTGGAAGCCTCAGCCTAAGAACCATAACTGGGGCCAGGGGGCACAAAAAGGGCCAGGTTTTTTTCCCTACAAACAGATCTCCACTGCCCACTCACACTTTGGCTAAGCGGCGTTGACCCACCAACGGCCTAGAAGGTATCAGGATCGACTCAAGCGAAAGCACAAACAGAATTCCGACAGAATTAATGTATTGGTTGCCCTAGGATGCAAATCCCTATCGGATACTTAAGCGAGCAGTAAGGTAGAGTAGACTGTAATGGGCCATGGGCAAGAAGGATATGGGGGGTATCATGATGTAAGACCTGAGTGGTAACAGCTAGGTTTGTAATGCTACAGGTCGTCACAATTATGGTCGGCCTACTCCTTCGATTATGGTCTGTATCTTTGAGGGGTGCCGCATCTCAGCTGCCGTCCTCCCATTTTGGATGAATGGTCATGCGAGGTCGGGAATACTGCCCTAGCCGGACGCGCACTGATCGGGCATAGGTAGGATGCAGCAATATTTCTGGTTTGTGAATGCGAAGCTAAACGGCAAGTGGGACAGCAACATGCATATGTATTATCGCAACGCTTACGAGCGGCGCCGGGCGCTAACCCCAAAGTGTCTCGACCCATGAATTAGTGTCTAGAGCGGCACGTGTCTGAGGGCAAGTCAGTCTCATGTTCACCGATTAGATAGGCTATAATGGGCTCTGATATTTCCCACTGCGAGGCCCGCGGGGTTGATGATGCCCAGACATAACG

In [141]:
with open('p25ans.txt', 'w') as f:
    f.write(sseq)