# Finding complete references for use with VAPOR
## April 18th, 2025

Today we'll search through the [h5-data-updates](https://github.com/moncla-lab/h5-data-updates) repo for complete references, in particular those that contain Uni12 and Uni13. We'll do something simple: instead of aligning, we'll do string matching. We'll look for matches that can be off by a configurable number of bases, starting with 1 since it is known that there is some circulating diversity in these generally conserved regions.

First some imports and definitions:

In [1]:
from Bio import SeqIO
from Bio.Seq import Seq


uni12 = 'AGCAAAAGCAGG'.lower()
uni13 = str(Seq('AGTAGAAACAAGG').reverse_complement()).lower()

print('Uni12:', uni12)
print('Uni13:', uni13)

Uni12: agcaaaagcagg
Uni13: ccttgtttctact


Let's build a little functionality based on the [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance). We can check that the reference we've built is indeed complete in this sense, and the one we were using before was not.

In [2]:
def within_hamming_distance(first_string, second_string, distance=1):
    if len(first_string) != len(second_string):
        print('unequal')
        return False
    positions_that_mismatch = [c1 != c2 for c1, c2 in zip(first_string, second_string)]
    number_of_mismatches = sum(positions_that_mismatch)
    return number_of_mismatches <= distance


def has_uni12_and_uni13(record, distance=1):
    five_prime_end = record.seq.lower()[:len(uni12)]
    has_uni12 = within_hamming_distance(uni12, five_prime_end)

    three_prime_end = record.seq.lower()[-len(uni13):]
    has_uni13 = within_hamming_distance(uni13, three_prime_end)
    
    return has_uni12 and has_uni13

# can share this FASTA if necessary
moncla_lab_HA = SeqIO.read('./choose-reference/chosen-reference/ha.fasta', 'fasta')
print('Our crafted reference is complete:', has_uni12_and_uni13(moncla_lab_HA))

previous_reference = SeqIO.read('PQ719273.1.fasta', 'fasta')
print('The previous reference is complete:', has_uni12_and_uni13(previous_reference))

Our crafted reference is complete: True
The previous reference is complete: False


Let's now see many sequences are complete in this sense and write those that are to respective FASTA files.

In [3]:
segments = ['pb2', 'pb1', 'pa', 'ha', 'np', 'na', 'mp', 'ns']
for segment in segments:
    number_of_sequences_for_this_segment = 0
    number_of_complete_sequences = 0
    complete_records = []
    path_to_all_records = f'h5-data-updates/h5nx/{segment}/sequences.fasta'
    fasta = SeqIO.parse(path_to_all_records, 'fasta')
    for record in fasta:
        number_of_sequences_for_this_segment += 1
        if has_uni12_and_uni13(record):
            complete_records.append(record)
            number_of_complete_sequences += 1
    print(
        f'Total number of complete {segment} sequences:',
        number_of_complete_sequences, 'out of', number_of_sequences_for_this_segment
    )
    path_to_complete_segments = f'{segment}-complete.fasta'
    SeqIO.write(complete_records, path_to_complete_segments, 'fasta')

Total number of complete pb2 sequences: 3162 out of 30438
Total number of complete pb1 sequences: 3083 out of 30283
Total number of complete pa sequences: 2770 out of 30511
Total number of complete ha sequences: 2794 out of 37201
Total number of complete np sequences: 3339 out of 30959
Total number of complete na sequences: 2688 out of 32999
Total number of complete mp sequences: 3528 out of 30829
Total number of complete ns sequences: 3349 out of 30500
