# Homework 02. Hyerin Kim (student ID:20185290)
CDC(Center for Disease Control and Prevention) released the information about PCR primers/probes to detect Covid-19. (See [this page](https://www.cdc.gov/coronavirus/2019-ncov/lab/rt-pcr-panel-primer-probes.html) for more information). We are curious how these primers/probes work.

If you need more info about "PCR-based diagnosis", see [this video](https://www.youtube.com/watch?v=fkUDu042xic).

## Data files
<ul>
    <li> The genome of Covid-19: '../data/2019nCoV_genomes.2020_03_27.fa'
    <li> The primers for Covid-19 detection: '../data/2019nCoV_primers.fa'
</ul>
    
## Procedures
<ol>
<li> Read 2019nCoV primers from a FASTA file (**see below**).
<li> Read 2019nCoV genomes from a FASTA file (**see below**).
<li> Find the position of primers (F, R) on each genome sequence.
<li> Calculate the length of PCR amplicons for Covid-19 diagnostics.
</ol>

## Questions
<ol>
    <li> What is the length of amplicons generated by primers? Are they all same among the genomes?
    <li> Can these primers detect all SARS-CoV-2 genomes?
    <li> Can these primers detect MERS genomes? How about SARS genomes? (you can find those genomes under '../data' directory).
    <li> Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 genomes. Are they all perfectly matched? Is there any mismatch or gap?
    <li> Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 proteomes. Are they all perfectly matched?
</ol>

In [9]:
# The function to get the reverse complementary sequences
rc = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
def revcomp(tmp_seq):
    # from the end of input sequence to the beginning, 
    # replace each nucleotide to its complementary one.
    #
    # The single line code at the end of this function is equivalent to the following 4 lines.
    # rv = []
    # for tmp_n in tmp_seq[::-1]:
    #     rv.append(rc[tmp_n])
    # return ''.join(rv)
    
    return ''.join([rc[x] for x in tmp_seq[::-1]])


In [10]:
# The function to read sequences from a FASTA file
def read_fasta(tmp_filename):
    rv = dict()
    f = open(tmp_filename, 'r')
    for line in f:
        if line.startswith('>'):
            tmp_h = line.strip().lstrip('>')
            rv[tmp_h] = ''
        else:
            rv[tmp_h] += line.strip().replace(' ', '')
    f.close()
    return rv

In [11]:
# Read primers
filename_primers = '../data/2019nCoV_primers.fa'
primer_list = read_fasta(filename_primers)

#make the new dictionaries, primer and types
primer = dict()
types = dict()

#sort the primer according to their set
for p_h, p_s in primer_list.items():   
    
    #apply revcomp function to reverse primer for genome sequnece comparison
    if p_h[-1] == 'R':
        p_s = revcomp(p_s)
   
    #primer.keys have primer set name / primer.values have types which is another dictionary
    #types.keys have 3 type of primer, forward(F), reverse(R), prove(P) / types.values have primer sequences
    if p_h[:-2] in primer.keys():  
        primer[p_h[:-2]][p_h[-1]] = p_s       
    else:
        types[p_h[-1]] = p_s
        primer[p_h[:-2]] = types
        types = dict()
        
print(primer)
print("there are", len(primer),"sets of primer")


{'2019-nCoV_N1': {'F': 'GACCCCAAAATCAGCGAAAT', 'R': 'CAGATTCAACTGGCAGTAACCAGA', 'P': 'ACCCCGCATTACGTTTGGTGGACC'}, '2019-nCoV_N2': {'F': 'TTACAAACATTGGCCGCAAA', 'R': 'TTCTTCGGAATGTCGCGC', 'P': 'ACAATTTGCCCCCAGCGCTTCAG'}, '2019-nCoV_N3': {'F': 'GGGAGCCTTGAATACACCAAAA', 'R': 'CAATGCTGCAATCGTGCTACA', 'P': 'AYCACATTGGCACCCGCAATCCTG'}, 'RP': {'F': 'AGATTTGGACCTGCGAGCG', 'R': 'ACTTGTGGAGACAGCCGCTC', 'P': 'TTCTGACCTGAAGGCTCTGCGCG'}}
there are 4 sets of primer


In [12]:
# Q1. What is the length of amplicons generated by primers? Are they all same among the genomes?

# Read genomes
filename_genomes = '../data/2019nCoV_genomes.2020_03_27.fa'
genome_list = read_fasta(filename_genomes)

A = dict()
count = dict()
SUM = 0

# p_set: string, <2019-nCoV_N1,2019-nCoV_N2,2019-nCoV_N3, RP>
# p_type: dictionary, <F:seqeunces, R:sequences. P:sequences>
print("Q1: What is the length of amplicons generated by primers? Are they all same among the genomes?")
for p_set, p_type in primer.items():
    for g_h, g_seq in genome_list.items():
        
        #p_type["F"]: forward primer sequence, p_type["R"]: reverse primer sequnece
        if p_type["F"] in g_seq and p_type["R"] in g_seq:
            # we have to add the reverse primer sequence length, because amplicon include both forward and reverse primer
            amplicon_len = abs(g_seq.find(p_type["R"])-g_seq.find(p_type["F"]))+len(p_type["R"])
            A[g_h] = amplicon_len            
    
    #count the overlapping amplicon lengths and 
    for c in A.values():
        SUM = SUM + c
        if c not in count.keys():
            count[c] = 1
        else:
            count[c] += 1
    
    #print the results
    if len(count) == 1:
        print("\n in", p_set,", mean of amplicon lengths is", SUM/len(A))
        print("** length of amplicons are all same among the genomes in", p_set,"primer sets")
    elif len(count) == 0:
        print("\n no amplicons are detected by",p_set, "primer sets with these genome sets")
    else:
        print("in", p_set,", mean of amplicon lengths is", SUM/len(A))
        print("\n amplicon size are various among the genomes")
    
    #reset the SUM, A, count when start with new primer sets
    SUM = 0
    A = dict() 
    count = dict()

Q1: What is the length of amplicons generated by primers? Are they all same among the genomes?

 in 2019-nCoV_N1 , mean of amplicon lengths is 72.0
** length of amplicons are all same among the genomes in 2019-nCoV_N1 primer sets

 in 2019-nCoV_N2 , mean of amplicon lengths is 67.0
** length of amplicons are all same among the genomes in 2019-nCoV_N2 primer sets

 in 2019-nCoV_N3 , mean of amplicon lengths is 72.0
** length of amplicons are all same among the genomes in 2019-nCoV_N3 primer sets

 no amplicons are detected by RP primer sets with these genome sets


In [13]:
# Q2. Can these primers detect all SARS-CoV-2 genomes?
# '../data/2019nCoV_genomes.2020_02_03.fa' and '../data/2019nCoV_genomes.2020_03_27.fa'

# Read genomes of SARS-CoV-2 genomes
filename_genomes = '../data/2019nCoV_genomes.2020_02_03.fa'
genome_list_old = read_fasta(filename_genomes)

# add 2019_nCoV_genomes.2020_02_03.fa genome list to my genome list dictionary
genome_list.update(genome_list_old)

count = 0
# p_set: string, <2019-nCoV_N1,2019-nCoV_N2,2019-nCoV_N3, RP>
# p_type: dictionary, <F:seqeunces, R:sequences. P:sequences>
print("Q2. Can these primers detect all SARS-CoV-2 genomes?")
for p_set, p_type in primer.items():
    for g_h, g_seq in genome_list.items():
        
        #p_type["F"]: forward primer sequence, p_type["R"]: reverse primer sequnece
        if p_type["F"] in g_seq and p_type["R"] in g_seq:
            count += 1            
    
    #print the results
    if len(genome_list) == count:
        print("\n",p_set,"primer can detect all SARS-CoV-2 genomes")
    elif count == 0:
        print("\n",p_set,"primer can't detect any SARS-CoV-2 genomes")
    else:
        print("\n",p_set,"primer can't detect all SARS-CoV-2 genomes, but can detect",count,"types of SARS-CoV-2 genomes")
        
    
    #reset count when start with new primer sets
    count = 0

Q2. Can these primers detect all SARS-CoV-2 genomes?

 2019-nCoV_N1 primer can detect all SARS-CoV-2 genomes

 2019-nCoV_N2 primer can detect all SARS-CoV-2 genomes

 2019-nCoV_N3 primer can detect all SARS-CoV-2 genomes

 RP primer can't detect any SARS-CoV-2 genomes


In [14]:
# Q3. Can these primers detect MERS genomes? How about SARS genomes? (you can find those genomes under '../data' directory).

# Read genomes
MERS = '../data/MERS_genomes.2020_02_03.fa'
SARS = '../data/SARS_genomes.2020_02_03.fa'
MERS_list = read_fasta(MERS)
SARS_list = read_fasta(SARS)
S_count = 0
M_count = 0

# p_set: string, <2019-nCoV_N1,2019-nCoV_N2,2019-nCoV_N3, RP>
# p_type: dictionary, <F:seqeunces, R:sequences. P:sequences>
print("Q3. Can these primers detect MERS genomes? How about SARS genomes?")
for p_set, p_type in primer.items():
    
    for g_h, g_seq in MERS_list.items():
        
        #p_type["F"]: forward primer sequence, p_type["R"]: reverse primer sequnece
        if p_type["F"] in g_seq and p_type["R"] in g_seq:
            M_count += 1            
    
    for g_h, g_seq in SARS_list.items():
        
        #p_type["F"]: forward primer sequence, p_type["R"]: reverse primer sequnece
        if p_type["F"] in g_seq and p_type["R"] in g_seq:
            S_count += 1            

    #print the results
    if M_count == 0 and S_count == 0:
        print("\n",p_set,"primer set can't detect both MERS and SARS genome")  
    elif M_count != 0:
        print("\n",p_set, "primer set can detect MERS genome")
    elif S_count != 0:
        print("\n",p_set, "primer set can detect SARS genome")
    else:
        print("\n",p_set, "primer set can detect both MERS and SARS genomes")
    
    #reset S_count and M_count when start with new primer set
    S_count = 0
    M_count = 0


Q3. Can these primers detect MERS genomes? How about SARS genomes?

 2019-nCoV_N1 primer set can't detect both MERS and SARS genome

 2019-nCoV_N2 primer set can't detect both MERS and SARS genome

 2019-nCoV_N3 primer set can't detect both MERS and SARS genome

 RP primer set can't detect both MERS and SARS genome


In [15]:
# for Q4 and Q5, extract the amplicon seqeunces in amplicon_seq dictionary 

amplicon_seq = dict()
all_amplicon = dict()

for p_set, p_type in primer.items():
    for g_h, g_seq in genome_list.items():
        
        #p_type["F"]: forward primer sequence, p_type["R"]: reverse primer sequnece
        if p_type["F"] in g_seq and p_type["R"] in g_seq:
            all_amplicon[g_h+" amplified by "+p_set+" primer set"] = g_seq[g_seq.find(p_type["F"]):g_seq.find(p_type["R"])+len(p_type["R"])]
            
            #if statement remove repetitive amplicons seqeunces among the genomes
            if g_seq[g_seq.find(p_type["F"]):g_seq.find(p_type["R"])+len(p_type["R"])] not in amplicon_seq.values():
                amplicon_seq[g_h+" amplified by "+p_set+" primer set"] = g_seq[g_seq.find(p_type["F"]):g_seq.find(p_type["R"])+len(p_type["R"])]

#write the fasta file by using amplicon_seq dictionary
ofile = open("../data/amplicon_sequences.fa","w")

for a_name, a_seq in amplicon_seq.items():
    ofile.write(">"+a_name+"\n"+a_seq+"\n")
ofile.close()

ofile = open("../data/all_amplicon_sequences.fa","w")

for a_name, a_seq in all_amplicon.items():
    ofile.write(">"+a_name+"\n"+a_seq+"\n")
ofile.close()
    

In [16]:
# Q4. Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 genomes. Are they all perfectly matched? Is there any mismatch or gap?
# BLAST was done by amplicon_sequences.fa 
# blastn -db 2019nCoV_genomes.2020_03_27.blastdb -query amplicon_sequences.fa -out blastN_Q4.out -outfmt 6
print("Q4. Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 genomes. Are they all perfectly matched? Is there any mismatch or gap?")
f = open('../data/blastN_Q4.out','r')

for l in f:
    if l.split()[2] == '100.000':
        print("amplicon'",l.split()[0],"'is perfectly matched with", l.split()[1],"genomes")
    else:
        print("there are",l.split()[4],"mismatch and",l.split()[5],"gap in",l.split()[1])
    


Q4. Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 genomes. Are they all perfectly matched? Is there any mismatch or gap?
amplicon' MT246449 'is perfectly matched with MT049951 genomes
amplicon' MT246449 'is perfectly matched with MT066176 genomes
amplicon' MT246449 'is perfectly matched with MT066175 genomes
amplicon' MT246449 'is perfectly matched with MT072688 genomes
amplicon' MT246449 'is perfectly matched with MT093571 genomes
amplicon' MT246449 'is perfectly matched with MT093631 genomes
amplicon' MT246449 'is perfectly matched with MT123293 genomes
amplicon' MT246449 'is perfectly matched with MT123291 genomes
amplicon' MT246449 'is perfectly matched with MT123290 genomes
amplicon' MT246449 'is perfectly matched with MT123292 genomes
amplicon' MT246449 'is perfectly matched with LC528232 genomes
amplicon' MT246449 'is perfectly matched with LC528233 genomes
amplicon' MT246449 'is perfectly matched with MT126808 genomes
amplicon' MT246449 'is perfectly mat

In [17]:
# Q5. Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 proteomes. Are they all perfectly matched?
# blastx -db 2019_nCoV_proteins.2020_03_27.blastdb -query amplicon_sequences.fa -out blastX_Q5.out -outfmt 6
print("Q5. Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 proteomes. Are they all perfectly matched?")
f = open('../data/blastX_Q5.out','r')

for l in f:
    if l.split()[2] == '100.000':
        print("amplicon'",l.split()[0],"'is perfectly matched with", l.split()[1],"genomes")
    else:
        print("there are",l.split()[4],"mismatch and",l.split()[5],"gap in",l.split()[1])
    


Q5. Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 proteomes. Are they all perfectly matched?
amplicon' MT246449 'is perfectly matched with QIK02783 genomes
amplicon' MT246449 'is perfectly matched with QIJ96530 genomes
amplicon' MT246449 'is perfectly matched with QIE07458 genomes
amplicon' MT246449 'is perfectly matched with BCB15098 genomes
amplicon' MT246449 'is perfectly matched with QII57305 genomes
amplicon' MT246449 'is perfectly matched with QIQ08827 genomes
amplicon' MT246449 'is perfectly matched with QIQ50109 genomes
amplicon' MT246449 'is perfectly matched with QIQ49979 genomes
amplicon' MT246449 'is perfectly matched with QIQ49789 genomes
amplicon' MT246449 'is perfectly matched with QHO62884 genomes
amplicon' MT246449 'is perfectly matched with QHW06046 genomes
amplicon' MT246449 'is perfectly matched with QHW06056 genomes
amplicon' MT246449 'is perfectly matched with QHZ00406 genomes
amplicon' MT246449 'is perfectly matched with QIM47464 genomes
a