# Task_02. In Silico PCR for Covid-19 diagnosis.
CDC(Center for Disease Control and Prevention) released the information about PCR primers/probes to detect Covid-19. (See [this page](https://www.cdc.gov/coronavirus/2019-ncov/lab/rt-pcr-panel-primer-probes.html) for more information). We are curious how these primers/probes work.

If you need more info about "PCR-based diagnosis", see [this video](https://www.youtube.com/watch?v=fkUDu042xic).

## Data files
<ul>
    <li> The genome of Covid-19: '../data/2019nCoV_genomes.2020_02_03.fa'
    <li> The primers for Covid-19 detection: '../data/2019nCoV_primers.fa'
</ul>
    
## Procedures
<ol>
<li> Read 2019nCoV primers from a FASTA file (**see below**).
<li> Read 2019nCoV genomes from a FASTA file (**see below**).
<li> Find the position of primers (F, R) on each genome sequence.
<li> Calculate the length of PCR amplicons for Covid-19 diagnostics.
</ol>

## Questions
<ol>
    <li> What is the length of amplicons generated by primers? Are they all same among the genomes?
    <li> Can these primers detect all SARS-CoV-2 genomes?
    <li> Can these primers detect MERS genomes? How about SARS genomes? 
    <li> Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 genomes.
    <li> Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 proteomes.
</ol>

In [1]:
# The function to get the reverse complementary sequences
rc = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
def revcomp(tmp_seq):
    # from the end of input sequence to the beginning, 
    # replace each nucleotide to its complementary one.
    #
    # The single line code at the end of this function is equivalent to the following 4 lines.
    # rv = []
    # for tmp_n in tmp_seq[::-1]:
    #     rv.append(rc[tmp_n])
    # return ''.join(rv)
    
    return ''.join([rc[x] for x in tmp_seq[::-1]])

#print(revcomp('AATTGGCC'))
#GGCCAATT
def reverse(tmp_seq):
    return ''.join([rc[x] for x in tmp_seq[::]])
#print(reverse('AATTGGCC'))
#TTAACCGG

In [2]:
# The function to read sequences from a FASTA file
def read_fasta(tmp_filename):
    rv = dict()
    f = open(tmp_filename, 'r')
    for line in f:
        if line.startswith('>'):
            tmp_h = line.strip().lstrip('>')
            rv[tmp_h] = ''
        else:
            rv[tmp_h] += line.strip().replace(' ', '')
    f.close()
    return rv

In [4]:
# Read primers and set it
filename_primers = '../data/2019nCoV_primers.fa'
primer_list = read_fasta(filename_primers)
for p_h, p_seq in primer_list.items():
    print(p_h, p_seq)


2019-nCoV_N1-F GACCCCAAAATCAGCGAAAT
2019-nCoV_N1-R TCTGGTTACTGCCAGTTGAATCTG
2019-nCoV_N1-P ACCCCGCATTACGTTTGGTGGACC
2019-nCoV_N2-F TTACAAACATTGGCCGCAAA
2019-nCoV_N2-R GCGCGACATTCCGAAGAA
2019-nCoV_N2-P ACAATTTGCCCCCAGCGCTTCAG
2019-nCoV_N3-F GGGAGCCTTGAATACACCAAAA
2019-nCoV_N3-R TGTAGCACGATTGCAGCATTG
2019-nCoV_N3-P AYCACATTGGCACCCGCAATCCTG
RP-F AGATTTGGACCTGCGAGCG
RP-R GAGCGGCTGTCTCCACAAGT
RP-P TTCTGACCTGAAGGCTCTGCGCG


In [5]:
# Read genomes 1
filename_genomes = '../data/2019nCoV_genomes.2020_02_03.fa'
genome_0203 = read_fasta(filename_genomes)
for g_h, g_seq in genome_0203.items():
    print(g_h.split()[0], g_seq[:25])
    

ENA|MN988668|MN988668.1 TTAAAGGTTTATACCTTCCCAGGTA
ENA|MN988669|MN988669.1 TTAAAGGTTTATACCTTCCCAGGTA
ENA|MN938384|MN938384.1 CAACCAACTTTCGATCTCTTGTAGA
ENA|MN975262|MN975262.1 ATTAAAGGTTTATACCTTCCCAGGT
ENA|MN996527|MN996527.1 AACCAACTTTCGATCTCTTGTAGAT
ENA|MN996528|MN996528.1 ATTAAAGGTTTATACCTTCCCAGGT
ENA|MN996529|MN996529.1 TACCTTCCCAGGTAACAAACCAACC
ENA|MN996530|MN996530.1 CCTTCCCAGGTAACAAACCAACCAA
ENA|MN996531|MN996531.1 ACCTTCCCAGGTAACAAACCAACCA
ENA|MN908947|MN908947.3 ATTAAAGGTTTATACCTTCCCAGGT
ENA|MN985325|MN985325.1 ATTAAAGGTTTATACCTTCCCAGGT
ENA|MN988713|MN988713.1 ATTAAAGGTTTATACCTTCCCAGGT
ENA|MN994467|MN994467.1 ATTAAAGGTTTATACCTTCCCAGGT
ENA|MN994468|MN994468.1 ATTAAAGGTTTATACCTTCCCAGGT
ENA|MN997409|MN997409.1 ATTAAAGGTTTATACCTTCCCAGGT


In [6]:
# Read genomes 2
filename_genomes = '../data/2019nCoV_genomes.2020_03_27.fa'
genome_0327 = read_fasta(filename_genomes)
for g_h, g_seq in genome_0327.items():
    print(g_h.split()[0], g_seq[:25])

MT246449 CTTTAAAATCTGTGTGGCTGTCACT
MT246450 CCAACCAACTTTCGATCTCTTGTAG
MT246451 ATCTCTTGTAGATCTGTTCTCTAAN
MT246452 GTTTATACCTTCCCAGGTAACAAAC
MT246453 CTTTAAAATCTGTGTGGCTGTCACT
MT246454 TTATACCTTCCCAGGTAACAAACCA
MT246455 AYCTCTTGTAGATCTGTTCTCTAAA
MT246456 TCTGTTCTCTAAACGAACTTTAAAA
MT246457 ATCTGTTCTCTAAACGAACTTTAAA
MT246458 GTATAATTAATAACTAATTACTGTC
MT246459 ACTATATACCTTCCCAGGTAACAAA
MT246460 AGCTAAGGTTTATACCTTCCCAGGT
MT246461 CCTTTAACTTTCGATCTCTTGTAGA
MT246462 AHNAAAGGTTTATACCTTCCCAGGT
MT246464 TGTAGATCTGTTCTCTAAACGAACT
MT246466 GGTTTATACCTTCCCAGGTAACAAA
MT246467 AAAGGTTTATACCTTCCCAGGTAAC
MT246468 GTTCTCTAAACGAACTTTAAAATCT
MT246469 ACCAACTTTCGATCTCTTGTAGATC
MT246470 CCCAGGTAACAAACCHTTHAACTTT
MT246471 CCTTTAACTTTCGATCTCTTGTAGA
MT246472 GTGTGGCTGTCACTCGGCTGCATGC
MT246473 CTTTAAAATCTGTGTGGCTGTCACT
MT246474 TATACCTTCCCAGGTAACAAACCHT
MT246475 CAGGTAACAAACCAACCAACTTTCG
MT246476 CTTTCGATCTCTTGTAGATCTGTTC
MT246477 CCHTTHAACTTTCGATCTCTTGTAG
MT246478 ACCTTCCCAGGTAACAAACCAACCA
MT246479 CTTTAAAATCT

In [7]:
#num1-1 : What is the length of amplicons generated by primers?
#find amplicon with cov_0203
amp_len = []
amp_list = []
count_amp = 0
for g_h, g_seq in genome_0203.items():
    for p_h, p_seq in primer_list.items():
        #print(p_h)
        #print(p_seq)
        #print(g_seq.find(p_seq))
        if p_h.endswith('F') and not p_h.startswith('R'):
            p_f = g_seq.find(p_seq)
            amp_start = g_seq[p_f:]
            #print(amp_start)
        if p_h.endswith('R') and not p_h.startswith('R'):
            rev = revcomp(p_seq)
            p_r = amp_start.find(rev)
            #print(p_r)
            amp = amp_start[:p_r]
            fin_amp = amp + rev
            #print(fin_amp)
            amp_list.append(fin_amp)
            amp_len.append(len(fin_amp))
            count_amp += 1
            #print(p_h)
            #print(amp)

#print(count_amp)
len_amp = sum(amp_len) #Answer for num1(length of amplicon, I find average)
print("Average lengths of amplicons :",len_amp/count_amp)


Average lengths of amplicons : 70.33333333333333


In [8]:
#num1-1 : What is the length of amplicons generated by primers?
#find amplicon with cov_0327
amp_len2 = []
amp_list2 = []
count_amp2 = 0
for g_h, g_seq in genome_0327.items():
    for p_h, p_seq in primer_list.items():
        #print(p_h)
        #print(p_seq)
        #print(g_seq.find(p_seq))
        if p_h.endswith('F') and not p_h.startswith('R'):
            p_f = g_seq.find(p_seq)
            amp_start = g_seq[p_f:]
            #print(amp_start)
        if p_h.endswith('R') and not p_h.startswith('R'):
            rev = revcomp(p_seq)
            p_r = amp_start.find(rev)
            #print(p_r)
            amp = amp_start[:p_r]
            fin_amp2 = amp + rev
            #print(fin_amp)
            amp_list2.append(fin_amp2)
            amp_len2.append(len(fin_amp2))
            count_amp2 += 1
            #print(p_h)
            #print(amp)

#print(count_amp)
len_amp2 = sum(amp_len2) #Answer for num1(length of amplicon, I find average)
print("Average lengths of amplicons :",len_amp2/count_amp2)


Average lengths of amplicons : 70.33333333333333


In [9]:
#num1_2 : Are they all same among the genomes?
amp_list = list(set(amp_list))
for i in amp_list:
    print("amplicon seqeunce in Cov0203 : ", i)
print("They all same among the genomes")

amp_list2 = list(set(amp_list2))
for i in amp_list2:
    print("amplicon seqeunce in Cov0327 : ", i)
print("They all same among the genomes")
print("also, amp 1 and 2 is same")

amplicon seqeunce in Cov0203 :  GACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGA
amplicon seqeunce in Cov0203 :  TTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAATGTCGCGC
amplicon seqeunce in Cov0203 :  GGGAGCCTTGAATACACCAAAAGATCACATTGGCACCCGCAATCCTGCTAACAATGCTGCAATCGTGCTACA
They all same among the genomes
amplicon seqeunce in Cov0327 :  GACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGA
amplicon seqeunce in Cov0327 :  TTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAATGTCGCGC
amplicon seqeunce in Cov0327 :  GGGAGCCTTGAATACACCAAAAGATCACATTGGCACCCGCAATCCTGCTAACAATGCTGCAATCGTGCTACA
They all same among the genomes
also, amp 1 and 2 is same


In [11]:
#num2 : Can these primers detect all SARS-CoV-2 genomes?
#Read SARS
#amp_0203
cov_list = []
filename_genomes = '../data/2019nCoV_genomes.2020_02_03.fa'
genome_cov = read_fasta(filename_genomes)
for g_h, g_seq in genome_0203.items():
    #print(g_h.split()[0], g_seq[:25])
    #print(g_seq)
    cov_list.append(g_seq)

ans_sheet = []
count_T = 0
count_F = 0
for i in cov_list:
    for j in amp_list:
        if j in i:
            ans_sheet.append('T')
            count_T += 1
        else:
            ans_sheet.append('F')
            count_F += 1

if 'F' in ans_sheet:
    print("Cov can't be detected")
else:
    print("Cov can be detected")
print('T : ',count_T)
print('F : ',count_F)

Cov can be detected
T :  45
F :  0


In [26]:
#num2 : Can these primers detect all SARS-CoV-2 genomes?
#Read SARS
#amp_0327
cov_list = []
filename_genomes = '../data/2019nCoV_genomes.2020_03_27.fa'
genome_cov = read_fasta(filename_genomes)
for g_h, g_seq in genome_0327.items():
    #print(g_h.split()[0], g_seq[:25])
    #print(g_seq)
    cov_list.append(g_seq)

ans_sheet = []
count_T = 0
count_F = 0
for i in cov_list:
    for j in amp_list2:
        if j in i:
            ans_sheet.append('T')
            count_T += 1
        else:
            ans_sheet.append('F')
            count_F += 1

if 'F' in ans_sheet:
    print("Cov can't be detected")
else:
    print("Cov can be detected")
print('T : ',count_T)
print('F : ',count_F) 

Cov can be detected
T :  219
F :  0


In [14]:
#num3 : Can these primers detect MERS genomes? How about SARS genomes? 
#Read SARS
sars_list = []
filename_genomes = '../data/SARS_genomes.2020_02_03.fa'
genome_sars = read_fasta(filename_genomes)
for g_h, g_seq in genome_sars.items():
    #print(g_h.split()[0], g_seq[:25])
    #print(g_seq)
    sars_list.append(g_seq)

ans_sheet = []
count_Ts = 0
count_Fs = 0
for i in sars_list:
    for j in amp_list:
        if j in i:
            ans_sheet.append('T')
            count_Ts += 1
        else:
            ans_sheet.append('F')
            count_Fs += 1
            
if 'F' in ans_sheet:
    print("Sars can't be detected")
else:
    print("Sars can be detected")
    
print('T : ',count_Ts)
print('F : ',count_Fs)    
    

#MERS
mers_list = []
filename_genomes = '../data/MERS_genomes.2020_02_03.fa'
genome_mers = read_fasta(filename_genomes)
for g_h, g_seq in genome_mers.items():
    #print(g_h.split()[0], g_seq[:25])
    #print(g_seq)
    mers_list.append(g_seq)

ans_sheet = []
count_Tm = 0
count_Fm = 0
for i in mers_list:
    for j in amp_list:
        if j in i:
            ans_sheet.append('T')
            count_Tm += 1
        else:
            ans_sheet.append('F')
            count_Fm += 1
            
if 'F' in ans_sheet:
    print("MERS can't be detected")
else:
    print("MERS can be detected")
    
    
print('T : ',count_Tm)
print('F : ',count_Fm)    
print("Both Sars and Mers don't have any part of amplicon")

Sars can't be detected
T :  0
F :  108
MERS can't be detected
T :  0
F :  1560
Both Sars and Mers don't have any part of amplicon


In [15]:
#num 4 : Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 genomes. 
#Are they all perfectly matched? Is there any mismatch or gap? 
#make fasta file from amplicon
line_num = 1
ofile = open("../data/amp_seq.fa", "w")

for i in range(len(amp_list)):
    ofile.write(">" + "\n" +amp_list[i] + "\n")
    line_num += 1
ofile.close()


In [17]:
#blastn -db 2019nCoV_genomes.2020_03_27.blastdb -query amp_seq.fa -out 20191324_jy_blastN.out -outfmt 6
f = open('../data/20191324_jy_blastN.out', 'r')
for i in f:
    #print(l.split()[4])
    if i[4] == 0 and i[5] == 0:
        print(l.split()[1], "hasn't no mismatches and gaps.")
    elif i[4] == 0:
        print(l.split()[1], "hasn't no mismatches but has", l.split()[5], "gaps")
    elif i[5] == 0:
        print(l.split()[1], "hasn't no gpas but has", l.split()[4], "mismatches")
    else:
        print(l.split()[1], "has", l.split()[4], "mismatches and ", l.split()[5], "gaps")

MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has

In [18]:
#num 5 : Extract those amplicon sequences, and run the BLAST to all SARS-CoV-2 proteomes. 
#Are they all perfectly matched?
#blastx -db 2019nCoV_proteins.2020_03_27.blastdb -query amp_seq.fa -out 20191324_jy_blastP.out -outfmt 6

f = open('../data/20191324_jy_blastP.out', 'r')
for i in f:
    #print(l.split()[4])
    if i[4] == 0 and i[5] == 0:
        print(l.split()[1], "hasn't no mismatches and gaps.")
    elif i[4] == 0:
        print(l.split()[1], "hasn't no mismatches but has", l.split()[5], "gaps")
    elif i[5] == 0:
        print(l.split()[1], "hasn't no gpas but has", l.split()[4], "mismatches")
    else:
        print(l.split()[1], "has", l.split()[4], "mismatches and ", l.split()[5], "gaps")

MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has 0 mismatches and  0 gaps
MT049951 has