# Task_02. In Silico PCR for Covid-19 diagnosis.
CDC(Center for Disease Control and Prevention) released the information about PCR primers/probes to detect Covid-19. (See [this page](https://www.cdc.gov/coronavirus/2019-ncov/lab/rt-pcr-panel-primer-probes.html) for more information). We are curious how these primers/probes work.

If you need more info about "PCR-based diagnosis", see [this video](https://www.youtube.com/watch?v=fkUDu042xic).

## Data files
<ul>
    <li> The genome of Covid-19: '../data/2019nCoV_genomes.2020_02_03.fa'
    <li> The primers for Covid-19 detection: '../data/2019nCoV_primers.fa'
</ul>
    
## Procedures
<ol>
<li> Read 2019nCoV primers from a FASTA file (**see below**).
<li> Read 2019nCoV genomes from a FASTA file (**see below**).
<li> Find the position of primers (F, R) on each genome sequence.
<li> Calculate the length of PCR amplicons for Covid-19 diagnostics.
</ol>

## Questions
<ol>
    <li> What is the length of amplicons generated by N1/N2/N3 primers? Any variation among the genomes?
    <li> What is the sequence of amplicons?
    <li> Can these primers detect all Covid-19 genomes?
    <li> Can these primers detect MERS genomes? How about SARS genomes? (you can find those genomes under '../data' directory).
</ol>

In [2]:
# The function to get the reverse complementary sequences
rs = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
def comp(seq):    
    return ''.join([rs[x] for x in seq[::-1]])

In [3]:
# The function to read sequences from a FASTA file
def read_fasta(tmp):
    r = dict()
    fi = open(tmp, 'r')
    for line in fi:
        if line.startswith('>'):
            h_g = line.strip().lstrip('>')
            r[h_g] = ''
        else:
            r[h_g] += line.strip().replace(' ', '')
    fi.close()
    return r

In [4]:
# Read primers

primers_set = '../data/2019nCoV_primers.fa'
primer = read_fasta(primers_set)
primer['nCoV_IP2-F'] = 'ATGAGCTTAGTCCTGTTG'
primer['nCoV_IP2-R'] = 'CTCCCTTTGTTGTGTTGT'
primer['nCoV_IP2-P'] = 'AGATGTCTTGTGCTGCCGGTA'

F_P=[]
R_C=[]
P_P=[]

for p_h, p_seq in primer.items():
    if '-F' in p_h:
        F_P.append(p_seq)
    elif '-R' in p_h:
        C=comp(p_seq)
        R_C.append(C)
    else:
        P_P.append(p_seq)

In [5]:
# Read genomes
SARS_CoV2_genomes = '../data/2019nCoV_genomes.2020_02_03.fa'
genome_COVID = read_fasta(SARS_CoV2_genomes)

In [6]:
# Counting index of Forward primer and Reverse primer
def count_idx(p, g):
    n=len(g)
    leng=len(p)
    for i in range(0, n-leng+1):
        if g[i : i+leng]==p:
            return i

In [7]:
#Return amplicon seq
def detect_amplicon(fp, rc, g):
    amplicon=''
    if fp in g and rc in g:
        a=count_idx(fp, g)
        b=count_idx(rc, g)
        j=len(fp)
        o=len(rc)
        if a + j < b:
            amplicon=g[a : b + o]
    return amplicon

In [8]:
# HW 2-1-1. length of amplicons by primers
g_seq = list(genome_COVID.values())
p_N = [p_h[:-2] for p_h in primer.keys()]
amplicon_s = {val : [] for val in set(p_N)}

for g in g_seq:
    for lie in range(len(F_P)):
        fp = F_P[lie]
        rc = R_C[lie]
        le=lie*3
        amplicon = detect_amplicon(fp, rc, g)
        amplicon_s[p_N[le]].append(amplicon)
        
for val in amplicon_s.keys():
    print(val, ": {}".format(len(amplicon_s[val][0])))

2019-nCoV_N3 : 72
2019-nCoV_N2 : 67
nCoV_IP2 : 108
2019-nCoV_N1 : 72
RP : 0


In [9]:
# HW 2-1-2. Are they all same among the genomes?
for val in amplicon_s.keys():
    if len(amplicon_s[val][0])!=0:
        if len(set(amplicon_s[val])) == 1 and set(amplicon_s[val]) != {''}:
            print(val, ": True")

2019-nCoV_N3 : True
2019-nCoV_N2 : True
nCoV_IP2 : True
2019-nCoV_N1 : True


In [10]:
#HW 2-2 SARS-CoV2 case
for val in amplicon_s.keys():
    if '' in list(amplicon_s[val]):
        print(val, "primer cannot detect {} cases in {} total sample genomes.".format(amplicon_s[val].count(''), len(amplicon_s[val])))
    else:
        print(val, "primer can detect {} cases in {} total sample genomes.".format(len(amplicon_s[val]), len(amplicon_s[val])))

2019-nCoV_N3 primer can detect 15 cases in 15 total sample genomes.
2019-nCoV_N2 primer can detect 15 cases in 15 total sample genomes.
nCoV_IP2 primer can detect 15 cases in 15 total sample genomes.
2019-nCoV_N1 primer can detect 15 cases in 15 total sample genomes.
RP primer cannot detect 15 cases in 15 total sample genomes.


In [11]:
#HW 2-3-1 MERS case
MERS_genomes = '../data/MERS_genomes.2020_02_03.fa'
genome_MERS = read_fasta(MERS_genomes)
M_seq = list(genome_MERS.values())
M_amplicon_s = {val : [] for val in set(p_N)}

for gM in M_seq:
    for lie in range(len(F_P)):
        fp = F_P[lie]
        rc = R_C[lie]
        le=lie*3
        amplicon = detect_amplicon(fp, rc, gM)
        M_amplicon_s[p_N[le]].append(amplicon)
        

for val in M_amplicon_s.keys():
    if set(M_amplicon_s[val]) == {''}:
        print(val, "cannot detect MERS.")
    else:
        if '' in list(M_amplicon_s[val]):
            print(val, "primer cannot detect {} cases in {} total sample genomes.".format(M_amplicon_s[val].count(''), len(M_amplicon_s[val])))
        else:
            print(val, "primer can detect {} cases in {} total sample genomes.".format(len(M_amplicon_s[val]), len(M_amplicon_s[val])))

2019-nCoV_N3 cannot detect MERS.
2019-nCoV_N2 cannot detect MERS.
nCoV_IP2 cannot detect MERS.
2019-nCoV_N1 cannot detect MERS.
RP cannot detect MERS.


In [12]:
#HW 2-3-2 SARS case
SARS_genomes = '../data/SARS_genomes.2020_02_03.fa'
genome_SARS = read_fasta(SARS_genomes)
S_seq = list(genome_SARS.values())
S_amplicon_s = {val : [] for val in set(p_N)}

for gS in S_seq:
    for lie in range(len(F_P)):
        fp = F_P[lie]
        rc = R_C[lie]
        le=lie*3
        amplicon = detect_amplicon(fp, rc, gS)
        S_amplicon_s[p_N[le]].append(amplicon)
        

for val in S_amplicon_s.keys():
    if set(S_amplicon_s[val]) == {''}:
        print(val, "cannot detect SARS.")
    else:
        if '' in list(S_amplicon_s[val]):
            print(val, "primer cannot detect {} cases in {} total sample genomes.".format(S_amplicon_s[val].count(''), len(S_amplicon_s[val])))
        else:
            print(val, "primer can detect {} cases in {} total sample genomes.".format(len(S_amplicon_s[val]), len(S_amplicon_s[val])))

2019-nCoV_N3 cannot detect SARS.
2019-nCoV_N2 cannot detect SARS.
nCoV_IP2 cannot detect SARS.
2019-nCoV_N1 cannot detect SARS.
RP cannot detect SARS.


In [13]:
#HW 2-4-1 Extract amplicons
def w_amplicon_s(fi, gc):
    f = open(fi, 'w')
    for g in list(gc.keys()):
        if len(gc[g]) == 1:
            f.write(">{}\n".format(g))
            f.write("{}\n".format(gc[g][0]))
        else:
            for i in range(len(gc[g])):
                f.write(">{} #{}\n".format(g, i+1))
                f.write("\n{}".format(gc[g][i]))
    f.close()
    
amplicon_s_no = {"{}->Amplicons".format(val) : [] for val in set(p_N)}
for val in list(amplicon_s.keys()):
    amplicon_s_no["{}->Amplicons".format(val)] = list(set(amplicon_s[val]))
                    
for val in list(amplicon_s.keys()):
        if '' in amplicon_s_no["{}->Amplicons".format(val)]:
            amplicon_s_no["{}->Amplicons".format(val)].remove('')
            
w_amplicon_s("20191116_SeonjunPark_HW2_amp.fa", amplicon_s_no)
f = open("20191116_SeonjunPark_HW2_amp.fa", "r")
for line in f.readlines():
    print(line, end='')

>2019-nCoV_N3->Amplicons
GGGAGCCTTGAATACACCAAAAGATCACATTGGCACCCGCAATCCTGCTAACAATGCTGCAATCGTGCTACA
>2019-nCoV_N2->Amplicons
TTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAATGTCGCGC
>nCoV_IP2->Amplicons
ATGAGCTTAGTCCTGTTGCACTACGACAGATGTCTTGTGCTGCCGGTACTACACAAACTGCTTGCACTGATGACAATGCGTTAGCTTACTACAACACAACAAAGGGAG
>2019-nCoV_N1->Amplicons
GACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGA


In [14]:
#HW 2-4-2 run the BLAST to all SARS-CoV-2 genomes.
f=open('20191116_SeonjunPark_HW2_blastn.out', 'r')

bool=0

for line in f.readlines():
    blast_id = str(line).split()
    if float(blast_id[2]) < 99.5:
        print("{} and {} have {} mismatches and {} gaps".format(blast_id[0], blast_id[1], blast_id[4], blast_id[5]))
        bool += 1
if bool==0:
    print("Everything matches perfectly.", '\n')

f.close()    
f = open('20191116_SeonjunPark_HW2_blastn.out', 'r')
for line in f.readlines():
    print(line, end='')
f.close()

In [15]:
#HW 2-5 run the BLAST to all SARS-CoV-2 genomes.
f = open('20191116_SeonjunPark_HW2_blastx.out', 'r')

bool= 0

for line in f:
    blastx_id = str(line).split()
    if line.split()[4] != '0' or line.split()[5] != '0':
        print('{} and {} have {} mismatches and {} gaps'
              % (blastx_id[0], blastx_id[1], blastx_id[4], blastx_id[5]))
        bool += 1
if bool==0:
    print('there is no mismatch or gap')        
f.close()

f = open('20191116_SeonjunPark_HW2_blastx.out', 'r')
for line in f.readlines():
    print(line, end='')
f.close()

FileNotFoundError: [Errno 2] No such file or directory: '20191116_SeonjunPark_HW2_blastn.out'

In [27]:
#Although I continue to try to blastn and blastx to do my HW 2-4-2 and 2-5, code errors continue to appear so I only uploaded my code.