Let's make a new (viral) genome

In [7]:
import random

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

In [4]:
letters = ["A", "C", "G", "T"]
virus = "".join([random.choice(letters) for _ in range(1000)])

Make a new virus that duplicates some of the middle of the sequence

In [6]:
newvirus = virus[:600] + virus[500:]
len(virus), len(newvirus)

(1000, 1100)

In [8]:
record_virus = SeqRecord(seq=Seq(virus), id="virus_ref")
record_newvirus = SeqRecord(seq=Seq(newvirus), id="newvirus_query")

SeqIO.write(record_virus, "virus.fasta", "fasta")
SeqIO.write(record_newvirus, "newvirus.fasta", "fasta")

1

What does the MUMmer alignment look like?

In [10]:
!dnadiff virus.fasta newvirus.fasta

Building alignments
Can't exec "/Users/lpritc/opt/anaconda3/envs/pyani_py311/opt/mummer-3.23/nucmer": No such file or directory at /Users/lpritc/opt/anaconda3/envs/pyani_py311/bin/dnadiff line 142.
ERROR: Failed to run nucmer, aborting.


In [11]:
!cat out.report

/Users/lpritc/Desktop/virus.fasta /Users/lpritc/Desktop/newvirus.fasta
NUCMER

                               [REF]                [QRY]
[Sequences]
TotalSeqs                          1                    1
AlignedSeqs               1(100.00%)           1(100.00%)
UnalignedSeqs               0(0.00%)             0(0.00%)

[Bases]
TotalBases                      1000                 1100
AlignedBases           1000(100.00%)        1100(100.00%)
UnalignedBases              0(0.00%)             0(0.00%)

[Alignments]
1-to-1                             2                    2
TotalLength                     1100                 1100
AvgLength                     550.00               550.00
AvgIdentity                   100.00               100.00

M-to-M                             2                    2
TotalLength                     1100                 1100
AvgLength                     550.00               550.00
AvgIdentity                   100.00               100.00

[Feature Estim

Let's put some indels into a version of the larger genome

In [13]:
# Make three indels of length 30bp
indels = ["".join([random.choice(letters) for _ in range(30)]) for _ in range(3)]
indels

['GAAACGCTGCAGCGTATAGGTAAGAAAGAG',
 'GACGTGTGCCATCGGAAAGCTCGAGATAAC',
 'AACCTGGCATTCTCCCGTCTAACTAAAACT']

We take the sequence of the new virus and put the indels:

- in the first 300bp
- in the repeated region (bases 600-700)
- in the last 200bp

In [16]:
indel_virus = newvirus[:150] + indels[0] + newvirus[150:650] + indels[1] + newvirus[650:950] + indels[2] + newvirus[950:]
len(virus), len(newvirus), len(indel_virus)

(1000, 1100, 1190)

In [17]:
record_indelvirus = SeqRecord(seq=Seq(indel_virus), id="indelvirus_query")

SeqIO.write(record_indelvirus, "indelvirus.fasta", "fasta")

1

In [26]:
!dnadiff virus.fasta indelvirus.fasta

Building alignments
1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS
# reading input file "out.ntref" of length 1001
# construct suffix tree for sequence of length 1001
# (maximum reference length is 2305843009213693948)
# (maximum query length is 18446744073709551615)
# CONSTRUCTIONTIME /Users/lpritc/opt/anaconda3/envs/pyani_py311/opt/mummer-3.23/mummer out.ntref 0.00
# reading input file "/Users/lpritc/Documents/Development/GitHub/pyani/issue_340/indelvirus.fasta" of length 1190
# matching query-file "/Users/lpritc/Documents/Development/GitHub/pyani/issue_340/indelvirus.fasta"
# against subject-file "out.ntref"
# COMPLETETIME /Users/lpritc/opt/anaconda3/envs/pyani_py311/opt/mummer-3.23/mummer out.ntref 0.00
# SPACE /Users/lpritc/opt/anaconda3/envs/pyani_py311/opt/mummer-3.23/mummer out.ntref 0.00
4: FINISHING DATA
Filtering alignments
Extracting alignment coordinates
Analyzing SNPs
Extracting alignment breakpoints
Generating report file


In [27]:
!cat out.report

/Users/lpritc/Documents/Development/GitHub/pyani/issue_340/virus.fasta /Users/lpritc/Documents/Development/GitHub/pyani/issue_340/indelvirus.fasta
NUCMER

                               [REF]                [QRY]
[Sequences]
TotalSeqs                          1                    1
AlignedSeqs               1(100.00%)           1(100.00%)
UnalignedSeqs               0(0.00%)             0(0.00%)

[Bases]
TotalBases                      1000                 1190
AlignedBases           1000(100.00%)         1111(93.36%)
UnalignedBases              0(0.00%)            79(6.64%)

[Alignments]
1-to-1                             2                    2
TotalLength                     1051                 1111
AvgLength                     525.50               555.50
AvgIdentity                    94.60                94.60

M-to-M                             2                    2
TotalLength                     1051                 1111
AvgLength                     525.50               555.

In [28]:
!cat out.snps

150	.	G	151	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	A	152	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	A	153	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	A	154	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	C	155	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	G	156	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	C	157	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	T	158	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	G	159	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	C	160	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	A	161	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	G	162	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	C	163	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	G	164	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	T	165	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	A	166	0	150	1000	1190	1	1	virus_ref	indelvirus_query
150	.	T	167	0	150	1000	1190	1	1	virus_ref	indelvirus_que

In [29]:
!cat out.mcoords

1	600	1	630	600	630	95.24	1000	1190	60.00	52.94	virus_ref	indelvirus_query
550	1000	710	1190	451	481	93.76	1000	1190	45.10	40.42	virus_ref	indelvirus_query
