In [7]:
%load_ext nb_black

<IPython.core.display.Javascript object>

# Lecture 20
## SARS-nCov-2 data

We continue to explore coronavirus data.

## Filtering
The overall dataset is very large:

In [None]:
!zcat metadata.tsv.gz | wc -l   # count the number of rows in the metadata file

For this lecture we'll focus on only those sequences originating at UofM:

In [19]:
import pandas as pd

df = pd.read_csv("metadata.tsv.gz", sep="\t")
mi_seqs = df[df.strain.str.startswith("USA/MI-UM")]

(592, 27)

<IPython.core.display.Javascript object>

In [24]:
print(mi_seqs.shape)
mi_seqs.head()

(592, 27)


Unnamed: 0,strain,virus,gisaid_epi_isl,genbank_accession,date,region,country,division,location,region_exposure,...,Nextstrain_clade,pangolin_lineage,GISAID_clade,originating_lab,submitting_lab,authors,url,title,paper_url,date_submitted
142199,USA/MI-UM-10033759038/2020,ncov,EPI_ISL_578095,?,2020-03-23,North America,USA,Michigan,,North America,...,20C,B.1,GH,University of Michigan Clinical Microbiology L...,"Lauring Lab, University of Michigan, Departmen...",Valesano et al,https://www.gisaid.org,?,?,2020-10-13
142200,USA/MI-UM-10033759721/2020,ncov,EPI_ISL_578090,?,2020-03-23,North America,USA,Michigan,,North America,...,20C,B.1,GH,University of Michigan Clinical Microbiology L...,"Lauring Lab, University of Michigan, Departmen...",Valesano et al,https://www.gisaid.org,?,?,2020-10-13
142201,USA/MI-UM-10033760952/2020,ncov,EPI_ISL_578096,?,2020-03-23,North America,USA,Michigan,,North America,...,20C,B.1,GH,University of Michigan Clinical Microbiology L...,"Lauring Lab, University of Michigan, Departmen...",Valesano et al,https://www.gisaid.org,?,?,2020-10-13
142202,USA/MI-UM-10033766580/2020,ncov,EPI_ISL_578097,?,2020-03-24,North America,USA,Michigan,,North America,...,20C,B.1,GH,University of Michigan Clinical Microbiology L...,"Lauring Lab, University of Michigan, Departmen...",Valesano et al,https://www.gisaid.org,?,?,2020-10-13
142203,USA/MI-UM-10033767074/2020,ncov,EPI_ISL_578098,?,2020-03-24,North America,USA,Michigan,,North America,...,20C,B.1,GH,University of Michigan Clinical Microbiology L...,"Lauring Lab, University of Michigan, Departmen...",Valesano et al,https://www.gisaid.org,?,?,2020-10-13


<IPython.core.display.Javascript object>

## Extracting the raw data
The raw sequence data come to us in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format). The uncompressed file is really large:

In [29]:
!xz -l sequences_2020-11-05_07-30.fasta.xz

Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1     211  8,588.5 KiB  5,059.8 MiB  0.002  CRC64   sequences_2020-11-05_07-30.fasta.xz


<IPython.core.display.Javascript object>

In [31]:
from Bio import SeqIO
seq = SeqIO.load('sequences_2020-11-05_07-30.fasta.xz')

<IPython.core.display.Javascript object>

In [37]:
import lzma
from Bio.SeqIO.FastaIO import SimpleFastaParser

input_file = "sequences_2020-11-05_07-30.fasta.xz"
output_file = "um.fasta"
count = 0

with lzma.open(input_file, "rt") as in_handle:
    with open(output_file, "w") as out_handle:
        for seq_id, seq in SimpleFastaParser(in_handle):
            # The ID is the first word in the title line (after the @ sign):
            if seq_id.startswith("USA/MI-UM"):
                out_handle.write("> %s\n%s\n" % (seq_id, seq))
                count += 1
print("Saved %i records from %s to %s" % (count, input_file, output_file))

Saved 592 records from sequences_2020-11-05_07-30.fasta.xz to um.fasta


<IPython.core.display.Javascript object>

## Alignment
The sequences in the filtered file potentially have different lengths:

In [48]:
seqs = SeqIO.parse(open("um.fasta", "rt"), "fasta")
{len(s.seq) for s in seqs}

{29611,
 29754,
 29758,
 29767,
 29773,
 29775,
 29776,
 29779,
 29780,
 29781,
 29782,
 29784}

<IPython.core.display.Javascript object>

In order to estimate a phylogeny, we need to align these sequences so that nucleotides generated from the same genaelogy can be considered. 

In [53]:
# !pip install nexstrain-augur
! augur align --sequences um.fasta --output aligned.fa --reference-sequence reference_seq.gb \ 
    --remove-reference --nthreads 4


using mafft to align via:
	mafft --reorder --anysymbol --nomemsave --adjustdirection --thread 4 aligned.fa.to_align.fasta 1> aligned.fa 2> aligned.fa.log 

	Katoh et al, Nucleic Acid Research, vol 30, issue 14
	https://doi.org/10.1093%2Fnar%2Fgkf436

2bp insertion at ref position 27455
	AG: USA/MI-UM-10033991503/2020
2bp insertion at ref position 28251
	TG: USA/MI-UM-10035769699/2020, USA/MI-UM-10035905827/2020
Trimmed gaps in MN908947 from the alignment


<IPython.core.display.Javascript object>

In [62]:
seqs = SeqIO.parse(open("aligned.fa", "rt"), "fasta")
{len(s.seq) for s in seqs}

{29903}

<IPython.core.display.Javascript object>

## Estimating a tree
Given the sequence alignment, we can now estimate a tree. This is in general a computationally hard problem. For this analysis we will use [IQ-Tree](http://www.iqtree.org), which is a the state-of-the-art heuristic methods.

In [64]:
!iqtree2 -s aligned.fa -nt 20

IQ-TREE multicore version 2.0.7 for Mac OS X 64-bit built Jul 24 2020
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor,
Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    Jonathans-MacBook-Pro-2.local (AVX2, FMA3, 64 GB RAM)
Command: iqtree2 -s aligned.fa -nt 8
Seed:    990563 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Thu Nov  5 11:24:56 2020
Kernel:  AVX+FMA - 8 threads (16 CPU cores detected)

Reading alignment file aligned.fa ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
Alignment has 592 sequences with 29903 columns, 1331 distinct patterns
448 parsimony-informative, 513 singleton sites, 28942 constant sites
                            Gap/Ambiguity  Composition  p-value
   1  USA/MI-UM-2006S000890/2020    3.27%    passed     99.85%
   2  USA/MI-UM-2006S001053/2020    1.67%    passed     99.94%
   3  USA/MI-UM-2006S000850/2020    0.40%    passed     99.99%
   4  USA/MI-UM-2006S000861/2020    1.13%    passed 

 514  USA/MI-UM-10035946104/2020    2.56%    passed     99.98%
 515  USA/MI-UM-10035757349/2020    0.40%    passed    100.00%
 516  USA/MI-UM-MH27682/2020        1.13%    passed    100.00%
 517  USA/MI-UM-10035755495/2020    0.75%    passed     99.99%
 518  USA/MI-UM-10036092704/2020    1.42%    passed     99.96%
 519  USA/MI-UM-10036381452/2020    1.06%    passed     99.93%
 520  USA/MI-UM-10036422279/2020    1.13%    passed     99.97%
 521  USA/MI-UM-10036036877/2020    0.40%    passed     99.98%
 522  USA/MI-UM-10036453509/2020    1.74%    passed     99.98%
 523  USA/MI-UM-10036331190/2020    1.14%    passed     99.99%
 524  USA/MI-UM-10036002831/2020    0.75%    passed     99.99%
 525  USA/MI-UM-10036277128/2020    0.46%    passed     99.96%
 526  USA/MI-UM-10036434968/2020    0.75%    passed     99.97%
 527  USA/MI-UM-2006S000851/2020    0.40%    passed     99.90%
 528  USA/MI-UM-2006S000881/2020    0.40%    passed     99.89%
 529  USA/MI-UM-2006S000913/2020    1.13

NOTE: 62 identical sequences (see below) will be ignored for subsequent analysis
NOTE: USA/MI-UM-2006S000935/2020 (identical to USA/MI-UM-2006S000850/2020) is ignored but added at the end
NOTE: USA/MI-UM-2006S001097/2020 (identical to USA/MI-UM-2006S000850/2020) is ignored but added at the end
NOTE: USA/MI-UM-2006S001114/2020 (identical to USA/MI-UM-2006S000850/2020) is ignored but added at the end
NOTE: USA/MI-UM-2006S001135/2020 (identical to USA/MI-UM-2006S000850/2020) is ignored but added at the end
NOTE: USA/MI-UM-2006S005725/2020 (identical to USA/MI-UM-2006S000850/2020) is ignored but added at the end
NOTE: USA/MI-UM-10033759038/2020 (identical to USA/MI-UM-2006S000850/2020) is ignored but added at the end
NOTE: USA/MI-UM-10033760952/2020 (identical to USA/MI-UM-2006S000850/2020) is ignored but added at the end
NOTE: USA/MI-UM-10033768049/2020 (identical to USA/MI-UM-2006S000850/2020) is ignored but added at the end
NOTE: USA/MI-UM-10033768108/2020 (identical to USA/MI-UM-2006S0

 97  TIM2e+R3      53087.760    1064 108303.520   108382.108   117140.799
110  TIM+F+R3      52014.064    1067 106162.129   106241.169   115024.326
123  TIMe+R3       53054.069    1064 108236.138   108314.726   117073.418
136  TPM3u+F+R3    52211.251    1066 106554.501   106633.390   115408.393
149  TPM3+F+R3     52211.251    1066 106554.501   106633.390   115408.392
162  TPM2u+F+R3    52154.889    1066 106441.778   106520.667   115295.669
175  TPM2+F+R3     52154.889    1066 106441.778   106520.667   115295.669
188  K3Pu+F+R3     52141.617    1066 106415.234   106494.123   115269.125
201  K3P+R3        53183.686    1063 108493.373   108571.810   117322.347
214  TN+F+R3       52116.322    1066 106364.644   106443.533   115218.535
227  TNe+R3        53149.225    1063 108424.450   108502.888   117253.424
240  HKY+F+R3      52243.686    1065 106617.371   106696.110   115462.957
253  K2P+R3        53278.979    1062 108681.959   108760.246   117502.627
266  F81+F+R3      52646.918    1064 1

<IPython.core.display.Javascript object>