# virus mutation rates analysis

![](./img/pyro_cov.png)

RSV and SARS-CoV-2  using Pyro

nextstrain offers a nice way to download data from GISAID. 

usher is a tool to build a tree from a set of sequences. 

`cov-lineages/pango-designation` suggests new lineages that should be added to the current scheme.

`cov-lineages/pangoLEARN` is a Store of the trained model for pangolin to access.
This repository is deprecated and only for use with pangolin 2.0 and 3.0. For latest pangolin data models compatible with pangolin 4.0, use `cov-lineages/pangolin-data`, the repo for storing latest model, protobuf, designation hash and alias files for pangolin assignments

`CSSEGISandData/COVID-19` is a repository with Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE.

`nextstrain/nextclade` is a tool for Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement



Manhattan plot of the mutation rates for the SARS-CoV-2 genome. The plot shows the log10 of the p-values for the null hypothesis that the mutation rate is zero. The plot is based on the results of the analysis of 6.4 million SARS-CoV-2 genomes. The plot shows that the mutation rate is significantly different from zero for 11 of the 29 genes in the SARS-CoV-2 genome. The genes with the highest mutation rates are ORF1ab, N, and S. The genes with the lowest mutation rates are E, M, and NSP3. The plot also shows that the mutation rate is significantly different from zero for the entire genome. The plot is based on the results of the analysis of 6.4 million SARS-CoV-2 genomes. The plot shows that the mutation rate is significantly different from zero for 11 of the 29 genes in the SARS-CoV-2 genome. The genes with the highest mutation rates are ORF1ab, N, and S. The genes with the lowest mutation rates are E, M, and NSP3. The plot also shows that the mutation rate is significantly different from zero for the entire genome.

In [1]:
import os
import sys

REPO_ADDRESS = "https://github.com/broadinstitute/pyro-cov.git"
REPO_NAME = "pyro-cov"

# Download the repo if it doesn't exist
if not os.path.exists(REPO_NAME):
    !git clone $REPO_ADDRESS

# change to the repo directory
os.chdir(REPO_NAME)

In [5]:
!pip install -e .

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///home/ma/git/computation/pyro-cov
  Preparing metadata (setup.py) ... [?25ldone
Collecting geopy
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e4/00/6313c0ffdd230164890f433019749189e0b884562bdc170e9c3a3454b3a6/geopy-2.3.0-py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting gpytorch
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/0b/8d/2baa077efc7ad94fa47b8a80324166538f75d76cb4ac2c7833bbc2938cc9/gpytorch-1.9.1-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.9/250.9 kB[0m [31m583.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting mappy
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9b/41/52391258e16bcc8182ddf3133d86408ed17d8d8fd6e30eea4960df29b64b/mappy-2.24.tar.gz (140 kB)
[2K     [90m━━━━━━━━━━━

Download data

In [3]:
# download the data
!make update

! test -f results/DO_NOT_UPDATE
scripts/pull_nextstrain.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  719M  100  719M    0     0  1956k      0  0:06:16  0:06:16 --:--:-- 2134k    0     0  1904k      0  0:06:26  0:03:40  0:02:46 1827k4:04  0:02:21 2019k
results/nextstrain/metadata.tsv.gz:	 91.5% -- created results/nextstrain/metadata.tsv
scripts/pull_usher.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   126  100   126    0     0    102      0  0:00:01  0:00:01 --:--:--   102
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 87.5M  100 87.5M    0     0  10.1M      0  0:00:08  0:00:08 --:--:-- 13.4M   9.8M      0  0:00:08  0:00:08 --:--:-- 13.1

Preprocess data

This takes under an hour.


In [6]:
!make preprocess

python scripts/preprocess_usher.py
    73913 Refining a tree with 4472367 nodes
   113825 Found 1131065 clones
   113854 Refined 2048 -> 3341302
   121780 Loading usher metadata
100%|██████████████████████████████| 6590630/6590630 [01:32<00:00, 70936.20it/s]
   225106 Found metadata:
{'day': 6584214, 'lineage': 6590629, 'location': 2551026}
   225630 Loading nextstrain metadata
100%|██████████████████████████████| 7170954/7170954 [01:59<00:00, 60024.46it/s]
   435318 Found metadata:
{'location': 5933129, 'day': 6398841, 'lineage': 6449861}
   471323 Loading tree from results/lineageTree.fine.pb
   543029 Accumulating mutations on 4472367 nodes
100%|█████████████████████████████| 4472367/4472367 [00:39<00:00, 112788.21it/s]
  1060786 Found 6652960 samples in the usher tree
  1060787 Skipped 4102146 nodes because:
Counter({'no location': 4033211, 'no date': 68935})
  1060787 Kept 2550814 rows
  1062391 Saved results/columns.pkl
  1062403 Saved results/stats.pkl
  1076205 Extracting featu

In [8]:
# analyze data
!make analyze

python scripts/mutrans.py --vary-holdout
     1212 Config: ('coef_scale=0.05', 'reparam-localinit', 'full', 10001, 0.05, 0.1, 10.0, 200, 6, None, ())
     1212 loading cached results/mutrans.data.single.3000.1.50.None.pt
     2589 Fitting full guide via SVI
     2593 init stddev = 4.63
     3756 Model has 18288 latent variables of shapes:
 rate_scale ()
 init_scale ()
 rate_loc_scale ()
 init_loc_scale ()
 coef_decentered (3144,)
 init_loc_decentered (3000,)
 pc_rate_decentered (6070,)
 pc_init_decentered (6070,)
     3756 Guide has 3697323 parameters of shapes:
 coef_centered (3144,)
 init_loc_centered ()
 pc_rate_centered ()
 pc_init_centered ()
 AutoLowRankMultivariateNormal.loc (18288,)
 AutoLowRankMultivariateNormal.scale (18288,)
 AutoLowRankMultivariateNormal.cov_factor (18288, 200)
     3793 step    0 L=129.163 RS=0.00951 IS=0.951 RLS=0.0105 ILS=1.05
     6977 step  100 L=32.0763 RS=0.0665 IS=1.99 RLS=0.0183 ILS=9.15
    10413 step  200 L=24.7163 RS=0.091 IS=2.1 RLS=0.0168 ILS=