# virus mutation rates analysis

![](./img/pyro_cov.png)

RSV and SARS-CoV-2  using Pyro

nextstrain offers a nice way to download data from GISAID. 

usher is a tool to build a tree from a set of sequences. 

`cov-lineages/pango-designation` suggests new lineages that should be added to the current scheme.

`cov-lineages/pangoLEARN` is a Store of the trained model for pangolin to access.
This repository is deprecated and only for use with pangolin 2.0 and 3.0. For latest pangolin data models compatible with pangolin 4.0, use `cov-lineages/pangolin-data`, the repo for storing latest model, protobuf, designation hash and alias files for pangolin assignments

`CSSEGISandData/COVID-19` is a repository with Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE.

`nextstrain/nextclade` is a tool for Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement



Manhattan plot of the mutation rates for the SARS-CoV-2 genome. The plot shows the log10 of the p-values for the null hypothesis that the mutation rate is zero. The plot is based on the results of the analysis of 6.4 million SARS-CoV-2 genomes. The plot shows that the mutation rate is significantly different from zero for 11 of the 29 genes in the SARS-CoV-2 genome. The genes with the highest mutation rates are ORF1ab, N, and S. The genes with the lowest mutation rates are E, M, and NSP3. The plot also shows that the mutation rate is significantly different from zero for the entire genome. The plot is based on the results of the analysis of 6.4 million SARS-CoV-2 genomes. The plot shows that the mutation rate is significantly different from zero for 11 of the 29 genes in the SARS-CoV-2 genome. The genes with the highest mutation rates are ORF1ab, N, and S. The genes with the lowest mutation rates are E, M, and NSP3. The plot also shows that the mutation rate is significantly different from zero for the entire genome.

In [1]:
import os
import sys

REPO_ADDRESS = "https://github.com/broadinstitute/pyro-cov.git"
REPO_NAME = "pyro-cov"

# Download the repo if it doesn't exist
if not os.path.exists(REPO_NAME):
    !git clone $REPO_ADDRESS

# change to the repo directory
os.chdir(REPO_NAME)

In [2]:
!pip install -e .

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///home/ma/git/computation/pyro-cov
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: pyrocov
  Attempting uninstall: pyrocov
    Found existing installation: pyrocov 0.1.0
    Uninstalling pyrocov-0.1.0:
      Successfully uninstalled pyrocov-0.1.0
  Running setup.py develop for pyrocov
Successfully installed pyrocov-0.1.0


Download data

In [3]:
# download the data
!make update

! test -f results/DO_NOT_UPDATE
make: *** [Makefile:42: update] Error 1


Preprocess data

This takes under an hour.


In [4]:
!make preprocess

python scripts/preprocess_usher.py
    73145 Refining a tree with 4472367 nodes
   113577 Found 1131065 clones
   113618 Refined 2048 -> 3341302
   121584 Loading usher metadata
100%|██████████████████████████████| 6590630/6590630 [01:33<00:00, 70121.77it/s]
   225982 Found metadata:
{'day': 6584214, 'lineage': 6590629, 'location': 2551026}
   226612 Loading nextstrain metadata
100%|██████████████████████████████| 7170954/7170954 [02:06<00:00, 56641.67it/s]
   443420 Found metadata:
{'location': 5933129, 'day': 6398841, 'lineage': 6449861}
   480169 Loading tree from results/lineageTree.fine.pb
   551769 Accumulating mutations on 4472367 nodes
100%|█████████████████████████████| 4472367/4472367 [00:41<00:00, 107386.50it/s]
  1073417 Found 6652960 samples in the usher tree
  1073417 Skipped 4102146 nodes because:
Counter({'no location': 4033211, 'no date': 68935})
  1073418 Kept 2550814 rows
  1075152 Saved results/columns.pkl
  1075168 Saved results/stats.pkl
  1088888 Extracting featu

analyze data

In [5]:
# analyze data
!python scripts/mutrans.py --vary-gene

     1226 loading cached results/mutrans.data.single.3000.1.50.None.pt
     2813 Fitting to each of genes: E, M, N, ORF10, ORF14, ORF1a, ORF1b, ORF3a, ORF6, ORF7a, ORF7b, ORF8, ORF9b, S
  0%|                                                    | 0/16 [00:00<?, ?it/s]     2816 Holdout: {}
     2816 loading cached results/mutrans.data.single.3000.1.50.None.pt
     2861 loading cached results/mutrans.svi.3000.1.50.coef_scale=0.05.reparam-localinit.full.10001.0.05.0.1.10.0.200.6.None..pt
     2905 Dense data has shape 82 x 6 x 3000 totaling 2550814 sequences
     2927 |μ|/σ [median,max] = [0.0886,635]
     2927 ΔlogR(S:D614G) = 0.0347 ± 0.00
     2927 ΔlogR(S:N501Y) = 0.055 ± 0.00
     2927 ΔlogR(S:E484K) = 0.000379 ± 0.00
     2928 ΔlogR(S:L452R) = 0.0462 ± 0.00
     2934 R(B.1.1.7)/R(A) = 1.42
     2934 R(B.1.617.2)/R(A) = 2.03
     2934 R(AY.23.1)/R(A) = 2.06
     2936 KL = 0.04137, MAE = 6.431, RMSE = 2.809
     2936 England	KL = 0.0353, MAE = 18.1, RMSE = 4.84
     2936 England B.1.1.7