# RdRp0 pan-proteome v0
```
Lead     : ababaian
Issue    : 
start    : 2020 12 10
complete : 2020 12 13
revision : 2020 12 15
files    : ~/serratus/notebook/201210_rdrp0/
s3 files : s3://serratus-public/notebook/201210_rdrp0/
```

## Introduction

We've been considering other ways to 'dive' into the SRA to yield meaningful, interpretable results. An idrea which is recurring is to focus on a gene-family/domain that we would like to characterize exhaustively.

The prime candidate is viral RNA-dependent RNA-polymerase or `RdRp`. This is slowly-evolving and central reference gene for the identification and classification of RNA viruses.

It is a daunting task to isolate all known RdRp and categorize them into a meaningful system, this is a first approximation of that goal putting together the components to do so.

The ideal end-goal will be to create a hierarchly/taxonomically nested set of RdRp protein sequences at various cut-off thresholds.

- rdrp1_99: all unique RdRp sequences
- rdrp1:  Species-Approximate. 90% identity clusters
- rdrp1_75:  Genus-Approximate. 75% identity clusters
- rdrp1_45:  Family-Approximate. 45% identity clusters


### Key Literature

- [Wolf20: Doubling RNA viruses](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7508674/)
- [Wolf18: RdRp evo/origin](https://pubmed.ncbi.nlm.nih.gov/30482837/)
- [Venk18: RdRp evo/origin](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5850383/)
- [Zhange: Expanding RNA virome (review)](https://pubmed.ncbi.nlm.nih.gov/31100994/)
- [Wu: Structural overview of RdRp](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4490480/)
- [Velthuis: Features of RdRp](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4207942/#Sec5)
- [Bruenn: Core motif-set of RdRp](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC152793/)
- [Jia: Diversity of motif-synteny](https://www.frontiersin.org/articles/10.3389/fmicb.2019.01945/full)
- [OReilly: Early computational analysis of RdRp](https://pubmed.ncbi.nlm.nih.gov/9878607/)

### Objectives
- Compile the materials neccesary for a comprehensive RdRp-ome
- Create the `rdrp0.fa` reference pan-proteome to run a pilot serratus run and see what results would look like

### END PRODUCT: `rdrp1_EPSY.fa `

- s3_link: `s3://serratus-public/notebook/201226_rdrp0/rdrp1_EPSY.fa`
- html_link: `https://serratus-public.s3.amazonaws.com/notebook/201226_rdrp0/rdrp1_EPSY.fa`
- Sequences: `14679`
- Lines: `129040`
- md5sum: `1adf7d172535844cb59b4dbd946ac549`


In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/201210_rdrp0"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/201210_rdrp0/'

# date and version
date
git rev-parse HEAD # commit version

Sat Dec 26 22:30:20 PST 2020
340d2fd52e816b20e5c88b7fb08e68b25066a12b


## GenBank Virome

The master corpus for all viral sequences

See also: [Genbank FTP README](https://ftp.ncbi.nlm.nih.gov/genbank/release.notes/gb240.release.notes)

### Nucleotide Sequences
- Query: `txid10239[Organism:exp]` # all viruses
- Date: `201205`
- Results: `3 535 357` sequences
- File : `ntViro_gb201005.fa`

### CDS Sequences (INCOMPLETE)
- Query: `txid2552587[Organism:exp]` # all RNA virus CDS
- Date: `201205` # error in this version
- Results: `2 825 230` sequences
- File : `cdsViro_gb201005.fa`

### Protein Sequences
- Query: # GenBank Protein v240 [see README](ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/README.asn1.protein_fasta)
- Date: `201220` # error in this version
- Results: `5 095 919` aa sequences
- File : `gbViro240.fa`

In [None]:
# Download GenBank v240 Viral Protein Sequences
# (RAN ON EC2)
#

wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl1.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl2.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl3.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl4.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl5.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl6.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl7.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl8.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl9.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl10.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl11.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl12.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl13.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl14.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl15.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl16.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl17.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl18.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl19.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl20.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl21.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl22.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl23.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl24.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl25.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl26.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl27.fsa_aa.gz 

gzip -d *
cat * > aaViro_gb240.fa

In [None]:
# EC2
AA='gbViro240.fa'
grep ">" $AA | wc -l
# 5095919
md5sum $AA
# 078505a82990197ccee5ebdbdf48ad6f  gbViro240.fa
md5sum $AA > $AA.md5

In [None]:
# December 23rd Release 241 was available
mkdir gb241; cd gb241

wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl1.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl2.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl3.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl4.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl5.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl6.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl7.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl8.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl9.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl10.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl11.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl12.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl13.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl14.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl15.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl16.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl17.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl18.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl19.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl20.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl21.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl22.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl23.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl24.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl25.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl26.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl27.fsa_aa.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/gbvrl28.fsa_aa.gz 

gzip -d *.gz

# Viral GenBank version 241
cat * > vgb241.fa

# md5: e0e2f609074182de74745eafc8acd976  vgb241.fa
md5sum vgb241.fa > vgb241.fa.md5

# faidx: 5 267 608 entries
seqkit faidx -f vgb241.fa; mv vgb241.fa.seqkit.fai vgb241.fa.fai
rm *.fsa_aa

In [None]:
# December 23rd Release 241 was available -- NUCLEOTIDE
mkdir gb241_nt; cd gb241_nt

wget ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2fsa/linux64.asn2fsa.gz
gzip -dc linux64.asn2fsa.gz > asn2fsa
chmod 755 asn2fsa

wget ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2all/linux64.asn2all.gz
gzip -dc linux64.asn2all.gz > asn2all
chmod 755 asn2all

wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl1.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl2.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl3.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl4.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl5.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl6.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl7.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl8.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl9.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl10.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl11.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl12.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl13.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl14.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl15.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl16.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl17.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl18.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl19.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl20.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl21.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl22.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl23.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl24.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl25.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl26.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl27.aso.gz 
wget ftp://ftp.ncbi.nih.gov/ncbi-asn1/gbvrl28.aso.gz 

gzip -d *.gz

# Viral GenBank version 241
# -a t : official release
# -b   : binary input
#
# ./asn2fsa -T -a t -b -d ./ -o vgb241.nt.fa
# FASTA
for ASO in $(ls *.aso); do
  ./asn2fsa -T -a t -b -i $ASO -o $ASO.tmp
done

# GenBank
for ASO in $(ls *.aso); do
  ./asn2all -f g -T -a t -b -i $ASO -o $ASO.tmp
done


cat *.tmp > vgb241.nt.fa

rm *.aso *.tmp

# a797155487e1d05f9899980bc9b4a04e  vgb241.nt.fa

# faidx: 3 261 824 entries
seqkit faidx -f vgb241.nt.fa; mv vgb241.nt.fa.seqkit.fai vgb241.nt.fa.fai

# Upload
aws s3 sync ./ s3://serratus-public/notebook/201226_rdrp0/vgb241_nt/


In [2]:
cd $WORK
NT='ntViro_gb201205.fa'
grep ">" $NT | wc -l
md5sum $NT
md5sum $NT > $NT.md5

CDS='cdsViro_gb201205.fa'
grep ">" $CDS | wc -l
md5sum $CDS
md5sum $CDS > $CDS.md5

AA='aaViro_gb201212.fa'
grep ">" $AA | wc -l
md5sum $AA
md5sum $AA > $AA.md5

# GB RdRp sequences (see below WOLF18)
GB='gbRdRp_201212.fa'
grep ">" $GB | wc -l
md5sum $GB
md5sum $GB > $GB.md5

# YA RdRp sequences (see below WOLF20)
YA='gbRdRp_201212.fa'
grep ">" $YA | wc -l
md5sum $YA
md5sum $YA > $YA.md5

3535357
9102eceda85185cfb124023dbf129621  ntViro_gb201205.fa
2825230
81946a20fc18b24eb6b49541d41b8dd0  cdsViro_gb201205.fa
2825230
81946a20fc18b24eb6b49541d41b8dd0  aaViro_gb201212.fa
13870
671fb5b3f02fd41457be4d6c7a31a417  gbRdRp_201212.fa
13870
671fb5b3f02fd41457be4d6c7a31a417  gbRdRp_201212.fa


## WOLF18 RdRp

FTP Access: `ftp://ftp.ncbi.nlm.nih.gov/pub/wolf/_suppl/rnavir18/`

Sequence data: `ftp://ftp.ncbi.nlm.nih.gov/pub/wolf/_suppl/rnavir18/RNAvirome.S2.afa`

Saved as: `gb_rdrp.afa`

Date Accessed: `201212`

![Figure 1](/home/artem/serratus/notebook/201210_rdrp0/wolf18/figure1.png)

### Level 1 - Supergroup / Branches

The RdRp can be broadly classified into 5 branches which will form the lowest level of the hierarchy: `rdrp1`, `rdrp2`, `rdrp3`, `rdrp4`, `rdrp5`.

>Branch 1 consists of leviviruses and their eukaryotic relatives, namely, “mitoviruses,” “narnaviruses,” and “ourmiaviruses” (the latter three terms are placed in quotation marks as our analysis contradicts the current ICTV framework, which classifies mitoviruses and narnaviruses as members of one family, Narnaviridae, and ourmiaviruses as members of a free-floating genus, Ourmiavirus).

> Branch 2 (“picornavirus supergroup”) consists of a large assemblage of +RNA viruses of eukaryotes, in particular, those of the orders Picornavirales and Nidovirales; the families Caliciviridae, Potyviridae, Astroviridae, and Solemoviridae, a lineage of dsRNA viruses that includes partitiviruses and picobirnaviruses; and several other, smaller groups of +RNA and dsRNA viruses.

> Branch 3 consists of a distinct subset of +RNA viruses, including the “alphavirus supergroup” along with the “flavivirus supergroup,” nodaviruses, and tombusviruses; the “statovirus,” “wèivirus,” “yànvirus,” and “zhàovirus” groups; and several additional, smaller groups.

> Branch 4 consists of dsRNA viruses, including cystoviruses, reoviruses, and totiviruses and several additional families.

> Branch 5 consists of −RNA viruses.

Boundary defintions of Branches with relation to RdRp are taken from paper

Based on: [Supplementary Data 4](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6282212/bin/mbo006184203sd4.txt)
Saved as: `rdrp_representative_branches.tree`

### Level 2 - Viral Family

Based on: `https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6282212/bin/mbo006184203sd1.xls`
Saved as: `wolf18_vlist.xlsx`

Spreadsheet includes the fields

- RdRp num ID: Ordinal numbering for RdRp
- RdRp GenBank Acc: Protein accession ID
- NCBI Tax ID: taxid from NCBI
- virus name: virus name
- taxonomy: taxonomic tree

Taxonomy field was parsed to retrieve "*dae* suffix for "Family", relatively appropriate family-name or "unclassified" when unavailable. Monkey work.

### Level 3 - Sequence/Species

Based on: `wolf18_vlist.xlsx`

- Virus name field (most shallow taxonomic classification) taken for each record.
- GenBank accession taken from each record.

### Example RdRp

```
rdrp5.Hantaviridae.Bowe_virus:AGW23849.1
rdrp5.Bunyaviridae.Azagny_virus:AEA42011.1
...
rdrp2.Coronaviridae.Night_heron_coronavirus_HKU19:YP_005352862.1
rdrp2.Coronaviridae.Munia_coronavirus_HKU13_3514:YP_002308505.1
rdrp2.Coronaviridae.Wigeon_coronavirus_HKU20:YP_005352870.1
rdrp2.Coronaviridae.Feline_infectious_peritonitis_virus:AGZ84535.1
rdrp2.Coronaviridae.Lucheng_Rn_rat_coronavirus:YP_009336483.1
rdrp2.Coronaviridae.Hipposideros_bat_coronavirus_HKU10:AFU92121.1
rdrp2.Coronaviridae.BtMs_AlphaCoV_GS2013:AIA62270.1
rdrp2.Coronaviridae.Chaerephon_bat_coronavirus_Kenya_KY41_2006:ADX59465.1
rdrp2.Coronaviridae.Porcine_epidemic_diarrhea_virus:AID56804.1
rdrp2.Coronaviridae.Bat_coronavirus_CDPHE15_USA_2006:YP_008439200.1
rdrp2.Coronaviridae.Anlong_Ms_bat_coronavirus:AID16674.1
rdrp2.Coronaviridae.Scotophilus_bat_coronavirus_512:YP_001351683.1
rdrp2.Coronaviridae.BtNv_AlphaCoV_SC2013:YP_009201729.1
rdrp2.Coronaviridae.Bat_coronavirus_1B:ACA52156.1
rdrp2.Coronaviridae.NL63_related_bat_coronavirus:YP_009328933.1
rdrp2.Coronaviridae.NL63_related_bat_coronavirus:APD51489.1
rdrp2.Coronaviridae.229E_related_bat_coronavirus:ALK43115.1
rdrp2.Coronaviridae.Rhinolophus_bat_coronavirus_HKU2:ABQ57223.1
rdrp2.Coronaviridae.Wencheng_Sm_shrew_coronavirus:AID16677.1
rdrp2.Coronaviridae.Human_coronavirus_HKU1:ABD75543.1
rdrp2.Coronaviridae.Betacoronavirus_Erinaceus_VMC_DEU_2012:YP_008719930.1
rdrp2.Coronaviridae.Pipistrellus_bat_coronavirus_HKU5:YP_001039961.1
rdrp2.Coronaviridae.Rousettus_bat_coronavirus:AOG30811.1
rdrp2.Coronaviridae.Rousettus_bat_coronavirus:YP_009273004.1
rdrp2.Coronaviridae.Bat_CoV_279_2005:P0C6V9.1
rdrp2.Coronaviridae.Bat_Hp_betacoronavirus_Zhejiang2013:YP_009072438.1
rdrp2.Coronaviridae.Bottlenose_dolphin_coronavirus_HKU22:AHB63494.1
rdrp2.Coronaviridae.Duck_coronavirus:AKF17722.1
rdrp2.Coronaviridae.Avian_infectious_bronchitis_virus_partridge_GD_S14_2003:AAT70770.1
rdrp2.Coronaviridae.Infectious_bronchitis_virus:ADA83556.1
...
rdrp1.unclassified.Wenzhou_levi_like_virus_3:APG77299.1
rdrp1.Leviviridae.Pseudomonas_phage_PP7:NP_042307.1

```

saved as: `gb_assign_group.txt`

In [2]:
# 201213 REPEAT; accidently duplicated rdrp0.fa output here
# 201226 REPEAT; update the viral-family classification names (monkey work)
cd $WORK/wolf18

In [3]:
grep ">" gb_rdrp.afa | tail -
echo ''
head gb_assign_group_v2.txt

>AMN92168.1|Bourbon_virus
>YP_009352882.1|Dhori_virus
>YP_145794.1|Thogoto_virus
>AHB34055.1|Upolu_virus
>ABF68025.1|Infectious_salmon_anemia_virus
>AQM37684.1|Steelhead_trout_orthomyxovirus_1
>APG77864.1|Beihai_orthomyxo_like_virus_2
>APG77905.1|Hubei_orthomyxo_like_virus_5
>YP_009246481.1|Tilapia_lake_virus
>YP_009337891.1|Changping_earthworm_virus_2

>branch.family.name:acc
>rdrp5.unclassified.Wuhan_Insect_virus_3:AJG39263
>rdrp5.unclassified.Mucorales_RNA_virus_1:AMK47917
>rdrp5.unclassified.Wenling_crustacean_virus_9:YP_009329879
>rdrp5.Bunyaviridae.Groundnut_chlorotic_fan_spot_virus:AJT59689
>rdrp5.Tospoviridae.Soybean_vein_necrosis_virus:ADX01591
>rdrp5.Tospoviridae.Bean_necrotic_mosaic_virus:YP_006468898
>rdrp5.Tospoviridae.Polygonum_ringspot_tospovirus:AHZ45965
>rdrp5.Tospoviridae.Pepper_chlorotic_spot_virus:AQX77525
>rdrp5.Tospoviridae.Melon_yellow_spot_virus:BAG82842


In [4]:
# Rename the header to remove virus name
# remove gaps from sequence (unaligned)
sed 's/|.*//g' gb_rdrp.afa \
  | sed 's/-//g' - \
  > rdrp_1.tmp

# One sequence "Pseudomonas_phage_phiYY" has no accession
# YP_009618381.1
sed -i 's/^>$/>YP_009618381.1/g' rdrp_1.tmp

tail -n3 rdrp_1.tmp

KAPDSAARESLDRASEIMTGKSYNAVHTGDLSKLPNQGESPLRIVDSDLYSERSCCWVIEKEGRVVCKSTTLTRGMTGLLNTTRCSSPSELICKVLTVESLSEKIGDTSVEELLSHGRYFKCALRDQERGKPKSRAIFLSHPFFRLLSSVVETHARSVLSKVSAVYTATASAEQRAMMAAQVVESRKHVLNGDCTKYNEAIDADTLLKVWDAIGMGSIGVMLAYMVRRKCVLIKDTLVECPGGMLMGMFNATATLALQGTTDRFLSFSDDFITSFNSPAELREIEDLLFASCHNLSLKKSYISVASLEINSCTLTRDGDLATGLGCTAGVPFRGPLVTLKQTAAMLSGAVDSGVMPFHSAERLFQIKQQECAYRYNNPTYTTRNEDFLPTCLGGKTVISFQSLLTWDCHPFWYQVHPDGPDTIDQKVLSVLASK
>YP_009337891.1
WDDQDQSMFLRPKNRTGYGPLIFNTMKRISDMSPTRARELSEVFSVTEKERSISVLASGGTKFVPARGTSVPASTAFWDYQDQMRPIFEHYNIKYTDNSWWHIVICANIFGEYFEILPPTWDRSTLTKLFVEIFSAGLAVKQTEHNRSEGRNIVTMSISLQNFQNFVEEVAKIVNRMTGSHGTDLSSLEKRDLLRKVGLAASIELDTFLASLDKTKWNQLLQISTAMLLLAASYPNDASERRFVLLVGQIWREKCLYFPSKHSYYTGGMKTPKTIDELSRMNDEQLLNDNIRDDLMMVLRHYRKKRVIPQYIKCDLIMLMGMFNHSSTTLHIWPAYANHLDDNQTVSKIIDFCASSDDSMVRAKKILGMSALESYRTISSLWKSMGLNDSEDKSIIHDRLVKVEYNSNVFSMGQLIPNLSRDVAGTKVLYENPEKDLETMKNQLFVYINEGTLSTQDAAIILSDKYLTSLDIHDMLPFQKRHPIFLNNLTSAGLIPQCIPIWCGGTNHIPPELWGTMDDKMYWYHHHKDTGKTNLYLEFLASISTPPDV

In [5]:
# Iterate through each line / fasta name.
# to swap out headers

while read -r line; do
  # Find headers
  if [[ "$line" = ">"* ]]; then
    acc=$(echo $line | sed 's/\..*//g' - | sed 's/>//g' -)
    newheader=$(grep "$acc" gb_assign_group_v2.txt)
    
    if [[ "$newheader" = "" ]]; then
      echo ">NA_$acc" >> rdrp_2.tmp
    else
      echo $newheader >> rdrp_2.tmp
    fi
    
  else
    # print aa sequence
    echo $line >> rdrp_2.tmp 
  fi
done < rdrp_1.tmp


# Manually remove first ten sequences (Group II introns outgroup)
tail -n +21 rdrp_2.tmp > rdrp_3.tmp

mv rdrp_3.tmp ../gbRdRp_201212.fa

md5sum ../gbRdRp_201212.fa
#rm *.tmp

c912fa5199f573bbffcf4549336b9c4b  ../gbRdRp_201212.fa


## WOLF20 RdRp

FTP Access: `ftp://ftp.ncbi.nlm.nih.gov/pub/wolf/_suppl/yangshan/`

Sequence data: `gb_rdrp.afa`

Saved as: `gb_rdrp.fa`

Date Accessed: `201212`

![Figure 2](/home/artem/serratus/notebook/201210_rdrp0/wolf20/wolf20_figure2.jpg)

>RNA virome analysis performed using complementary DNA derived from approximately 10 l of samples from Yangshan Deep-Water Harbour yielded 4,593 nearly full-length RNA virus RdRPs that formed 2,192 clusters at 75% amino acid identity which represents virus diversity at a level between species and genus. Among the RdRP sequences from GenBank (October 2018), 2,021 comparable clusters were detected. Thus, the 10 l water sample analysed here more than doubles the known diversity of RNA viruses.

There are two sets of data treated independently here; the genbank RdRp and the yangshan RdRp. Only yangshen sequences will be considered as the genbank records are not comprehensive. The data is clustered globally at `75%` aa identity.

### Level 1/2 - Supergroup / Branches and Clusters

From [Supplementary Table 1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7508674/bin/41564_2020_755_MOESM2_ESM.xlsx) parsed the "clade id" field to retrieve branch numbering and relate to the OV.x clusters. A few sequences are unassigned / deep so they will form `rdrp0`.

The `OV.x` was renamed to `yaOVx` to remove period. So `OV.1` became `rdrp2.yaOV1`

Saved as: `ovx.branches.txt`

### Level 3 - ORF Identifier
Sequence data is from `rdrp.ya.fa`

Header parsing:
`>ya20_JAAOEH010000011_1 JAAOEH010000011.1 5194-3782 OV.1 NODE_11_truseq orf.65`
will become
`>rdrp2.yaOV1.orf65


In [17]:
cd $WORK/wolf20

# Iterate through each line / fasta name.
# to swap out headers

while read -r line; do
  # Find headers
  if [[ "$line" = ">"* ]]; then
    acc=$(echo $line | cut -f 4 -d ' ' - | sed 's/\.//g' -)
    branch=$(grep "$acc\." ovx.branches.txt)
    orf=$(echo $line | sed 's/.* //g' - | sed 's/\.//g' - )
    
    if [[ "$branch" = "" ]]; then
      echo ">NA" >> ya_2.tmp
    else
      echo ">$branch""$orf" >> ya_2.tmp
    fi
    
  else
    # print aa sequence
    echo $line >> ya_2.tmp 
  fi
done < rdrp.ya.fa

mv ya_2.tmp ../yaRdRp_201212.fa

mv: missing destination file operand after 'ya_2.tmp'
Try 'mv --help' for more information.


: 1

## rdrp0 - pilot panproteome


In [6]:
cd $WORK

# Make rdrp0
cat gbRdRp_201212.fa yaRdRp_201212.fa > rdrp0.fa
samtools faidx rdrp0.fa

md5sum gbRdRp_201212.fa
md5sum yaRdRp_201212.fa 
md5sum rdrp0.fa
md5sum rdrp0.fa > rdrp0.fa.md5

c912fa5199f573bbffcf4549336b9c4b  gbRdRp_201212.fa
7e6781ed5902ea6034393c0dae521c62  yaRdRp_201212.fa
ce012c07be4cfc2bc7abd2dc4c81c10a  rdrp0.fa


## Upload to S3

In [7]:
cd $WORK
ls -alh

total 9.8M
drwxrwxr-x  5 artem artem 4.0K Dec 26 22:31 [0m[01;34m.[0m
drwxr-xr-x 41 artem artem  12K Dec 26 22:31 [01;34m..[0m
-rw-rw-r--  1 artem artem 2.5M Dec 26 22:31 gbRdRp_201212.fa
-rw-rw-r--  1 artem artem  16K Dec 20 23:18 quenya_rdrp.fa
-rw-rw-r--  1 artem artem 4.6M Dec 26 22:31 rdrp0.fa
-rw-rw-r--  1 artem artem 524K Dec 26 22:31 rdrp0.fa.fai
-rw-rw-r--  1 artem artem   43 Dec 26 22:31 rdrp0.fa.md5
drwxrwxr-x  7 artem artem 4.0K Dec 19 16:39 [01;34mrev1[0m
drwxrwxr-x  2 artem artem 4.0K Dec 26 22:31 [01;34mwolf18[0m
drwxrwxr-x  5 artem artem 192K Dec 12 22:20 [01;34mwolf20[0m
-rw-rw-r--  1 artem artem 2.1M Dec 12 22:14 yaRdRp_201212.fa


In [8]:
aws s3 sync ./ $S3_WORK # done

upload: ./rdrp0.fa.md5 to s3://serratus-public/notebook/201210_rdrp0/rdrp0.fa.md5
upload: wolf18/~$wolf18_vlist.xlsx to s3://serratus-public/notebook/201210_rdrp0/wolf18/~$wolf18_vlist.xlsx
upload: wolf18/gb_assign_group_v2.txt to s3://serratus-public/notebook/201210_rdrp0/wolf18/gb_assign_group_v2.txt
upload: ./rdrp0.fa.fai to s3://serratus-public/notebook/201210_rdrp0/rdrp0.fa.fai
upload: ./gbRdRp_201212.fa to s3://serratus-public/notebook/201210_rdrp0/gbRdRp_201212.fa
upload: ./rdrp0.fa to s3://serratus-public/notebook/201210_rdrp0/rdrp0.fa
upload: wolf18/rdrp_1.tmp to s3://serratus-public/notebook/201210_rdrp0/wolf18/rdrp_1.tmp
upload: wolf18/rdrp_2.tmp to s3://serratus-public/notebook/201210_rdrp0/wolf18/rdrp_2.tmp


## Create rdrp0 database

In [None]:
# Load serratus-align container on EC2
# From base amazon linux 2
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

git clone --branch diamond-dev https://github.com/ababaian/serratus.git; cd serratus

# If you want to upload containers to your repository, include this.
export DOCKERHUB_USER='serratusbio' # optional
sudo docker login # optional

# Build all containers and upload them docker hub repo (if available)
cd containers
./build_containers.sh   # run this in the folder 'serratus/containers'

sudo docker run --rm --entrypoint /bin/bash \
  -it serratus-align:latest

In [None]:
# rdrp0 v201213 for pilot run
mkdir rdrp0; cd rdrp0
GENOME='rdrp0'

# Download rdrp0
aws s3 cp s3://serratus-public/notebook/201210_rdrp0/rdrp0.fa ./

# Make diamond index for protref5
diamond makedb --in $GENOME.fa -d $GENOME

# Make fasta index for protref5
samtools faidx $GENOME.fa
mv $GENOME.fa.fai $GENOME.sumzer.tsv

md5sum * > $GENOME.md5

# use protref5 msa as place-holder
aws s3 cp s3://serratus-public/seq/protref5/protref5.msa ./rdrp0.msa

# 657d302bd62a9e0b588668101f581e4c  rdrp0.dmnd
# 8479a3347bbe73224cb2eac0c2138a92  rdrp0.fa
# 6bf2ffa27bb2b08bf0d6056b675fa348  rdrp0.sumzer.tsv
# e094fc7db19c07ffcedf8bc42963ab80  rdrp0.msa

# Upload to S3
aws s3 sync ./ s3://serratus-public/seq/$GENOME/


### Revision 0 - Clean-up "Virus Name" field (deprecated)

In parsing the "Virus Name" from `rdrp0` the "." character was in quite a few virus names and these result in icky downstream parsing. This is a manual removal of those periods from all rdrp0 names to make a 'clean' version


In [None]:
# revision 0 folder
mkdir rev0; cd rev0

# Wolf18 Genbank Sequences
cp ../gbRdRp_201212.fa  ./
# Wolf20 Yangshen Sequences
cp ../yaRdRp_201212.fa  ./ 

In [None]:
# Update Sequence names (to make uniform)
# This is from ongoing work from

## sp. --> sp
grep "sp\." gbRdRp_201212.fa | less -NS -
sed -i 's/sp\./sp/g' gbRdRp_201212.fa

#      1 >rdrp1.Narnaviridae.Rhizophagus_sp._HR1_mitovirus_like_ssRNA:BAN85985.1
#      2 >rdrp1.Narnaviridae.Rhizophagus_sp._RF1_mitovirus:BAJ23143.2
#      3 >rdrp2.Picobirnaviridae.Picobirnavirus_sp.:AQS16638.1
#      4 >rdrp2.Picobirnaviridae.Picobirnavirus_sp.:AOW41971.1
#      5 >rdrp2.Picobirnaviridae.Picobirnavirus_sp.:AOW41973.1
#      6 >rdrp2.Iflaviridae.Iflavirus_sp.:APB88805.1
#      7 >rdrp2.unclassified.Chaetoceros_sp._RNA_virus_2:BAK40203.1
#      8 >rdrp2.unclassified.Posavirus_sp.:APQ44560.1
#      9 >rdrp2.unclassified.Posavirus_sp.:APQ44553.1
#     10 >rdrp2.unclassified.Posavirus_sp.:APQ44556.1
#     11 >rdrp2.unclassified.Posavirus_sp.:APQ44559.1
#     12 >rdrp2.unclassified.Posavirus_sp.:APQ44517.1
#     13 >rdrp2.unclassified.Basavirus_sp.:APQ44489.1
#     14 >rdrp2.unclassified.Basavirus_sp.:APQ44495.1
#     15 >rdrp2.unclassified.Posavirus_sp.:APQ44558.1
#     16 >rdrp2.unclassified.Basavirus_sp.:APQ44499.1
#     17 >rdrp2.unclassified.Basavirus_sp.:APQ44492.1
#     18 >rdrp2.unclassified.Basavirus_sp.:APQ44496.1
#     19 >rdrp2.unclassified.Basavirus_sp.:APQ44502.1
#     20 >rdrp2.unclassified.Posavirus_sp.:APQ44531.1
#     21 >rdrp2.unclassified.Posavirus_sp.:YP_009333148.1
#     22 >rdrp2.unclassified.Posavirus_sp.:APQ44537.1
#     23 >rdrp2.unclassified.Rasavirus_sp.:APQ44507.1
#     24 >rdrp2.unclassified.Husavirus_sp.:APQ44514.1
#     25 >rdrp2.unclassified.Rasavirus_sp.:APQ44506.1
#     26 >rdrp2.unclassified.Rasavirus_sp.:YP_009333305.1
#     27 >rdrp2.unclassified.Posavirus_sp.:APQ44547.1
#     28 >rdrp2.unclassified.Basavirus_sp.:APQ44500.1
#     29 >rdrp2.Picornaviridae.Sicinivirus_sp.:APR73491.1
#     30 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_Ee_PicoV_NX2015:APA29022.1
#     31 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_Mc_PicoV_Tibet2015:APA29023.1
#     32 >rdrp2.Picornaviridae.Enterovirus_sp.:AHY21610.1
#     33 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_CK_PicoV_Tibet2014:APA29019.1
#     34 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_Rn_PicoV_SX2015_1:APA29018.1
#     35 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_Ds_PicoV_IM2014:APA29017.1
#     36 >rdrp3.Picobirnaviridae.Picobirnavirus_sp.:AOW41972.1
#     37 >rdrp4.unclassified.unclassified_Rhizophagus_sp._RF1_medium_virus:BAJ23141.1
#     38 >rdrp4.Chrysoviridae.Fusarium_oxysporum_f._sp._dianthi_mycovirus_1:YP_009158913.1
#     39 >rdrp5.Bunyaviridae.Bunyavirus_sp.:AOY18806.1


In [None]:
## Manual interventions

grep ">" gbRdRp_201212.fa | cut -f3- -d'.' | cut -f1 -d':' | grep "\." - | less -NS

#      1 Chaetoceros_socialis_f._radians_RNA_virus_01
sed -i 's/f\./f/g' gbRdRp_201212.fa
#      2 Norovirus_dog_GVI.1_HKU_Ca026F_2007_HKG
sed -i 's/GVI\.1/GVI_1/g' gbRdRp_201212.fa
#      3 Norovirus_cat_GIV.2_CU081210E_USA_2010
sed -i 's/GIV\.2/GIV_2/g' gbRdRp_201212.fa
#      4 Norovirus_Hu_GII.12_CGMH42_2010_TW
#      5 Norovirus_Hu_GII.12_CGMH40_2010_TW
#      6 Norovirus_GII.17
sed -i 's/GII\.12/GII_12/g' gbRdRp_201212.fa
sed -i 's/GII\.17/GII_17/g' gbRdRp_201212.fa
#      7 Sapovirus_Hu_GI.2_BR_DF01_BRA_2009
sed -i 's/GI\.2/GI_2/g' gbRdRp_201212.fa
#      8 Sapovirus_GII.8
sed -i 's/GII\.8/GII_8/g' gbRdRp_201212.fa
#      9 St._Louis_encephalitis_virus
#     10 St._Louis_encephalitis_virus
sed -i 's/St\._L/St_L/g' gbRdRp_201212.fa
#     11 Fusarium_oxysporum_f._sp_dianthi_mycovirus_1
##

# Accesion field does not contain a version number
# Do not chnage accessions at this point
grep ">" gbRdRp_201212.fa |  grep -v "\.[0-9]$" -
#>rdrp1.Leviviridae.Escherichia_virus_Qbeta:4R71
#>rdrp3.Flaviviridae.Douroucouli_hepatitis_GB_virus_A:T08841
#>rdrp3.Flaviviridae.Marmoset_hepatitis_GB_virus_A:T08839
#>rdrp3.Alphaflexiviridae.Potato_aucuba_mosaic_virus:2012194A
#>rdrp3.Virgaviridae.Barley_stripe_mosaic_virus:2211403A
#>rdrp4.Reoviridae.Simian_rotavirus:2R7Q
#>rdrp4.unclassified.White_button_mushroom_virus_1:T00494
#>rdrp5.Peribunyaviridae.La_Crosse_virus:5AMR_A
#>rdrp5.Orthomyxoviridae.Influenza_B_virus_(B_Memphis_13_2003):4WRT_B
#>rdrp5.Orthomyxoviridae.Influenza_A_virus_(A_little_yellow_shouldered_bat_Guatemala_060_2010(H17N10)):4WSB_B


md5sum *
# 7d2c1858d5d842ada3e8103c43f29e27  gbRdRp_201212.fa
# 7e6781ed5902ea6034393c0dae521c62  yaRdRp_201212.fa

# Overwrite previous gb versions
mv gbRdRp_201212.fa ../
mv yaRdRp_201212.fa ../

## Revision 1 - Cluster known + unclassified sequences

Some lessons learned so far... there are ALOT of new viruses to be discovered! From the ~9.5K datasets run with the `rdrp0` pilot, there are ~4,700 high score (>50) and high divergence (55-85% aa id) hits at the family level. Random sampling from vertebrate/virome SRA queries means overlap is likely to be minimal here but a conservative estimate would be 1,000 distinct RdRp, about 25% of the known biodiversity of sequences.

### Objectives

- There is a large amount of "unclassified" sequences in the data within branches, these should be grouped relative to one another such that if two unclassified sequences exist in one branch, they do not "collide" in the read summaries.



In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/201210_rdrp0"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/201210_rdrp0/'

# date and version
date
git rev-parse HEAD # commit version

Tue Dec 15 20:57:00 PST 2020
6ac78a036910813c0f5fb2e7ef0b88599e683959


In [None]:
# Work on EC2 Instance

# Local usearch install
#The clustered database was made with usearch:
wget https://drive5.com/downloads/usearch11.0.667_i86linux32.gz
gzip -dc usearch11.0.667_i86linux32.gz > usearch
chmod 755 usearch; sudo mv usearch /usr/bin/usearch
rm https://drive5.com/downloads/usearch11.0.667_i86linux32.gz

# Install seqkit
wget https://github.com/shenwei356/seqkit/releases/download/v0.12.0/seqkit_linux_amd64.tar.gz
  tar -xvf seqkit*
  sudo mv seqkit /usr/local/bin/
  rm seqkit_linux*


In [None]:
# Initialize workspace

# Download rdrp working directory
mkdir rdrp0; cd rdrp0
aws s3 sync s3://serratus-public/notebook/201210_rdrp0/ ./

# revision 1 folder
mkdir rev1; cd rev1

# Wolf18 == Genbank Sequences from 2018
cp ../gbRdRp_201212.fa  ./
# Wolf20 Yangshen Sequences
cp ../yaRdRp_201212.fa  ./ 


In [None]:
# in gbRdRp (from wolf 18); seperate out taxonomic and unclassified sequences
grep ">" gbRdRp_201212.fa | wc -l
# 4617

# -v inverse match
seqkit grep -v -r -p 'unclassified' gbRdRp_201212.fa > gb_tax.fa
grep ">" gb_tax.fa | wc -l
# 2802

seqkit grep -r -p 'unclassified' gbRdRp_201212.fa > gb_unc.fa
grep ">" gb_unc.fa | wc -l
# 1815


In [None]:
# Sort and Uclust unclassified sequences

usearch -sortbylength gb_tax.fa \
   -fastaout gb_tax.sort.fa

usearch -sortbylength gb_unc.fa \
   -fastaout gb_unc.sort.fa

cat gb_tax.sort.fa gb_unc.sort.fa > gb_cat.sort.fa


In [None]:
# Cluster sequences at 75% identity

# Prune TAX
usearch -cluster_smallmem gb_tax.sort.fa \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_tax.id75.uc \
   -centroids gb_tax.id75.fa

#      Seqs  2802
#  Clusters  1738
#  Max size  29
#  Avg size  1.6
#  Min size  1
#Singletons  1298, 46.3% of seqs, 74.7% of clusters



# Prune UNC
usearch -cluster_smallmem gb_unc.sort.fa \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_unc.id75.uc \
   -centroids gb_unc.id75.fa

#      Seqs  1815
#  Clusters  1708
#  Max size  6
#  Avg size  1.1
#  Min size  1
#Singletons  1620, 89.3% of seqs, 94.8% of clusters



# Prune CAT
usearch -cluster_smallmem gb_cat.sort.fa \
   -sortedby other \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_cat.id75.uc \
   -centroids gb_cat.id75.fa
   
#      Seqs  4617
#  Clusters  3426
#  Max size  29
#  Avg size  1.3
#  Min size  1
#Singletons  2889, 62.6% of seqs, 84.3% of clusters

## At 75%, no "unclassifed" sequence groups with a taxonomic identifier

mkdir id75
mv *id75.* id75/

In [None]:
# Repeat process at 55%; base of diamond detection

# Prune TAX
usearch -cluster_smallmem gb_tax.sort.fa \
   -id 0.55 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_tax.id55.uc \
   -centroids gb_tax.id55.fa

#      Seqs  2802
#  Clusters  781
#  Max size  143
#  Avg size  3.6
#  Min size  1
#Singletons  433, 15.5% of seqs, 55.4% of clusters



# Prune UNC
usearch -cluster_smallmem gb_unc.sort.fa \
   -id 0.55 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_unc.id55.uc \
   -centroids gb_unc.id55.fa

#      Seqs  1815
#  Clusters  1428
#  Max size  11
#  Avg size  1.3
#  Min size  1
#Singletons  1190, 65.6% of seqs, 83.3% of clusters



# Prune CAT
usearch -cluster_smallmem gb_cat.sort.fa \
   -sortedby other \
   -id 0.55 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_cat.id55.uc \
   -centroids gb_cat.id55.fa

#      Seqs  4617
#  Clusters  2143
#  Max size  143
#  Avg size  2.2
#  Min size  1
#Singletons  1548, 33.5% of seqs, 72.2% of clusters


## At 55%, no "unclassifed" sequence groups with a taxonomic identifier

mkdir id55
mv *.id55.* id55/

In [None]:
# Repeat process at 45%; base of diamond detection

# Prune TAX
usearch -cluster_smallmem gb_tax.sort.fa \
   -id 0.45 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_tax.id45.uc \
   -centroids gb_tax.id45.fa

#      Seqs  2802
#  Clusters  566
#  Max size  156
#  Avg size  5.0
#  Min size  1
#Singletons  280, 10.0% of seqs, 49.5% of clusters



# Prune UNC
usearch -cluster_smallmem gb_unc.sort.fa \
   -id 0.45 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_unc.id45.uc \
   -centroids gb_unc.id45.fa

#      Seqs  1815
#  Clusters  1176
#  Max size  17
#  Avg size  1.5
#  Min size  1
#Singletons  886, 48.8% of seqs, 75.3% of clusters

# Prune CAT
usearch -cluster_smallmem gb_cat.sort.fa \
   -sortedby other \
   -id 0.45 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_cat.id45.uc \
   -centroids gb_cat.id45.fa

#      Seqs  4617
#  Clusters  1651
#  Max size  156
#  Avg size  2.8
#  Min size  1
#Singletons  1073, 23.2% of seqs, 65.0% of clusters


## At 45%, no "unclassifed" sequence groups with a taxonomic identifier.

mkdir id45
mv *.id45.* id45/

In [None]:
# Repeat process at 35%; base of diamond detection (NA)

# Prune TAX
usearch -cluster_smallmem gb_tax.sort.fa \
   -id 0.35 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_tax.id35.uc \
   -centroids gb_tax.id35.fa

# Prune UNC
usearch -cluster_smallmem gb_unc.sort.fa \
   -id 0.35 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_unc.id35.uc \
   -centroids gb_unc.id35.fa

# Prune CAT
usearch -cluster_smallmem gb_cat.sort.fa \
   -sortedby other \
   -id 0.35 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_cat.id35.uc \
   -centroids gb_cat.id35.fa

## At 35%, no "unclassifed" sequence groups with a taxonomic identifier.

mkdir id35
mv *.id35.* id35/

In [None]:
mkdir -p all
mv *.fa all/

#### Conclusions

The clustering here is actually not as extensive as I originally had thought it would be. It looks like 

#### Assigning Sequence Clusters


In [None]:
# Identity to use for clustering
ID='45'

# Extract centroid sequence names
grep ">" id$ID/gb_cat.id$ID.fa \
  | sed 's/>//g' \
  > centroid$ID.name.tmp


# Convert uc format to TSV
# CENTROID hit1 hit2 hitN ...
rm clust.tmp; touch clust.tmp

# Read through centroid names
while read -r line; do
  # parse to just accession number
  acc=$(echo $line | sed 's/.*://g')
  
  # Search accession in uclust file
  # extract seqname and make it one line
  grep "$acc" id$ID/gb_cat.id$ID.uc \
    | cut -f 9 \
    | tr '\n' '\t' \
    >> clust.tmp
    
  echo -e "\n" >> clust.tmp

done < centroid$ID.name.tmp

# remove empty rows
cat clust.tmp | sed '/^$/d' > cluster$ID.tsv


In [None]:
# Assign "family_id" to sequences (unclassified)

# Field 1: Centroid Name
# Field 2+: Sequence name
cut -f 1 cluster$ID.tsv \
  | sed 's/\./\t/g' - \
  | sed 's/:/\t/g' - \
  > centroid$ID.info

# Centroid Information tsv
# 1: Branch
# 2: Family
# 3: Rep. Viral name
# 4: Rep. Viral accession
# 5: Rep. Viral accession version
# 6: Re-annotated Viral Family Name

# Iterate through centroid file
# assign each "unclassified family" a unique ordinal number
# 1055 --> use 4 leading zeros

# ID
N=1
rm family2.tmp

while read -r line; do
  # Read Family name
  #echo $line
  
  family=$(echo $line | cut -f 2 -d' ' -)  

  if [[ "$family" = "unclassified" ]]; then

    uncN=$(printf "%04d" $N)
    echo "unc$uncN" >> family2.tmp
    
    # increment up
    N=$((N+1))
    
  else
    echo $family >> family2.tmp
  
  fi

done < centroid$ID.info

# Add 6th column of new family names
cp centroid$ID.info centroid.tmp
paste family2.tmp centroid.tmp  > centroid$ID.info
rm *.tmp

In [None]:
## Assign Centroid Family-Name to Each Sequence
cat all/gbRdRp_201212.fa all/yaRdRp_201212.fa > rdrp0_r1.fa

#
cut -f 1  centroid$ID.info > fam.name.tmp
cut -f 2- cluster$ID.tsv   > cluster.members.tmp
paste fam.name.tmp cluster.members.tmp > assign.family.tmp
# col 1 == new name
# col N == old sequence name

while read -r line; do
  newfam=$(echo $line | cut -f1 -d' ' -)
  echo $newfame
  
  echo $line \
    | cut -f 2- -d' ' - \
    > members.tmp
    
  cat members.tmp \
    | tr " " "\n" \
    > members2.tmp
    
    while read -r line2; do
      # rdrp5.Sunviridae.Sunshine_Coast_virus:YP_009094051.1
      branch=$(echo $line2 | cut -f1 -d'.')
      oldfam=$(echo $line2 | cut -f2 -d'.')
      virname=$(echo $line2 | cut -f3 -d'.' | cut -f1 -d':')
      acc=$(echo $line2 | cut -f2 -d':')
      
      echo $branch $oldfam $virname $acc
      echo $branch $newfam $virname $acc
      
      # Inline rename
      matchline=$( echo $(grep -n $acc rdrp0_r1.fa  | cut -f1 -d':' -)s)
      
      sed -i "$matchline/.*/>$branch.$newfam.$virname:$acc/" rdrp0_r1.fa
      
    done < members2.tmp
    
    echo ''

done < assign.family.tmp

grep ">" rdrp0_r1.fa | sed 's/>//g' > rdrp0_r1.fai

rm *.tmp
rm gbRdRp.fa yaRdRp.fa

In [None]:
aws s3 sync ./ s3://serratus-public/notebook/201210_rdrp0/rev1/

## CHECKPOINT

As of `201226` all code above this checkpoint is copied to a new directory

CHECKPOINT: `s3://serratus-public/notebook/201226_rdrp0/`

In [None]:
aws s3 sync ./ s3://serratus-public/notebook/201226_rdrp0/

## Revision 2 - RdRp from GenBank NT

In [None]:
mkdir rev2; cd rev2
cp ../rdrp0.fa ./

In [None]:
# Install diamond
# DIAMONDVERSION='0.9.35'
# wget --quiet https://github.com/bbuchfink/diamond/releases/download/v"$DIAMONDVERSION"/diamond-linux64.tar.gz
# tar -xvf diamond-linux64.tar.gz
# rm diamond-linux64.tar.gz
# sudo mv    diamond /usr/local/bin/

In [None]:
# Install diamond
DIAMONDVERSION='2.0.6-dev'
cd ~/

# Libraries for diamond
sudo yum -y install git gcc gcc-c++ glibc-devel \
  cmake patch automake zlib-devel

# grab latest with fix from Benjamin
git clone https://github.com/bbuchfink/diamond.git
cd diamond

mkdir bin; cd bin
cmake ..
make -j4
sudo make install

# build
# cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_BUILD_MARCH=nehalem ..
# make && make install

In [None]:
GENOME='rdrp0'
cp ../rev1/rdrp0_r1.fa ./rdrp0.fa

diamond makedb --in $GENOME.fa -d $GENOME

In [None]:
# Quenya sensitivity test
# rdrp segments from q-viruses
aws s3 cp s3://serratus-public/tmp/quenya5.fasta ./qRdRP.fa
aws s3 cp s3://serratus-public/tmp/dicistro_all.fasta ./dRdRP.fa
# 22 input Q nucleotide sequences

# Dicistro
IN='dRdRP.fa'
GENOME='rdrp0'
OUTNAME='d_v_rdrp0'

# Diamond blastx alignment

time cat $IN |\
diamond blastx \
  -d "$GENOME".dmnd \
  -p 4 \
  -k 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq qseq_translated \
  > "$OUTNAME".us.pro

# First trial:
# [ec2-user@ip-172-31-65-53 r2]$ grep ">" dRdRP.fa | wc -l
# 2871
# [ec2-user@ip-172-31-65-53 r2]$ wc -l d_v_rdrp0.pro 
# 854 d_v_rdrp0.pro

# Quenya
IN='qRdRP.fa'
GENOME='rdrp0'
OUTNAME='q_v_rdrp0'

# Diamond blastx alignment

time cat $IN |\
diamond blastx \
  -d "$GENOME".dmnd \
  -p 4 \
  -k 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq qseq_translated \
  > "$OUTNAME".pro
  

In [None]:
# Link to all genbank CDS
ln -s ../ntViro_gb201205.fa ./

# Input file
IN='ntViro_gb201205.fa'
#IN='tmp.fa'

GENOME='rdrp0'

# Output name
OUTNAME='gbViro_rdrp'

# Diamond blastx alignment
time cat $IN |\
diamond blastx \
  -d "$GENOME".dmnd \
  --unal 0 \
  -k 1 \
  -p 4 \
  -b 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq qseq_translated \
  > "$OUTNAME".pro
  
# real    1175m3.649s
# user    4650m34.609s
# sys     1m27.468s

# wc -l
# 390965 gbViro_rdrp.pro


In [None]:
aws s3 cp gbViro_rdrp.pro s3://serratus-public/notebook/201210_rdrp0/rev2/

In [None]:
# Notes:
#
# qseq
# qseq_translated 
# full_qseq_mate

# tmp.fa timing tests:
# tail -n +20000000 ntViro_gb201205.fa | head -n 500000 > tmp.fa

# default
# real    5m22.409s
# user    21m6.984s
# sys     0m0.410s

# --sensitive
# real    11m19.287s
# user    44m18.393s
# sys     0m0.595s

# output size is equal

# NOTE: sseq returns protein sequence from the database, not the query
# Changed field 15 sseq to qseq
# use qseq instead

In [None]:
OUTNAME='gbViro_rdrp'

# qseqid qstart qend qlen qstrand \
# sseqid sstart send slen \
# pident evalue cigar \
# qseq qseq_translated \

# AA of result hits
cut -f1,13 $OUTNAME.pro |  sed 's/^/>/g' | sed 's/\t/\n/g' > $OUTNAME.aa_hit.fa


In [None]:
# UCLUST
OUTNAME='gbViro_rdrp'
INPUT="$OUTNAME.aa_hit.fa"
OUTPUT="$OUTNAME.aa_hit.id75.fa"

usearch -sortbylength $INPUT \
   -fastaout tmp.sort.fa

mv tmp.sort.fa $INPUT

# Cluster at 75%
usearch -cluster_smallmem $INPUT \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT
    
#      Seqs  390965 (391.0k)
#  Clusters  9591
#  Max size  74741 (74.7k)
#  Avg size  40.8
#  Min size  1
# Singletons  4462, 1.1% of seqs, 46.5% of clusters


# UCLUST
OUTPUT="$OUTNAME.aa_hit.id90.fa"

# Prune UNC
usearch -cluster_smallmem $INPUT \
   -id 0.90 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT

#       Seqs  390965 (391.0k)
#   Clusters  15487 (15.5k)
#  Max size  72136 (72.1k)
#  Avg size  25.2
#  Min size  1
#Singletons  7716, 2.0% of seqs, 49.8% of clusters

usearch -sortbylength $OUTNAME.aa_hit.id90.fa \
   -fastaout $OUTNAME.aa_hit.id90.sort.fa

Within the output there are lots of matches where the protein is yielding "XXXXX"

```
     12 >NC_031221.1
     13 MTSSKSCVSVNEDEEYYKNIKGKIPSKASKKTIRYLMKRADPSGKFKDVDLSNLVKSKLLDTAKLEEINRLGFSIVKNNY
     14 NTISNIITNYKDYTTNKVFDELYKVIKKMYNLSGTPDEILAGLKNKEELDIGFITIMENVFLYTDTLYFTLFPKNQRTAV
     15 DREIYEGTLAAKLGLYFIERIYKEYAGDDESEAISMPGESKWMAIGRKRKTCLDFLVNHCSNQADVEAKEEIGAVELNLD
     16 DQLEYPDLLKRDGEWSEVLGRKSRSSSISSQHSXXXXXXXXXXXXXXXXXXXXXXXXXXKVTKEAENTLKESQSLNATKN
     17 TFDVLRNLIWSEDVEQEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXLEINADMSKFSAKD
     18 NLYKFMIFVILDPNLYKNEKYFILKFFCIYLRKRLIIPDQVFGTILDQTTSNDQCVFKKMTNNFETNHIQVSHNWLQGNL
     19 NYISSCWHSMVMDVFKRVFTSAMKHLNTTVLVEPLVHSDDNSTSIAFISYNKDVNTVSLGAFAMDSIRRTLAKGSLILNT
     20 KKTNVSTFHKEFVSLHDVNSEPVSIYGRFLYPVVGDCSYLGPYEDLSARLSTIQVALKHGCPPGVAHLGVALGVQMTYRT
     21 YSMLPGQMHDPLPALDLQGESRFNIPVELGGYPDLDLWLFGSLGIEAHDLKKLAQIAEFYDLGGYKISLGWHSFIEDLTE
     22 HNSQSIHRPYINGRIQYVKQSLWMDFYIMYMKNYAVGKDIQPDCDTGLSHDMKQR

```

Yet from GenBank:


```
                     ....................MTSSKSCVSVNEDEEYYKNIKGKIPSKASKKTIRYLMK
                     RADPSGKFKDVDLSNLVKSKLLDTAKLEEINRLGFSIVKNNYNTISNIITNYKDYTTN
                     KVFDELYKVIKKMYNLSGTPDEILAGLKNKEELDIGFITIMENVFLYTDTLYFTLFPK
                     NQRTAVDREIYEGTLAAKLGLYFIERIYKEYAGDDESEAISMPGESKWMAIGRKRKTC
                     LDFLVNHCSNQADVEAKEEIGAVELNLDDQLEYPDLLKRDGEWSEVLGRKSRSSSISS
                     QHSDRGSPKSSKTQSKKGSSTGSKKGKKKKVTKEAENTLKESQSLNATKNTFDVLRNL
                     IWSEDVEQEEMESVEKERQEKERKKEKVREEMEALKNEARTKPMILKVMNHKILSHGI
                     LEINADMSKFSAKDNLYKFMIFVILDPNLYKNEKYFILKFFCIYLRKRLIIPDQVFGT
                     ILDQTTSNDQCVFKKMTNNFETNHIQVSHNWLQGNLNYISSCWHSMVMDVFKRVFTSA
                     MKHLNTTVLVEPLVHSDDNSTSIAFISYNKDVNTVSLGAFAMDSIRRTLAKGSLILNT
                     KKTNVSTFHKEFVSLHDVNSEPVSIYGRFLYPVVGDCSYLGPYEDLSARLSTIQVALK
                     HGCPPGVAHLGVALGVQMTYRTYSMLPGQMHDPLPALDLQGESRFNIPVELGGYPDLD
                     LWLFGSLGIEAHDLKKLAQIAEFYDLGGYKISLGWHSFIEDLTEHNSQSIHRPYINGR
                     IQYVKQSLWMDFYIMYMKNYAVGKDIQPDCDTGLSHDMKQR...
```

Matching line from the .pro output of diamond

```
NC_031221.1     2362    4686    7323    +       rdrp5.Peribunyaviridae.Shuangao_Insect_Virus_1:YP_009300681.1   1       775     775     100.0   0.0e+00775M

ATGACAAGCTCCAAATCTTGTGTCAGCGTTAATGAAGACGAAGAATATTATAAAAATATAAAAGGGAAAATACCTAGCAAGGCAAGCAAAAAGACTATAAGATACCTGATGAAAAGGGCAGACCCATCTGGAAAATTCAAAGATGTAGATCTTAGCAACTTAGTAAAAAGTAAACTATTGGACACTGCTAAACTAGAGGAGATAAATAGGCTCGGTTTCTCTATTGTAAAGAACAATTATAACACAATATCCAATATAATCACAAATTACAAAGATTATACAACAAATAAAGTCTTTGATGAGTTGTACAAAGTTATTAAAAAGATGTACAACCTGTCTGGAACTCCTGATGAGATATTAGCAGGCTTAAAAAACAAAGAAGAACTAGACATTGGATTCATAACAATAATGGAGAATGTTTTTTTGTATACAGACACACTTTATTTTACTTTGTTCCCCAAGAACCAGAGGACAGCAGTAGACAGAGAGATATATGAAGGCACCTTGGCTGCAAAACTTGGTCTCTATTTTATAGAGAGAATATATAAGGAATATGCTGGAGATGACGAATCTGAAGCTATAAGCATGCCTGGGGAATCAAAATGGATGGCCATTGGTAGGAAAAGGAAAACCTGTTTAGACTTTCTTGTAAACCATTGCTCCAACCAGGCAGATGTGGAGGCAAAGGAAGAGATAGGTGCTGTGGAGTTAAATTTGGATGACCAACTAGAATACCCTGATTTATTGAAAAGAGATGGTGAATGGTCAGAAGTCTTGGGTAGAAAAAGCAGAAGCTCAAGTATATCAAGTCAACACTCCGATAGAGGCTCACCAAAAAGCTCCAAAACACAATCAAAAAAAGGATCATCGACAGGGAGTAAGAAAGGGAAGAAGAAAAAAGTAACAAAAGAAGCAGAAAACACCTTGAAAGAAAGTCAAAGTCTTAATGCTACTAAGAACACCTTTGATGTTTTAAGAAATTTGATCTGGTCTGAAGATGTAGAACAAGAAGAAATGGAGAGTGTAGAGAAGGAGAGACAGGAGAAAGAGAGAAAGAAAGAGAAAGTAAGAGAAGAAATGGAAGCATTAAAGAATGAAGCAAGAACAAAGCCTATGATTCTTAAAGTGATGAACCATAAAATTTTATCCCATGGAATATTAGAGATAAATGCAGACATGTCAAAGTTTAGTGCAAAGGATAATCTTTACAAGTTCATGATCTTTGTTATACTTGATCCAAATTTGTACAAAAACGAAAAGTATTTCATACTCAAATTCTTCTGTATTTATCTAAGGAAGCGATTGATTATACCAGATCAAGTATTTGGAACTATTCTTGACCAGACTACCAGCAATGATCAATGTGTATTTAAAAAGATGACAAACAACTTTGAGACAAATCACATTCAAGTTTCTCATAATTGGTTACAAGGGAATTTGAATTACATTTCCTCATGTTGGCATTCAATGGTAATGGATGTTTTCAAAAGAGTTTTTACATCTGCAATGAAACACTTAAACACAACAGTACTAGTAGAACCTTTGGTGCATTCTGACGATAACAGTACTAGCATTGCTTTTATCTCGTATAACAAAGATGTCAACACAGTTTCTTTAGGAGCATTTGCAATGGATAGTATTAGACGGACACTTGCAAAAGGTTCTTTGATACTCAACACAAAGAAAACAAATGTGTCTACTTTCCACAAAGAATTTGTCTCACTACATGACGTAAATTCAGAACCTGTTAGCATTTACGGTAGATTCCTTTATCCTGTTGTTGGAGATTGTTCATATTTAGGGCCTTATGAAGACTTGAGTGCGAGATTGTCGACTATACAAGTTGCTTTAAAACATGGTTGCCCACCAGGAGTTGCACATCTGGGTGTTGCTTTGGGAGTTCAAATGACATACAGAACTTACAGCATGTTACCAGGACAGATGCATGACCCATTGCCTGCATTAGACCTACAAGGAGAATCTCGATTCAATATTCCAGTTGAGTTGGGAGGATACCCTGATTTAGATCTATGGCTTTTTGGCAGTTTAGGTATAGAAGCACATGATCTGAAGAAATTAGCACAGATTGCAGAATTCTATGATCTAGGAGGCTATAAAATATCGTTAGGATGGCATAGTTTCATAGAAGATCTTACAGAACACAACTCCCAGTCTATTCATAGGCCATACATAAATGGAAGAATACAATATGTGAAGCAATCTTTATGGATGGATTTTTACATTATGTATATGAAAAATTATGCAGTTGGTAAAGACATACAACCAGACTGTGACACAGGGTTATCTCATGATATGAAGCAACGA

MTSSKSCVSVNEDEEYYKNIKGKIPSKASKKTIRYLMKRADPSGKFKDVDLSNLVKSKLLDTAKLEEINRLGFSIVKNNYNTISNIITNYKDYTTNKVFDELYKVIKKMYNLSGTPDEILAGLKNKEELDIGFITIMENVFLYTDTLYFTLFPKNQRTAVDREIYEGTLAAKLGLYFIERIYKEYAGDDESEAISMPGESKWMAIGRKRKTCLDFLVNHCSNQADVEAKEEIGAVELNLDDQLEYPDLLKRDGEWSEVLGRKSRSSSISSQHSXXXXXXXXXXXXXXXXXXXXXXXXXXKVTKEAENTLKESQSLNATKNTFDVLRNLIWSEDVEQEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXLEINADMSKFSAKDNLYKFMIFVILDPNLYKNEKYFILKFFCIYLRKRLIIPDQVFGTILDQTTSNDQCVFKKMTNNFETNHIQVSHNWLQGNLNYISSCWHSMVMDVFKRVFTSAMKHLNTTVLVEPLVHSDDNSTSIAFISYNKDVNTVSLGAFAMDSIRRTLAKGSLILNTKKTNVSTFHKEFVSLHDVNSEPVSIYGRFLYPVVGDCSYLGPYEDLSARLSTIQVALKHGCPPGVAHLGVALGVQMTYRTYSMLPGQMHDPLPALDLQGESRFNIPVELGGYPDLDLWLFGSLGIEAHDLKKLAQIAEFYDLGGYKISLGWHSFIEDLTEHNSQSIHRPYINGRIQYVKQSLWMDFYIMYMKNYAVGKDIQPDCDTGLSHDMKQR
```

The `qseq_tr` field appears to have a mask applied wihch removes sequences such as

`DRGSPKSSKTQSKKGSSTGSKKGKKK`
and
`EMESVEKERQEKERKKEKVREEMEALKNEARTKPMILKVMNHKILSHGI`


## Revision 3 - RdRp from GenBank AA sequences (DEPRECATED SEE rev3B)

In [None]:
# Install diamond
DIAMONDVERSION='2.0.6-dev'
cd ~/

# Libraries for diamond
sudo yum -y install git gcc gcc-c++ glibc-devel \
  cmake patch automake zlib-devel

# grab latest with fix from Benjamin
git clone https://github.com/bbuchfink/diamond.git
cd diamond

mkdir bin; cd bin
cmake ..
make -j4
sudo make install

# build
# cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_BUILD_MARCH=nehalem ..
# make && make install



In [None]:
# Work on EC2 Instance

# Local usearch install
#The clustered database was made with usearch:
wget https://drive5.com/downloads/usearch11.0.667_i86linux32.gz
gzip -dc usearch11.0.667_i86linux32.gz > usearch
chmod 755 usearch; sudo mv usearch /usr/bin/usearch
rm https://drive5.com/downloads/usearch11.0.667_i86linux32.gz


In [None]:
# Install seqkit
wget https://github.com/shenwei356/seqkit/releases/download/v0.12.0/seqkit_linux_amd64.tar.gz
  tar -xvf seqkit*
  sudo mv seqkit /usr/local/bin/
  rm seqkit_linux*

In [None]:
# Install PathRacer
mkdir -p ~/pr; cd ~/pr

# Libraries
sudo yum -y install gcc gcc-c++ glibc-devel \
  cmake patch automake zlib-devel bzip2-devel \
  openssl-devel

# Update cmake to 3.18
wget https://cmake.org/files/v3.18/cmake-3.18.0.tar.gz
tar -xvzf cmake-3.18.0.tar.gz
cd cmake-3.18.0
./bootstrap
make
sudo make install
cd ..

# Install dev SPAdes
wget http://cab.spbu.ru/files/pathracer/SPAdes-3.15.0-pathracer-dev.tar.gz
tar -xvf SPAdes-3.15.0-pathracer-dev.tar.gz

cd SPA*
./spades_compile.sh

sudo cp bin/* /usr/bin/
sudo cp build_spades/bin/* /usr/bin/

### Start set-up

In [None]:
# rdrp0 v201213 for pilot run
mkdir rdrp0; cd rdrp0

# Download rdrp0
aws s3 sync s3://serratus-public/notebook/201226_rdrp0/ ./

# Revision 3 folder
mkdir -p rev3; cd rev3
GENOME='rdrp0'
cp ../rev1/rdrp0_r1.fa ./rdrp0.fa

diamond makedb --in $GENOME.fa -d $GENOME

In [None]:
# Link to all genbank CDS
ln -s ../aaViro_gb240.fa ./

# Input file
IN='aaViro_gb240.fa'
# Output name
OUTNAME='gbViro_rdrp'
# Genome ID
GENOME='rdrp0'

## test
# head -n 100000 $IN > tmp.fa
# IN='tmp.fa'
# OUTNAME='tmp.pro'

# Diamond blastx alignment
time cat $IN |\
diamond blastp \
  -d "$GENOME".dmnd \
  --unal 0 \
  --masking 0 \
  -k 1 \
  -p 4 \
  -b 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq \
  > "$OUTNAME".pro

# On C5.9xlarge
# real    5m52.304s
# user    22m23.987s
# sys     0m11.048s


In [None]:
# AA of result hits
cut -f1,13 $OUTNAME.pro |  sed 's/^/>/g' | sed 's/\t/\n/g' > $OUTNAME.aa_hit.fa

In [None]:
# UCLUST
OUTNAME='gbViro_rdrp'
INPUT="$OUTNAME.aa_hit.fa"

usearch -sortbylength $INPUT \
   -fastaout tmp.sort.fa
mv tmp.sort.fa $INPUT

# Cluster at 45%
OUTPUT="$OUTNAME.aa_hit.id45.fa"
usearch -cluster_smallmem $INPUT \
   -id 0.45 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT

#      Seqs  354010 (354.0k)
#  Clusters  2937
#  Max size  85482 (85.5k)
#  Avg size  120.5
#  Min size  1
#Singletons  1434, 0.4% of seqs, 48.8% of clusters

# Cluster at 75%
OUTPUT="$OUTNAME.aa_hit.id75.fa"
usearch -cluster_smallmem $INPUT \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT

#      Seqs  354010 (354.0k)
#  Clusters  6911
#  Max size  74942 (74.9k)
#  Avg size  51.2
#  Min size  1
#Singletons  4116, 1.2% of seqs, 59.6% of clusters


# Cluster at 90%
OUTPUT="$OUTNAME.aa_hit.id90.fa"
usearch -cluster_smallmem $INPUT \
   -id 0.90 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT
   
#         Seqs  354010 (354.0k)
#  Clusters  11712 (11.7k)
#  Max size  72990 (73.0k)
#  Avg size  30.2
#  Min size  1
#Singletons  6854, 1.9% of seqs, 58.5% of clusters


In [None]:
# Extract full-names from genbank
ln -s ../aaViro_gb240.fa ./

# Create index
seqkit faidx -f aaViro_gb240.fa


# Sort by length (large-small)
usearch -sortbylength gbViro_rdrp.aa_hit.id90.fa \
   -fastaout id90.tmp.fa
seqkit faidx id90.tmp.fa
grep ">" id90.tmp.fa > id90.header.tmp.fa

while read -r line; do
      acc=$(echo $line | cut -f1 -d' ' - | sed 's/>//g' - )
      match=$(grep "$acc" aaViro_gb240.fa.seqkit.fai)
      echo $acc $match >> output.tmp
done < id90.header.tmp.fa

### Extending Partial Matches

Partial Matches. The diamond rdrp0 hits being returned for some highly diverged viruses are partial-length,

```
acc	description	rdrp_len	name	acc_len	full_acc_match
AXB38890.1	polyprotein, partial 	225	Potato virus Y	289	0
AAZ08330.1	polyprotein, partial 	225	Eustrephus virus Y	479	0
ADM89618.1	polyprotein, partial 	225	Trillium crinkled leaf virus	455	0
QIR30294.1	RNA-dependent RNA polymerase 	225	Plasmopara viticola lesion associated narnavirus 15	778	0
SBU65124.1	RNA dependent RNA polymerase, partial 	225	Parsley mottle mimic virus	225	1
AAX07288.1	polyprotein, partial 	225	Papaya ringspot virus	500	0
QIR30321.1	RNA-dependent RNA polymerase 	225	Plasmopara viticola lesion associated narnavirus 42	777	0
AAZ08326.1	polyprotein, partial 	225	Siratro 1 virus Y	490	0
AVA17452.1	RNA-dependent RNA polymerase 	225	Gigaspora margarita mitovirus 4	1008	0
```

For instance, QIR30294.1 Plasmopara viticola lesion associated narnavirus is 778 aa long, yet the match is 225 aa.


GenBank Complete Query Record:
```
>QIR30294.1 RNA-dependent RNA polymerase [Plasmopara viticola lesion associated narnavirus 15]
MRVRLPAYPFVTVKGVLPKSSPKPVAMLAGRRRPQRKVPSRRMPVHTKKLWEATWCALIS
CGVGTRRGAWEIRRWVSLASDRMGWEEVARSLKAVCSELRSSALEGRRARLPKAHHFPAR
LLSFLDRRLTVKGKLAFSRLARALPAASEKMKKEAVSTHARNLSARHVTPKAYLESIEEH
VRYTLRGAFQQTSRYSVPSSSAAVVEADRKSGGYNKVLADLARAGWRGVAAGYTRSGVPT
DTQYWKETSKLAHHFEHSTQQRIRSDKFNMYPTTVSAERNMGFATALMLRKAVGVRVVHH
ASVIAELGMKARVITVPPASVFAQGDLLRQVLWPALVSRVPQILPYAPHTEEAILQSIGQ
WVPGMVYLSADLTRATDGFGHDAILAVIRGLRKAGLPAFLCAELESNLGVGPDVHYVRYS
KSQLTQQTWQEFGKRFGVNEDGETVDVPKVRGSLMGTPCSFSILSILNHWMSELLGRRRI
ICGDDLAALTHPDNVSTYARRACAIGSELHQGKSFRSKIGFVFCEAYALTNGGGLRSFRP
ASLKEFVRDGNGVMCQHGVDATSFNRLARCARTLYKRQRVIATKKRRYPELPAALGGLGH
PCKGRLRIPAAGRAALFELYLCENTAHDGPHDPTRYVSTLTYPAVPASRREFRERVSQVR
SWLDDKRIDEPQPGDKFATNREISAYASMCANLTYLAGGGRFRKSRPQEIKVSRQRWPKP
IDGCRSGVLSSRTRINQVLDWDRRARSELGTYLDAPLQTHIRRRICAHREGDLPGDDR
```
Diamond Pro Output
```
QIR30294.1      301     525     778     +       rdrp1.Narnaviridae.Zhejiang_mosquito_virus_3:YP_009333331       126     359     513     32.9    7.3e-1738M1D13M2D14M3D28M1I12M1I3M3D2M1D19M10I38M11D46M        ASVIAELGMKARVITVPPASVFAQGDLLRQVLWPALVSRVPQILPYAPHTEEAILQSIGQWVPGMVYLSADLTRATDGFGHDAILAVIRGLRKAGLPAFLCAELESNLGVGPDVHYVRYSKSQLTQQTWQEFGKRFGVNEDGETVDVPKVRGSLMGTPCSFSILSILNHWMSELLGRRRIICGDDLAALTHPDNVSTYARRACAIGSELHQGKSFRSKIGFVFCE
```

Diamond Hit Sequence:
```
>QIR30294.1
ASVIAELGMKARVITVPPASVFAQGDLLRQVLWPALVSRVPQILPYAPHTEEAILQSIGQ
WVPGMVYLSADLTRATDGFGHDAILAVIRGLRKAGLPAFLCAELESNLGVGPDVHYVRYS
KSQLTQQTWQEFGKRFGVNEDGETVDVPKVRGSLMGTPCSFSILSILNHWMSELLGRRRI
ICGDDLAALTHPDNVSTYARRACAIGSELHQGKSFRSKIGFVFCE
```

rdrp0 hit record
```
>rdrp1.Narnaviridae.Zhejiang_mosquito_virus_3:YP_009333331
LRAWAEIWTRERLGAGGRTLTNPAAHFSRSASATVSALKGGQLTELRQLPAVAEHYALIQ
ELLGDQLDPVTGEPFEWDPYTFDADGGAIGLLTSASEADDLQILADCALRTATEHAADNT
PIPMTATVISELGMKARVVTKPPAWAVVAGDACRKTVWPLLEGDRRIDLSGVRPTAEVLD
AFHDNLAHSLVGARSTQFYSADLTAATDLMPFDVSRAMWNGLCDGLGATATAPLRKLGLY
LLGPVQVSYPDLSALPASSKLYVAGERVECLSERGCMMGLPVSWTVLNLYNLAMADLACT
PEGSPVLVNVAPAIARGDDLVAAIPAEEATRYEDLIAATGGEANRLKSFRSADAFVLAER
TFEVGVLRRPNVELKQRGYVTRRTYRTAPLLAQFDSAELHGGTGDTRRLGDAPEVVAIRM
ACDVPIRSLLGGGPRTVGANPVPDYVSIPPAAAACLAEFEGTRLYRSVAEGLMSVHRGLV
ADLRRSAIPLFYPRELGGAGFPHPKGFAAAVAS
```

GenBank Record
```
Plasmopara viticola lesion associated narnavirus 15 isolate DMG-E_DN24203 RNA-dependent RNA polymerase gene, complete cds

>MN539832.1 Plasmopara viticola lesion associated narnavirus 15 isolate DMG-E_DN24203 RNA-dependent RNA polymerase gene, complete cds
CCTTGACAGAATCATGCGAGTGCGCCTTCCAGCGTACCCGTTTGTGACGGTTAAGGGTGTTCTTCCTAAG
AGCTCTCCTAAGCCTGTAGCCATGCTAGCAGGTCGAAGGAGACCTCAAAGGAAGGTCCCTAGTCGTCGGA
TGCCGGTGCACACCAAGAAGCTTTGGGAAGCTACTTGGTGTGCTTTGATTTCTTGTGGTGTTGGAACACG
CCGTGGTGCGTGGGAGATACGTAGGTGGGTATCCCTTGCGAGCGATCGCATGGGATGGGAGGAAGTCGCA
CGTTCCCTCAAGGCAGTCTGCAGTGAGCTCCGCAGTTCTGCTCTTGAGGGCCGACGAGCACGTCTTCCTA
AGGCCCACCATTTCCCAGCACGGCTCCTTTCTTTCCTCGATCGTCGTCTCACTGTGAAAGGTAAGCTGGC
CTTTAGTAGGCTGGCTCGCGCTCTCCCAGCGGCAAGTGAGAAGATGAAGAAAGAAGCCGTGTCCACGCAT
GCACGGAACCTCTCGGCGCGACATGTTACTCCCAAGGCTTACCTTGAGAGCATTGAGGAACATGTCCGCT
ACACGCTGAGAGGTGCGTTCCAACAAACGTCTCGTTACTCTGTGCCTTCTTCGTCTGCGGCTGTGGTTGA
AGCGGATCGCAAGAGCGGAGGTTACAACAAAGTATTGGCAGATCTTGCCCGCGCAGGTTGGCGCGGTGTG
GCAGCAGGTTACACTAGGAGCGGTGTACCCACTGACACACAATACTGGAAGGAAACCTCCAAACTTGCCC
ACCACTTCGAGCACAGCACCCAGCAGCGAATACGGAGTGATAAGTTCAACATGTACCCAACAACTGTCAG
TGCTGAGCGAAACATGGGTTTTGCAACCGCGCTTATGCTTCGCAAAGCAGTTGGTGTACGTGTTGTCCAT
CACGCATCCGTAATCGCCGAGCTGGGGATGAAGGCACGAGTCATTACCGTCCCTCCTGCCAGTGTGTTTG
CCCAGGGCGACTTGCTTCGGCAAGTCCTCTGGCCTGCCCTTGTTAGCAGAGTGCCTCAGATCCTTCCATA
TGCTCCGCATACGGAAGAGGCTATTCTGCAAAGCATAGGACAGTGGGTGCCTGGTATGGTCTACCTTTCG
GCAGATCTTACTAGGGCGACTGACGGTTTCGGACATGATGCCATCTTGGCAGTCATCAGGGGCCTGAGAA
AGGCGGGTCTCCCCGCGTTTCTCTGTGCCGAACTCGAGAGTAATCTCGGAGTTGGCCCTGATGTGCATTA
TGTCCGTTACAGCAAGTCGCAGCTTACCCAGCAAACATGGCAGGAGTTTGGAAAACGTTTCGGTGTGAAT
GAGGATGGGGAGACCGTCGATGTCCCGAAGGTTAGGGGTTCCCTTATGGGTACTCCTTGTTCCTTCTCGA
TATTGTCGATCCTCAATCACTGGATGAGCGAGCTTCTTGGTCGGCGAAGAATCATCTGCGGAGATGATCT
TGCTGCGCTCACTCATCCCGATAACGTTTCTACCTACGCGCGCAGAGCCTGCGCAATAGGTAGCGAACTC
CATCAAGGAAAGTCTTTCAGGTCGAAGATAGGTTTCGTGTTCTGCGAAGCCTACGCCCTTACCAATGGGG
GTGGTCTTCGGTCCTTTAGACCCGCTTCCTTGAAGGAGTTCGTCCGTGACGGTAATGGGGTCATGTGTCA
ACATGGCGTGGACGCAACTTCGTTCAACCGCTTGGCTCGTTGTGCTCGTACTCTGTACAAGCGTCAACGA
GTGATCGCAACGAAGAAGCGTAGGTACCCTGAACTTCCGGCTGCGCTCGGAGGATTGGGGCATCCCTGTA
AGGGGAGGCTCAGAATCCCGGCGGCTGGTCGTGCAGCATTGTTCGAGTTGTACCTGTGTGAGAACACCGC
GCATGATGGGCCACATGACCCAACGAGATACGTTTCGACTCTTACTTACCCGGCGGTACCCGCGTCTAGA
CGCGAGTTCCGTGAACGAGTATCCCAGGTTCGCTCTTGGCTCGATGACAAGAGGATTGATGAACCGCAAC
CTGGGGACAAGTTCGCCACCAACAGGGAGATTAGTGCATATGCATCGATGTGTGCTAATCTCACCTATCT
GGCAGGTGGTGGCAGGTTCCGTAAGAGTCGACCACAAGAAATCAAAGTTTCGAGACAGAGGTGGCCCAAG
CCGATCGACGGATGTCGATCGGGGGTCTTGTCCAGTCGGACGAGGATTAACCAGGTCCTCGACTGGGACA
GGAGAGCTCGTAGCGAGCTCGGCACCTATCTCGATGCACCGCTTCAGACGCACATCCGGCGTAGGATATG
CGCCCACCGGGAGGGTGACCTCCCGGGAGATGACAGGTAGACTAGGG
```

Global Alignments (NW)

Diamond Local vs. complete rdrp0
```
Sbjct  1    LRAWAEIWTRERLGAGGRTLTNPAAHFSRSASATVSALKGGQLTELRQLPAVAEHYALIQ  60

Sbjct  61   ELLGDQLDPVTGEPFEWDPYTFDADGGAIGLLTSASEADDLQILADCALRTATEHAADNT  120

Query  1         ASVIAELGMKARVITVPPASVFAQGDLLRQVLWPALVS-RVPQILPYAPHTE--E  52
                 A+VI+ELGMKARV+T PPA     GD  R+ +WP L   R   +    P  E  +
Sbjct  121  PIPMTATVISELGMKARVVTKPPAWAVVAGDACRKTVWPLLEGDRRIDLSGVRPTAEVLD  180

Query  53   AILQSIGQWVPGM---VYLSADLTRATDGFGHDAILAVIRGLRKAGLPAFLCAELESNLG  109
            A   ++   + G     + SADLT ATD    D   A+  GL   GL A   A L   LG
Sbjct  181  AFHDNLAHSLVGARSTQFYSADLTAATDLMPFDVSRAMWNGLCD-GLGATATAPLRK-LG  238

Query  110  VGPDVHYVRYSKSQLTQQTWQEFGKRFGVNEDGETVDVPKVRGSLMGTPCSFSILSILNH  169
            +     Y+     Q++            +   GE V+    RG +MG P S+++L++ N 
Sbjct  239  L-----YL-LGPVQVSYPDLSALPASSKLYVAGERVECLSERGCMMGLPVSWTVLNLYNL  292

Query  170  WMSELLGRRR-----------IICGDDLAALTHPDNVSTYARRACAIGSELHQGKSFRSK  218
             M++L                I  GDDL A    +  + Y     A G E ++ KSFRS 
Sbjct  293  AMADLACTPEGSPVLVNVAPAIARGDDLVAAIPAEEATRYEDLIAATGGEANRLKSFRSA  352

Query  219  IGFVFCE                                                       225
              FV  E                                                     
Sbjct  353  DAFVLAERTFEVGVLRRPNVELKQRGYVTRRTYRTAPLLAQFDSAELHGGTGDTRRLGDA  412

Sbjct  413  PEVVAIRMACDVPIRSLLGGGPRTVGANPVPDYVSIPPAAAACLAEFEGTRLYRSVAEGL  472

Sbjct  473  MSVHRGLVADLRRSAIPLFYPRELGGAGFPHPKGFAAAVAS  513
```

Total Record vs. complete rdrp0
```
Query  1    MRVRLPAYPFVTVKGVLPKSSPKPVAMLAGRRRPQRKVPSRRMPVHTKKLWEATWCALIS  60
            +R                                                W   W     
Sbjct  1    LRA-----------------------------------------------WAEIW-----  8

Query  61   CGVGTRRGAWEIRRWVSLASDRMGWEEVARSLKAVCSELRSSALEGRRARLPKAHHFPAR  120
                TR              +R+G                     GR    P AH     
Sbjct  9    ----TR--------------ERLG-------------------AGGRTLTNPAAH-----  26

Query  121  LLSFLDRRLTVKGKLAFSRLARALPAASEKMKKEAVSTHARNLSARHVTPKAYLESIEEH  180
                            FSR A A  +A               L    +T    L ++ EH
Sbjct  27   ----------------FSRSASATVSA---------------LKGGQLTELRQLPAVAEH  55

Query  181  VRYTLRGAFQQTSRYSVPSSSAAVVEADRKSGGYNKVLADLARAGWRGVAAGYTRSGVPT  240
              Y L    Q+     +   +    E D  +   +            G A G   S    
Sbjct  56   --YAL---IQELLGDQLDPVTGEPFEWDPYTFDAD------------GGAIGLLTSASEA  98

Query  241  DTQYWKETSKLAHHFEHSTQQRIRSDKFNMYPTTVSAERNMGFATALMLRKAVGVRVVHH  300
            D         L    EH+             P T                          
Sbjct  99   DDLQILADCALRTATEHAADN-------TPIPMT--------------------------  125

Query  301  ASVIAELGMKARVITVPPASVFAQGDLLRQVLWPALVS-RVPQILPYAPHTE--EAILQS  357
            A+VI+ELGMKARV+T PPA     GD  R+ +WP L   R   +    P  E  +A   +
Sbjct  126  ATVISELGMKARVVTKPPAWAVVAGDACRKTVWPLLEGDRRIDLSGVRPTAEVLDAFHDN  185

Query  358  IGQWVPGM---VYLSADLTRATDGFGHDAILAVIRGLRKAGLPAFLCAELESNLGVGPDV  414
            +   + G     + SADLT ATD    D   A+  GL   GL A   A L   LG+    
Sbjct  186  LAHSLVGARSTQFYSADLTAATDLMPFDVSRAMWNGLCD-GLGATATAPLRK-LGL----  239

Query  415  HYVRYSKSQLTQQTWQEFGKRFGVNEDGETVDVPKVRGSLMGTPCSFSILSILNHWMSEL  474
             Y+     Q++            +   GE V+    RG +MG P S+++L++ N  M++L
Sbjct  240  -YL-LGPVQVSYPDLSALPASSKLYVAGERVECLSERGCMMGLPVSWTVLNLYNLAMADL  297

Query  475  LGRRR-----------IICGDDLAALTHPDNVSTYARRACAIGSELHQGKSFRSKIGFVF  523
                            I  GDDL A    +  + Y     A G E ++ KSFRS   FV 
Sbjct  298  ACTPEGSPVLVNVAPAIARGDDLVAAIPAEEATRYEDLIAATGGEANRLKSFRSADAFVL  357

Query  524  CEAYALTNGGGLRSFRPASLKEFVRDGNGVMCQHGVDATSFNRLARCARTLYKRQRVIAT  583
             E          R+F             GV+ +  V+                +QR   T
Sbjct  358  AE----------RTFEV-----------GVLRRPNVEL---------------KQRGYVT  381

Query  584  KKRRYPELPAALGGLGHPCKGRLRIPAAGRAALFELYLCENTAHDGPHDPTRYVSTLTYP  643
            + R Y   P                      A F+        H G  D  R       P
Sbjct  382  R-RTYRTAPLL--------------------AQFD----SAELHGGTGDTRRLGDA---P  413

Query  644  AVPASRREFRERVSQVRSWLDDKRIDEPQPGDKFATNREISAYASMCANLTYLAGGGRFR  703
             V A R         +RS L       P+          +S   +  A L    G   +R
Sbjct  414  EVVAIRMACDV---PIRSLLGGG----PRTVGANPVPDYVSIPPAAAACLAEFEGTRLYR  466

Query  704  KSRPQEIKVSRQRWPKPIDGCRSGVLSSRTRINQVLDWDRRARSELGTYLDAPLQTHIRR  763
                  + V R            G+++   R    L + R    ELG     P       
Sbjct  467  SVAEGLMSVHR------------GLVADLRRSAIPLFYPR----ELGG-AGFP-------  502

Query  764  RICAHREGDLPGDDR  778
                H +G       
Sbjct  503  ----HPKGFAAAVAS  513
```

And semi-global (from EMBOSS Needle)

```
QIR30294.1      MRVRLPAYPFVTVKGVLPKSSPKPVAMLAGRRRPQRKVPSRRMPVHTKKLWEATWCALIS
YP_009333331    ------------------------------------------------------------
                                                                            
QIR30294.1      CGVGTRRGAWEIRRWVSL-ASDRMGWEEVARSLKAVCSELRSSALEGRRARLPKAHHFPA
YP_009333331    -----------LRAWAEIWTRERLG-------------------AGGRTLTNPAAH----
                                                                            
QIR30294.1      RLLSFLDRRLTVKGKLAFSRLARALPAASEKMKKEAVSTHARNLSARHVTPKAYLESIEE
YP_009333331    -----------------FSRSASATVSA---------------LKGGQLTELRQLPAVAE
                                                                            
QIR30294.1      HVRYTL------------RGAFQQTSRYSVPSSSAAV------VEADRKSGGYNKVLADL
YP_009333331    H--YALIQELLGDQLDPVTGEPFEWDPYTFDADGGAIGLLTSASEADDL-----QILADC
                                                                            
QIR30294.1      ARAGWRGVAAGYTRSGVPTDTQYWKETSKLAHHFEHSTQQRIRSDKFNMYPTTVSAERNM
YP_009333331    A---------------LRTATEHAADNTPI--------------------PMT-------
                                                                            
QIR30294.1      GFATALMLRKAVGVRVVHHASVIAELGMKARVITVPPASVFAQGDLLRQVLWPALV-SRV
YP_009333331    -------------------ATVISELGMKARVVTKPPAWAVVAGDACRKTVWPLLEGDRR
                                                                            
QIR30294.1      PQILPYAPHTE--EAILQSIGQWVPG---MVYLSADLTRATDGFGHDAILAVIRGLRKAG
YP_009333331    IDLSGVRPTAEVLDAFHDNLAHSLVGARSTQFYSADLTAATDLMPFDVSRAMWNGLCD-G
                                                                            
QIR30294.1      LPAFLCAELESNLG---VGP-DVHYVRYSKSQLTQQTWQEFGKRFGVNEDGETVDVPKVR
YP_009333331    LGATATAPLR-KLGLYLLGPVQVSYPDLSALPASSKLYVA----------GERVECLSER
                                                                            
QIR30294.1      GSLMGTPCSFSILSILNHWMSELLGRRR-----------IICGDDLAALTHPDNVSTYAR
YP_009333331    GCMMGLPVSWTVLNLYNLAMADLACTPEGSPVLVNVAPAIARGDDLVAAIPAEEATRYED
                                                                            
QIR30294.1      RACAIGSELHQGKSFRSKIGFVFCEAYALTNGGGLRSFRPASLKEFVRDGNGVMCQHG--
YP_009333331    LIAATGGEANRLKSFRSADAFVLAE----------RTFEVGVL----RRPNVELKQRGYV
                                                                            
QIR30294.1      -------------VDATSFNRLARCARTLYKRQRVIATKKRRYPELPAALGG----LG-H
YP_009333331    TRRTYRTAPLLAQFDSAELHGGTGDTRRLGDAPEVVAIRMACDVPIRSLLGGGPRTVGAN
                                                                            
QIR30294.1      PCKGRLRIPAAGRAALFE-----LY--LCEN--TAHDG-PHDPTRYVSTLTYP-------
YP_009333331    PVPDYVSIPPAAAACLAEFEGTRLYRSVAEGLMSVHRGLVADLRRSAIPLFYPRELGGAG
                                                                            
QIR30294.1      --------AVPASRREFRERVSQVRSWLDDKRIDEPQPGDKFATNREISAYASMCANLTY
YP_009333331    FPHPKGFAAAVAS-----------------------------------------------
                                                                            
QIR30294.1      LAGGGRFRKSRPQEIKVSRQRWPKPIDGCRSGVLSSRTRINQVLDWDRRARSELGTYLDA
YP_009333331    ------------------------------------------------------------
                                                                            
QIR30294.1      PLQTHIRRRICAHREGDLPGDDR
YP_009333331    -----------------------
```

### Pathracer Semi-Global Alignment Testing

RdRP_1 	PF00680.20 	CL0027 	RNA dependent RNA polymerase 	
1832 	2196 	8.7e-19 	4.9e-23

In [None]:
# Compare rdrp0 hit versus pr
# using QIR30294.1

seqkit grep -r -p "QIR30294.1" aaViro_gb240.fa > test/narna_fl.fa

grep "QIR30294.1" gbViro_rdrp.pro > test/narna.dmnd.pro
seqkit grep -r -p "QIR30294.1" gbViro_rdrp.aa_hit.fa > test/narna.dmnd.fa


# Download rdrp1234q.hmm from Rayan
wget https://serratus-rayan.s3.amazonaws.com/tmp/RdRP_1234q.hmm ./

# Search sequence with PathRacer
pathracer-seq-fs RdRP_1234q.hmm narna_nt.fa --global --output pr_out  

### Merge GenBank hits into rdrp0 / Assign Taxonomy

Shortest sequences in rdrp0

```
rdrp2.unc0371.Hubei_sobemo_like_virus_16:YP_009330046   280     4764591 80      81
rdrp2.Luteoviridae.Spinach_yellows_virus:CRL92755       275     4764926 80      81
rdrp2.Luteoviridae.Lettuce_mild_yellows_virus:CRL92746  273     4765261 80      81
rdrp3.Betaflexiviridae.Phlomis_mottle_virus:CAP46903    273     4765592 80      81
rdrp2.Luteoviridae.Hubei_polero_like_virus_1:YP_009330062       270     4765928 80      81
rdrp2.Picobirnaviridae.Genet_fecal_picobirnavirus:AIB06803      270     4766262 80      81
rdrp2.Luteoviridae.Lettuce_yellows_virus:CRL92742       266     4766587 80      81
rdrp2.unc0373.Hubei_sobemo_like_virus_14:YP_009330128   265     4766912 80      81
rdrp2.unc0370.Hubei_sobemo_like_virus_17:APG75899       265     4767232 80      81
rdrp2.unc0066.Wuhan_arthropod_virus_4:YP_009342307      261     4767553 80      81
rdrp2.unc0368.Hubei_myriapoda_virus_9:YP_009345131      255     4767870 80      81
rdrp2.unc0378.Beihai_sobemo_like_virus_16:YP_009336705  254     4768185 80      81
rdrp2.unc0225.Prestney_Burn_virus:AMO03211      253     4768487 80      81
```

In [None]:
mkdir -p merge; cd merge
cp ../gbViro_rdrp.aa_hit.fa ./
cp ../rdrp0.fa ./

RDRP="rdrp0"
GB="gbViro_rdrp.aa_hit"

# Minimum length
MIN='200'

# Sort and filter by 200a
usearch -sortbylength $RDRP.fa \
  -minseqlength $MIN \
  -fastaout $RDRP.sort.fa

usearch -sortbylength $GB.fa \
  -minseqlength $MIN \
  -fastaout $GB.sort.fa


cat $RDRP.sort.fa $GB.sort.fa > merge.fa.tmp

# Cluster


In [None]:
# UCLUST
INPUT='merge.fa.tmp'
OUTNAME='gb240_r0'

# Cluster/Merge

#45%
OUTPUT="$OUTNAME.id45"
usearch -cluster_smallmem $INPUT \
   -id 0.45 \
   -sortedby other \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT.fa

#      Seqs  229046 (229.0k)
#  Clusters  4437
#  Max size  85638 (85.6k)
#  Avg size  51.6
#  Min size  1
# Singletons  1922, 0.8% of seqs, 43.3% of clusters


#90%
OUTPUT="$OUTNAME.id90"
usearch -cluster_smallmem $INPUT \
   -id 0.90 \
   -sortedby other \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT.fa

#       Seqs  229046 (229.0k)
#  Clusters  12477 (12.5k)
#  Max size  72655 (72.7k)
#  Avg size  18.4
#  Min size  1
# Singletons  7038, 3.1% of seqs, 56.4% of clusters

# Number of non-wolf18 sequences that made it through
# grep -r ">[^rdrp]" gb240_r0.id45.fa | wc -l 
# 776 / 4437
#
# grep -r ">[^rdrp]" gb240_r0.id90.fa | wc -l
# 3741 / 12477


### Revision 3B USEARCH Semi-Global Extension



In [None]:
# Revision 3B folder
mkdir -p rev3B; cd rev3B

# Use rdrp0_revision1 as start-point
GENOME='rdrp0'
cp ../rev1/rdrp0_r1.fa ./rdrp0.fa
diamond makedb --in $GENOME.fa -d $GENOME

In [None]:
# Link to GenBank v241 (Dec2020 Release)
ln -s ../vgb241/vgb241.fa ./

mkdir 0_rdrp_search; cd 0_rdrp_search

# Input file
IN='vgb241.fa'
# Output name
OUTNAME='gb241_r0'
# Genome ID
GENOME='rdrp0'

# Diamond blastx alignment
time cat $IN |\
diamond blastp \
  -d "$GENOME".dmnd \
  --unal 0 \
  --masking 0 \
  --ultra-sensitive \
  -k 1 \
  -p 30 \
  -b 2 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue bitscore cigar \
       qseq \
  > "$OUTNAME".raw.pro

# On C5.9xlarge
# real    31m22.996s
# user    890m28.269s
# sys     1m16.513s


In [None]:
# Filter for High-Confidence alignments only

# Output name
OUTNAME='gb241_r0'

# Threshold for "cut-offs"
EVALUE=6  # -log(E-value) > $EVAL cut-off
RCOV=50 # rdrp-coverage > $RCOV cut-off

while read -r line; do
  
  # Return -log(e-value); zero == 999
  e_value=$(echo $line | cut -f 11 -d' ' - \
    | sed 's/[0-9\.]*e.//g' - \
    | sed 's/^0//g' -)
  if (( $e_value == 0 )); then
    e_value='999'
  fi
  
  # rdrp-match coordinates
  r_start=$(echo $line | cut -f 7 -d' ' - )
  r_end=$(echo $line | cut -f 8 -d' ' - )
  r_len=$(echo $line | cut -f 9 -d' ' - )
  
  # convert to percent
  r_cov=$( echo "$r_end - $r_start + 1" | bc )
  r_covpct=$( echo "scale=0; 100 * $r_cov / $r_len" | bc )
  
  #echo $r_start $r_end $r_len $r_covpct $e_value
 
  if (( $r_len > $RCOV )) && (( $e_value > $EVALUE )); then
    #echo HIT
    echo $line | sed 's/ /\t/g' >> "$OUTNAME".local.pro
  else
    #echo MISS
    echo $line | sed 's/ /\t/g' >> "$OUTNAME".REJECT.pro
  fi
  # I guess they never miss huh
done < "$OUTNAME".raw.pro

mkdir raw;
mv "$OUTNAME".raw.pro raw/
mv"$OUTNAME".REJECT.pro raw/

In [None]:
# UCLUST Local Alignments
OUTNAME='gb241_r0'
INPUT="gb241_r0.local.fa"

function uclust () {

  INPUT=$1
  OUTNAME=$2
  ID=$3
  
  # uclust
  OUTPUT="$OUTNAME.local.id$ID"
  usearch -cluster_smallmem $INPUT \
     -id 0.$ID \
     -maxaccepts 4 \
     -maxrejects 64 \
     -maxhits 1 \
     -uc $OUTPUT.uc \
     -centroids $OUTPUT.fa

  mkdir id$ID; mv *.id$ID.* id$ID/

}

# Cluster at 45%
uclust $INPUT $OUTNAME 45

#      Seqs  367343 (367.3k)
#  Clusters  2912
#  Max size  86855 (86.9k)
#  Avg size  126.1
#  Min size  1
#Singletons  1365, 0.4% of seqs, 46.9% of clusters


# Cluster at 75%
uclust $INPUT $OUTNAME 75

#      Seqs  367343 (367.3k)
#  Clusters  7052
#  Max size  75171 (75.2k)
#  Avg size  52.1
#  Min size  1
#Singletons  4177, 1.1% of seqs, 59.2% of clusters



# Cluster at 90%
uclust $INPUT $OUTNAME 90

#      Seqs  367343 (367.3k)
#  Clusters  12002 (12.0k)
#  Max size  73247 (73.2k)
#  Avg size  30.6
#  Min size  1
#Singletons  6946, 1.9% of seqs, 57.9% of clusters



# Cluster at 95%
uclust $INPUT $OUTNAME 95

#      Seqs  367343 (367.3k)
#  Clusters  18733 (18.7k)
#  Max size  41281 (41.3k)
#  Avg size  19.6
#  Min size  1
#Singletons  10944 (10.9k), 3.0% of seqs, 58.4% of clusters

In [None]:
# Extract Full Header from GenBank Local Matches {DEPRECATED}
ln -s ../../vgb241/vgb241.fa.fai

function reheader {
  QUERYFA=$1    # Diamond Local Matches
  SUBJECTFAI=$2 # GenBank FAI index
  OUTPUT=$3     # Output name

  # Sort Matching Diamond Accessions
  usearch -sortbylength $QUERYFA \
    -fastaout tmp.sort

  grep ">" tmp.sort > tmp.qheader

  rm -f output.tmp
  while read -r line; do
      acc=$(echo $line | cut -f1 -d' ' - | sed 's/>//g' - )
      match=$(grep "$acc" $SUBJECTFAI)
      echo $acc $match >> output.tmp
  done < tmp.qheader

  mv output.tmp > $OUTPUT

  rm -f tmp.sort tmp.qheader
}

OUTNAME='gb241_r0'
ID='90'
QUERYFA="id$ID/$OUTNAME.local.id$ID.fa"
SUBJECTFAI='vgb241.fa.fai'
OUTPUT='gb241_r0.local.id90.reheader'

reheader $QUERYFA $SUBJECTFAI $OUTPUT


In [None]:
# 1. Cluster all putative-RdRp in GenBank into rdrp0 at 90%
#    Isolate centroids that are from GenBank as putative novel-RdRp
#    Note: these are "local" matches, and require extension

mkdir 1_merge_rdrp; cd 1_merge_rdrp

RDRP="rdrp0.fa"
GBLOCAL="gb241_r0.local.fa"
OUTNAME="rdrp0_gb241"

# rdrp0 == wolf18/wolf20 reference rdrp-set
usearch -sortbylength ../$RDRP \
   -fastaout ./$RDRP
   
# Local GenBank matches to rdrp0 (via diamond)
usearch -sortbylength ../$GBLOCAL \
   -fastaout ./$GBLOCAL
   
# Order the files such that rdrp0 takes priority over genbank
cat $RDRP $GBLOCAL > cat_rdrp.fa

function uclust () {

  INPUT=$1
  OUTNAME=$2
  ID=$3
  
  # uclust
  OUTPUT="$OUTNAME.id$ID"
  usearch -cluster_smallmem $INPUT \
     -sortedby other \
     -id 0.$ID \
     -maxaccepts 4 \
     -maxrejects 64 \
     -maxhits 1 \
     -uc $OUTPUT.uc \
     -centroids $OUTPUT.fa

  mkdir id$ID; mv *.id$ID.* id$ID/

}

# Cluster rdrp0 and genbank_rdrp0_matches at 90%
uclust cat_rdrp.fa $OUTNAME 90

# Retrieve non-rdrp0 sequences (GenBank-derived RdRp)
grep ">" id90/$OUTNAME.id90.fa \
  | grep -v ">rdrp" - \
  | sed 's/>//g' - \
  > new.rdrp.acc

# 7633 sequences
cd ..

In [None]:
# 2. For all New GenBank-derived RdRp
#    retrieve the full-length sequence/ORF from GenBank
mkdir 2_retrieve_orf; cd 2_retrieve_orf

cp ../1_merge_rdrp/new.rdrp.acc ./new.rdrp.acc

# Viral GenBank (all)
VGB='vgb241.fa'
ORFFA='gb241.orf.fa'
seqkit grep -f new.rdrp.acc ../$VGB > $ORFFA

cd ..

In [None]:
# 3. For all New GenBank ORF (containing RdRp)
#    Semi-global align the ORF to the original rdrp0
#    to "trim" the sequences to the wolf18 boundary
mkdir 3_trim_orf; cd 3_trim_orf

# Perform semi-global extension of potential hits
wget https://drive5.com/tmp/usearch12_trim ./
chmod 755 usearch12_trim

RDRP='rdrp0.fa'
ln -s ../$RDRP ./
ORFFA='gb241.orf.fa'
ln -s ../2_retrieve_orf/$ORFFA ./

OUTFA="gb241.rdrp.fa"

# NW Semi-Global Extension
./usearch12_trim -usearch_global $ORFFA \
    -id 0.01 \
    -fulldp \
    -maxaccepts 8 \
    -maxrejects 32 \
    -top_hit_only \
    -db rdrp0.fa \
    -userfields query+target+id+qtrimlo+qtrimhi \
    -userout results.tsv \
    -trimout trimmed_output.fa
    
usearch -sortbylength trimmed_output.fa \
   -fastaout $OUTFA
mv results.tsv $OUTFA.trims
rm trimmed_output.fa

# 1. query label
# 2 reference label of top hit
# 3. %id of semi-global alignment
# 4. one-based start coordinate of alignment in query
# 5. one-based end   coordinate of alignment in query

### POTENTIAL OVEREXTENSIONS
# >BAB18266.1 polyprotein [Chiba virus]
# >BAC79393.1 polyprotein [Turnip mosaic virus]
# >BAA32667.1 polyprotein [Hepatitis C virus (isolate VN004)]
# >CAI65399.1 polyprotein [Potato virus Y strain NTN]
# >CAI65399.1 polyprotein [Potato virus Y strain NTN]
# >CAD24793.1 polyprotein [Peru tomato mosaic virus]


In [None]:
# 4. For all New GenBank-Rdrp (trimmed)
#    Remove Known False Positive hits (RNA capsid/envelopes)
#    Remove RdRp below 160aa long
#    Add-back high quality (annotated) partial RdRp
mkdir 4_clean_up; cd 4_clean_up

ORFFA='gb241.rdrp.fa'
ln -s ../3_trim_orf/$ORFFA ./raw.$ORFFA

# MANUAL CLEAN-UP


#### DECOY SEQUENCES ==========================
### HCV Envelope 2 Glycoprotein (113 sequences)
grep "[Hh]epa" raw.$ORFFA \
  | grep -e "[Ee]nv" - \
  | grep -v "[Rr][Dd][Rr][Pp]" - \
  | sed 's/>//g' - \
  > DY.hcv_env2.acc
  
grep "[Hh]epa" raw.$ORFFA \
  | grep -e "E2 protein" - \
  | grep -v "[Rr][Dd][Rr][Pp]" - \
  | sed 's/>//g' - \
  >> DY.hcv_env2.acc  
  
# i.e. ABG55345.1 envelope glycoprotein, partial [Hepacivirus C]:1-192

### Partial Capsid Protein (42 sequences)
grep -e "[Cc]apsid" raw.$ORFFA \
  | grep -v "[Rr][Dd][Rr][Pp]" - \
  | sed 's/>//g' - \
  > DY.capsids.acc
  
# i.e. ADP88743.1 capsid protein [Rabbit calicivirus]:1-532
  
### Coat Proteins (70 sequences)
grep -e "[Cc]oat" raw.$ORFFA \
  | grep -v "[Rr][Dd][Rr][Pp]" - \
  | sed 's/>//g' - \
  > DY.coats.acc

# i.e. AAK17013.1 coat protein, partial [Papaya ringspot virus]:1-307

cat DY* | cut -f1 - > DECOY.acc

#### BLACKLISTED SEQUENCES ==========================
### Deplete HCV Partial Sequences (1397 sequences)
grep -e "[Hh]epacivirus C" raw.$ORFFA \
  | grep -e "[Pp]artial" - \
  | sed 's/>//g' - \
  > BL.hcv_partial.acc
  
grep -e "[Hh]epatitis C" raw.$ORFFA \
  | grep -e "[Pp]artial" - \
  | sed 's/>//g' - \
  >> BL.hcv_partial.acc

cat BL.hcv_partial.acc > BLACKLIST.acc


#### QC SEQUENCE MATCHES ==========================

# Trim output to at least 160 aa post-extension
seqkit seq --min-len 160 raw.$ORFFA \
  | seqkit sort -l - > gt160.$ORFFA

seqkit seq --max-len 160 raw.$ORFFA \
  | seqkit sort -l - > lt160.$ORFFA

# Remove sequences containing 4 or more consecutive "XXXX"
# (low quality sequences). Quality-control pass
seqkit grep -s -v -p "XXXX" gt160.$ORFFA \
  | seqkit sort -l - > gt160.QC_PASS.$ORFFA

# Removes (50 sequences)
seqkit grep -s -p "XXXX" gt160.$ORFFA \
  | seqkit sort -l - > gt160.x4.$ORFFA

## Recover Partial (non HCV) Sequences
# Extract Known partial RdRp (911 sequences)
seqkit grep -r -n -p "RNA-dependent RNA polymerase" lt160.$ORFFA \
  | seqkit grep -r -n -v -p "epacivirus C" - \
  | seqkit grep -r -n -v -p "epatitis C" - \
  > partial.$ORFFA

# Whitelist PREDICT accessions from trimmed records (20 sequences)
seqkit grep -r -n -p "PREDICT" lt160.$ORFFA >> partial.$ORFFA
seqkit rmdup partial.$ORFFA > tmp.partial.$ORFFA
mv tmp.partial.$ORFFA partial.QC_PASS.$ORFFA

# DEPLETE Decoy and Blacklisted sequences 
# Add Long + Partial RdRp, remove false positives
cut -f1 -d' ' DECOY.acc      > deplete.tmp
cut -f1 -d' ' BLACKLIST.acc >> deplete.tmp

cat gt160.QC_PASS.$ORFFA partial.QC_PASS.$ORFFA \
  | seqkit grep -r -v -n -f deplete.tmp \
  > final.gb241.rdrp.fa
  
rm deplete.tmp

# f799cc6d3525340fecea6edd15d3adea  gb241.rdrp.fa
cp gb241.rdrp.fa ../gb241.rdrp.fa

cd ..

In [None]:
# 5. Create rdrp1
#    Combine independent sets of rdrp into one master set
#    Bootstrap the annotation from rdrp0 to assign to new sets
mkdir 5_make_rdrpN; cd 5_make_rdrpN

RDRP='rdrp0.fa'
VGB='gb241.rdrp.fa'

usearch -sortbylength ../$RDRP \
   -fastaout $RDRP

# Seperate out sorted wolf18 and wolf20 sets
seqkit grep -n -r -v -p "yaOV" $RDRP > wolf18.fa
seqkit grep -n -r -p "yaOV" $RDRP > wolf20.fa
   
usearch -sortbylength ../$VGB \
   -fastaout $VGB

# Order of priority for taxonomic classification
cat wolf18.fa $VGB wolf20.fa > tmpcat.fa

# Create initial 45% cluster set
function uclust () {

  INPUT=$1
  OUTNAME=$2
  ID=$3
  
  # uclust
  OUTPUT="$OUTNAME.id$ID"
  usearch -cluster_smallmem $INPUT \
     -sortedby other \
     -id 0.$ID \
     -maxaccepts 4 \
     -maxrejects 64 \
     -maxhits 1 \
     -uc $OUTPUT.uc \
     -centroids $OUTPUT.fa

  mkdir id$ID; mv *.id$ID.* id$ID/

}

# Cluster rdrp0 and genbank_rdrp0_matches at 45%
uclust tmpcat.fa initial.cluster 45

#      Seqs  14613 (14.6k)
#  Clusters  4723
#  Max size  315
#  Avg size  3.1
#  Min size  1
#Singletons  2982, 20.4% of seqs, 63.1% of clusters

# For every Viral GenBank RdRp; which cluster does it belong?
grep "^[HC]" id45/initial.cluster.id45.uc \
  | cut -f 1,9,10 - \
  | sort -k1 - > uctable45.tsv

In [None]:
# Manual Additions

# 1. Regex-fu to import UC file (uctabl45.tsv) into excel
# 2. For accessions without a lift-over annotation, retrieve full taxonomy-line
#    use: https://www.ncbi.nlm.nih.gov/sites/batchentrez
# 3. For GenBank Records of unknown "Branch" use local alignment data from rdrp0
#    paste accession-list into "assign.branch"

while read -r line; do
  grep "$line" gb241_r0.local.pro >> tmp.pro
done < assign.branch

# ALPHA is tmpcat.fa above
# Clean-up
seqkit rmdup rdrp1_ALPHA_r2.fa \
  | seqkit seq --min-len 35 - \
  | seqkit sort -r -l - \
  | seqkit rmdup -n - \
  > rdrp1_ALPHA_r2.sort.fa
seqkit faidx -f rdrp1_ALPHA_r2.sort.fa

# Re-import header swap
aws s3 cp s3://serratus-public/tmp/header_update.tsv ./header_update.tsv

# AAC25017.1 replicase protein, partial [Wild cucumber mosaic virus]:1-214 -->
# kiti.Tymoviridae.wild_cucumber_mosaic_virus:AAC25017
# AAF00520.1 polyprotein, partial [Pleione virus Y]:1-240 -->
# pisu.Potyviridae.pleione_virus_y:AAF00520
# ...

function reheader {
  INFA=$1  # Input Fasta File
  TSV=$2   # 2-field TSV == 1:Old_Fasta_header 2:New_Fasta_Header
  OUTFA=$3 # Output Fasta File
  
  rm -f $OUTFA
  
  # Ensure header input are unique
  sort -k1,1 $TSV | uniq - > tsv.tmp
  mv tsv.tmp $TSV
  
  while read -r line; do
   if [[ "$line" = ">"* ]]; then
    # header line; replace old with new
    oldheader=$(echo $line | sed 's/>//g' -)
    newheader=$(grep -Fw "$oldheader" header_update.tsv | cut -f2 - )
    
    if [[ "$newheader" = "" ]]; then
      echo AHHHHHHH - $oldheader
      stop
    fi
    echo ">$newheader" >> $OUTFA
    
   else
     echo $line >> $OUTFA
   fi    
  done < $INFA  
}

reheader rdrp1_ALPHA_r2.sort.fa header_update.tsv ../../rdrp1_BETA.fa


## Revision 3D - Re-search GenBank with rdrp1_BETA

Revision 3C had an error with headers, will be fixed here.


In [None]:
### GLOBAL REFINEMENT SCRIPT

### PARAMETERS ===================================
# RDRP Database
GENOME='rdrp1_BETA'
GPATH='../rev3B'
RDRP="$GENOME.fa"

# Query Records (GenBank)
QUERY='vgb241'
QPATH='../vgb241'

# Output Suffix
OUTNAME='gb241_r1B'

# Diamond QC-Filtering (Initial)
EVALUE=6  # -log(E-value) > $EVAL cut-off
RCOV=50 # rdrp-coverage > $RCOV cut-off

### INITIALIZE ===================================
# Revision 3C folder
mkdir -p rev3D; cd rev3D

# Use rdrp1_BETA (iteration 1) as start-point
ln -s $GPATH/$GENOME.fa ./
diamond makedb --in $GENOME.fa -d $GENOME

# Query
ln -s $QPATH/$QUERY.fa ./
ln -s $QPATH/$QUERY.fa.fai ./

### 0 DIAMOND SEARCH ==============================
mkdir 0_rdrp_search; cd 0_rdrp_search

# Diamond blastp alignment
time cat ../$QUERY.fa |\
diamond blastp \
  -d ../$GENOME.dmnd \
  --unal 0 \
  --masking 0 \
  --ultra-sensitive \
  -k 1 \
  -p 30 \
  -b 2 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue bitscore cigar \
       qseq \
  > $OUTNAME.raw.pro


# real    39m30.035s
# user    1120m51.264s
# sys     1m46.920s

# QC Filter diamond output based on
# E-value threshold and and rdrp-Coverage

while read -r line; do
  # Return -log(e-value); zero == 999
  e_value=$(echo $line | cut -f 11 -d' ' - \
    | sed 's/[0-9\.]*e.//g' - \
    | sed 's/^0//g' -)
  if (( $e_value == 0 )); then
    e_value='999'
  fi
  
  # rdrp-match coordinates
  r_start=$(echo $line | cut -f 7 -d' ' - )
  r_end=$(echo $line | cut -f 8 -d' ' - )
  r_len=$(echo $line | cut -f 9 -d' ' - )
  
  # convert to percent
  r_cov=$( echo "$r_end - $r_start + 1" | bc )
  r_covpct=$( echo "scale=0; 100 * $r_cov / $r_len" | bc )
  
  #echo $r_start $r_end $r_len $r_covpct $e_value
 
  if (( $r_len > $RCOV )) && (( $e_value > $EVALUE )); then
    #echo HIT
    echo $line | sed 's/ /\t/g' >> "$OUTNAME".local.pro
  else
    #echo MISS
    echo $line | sed 's/ /\t/g' >> "$OUTNAME".REJECT.pro
  fi
  # I guess they never miss huh
done < "$OUTNAME".raw.pro

mkdir raw;
mv "$OUTNAME".raw.pro raw/
mv"$OUTNAME".REJECT.pro raw/

# Make fasta of output
cut -f1,14 $OUTNAME.local.pro \
  |  sed 's/^/>/g' \
  | sed 's/\t/\n/g' \
  > $OUTNAME.local.fa

usearch -sortbylength $OUTNAME.local.fa \
   -fastaout tmp.sort.fa
   
mv tmp.sort.fa $OUTNAME.local.fa

cd ..

### 1 CLUSTER RESULTS ============================
# 1. Cluster all putative-RdRp in QUERY into rdrp at 90%
#    Isolate centroids that are from QUERY as putative novel-RdRp
#    Note: these are "local" matches, and require extension

mkdir 1_merge_rdrp; cd 1_merge_rdrp
GBLOCAL="$OUTNAME.local.fa"

# Reference set of rdrp
usearch -sortbylength ../$RDRP \
   -fastaout ./$RDRP
   
# Local query matches to rdrp
usearch -sortbylength ../0_rdrp_search/$GBLOCAL \
   -fastaout ./$GBLOCAL
   
# Order the files such that rdrp0 takes priority over genbank
cat $RDRP $GBLOCAL > cat_rdrp.fa

function uclust () {

  INPUT=$1
  OUTNAME=$2
  ID=$3
  
  # uclust
  OUTPUT="$OUTNAME.id$ID"
  usearch -cluster_smallmem $INPUT \
     -sortedby other \
     -id 0.$ID \
     -maxaccepts 4 \
     -maxrejects 64 \
     -maxhits 1 \
     -uc $OUTPUT.uc \
     -centroids $OUTPUT.fa

  mkdir id$ID; mv *.id$ID.* id$ID/

}

# Cluster rdrp0 and genbank_rdrp0_matches at 90%
uclust cat_rdrp.fa $OUTNAME 90

# Retrieve non-rdrp0 sequences (GenBank-derived RdRp)
grep ">" id90/$OUTNAME.id90.fa \
  | grep -v -e ">[lkpdnr]" \
  | sed 's/>//g' - \
  > new.rdrp.acc

cd ..

### 2 RETREIVE ORF ============================
# 2. For all _New_ GenBank-derived RdRp
#    retrieve the full-length sequence/ORF from GenBank
mkdir 2_retrieve_orf; cd 2_retrieve_orf

cp ../1_merge_rdrp/new.rdrp.acc ./new.rdrp.acc

# Full Length ORF Fasta file
ORFFA="$OUTNAME.orf.fa"
seqkit grep -f new.rdrp.acc ../$QUERY.fa > $ORFFA

cd ..

### 3 NW Extension + Trim =====================
# 3. For all  ORF (containing rdrp)
#    Semi-global align the ORF to the original rdrp
#    to "trim" the sequences to the boundary
mkdir 3_trim_orf; cd 3_trim_orf

# Perform semi-global extension of potential hits
aws s3 cp s3://serratus-public/bin/usearch12_trim ./
chmod 755 usearch12_trim

ln -s ../$RDRP ./
ln -s ../2_retrieve_orf/$ORFFA ./

RDRPFA="raw.$OUTNAME.rdrp.fa"

# NW Semi-Global Extension
./usearch12_trim -usearch_global $ORFFA \
    -id 0.01 \
    -fulldp \
    -maxaccepts 8 \
    -maxrejects 32 \
    -top_hit_only \
    -db $RDRP \
    -userfields query+target+id+qtrimlo+qtrimhi \
    -userout results.tsv \
    -trimout trimmed_output.fa
    
usearch -sortbylength trimmed_output.fa \
   -fastaout $RDRPFA

mv results.tsv $RDRPFA.trims
rm trimmed_output.fa

### 4 Semi-Automated Clean-up =====================
# 4. For all New Query-Rdrp (trimmed)
#    Remove Known False Positive hits (RNA capsid/envelopes)
#    Remove RdRp below 160aa long
#    Add-back high quality (annotated) partial RdRp
mkdir 4_clean_up; cd 4_clean_up

ln -s ../3_trim_orf/$RDRPFA ./$RDRPFA

# Output removes raw- prefix
RDRPFA="$OUTNAME.rdrp.fa"

# MANUAL CLEAN-UP


#### DECOY SEQUENCES
### HCV Envelope 2 Glycoprotein (113 sequences)
grep "[Hh]epa" raw.$RDRPFA \
  | grep -e "[Ee]nv" - \
  | grep -v "[Rr][Dd][Rr][Pp]" - \
  | sed 's/>//g' - \
  > DY.hcv_env2.acc
  
grep "[Hh]epa" raw.$RDRPFA \
  | grep -e "E2 protein" - \
  | grep -v "[Rr][Dd][Rr][Pp]" - \
  | sed 's/>//g' - \
  >> DY.hcv_env2.acc  
  
# i.e. ABG55345.1 envelope glycoprotein, partial [Hepacivirus C]:1-192

### Partial Capsid Protein (42 sequences)
grep -e "[Cc]apsid" raw.$RDRPFA \
  | grep -v "[Rr][Dd][Rr][Pp]" - \
  | sed 's/>//g' - \
  > DY.capsids.acc
  
# i.e. ADP88743.1 capsid protein [Rabbit calicivirus]:1-532
  
### Coat Proteins (70 sequences)
grep -e "[Cc]oat" raw.$RDRPFA \
  | grep -v "[Rr][Dd][Rr][Pp]" - \
  | sed 's/>//g' - \
  > DY.coats.acc

# i.e. AAK17013.1 coat protein, partial [Papaya ringspot virus]:1-307

cat DY* | cut -f1 - > DECOY.acc

#### BLACKLISTED SEQUENCES ==========================
### Deplete HCV Partial Sequences (1397 sequences)
grep -e "[Hh]epacivirus C" raw.$RDRPFA \
  | grep -e "[Pp]artial" - \
  | sed 's/>//g' - \
  > BL.hcv_partial.acc
  
grep -e "[Hh]epatitis C" raw.$RDRPFA \
  | grep -e "[Pp]artial" - \
  | sed 's/>//g' - \
  >> BL.hcv_partial.acc

cat BL.hcv_partial.acc > BLACKLIST.acc


#### QC SEQUENCE MATCHES 

# Trim output to at least 160 aa post-extension
seqkit seq --min-len 160 raw.$RDRPFA \
  | seqkit sort -l - > gt160.$RDRPFA

seqkit seq --max-len 160 raw.$RDRPFA \
  | seqkit sort -l - > lt160.$RDRPFA

# Remove sequences containing 4 or more consecutive "XXXX"
# (low quality sequences). Quality-control pass
seqkit grep -s -v -p "XXXX" gt160.$RDRPFA \
  | seqkit sort -l - > gt160.QC_PASS.$RDRPFA

# Removes (50 sequences)
seqkit grep -s -p "XXXX" gt160.$RDRPFA \
  | seqkit sort -l - > gt160.x4.$RDRPFA

## Recover Partial (non HCV) Sequences
# Extract Known partial RdRp (911 sequences)
seqkit grep -r -n -p "RNA-dependent RNA polymerase" lt160.$RDRPFA \
  | seqkit grep -r -n -v -p "epacivirus C" - \
  | seqkit grep -r -n -v -p "epatitis C" - \
  > partial.$RDRPFA

# Whitelist PREDICT accessions from trimmed records (20 sequences)
seqkit grep -r -n -p "PREDICT" lt160.$RDRPFA >> partial.$RDRPFA
seqkit rmdup partial.$RDRPFA > tmp.partial.$RDRPFA
mv tmp.partial.$RDRPFA partial.QC_PASS.$RDRPFA

# DEPLETE Decoy and Blacklisted sequences 
# Add Long + Partial RdRp, remove false positives
cut -f1 -d' ' DECOY.acc      > deplete.tmp
cut -f1 -d' ' BLACKLIST.acc >> deplete.tmp

cat gt160.QC_PASS.$RDRPFA partial.QC_PASS.$RDRPFA \
  | seqkit grep -r -v -n -f deplete.tmp \
  | seqkit rmdup -n \
  > final.$RDRPFA
  
rm deplete.tmp

cd ..



In [None]:
### 5 Manual Clean-up =====================
# 5. Create rdrpN+1
#    Combine independent sets of rdrp into one master set
#    Bootstrap the annotation from rdrp0 to assign to new sets
mkdir 5_make_rdrpN; cd 5_make_rdrpN

## Make "manual.additions" of accessions to add to rdrp1
## see "Manual Inspection" below (USE FROM 3C!!)
# seqkit grep -f manual.additions final.gb241_r1B.rdrp.fa > manual.additions.fa
# grep -f manual.additions 

### Manual inspection of hits

- Flanking sequences from imperfect rdrp-extensions result in more capsid/helicase/flanking proteins coming up in this round.
- 30 matches annotated as RdRp were additionally retrieved, adding these back to `rdrp1`. These are pretty much all "out there" wrt research subjects so we're scrapping the bottom of the barrel for GenBank for now.

```
ASM94010.1 putative RNA-dependent RNA polymerase, partial [Caledonia beadlet anemone chu-like virus 1]	75.9
ASV45868.1 RNA-dependent RNA polymerase protein [Lobeira virus]	70.4
AXG65493.1 RNA-dependent RNA polymerase [Kundal virus]	69.1
QED22729.1 RNA-dependent RNA polymerase, partial [Botryosphaeria dothidea narnavirus 1]	48.6
QIR30265.1 RNA-dependent RNA polymerase [Plasmopara viticola lesion associated mitovirus 42]	42.9
QOW97233.1 RNA-dependent RNA polymerase [Partiti-like adriusvirus]	18.4
ABA46925.1 putative RNA dependent RNA polymerase, partial [Cassava frogskin virus]	18.2
AHB12495.1 RNA-dependent RNA polymerase, partial [Cassava frogskin associated virus]	17.9
AVM87434.1 RNA-dependent RNA polymerase [Wenling jack mackerels birnavirus]	17.7
ASQ44445.1 RNA-dependent RNA polymerase, partial [Human rotavirus A]	17.5
BBA12670.1 RNA-dependent RNA polymerase, partial [Sclerotium Rolfsii dsRNA virus]	17.4
QBZ92966.1 RNA-directed RNA polymerase, partial [Potato virus S]	17.2
AWY10996.1 replicase-associated protein, partial [Sclerotinia sclerotiorum tymo-like RNA virus 5]	17
QCQ84350.1 RNA-dependent RNA polymerase [Lates calcarifer birnavirus]	17
AOX47590.1 RNA-dependent RNA polymerase, partial [Ceratobasidium mycovirus-like]	16.6
CAD30691.1 RNA-dependent RNA polymerase [Blotched snakehead virus]	16.3
BAP99821.1 predicted structural protein [Chaetoceros tenuissimus RNA virus type-II]	16
BCH36656.1 RNA-dependent RNA polymerase [Magnaporthe oryzae narnavirus 1]	14.2
QAY29233.1 putative RNA-dependent RNA-polymerase, partial [Agassiz Rock virus]	33.6
ASU43981.1 RdRp [Australian Anopheles totivirus]	71.1
QED43009.1 RdRp, partial [Phakopsora virgavirus A]	59
ANC52159.1 RdRp [Sclerotinia sclerotiorum mycoreovirus 4]	48.5
QNQ74064.1 RdRp [Plasmopara viticola lesion associated orfanplasmovirus 2]	27.6
QNQ74065.1 RdRp [Plasmopara viticola lesion associated orfanplasmovirus 3]	26.5
QNQ74063.1 RdRp [Plasmopara viticola lesion associated orfanplasmovirus 1]	26.1
QNQ74067.1 RdRp [Plasmopara viticola lesion associated orfanplasmovirus 5]	26.1
QNQ74066.1 RdRp [Plasmopara viticola lesion associated orfanplasmovirus 4]	24.7
QED42926.1 putative RdRp [Leucocoprinus gammaflexivirus B]	22.4
QBP78746.1 RdRp, partial [Botrytis virus F]	20.2
QHD64760.1 RdRp, partial [Plasmopara viticola lesion associated picorna-like 4]	16.7
QAY29251.1 putative RNA-dependent RNA-polymerase [Dumyat virus]	55.4
```


In [None]:
usearch -sortbylength ../$RDRP \
   -fastaout $RDRP
   
usearch -sortbylength manual.additions.fa \
   -fastaout $OUTNAME.it2.rdrp.fa

# Order of priority for taxonomic classification
cat $RDRP manual.additions.sort.fa > catfa.tmp

# Create initial 45% cluster set
function uclust () {

  INPUT=$1
  OUTNAME=$2
  ID=$3
  
  # uclust
  OUTPUT="$OUTNAME.id$ID"
  usearch -cluster_smallmem $INPUT \
     -sortedby other \
     -id 0.$ID \
     -maxaccepts 4 \
     -maxrejects 64 \
     -maxhits 1 \
     -uc $OUTPUT.uc \
     -centroids $OUTPUT.fa

  mkdir id$ID; mv *.id$ID.* id$ID/

}

# Cluster rdrp0 and genbank_rdrp0_matches at 45%
uclust catfa.tmp initial.cluster 45

#      Seqs  14613 (14.6k)
#  Clusters  4723
#  Max size  315
#  Avg size  3.1
#  Min size  1
#Singletons  2982, 20.4% of seqs, 63.1% of clusters

# For every Viral GenBank RdRp; which cluster does it belong?
grep "^[HC]" id45/initial.cluster.id45.uc \
  | cut -f 1,9,10 - \
  | sort -k1 - > uctable45.tsv

In [None]:
## Manually adjust headers based on UC assignment/monkey work
#  in excel

In [None]:
# Re-import header swap
# vim header_update.tsv

function reheader {
  INFA=$1  # Input Fasta File
  TSV=$2   # 2-field TSV == 1:Old_Fasta_header 2:New_Fasta_Header
  OUTFA=$3 # Output Fasta File
  
  rm -f $OUTFA
  
  # Ensure header input are unique
  sort -k1,1 $TSV | uniq - > tsv.tmp
  mv tsv.tmp $TSV
  
  while read -r line; do
   if [[ "$line" = ">"* ]]; then
    # header line; replace old with new
    oldheader=$(echo $line | sed 's/>//g' -)
    newheader=$(grep -Fw "$oldheader" $TSV | cut -f2 - )
    
    if [[ "$newheader" = "" ]]; then
      echo AHHHHHHH - $oldheader
      stop
    fi
    echo ">$newheader" >> $OUTFA
    
   else
     echo $line >> $OUTFA
   fi    
  done < $INFA  
}

reheader gb241.it2.rdrp.fa header_swap.tsv gb241.it2.rdrp.reheader.fa

In [None]:
# Manual Addition of sequences


File: `deltavirus.dag.fa`

```
>var.Deltavirus.mmondv:SS0000002
MENPKQGNSKGREETLRQWVKGRERKEELEEELRRLNKKIKKLEERNPWLGNILGMVRKRGGEEQSSPSKRPRGESMEVDGGSSGGPQGPRFSEEEKRDHRRRKALENKKKQLEGHGKKLPQEEEEELRRLTGPDEARERRLGDHGPGDVNVGGGGPRGAPGGGFVSNLQGVPESPYSRRGEGLDIRGKQEFP
>var.Deltavirus.ovirdV:SS0000003
MDTPGNKANQRGREDVLRDWVEGRRRKDELEKELQRLSRKLKRLEEKNPWLGNVLGMIRKGGGGEGASPAKRPRSDPMDVDPGPSGSQQRPRFTDQERRDHRRRKALENKKKQLSGGGKDLSPEEEEELRRLTGPDEERERRVAGPPVGDVNPFGGPPRGAPGGGFVPNLQGVPESPFARKGDGLDTRGDQ
>var.Deltavirus.pmacdv:SS0000004
METPGKKKTPKPRQEILEEWADMNRKKRELEKELQRTLRKKKKLEEENPWLGNVLGIVRQRAGGTDAPQAKKRRLGEEMEVDGGPGPSSAPRAPFTKKEREDHRRRCALENKKKQLEEQGKKMSEEEEAERRRLAEEDERRKKRAEGGGDGDVNPPEGTPRGAFGGGFVPGLQGVPESPFSRRGDGLSLRGEGEYP
>var.Deltavirus.tgutdV:SS0000005
MDATPKKKGPKSREEVLEEWCDLRKRKRELEKELQRVNKKKKKLEDENGWLGNVLGMVRKGGGGDSSSPRKRKRGEGGDMDVDPPGDGGSSRPGPYSDEERAEHRRRCMLENKKKQLEAQGKQLSGEEEAERRRLASRDEDRKRRRQEAEAGGVNPSAGQPRGAPGGGFVPDSQGVPESPFTRRGDGLSTRGEGMFP
>var.Deltavirus.bgladv:SS0000006
MAALTAPKKKTGVDRFATLKRWTDTSDKIKELEEALRKERDRKRKLEDKHPWLGNVKGIIKEPGAKKPESGKASERMDTEGSEEPSRKRKRKEEYSEEQRKDYREMKSLLNKAAQLQKAGKKLSKEETTKLCGLSGKTGEGIPVELLTALTYDGADVNPLQIIGSARAGENPHDLNLDGVPEGSRTIPPTLFPRGPR
>var.Deltavirus.ichidv:SS0000007
MESSFAPAAETTAGKSSKMERGEVLNKWVLLREDIKDLEAKLAKARGRKRKLERENPWLGNILGIIKERPPGDGEVAPKKRKEKEASEVQERPTKRKRSKKDLYSEEERERYKRLKQLLNKAKQLGSKGRKLTPEEQGELMGLAAATGEQLPEDVVVGLSGEVNPLLVSGAPAAGINPFGLNLTGVPVVEGDPMQLDIRGDRST
>var.Deltavirus.aansdv:SS0000008
MENKEESQKRKRGREETLQRWVDDRRRKKELEEELEKIRKRIKDRERKNPWLGNLKGILKQETGGGQEKKKETTEPMDTGESPKKKKKRDYEGLSFTPEEKKQHKRKCDLENKKKQLNSKGKQLSAQEEDELRKLQKLDEERLLKKRAEREQRSGGVNLFGSASPSTSGGGNAPTTQGLRLPWQK
>var.Deltavirus.wldv:SS0000009
MSEASTQRSVILVEWVSCMEEIRRATRTLEKNKRKMKKLEESNPWLGNVKGLIKKEEGEVRPKTAEAEPTRKRKRGEEEMTTDPFSDRAAKRSFTPEEKQAYQKMSQELNRLLQRAKSGKQNSLEETERVVREAAVLKRYLTNDQRELLGTAAEVEGTQLTAWQMEATRNIL
>var.Deltavirus.hdv8:AJ584849
MSQSDAKRERRGGREDVLSKWVEARKDLEDLEKRIRKTRRNIKRLEDENPWLGNILGIIRKGKAGEGAPPAKRARTDQMEVDSGPGKKSRKGGFTDEERRDHRRRKALENKKKQLSSGGKKLSEEEEGELRRLIVEDEKRERRAAGPQDGGVNPPGGSPRGAPGGGFVPRMLGVPESPFTRHGDGLDIRGDQQFP
>var.Deltavirus.hdv6:AJ584847
MGPAEQKRKRGGREEILEKWVELRKNREDLERKLRKTQKGLKKLEDDNPWLGNIIGIIRKGKDGEGAPPTKKARTDQMEVDSGPRGKPHKSGFTDEERRDHRRRKALENKKKQLAAGGKNLSREEEEELGRLTGEDEQRKRRTAGPRVGDVNPPGGDPRGAPGGGFVPTMLHVPESPFSRTGEGLDVRGTQQFP
>var.Deltavirus.hdv5:AJ584848
MSQSESKKARRGGREEILEKWVQARKDSEDLEKELRKTKRTIKKLEDENPWLGNILGIIRKGKDGEGAPPAKRARTDQMEVDSGPRKRTRAGDFTDQERKDHRRRKALENKKNQLSAGGKSLSREEEEELRKLTEEDERRERRVAGPRVGDVNPPEGPPRGAPGGGFVPQMLDVPESPFSRTGSGLDVRGNQQFP
>var.Deltavirus.hdv2:X60193
MSQSETRRGRRGTREETLEKWITARKKAEELEKDLRKTRKTIKKLEEENPWLGNIVGIIRKGKDGEGAPPAKRPRTDQMEVDSGPGKRPHKSGFTDKEREDHRRRKALENKKKQLSAGGKILSKEEEEELRRLTDEDEERKRRVAGPRVGDVNPSRGGPRGAPGGGFVPQMAGVPESPFSRTGEGLDIRGTQGFP
>var.Deltavirus.hdv4:AF018077
MSQPDSRRPRRGREEQLGKWIDARRRKEELERDLRKVNKTIKRLEEDNPWLGNVRGIIRKDKDGEGAPPAKRARTDQMEVDSGPRKRKHPGGFTEQERRDHRRRKALENKKKQLSSGGKNLSREEEEELRRLTEEDERRERRVAGPRVGDVNPLDGGPRGAPGGGFVPSMHDIPESPFTRRGDGLDVRGAQEFP
>var.Deltavirus.hdv1:M21012
MSRSESRKNRGGREEILEQWVAGRKKLEELERDLRKTKKKLKKIEDENPWLGNIKGILGKKDKDGEGAPPAKRARTDQMEVDSGPRKRPLRGGFTDKERQDHRRRKALENKKKQLSAGGKNLSKEEEEELRRLTEEDERRERRVAGPPVGGVNPLEGGSRGAPGGGFVPNLQGVPESPFSRTGEGLDIRGNQGFP
>var.Deltavirus.hdv1:NC_001653
MSRPEGRKNRGGREEVLEQWVSGRKKLEELERDLRKVKKKIKKLEDEHPWLGNIKGILGKKDKDGEGAPPAKRARTDQMEVDSGPRKRPSRGGFTDKERQDHRRRKALENKRKQLSAGGKNLSKEEEEELRRLTEEDERRERRIAGPQVGGVNPLEGGTRGAPGGGFVPSMQGVPESPFTRTGEGLDIRGSQGFP
>var.Deltavirus.hdv7:AJ584844
MSHADTKRSRKGREETLSKWLKAREDAEELERRLRKTKKTIKKLEDDNPWLGNIKGIIGKVGTGEGAPPAKRPRTDRMEVDSGPGKKSNKGGFTDEERRAHRRRKALENKKKQLSAGGKSLSKEEEEELRKLTADDEQRARRIAGPRVGDVNPPEGPPRGAPGGGFVPQLLGVPESPFSRTGEGLDIRGDRQFP
>var.Deltavirus.hdv3:L22063
MSQTVARLTSKEREEILEQWVEERKNRRKLEKDLRRANKKIKKLEDENPWLGNVVGLLRRKKDEDGAPPAKRPRQETMEVDSGPGRKPKARGFTDQERRDHRRRKALENKKKQLAGGGKHLSQEEEEELRRLARDDDERERRTAGPRPGGVNPMDGPPRGAPGGGFVPSLQGVPESPFSRTGEGIDIRGTQQFP
>var.Deltavirus.drdvb:MT649206
METPGKKKPPKPRQETLEEWCELGKRKRELEKELQRITRKRKRLEEENPWLGNVLGITRQKSGGSETPSGKKKRREEEMEVDGAPGSGAAPRTPFTKKEREDHRRRCALENKKKQLEQQGKKLSEEDEAERRRLAEEDERRKRRLEERGDGDVNPPEGPPRGAPGGGFVPGLQGVPESPYSRTGEGLSKRGEGYFP
>var.Deltavirus.drdvb:MT649208
METPGKKKPPKPRQETLEEWCELGKRKRELEKELQRVTRKKKRLEEENPWLGNVLGITRQKSGGSETPSGKKKRREEEMEVDGAPGSGAAPRTPFTKKEREDHRRRCALENKKKQLEQQGKKLSEEDEAERRRLAEEDERRKRRLEERGDGDVNPPEGTPRGAPGGGFVPGLQGVPESPYSRTGEGLSKRGEGYYP
>var.Deltavirus.ratdv:MK598003
METPGKKRSPKPRQEILEEWADLNRKKRELEKELQRTQRKKKRLEEENPWLGNVLGIVRQKAGGSDAPQAKKRRLGEEMEVDGGPGPSSAPRAPFTKKEREDHRRRCALENKKKQLEEQGKKLSEEEEAERRRLAEEDERRKKRAEGGGDGDVNPPEGTPRGAFGGGFVPGLQGVPESPFSRRGDGLSLRGEGEYP
>var.Deltavirus.drdva:MT649207
MESTDPKKNPKGREDTLREWVKGRERKEELERELGRLTRKLKKLEERNPWLGNVLGMVRKRDGGEGGTPSKRPREDPMDVDPGAGGLHHRPRFSEKEKQDHRRRKALENKKKQLSGGGKSLSQEEEEELRRLTREDEERERRVAGPPGGDVNPDAGGPRGAPGGGFVPNMQGVPESPFARTGEGLDPRGDRHFP
>var.Deltavirus.snakedv:NC_040729
METPSKKQIPTPSREDILEQWVELGKRKKELEKELQKVTKKKRKLEEQHGFLGNVLGIVRGKEQKPAATPQKKRKAEESMDVDGGSRLPPKEIKKRIFTEEERAEHRRRGQLENKKKQLEGRGKQLPEEEQKLLAELTRKDEERKQRFHYGGAGEVNPLEGQSRGAFGGGFVPSTQGVPESPFHRTGTGLDVRGDKMFP
>var.Deltavirus.toaddv:MK962760
MESLNPSRVKAVEEWAVTRREIAALEEKLAKKRKRVKELEKDNLFLGNIKGMCSKGTKKKASRSPTRRKEPQPGEPGTSGTAKKRKLDSSSKPKSSSSGGPKIPVSPAPKISKGTSSASLDEENFARTLAARAGDPRFREDLERLLSLEDTTRRGRLDSESSESTFSAPGKS
>var.Deltavirus.aviandv:NC_040845
MENKEESQKKKRGREETLQKWVDDRKRKRELEEELEKLRKRIKDRERKNPWLGNLKGMLKQDSGGGQEKKKEAAEPMDTGESPKKKKKKDYEGLSFTPEEKQRHKRKCDLENKKKQLNAKGKQLTSQEEDELRNLQEEDKKRLLKKREERERRSGGVNLFGSASPSTSGGGNASSTQGLRLPWQK
>var.Deltavirus.fishdv:MN031240
MADEQARFGLAESYLGGYRRQFQLELELIETKRKLREMEKENPWLKMVKGQTRKGSQGIQTILGKREKVAEAGPSGKKQRKADLSGEERARKRNMSLFKQQENQVRMKLAAKKPVSEELAKALLSSSNLVERQLDDDLVKDLTERGFLSPDQTGPYLLRWECKTRAEREGRLSPLSDMGE
>var.Deltavirus.newtdv:MN031239
MASTSEARVPPVKKGKTPSRERVLSKWVASRDKLAALLREEAAERAKILALERDNPWLGTIKGIIRKSAESAIPSGPLAGPSAPSTGGDPPPKKRKRTSEKLEASPDQKAIWRAGKDAVNRAKQAMSKGKEVSSELIRQMLLAHDKLGMEVPEDVQAYVMTEVNRSAQTAPASDFGTREEEMAMVPVQMGAGPPVLSIRGFESKDGYNNGQ
>var.Deltavirus.termitedv:MK962759
MELRGESETTLDRWLKLRESVRRRSLRLKESQEQLRLLEQQQPWLGSVKGTLRKKDGSSFDIPEGPKTTKAAPQEETQTAPPEEKVGPPRRERERKAEKRPAAPLMEKSRREAEAKRKKLQPPHTPAASSTPRRVHHLGDLSEIPVSLLKKALEEAEGRKPGMETGESSEEGAEGSESDFWKMN
```

`epsy.orf.fa`

```
>pisu.Nidovirales.pacific_salmon_nidovirus:QEG08237
DKSIRSVDVNITKGDGSRVSFIGTCKKSNTIQMDTGHGRSILKMSLKSEDAIERKFFPSYANTGVVCDHQFHDFVCENDTLRCIIRHHCTEYTLSDLVHSFRTADFAKMETIMLTLPQYTQTDFIPFWYDPVVNTNVYADIMSRLAPTLLQCLTHTTVFVDAIVANGHCVVLTLDNQNLNGNWLDFGDFTDLQIPGVGVAATNSYYSYLLPLLGATQCFSSVFDPSFNEFGYDFTEQKLLIYSTYFRWERVGIYHPNCVDCVDEECLLFCSNLNVLLSMYIPFNYLGPLCSIEYFNGMATPFCAGLSSVELGIVHNSCDLPSDPILKACLIFGTPPIHVVSSTPTADTRLNIPCVGSGKLMPGKLLKPAMIDTEFYEFVKDALLYEGSPITIEHFFYLQPRDCAVTDFDYYRFNRPTVLDPLQFRFVYNVVKHYFKSYSAGCLKSEFVIINNPHKSSGFPFNQLGDASDVYDFLGHERVDELYAYTKRSVLPTLTKIITKDAISAKTRARTVAGVGILSTATNRCAHQHMLKDIAAQRGRTVVIGTPKFYGGWHEMLSRFNDNRDRKLFGWDYPKCDRSMPNLLRMSAAWMFLNKHPCCSLLDNKYRLANELSQVLSETCVVNGAFYRKPGGTSSGDATTAYANSAFNLFQVITSAVNRCMSTDRFDYRSFNHELYSNIYLLDTPDRGFVEKFYNFMDDTCPLMILSDDGVAGPLVDHPLAIDVSDFRATLFYQNNVYLDDSKCWTEADVLKGPHEFCSQHTILHEGIFYPVPNPSRILAACMFVNSMEKADPKCLVERLVSLAIDAYPLVHHSIEVYRRVFPQILAYIKQVVSSRDMDIYDMFGEFTDFCSGVDITSETFYARLYKCKSSLQ
>pisu.Nidovirales.ambystoma_mexicanum_nidovirus:SS0000010
MKFNCLQLDCDGERILLKLLNKDEQVNELAAIPYTASNYFCKAVYAKICDRDCLLRFNCTEYTLADLAYAFKTADYIMLETIAKTLPAFVDFKFSAGWYDPIEHPESYDAFMTLLRPTLIQCLVATSEVYDVLDANYFVGTLSPDNQNLNGHWLDFGDFTKCAASVVMVDTYYSYIMPLLGAARVFTDVFDHTNYKGCYDYSDEKRRLFLKYFKGWSGIYHPNCHQCLDEMCVLHCANYNILLSMYINPKFLGPITGIFNYCEYTVPVCIGHSSVELGIVYNNDSLPEDDILRLMLIYGTPPIHVAASTPYVNNSTSLYTAGFIKHMSTRLDVPAYIDTDFYEHCRDILLHPGSTINLKHFYFLVNRDSASSDFDLYRHNRPTVFDIKQFQFVYKVVQKYFEPYDAKCLPNEDVIVNSPNKSSGFPFNTIANAGGILEYMQNIRVDKLFAWTKRNVLATLTKVNLKNAISAKDRVRTVAGVSILSTCTNRCAHQHLLKNIAATRDATVVIGTPKFYGNWDTMLRRFMDGENRNLIGWDYPKCDRSMPNLHRIAAAWLFLNKHTSCCNHTERFYRLVNEHAQVISETCLVDGALLYKPGGTSSGDATTAYANSVFNLFQVVTASVQHMLKKNVPDLLRDVKVLDSNSVTIRNYRDVVHNIYECIYLCPYQEAVQQFYDLMLKVCPLMILSDDGVAAPMVEHPFALTLDDFKGVLFSQNNVVLTSDKCWTQSDLTKGPEEFCSQHTIFSEGSFYPVPDPSRIIAACIFVSKMEPTTDSYKIERLVSLAIDAWPLTLHHDENYKRVFPVILDFINKCADRIQDDVSFDVPDCVNVTPHVIRSESFYRNMYVKSAMLQAGATMCVVCQIPTILSCSSCVRPVPLCCMCMYHHVATCNHHRVATTTVMCCNHRGCACDDVNKLYFDGKDGLIRCLDHVTSTITLPIMDPSSETIFSLGRDKCVPNGDIHVFNQAILDGFSKVANYDTFGKPYHLKCQELVRAMEE
>pisu.Nidovirales.puntigrus_tetrazona_nidovirus:SS0000011
MDLCNAFTTADFFRLQEIALTLPEFADFDFCDNWYDPITNPDSHDLFYSLLSPRLLNILKCTCAYFDHCTAKSFAGVLTLDNQNLNGRWLDHGDFTPFPEPAIVVDSYYGYLLPILGRCNVFKDSFDVNNHDDFDYSKEKLRLFYEYFRWSVKYHPNCVDCCDSYCVLWCFNFNVLFGNNIPVNHLGPLVAKGQFHDLLSPVCCGYASIELGIIRNRYEYPTDPITTASLTFASPITHVAASKPCVDHRINPPCLGSTGKIISKVLPPAKFDIDFYNLLKEEITYEGSPITLKHFFFMQTRDSAVTDFGYYGYNRPTVCDPYKLCFAYSVSSLYFDCYEAGCLPNDKVVVANPKKSAGFPFSQLGTAGEVLEFLSSERLDELHAASKGAIIPTITKVNTKYAISAKDRARTVSGVSIVSTALNRMTHQMLLKSIAATRSATCCIGTPKFYGNWDAMMSRFCDGVSRTLIGWDYPKCDRSMPNILRAGASMAFLDKHQGCCTPTDKFYLMANELCQVLHDTAYLEGNFYYKPGGTSSGDASTAYANSHFNIMQVVSASVSRCLNSPASGYYDFQSSLRDSIYNPNRGFHDFAHTFFGYMQSTNPLLILSDDGVCGPSKDHPLALTLNDFKGTLYSQNNVVLTTDKCWMEENPSHGPEEFCSQHSIRHNGVLYPCPDASRILSACVFMSDEIRADPQLMLERYVSLAIDAYPLVYHKCEVHRKVFPVLLRHIRSVHASLKQAVSFTGGFDGDPDILNEDFYRRLYECKTTLQATSNTCIICESPTIIYCSSCPRKHMLCCSCAFFHVKDTGHTTIAHYKDITCCVSGCTSDITLLHYAADGSVRCVDHLASSKYSLPIYDTAAQKIFVMHHDACRYDPSCHTYNSGVAKGYSEVANYDVYSLPMSAQLACHENIQVLEEMSKRTYGSATVDVVVNNTFPPMVVVSWDGPIPPINKNSKFSATKNRSNYGEFTLTTASSGSSHYFYESLTNARLVQGCVLKLV
>pisu.Nidovirales.crotalus_viridis_nidovirus:SS0000012
MKTVNFKVGFDNFTVKCVDEKELARELIWYRRLKNKVKVVPHYSCKIGRNFALIRGPMTEKSLGDYVYKYFKADDSIGIDREDGVSFNERRNWYKNNLFNDELFGNLDVLYDNSNLLGGIISLDNIDCKGYLYDFGDFNGQESYLDKALADLCSFYSSIGLDKRNEFKKWFSVDYNVLKSGDWLDKVLMLNNSILNSSSKVTDLFIDVDEVCTSAYYNQLFGYCKIDQSKIEEVDLLLQYIYKLQDPALRGLEPVKSVSDAIVVHKYKDIKSDNMVKNVYYNIDSYEDLKNRGVDCDSALKYTYFQGDNVDVMVDFMYYNYQGDLFLQPHILKFLYKQTLKMFEVVRTCKRFNYEECKPNKSSLGPSEHLLTSLKQDYVYNHLNESVVDQIVKLAAETPLIFLTKISEKFALTKKARARTIAACSMFSSLLFRALHKPVTSSMVSNAQQYKVPSLIGVSKFYNNFDGYIKSRYGDLNDYKIFGSDYTKCDRSFPLIFRAAAAAMLFELGEWDFDNYYFINEIQAFMLDFVMVGDCILQKPGGTSSGDATTAFSNTLYNCMVHLYVQLASLVSCENECKDPVFKAAAVKLWLTGDSSDYEMMLDYYNSEVYRFNFLSDDSFILTNEGKCVLDIYNCNNFSLHLERLIHVNVDVNKAWEGKDIHEFCSSTIVKVEGAWQYVAEKDRLLASLLITGKESDLELDIIRTSAILAEAAVYYWVDNMFFMALYDHFILKCQMYFNKYGCHVLPLVM
>pisu.Nidovirales.hippocampus_kuda_nidovirus:SS0000013
MSCRIKFLFSFKLTEGSCKFVNLRGIYRNPNNKFSISLSPAKRAVDVSCQRGDGHRVSFIGYVNKSNTLQIDTANERIILKCVHESEYQAEYDAYNRVKSNTSVCRHAFLKIFGMLFVCRRDCTEYTVMDLCYAFRTANFPKLKEICLSMGLSEGLFVEGWYDSVEGYDAYIDFTQSLRPYFLNILSSTKHYIEGLKVAGLSGVLTLDNQNLNGKWLDHGDFVFTTEASIVVESYYSYMLPLLGVCNVFYDVFDHANCQYGYDYTNEKLRLFETYFSWPITYHPNCVDCSSDICLLTCTNFNILFSNMMDLAKVGPIITVTNYCELTLPTTMGINSIELGVIYNDGDYPSDPIIRTIIHYGTPPVHVLSSTALTDHRVQFYSVCSVNRNRSKLQPPAVIDFSFYNYVKDKLLFAGSSINLNHFFFMQRPDAAVNDFAYYRYNKPTVLDIRQAIFCYKVTCLYFGEYVARCLPDEEVIVANPHKSAGFPFSGFGTAGEWLDFVESHRLNELHACAKRCILPTITKTNVKVAISGKSRARTVAGVSILSTALNRMTHQHLLKDITSKRDATICIGTPKFYGNWDTMMSRFRDGKPRTLFGFDYPKCDRSMPNILRQAACMAFLAKHECCNVRERFYLLANEMAQVLTEYTYLEGGLYRKPGGTSSGDASTAYANSFFNVFQVVTSSVNRCLHSKCAPYVDFQSQLRANLYDCVDDPLFVSQYQKFLQHTCPLLILSDDGVAGPLVEHPLSLTLEDFRATLFYQNNVYLSDGKCWVQPDVNKGPDEFCSQHTIVHNGCLYPVPDPSRIIAACCFVSTAERSDPTMQLTRLVSLAIDAYPLIYHNDETYKLVFHCILAYIRQLHKSVHTEALAGICEAFSDYGGSVLDEGFYKRLYETRSTLQS
>pisu.Nidovirales.syngnathus_typhle_nidovirus:SS0000014
MLFVSRRQCTAFTVMDLCYSFRTANFVLIKEICLSMGMPESLFVEGWYDSVEGFDAYVEFVTALRPYFLRILECTKSYITGLKAVGLSGVLTLDNQNLNGKWLDHGDFVYTGESAIVVESYYSYMLPLLGVCNVFYDSFDHSNCKYGYDYTDEKLRLFNTYFDWTITYHPNCVDCVDDICLLTCTNFNILFSNMMDLALVGPIVTVTNYCELTLPTTMGINSIELGIIYNDGDYPSDPITRTIIHYGTPPVHVLSSTAMKDHRVDFYSVCSVNRNRCKMQAPSVIDFAFYNFVKDKLLFAGSSISLSHFFYMQRPDAASTDFAYYRYNKPTVLDIRQAIFCYKVTCAYFDDYSARCLPDEEVVVANPHKSAGFPFSTFGTAGEWLDFVEPHRLNDLHACAKRCIIPTITKTNVKVAISCKSRARTVAGVSILSTALNRMTHQYLLKDITAKRNKTICIGTPKFFGNWDAMMSRFRDGKSRTLCGWDYPKCDRSMPNIIRQAACMAFLAKHECCSTSERFYLMANEMSQVLTEYTYLEGGLYRKPGGTSSGDASTAYANSYFNIFQVVTSSVNRCLHKKCEPHLDFQAQLRSNLYDCVDDPMFVKQYREFLQHTCPLLILSDDGIAGPLTDHPLALTLEDFRATLFYQNNVYLSDSKCWVEPDVNKGPEEFCSQHTIVYNGCLYPIPNPSRIIAACCFVSTAEKSDSTMQLTRLVSLAIDAYPLIYHKDETFQLVFPCILNYIRQLHQTVHSEALAGICEAFSDYGGSVLDEDFYKRLYETRSVLQSSTGGCVICDSNTIL
>pisu.Nidovirales.acanthemblemaria_nidovirus:SS0000015
MLPLLGASQCFYNQLDITLVQNEYDYTKCKLALYEHYFRWHLTYHPNIVDCIDDMCVLTCSNFNVLLGMGIPWNLFGPLVSIDFLHGQQVPVCSGFHSLELGIILNEKPLPTEHLLRAAIVYGTPLLHTTTAIPTIDTRTQIPCVGSGKLISGKVTRPAFIDFEFYEFAKDALFFDGSSINQQHFYFLHPQDSTMEDFNYYRFNIPTVVDICQFKFVFSVTEHYFSMYSATCLKSEFVVVNNPNKSSGFPFNQLGNAGDIYSYLGHERIDELFNYSKSAIVPTITKVIVKSAISAKDRCRTVSGVGILSTMMCRCMHQCLLKEIAATRDATVVIGTPKFYGGWHSMLSRFVDKYERELIGWDYPKCDRSMPNMLRLASAWLFICKHKCCNLHDRIFRLGNELSEVLTETCLVQGVYYSKPGGTSSGDATTAYANSVFNIFQVVTTAVNHCLQQDLPEHRPFNVKLYNSIYRLQDKEFVSEFYNWMQVKVPLMILSDDGVAGPIANDPQCLNLDSFKSTLFYQNNVVLTPDKCWRQRDISKGPHEFCSQHTIMHNGIFYPVPDPSRILAACMYTNSMEKSDPHLMCERFVSLAIDAYPLVYSNNETYRQVFPTILGYINKVVKDRDLNIYSLLGDYWDNSGASILSEAFYS
>pisu.Nidovirales.takifugu pardalis_nidovirus:SS0000016
MRAKCTEYTVMDLCHAYRSADFERLQTILYTCGFTPLDFADNWHDSVEGFDAYNHWVSKMDTLFLNILTKTTSYYDHLTANSMIGVLTLDNQNLNGNWLDHGDFTCSTDSAVIVDSYYSYMLPLLGACRCFKSVFDHENKEHGYDYTAEKSRLYEQYFRWHLPYHPNCVDCPNTRCVLACSNFNTLFGMYINPDLLGPICAIENFCEIRTPTTIGCHSIEQGVTFSEGSFPNDVVTRTCIYYGTPPVHVAASTPLVDHRMSVHPIGSVGRPRSKMQQPGFVDQVFYEFAKNELFFPGSSIELKHFFYMQERSAGVADFDYYRYNRPTVLDICQALFCYEITLDFFSDYEAGCIPDEQVVVANPNKSCGFPFSQFGNAGDFLEFLGSDRINSLHKACKSNILPTITKSNVKCAVSGKSRARTVAGVGILSTAMSRCTHQKLLKGIAAARDKTICIGTPKFYGNWHRMMSRFTDGEDRILIGWDYPKCDRSMPNWIRQAASLGFLSKHECCNHSQQFYLCANEMCQVLSEATLLDGNLYRKPGGTSSGDGTTAFANSFFNIFQVATSCVNRCLTTRVEGHLDFQSLLRDTVYGYSPPEEFVPTFRAFMQHTCPLLILSDDGVAGPLQTHPLSLKLQDFTATLFYQNNVFLNETKCWTEPHVNLGPEEFCSQHTLVHNGVMYPCPSPSRILASCCFVSNAEKSDSVLMLTRLVSLAIDAYPLVHHPDETYRRVFPVMLKYIRHVYGIINKEAL
>pisu.Nidovirales.caretta_caretta_nidovirus:SS0000017
MCNKERTTLQVIRSDCSLYSLLDLACAFRSADFDLAIEIAKSLKIDTSSWPSNWYDPIEQHSSKYWSAFGPAVFKCLLAANDFGDFLIEKGIIATLTADNQCLNSKWLDFADYSHNLNCPGAVSMDSYYSYLLPLINVVNYITVDSPFTVTDVLRYDYSEEKMSLFKKYFRHWPYEYHPNCINCDSDRCTIHCINFNVLLSQPYSTTNGPLISKCHFNGQPIYATCGYVSAELGVVINSDNLLELPSEFVKVFAILCDASLHITTSEALLDCTEGIIPIGLISLRKNTVKPGEVNLDFYDHLEKYGILHPGTGINLKHYFFMQDSTAATADFSLYKYNKFTVVDILQYLFCYEVTNCYFEVYEGGCIPASEVIVRSPHKSAGFLLNKISTAGGFYDSISLAEQDMLYAYTKSNIIPTVTQINLKHSIAAKERARTVAGVSLLATMTTRQCQQKLLKSIAATRGGTIVIGTTKFYGGWDSMMRTLIDGVSNPIMYGWDYPKCDRSYPTILRLTSSILFSRFHNSCCSDSEKFFRLCNELAQVVTESSFCNGALFYKKGGQSSGDALTAYSNSCFNIMQVVSSTVGEIISRPTRNVPDNFKKFQRSIYSAVYTTSLHSPDLATVRLLYQHMKTYCHLMILSDDGVATVDGSLMESGMVPSLDSFKKLLYYQNNVVLADDKCWTETNMSVGPHEFCSQHSKLVTIDGEVVYLPMPDPSRIISALVYVDKVEKASSYVVMERFVAMAMEAWPLKHHDNVHVRKIFPCILDYISNLYNSLSADVLSSFGCDVEYDGGTFTSETFY
>pisu.Nidovirales.microhyla_fissipes_nidovirus:SS0000018
MFYNILKSLSIDYTNWPTNWYDPVECFKCYYYWDLILTPVLEVCLTNANLFTSYCNANGIVGVLTADNQCLNGLILDFGDFHLAPVGSVQMRSYYSYLMPVITMSTLFNKFNADEYDTSVYDFTEFKLKLHNMYFKHWEFDYHPNCIDCNNDLCQLHCGNFNTMYTQYIPTYYFGPLLGVEYYNGTPVYQTNGVSTKFFGVVLNKAVFNTSSQFLSLFKLLSDASLHILSSKKFVDLNIDMPASAFAALPNRSVPPAGFFNEQFYNFCVEQGLFTDNTPLDIKHFYFMQESTSMVNDYGLYIHNRATSVDICQFLFVVDYAFQYFKDYEGGCLHPDAVYVRNKHRSSGFVFSSFGDAGNFYESLTAEDFINLYEYTGRAVLPTVTKLNLKISLSAKSRARTVAGVALLSTICGRLYHQKALKSIAQSNDKPVVIGRTKYYGGWDDMLKSLLGVVDQPMFIGWDYPKCDRAMPNILRIAASLILSRLHVDCCTKENKFFRTVNEYVQVICEYVYCSGIFYNKPGGTSSGDATTATANSVFNIIQYLSSFISKIIGTNYTTHVKTDDCPIDIVKLQSSVYSAIYADGDKSSVIKELHSISPHFFNLMILSDDGVACCNKNYYDFISPSLSDFGNLLYYQNNVFMSQDKCWTQLDVTKGPTEFCSQHTLLVCIDDEWVYLPYPDPSRIMAAALFVNKADGSLPSDVVERFVSLAIDAYPLVHTGNPTYSKVFYLLLDYIKLNSIYIAETLFEH
>rdrp.Serrataviridae.d_ana_virus:SS0000001
MTFEYRSATPTKGTFAGLSSEVIWTKKGGWRTLLTERPKKFRLENLKYVGGLPYYVRQKEQTRRVNEVVHELIRKHFPDFDANAYGKSRPTFEKIENEAKRINPAPTWELDEEARAFADYTLYNYILPLGPFPLATDDEIINELLPKSPGLWYEDLHKNATKRTTCGVAVKQARGLLDGTLKDFEDSDPLFKFVGKTEILSAVKIQEKRNRNYAVPPLYLYILEIIFFHHLSNAYKEHDKGYTLTLQHGGLYELFAQMDKYGTVISDDKTGFDLRQQWIVQKSVARICAYTIDMSDKHHMMYMHLAHQFCAKNILMPDGSVFYTPYHHASGRYVTTMKGSLFHRWEQAYVFYVVMQQQTPDRATEGLGREALRLLFERMNTRIASDDVLQGVPNDPIYEPWLDQEKRTSVWETLVPIKPGSALQTHSSTGHFYLGWTVENGKVKHNSRTKLLAKLLYSSIKDKQGVITGLVHTSPYDTPMCDFLRELANKWGVEFCGQRIAQLLWRRRLDTAPFLPSPPTPEGVIMMHPVDNVHKIRNSVTMARKKQVKQAAKRKEERKIERAVERRQAVHPLRHLMTPKLHGNTYKYLETLVDPYNTPAGASTPDKVVRKSARYKSYVKTTMTTGTGGFGFCLFNPYSAAWGDQFAVATSNGTYAGTLTDPNSATAGVNNYKSTAPLNWADASANGYSCRVVGAGIRILNNTALLDMGGSVTGIREPGNQYLTGYTFADILNYNQCHVIRPEAGKWIHVEWAPTGNIGQETGDQFEFDDENPASAASMIGGQIAIVATSAGPAANAQSYDVEAVVIYEMLGEKLPLQDVHVDPLGQAACLEVIAELDGSASDKWITQVGKVAKRTGKALREISAAAMPAALMLGLA
```

`quenya.orf.fa`

```
>pisu.Quenyaviridae.nai_virus:MH937728
SSGLLPLNSLFGACQGSKKEMHCAASAVNQADVCGLTIPGCIPNKPVGKDEVIKKGKKVRTIMVESQANSMVLRHFFEKTVSNDRDIPRGKAIGLSSVGGSFKIIVMRWFRIVSRELGLDWNAFLLWLEQQPINESDKKAWESSTNELDGLPYVILMLLSIRDVSDSTSQRLLARALADYMNPPVQLDGDLIAFAPWRVASGSYFTAHGNTYRHWLMGQWVCDFIERHYHRLGCHDCVCKVCNYFRDATWFAGTVTSLELQLMRAFFVMGDDFIGIGYHMEAFNSILDYIFGTTTVGKVKSFFSEPSLFEPEGSEFLKKHFYLDKSFNTYNVRIFRAPGRLLAKLLKGRSTQRTDTFRIACLSALWEIGYN
>pisu.Quenyaviridae.kwi_virus:MH937729
SCGLLPLNSLFGACQGSKKEMQNSASAVNWADLAWRCVPSCLPNKAVHKDEVIKKGKKIRTIMIESQANNMVLRHYFEKTVSKDRDIPRGRAIGLSGVGGSFKIIVLRWYDIVSRHNGMGWNEFLLWLAKQPINESDKTAWESSTNEVDGLPYVIYLLLTIADVSDPVAQRILARALADYMNPPVQIDSDLVTFAPWRVASGSYFTAHGNTFRHWLMGQWVCDFIAKHDNMVGKDDCSCDVCAHFKHNSWFGVAVTELELELMRAFFVMGDDFIGLGYHGEAFNSILDFKFKTTTKGEIKEFFSMPSLEEPTGAEFLKKHFYLDTSGRTYNIRTFRAPARLLAKLSKGRATQSATNFRIACQSALWEIGYN
>pisu.Quenyaviridae.sina_virus:MN264690
SAGLLNLNSLFGECIGTKKQMQAGAAAINLNDLQGKNRAFPSAQLCKPAGKLEVIKKNKKVRTIQVESQADFMILKYYGEHFVTESDVPSGRAIGLSTLRGDFKQIIFTWYKIWLHYNNGSWNKFLDWLDHAPISMSDKTAWESSTNAVDGSIYVWKYLLSFTTPRDPNADRVLARALADYMNPAIQIDKDKVYFAPWRIASGSYLTADGNTRRHMSMTDWICDFLETHGNEFQARDDCTCIGCRCLRDNPDIGGSYQPMDIDLLRHSFKLGDDDISINPHPYAWDKFMDAVFGTTTKSEETKFFSQPGLFKPEGAEFLKKHFYLDKSSGTWNVRIFRAPARLLAKLCKGSSNTDNARFYVACQSALWECGYN
>pisu.Quenyaviridae.nete_virus:MN264685
SCGLMNLNTLMGSVVGSKKNMHCSATSVNVAACXNPLKAPGCLPYKPVGXNEVIKKGKKVRTIMVESQPNFMVLKHFFGHIVSKHHDIASGDAIXMSTLGGDFKVIVLKWFSIYLDYNPRISFNQFLDWLDQQTINESDKTAWESSTNESDGLAYVMGMLMRIDVRDDFSKALLTRAIADYICPAVQIDRDKVFFAPWRIASGSFLTAHGNTKRHRSFNGWCCDFIERHGGYGLEDCMCKVCIHFGWNISPGPRVCEFKLQLMRSGFILGDDYLAICDDPLLFNNLMDYEFGTQTKTEVKKFFSTPSLYEPTGCEFLKKHFYLDKEFSTWNVRSFRAPGRLLAKLFKGEARSVRAKFHAAALSAVWECGHN
>pisu.Quenyaviridae.lithobius_forficatus:GBKE01003710
KKVRTIMVESQANGMVLRHYFEKTVSLDRDIPGGKAIGLSSVGGGFKMIVMRWFQIVTKELGLDWNAFLTWLSEQPINESDKKAWESSTNELDGLPYVIFMLLTIRDVEDPTSQRLLSRALADYINPPVQLDGDLVAFAPWRVASGSNFTAHGNTYRHWFMGQWVCDFIERHQSRVGCLGCVCKVCAKFKDTTWFAEAVTPLELQLMRAFFVMGDDFIGLGYHMEGFNAILDYVFGTTTVGHETKFFSEPSLAEPEGCEFLKKHFYLDKTFSTWNVRIFRAPGRLLAKLLKGRTTQRADTFRVACLSALWEIGYN
>pisu.Quenyaviridae.triatoma_infestans:GBBI01004385
SAGLLPLNTAFGNPVGNKKEMQASATSVNIYDSGVFLPGCLPAKPAPKDEVLKRGKKVRTIMVESESNFQILKHFFEDQVSRTRDVVGGSAIGLSSMGGGFKSIFFNWYWVFLKYFPDTPWNEFCEWLDKQFADESDKTAWESSTNMVDGMTFLIELLVSLPVVSDKISRKLLARALADYANPPVQIDSDKVYFAPWRVCSGSYFTAKGNTRRHWYFVQFVCDFVELHDHHFGGLDCSCNWCKKLCALEGFGTPITKFQLDMIRACIILGDDFLAINDKPVFGLFIDAIFGTTTKTHFKKIFSTPSLEEPEGAEFLKKHFFLDKTFATFNIRSFRAPTRLLAKLSLGRSTESRSTFKAAVLSAIWECGYN
>pisu.Quenyaviridae.psyttalia_concolor:GCDX01009922
SAGLLPLNSLFGDCLGSKKEMQASAAAVNFSEMRRTVGVPAALPYKPSGKAEVIKKNKKIRTIQVESQANFMVLKHYFGHSVSEQDIPSGRAVGLSTLGGDFKIIFMTWFKIYQHHNPDADWHSFLTYLETASVSESDKTAWESSTNAVDGMVYVFRLLLTMKTMKDRGGERLLARALADYINPAIQIDADKVFFAPWRVPSGSFLTADGNTRRHMSFTDWICDYLETHDLRLKAKIGCNCVGCVFLERREDLDCSFTQMDIDLLRHSFKMGDDDLSLNPHPFQWDAFMDGVFGTTTKSKRVRMFSDPGMHEPQGAEFLKKHFYLDKSLTTWNVRTFRAPCRLLAKWFKGSSTDKIESFYVAGISALWESGFN
>pisu.Quenyaviridae.locusta_migratoria_manilensis:GDIO01031530
SAGLLNVNTLFGSAVGSKNEMQASATSVNIAALMHPLKTPGCLPYKPVGKSEVIKKGKKVRTIMVESQANFMVLKHFFHSQVTADHDVPSGRAIGMSTLGGGFKTIPLSWYEVWLEFHDGTWNDFLEWLGCQVVNESDKTAWESSTNEADCLVYLLKRIMRIRLPDDLGSVSLLQRAIADYMCPAVQIDSEKVYFAPYRVASGSYMTADGNTERHMLMSDWVTTYLERHGGYGRAGCQCAVCVEFGWVDNPGPRVSCEELAYLRRRFILGDDYLAICREPGYFNAIMDFKFGTTTKTCAKPFFSEPGLGEPEGCEFLKKHFYLDKSNPTWNVRSFRAPGRLLAKLSLGESTACRSRFSAAVMSAIWECGYN
>pisu.Quenyaviridae.diaeretus_essigellae:GBWM01032350
SAGLLNLNSLFGECVGSKKEMHAGAAAINLNDLQEKNKAHPSAQLYKPVGKLEVIKKNKEVRTIQVESQANFMILKYYGEHFVTERDDVPSGRAIGLSTLRGDFKQIVFTWYKIWLHYNNGSWNKFLDWLDKAPISMSDKTAWESSTNAVDGSIYVLRFLSSFTTQRNPNADRVLARALADYMNPAIQIDKDKVYFAPWRIASGSYLTADSNTRRHMSMTDWICDFLETHGNEFLARDECTCLGCRCLRDNPDIIGKYQRMDIDLLRHSFKLGDDDISINPHPFAWDKFMDAVFGTTTKSKETKFFSTPGMFKPEGAEFLKKHFYLDKSSGTWNVRTFRAPARLLAKLCKGSSNTDNARFYVACQSALWECGYN
>pisu.Quenyaviridae.ceraphron_sp_ad2014:GBVD01020893
SSGLLGVNTLYGEPMGTKKQMQASATSVNFYESLRKVGAPPCLPYNPAGKSEVLKKGKKVRTIQVESQANFMVLKHVWEPLINERDAPSGQAIGLSSLNGDFKQMVFTWFKLWSHYHPGDYHDFLEWLSKQDISESDKTKWESSTNEADGFVFVSSLMIETDFSAFDGGDKRLLGRALADYVNPAIQIDANKVFYAPWRIASGSYFTAKGNTYRHRSMNWYVCDFLECHSFSVPLAGCRCKGCVYMRDNDMKVDTCVEMVTYLRHSFVMGDDFIGICSFPEEHSKLIDFFFGTETKVEKKTFFSADADMREPKGAEFLKKHFMLDTTVYPWNVRTFRAPTRLLAKLCLGSSTVSLSRFKAATLSAIWECGAN
>pisu.Quenyaviridae.tomicus_yunnanensis:GFJU01178533
SAGLMPLSVLDGVTTGTKKEMHEAAVAVNLQDKLGTHAPPFNVFKPALKAEVIKRGKNVRSILLESQPNYMVLKHYFGPLASLWSDIGGGIAIGMSNRRGDFKTIPFSWWQEKGGSYVDFIDWLEQQNGHESDKSSWEASTNVTDGVVFLIDLLLRVPIHQGDKSMVVRALADYICPHVQYDIGKVYRTNWRVPSGSYLTSYGNSRRHHAMAKWVIRWLAQHAAAGDDECDCAVCRVGQIERPHGWGEKLNDEQLRILDRFYVMGDDFIALNPSPYVFDWVLDLVFGTTTKTVVKPFFSTPGLLEPQGCEFLRRHFCLDRTGSLPTIRTFRESGRVLGKLFNGGHRHSRERFLAAVDSALNDCGAN
>pisu.Quenyaviridae.cephus_spinipes:GBVI01014741
ASGLIALSMAGGRTYGTKNEMQAGASAVNNYDRVGQHTPTAQPFKVAFKNEVIKKGKKCRVILLEGQANYMVGKHYFEPLAAAWKDPAGKFAIGMSSRGGDFKNIFYRLWDLKEELHHNAFIEWLEGLAWHESDKASWESSTNENDGIVYLWELLCRVNIADEDTRMVARFFADYFFPVVQYDVRRGFQARWRIPSGSYLTAHGNSRRHDIMAGYVVRYFASHGAAGSNVCECRICEAGATKDLKGWGEPVTVLDLKLMEKRKIMGDDYLAPNRHGVVYDWVMDYVFGTTTKTVMKPAFCEPGLSEPDGVEFMRRHFTVDKTHDVWQIRPFKSAGRALGKLFHGEHRKSRETFLAAVDSAIAECGFN
>pisu.Quenyaviridae.clastoptera_arizonana:GEDC01010646
SCGNLPLNTIFGSAVGSKKEMLAGAVATNVAALAWGGVPGCLPYKPVGKDEVIKKTKKVRTIMIESQANFIVLRHYFSKLVSDSRLTSRGRAIGLSSVGGSFKILLMRXXXXXXXXXXXXXXXXXXXXCDQPMNESDKTSWESSTNETDGLAYVLGLLMSINIEGDDRLLARALADYVNPAILVDQSVIYHAPWRVPSGSYLTAHGNTERHWLMFDRLLDWVEVHGGLGKTGCDCNICQKACGIDGFGEAMDDFQLQLLRAFFVMGDDFAGFGYGASVLNKLIDLVFGTITKGEVKTFFSTPSLEEPMGMEFLKKHFYLDKTVNPWNVRCFRAPVRLLAKLRHGRHRLTKPKFKAALLSAIWDCGAN
>pisu.Quenyaviridae.bemisia_tabaci:GAUC01011130
PAIGISSLSGGFKQIPMRWYRIWLRYHQGSWNDFLEWLSKRNINESDKESWESTTYELDGFIYLLSLMPYLGELDNTSHLLLSRALADYCNPAIQLDSDKVVFASWGIASGSYLTAHGNTYRHKIMADGVCDFIQLHGGYGKLGCACRTCAHFGWDTTPGPTVDAEQMAFLRAYFVMGDDFIGLCEHPLLFNRLLDF
>pisu.Quenyaviridae.epiophlebia_superstes:GAVW02027760
FFFAPWRIASGSYLTAKGNTRRHRQMNDWVCDYLEHHQQYGRSNCSCTICREIGEGPRVTQLEIDLMRAAFIMGDDYIAICNQPELFNRIIDFKFGTTTKTERKKFFSTLDDPGAEFLKKRFFLDDKTTNCHSVRSFRDSERVLAKLYHGGAAQTSATFYAALMSAIWESGYN
>pisu.Quenyaviridae.diceroprocta_semicincta:GGPH01203597
XTCDESDKTGWESSTNMLDGLIYFIELLVSRHDIIDTESRRLLARALADCFNPPVQIDSDLVYFAPWRVCSGSYLTAHGNTKRHLVMVKFVCDLVELHDYRFGRDDCGCRWCKRLSSVPGFGAIVTPFDLDMLRSCIILGD
>pisu.Quenyaviridae.ponana_quadralaba:GCZF01072004
SSGLLPLSSIHGVVQGTKKQMHNAAVAVNTASRVWGGPPGCLPYKPVSKDEVIKVGKKVRTIMIESQSNVIVLRHFFAPLVTKEKNIVRGRGIGLSSVGGSFKILFMRWFEVWCRYNDGGWNEFLTFMSAQMMSESDKTQWESSTNETDGFIYLLSILMQVVLPEDCELLARALADYANPAVQIDGDLVYFAPWRVPSGSYLTAHGNTERHYGMADFVVEYFKSHGGAGDEDCNCRMCESVRHIEGFGQKFSDMELELLGCFCVMGDDFIGFSYGSKVFNALLDNFFGTTTKEIVRPCFGYGGAEEPDSCEFLRKHFCLDKEHATWNVRLFRAPHRVLAKLYHGRSRVKKERFKAALLSAVWDVGDN
>pisu.Quenyaviridae.corixa_punctata:GDDR01018688
SAGLLPLNTLFGSAVGSKKEMQSASASVNWQDAHMSTPGCLPAKPVGKDEVIKLNKKVRTIMIESEANFSVLKHFFEDQVSVEKDIITGSAIGLSTLNGGYKQIVMLWYWVYLQHHPNTPWNLFLEWLSRQTIDESDKTEWESSTNMIDGLVYVVGLLASRSTVTDKISRRLLARALSDYINPPVQIDEDLVYFAPWRVCSGSYFTAHGNTRRHKSMANFVCDYVELHGNQFGELDCTCPLCAKFQDLPGFATRITDLDLDLLRHSVILGDDYIAINEYEIFGHILDKQFGTTTKTSFKKFFSTPSLEEPEGCEFLKKHFYLDQTFSTWNVRSFRAPGRLLAKLYHGRSTSTTNAFKAALLSAIWECGHN
>pisu.Quenyaviridae.plea_minutissima:GDES01014899
SAGLLPLSTLFGSATGSKKEMQAGAAAVNWQDSMSETPGCLPAKPVSKSEVIKRGKKVRTVMIESEANFLVLKHFFEEQVSKSKDIVGGSAIGLSSMNGGFKTIVMLWYWVFLKFHPNVPWNDFLDWLSKQTIDESDKTAWESSTNEIDGLVYVIELLVSRSDIKDPPSVRLLGRALADYINPPVQIDSDLVYFAPWRVLSGSYFTAHGNTKRHRYMVQFVCDLVELHDYKYGDLNCSCKWCKKLAKVPGFGEKTSEFELALLRHSVILGDDFIGINNKQAFGHFIDNIFGTKSKTHFKPFFSQPGLFEPDGAEFLKKHFYLDKTFTTWNVRTFRAPGRLLAKLFKGQTIVDKNRFKAALLSAVWECGAN
>pisu.Quenyaviridae.jasminum_sambac:GHOY01071622
SAGLLNLNTLFGSAVGSKKEMHCSATAINIIAASEVNLNPGCLPYKPVLKGEVKRRFEEDGSHKKNRVIQVESQPNTIVWKHYFEDLISRDRNIGGGIGIGISSVGGGMKQVFLNWYRIWLGNNRGNWNDFLTWLGSRTINESDKTSWESSTNEEDGFCYLVYLLMSLGEIPKDPFSRMLLTRAISDYINPAIQIDSDRCVFAPWRTPSGSFPTAHGGSFRHKLMQNAVCDYLEVHGGYGKKGCHCKICSHFHYDTEPGPRLSKFEIKLMRASIVMGDDSLSVDPYPHLFNAILDFRFGTTTKTVPKLLFSEPGLQEPTGAEFLRKHFFLDKEFNTWNIRSFRAPKRLLAKLYHGEAAVQPSRAHAALLSAIWECGAN
>pisu.Quenyaviridae.harmonia_axyridis:GHJE01057056
XYQILSIVSDMENCDMLDFGNAIGMGGQNKAFNLMFVLWYSVYFRHCKGTWSEFLDYIEEKGAHESDKQSWEATTSINEGLPLLIMACGNKQFETSGDRRLFVRSMADTFNPFIYIDKGGFFAPYRVPSGTAVTSKYNTERHESMIQWVVDYVSVHGNRLGSEGCDCNKCQSMIEHEAFGTPISPMDLELKRYAMKLGDDFLAISWGEQEDDFFDRLMDLTFGTITVSEWKPFFNCGEFLRKRFRRNCDDSVTWYRDPQRLLAKLYHGEALQDNQKLAEALTSYKLEAGDN
>pisu.Quenyaviridae.megastigmus_spermotrophus:GCPB01083611
IVPVRPIPGVCHKVAGNKEDMRSSAISVNIRDRITNSKPAFHPVSPFVKVEAVKVDKCARGIQNESLSNYIIMKMADIGYEDIVNGNAIGMGSKINGYHQIFFRWFLEWKDITGGMWGEFLDYLGEVGAHECDKVAWEASTGITDGLPVVCCDLWLKRPKTPGDRRLYLRAECDATVPFIYLDRCGFFAPWRVPSGILLTSQRNTRRHRGMNNWMVHFIKSHGMRLGKAGCDCFVCIAIPGVGLEVSRIGLALRKLAFIQGDDYLAVSLGKEQDALFDVVIDFVFGTTTKTEFKTFFCPGGCEFLRKSFVRKGDDLFYYRQEERLIAKFYHGSAIVNPSTALAALISMRYEAGYN
>pisu.Quenyaviridae.cerianthus_borealis:HAGY01027881
GSGPINVSPNFGAQRGSKSDLQAAAAKAAMEPVLHCHGRQPKDETLKRGKKLRFVIPAGYSKQMLDVHFFGDKIYRHNTIVDGGAVGISRVKGNGLKPLIKIYLVWAKNTEELTGDTPSWEDFIEITSGQRVDPNGLITNKANEMDCQAWEFTLNESSVVHEVYDLLECVNVRNLGEQSRRLLANTLAHTQVWFLKIEGHTGCFVPFKKASGELNTAGGNTARHRGGQNGAMAIYQRHGWATACAEDKCWICKKTRLKREQFPIGVMSMSYIGSTLYGDDDYAPDTPASGFFAKAMDAYIGTKTISEPKPYFSTEEVDGAKFLQVELAIDECGKVVTKRARERVIAKLVKTPRARPAMLAAIVSACYEVGHD
>pisu.Quenyaviridae.jiangan_virus:MN371233
ICSVKAVAGMGRTRGTKGDMHPSSVSLNIADRISNDIPLSLPYKPFLKEEVLKVGKPVRGIQNESAANYQILSIVDQAQKFKGHLGNAIGMGSKEFNFSAIFVVWYTVFLRFQEYGTWTEFLDYITIRGAHESDKTSWESTTGINDGLPTAIVDLSFKDFATPGDRKLYVRAVADVYNPFIYVNKGGFFAPWRVASGTERTSSNNTDRHQLMVHYCVSWVVSHGGSLGSELCKCDACNHLREHEAFGWQISEMDIELKKHAFILGDDFFAVSWGYYEDSFFDMLMDYTFGTITKTEAKSMWCDGEFLRKKFIRNGDGSITWYRDHERMLGKLLYGSHLLNEQRLHAALISYKYEAGDN
>pisu.Quenyaviridae.qiaokou_virus:MN371234
IVPVRPVHGVTGVVGGTKEKMLPAAVSVNLADKKQCGLPPNIPIKPFLKEEVIKVGKEVRGIQNESLCDYLALSPLLSKQRLFRWGWAIGMGSQVNGFQNIFFTWYVHYRRRYRNVQWSEFLDLLEEWGAHESDKKGWETSTNANGGLPVALLDFCVAKPSTEGDKRVFLRAYSNYVNPQIYVDKCCFNVPWRVPSGTVFTSEKNTARHSGKVQWVVNFIRRHGLRIGAAGCSCDVCRGLDLQGTEVDERRLHLRLVPFILGDDFLAVSLGKQQDAEFDRVMDFVFGTETVTEWKPMFGCAEFLRKQFIRLPCDRIAVRRSEERVLAKLYHGSFLSTPERLAAALLSYKLEAGYN
>pisu.Quenyaviridae.hanyang_virus:MN371235
SCGSLPLNSIFGSVTGNKKEMLSSAVAVNIADNTWGGVPGCLPYKPVAKDEVLKKGKKVRTIMIESQANFMVLRHYFSKLVQDLKSVPRGQAIGLSTVRGGFKTLVMRWFYVWAKHRQGDWHDFLDWLAQQPMSESDKKSWESSTNEADGLAFLLGQLLQCKVDGPGQVLLARALADYANPAVQIDDQLVYHAPWRVPSGSYFTAKGNTGRHHLMARWVVQWFREHGGAGRIGCDCNVCQAVSGEPGFGETLDSLQLDLLDSFFVMGDDFAGFGYGARVFDRVIDFVFGTTTVGCVKPFFSTPSLNVEPEGLEFLKKHFCLDKSVEPWNVRMFRAPVRVLAKLRHGRARLTKPSFKAALLSALWDCGAN
>pisu.Quenyaviridae.wuchang_virus:MN371236
SAGLMPLVASGSVTGNKREMHNAGTLINMRDYDAFLKDARPPQLLPYKPALKQEVIKKGKKVRCIMVESLPNFQVHKLAYSHLITESKHYVKGEAIGISTAQGGFNIILCNWFEIYTKYHDGDWNDFLEWLTQQVFSESDKTSWEATTGPTDGFPALLWAILTLPRPPHSERATYALISRALADLFNPHIIVDGPLAYTAPWKVPSGSYWTAKSNTKRHKLMADVTCMFFERHGAWGRIGCGCNVCLAASGTPGFGQTITQEEFDFMRASWVMGDDFISIIKYPKIFDIAIDTLFGTITKSNEKSFFDVEFLRRKFFIDRSHPTFVVHTYRDPERLLAKLFKGRARRNVSSFLAAIHSAVWDAGRC
>pisu.Quenyaviridae.qingshan_virus:MN371237
AAGAINIDGSYGQTRGSKVDLHSSAVYAAMRTHPLLNHEPAMKGKEVLKRGKKVRQILVEPMPIYLYGLWFVSHLVKGKGSLPIGMARGLSKKEGDFYRILWSFYLNERVISDISFSDFLEKLESEGLREDDKKNYESTKNSVTKVVYSLFMCFMVTPPLQDRSKYASYRAHVHLPAIKVSGKTCYYAYNVTGSGNVETENENTITHLLGYLRLDLNIEKHGGKFGRCGCKLCEVFTGFFSFEDRLKILSLALLGDDLLGRAVDVHVSALVDVLMGTTTVGDIKPIYGPGGAEFLRSRFTRDGFVVRELVRVLAKLRWGVARYNRLHFKAALTSASLEMGPD
>pisu.Quenyaviridae.hongshan_virus:MN371239
SSGAVNLDPGYGKVGGGKYQMQSAAAKAAAAGCPLLNYEAHPKGREVIKKGKKVRQIVCEPYPIYLKNVAMFGPAIKRHGDVVDGECHGMARMSGGYLAPLIKFYLDEKAFNNQLTFDEFLNQLEVDGLQEDDKQAFEKSINKWSALVYMILMCARVTVSKKNVDDMAQVLAHWAWVPIKYKGHKCYYAYYVVASGNFWTLKGNTERHIAGYQKLSLDIQRHGCALGGSCSCYLCDGIDGVRVITEKELSYIVGGVIVGDDLLRRHVDLPVAKWIDHILGTKTLGGRKNAFFEQGDAAEFLRVSYGKDLRTFRDCKRVFAKLRWGMAKCNDEDFLCALQSASLELGPN
>pisu.Quenyaviridae.ruian_virus:MN371241
GSGPINVSPNFGAQRGSKSDLQAAAAKAAMEPVLHCHGRQPKDETLKRGKKLRFVIPASYSKQMLDVHFFGNKIYRHNTIVDGGAVGISRVKGNALKPLMKIYLVWAKNTEELTGETPSWEDFIEITSGQREDPNGLITDKANEMDCQAWEFTLNESSVVHEVYDLLECVNIKNLGDQSKKLLANTLAHTQVWFLKIEGHTGCFVTFKKASGELNTAGGNTARHRGGQNGAMAIYQRHGWATACSEDKCWICKKTRLKREQFPIGVLSMSYMGSTLYGDDDYAPDTPASGFFAKAMDAYIGTRTISEPKPYFATEEVDGAKFLQVELAIDECGKVVTKRARERVIAKLVKTPRARPAMLAAIVSACFEVGHD
>pisu.Quenyaviridae.zhanggezhuang_virus:MN371242
KDASVGIDLPGNPQGTNKRENGAFALSQFRDQCLSGPNETPYVYDTQTKGNETLKKTKDQRKIFCESGTQLFVFLMCFAGIIYRKRNLFNGAAQSLSQGGSTGLQYIMKLTPELPWNRIRENPLINDKDLLQNARDILSKSDALSESDKAAWEYRIHPILKAILMASVLFMTDFTGHEIYYVPLISAIASFLCPLIALSGDLVIVAPNAMPSGSLFTLYGNTDMHHLLTIAFKHLKQRIAFETNNDVLESKVQTWYKSILLQGDDFISRNIFGDEYDKYIDARFGTQTKAAFGSAETCKFLQRSIRWRGKVPQLVYDVNRAKIKMCMPRESPIDQADAFKSILLSTGDPSIIPHVR
>pisu.Quenyaviridae.daxi_virus:MN371243
RDSSAGYDPPGGARGSNKKENAAAAXGGVMQAITRPLPPLVYATQSKGNEFLPIEKDQRKIFCESTLQNVTLQMVFGQVVYRPRNLYNGSAQNLSMGGNTGLQFVQLLTPELEWGKKGTLQEAKKILGGADSCAESDKKSWEYLVSFNSKLQVSAELLLVTEFEKYMYQPLEASLASFLCPLVATVGNRVIAAPNFMPSGSFLTLFGNTELHRSQVVAFRTFVIKNNYKLPKLDGIRYQERISADPMDTTPEMLDIWFRSVLLQGDDFISRNIYGSYYDKWVDLVFGTKTKGGSGDYTQVRFLQRHVNWEGPLPRLVYDQKRAAIKICAPRSDMADLASAIKSIALSVGPHPWLPTLK
>pisu.Quenyaviridae.dongxihu_virus:MN371245
IIPVKPVPGIMGRLTGKKGTMHTSAYAVNLANRLNDSIPRSIPYKVFLKGEVIKKGKPVRGIQNESLASYEILHLSEPERQEVRLGNAIGMGGSNKTFASIFVVWYVVYQEETGKSWHDFVEFLADSGAHESDKVGWEASTNMTDCLPRLIVLLNEKKFKTPDCKKLYVAAVTDSHFPFVYLHKEQFQCPARVPSGTYFTSKGNTARHRSMNDYGCNYVYQHENKLGLKGCKCYGCELMRGHEAFGTSISSLLLKLRRKAFILGDDYLSVSWGIVCDDFFDTLMDRTFGTETKTEKKAFFDDAEFLRKSFKRGEENTITWYRKPERVLAKIYHGDFLCSPQRFAEALTSYKYEAGDN
>pisu.Quenyaviridae.jieyang_virus:MN371246
SSGAINLDPFCGAVGGKKFDLQASASLAAEAGTPMLNGEAHSKGEVIKAGKKVRQIICQAFPIYLKAAHYLEHKVNRKKNLCEGRASGIVFKGGGMYEILRSLWLDELEFQDIEFEDFLDLLEAEGLDESDIKAFEANINDATGIVYLAELVASFKNLCAAAIPDVAQVLCDYAAMPVKYDGHKCYYAFLTVGSGQYTTEDGNSRRQLIGYQRASWKIQLHERECQGCPVCPYADCSWHDLAILVGGAVLGDDRIMRHFPRDIGPLMKHCLGCEVVSEHKKAFAGNGSSAEFLRRGIDREGNLYREPIRLLAKLYHGDAKRDGRTFGAAITSASIEMGPQ
>pisu.Quenyaviridae.bawangfen_virus:MN371248
SAGLLPLNTLFGSAVGNKKEMQASSAVVNWQDGHSELPGCLPAKPVGKDEVIKKGKKVRTIMIESEANFTVLKHFFEDHISQSKDIVGGSAIGLSTVGGGFKSIIMLWFAVFLRHCPNTSWNDFLLWLGRQDIDESDKQSWESSTSFTDGLVYVIELLVGRGPIQGLVAQRLLARSLADYLNPAIQIDDDKIYFAKGRILSGGYPTAHGNTRRHKMYADYICDYVEVHDNCYGQLSCACADCAVFRHLPGFGSRVTDMELDLLRARVILGDDFLAVNGNPCFGLFMDSKFGTVTKTNFKKFFSEPSLEEPLGCEFLKKHFFLDKTFRTWNVRTFRAPGRLLAKLFKGRAAASQDTFKAALLSAIWECGYN
>pisu.Quenyaviridae.zhuyu_virus:MN371249
RDSSAGYDPPGGARGSNKKENAAAAIGGVMQAITRPLPPLVYATQSKGNEFLPIEKDQRKIFCESTLQNVTLQMVFGQVVYRPRNLYNGSAQNLSMGGNTGLQFVQLLTPELEWGKKGTLQEAKKILGGADSCAESDKKSWEYLVSFNSKLQVSAELLLVTEFEKYMYQPLEASLASFLCPLVAMVGNRVIAAPNFMPSGSFLTLFGNTELHRAQVVAFRTFVIKNNYKLPKLDGIRYQERISADPMDTTPEMLDIWFRSVLLQGDDFISRNIYGSYYDKWVDLVFGTKTKGGSGDYTQVRFLQRHVNWEGPLPRLVYDQKRAAIKICAPRNDMADLASAIKSIALSVGPHPWLPTLK
>pisu.Quenyaviridae.xiangtouao_virus:MN371250
IAMIPVSSTSTGNFSEVVGNKKENGVHVACSIVAALQSDQPPAFSPTAPSPKLEAIKKGKKGRVIQVTDPAIDKASDACFSSYRMPRGLEFGTAIGVPFNKTFGERLVARLTRTCGEDYLVKHGLHESDKKAWESTTKPNSAIIYICTLLCLARSLVGWMPVAARVLADYYFPLFAAQGNYYGGRPGVVSSGNKFTAAGNTFRHRTFIISFTAFVETHEGRAGSEACSCTQCTYALAHGFELDREVSEFELGVLNDAVLMGDDFLAVWTDASPIYDEYCDIFHGTVTKTDRKPWDEGEFCKRRLMRTETKSGYFYSTTRKLGTVIYKFHGPRAIQRASKAASCKSIAVDCNDK
>pisu.Quenyaviridae.zhupengkeng_virus:MN371251
DLVSLSVQRVERKGVGNFNVGGRMKKTEVAVESAMAVVANNQSETPVCFPVKPSVKQEVVKVGKSPRTIYNVASTEFLALKSLEAIYKEKRGVAAGTASGEPIEGAWAEKLARVITHGLAEDPIDALEDSGIHFSDKTTYERYTIVQTAIVYLCILLVRAGSDIMNFKGSLAAAFANYVYPYVSLRGDLGFRCEGAVPSGSNLTSHGNTHRHRQNVNIFTSHVEMHDGVIGLHDCCKLCQLMCDSGFDLVYYDESDVHRLRQCVLMGDDFIACYGRLAKVYDVVGDKWFGSKTKAGERQCAFDNDEAAFLQRVFMKDDCGNLTTRAVRKRDLGKSLGPTEKRVATELARSVSCALNSNDPVLYAY
>pisu.Quenyaviridae.xuejiaotou_virus:MN371252
RDSSAGYDPPGGARGSNKKENAAAAIGGVMQAITRPLPPLVYATQSKGNEFLPITKDQRKIFCESTLQNVTLQMVFGQVVYRPRNLYNGSAQNLSMGGNTGLQFVQLLTPELEWGKKGTLQEAKQILGEPDSCAESDKKYWEYLVSFSSKLLVSAELLLMTEFEKYMYQPLEASLASFLCPLVATVGNRVIAAPNFMPSGSFLTLFGNTELHRSQVVAFRTFVIKNNYKLPRLDGIRYGVRITAEPMETTPEKLDIWFKSVLLQGDDFISRNIFGDYYDKWVDLVFGTKTKGGSGDYTQVRFLQRHVNWEGPLPRLVYDQKRAAIKICAPRSDMADLASAIKSIALSVGPHPWLPILK
>pisu.Quenyaviridae.zhongzhu_virus:MN371253
LAKLKVENTSTGNFAKDPKKKKDCAIEVAGVVSSMAQTGRAKCGPNKPNIKLEVIKDKDSQGNDKKARATQSDDAVNALAAEACFGTFIRLSPGVEGGSAIKVPIHGGYGTRMFRHLTRPISSSLKVAECEALRRGLHESDKKSWEATTKPETSYAYSICAISAVDDLSEAGAVAANVLAHYHHPFFSITGDQACSKPGVTASGSKPTAHGNTMRHGFMLGLFKMYVLAHGGLGSVGCTCKDCEVLKDHEDFGKPLDTVELHLLLQAVLMGDDFICVWNDASSFYDYYCDRVFGTVTESGREDFTEGAFLKRRMKKNRGFWTTYVHEDQAIPKLAGPKGFQAESKMAQCINIAVNSNNK
>pisu.Quenyaviridae.laitoushanzui_virus:MN371254
LAKIKVNDTSAGNFSSSVIKKKLCGVEVAGIVARQAQIGRAKTAPAKPCQKLEVIKKVNSRGVAKKVRAIQSDSADNSLASEAVLGSFVRLSRGVEGGSAIGVPTNKAFGLRLFTQLTSPFSNCLTYREEQAVERGIHESDMQCWEATTKQETALAYTVLAISAVDDLSQAGAVAANVLANYYHPFFSISGEQACSKRGVTSSGSKPTAHGNTVRHRFMLSLFKLYVRRHGFSLGLAGCSCKVCKKMQGHPDFGKKVDYVELALIICAILMGDDFEALWTECAAFYDNYCDEVFGTVHTTERKGLWEGAFLKRRLKKVRGWFTTHVPHEYVIPKFKGPGMMTAENKMCACINAAFNSNNK
```

In [None]:
### NW Extension + Trim of New Seq =====================

# manual.seq.additions.fa
mkdir trim 
# Perform semi-global extension of potential hits
aws s3 cp s3://serratus-public/bin/usearch12_trim ./
chmod 755 usearch12_trim

ln -s ../rdrp1_BETA.fa
# NW Semi-Global Extension

function utrim {
    ORFFA=$1
    RDRP=$2
    RDRPFA=$3
    
    ./usearch12_trim -usearch_global $ORFFA \
        -id 0.01 \
        -fulldp \
        -maxaccepts 8 \
        -maxrejects 32 \
        -top_hit_only \
        -db $RDRP \
        -userfields query+target+id+qtrimlo+qtrimhi \
        -userout $RDRPFA.tsv \
        -trimout trimmed_output.fa

    usearch -sortbylength trimmed_output.fa \
       -fastaout $RDRPFA
}

utrim quenya.orf.fa rdrp1_BETA.fa quenya.rdrp.fa
utrim epsy.orf.fa rdrp1_BETA.fa epsy.rdrp.fa


mv results.tsv $RDRPFA.trims
rm trimmed_output.fa

In [None]:
cat rdrp1_BETA.fa epsy.rdrp.fa quenya.rdrp.fa deltavirus.dag.fa \
  | seqkit sort -l - \
  | seqkit rmdup - \
  > rdrp1_EPSY.fa
  
# Cluster rdrp0 and genbank_rdrp0_matches at 45%
uclust rdrp1_EPSY.fa rdrp1 45

seqkit sort -n rdrp1_EPSY.fa > tmp.fa
mv tmp.fa rdrp1_EPSY.fa

#      Seqs  14679 (14.7k)
#  Clusters  4624
#  Max size  275
#  Avg size  3.2
#  Min size  1
#Singletons  2916, 19.9% of seqs, 63.1% of clusters

cp rdrp1_EPSY.fa ../../../rdrp1/
cp -r id45 ../../../rdrp1/

In [None]:
wget https://serratus-public.s3.amazonaws.com/rce/motifator/bin/motifator ./ ; chmod 755 motifator
wget https://serratus-public.s3.amazonaws.com/rce/motifator/model/rdrp_model.txt ./

mkdir -p motif
./motifator -search_rdrp rdrp1_ZETA.fa -model rdrp_model.txt \
  -tsvout motif/rdrp1.tsv \
  -report motif/rdrp1.txt \
  -fevout motif/rdrp1.fev

These sequences are huge outliers in teh dataset, looking at INTERPRO for them show they have an RdRp domain and it's only ~200aa long. Sequences removed from rdrp1_epsy, then utrimmed based on rdrp1_epsy(minus outliers) and added back in to make ZETA

```
>rdrp.unc0007.wolf_4584:yaOV314orf2534
AVACRGPKPVAAAVLRTLSEPEVLAALREVYPGPMQKAESVARACLTNKDGVSRCSPAQPLLEKCIEACAALPQAARATK
ELACHYAALLGVFATTRRCGAVDDLLGWLLEWPRTKAACSRVQLAFKVHRLFSPGPYGALTRLIRRRIYQALLERPVRVW
RKSCHLVRACSALVKFLQTDIPCNNLAQSTVFAPIAASGHLTSLLTAFTSPGSRLLTSFTPMWLAVFGVEVVCTFIRVRT
RSWLKEKGWRVPRWLTRATTLVNAGYFAWAAATFTREQTKLLKASGYWGGDVLSTLTRIAAARDRDPVASVTAGLVSMAF
LALASVGAEEAIKWLAIAADGPLGVLLDELLFRMACLESRRAIGMPFRARSALHSLATSGIELAPLNSFLQHASNNLLTA
AGSGAAASSWNVLLNVAVFGGTRACFDCAADLVTRAVVNSTLPPGMRWLNPAAEDAGQPDFQVVDGVQHALTTDQWHRPL
VSGLLNIRDSLVAQRRWNTGACAGGSASLQSWLWTASAGVIVKSREAFDAAGNIVSKELRANVSGILDNFLVPQGAFGLG
MACLAVVAANKLLTETRLTRELVEAVEVVVVAAHPPKREATPGEFRLAHGAPAVADERSFGNAGGYCIDPTGVRETIGNF
VMLPVSTSQTAAIHRASSMPFLT
>rdrp.unc0015.wolf_4592:yaOV322orf3816
ECEVDVDFCIVWTDAAPLNVAGFEFSLKIWVSKAGETNGQTEYSIPFMPRVATDDVHQVAAHHGVFRLREELSRVAAGAD
AVAPIVGFEVRCLNPVSPTAIGSLSMDNLRSIRWRRMNRYQQRNIAAFIAQAPATDTQKWSVTLNFKYQFELTAAELARF
GSGGMVPGHLSRQYLALSQLHAEHGIDYVAPAQTLGFSLVGDALKYLKKAPASLAKALPRIAGRALQDAGKSALNELIGF
SDYDIPRNAMWPRMPDAPAYQEGQAVVDHNYFGVGRTVDLTDGMHYLSAPIYIQAAKMSYDEKKQCFHVTNNTGRVLVDR
FYVSGCRIVSACECRSVSFTDKVRVSKAAGVDFTTGRSTIGQWGIDPTNVVGEGDHPVNLAGVQVYYRRVVGSGDSFSVV
SQPETNWRAPFEFATQMYEATPTNLRNIWSLPPTGSIIVFTTKAMYNDGDTSHSAALLNLLWNGDPNAGVSGVATLSTEM
GPIWPKPMPVGNVTSKTRKNKVFSWVAIPSQVPHQTEGYVRTYSTFSELFALTHGGRSLRSPPSRQQQPSHGQKQAEKQK
ARKRPRPAKAPVPAATDLTSQV
>rdrp.unc0009.wolf_4586:yaOV316orf3967
TCNHHPVTWDRTGQPGVVCPGCQAARQVNFTYPLVTTGPLYQPETSPFSYLSALMNRWMKDPTGRRTSPEDEAEFLDEVR
ARLSPAAKLVKELFDRHPRSTRLPSVEECARLMPIQKGNLLISAHERAKLHGLAGRTKGQVKGNETIKASSFDFDSNTVV
VKPRLICAVDPQVQAEILPNSRVMTEEAKDCFDGLEVFRVGRFNVRILFGDQTQKSMERIIEMAKTPEYCISLSCDDTIV
STGQYSNIFKTDGWDTDYEGFDQSQNSAFWDKDEEVLAHHGREFWALQRAVNRMAVKYSMTLPNGFEVIITALLREAMMT
GVGTTYILGCFHNLYAQIAYALCVSELLDSSEDAVIPTFDQFAPRLGLASKTAYSPCGRSIDGLCFLKHAFWQNGYTVLP
SMVCKLGKSMKAPVEMMAKAGIKANVAEATQLVYKSILLSVQVPADYPIFGALLAKASERTRAALSRTAQEAVARIAKDP
LFIEDAQYKQRMCTLSRSQVLEFMLVRYGISEAEVSEVEHLISQIRCYPVHVAHPVFERLRAVDYGV
>rdrp.unc0004.wolf_4581:yaOV311orf2997
PRLELPRHHVDDDGPSMVERVGIAMQTPAATIAELRSIVAPPPLRRQVAGSIITRRRPFWKDFILYGTVVSVWVVLVLQS
IAWIIRLRSELSGTHGSWTRTDDVRAYSFLLTFTTLSPGLITYRRLYRQSQVETVPEGRLAEVIVYIKGADRRNPPVREL
VDVEVRLYSEKSLFEYLTPQDVETIVRDFGVEKSMAGVRRNRCSAHPRARAARTRPYLHLQLQARHGGGEPVIFEEASLM
REVGGSNRYTFGVAFDGDRIDYGPNECATERVPTPHPDIADLLDVMVLHDGPDDLNGNNGSATNTDDHPAIVLDWADQLT
TMAEDYTPRYVTHPFGEVRCAFDVDSAGNVYYIEVEDDEWMTRNRQSCVALLRQQLLRLEILYSPRLDLVPDDTESKVDP
DAESFSRVTTPDHPPDLNQPPSTLPALLDACVLNRTRSLERRRCEASSPDGSVLPTLSAYDAGERYIPPKDVEVRSQLNG
SHGEYTNSDDVVAVAGGADADGV
>rdrp.unc0013.wolf_4590:yaOV320orf3454
LSVDDWLAKRRRKGLKKNQETRLKRAYDQLVKQAPNYYSLCRRGLFVKFERNFYSTINYDKSKAARTIFDPSPGYQIVTG
PLITNAQQQIVRDYSVHVEGELKCISFGFGSNLTDYAAWLLKVCLAFRYRPNFTNLQRIYAVEFDAESMDGSSGQPILMA
SLNSLGGFFDLASEYEFSQYILPAELFGMVTRARSKDGSVKVVMSGKMNSGKSRTTLCNSIGNRKVVFLALVELFTNEFN
VVDPIPEIEIKSVVEGDYFNLPSWNNSFIGQPAPPYKPTLRRNKVIVDWVRRWETRLRSKVFVMVNGDDGILATIDVTIR
DKFSKYLSEMAILIGWTYKTKVTVLSEGEFCSKVIYPVSPYSIGDCEVDFAWTNKIGKVLAGSGLTTAPINWNNGRTFLQ
SEKFYALRHEFSFLPLLRKFSQFSSRHRQRATTKQRRTVQKWVYEAERLNFKPMADIDHEPTEMTRQVFLDRYGYDFDIV
DDLHTMFVGENDWKL
>rdrp.unc0008.wolf_4585:yaOV315orf1854
ATAVLQPSLEPSPLLVNILTGVFRIKQKSLLALSDALEAINKYEETNLNYFKKIGTKLPERMRMLKQYMETPPVDVGQKR
HQVKMYNTIQPKPDEFQKIKKGDLKYARATVNISGMDLMHANPALMYTCKHIISTPLIIRQYGEDVHIEYDGLVEVLSNI
KIPRTIYIRSVIKETTSDKLQEVMNEMLDYGDGHDMFVGMINHGDDMLFAEKSEETVHYPRGYRFIEGDIEKNDTSYTHE
LINAIHTATFPAVPGVTPAFAQLAKDCIIDYKNSDIKAIMKNTQGMMLCSGSSLTTFLNSTGSSLIGISYGLSKSTDTLA
GAASEVGFLVTSVSGDLSDTTFLSKGFYYTQNLAVRSYKCLASICRSFGHVHGDLLGSSKTPPSKRWNEHISGVIESEKH
EPDSLFMTALRCWNSKDTASAIYNKMKLKFSTLRLGKSYTSGFTNLPEIDQAIIDRYYPGETALGYCEYIQLIEYMKSVN
DLYGNVIMSPF
>rdrp.unc0012.wolf_4589:yaOV319orf1472
IFSIDRLDQIRYQLGQLKEIQLRLKALGVEKRRWMEDLTIDGDVESNPGPFLEKWWNRQCLLCTECTCAVSYKGYITKEY
CHPGFACAFIVYPSIEEICPEDQMEFERDPSRYIAVTRGAIGRYARIDEDLWAMEHPVYENYYQWLPNELRFSSFSGSAF
VDSCERFEIPREEDSPMTYILWDRGAYHALEHGFLIGEESELTIARYRLRMGDIYEYAGEKPWKAWYNFFRAFKVDYDLE
NHTIVPIVEERVSLHFMCNEQITEGHFLAKQRFGMNLKNVLLAIRQVWLRDLTEDGDIESNPGPWSELWHTFRYVVEKMK
HAGYPDYDSRTRIFLSIFRVDSVDDFCGCENKCHHGHSWQYYFQKVNAFMDECCAPSEHGSESHSFHSKKEESVYSVPSD
EKKKPPSAHSEGRKFNGKASRTLNIVRTALSNR
```


- Add SARS-CoV-2 (trimmed)

```
>pisu.Coronaviridae.sars_cov_2:YP_009725307
DFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDK
SAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIA
ATRGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLARKHTTCCSLSHRFYRLANECA
QVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTD
FVNEFYAYLRKHFSMMILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCSQHTMLV
KQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHML
DMYSVMLTNDNTSRYWEPEFYEAMYTPHTVL
```

### Revision 4 / RefSeq RdRp Set / Testing Diamond

Create the "High Confidence" set of RdRp from RefSeq. Note it looks like this is already implemented as a sub-set of wolf18

Run Diamond with different settings to measure performance

In [None]:
## Set-up Folder
# RefSeq sequence extraction
mkdir rdrp0; cd rdrp0

# Download rdrp0 work folder
aws s3 sync s3://serratus-public/notebook/201226_rdrp0/ ./

# RefSeq (rs) folder
mkdir -p rs; cd rs


## Set-up Database
GENOME='rdrp0'
cp ../rev1/rdrp0_r1.fa ./rdrp0.fa
diamond makedb --in $GENOME.fa -d $GENOME

## Set-up Query
# Query is Viral GenBank v240
# Extract out Viral RefSeq (NC_XXX)
QUERY='vrs240.fa'

wget https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz
wget https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.protein.faa.gz
gzip -d *.gz

cat viral.1.protein.faa viral.2.protein.faa > $QUERY
rm viral*faa

## Run Diamond -- DEFAULT MODE
QUERY='vrs240.fa'
GENOME='rdrp0'
OUTNAME='vrs240_r0'

## test
# head -n 100000 $IN > tmp.fa
# IN='tmp.fa'
# OUTNAME='tmp.pro'

# Diamond blastx alignment
time cat $QUERY |\
diamond blastp \
  -d "$GENOME".dmnd \
  --unal 0 \
  --masking 0 \
  -k 1 \
  -p 16 \
  -b 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq \
  > "$OUTNAME".pro

# real    0m3.656s
# user    0m41.907s
# sys     0m3.899s

# Fasta of local diamond matches
#
# query_acc query_start query_end query_len
# db_acc db_start db_end db_len
# pctid
cut -f1,2,3,4,6,7,8,9,10,13 $OUTNAME.pro \
  | sed 's/^/>/g' \
  | sed 's/\(\t\)\([A-Z]*$\)/\n\2/g' - \
  > $OUTNAME.fa

In [None]:
## Run Diamond with ULTRA-SENSITIVE mode
QUERY='vrs240.fa'
GENOME='rdrp0'
OUTNAME='vrs240_r0US'

# Diamond blastp alignment
time cat $QUERY |\
diamond blastp \
  -d "$GENOME".dmnd \
  --unal 0 \
  --masking 0 \
  --ultra-sensitive \
  -k 1 \
  -p 16 \
  -b 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq \
  > "$OUTNAME".pro


# Fasta of local diamond matches
cut -f1,2,3,4,6,7,8,9,10,13 $OUTNAME.pro \
  | sed 's/^/>/g' \
  | sed 's/\(\t\)\([A-Z]*$\)/\n\2/g' - \
  > $OUTNAME.fa

In [None]:
## Run Diamond with ULTRA-SENSITIVE mode with BLOSUM45
QUERY='vrs240.fa'
GENOME='rdrp0'
OUTNAME='vrs240_r0USB45'

# Diamond blastp alignment
time cat $QUERY |\
diamond blastp \
  -d "$GENOME".dmnd \
  --unal 0 \
  --masking 0 \
  --ultra-sensitive \
  --matrix BLOSUM45 \
  -k 1 \
  -p 16 \
  -b 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq \
  > "$OUTNAME".pro

In [None]:
## Run Diamond with ULTRA-SENSITIVE mode with BLOSUM45
QUERY='vrs240.fa'
GENOME='rdrp0'
OUTNAME='vrs240_r0USP250'

# Diamond blastp alignment
time cat $QUERY |\
diamond blastp \
  -d "$GENOME".dmnd \
  --unal 0 \
  --masking 0 \
  --ultra-sensitive \
  --matrix PAM250 \
  -k 1 \
  -p 16 \
  -b 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq \
  > "$OUTNAME".pro

In [None]:
### RUNTIME -----------

## DEFAULT
# real    0m3.656s
# user    0m41.907s
# sys     0m3.899s

## ULTRASENSITIVE
# real    1m39.734s
# user    24m28.415s
# sys     0m5.550s

## US-BLOSUM32
# real    1m37.864s
# user    24m5.467s
# sys     0m5.500s

## US-PAM250
# real    1m33.032s
# user    23m24.756s
# sys     0m3.548s


### HITS -----------
# 4379 vrs240_r0.pro
# 5056 vrs240_r0US.pro
# 4919 vrs240_r0USB45.pro
# 5359 vrs240_r0USP250.pro

### EDGECASES -----------
grep "YP_009480680.1" vrs240_r0USP250.pro  | cut -f1-4,6-10
grep "YP_009551960.1" vrs240_r0USP250.pro  | cut -f1-4,6-10
grep "YP_008802665.1" vrs240_r0USP250.pro  | cut -f1-4,6-10

# Spot compares of edge-cases
## DEFAULT
# YP_009480680.1	446	565	578
# rdrp2.unc0336.Grapevine_associated_mycovirus_2:ADO60939	1	120	325
# 85.8
## ULTRASENSITIVE
# YP_009480680.1	446	565	578
# rdrp2.unc0336.Grapevine_associated_mycovirus_2:ADO60939	1	120	325
# 85.8
## US BLOSUM45
# YP_009480680.1  446     565     578
# rdrp2.unc0336.Grapevine_associated_mycovirus_2:ADO60939    1       120     325
# 85.8
## US PAM250
# YP_009480680.1  328     562     578
# rdrp2.Fusarividae.Sclerotinia_sclerotiorum_fusarivirus_1:YP_009143301      5       233     506
# 44.3

## DEFAULT
# YP_009551960.1	492	725	1010
# rdrp1.Narnaviridae.Rhizophagus_sp_HR1_mitovirus_like_ssRNA:BAN85985	116	338	482
# 35.4
## ULTRASENSITIVE
# YP_009551960.1	491	751	1010
# rdrp1.Narnaviridae.Macrophomina_phaseolina_mitovirus_3:AMM45292	120	383	474
# 33.6
## US BLOSUM45
# YP_009551960.1  493     725     1010
# rdrp1.Narnaviridae.Fusarium_poae_mitovirus_4:YP_009272901  120     334     471
# 34.8
# US PAM250
# YP_009551960.1  491     740     1010
# rdrp1.Narnaviridae.Macrophomina_phaseolina_mitovirus_3:AMM45292    120     361     474
# 32.7


## DEFAULT
# YP_008802665.1	31	147	285
# rdrp2.Picornaviridae.Kobuvirus_sewage_Aichi:BAO02685	4	133	330
# 33.8
## ULTRASENSIVE
# YP_008802665.1  31      147     285
# rdrp2.Picornaviridae.Kobuvirus_sewage_Aichi:BAO02685    4       133     330
# 33.8
## US BLOSUM45
# YP_008802665.1  31      151     285
# rdrp2.Picornaviridae.Kobuvirus_sewage_Aichi:BAO02685       4       137     330
# 32.8
## US PAM250
# YP_008802665.1  30      147     285
# rdrp2.Picornaviridae.Kobuvirus_sewage_Aichi:BAO02685       3       133     330
# 33.6


In [None]:
# High confidence set of matches
OUTNAME='vrs240_r0'
OUTNAME2='vrs240_r0_HC'

# Extract only matches that are:
# >88% identity to rdrp0
# >98% coverage of rdrp0

cut -f1,2,3,4,6,7,8,9,10,13 $OUTNAME.pro \
  > pro.tmp

rm $OUTNAME2.pro

while read -r line; do
  
  pctid=$(echo $line | cut -f 9 -d' ' - )  
  
  r0_start=$(echo $line | cut -f 6 -d' ' - )
  r0_end=$(echo $line | cut -f 7 -d' ' - )
  r0_len=$(echo $line | cut -f 8 -d' ' - )
  
  r0_cov=$( echo "$r0_end - $r0_start + 1" | bc )
  r0_cov2=$( echo "scale=2;$r0_cov / $r0_len" | bc )
  
  #echo $r0_start $r0_end $r0_len $pctid $r0_cov2
 
  if [[ $pctid > 0.88 ]] && [[ $r0_cov2 > 0.98 ]]; then
    echo $line >> $OUTNAME2.pro
  fi
  
done < pro.tmp

rm pro.tmp

# Fasta of high confidence diamond matches
cut -f1,2,3,10 -d' ' $OUTNAME2.pro \
  | sed 's/^/>/g' \
  | sed 's/\( \)\([A-Z]*$\)/\n\2/g' - \
  > $OUTNAME2.fa

# List of accession with High Confidence Hits
# Create a file of full-length records with these hits
# and associated bed3 file
cut -f1 -d' ' vrs240_r0_HC.pro | sed 's/>//g' - > vrs240_r0_HC.hits
seqkit grep -r -f vrs240_r0_HC.hits vrs240.fa > vrs240_r0_HC_FullLength.fa

cut -f1,2,3 -d' ' $OUTNAME2.pro \
  | sed 's/ /\t/g' - \
  > $OUTNAME2.rdrp.bed

In [None]:
# Create Cross-Validation Set
mkdir cvi; cd cvi
RDRP="vrs240_r0_HC.fa"

# High-Confidence RDRP only
ln -s ../$RDRP ./

# USEARCH CVI
usearch -threads 30 -calc_distmx $RDRP -maxdist 1 -termdist 1 -tabbedout vrs240.distmat

# 90% CVI 
usearch -distmx_split_identity vrs240.distmat -mindist 0.05 -maxdist 0.15 \
  -tabbedout subsets.tmp
sed 's/ /_/g' subsets.tmp > vrs240_r0_HC_id90.cvi

# 75% CVI 
usearch -distmx_split_identity vrs240.distmat -mindist 0.2 -maxdist 0.3 \
  -tabbedout subsets.tmp
sed 's/ /_/g' subsets.tmp > vrs240_r0_HC_id75.cvi

# 45% CVI 
usearch -distmx_split_identity vrs240.distmat -mindist 0.4 -maxdist 0.6 \
  -tabbedout subsets.tmp
sed 's/ /_/g' subsets.tmp > vrs240_r0_HC_id45.cvi

# 25% CVI 
usearch -distmx_split_identity vrs240.distmat -mindist 0.7 -maxdist 0.8 \
  -tabbedout subsets.tmp
sed 's/ /_/g' subsets.tmp > vrs240_r0_HC_id25.cvi

# Replace spaces with undescores for RdRp coordinates within accession

#1 Subset name.
#2 Label1
#3 Label2
#4 Dist

In [None]:
# Upload rs folder
aws s3 sync ./ s3://serratus-public/notebook/201226_rdrp0/rs/

In [None]:
# From base amazon linux 2
sudo yum install -y docker
sudo yum install -y git
sudo service docker start
sudo docker run --rm --entrypoint /bin/bash -it serratus-align:latest

In [None]:
# HMMER
wget http://eddylab.org/software/hmmer/hmmer-3.3.1.tar.gz ./

# Wiki Page Draft

# rdrp: RNA-dependent RNA polymerse

### Current Version: `rdrp0`

[RNA-dependent RNA polymerse (RdRp)](https://en.wikipedia.org/wiki/RNA-dependent_RNA_polymerase) is the enzyme which catalyzes the synthesis of an RNA molecule from an RNA template. RdRp are a marker for RNA viruses and are essential for viruses to synthesize RNA genomes and mRNA. The similarity between viral RdRp demarcate species, genus, and family level taxonomy at approximately 90%,80%, and 40%, respectively. RdRp is a slowly evolving, meaning related viruses are most likely to show sequence similarity in this enzyme. 

To isolate every RdRp in the SRA, we constructed a collection of all available viral RdRp sequences called `rdrp#`. 

# Custom added sequences


### Permuted RdRp

FJ977041 

>ACV83739.1 replicase-associated polyprotein [Grapevine virus Q]
MAAPATAYASPSAAFFALFQEQDLRCFRPLTLAHSLRYDAPVRPRQLPRLRSITVPITSLDEGFTPILIA
RPSLPLLGGGLKELVEMLAPTTHRDTVASPILEAVAGPLRTSIQRYPYEVPAHAVPVLQRFGIEASGFGF
KAHPHPVHKTIEIHLLFEHWLNLCRSPSAVLFMKQSKFEKLQHENANFEALANYNLTARDTTRYEQVAVA
PPTQAVWFMHDALQYFSLSQVAAFFADCPHLEKLFASLVVPPESDFTNLSLFPEIYRYSFAGSRLNYQLE
GNPGHSYSQPREALEWLKTTTIRCGNLYLTVTKLESWGPVHSLLIQRGKPSVHLEHDEVSFVGPDAVALP
EAAALRQDLRHRLVPRTVYDALFVYVRAVRTLRTTDPVGFVRTQSNKAEYSWVTSAAWDNLQHFVTETAA
HRVPNRHFFFNSTFAKCRYWCSQHKLGLLTVTTPPACGLTLFTGAKLASAMSSRLTALAVFHHWVVPPPT
LFFTPKAPLLAIQLTRLPQPLFSSVPFLHKPLGKLSLRLLNFDSFLRRFFPDAPIPTWARLLTVAIALSP
AVWLAIRHFIGPDAPQALNDHYVRFFHPDRWQLTFERQPRFVALDRTFPWPLPQAPEPTEPRDSDVPLET
VPSPLPVVAPLPAPATSVPPVDTSATTASAVEPSLSTESLKTDEAPSGTTILQPRELKDTIFPLPAAALA
VTPPEPAPAPAEPVSASTVLGTAPLSRDLHTGHVSTPATEPGLVEPEHSPLAADSSATGEVSEFFNLHPA
DWIAPTATFLARRRGETISGAKYPAMDCLLAAVSAGANIPKDALWKTICSYFPDSMLREEDIAKHGLSTH
HFAALAREHRLQATFHSAGNQFVLGVEHPSVSFHIDHTPESATAPGHFSLRADERQHSPRLLGGRAADLV
HAALKFKVGSAVLPFQQAHDYTTNVARAKNLISNMKNGFDGVLANIDPAHTNESRDRLLSLDGAMDIAAP
RDVKLIHIAGFPGCGKSYPIAQLLKSRAFKHFKIAVPTVELRNEWKGVLKVKPQDNWRISTWESSLLKSA
RILVIDEIYKMPRGYLDLAIHADPTIDLVIALGDPLQGVYHSTHSDSSNHRLSSEVKHLQPYMDYYCLWS
HRVPQDIGTFFGIKSTSTVPGFKSYQANIPGNLRQLANSQSAAKVLNQCGFSSVTIASSQGSTYSAPACI
HLDRHSMSLSHAHSLVALTRSKSGVIFTGDKRVLEAPGGNLLFSSYFQEKKVDLRALFPTEFPCRPILLE
PLKRRPTDLTGGAPFPFRDEARVFNPERRDDVFVEAAVVCGDGSSNAPQVSTHFLPETRRPLHFDLPSAK
PEFAAHEAPAPLTDTFIEPVYPGETFENIAAHFLPAHDPEVKEILFKDQRSNQFPFIDQPFHVGAQPASL
CAAVHHSKKDPTLLAASIEKRLRFRASDAPYQITAKDEILGSILFEAHCRAMRRDPNVRVPFDEALFAEC
IALNEFAQLTSKTQAVIMANHERSDPDWRYTAVRIFAKNQHKVNSGSLFGPWKACQTLALMHDAVILLFG
PVKKYQLIHDERDRPEHIFIYAGRTPQEMSEWCQKFLTPRSASSPVPVMVSGDDSLIGCHPHFVANDYTA
FDQSQHGEAAVLERLKMERVNIPEWLITLHIMIKTHITTQFGPLTCMRLTGEPGTYFDNSDYNLAVIFLE
YSMSGQWLSENPLWPAIKPLLALRFKKEKTRYGNFCGYYVGAAGAVRMPRALFAKILIAVEDASIADKMA
SYATEFAIGHSLGDALWSLLPVEEVVYQSAVFDFFCRNAPRELKLLFKLGPVERSVVEAVQEFATWASYA
FYRFLNSAQRKVLLTRSPQLHFPGDAPEVSQLQGELLQSFSMMQPTFPLTGGLLLPRAVDAPMSDDSPAG
RARSQRDPDHRVDPQPPLPLAPSVQETSGGPAITVPFQWVALVVKSESTIFTVDPPRAKSLTQLIGPYCH
ARLLSLEAILMPTLNAFQNPVTVHMVWTVNTVQPASGEELFYPGGQALTVGGPVSMSALATVPADVSRLN
PVIKGAVAFLDTPRLTGTTMKCAKSETSPMAYVVIRGTLALSGPVGTRLSE 


# rdrp DECOY

Non-rdrp sequences which return a high E-value to a known and verified rdrp. These are the sequences most likely to give us trouble at scale/assembly.


- Hepatitis C Virus `E2/NS1` Envelope gene

A short region of HCV E2 (AST36587.1) consistently yields hits to the HepC RdRP (called NS5B, BAK61626)

```
Query:   AST36587.1      61      99      99      +
Subject: rdrp3.Flaviviridae.Hepatitis_C_virus_subtype_2b:BAK61626        376     414     608
48.7    8.7e-07 43.9    39M     HHRFNSSGCPERLASCRHLTEFAQGWGPISHANGSGPDE
```

- Retrovirus sequences that made it through
```
AWS06671.1	Maize-associated retrovirus 1	Retroviridae	Artverviricota
AXY66749.1	Saesbyeol virus	Retroviridae	Artverviricota

```

# rdrp0 BLACKLIST

rdrp sequences that are mis-annotated, contain junk (plasmid/other genes) or are of low quality. Excluded from SRA search. `rdrp1` 

-`ADG27878` :rdrp2.Caliciviridae.Calicivirus_pig_NC_WGP93C_USA_2009:ADG27878

Matches calici capsid protein and related stuff

> Region          1030..1441
                     /region_name="RT_like"
                     /note="RT_like: Reverse transcriptase (RT, RNA-dependent
                     DNA polymerase)_like family. An RT gene is usually
                     indicative of a mobile element such as a retrotransposon
                     or retrovirus. RTs occur in a variety of mobile elements,
                     including retrotransposons; cl02808"
                                      

- **Missing/Strange motif A**

>AKJ82635.1|Influenza_A_virus_A_swine_Italy_55925_2011_H3N2_/61-60
....................................................................................................................................
>ANW82746.1|Makokou_virus/1-16
...................................................WS...P..G..........................D...N.....S...A......K....F......R....LF.T.A.M

>BAT24481.1|Rosellinia_necatrix_partitivirus_6/127-126
....................................................................................................................................
>ADP00834.1|Turnip_mosaic_virus/193-207
I.................................................TVR...S..S..........................L...S.....P...Y......L....I......N....A......V
>AFK23478.1|Zucchini_yellow_mosaic_virus/188-209
L............Y.....................C....H...A..DGSXXD...S..S..........................L...T.....P...A......L....L......N....A......V
>AKH48980.1|unidentified_Reptarenavirus/210-232
I............V.....................L....S...X..XXXXXX...X..X..........................N...S.....P...L......Q....Y......H....LM.....F
>AHA83412.1|Xinyi_virus/236-262
I............N.....................V....P...F..AQTPLNGG.P..G..........................D...N.....S...A......K....F......R....RF.T.A.I
>AJZ74605.1|Zaire_ebolavirus/227-247
S............F..........................V...T..DLEXYX...L..X..........................F...X.....Y...X......F....T......A....P......F
>ACL68402.1|Hepatitis_C_virus/189-210
M............G.....................F....X...Y..DTRCFD...S..T..........................V...T.....E...R......D....I......Q....V......E
>YP_009330274.1|Beihai_sesarmid_crab_virus_3/207-223
K............I.....................I....F...I..SFDMSE...F..S....................................................K......K....F......P
>YP_009337854.1|Shahe_yuevirus_like_virus_1/208-224
K............A.....................L....F...V..SFDMSE...F..S....................................................K......K....F......P
>YP_009329866.1|Wuhan_large_pig_roundworm_virus_1/195-216
F............A.....................Y....C...L..DXSSFD...S..S..........................I...N.....T...W......F....I......E....R......F
>YP_009246481.1|Tilapia_lake_virus/146-163
L............L.....................S....S...V..VETHAR...S..V..........................L...S.................................KV.....S
>ACY46471.1|Influenza_A_virus_A_Singapore_ON368_2009_H1N1_/218-240
F............T.....................I....T...G..DNTXWN...E..N..........................Q...X.....P...X......M....F......L....AM.....I
>AQY77579.1|Atypical_porcine_pestivirus/193-214
V............A.....................V....S...F..DTKAWD...T..Q..........................V...T.....R...E......D....L......R....L......V

- **Missing/Strange motif B**

>AEK49762.1|Influenza_A_virus_A_gadwall_California_8504_2008_H6N1_/321-334
L.X..X..X...X.......XXXLS.T....V.......L...G
>BAD89093.1|Ibaraki_virus/308-326
R.YS.T..Y...RA.K..IQHWSVI.F....M.......H...N
>ACY46471.1|Influenza_A_virus_A_Singapore_ON368_2009_H1N1_/321-338
L.X..P..G...RM.M..GMFXXLS.T....V.......L...G
AFJ05062.2|Cordyline_virus_4/271-279
A.Q..R..R...TG.................I.......T...T
>AEM63700.1|Varroa_destructor_virus_1/274-291
C.G..I..P...SG.S..PITDILN.X....X.......X...X
>AHW48492.1|West_Nile_virus/266-278
X.X..X..X...X........XXXX.X....X.......X...X
>AKH67623.1|Nora_virus/270-287
R.G..N..K...SG.S..YTTTIDN.X....X.......X...X

>API61900.1|Quarivirus_93C4/274-291
N.G..T..P...AG.F..VPTAENN.S....L.......Y...G
>YP_009330019.1|Hubei_astro_like_virus/268-285
C.G..F..N...PG.C..LCTTHDN.T....L.......V...N
>APG79231.1|Beihai_sipunculid_worm_virus_7/319-340
K.L..S..D...FG.Q..GLSGFCA.SAESTV.......V...E

- **Missing motif C**

**REFSEQ SEQUENCE**
>NP_042695.1|Cassava_common_mosaic_virus/293-299
Q..........................D.S.A........M.......D...A
>AKZ17743.1|Grapevine_Syrah_virus_1/314-314
L....................................................
>YP_009337147.1|Changjiang_hepe_like_virus_1/325-328
T..........V....................................L...G
>ACY46471.1|Influenza_A_virus_A_Singapore_ON368_2009_H1N1_/358-369
G..........L...Q....S.SD...D.Y.X........L.......I...X
>APG77679.1|Hubei_negev_like_virus_1/316-316
....................................................P
>APG77656.1|Hubei_virga_like_virus_22/327-334
S..........T...N....I.KR........................Y...L
>AIX97862.1|Goutanap_virus/317-318
................................................F...L
>YP_009001772.1|Wallerfield_virus/317-318
................................................Y...T
>YP_009351824.1|Biratnagar_virus/317-318
................................................Y...T
>APG75979.1|Wenzhou_toti_like_virus_2/301-310
....................Q.GL...Y.VRM........K.......S...V
>AFI24669.1|Dezidougou_virus/317-318
................................................Y...T
>YP_009344994.1|Wuhan_insect_virus_8/315-315
....................................................N
>AHX42605.1|Tanay_virus/319-320
................................................Y...T
>YP_009333224.1|Beihai_barnacle_virus_3/341-344
E..........L....................................V...D
>YP_009337768.1|Hubei_virga_like_virus_7/318-319
................................................Y...S
>API61901.1|Biggievirus_Mos11/317-318
................................................Y...T
>YP_009342435.1|Wuhan_house_centipede_virus_1/315-315
....................................................I
>AEV46286.1|Hepatitis_C_virus_genotype_3/287-301
F..........L...F....G.EDGSWS.Y.Y........L.......R...T
>APG77564.1|Beihai_hepe_like_virus_11/332-335
A..........P....................................V...G
>ACH97717.1|Hepatitis_C_virus/287-298
M..........L...V....C.GD...X.X.X........X.......X...X
>APD78607.1|St_Louis_encephalitis_virus/329-340
X..........X...X....X.XX...X.X.X........X.......X...X
>ANA85187.1|Zika_virus/327-338
X..........X...X....X.XX...X.X.X........X.......X...X
>AFI24675.1|Santana_virus/318-319
................................................Y...T