# Build inference result dataset

**Objective**: create a dataset with the result of inference using simulated reads using  a subset/all the sequences as reference.

**Constraints**: cannot load the totalily of the data into memory for inference because there are to many reads:
- estimated 50k reads per sequence. for the 3k sequences, this means 150 million reads to handle !

**Pipeline Idea**:

Use a large fast file and iterate over manageble chuncks. For each chunck:
- create a fasta file for the chunk (`.fa`)
- create simreads with Art Illumina (`fq`, `aln`)
- preprocess simreads into ds and info
- use model to infer taxonomy and position
- build the inference result dataset including"
    - predicted result
    - ref sequence metadata
    - position ground truth
- save partial inference result dataset as parquet

When iteration is done:
- merge all partial inference result datasets into one single dataset


**Intermediate steps**: use groups of sequence to experiments and still get some statistically relevant info

# Setup


In [None]:
from ecutilities.ipython import nb_setup, pandas_nrows_ncols
nb_setup()

Set autoreload mode


In [None]:
import numpy as np
import os
import pandas as pd
import tempfile
from nbdev import show_doc
from pathlib import Path
from pprint import pprint
from metagentools.art import ArtIllumina
from metagentools.cnn_virus.data import FastaFileReader, FastaFileIterator, parse_metadata_fasta_cov

# Explore fasta file sequence metadata

Explore the `cov_data` directory that includes all available sequence files

In [None]:
p2art = Path('/bin/art_illumina')
assert p2art.is_file()

p2inputs = Path('../../data/cov_data').resolve()
print(p2inputs.absolute())
assert p2inputs.is_dir()

p2outputs = Path('../../data/cov_simreads').resolve()
print(p2outputs.absolute())
assert p2outputs.is_dir()

/home/vtec/projects/bio/metagentools/data/cov_data
/home/vtec/projects/bio/metagentools/data/cov_simreads


In [None]:
art = ArtIllumina(
    path2app=p2art,
    input_dir=p2inputs,
    output_dir=p2outputs
    )

Ready to operate with art: /bin/art_illumina
Input files from : /home/vtec/projects/bio/metagentools/data/cov_data
Output files to :  /home/vtec/projects/bio/metagentools/data/cov_simreads


In [None]:
art.list_all_input_files()

cov_virus_list.txt
cov_virus_sequence_one_1.fa
cov_virus_sequence_one_2.fa
cov_virus_sequences.fa
cov_virus_sequences_hundred.fa
cov_virus_sequences_ten.fa
cov_virus_sequences_twenty_five.fa
cov_virus_sequences_two.fa
groups_1


## Extract fasta metadata and create DataFrame

Create file reader for the full sequence file and extract all metadata

In [None]:
p2fasta = p2inputs / 'cov_virus_sequences.fa'
assert p2fasta.is_file()

In [None]:
fasta = FastaFileReader(p2fasta)
seqs_metadata = fasta.parse_fasta()
print(f"The file includes {len(seqs_metadata):,d} sequences")

The file includes 3,318 sequences


Display the metadata for the first few sequences

In [None]:
seqids = list(seqs_metadata.keys())
for seqid in seqids[:5]:
    pprint(seqs_metadata[seqid])

{'accession': 'MK211378',
 'seqid': '2591237:ncbi:1',
 'seqnb': '1',
 'source': 'ncbi',
 'species': 'Coronavirus BtRs-BetaCoV/YN2018D  scientific name',
 'taxonomyid': '2591237'}
{'accession': 'LC494191',
 'seqid': '11128:ncbi:2',
 'seqnb': '2',
 'source': 'ncbi',
 'species': 'Bovine coronavirus  scientific name',
 'taxonomyid': '11128'}
{'accession': 'KY967361',
 'seqid': '31631:ncbi:3',
 'seqnb': '3',
 'source': 'ncbi',
 'species': 'Human coronavirus OC43  scientific name',
 'taxonomyid': '31631'}
{'accession': 'LC654455',
 'seqid': '277944:ncbi:4',
 'seqnb': '4',
 'source': 'ncbi',
 'species': 'Human coronavirus NL63  scientific name',
 'taxonomyid': '277944'}
{'accession': 'MN987231',
 'seqid': '11120:ncbi:5',
 'seqnb': '5',
 'source': 'ncbi',
 'species': 'Infectious bronchitis virus  scientific name',
 'taxonomyid': '11120'}


Convert metadata into DataFrame format

In [None]:
meta_df = pd.DataFrame(columns=seqs_metadata[seqids[0]].keys())
for k, v in seqs_metadata.items():
    meta_df = pd.concat([meta_df, pd.DataFrame(index=[k], data=v)], axis=0)
meta_df.shape

(3318, 6)

In [None]:
meta_df.head()

Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
2591237:ncbi:1,MK211378,2591237:ncbi:1,1,ncbi,Coronavirus BtRs-BetaCoV/YN2018D scientific name,2591237
11128:ncbi:2,LC494191,11128:ncbi:2,2,ncbi,Bovine coronavirus scientific name,11128
31631:ncbi:3,KY967361,31631:ncbi:3,3,ncbi,Human coronavirus OC43 scientific name,31631
277944:ncbi:4,LC654455,277944:ncbi:4,4,ncbi,Human coronavirus NL63 scientific name,277944
11120:ncbi:5,MN987231,11120:ncbi:5,5,ncbi,Infectious bronchitis virus scientific name,11120


We will break down the 3k sequences by `species`. We know that some of the sequences do not have a species string:

In [None]:
nospecies_idxs = meta_df.loc[meta_df.species.isna(), :].index
meta_df.loc[nospecies_idxs, :]

Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
2877474:ncbi:73,MZ328303,2877474:ncbi:73,73,ncbi,,2877474
2872806:ncbi:125,MZ081378,2872806:ncbi:125,125,ncbi,,2872806
2891194:ncbi:184,MW924112,2891194:ncbi:184,184,ncbi,,2891194
2833184:ncbi:240,OK017818,2833184:ncbi:240,240,ncbi,,2833184
2877471:ncbi:243,MZ328300,2877471:ncbi:243,243,ncbi,,2877471
...,...,...,...,...,...,...
2833184:ncbi:3144,OK017826,2833184:ncbi:3144,3144,ncbi,,2833184
2833184:ncbi:3149,OK017833,2833184:ncbi:3149,3149,ncbi,,2833184
2877469:ncbi:3168,MZ328302,2877469:ncbi:3168,3168,ncbi,,2877469
2833184:ncbi:3185,OK017855,2833184:ncbi:3185,3185,ncbi,,2833184


## Group sequences by `species`

First we work only with sequences that have a value for `species`

In [None]:
meta = meta_df.dropna(axis=0, subset=['species'])
print(f"We dropped {meta_df.shape[0]-meta.shape[0]:,d} rows that have no `species` description")
meta.shape

We dropped 108 rows that have no `species` description


(3210, 6)

Check how many unique `species` and how many sequence for each unique `species`:

In [None]:
unique_species = meta.species.sort_values().value_counts(sort=False)
print(f"There are {unique_species.shape[0]:,d} species in the full sequence file:")
with pandas_nrows_ncols():
    display(unique_species)

There are 296 species in the full sequence file:


229E-related bat coronavirus  scientific name                                 6
Alpaca respiratory coronavirus  scientific name                               1
Alphacoronavirus 1  scientific name                                           1
Alphacoronavirus Bat-CoV/P                                                    4
Alphacoronavirus BtMs-AlphaCoV/GS2013  scientific name                        1
Alphacoronavirus Mink/China/1/2016  scientific name                           1
Alphacoronavirus UKRn3  scientific name                                       1
Alphacoronavirus sp                                                          56
Atlantic salmon bafinivirus  scientific name                                  1
Avian coronavirus  scientific name                                           24
Avian infectious bronchitis virus                                             1
Avian infectious bronchitis virus partridge/GD/S14/2003  scientific name      1
Ball python nidovirus 1  scientific name

Reviewing the info above, we can group the sequences in the following related species. A few sequence are not included in these groups:

In [None]:
species_groups = 'Alphacoronavirus; Bat CoV; Bat Hp; Bat SARS CoV; Bat SARS coronavirus; Bat SARS-like; Bat coronavirus; Betacoronavirus; Bovine; Bt; Coronavirus; Deltacoronavirus; Feline; Human; Infectious bronchitis; Middle East respiratory syndrome; Murine coronavirus; Murine hepatitis; Pangolin coronavirus; Porcine coronavirus; Porcine deltacoronavirus; Porcine epidemic diarrhea virus; SARS; Severe acute respiratory syndrome; TGEV'.split('; ')
len(species_groups)

25

```
    229E-related bat coronavirus  scientific name                                 6
    Alpaca respiratory coronavirus  scientific name                               1
Alphacoronavirus
    Alphacoronavirus 1  scientific name                                           1
    Alphacoronavirus Bat-CoV/P                                                    4
    Alphacoronavirus BtMs-AlphaCoV/GS2013  scientific name                        1
    Alphacoronavirus Mink/China/1/2016  scientific name                           1
    Alphacoronavirus UKRn3  scientific name                                       1
    Alphacoronavirus sp                                                          56

    Atlantic salmon bafinivirus  scientific name                                  1
    Avian coronavirus  scientific name                                           24
    Avian infectious bronchitis virus                                             1
    Avian infectious bronchitis virus partridge/GD/S14/2003  scientific name      1
    Ball python nidovirus 1  scientific name                                      2
Bat Cov
    Bat CoV 273/2005  scientific name                                             1
    Bat CoV 279/2005  scientific name                                             1
Bat Hp
    Bat Hp-betacoronavirus/Zhejiang2013  scientific name                          2
Bat SARS CoV
    Bat SARS CoV Rf1/2004  scientific name                                        1
    Bat SARS CoV Rm1/2004  scientific name                                        1
    Bat SARS CoV Rp3/2004  scientific name                                        1
Bat SARS coronavirus
    Bat SARS coronavirus HKU3  scientific name                                    2
    Bat SARS coronavirus HKU3-1  scientific name                                  1
    Bat SARS coronavirus HKU3-10  scientific name                                 1
    Bat SARS coronavirus HKU3-11  scientific name                                 1
    Bat SARS coronavirus HKU3-12  scientific name                                 1
    Bat SARS coronavirus HKU3-13  scientific name                                 1
    Bat SARS coronavirus HKU3-2  scientific name                                  1
    Bat SARS coronavirus HKU3-3  scientific name                                  1
    Bat SARS coronavirus HKU3-4  scientific name                                  1
    Bat SARS coronavirus HKU3-5  scientific name                                  1
    Bat SARS coronavirus HKU3-6  scientific name                                  1
    Bat SARS coronavirus HKU3-7  scientific name                                  1
    Bat SARS coronavirus HKU3-8  scientific name                                  1
    Bat SARS coronavirus HKU3-9  scientific name                                  1
Bat SARS-like
    Bat SARS-like coronavirus  scientific name                                   13
    Bat SARS-like coronavirus Rs3367  scientific name                             1
    Bat SARS-like coronavirus RsSHC014  scientific name                           1
    Bat SARS-like coronavirus WIV1  scientific name                               1
    Bat SARS-like coronavirus YNLF_31C  scientific name                           1
    Bat SARS-like coronavirus YNLF_34C  scientific name                           1
????
    Bat alphacoronavirus  scientific name                                         2
Bat coronavirus
    Bat coronavirus                                                               1
    Bat coronavirus  scientific name                                              4
    Bat coronavirus 1A  scientific name                                           1
    Bat coronavirus 1B  scientific name                                           1
    Bat coronavirus CDPHE15/USA/2006  scientific name                             2
    Bat coronavirus Cp/Yunnan2011  scientific name                                1
    Bat coronavirus HKU10  scientific name                                       17
    Bat coronavirus HKU4-1  scientific name                                       1
    Bat coronavirus HKU4-2  scientific name                                       1
    Bat coronavirus HKU4-3  scientific name                                       1
    Bat coronavirus HKU4-4  scientific name                                       1
    Bat coronavirus HKU5-1  scientific name                                       1
    Bat coronavirus HKU5-2  scientific name                                       1
    Bat coronavirus HKU5-3  scientific name                                       1
    Bat coronavirus HKU5-5  scientific name                                       1
    Bat coronavirus HKU9-1  scientific name                                       1
    Bat coronavirus HKU9-10-1  scientific name                                    1
    Bat coronavirus HKU9-10-2  scientific name                                    1
    Bat coronavirus HKU9-2  scientific name                                       1
    Bat coronavirus HKU9-3  scientific name                                       1
    Bat coronavirus HKU9-4  scientific name                                       1
    Bat coronavirus HKU9-5-1  scientific name                                     1
    Bat coronavirus HKU9-5-2  scientific name                                     1
    Bat coronavirus RaTG13  scientific name                                       1
    Bat coronavirus Rp/Shaanxi2011  scientific name                               1

    Bellinger River virus  scientific name                                        1
    Beluga whale coronavirus SW1  scientific name                                 2
    Berne virus  scientific name                                                  1
Betacoronavirus
    Betacoronavirus 1  scientific name                                            4
    Betacoronavirus England 1  scientific name                                    2
    Betacoronavirus Erinaceus  scientific name                                    8
    Betacoronavirus Erinaceus/VMC/DEU/2012  scientific name                       3
    Betacoronavirus HKU24  scientific name                                        4
    Betacoronavirus sp                                                           10

    Bottlenose dolphin coronavirus  scientific name                               4
    Bottlenose dolphin coronavirus HKU22  scientific name                         3
Bovine
    Bovine coronavirus  scientific name                                          89
    Bovine coronavirus DB2  scientific name                                       1
    Bovine coronavirus E-AH187  scientific name                                   1
    Bovine coronavirus E-AH187-TC  scientific name                                1
    Bovine coronavirus E-AH65  scientific name                                    1
    Bovine coronavirus E-AH65-TC  scientific name                                 1
    Bovine coronavirus E-DB2-TC  scientific name                                  1
    Bovine coronavirus R-AH187  scientific name                                   1
    Bovine coronavirus R-AH65  scientific name                                    1
    Bovine coronavirus R-AH65-TC  scientific name                                 1
    Bovine coronavirus isolate Alpaca  scientific name                            1
    Bovine respiratory coronavirus AH187  scientific name                         1
    Bovine respiratory coronavirus bovine/US/OH-440-TC/1996  scientific name      1
    Bovine torovirus  scientific name                                             4
Bt
    BtMf-AlphaCoV/AH2011  scientific name                                         1
    BtMf-AlphaCoV/FJ2012  scientific name                                         1
    BtMf-AlphaCoV/GD2012  scientific name                                         1
    BtMf-AlphaCoV/HeN2013  scientific name                                        1
    BtMf-AlphaCoV/HuB2013  scientific name                                        1
    BtMf-AlphaCoV/JX2012  scientific name                                         1
    BtMr-AlphaCoV/SAX2011  scientific name                                        2
    BtNv-AlphaCoV/SC2013  scientific name                                         2
    BtPa-BetaCoV/GD2013  scientific name                                          1
    BtRf-AlphaCoV/HuB2013  scientific name                                        2
    BtRf-AlphaCoV/YN2012  scientific name                                         2
    BtRf-BetaCoV/HeB2013  scientific name                                         1
    BtRf-BetaCoV/JL2012  scientific name                                          1
    BtRf-BetaCoV/SX2013  scientific name                                          1
    BtRs-BetaCoV/GX2013  scientific name                                          1
    BtRs-BetaCoV/HuB2013  scientific name                                         1
    BtRs-BetaCoV/YN2013  scientific name                                          1
    BtTp-BetaCoV/GX2012  scientific name                                          1
    BtVs-BetaCoV/SC2013  scientific name                                          1

    Bulbul coronavirus HKU11-796  scientific name                                 1
    Bulbul coronavirus HKU11-934  scientific name                                 2
    Calf-giraffe coronavirus US/OH3/2006  scientific name                         1
    Camel alphacoronavirus  scientific name                                      27
    Camel alphacoronavirus Camel229E  scientific name                             6
    Camel coronavirus HKU23  scientific name                                      2
    Canada goose coronavirus  scientific name                                     2
    Canine coronavirus  scientific name                                          14
    Canine respiratory coronavirus  scientific name                               3
    Chinook salmon bafinivirus  scientific name                                   3
    Civet SARS CoV SZ16/2003  scientific name                                     1
    Civet SARS CoV SZ3/2003  scientific name                                      1
    Common moorhen coronavirus HKU21  scientific name                             2
Coronavirus    
    Coronavirus AcCoV-JC34  scientific name                                       2
    Coronavirus BtRl-BetaCoV/SC2018  scientific name                              1
    Coronavirus BtRs-AlphaCoV/YN2018  scientific name                             1
    Coronavirus BtRs-BetaCoV/YN2018A  scientific name                             1
    Coronavirus BtRs-BetaCoV/YN2018B  scientific name                             1
    Coronavirus BtRs-BetaCoV/YN2018C  scientific name                             1
    Coronavirus BtRs-BetaCoV/YN2018D  scientific name                             1
    Coronavirus BtRt-BetaCoV/GX2018  scientific name                              1
    Coronavirus BtSk-AlphaCoV/GX2018A  scientific name                            1
    Coronavirus BtSk-AlphaCoV/GX2018B  scientific name                            1
    Coronavirus BtSk-AlphaCoV/GX2018C  scientific name                            1
    Coronavirus BtSk-AlphaCoV/GX2018D  scientific name                            1
    Coronavirus HKU15  scientific name                                            2
    Coronavirus Neoromicia/PML-PHE1/RSA/2011  scientific name                     1
Deltacoronavirus
    Deltacoronavirus PDCoV/USA/Illinois121/2014  scientific name                  1
    Deltacoronavirus PDCoV/USA/Illinois133/2014  scientific name                  1
    Deltacoronavirus PDCoV/USA/Illinois134/2014  scientific name                  1
    Deltacoronavirus PDCoV/USA/Illinois136/2014  scientific name                  1
    Deltacoronavirus PDCoV/USA/Ohio137/2014  scientific name                      1
    Deltacoronavirus sp                                                           3

    Dromedary camel coronavirus HKU23  scientific name                            9
    Duck coronavirus  scientific name                                             3
    Equine coronavirus  scientific name                                           5
    Erinaceus hedgehog coronavirus HKU31  scientific name                         2
    European turkey coronavirus 080385d  scientific name                          1
    Fathead minnow nidovirus  scientific name                                     1
Feline
    Feline alphacoronavirus 1  scientific name                                    1
    Feline coronavirus  scientific name                                          60
    Feline coronavirus RM  scientific name                                        1
    Feline coronavirus UU10  scientific name                                      1
    Feline coronavirus UU11  scientific name                                      1
    Feline coronavirus UU15  scientific name                                      1
    Feline coronavirus UU16  scientific name                                      1
    Feline coronavirus UU17  scientific name                                      1
    Feline coronavirus UU18  scientific name                                      1
    Feline coronavirus UU19  scientific name                                      1
    Feline coronavirus UU2  scientific name                                       1
    Feline coronavirus UU20  scientific name                                      1
    Feline coronavirus UU21  scientific name                                      1
    Feline coronavirus UU22  scientific name                                      1
    Feline coronavirus UU23  scientific name                                      1
    Feline coronavirus UU24  scientific name                                      1
    Feline coronavirus UU3  scientific name                                       1
    Feline coronavirus UU30  scientific name                                      1
    Feline coronavirus UU31  scientific name                                      1
    Feline coronavirus UU34  scientific name                                      1
    Feline coronavirus UU4  scientific name                                       1
    Feline coronavirus UU40  scientific name                                      1
    Feline coronavirus UU47  scientific name                                      1
    Feline coronavirus UU5  scientific name                                       1
    Feline coronavirus UU54  scientific name                                      1
    Feline coronavirus UU7  scientific name                                       1
    Feline coronavirus UU8  scientific name                                       1
    Feline coronavirus UU9  scientific name                                       1
    Feline infectious peritonitis virus  scientific name                          6

    Ferret coronavirus  scientific name                                           4
    Ferret enteric coronavirus  scientific name                                   1
    Ferret systemic coronavirus  scientific name                                  1
    Giraffe coronavirus US/OH3-TC/2006  scientific name                           1
    Giraffe coronavirus US/OH3/2003  scientific name                              1
    Goat torovirus  scientific name                                               1
    Guangdong red-banded snake torovirus  scientific name                         1
    Guinea fowl coronavirus  scientific name                                      2
    Guinea fowl coronavirus GfCoV/FR/2011  scientific name                        1
    Hainan hebius popei torovirus  scientific name                                1
    Hedgehog coronavirus 1  scientific name                                       1
    Hipposideros bat coronavirus HKU10  scientific name                           6
    Hipposideros pomona bat coronavirus CHB25  scientific name                    1
    Hipposideros pomona bat coronavirus HKU10-related  scientific name            1

Human
    Human betacoronavirus 2c EMC/2012  scientific name                            1
    Human betacoronavirus 2c England-Qatar/2012  scientific name                  1
    Human betacoronavirus 2c Jordan-N3/2012  scientific name                      2
    Human coronavirus 229E  scientific name                                      39
    Human coronavirus HKU1  scientific name                                      52
    Human coronavirus NL63  scientific name                                      71
    Human coronavirus OC43  scientific name                                     213
    Human enteric coronavirus 4408  scientific name                               1
    Human group 1 coronavirus associated with pneumonia  scientific name          1
Infectious bronchitis
    Infectious bronchitis virus  scientific name                                433
    Infectious bronchitis virus ITA/90254/2005  scientific name                   1
    Infectious bronchitis virus NGA/A116E7/2006  scientific name                  1

    Lucheng Rn rat coronavirus  scientific name                                   5
    Magpie-robin coronavirus HKU18  scientific name                               2
Middle East respiratory syndrome
    Middle East respiratory syndrome-related coronavirus  scientific name       606
???
    Miniopterus bat coronavirus 1  scientific name                                1
    Miniopterus bat coronavirus HKU8  scientific name                             2
    Miniopterus pusillus bat coronavirus HKU8-related  scientific name            1
    Miniopterus schreibersii bat coronavirus 1-related  scientific name           1
    Mink coronavirus 1  scientific name                                           2
    Mink coronavirus strain WD1127  scientific name                               2
    Mink coronavirus strain WD1133  scientific name                               1
    Munia coronavirus HKU13-3514  scientific name                                 2
Murine coronavirus
    Murine coronavirus  scientific name                                           7
    Murine coronavirus MHV-1  scientific name                                     1
    Murine coronavirus MHV-3  scientific name                                     2
    Murine coronavirus MHV-JHM                                                    1
    Murine coronavirus RA59/R13  scientific name                                  1
    Murine coronavirus RA59/SJHM  scientific name                                 1
    Murine coronavirus RJHM/A  scientific name                                    1
    Murine coronavirus SA59/RJHM  scientific name                                 1
    Murine coronavirus inf-MHV-A59  scientific name                               1
    Murine coronavirus repA59/RJHM  scientific name                               1
    Murine coronavirus repJHM/RA59  scientific name                               1
Murine hepatitis
    Murine hepatitis virus  scientific name                                      10
    Murine hepatitis virus strain 2  scientific name                              1
    Murine hepatitis virus strain A59  scientific name                            2
    Murine hepatitis virus strain JHM  scientific name                            1
    Murine hepatitis virus strain ML-11  scientific name                          1
    Murine hepatitis virus strain S/3239-17  scientific name                      1
    
    Myotis lucifugus coronavirus  scientific name                                 1
    NL63-related bat coronavirus  scientific name                                 5
    Night heron coronavirus HKU19  scientific name                                2
    PRCV ISU-1  scientific name                                                   1
Pangolin coronavirus
    Pangolin coronavirus  scientific name                                         8

    Pheasant coronavirus  scientific name                                         2
    Pipistrellus abramus bat coronavirus HKU5-related  scientific name            1
    Pipistrellus bat coronavirus HKU5  scientific name                            4
Porcine coronavirus
    Porcine coronavirus HKU15  scientific name                                   13
Porcine deltacoronavirus
    Porcine deltacoronavirus  scientific name                                   124
    Porcine deltacoronavirus 8734/USA-IA/2014  scientific name                    1
    Porcine deltacoronavirus KNU14-04  scientific name                            1

    Porcine enteric alphacoronavirus  scientific name                             3
    Porcine enteric alphacoronavirus GDS04  scientific name                       1

Porcine epidemic diarrhea virus
    Porcine epidemic diarrhea virus  scientific name                            821
    Porcine epidemic diarrhea virus L00721/GER/2014  scientific name              1

    Porcine hemagglutinating encephalomyelitis virus  scientific name            15
    Porcine respiratory coronavirus  scientific name                              2
    Porcine torovirus  scientific name                                            3
    Quail deltacoronavirus  scientific name                                       1
    Rabbit coronavirus HKU14  scientific name                                     5
    Rat coronavirus  scientific name                                              2
    Rat coronavirus Parker  scientific name                                       2
    Rhinolophus affinis bat coronavirus HKU2-related  scientific name             1
    Rhinolophus affinis coronavirus  scientific name                              1
    Rhinolophus bat coronavirus HKU2  scientific name                             6
    Rhinolophus bat coronavirus HKU32  scientific name                            2
    Rousettus aegyptiacus bat coronavirus 229E-related  scientific name           1
    Rousettus bat coronavirus  scientific name                                    1
    Rousettus bat coronavirus GCCDC1  scientific name                             1
    Rousettus bat coronavirus HKU10  scientific name                              3
    Rousettus bat coronavirus HKU9  scientific name                               2
SARS
    SARS coronavirus Rs_672/2006  scientific name                                 1
    SARS-like coronavirus WIV16  scientific name                                  1

    Sable antelope coronavirus US/OH1/2003  scientific name                       1
    Sambar deer coronavirus US/OH-WD388-TC/1994  scientific name                  2
    Sambar deer coronavirus US/OH-WD388/1994  scientific name                     1
    Scotophilus bat coronavirus 512  scientific name                              2
    Scotophilus kuhlii bat coronavirus 512-related  scientific name               1
Severe acute respiratory syndrome
    Severe acute respiratory syndrome coronavirus 2  scientific name              2
    Severe acute respiratory syndrome-related coronavirus  scientific name       21

    Sparrow coronavirus HKU17  scientific name                                    2
    Sparrow deltacoronavirus  scientific name                                     4
    Swine deltacoronavirus OhioCVM1/2014  scientific name                         1
    Swine enteric alphacoronavirus  scientific name                               1
    Swine enteric coronavirus  scientific name                                    4
TGEV
    TGEV Miller M6  scientific name                                               1
    TGEV Miller M60  scientific name                                              1
    TGEV Purdue P115  scientific name                                             1
    TGEV virulent Purdue  scientific name                                         1
    Thrush coronavirus HKU12-600  scientific name                                 2
    Transmissible gastroenteritis virus  scientific name                         38
    Turkey coronavirus  scientific name                                           9
    Tylonycteris bat coronavirus HKU33  scientific name                           1
    Tylonycteris bat coronavirus HKU4  scientific name                            5
    Tylonycteris pachypus bat coronavirus HKU4-related  scientific name           1
    Water deer coronavirus  scientific name                                       1
    Waterbuck coronavirus US/OH-WD358-GnC/1994  scientific name                   1
    Waterbuck coronavirus US/OH-WD358-TC/1994  scientific name                    1
    Waterbuck coronavirus US/OH-WD358/1994  scientific name                       1
    White bream virus  scientific name                                            1
    White-eye coronavirus HKU16  scientific name                                  2
    White-tailed deer coronavirus US/OH-WD470/1994  scientific name               1
    Wigeon coronavirus HKU20  scientific name                                     2
    Yak coronavirus  scientific name                                              1
```

In [None]:
def display_filtered_meta(pattern, meta=meta, show_all=False):
    df = meta.loc[meta.species.str.startswith(pattern)]
    print(f"{pattern}: {df.shape[0]:,d} species:")
    if show_all:
        with pandas_nrows_ncols():
            display(df)
    else:
        display(df)
    return df.shape[0]
    
def add_group(pattern):
    if pattern not in species_groups: species_groups.append(pattern)
    return species_groups

In [None]:
total_nbr_species_in_groups = 0
for pattern in species_groups:
    total_nbr_species_in_groups += display_filtered_meta(pattern)

total_nbr_species_in_groups

Alphacoronavirus: 64 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1906673:ncbi:113,MH687967,1906673:ncbi:113,113,ncbi,Alphacoronavirus sp,1906673
1906673:ncbi:137,MH687954,1906673:ncbi:137,137,ncbi,Alphacoronavirus sp,1906673
1906673:ncbi:204,MH687950,1906673:ncbi:204,204,ncbi,Alphacoronavirus sp,1906673
1906673:ncbi:368,MH687945,1906673:ncbi:368,368,ncbi,Alphacoronavirus sp,1906673
1906673:ncbi:409,MZ081391,1906673:ncbi:409,409,ncbi,Alphacoronavirus sp,1906673
...,...,...,...,...,...,...
2492658:ncbi:2983,NC_046964,2492658:ncbi:2983,2983,ncbi,Alphacoronavirus Bat-CoV/P,2492658
1906673:ncbi:3025,MZ081389,1906673:ncbi:3025,3025,ncbi,Alphacoronavirus sp,1906673
1906673:ncbi:3211,MZ081394,1906673:ncbi:3211,3211,ncbi,Alphacoronavirus sp,1906673
1906673:ncbi:3216,MH687965,1906673:ncbi:3216,3216,ncbi,Alphacoronavirus sp,1906673


Bat CoV: 2 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
389166:ncbi:3259,DQ648856,389166:ncbi:3259,3259,ncbi,Bat CoV 273/2005 scientific name,389166
389167:ncbi:3260,DQ648857,389167:ncbi:3260,3260,ncbi,Bat CoV 279/2005 scientific name,389167


Bat Hp: 2 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1541205:ncbi:1362,KF636752,1541205:ncbi:1362,1362,ncbi,Bat Hp-betacoronavirus/Zhejiang2013 scientifi...,1541205
1541205:ncbi:2738,NC_025217,1541205:ncbi:2738,2738,ncbi,Bat Hp-betacoronavirus/Zhejiang2013 scientifi...,1541205


Bat SARS CoV: 3 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
349344:ncbi:3254,DQ071615,349344:ncbi:3254,3254,ncbi,Bat SARS CoV Rp3/2004 scientific name,349344
347537:ncbi:3256,DQ412042,347537:ncbi:3256,3256,ncbi,Bat SARS CoV Rf1/2004 scientific name,347537
347536:ncbi:3257,DQ412043,347536:ncbi:3257,3257,ncbi,Bat SARS CoV Rm1/2004 scientific name,347536


Bat SARS coronavirus: 15 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
338605:ncbi:3252,DQ084199,338605:ncbi:3252,3252,ncbi,Bat SARS coronavirus HKU3-2 scientific name,338605
338606:ncbi:3253,DQ084200,338606:ncbi:3253,3253,ncbi,Bat SARS coronavirus HKU3-3 scientific name,338606
333387:ncbi:3255,DQ022305,333387:ncbi:3255,3255,ncbi,Bat SARS coronavirus HKU3-1 scientific name,333387
742001:ncbi:3263,GQ153539,742001:ncbi:3263,3263,ncbi,Bat SARS coronavirus HKU3-4 scientific name,742001
742002:ncbi:3264,GQ153540,742002:ncbi:3264,3264,ncbi,Bat SARS coronavirus HKU3-5 scientific name,742002
742003:ncbi:3265,GQ153541,742003:ncbi:3265,3265,ncbi,Bat SARS coronavirus HKU3-6 scientific name,742003
742004:ncbi:3266,GQ153542,742004:ncbi:3266,3266,ncbi,Bat SARS coronavirus HKU3-7 scientific name,742004
742005:ncbi:3267,GQ153543,742005:ncbi:3267,3267,ncbi,Bat SARS coronavirus HKU3-8 scientific name,742005
742006:ncbi:3268,GQ153544,742006:ncbi:3268,3268,ncbi,Bat SARS coronavirus HKU3-9 scientific name,742006
741997:ncbi:3269,GQ153545,741997:ncbi:3269,3269,ncbi,Bat SARS coronavirus HKU3-10 scientific name,741997


Bat SARS-like: 18 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1415851:ncbi:3273,KC881005,1415851:ncbi:3273,3273,ncbi,Bat SARS-like coronavirus RsSHC014 scientific...,1415851
1415834:ncbi:3274,KC881006,1415834:ncbi:3274,3274,ncbi,Bat SARS-like coronavirus Rs3367 scientific name,1415834
1415852:ncbi:3275,KF367457,1415852:ncbi:3275,3275,ncbi,Bat SARS-like coronavirus WIV1 scientific name,1415852
1699360:ncbi:3277,KP886808,1699360:ncbi:3277,3277,ncbi,Bat SARS-like coronavirus YNLF_31C scientific...,1699360
1699361:ncbi:3278,KP886809,1699361:ncbi:3278,3278,ncbi,Bat SARS-like coronavirus YNLF_34C scientific...,1699361
1508227:ncbi:3280,KY417142,1508227:ncbi:3280,3280,ncbi,Bat SARS-like coronavirus scientific name,1508227
1508227:ncbi:3281,KY417143,1508227:ncbi:3281,3281,ncbi,Bat SARS-like coronavirus scientific name,1508227
1508227:ncbi:3282,KY417144,1508227:ncbi:3282,3282,ncbi,Bat SARS-like coronavirus scientific name,1508227
1508227:ncbi:3283,KY417145,1508227:ncbi:3283,3283,ncbi,Bat SARS-like coronavirus scientific name,1508227
1508227:ncbi:3284,KY417146,1508227:ncbi:3284,3284,ncbi,Bat SARS-like coronavirus scientific name,1508227


Bat coronavirus: 45 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1244203:ncbi:18,MN477899,1244203:ncbi:18,18,ncbi,Bat coronavirus HKU10 scientific name,1244203
875613:ncbi:231,HM211101,875613:ncbi:231,231,ncbi,Bat coronavirus HKU9-10-2 scientific name,875613
424368:ncbi:256,EF065514,424368:ncbi:256,256,ncbi,Bat coronavirus HKU9-2 scientific name,424368
424361:ncbi:316,EF065507,424361:ncbi:316,316,ncbi,Bat coronavirus HKU4-3 scientific name,424361
389230:ncbi:454,DQ648794,389230:ncbi:454,454,ncbi,Bat coronavirus,389230
1244203:ncbi:535,MN477900,1244203:ncbi:535,535,ncbi,Bat coronavirus HKU10 scientific name,1244203
1244203:ncbi:691,MN477915,1244203:ncbi:691,691,ncbi,Bat coronavirus HKU10 scientific name,1244203
1244203:ncbi:735,MN477906,1244203:ncbi:735,735,ncbi,Bat coronavirus HKU10 scientific name,1244203
424362:ncbi:802,EF065508,424362:ncbi:802,802,ncbi,Bat coronavirus HKU4-4 scientific name,424362
424370:ncbi:896,EF065516,424370:ncbi:896,896,ncbi,Bat coronavirus HKU9-4 scientific name,424370


Betacoronavirus: 31 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1928434:ncbi:157,MH687973,1928434:ncbi:157,157,ncbi,Betacoronavirus sp,1928434
1385427:ncbi:197,KC545383,1385427:ncbi:197,197,ncbi,Betacoronavirus Erinaceus/VMC/DEU/2012 scient...,1385427
1928434:ncbi:225,MH687969,1928434:ncbi:225,225,ncbi,Betacoronavirus sp,1928434
2720538:ncbi:247,MW246800,2720538:ncbi:247,247,ncbi,Betacoronavirus Erinaceus scientific name,2720538
1385427:ncbi:319,NC_039207,1385427:ncbi:319,319,ncbi,Betacoronavirus Erinaceus/VMC/DEU/2012 scient...,1385427
2720538:ncbi:410,MW246802,2720538:ncbi:410,410,ncbi,Betacoronavirus Erinaceus scientific name,2720538
1590370:ncbi:442,NC_026011,1590370:ncbi:442,442,ncbi,Betacoronavirus HKU24 scientific name,1590370
1928434:ncbi:463,MH687971,1928434:ncbi:463,463,ncbi,Betacoronavirus sp,1928434
694003:ncbi:502,MW773844,694003:ncbi:502,502,ncbi,Betacoronavirus 1 scientific name,694003
1590370:ncbi:697,KM349744,1590370:ncbi:697,697,ncbi,Betacoronavirus HKU24 scientific name,1590370


Bovine: 105 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
11128:ncbi:2,LC494191,11128:ncbi:2,2,ncbi,Bovine coronavirus scientific name,11128
11128:ncbi:63,LC494181,11128:ncbi:63,63,ncbi,Bovine coronavirus scientific name,11128
11128:ncbi:72,KU886219,11128:ncbi:72,72,ncbi,Bovine coronavirus scientific name,11128
11128:ncbi:74,AF391542,11128:ncbi:74,74,ncbi,Bovine coronavirus scientific name,11128
454963:ncbi:100,FJ938064,454963:ncbi:100,100,ncbi,Bovine coronavirus E-AH187-TC scientific name,454963
...,...,...,...,...,...,...
11128:ncbi:2998,LC494174,11128:ncbi:2998,2998,ncbi,Bovine coronavirus scientific name,11128
11128:ncbi:3054,LC494146,11128:ncbi:3054,3054,ncbi,Bovine coronavirus scientific name,11128
11128:ncbi:3132,MG757138,11128:ncbi:3132,3132,ncbi,Bovine coronavirus scientific name,11128
11128:ncbi:3219,LC494154,11128:ncbi:3219,3219,ncbi,Bovine coronavirus scientific name,11128


Bt: 23 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1503291:ncbi:382,KJ473809,1503291:ncbi:382,382,ncbi,BtNv-AlphaCoV/SC2013 scientific name,1503291
1503302:ncbi:497,KJ473814,1503302:ncbi:497,497,ncbi,BtRs-BetaCoV/HuB2013 scientific name,1503302
1495253:ncbi:543,KJ473821,1495253:ncbi:543,543,ncbi,BtVs-BetaCoV/SC2013 scientific name,1495253
1503289:ncbi:887,NC_028811,1503289:ncbi:887,887,ncbi,BtMr-AlphaCoV/SAX2011 scientific name,1503289
1503299:ncbi:1168,KJ473811,1503299:ncbi:1168,1168,ncbi,BtRf-BetaCoV/JL2012 scientific name,1503299
1503286:ncbi:1252,KJ473798,1503286:ncbi:1252,1252,ncbi,BtMf-AlphaCoV/HuB2013 scientific name,1503286
1503278:ncbi:1503,KJ473795,1503278:ncbi:1503,1503,ncbi,BtMf-AlphaCoV/AH2011 scientific name,1503278
1503280:ncbi:1681,KJ473797,1503280:ncbi:1681,1681,ncbi,BtMf-AlphaCoV/GD2012 scientific name,1503280
1503291:ncbi:1766,NC_028833,1503291:ncbi:1766,1766,ncbi,BtNv-AlphaCoV/SC2013 scientific name,1503291
1503279:ncbi:1870,KJ473799,1503279:ncbi:1870,1870,ncbi,BtMf-AlphaCoV/FJ2012 scientific name,1503279


Coronavirus: 16 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
2591237:ncbi:1,MK211378,2591237:ncbi:1,1,ncbi,Coronavirus BtRs-BetaCoV/YN2018D scientific name,2591237
2591233:ncbi:159,MK211374,2591233:ncbi:159,159,ncbi,Coronavirus BtRl-BetaCoV/SC2018 scientific name,2591233
2591236:ncbi:266,MK211377,2591236:ncbi:266,266,ncbi,Coronavirus BtRs-BetaCoV/YN2018C scientific name,2591236
1368314:ncbi:509,KC869678,1368314:ncbi:509,509,ncbi,Coronavirus Neoromicia/PML-PHE1/RSA/2011 scie...,1368314
1965089:ncbi:686,LC216915,1965089:ncbi:686,686,ncbi,Coronavirus HKU15 scientific name,1965089
2591230:ncbi:710,MK211371,2591230:ncbi:710,710,ncbi,Coronavirus BtSk-AlphaCoV/GX2018C scientific ...,2591230
2591229:ncbi:811,MK211370,2591229:ncbi:811,811,ncbi,Coronavirus BtSk-AlphaCoV/GX2018B scientific ...,2591229
1964806:ncbi:900,KX964649,1964806:ncbi:900,900,ncbi,Coronavirus AcCoV-JC34 scientific name,1964806
1965089:ncbi:1120,LC216914,1965089:ncbi:1120,1120,ncbi,Coronavirus HKU15 scientific name,1965089
2591231:ncbi:1217,MK211372,2591231:ncbi:1217,1217,ncbi,Coronavirus BtSk-AlphaCoV/GX2018D scientific ...,2591231


Deltacoronavirus: 8 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1477411:ncbi:104,KJ601777,1477411:ncbi:104,104,ncbi,Deltacoronavirus PDCoV/USA/Illinois133/2014 s...,1477411
1911231:ncbi:1399,MT138108,1911231:ncbi:1399,1399,ncbi,Deltacoronavirus sp,1911231
1477414:ncbi:1401,KJ601780,1477414:ncbi:1401,1401,ncbi,Deltacoronavirus PDCoV/USA/Ohio137/2014 scien...,1477414
1477413:ncbi:1611,KJ601779,1477413:ncbi:1611,1611,ncbi,Deltacoronavirus PDCoV/USA/Illinois136/2014 s...,1477413
1465644:ncbi:1639,KJ481931,1465644:ncbi:1639,1639,ncbi,Deltacoronavirus PDCoV/USA/Illinois121/2014 s...,1465644
1911231:ncbi:2319,MT138105,1911231:ncbi:2319,2319,ncbi,Deltacoronavirus sp,1911231
1911231:ncbi:2427,MT138104,1911231:ncbi:2427,2427,ncbi,Deltacoronavirus sp,1911231
1477412:ncbi:3041,KJ601778,1477412:ncbi:3041,3041,ncbi,Deltacoronavirus PDCoV/USA/Illinois134/2014 s...,1477412


Feline: 93 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
454952:ncbi:75,FJ938054,454952:ncbi:75,75,ncbi,Feline coronavirus UU4 scientific name,454952
12663:ncbi:79,KX722530,12663:ncbi:79,79,ncbi,Feline coronavirus scientific name,12663
627434:ncbi:101,FJ938058,627434:ncbi:101,101,ncbi,Feline coronavirus UU16 scientific name,627434
12663:ncbi:111,KF530130,12663:ncbi:111,111,ncbi,Feline coronavirus scientific name,12663
12663:ncbi:112,KF530133,12663:ncbi:112,112,ncbi,Feline coronavirus scientific name,12663
...,...,...,...,...,...,...
454955:ncbi:3100,FJ938053,454955:ncbi:3100,3100,ncbi,Feline coronavirus UU7 scientific name,454955
12663:ncbi:3148,KY566210,12663:ncbi:3148,3148,ncbi,Feline coronavirus scientific name,12663
12663:ncbi:3212,KF530127,12663:ncbi:3212,3212,ncbi,Feline coronavirus scientific name,12663
12663:ncbi:3227,KU215425,12663:ncbi:3227,3227,ncbi,Feline coronavirus scientific name,12663


Human: 381 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
31631:ncbi:3,KY967361,31631:ncbi:3,3,ncbi,Human coronavirus OC43 scientific name,31631
277944:ncbi:4,LC654455,277944:ncbi:4,4,ncbi,Human coronavirus NL63 scientific name,277944
290028:ncbi:11,KT779556,290028:ncbi:11,11,ncbi,Human coronavirus HKU1 scientific name,290028
277944:ncbi:20,JQ900257,277944:ncbi:20,20,ncbi,Human coronavirus NL63 scientific name,277944
11137:ncbi:21,LC654446,11137:ncbi:21,21,ncbi,Human coronavirus 229E scientific name,11137
...,...,...,...,...,...,...
31631:ncbi:3183,MG197714,31631:ncbi:3183,3183,ncbi,Human coronavirus OC43 scientific name,31631
277944:ncbi:3204,KY554967,277944:ncbi:3204,3204,ncbi,Human coronavirus NL63 scientific name,277944
31631:ncbi:3208,KX344031,31631:ncbi:3208,3208,ncbi,Human coronavirus OC43 scientific name,31631
31631:ncbi:3223,KF530096,31631:ncbi:3223,3223,ncbi,Human coronavirus OC43 scientific name,31631


Infectious bronchitis: 435 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
11120:ncbi:5,MN987231,11120:ncbi:5,5,ncbi,Infectious bronchitis virus scientific name,11120
11120:ncbi:17,MW792514,11120:ncbi:17,17,ncbi,Infectious bronchitis virus scientific name,11120
11120:ncbi:19,EU526388,11120:ncbi:19,19,ncbi,Infectious bronchitis virus scientific name,11120
11120:ncbi:28,KX219795,11120:ncbi:28,28,ncbi,Infectious bronchitis virus scientific name,11120
11120:ncbi:39,KJ425512,11120:ncbi:39,39,ncbi,Infectious bronchitis virus scientific name,11120
...,...,...,...,...,...,...
11120:ncbi:3196,JF828981,11120:ncbi:3196,3196,ncbi,Infectious bronchitis virus scientific name,11120
11120:ncbi:3203,MN509589,11120:ncbi:3203,3203,ncbi,Infectious bronchitis virus scientific name,11120
11120:ncbi:3210,KX364296,11120:ncbi:3210,3210,ncbi,Infectious bronchitis virus scientific name,11120
11120:ncbi:3244,KJ425490,11120:ncbi:3244,3244,ncbi,Infectious bronchitis virus scientific name,11120


Middle East respiratory syndrome: 606 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1335626:ncbi:12,KT374055,1335626:ncbi:12,12,ncbi,Middle East respiratory syndrome-related coron...,1335626
1335626:ncbi:14,KF961222,1335626:ncbi:14,14,ncbi,Middle East respiratory syndrome-related coron...,1335626
1335626:ncbi:15,MF598676,1335626:ncbi:15,15,ncbi,Middle East respiratory syndrome-related coron...,1335626
1335626:ncbi:16,KY581695,1335626:ncbi:16,16,ncbi,Middle East respiratory syndrome-related coron...,1335626
1335626:ncbi:22,MG987420,1335626:ncbi:22,22,ncbi,Middle East respiratory syndrome-related coron...,1335626
...,...,...,...,...,...,...
1335626:ncbi:3232,MF598624,1335626:ncbi:3232,3232,ncbi,Middle East respiratory syndrome-related coron...,1335626
1335626:ncbi:3235,MG757605,1335626:ncbi:3235,3235,ncbi,Middle East respiratory syndrome-related coron...,1335626
1335626:ncbi:3239,KT156560,1335626:ncbi:3239,3239,ncbi,Middle East respiratory syndrome-related coron...,1335626
1335626:ncbi:3241,KT121572,1335626:ncbi:3241,3241,ncbi,Middle East respiratory syndrome-related coron...,1335626


Murine coronavirus: 18 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
694005:ncbi:180,JX169867,694005:ncbi:180,180,ncbi,Murine coronavirus scientific name,694005
429331:ncbi:405,FJ647221,429331:ncbi:405,405,ncbi,Murine coronavirus repA59/RJHM scientific name,429331
694005:ncbi:468,KF268338,694005:ncbi:468,468,ncbi,Murine coronavirus scientific name,694005
502106:ncbi:514,FJ647223,502106:ncbi:514,514,ncbi,Murine coronavirus MHV-1 scientific name,502106
430472:ncbi:559,FJ647222,430472:ncbi:559,559,ncbi,Murine coronavirus SA59/RJHM scientific name,430472
694005:ncbi:571,KF268337,694005:ncbi:571,571,ncbi,Murine coronavirus scientific name,694005
591069:ncbi:659,FJ647218,591069:ncbi:659,659,ncbi,Murine coronavirus RA59/R13 scientific name,591069
502104:ncbi:724,FJ647224,502104:ncbi:724,724,ncbi,Murine coronavirus MHV-3 scientific name,502104
591071:ncbi:785,FJ647225,591071:ncbi:785,785,ncbi,Murine coronavirus inf-MHV-A59 scientific name,591071
694005:ncbi:1332,KF268339,694005:ncbi:1332,1332,ncbi,Murine coronavirus scientific name,694005


Murine hepatitis: 16 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
11138:ncbi:13,AF208066,11138:ncbi:13,13,ncbi,Murine hepatitis virus scientific name,11138
11142:ncbi:703,FJ884687,11142:ncbi:703,703,ncbi,Murine hepatitis virus strain A59 scientific ...,11142
123595:ncbi:931,AF207902,123595:ncbi:931,931,ncbi,Murine hepatitis virus strain ML-11 scientifi...,123595
11138:ncbi:1308,NC_001846,11138:ncbi:1308,1308,ncbi,Murine hepatitis virus scientific name,11138
11144:ncbi:1366,AC_000192,11144:ncbi:1366,1366,ncbi,Murine hepatitis virus strain JHM scientific ...,11144
11142:ncbi:1571,FJ884686,11142:ncbi:1571,1571,ncbi,Murine hepatitis virus strain A59 scientific ...,11142
11138:ncbi:2013,MF618253,11138:ncbi:2013,2013,ncbi,Murine hepatitis virus scientific name,11138
11138:ncbi:2103,AB551247,11138:ncbi:2103,2103,ncbi,Murine hepatitis virus scientific name,11138
11138:ncbi:2184,AF029248,11138:ncbi:2184,2184,ncbi,Murine hepatitis virus scientific name,11138
11138:ncbi:2359,GU593319,11138:ncbi:2359,2359,ncbi,Murine hepatitis virus scientific name,11138


Pangolin coronavirus: 8 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
2708335:ncbi:425,MT084071,2708335:ncbi:425,425,ncbi,Pangolin coronavirus scientific name,2708335
2708335:ncbi:1709,MT040336,2708335:ncbi:1709,1709,ncbi,Pangolin coronavirus scientific name,2708335
2708335:ncbi:1878,MT040333,2708335:ncbi:1878,1878,ncbi,Pangolin coronavirus scientific name,2708335
2708335:ncbi:1946,MT072864,2708335:ncbi:1946,1946,ncbi,Pangolin coronavirus scientific name,2708335
2708335:ncbi:2387,MT121216,2708335:ncbi:2387,2387,ncbi,Pangolin coronavirus scientific name,2708335
2708335:ncbi:2758,MT072865,2708335:ncbi:2758,2758,ncbi,Pangolin coronavirus scientific name,2708335
2708335:ncbi:3039,MT040334,2708335:ncbi:3039,3039,ncbi,Pangolin coronavirus scientific name,2708335
2708335:ncbi:3220,MT040335,2708335:ncbi:3220,3220,ncbi,Pangolin coronavirus scientific name,2708335


Porcine coronavirus: 13 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1159905:ncbi:182,JQ065042,1159905:ncbi:182,182,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:198,KJ584358,1159905:ncbi:198,198,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:625,KM012168,1159905:ncbi:625,625,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:814,KT381613,1159905:ncbi:814,814,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:897,KJ584356,1159905:ncbi:897,897,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:1340,KJ584359,1159905:ncbi:1340,1340,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:1341,NC_039208,1159905:ncbi:1341,1341,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:2206,KJ620016,1159905:ncbi:2206,2206,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:2344,KJ584357,1159905:ncbi:2344,2344,ncbi,Porcine coronavirus HKU15 scientific name,1159905
1159905:ncbi:2896,KJ584355,1159905:ncbi:2896,2896,ncbi,Porcine coronavirus HKU15 scientific name,1159905


Porcine deltacoronavirus: 126 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1586324:ncbi:37,KT336560,1586324:ncbi:37,37,ncbi,Porcine deltacoronavirus scientific name,1586324
1586324:ncbi:38,KU981060,1586324:ncbi:38,38,ncbi,Porcine deltacoronavirus scientific name,1586324
1586324:ncbi:46,MG837133,1586324:ncbi:46,46,ncbi,Porcine deltacoronavirus scientific name,1586324
1586324:ncbi:68,KR265862,1586324:ncbi:68,68,ncbi,Porcine deltacoronavirus scientific name,1586324
1586324:ncbi:92,MF642322,1586324:ncbi:92,92,ncbi,Porcine deltacoronavirus scientific name,1586324
...,...,...,...,...,...,...
1586324:ncbi:3140,KU984334,1586324:ncbi:3140,3140,ncbi,Porcine deltacoronavirus scientific name,1586324
1586324:ncbi:3147,KY513725,1586324:ncbi:3147,3147,ncbi,Porcine deltacoronavirus scientific name,1586324
1586324:ncbi:3177,LC260045,1586324:ncbi:3177,3177,ncbi,Porcine deltacoronavirus scientific name,1586324
1586324:ncbi:3178,MF095123,1586324:ncbi:3178,3178,ncbi,Porcine deltacoronavirus scientific name,1586324


Porcine epidemic diarrhea virus: 822 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
28295:ncbi:6,KU893866,28295:ncbi:6,6,ncbi,Porcine epidemic diarrhea virus scientific name,28295
28295:ncbi:7,KJ645638,28295:ncbi:7,7,ncbi,Porcine epidemic diarrhea virus scientific name,28295
28295:ncbi:8,KJ645678,28295:ncbi:8,8,ncbi,Porcine epidemic diarrhea virus scientific name,28295
28295:ncbi:9,KR873434,28295:ncbi:9,9,ncbi,Porcine epidemic diarrhea virus scientific name,28295
28295:ncbi:23,KJ645699,28295:ncbi:23,23,ncbi,Porcine epidemic diarrhea virus scientific name,28295
...,...,...,...,...,...,...
28295:ncbi:3240,KJ645655,28295:ncbi:3240,3240,ncbi,Porcine epidemic diarrhea virus scientific name,28295
28295:ncbi:3242,KT021229,28295:ncbi:3242,3242,ncbi,Porcine epidemic diarrhea virus scientific name,28295
28295:ncbi:3243,KR265844,28295:ncbi:3243,3243,ncbi,Porcine epidemic diarrhea virus scientific name,28295
28295:ncbi:3245,KX812524,28295:ncbi:3245,3245,ncbi,Porcine epidemic diarrhea virus scientific name,28295


SARS: 2 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
1739625:ncbi:2200,KT444582,1739625:ncbi:2200,2200,ncbi,SARS-like coronavirus WIV16 scientific name,1739625
722424:ncbi:3261,FJ588686,722424:ncbi:3261,3261,ncbi,SARS coronavirus Rs_672/2006 scientific name,722424


Severe acute respiratory syndrome: 23 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
2697049:ncbi:3249,MN908947,2697049:ncbi:3249,3249,ncbi,Severe acute respiratory syndrome coronavirus ...,2697049
694009:ncbi:3258,DQ497008,694009:ncbi:3258,3258,ncbi,Severe acute respiratory syndrome-related coro...,694009
694009:ncbi:3262,FJ959407,694009:ncbi:3262,3262,ncbi,Severe acute respiratory syndrome-related coro...,694009
694009:ncbi:3279,KY352407,694009:ncbi:3279,3279,ncbi,Severe acute respiratory syndrome-related coro...,694009
694009:ncbi:3296,LC556375,694009:ncbi:3296,3296,ncbi,Severe acute respiratory syndrome-related coro...,694009
694009:ncbi:3297,HG994852,694009:ncbi:3297,3297,ncbi,Severe acute respiratory syndrome-related coro...,694009
694009:ncbi:3298,HG994853,694009:ncbi:3298,3298,ncbi,Severe acute respiratory syndrome-related coro...,694009
694009:ncbi:3299,HG994854,694009:ncbi:3299,3299,ncbi,Severe acute respiratory syndrome-related coro...,694009
694009:ncbi:3300,HG994855,694009:ncbi:3300,3300,ncbi,Severe acute respiratory syndrome-related coro...,694009
694009:ncbi:3301,HG994856,694009:ncbi:3301,3301,ncbi,Severe acute respiratory syndrome-related coro...,694009


TGEV: 4 species:


Unnamed: 0,accession,seqid,seqnb,source,species,taxonomyid
398810:ncbi:102,DQ811786,398810:ncbi:102,102,ncbi,TGEV Miller M60 scientific name,398810
398809:ncbi:406,DQ811785,398809:ncbi:406,406,ncbi,TGEV Miller M6 scientific name,398809
398812:ncbi:2064,DQ811789,398812:ncbi:2064,2064,ncbi,TGEV virulent Purdue scientific name,398812
398811:ncbi:2291,DQ811788,398811:ncbi:2291,2291,ncbi,TGEV Purdue P115 scientific name,398811


2879

## Split full sequence fasta into group fasta files

Subclass of Fastareader to split fasta into several others

In [None]:
class SplitFastaFile(FastaFileReader):
    """Subclass of `FastaFileReader`. Allows to split file into group files based on the presence of a pattern in `species"""
    def __init__(
        self, 
        p2fasta: Path,        # path to the original fasta file
        p2groups: Path,       # path to the directory where to store all groups fasta files 
        group_patterns:list,  # string patterns to search in `species` for each group
        overwrite:bool=False, # overwrite existing group fasta if True
    ):
        """Create paths and counts for each sequence group"""
        self.p2fasta = p2fasta
        super().__init__(p2fasta)
        self.p2groups = p2groups
        self.patterns = group_patterns
        self.overwrite = overwrite

        self._create_group_info()
        print(f"Ready to split <{self.p2fasta.name}> into {len(group_patterns):,d} groups")
        print(f"Grouped sequence fasta files to be saved into <{self.p2groups.absolute()}>")

        
    def _create_group_info(self):
        """Create paths and counts for each sequence group"""
        os.makedirs(self.p2groups, exist_ok=True)
        self.paths = {}
        self.counts = {}
        for g in self.patterns:
            self.paths[g] = self.p2groups / f"seqs_{g.lower().replace(' ','_')}.fa"
            self.counts[g] = 0
        self.paths['others'] = self.p2groups / f"seqs_others.fa"
        self.counts['others'] = 0
   
        
    def __enter__(self):
        """Open a file handle for each of the sequence groups"""
        self.fhandles = {}
        for g, p in self.paths.items():
            if p.is_file() and self.overwrite:
                p.unlink()
            elif p.is_file() and not self.overwrite:
                raise ValueError(f"{p.name} already exist. set `overwite` to True if you want to overwrite")            

            self.fhandles[g] = open(p, 'a')
            
                
    def __exit__(self, exc_type, exc_value, exc_tb):
        """Closes file handles for each of the sequence groups"""
        for g, fh in self.fhandles.items():
            fh.close()

            
    def _get_group(self, seq):
        """Retrieve the group from a sequence based on its `species`. Default group is 'others'"""
        metadata = parse_metadata_fasta_cov(seq['definition line'])
        species = metadata['species']
        if species is None:
            group = 'others'
        else:
            matches = [species.startswith(pattern) for pattern in species_groups]
            if any(matches):
                group = np.array(species_groups)[matches][0]
            else:
                group = 'others'
        return group
        
    def split_into_groups(self):
        """Create one fasta file including sequences of each group, or group `others` for sequences not in listed groups"""
        with self:
            for seq in self.it:
                group = self._get_group(seq)
                fh = self.fhandles[group]
                self.counts[group] += 1
                fh.write(f"{seq['definition line']}")
                fh.write(f"{seq['sequence']}\n")
        print(f"Created {len(species_groups)} new sequence files:")
        print('\n'.join([f"  - {p.name:.<50s}: {(self.counts[g]):3d} sequences" for g, p in self.paths.items()]))

In [None]:
show_doc(SplitFastaFile)

---

### SplitFastaFile

>      SplitFastaFile (p2fasta:pathlib.Path, p2groups:pathlib.Path,
>                      group_patterns:list, overwrite:bool=False)

Subclass of `FastaFileReader`. Allows to split file into group files based on the presence of a pattern in `species

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| p2fasta | Path |  | path to the original fasta file |
| p2groups | Path |  | path to the directory where to store all groups fasta files |
| group_patterns | list |  | string patterns to search in `species` for each group |
| overwrite | bool | False | overwrite existing group fasta if True |

In [None]:
show_doc(SplitFastaFile.split_into_groups)

---

### SplitFastaFile.split_into_groups

>      SplitFastaFile.split_into_groups ()

Create one fasta file including sequences of each group, or group `others` for sequences not in listed groups

In [None]:
p2fasta = p2inputs / 'cov_virus_sequences.fa'
# p2fasta = p2inputs / 'cov_virus_sequences_ten.fa'

p2cov_data = Path(f"../../data/cov_data/").resolve()
groups_subdir = 'groups_1'
p2groups = p2cov_data / groups_subdir


sf = SplitFastaFile(p2fasta, p2groups, species_groups, overwrite=True)

Ready to split <cov_virus_sequences.fa> into 25 groups
Grouped sequence fasta files to be saved into </home/vtec/projects/bio/metagentools/data/cov_data/groups_1>


In [None]:
sf.split_into_groups()

Created 25 new sequence files:
  - seqs_alphacoronavirus.fa..........................:  64 sequences
  - seqs_bat_cov.fa...................................:   2 sequences
  - seqs_bat_hp.fa....................................:   2 sequences
  - seqs_bat_sars_cov.fa..............................:   3 sequences
  - seqs_bat_sars_coronavirus.fa......................:  15 sequences
  - seqs_bat_sars-like.fa.............................:  18 sequences
  - seqs_bat_coronavirus.fa...........................:  45 sequences
  - seqs_betacoronavirus.fa...........................:  31 sequences
  - seqs_bovine.fa....................................: 105 sequences
  - seqs_bt.fa........................................:  23 sequences
  - seqs_coronavirus.fa...............................:  16 sequences
  - seqs_deltacoronavirus.fa..........................:   8 sequences
  - seqs_feline.fa....................................:  93 sequences
  - seqs_human.fa.....................................: 381

# Build pipeline

**Pipeline Idea**:

Use a large fast file and iterate over manageble chuncks. For each chunck:
- create a fasta file for the chunk (`.fa`)
- create simreads with Art Illumina (`fq`, `aln`)
- preprocess simreads into ds and info
- use model to infer taxonomy and position
- build the inference result dataset including"
    - predicted result
    - ref sequence metadata
    - position ground truth
- save partial inference result dataset as parquet

When iteration is done:
- merge all partial inference result datasets into one single dataset

In [None]:
p2cov_data = Path(f"../../data/cov_data/").resolve()
groups_subdir = 'groups_1'
p2groups = p2cov_data / groups_subdir

p2refs = p2groups / 'seqs_alphacoronavirus.fa'

### Technical Note: work with temporary files and directories

`tempfile` built-in module:
- TemporaryFile: creates a new temp file
- TemporaryDirectory: creates a new temp dir. Can be used to create other files.

When the output of these two classes is used as a context, the temp files / temp dir + content are deleted when context is exited

In [None]:
# Creating the class and file/dir
tempdir = tempfile.TemporaryDirectory(prefix='dir_')

# Start the context
with tempdir:
    print(tempdir.name)
    p2dir = Path(tempdir.name)
    assert p2dir.is_dir()
    p2f = p2dir/'blabla.fq'
    p2f.touch()
    assert p2f.is_file()
    
# Folder and file deleted after context is exited
p2dir.is_dir(), p2f.is_file()

/tmp/dir_yriywizs


(False, False)

# Others