# Review NCBI CoV Sequences and SimReads Data

In this notebook, we review the CoV sequence retireved from NCBI and the corresponding simulated reads generated with ARt Illumina.

Note: The notebook was built and tested to run locally but should also work on Colab or Kaggle. If on Colab, it assumes that the project shared gdrive directory is accessible through a shortcut called `Metagenomics` under the root of gdrive.

# 1. Imports and setup environment

### Install and import packages

In [28]:
# Install required custom packages if not installed yet.
import importlib.util
if not importlib.util.find_spec('ecutilities'):
    print('installing package: `ecutilities`')
    ! pip install -qqU ecutilities
else:
    print('`ecutilities` already installed')
if not importlib.util.find_spec('metagentools'):
    print('installing package: `metagentools')
    ! pip install -qqU metagentools
else:
    print('`metagentools` already installed')

`ecutilities` already installed
`metagentools` already installed


In [74]:
# Import all required packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

from ecutilities.core import files_in_tree
from ecutilities.ipython import nb_setup
from IPython.display import display, Markdown
from pathlib import Path
from pprint import pprint
from tqdm.notebook import tqdm, trange

# Setup the notebook for development
nb_setup()

from metagentools.cnn_virus.data import FastaFileReader, FastqFileReader, AlnFileReader
from metagentools.core import ProjectFileSystem

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Set autoreload mode


# 2. Review project file system

This project adopts a unified file structure to make coding and colaboration easier. In addition, we can run the code locally (from a `project-root` directory) or in the cloud (colab or kaggle).

The unified file structure when running localy is:
```text
    project-root   
        |--- data
        |      |--- CNN_Virus_data  (all data related to CNN Virus original paper)
        |      |--- models          (trained and finetuned models)
        |      |--- ....            (raw and pre-processed data from various sources)  
        |      
        |--- nbs  (all reference and work notebooks)
        |      |--- cnn_virus
        |      |        |--- notebooks.ipynb
```

When running on google colab, it is assumed that a google drive is mounted on the colab server instance, and that the google drive root includes a shortcut named `Metagenomics` and pointing to the project shared directory. The project shared directory is accessible [here](/https://drive.google.com/drive/folders/134uei5fmt08TpzhmjG4sW0FQ06kn2ZfZ) if your google account was given access permission.

To make access to the unified file system, `metagentools` provides a `ProjectFileSystem` class (see [documentation](https://vtecftwy.github.io/metagentools/core.html#projectfilesystem) for more info).

Key folders and system information

In [75]:
pfs = ProjectFileSystem()
pfs.info()

Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentools 
 - Data Dir .... /home/vtec/projects/bio/metagentools/data 
 - Notebooks ... /home/vtec/projects/bio/metagentools/nbs


In [76]:
pfs.readme()

*/home/vtec/projects/bio/metagentools/data*

This directory includes all the data for the project `metagentools`.

```text
data
     |--- CNN_Virus_data  (all data related to CNN Virus original paper)
     |--- models          (trained and finetuned models)
     |--- ncbi            (refsequences, simreads, datasets and infer_results for CoV from NCBI)
     |--- ncov_data       (refsequences, simreads, datasets and infer_results for non CoV sequences)
     |--- ....            (raw and pre-processed data from various sources)  
     
```



### Explore files in NCBI data directory

In [77]:
files_in_tree(pfs.data/'ncbi');

data
  |--ncbi
  |    |--refsequences
  |    |    |--cov_virus_sequence_001-seq1.fa (0)
  |    |    |--cov_virus_sequences.txt (1)
  |    |    |--cov_virus_sequences_100-seqs.fa (2)
  |    |    |--readme.md (3)
  |    |    |--cov_virus_sequences_002-seqs.fa (4)
  |    |    |--cov_virus_sequences_025-seqs.fa (5)
  |    |    |--cov_virus_sequences_010-seqs.fa (6)
  |    |    |--cov_virus_sequence_001-seq2.fa (7)
  |    |    |--cov_virus_original_cnn_sequences.json (8)
  |    |    |--cov_virus_sequences.fa (9)
  |    |--simreads
  |    |    |--readme.md (10)
  |    |--infer_results
  |    |    |--readme.md (11)
  |    |--ds
  |    |    |--single_10seq_50bp_metadata.csv (12)
  |    |    |--readme.md (13)
  |    |    |--single_10seq_50bp_ds (14)


In [78]:
for d in ['refsequences', 'simreads', 'infer_results']:
    path = pfs.data / 'ncbi' / d
    pfs.readme(dir_path=path)
    print(f"{'-'*100}")
    files_in_tree(path)
    print(f"{'='*100}")

*/home/vtec/projects/bio/metagentools/data/ncbi/refsequences*

# Information on CoV sequence files in this folder
This folder includes CoV sequences in fasta files, retrieved from the NCBI database. 

The main file is **`cov_virus_sequences.fa`**.
- it includes 3,318 sequences of corona virus with different types of hosts. 
- the names of the virus species is listed in the file `cov_virus_sequences.txt`


Smaller files are also available. Each of which includes a limited number of sequences to test the code on smaller datasets. The files are named `cov_virus_sequences_*-seqs.fa` , where `*` is the number of sequences in the file:
- `cov_virus_sequences_001-seq1.fa` includes 1 sequence
- `cov_virus_sequences_001-seq2.fa` includes 1 other sequence
- `cov_virus_sequences_002-seqs.fa` includes 2 sequences
- `cov_virus_sequences_010-seqs.fa` includes 10 sequences
- `cov_virus_sequences_025-seqs.fa` includes 25 sequences
- `cov_virus_sequences_100-seqs.fa` includes 100 sequences


Each sequence is preceded by the following Definition Line, with the following information:
|item | example |
|:---:|:-------|
|sequence id | (2591237:ncbi:1)|
|accession | MK211378|
|taxonomy id | 2591237|
|source |ncbi|
|sequence nbr |1|
|specie name |Coronavirus BtRs-BetaCoV/YN2018D|


----------------------------------------------------------------------------------------------------
ncbi
  |--refsequences
  |    |--cov_virus_sequence_001-seq1.fa (0)
  |    |--cov_virus_sequences.txt (1)
  |    |--cov_virus_sequences_100-seqs.fa (2)
  |    |--readme.md (3)
  |    |--cov_virus_sequences_002-seqs.fa (4)
  |    |--cov_virus_sequences_025-seqs.fa (5)
  |    |--cov_virus_sequences_010-seqs.fa (6)
  |    |--cov_virus_sequence_001-seq2.fa (7)
  |    |--cov_virus_original_cnn_sequences.json (8)
  |    |--cov_virus_sequences.fa (9)
  |    |--groups_1
  |    |    |--seqs_human.fa (10)
  |    |    |--seqs_porcine_epidemic_diarrhea_virus.fa (11)
  |    |    |--seqs_middle_east_respiratory_syndrome.fa (12)
  |    |    |--seqs_bovine.fa (13)
  |    |    |--seqs_betacoronavirus.fa (14)
  |    |    |--seqs_infectious_bronchitis.fa (15)
  |    |    |--seqs_feline.fa (16)
  |    |    |--seqs_tgev.fa (17)
  |    |    |--seqs_porcine_deltacoronavirus.fa (18)
  |    |    |--seqs_coronav

*/home/vtec/projects/bio/metagentools/data/ncbi/simreads*

# Information on CoV simulated read sequence files
This folder includes a set of simulated read sequence files generated from CoV sequences in `cov_data` and ARC Illumina. 


Each simread sub-directory is named as `<method>_<nb seq>_<nb bp>` where"
- `<method>` is either `single` or `paired` depending on the simulation method
- `<nb seq>` is the number of reference sequences used for simulation, and refers to the `fa` files in `cov_data`
- `<nb bp>` is the number of base pairs in the simulated read

Each directory includes simreads files made using a simulation method and a specific number of reference sequences.
- `xxx.fq` and `xxx.aln` files with method is `single`
- `xxx1.fq`, `xxx2.fq`, `xxx1.aln` and `xxx2.aln` files with method is `paired`.

Example:
- `paired_10seq_50bp` means that the simreads were generated by using the `paired` method to simulate 50-bp reads, and using the `fa` file `/cov_data/cov_virus_sequences_010-seqs.fa`
- `single_100seq_50bp` means that the simreads were generated by using the `single` method to simulate 50-bp reads, and using the `fa` file `/cov_data/cov_virus_sequences_100-seqs.fa`. Note that this generated 20,660,104 reads !



----------------------------------------------------------------------------------------------------
ncbi
  |--simreads
  |    |--readme.md (0)
  |    |--paired_25seq_150bp
  |    |    |--paired_25seq_150bp2.fq (1)
  |    |    |--paired_25seq_150bp1.aln (2)
  |    |    |--paired_25seq_150bp2.aln (3)
  |    |    |--paired_25seq_150bp1.fq (4)
  |    |--paired_100seq_150bp
  |    |    |--paired_100seq_150bp1.aln (5)
  |    |    |--paired_100seq_150bp2.aln (6)
  |    |    |--paired_100seq_150bp1.fq (7)
  |    |    |--paired_100seq_150bp2.fq (8)
  |    |--paired_100seq_50bp
  |    |    |--paired_100seq_50bp2.aln (9)
  |    |    |--paired_100seq_50bp1.aln (10)
  |    |    |--paired_100seq_50bp2.fq (11)
  |    |    |--paired_100seq_50bp1.fq (12)
  |    |--paired_10seq_150bp
  |    |    |--paired_10seq_150bp1.aln (13)
  |    |    |--paired_10seq_150bp2.fq (14)
  |    |    |--paired_10seq_150bp2.aln (15)
  |    |    |--paired_10seq_150bp1.fq (16)
  |    |--paired_10seq_50bp
  |    |    |--paire

*/home/vtec/projects/bio/metagentools/data/ncbi/infer_results*

# CoV Virus Inference Data
This folder includes results from inference using CoV simulated read sequences from `fq` and `aln` files in `cov_simreads`. 

## `cnn_virus`
The directory `cnn_virus` includes results from inference made with the original pretrained model. 

Results are saved into many individual `parquet` files during inference. Then they are merged into a single `parquet` file. 

Each inference experiment receives a unique 8-character unique ID.

Each inference experiment will therefore generate a set of files like follows, where `xxxxxxxx` is the experiment UID and `nnnn` is the index of a partial result file:
- `results_nnnn_infer_xxxxxxxx.parquet`
- `results_all_infer_xxxxxxxx.parquet`




----------------------------------------------------------------------------------------------------
ncbi
  |--infer_results
  |    |--readme.md (0)
  |    |--xlsx
  |    |    |--all_taxonomies_infer_5yucnk_6.xlsx (1)
  |    |    |--top25_sars_infer_5yucnk_6.xlsx (2)
  |    |    |--top25_cov_infer_5yucnk_6.xlsx (3)
  |    |    |--top25_mers_infer_5yucnk_6.xlsx (4)
  |    |--cnn_virus
  |    |    |--results_0051_infer_5yucnk_6.parquet (5)
  |    |    |--results_0049_infer_5yucnk_6.parquet (6)
  |    |    |--results_0059_infer_s4gjodd8.parquet (7)
  |    |    |--results_0021_infer_s4gjodd8.parquet (8)
  |    |    |--results_0032_infer_s4gjodd8.parquet (9)
  |    |    |--results_0006_infer_s4gjodd8.parquet (10)
  |    |    |--results_0058_infer_5yucnk_6.parquet (11)
  |    |    |--results_0031_infer_5yucnk_6.parquet (12)
  |    |    |--results_0013_infer_s4gjodd8.parquet (13)
  |    |    |--results_0044_infer_s4gjodd8.parquet (14)
  |    |    |--results_0049_infer_s4gjodd8.parquet (15)
  

# 3. Review fasta, fastq and aln files

## Setup paths

- `p2refseqs`: path to a reference sequence file (FASTA)
- `p2simreads`: path to folder where reads files are located (FASTQ and ALN)

In [80]:
p2refseqs = pfs.data / 'ncbi/refsequences/cov_virus_sequences_010-seqs.fa'
assert p2refseqs.is_file(), f"No file found at {p2refseqs.absolute()}"

p2simreads = pfs.data / 'ncbi/simreads/single_10seq_50bp'
assert p2simreads.is_dir(), f"No directory found at {p2simreads.absolute()}"

## Explore Reference Files (FASTA)

`FastaFileReader` is a class to make reading and accessing fasta file information easier

In [81]:
fasta = FastaFileReader(path=p2refseqs)

In [82]:
fasta.print_first_chunks(nchunks=3)


Sequence 1:
>2591237:ncbi:1 [MK211378]	2591237	ncbi	1 [MK211378] 2591237	Coronavirus BtRs-BetaCoV/YN2018D		scientific name
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...

Sequence 2:
>11128:ncbi:2 [LC494191]	11128	ncbi	2 [LC494191] 11128	Bovine coronavirus		scientific name
CATCCCGCTTCACTGATCTCTTGTTAGATCTTTTCATAATCTAAACTTTATAAAAACATCCACTCCCTGTAGTCTATGCC ...

Sequence 3:
>31631:ncbi:3 [KY967361]	31631	ncbi	3 [KY967361] 31631	Human coronavirus OC43		scientific name
ATCTCTTGTTAGATCTTTTTGTAATCTAAACTTTATAAAAACATCCACTCCCTGTAATCTATGCTTGTGGGCGTAGATTT ...


Access each sequence (definition line and sequence) one by one

In [83]:
fasta.reset_iterator()

for i, seq in enumerate(fasta):
    print(f"Definition Line for sequence {i+1}:")
    print(seq['definition line'])
    print(f"{len(seq['sequence']):,d} bases:")
    print(seq['sequence'])
    print()
    if i >= 2: break

Definition Line for sequence 1:
>2591237:ncbi:1 [MK211378]	2591237	ncbi	1 [MK211378] 2591237	Coronavirus BtRs-BetaCoV/YN2018D		scientific name
30,213 bases:
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTACGGTTCCGTCCGTGTTGCAGTCGATCATCAGCATACCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTTCTTGGTGTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTCTTCAGGTTAGAGACGTGCTAGTGCGTGGCTTCGGGGACTCTGTGGAAGAGGCCCTATCGGAGGCACGTGAACATCTTAAAAATGGCACTTGTGGTTTAGTAGAGCTGGAAAAAGGCGTACTGCCCCAGCTTGAACAGCCCTATGTGTTCATTAAACGTTCTGATGCCTTAAGCACCAATCACGGCCACAAGGTCGTTGAGCTGGTTGCAGAATTGGACGGCATTCAGTACGGTCGTAGCGGTATAACTCTGGGAGTACTCGTGCCACATGTGGGCGAAACCCCAATCGCATACCGCAATGTTCTTCTTCGTAAGAACGGTAATAAGGGAGCCGGTGGCCATAGCTTTGGCATCGATCTAAAGTCTTATGACTTAGGTGACGAGCTTGGTACTGATCCCATTGAAGATTATGAACAAAACTGGAACACTAAGCATGGCAGTGGTGTACTCCGTGAACTCACTCGTGAGCTCAATGGAGGTGCAGTCACTCGCTATGTCGACAACAACTTCTGTGGCCCA

In [84]:
definition_lines = fasta.parse_file()
fasta.reset_iterator()

for i, (k, v) in enumerate(definition_lines.items()):
    print(f"Definition Line for sequence {i+1}:")
    print(next(fasta)['definition line'])
    print('Parsed data:')
    print(f"key: {k}\nmetadata:")
    pprint(v)
    print(f"{'='*100}")
    if i >= 2: break

Definition Line for sequence 1:
>2591237:ncbi:1 [MK211378]	2591237	ncbi	1 [MK211378] 2591237	Coronavirus BtRs-BetaCoV/YN2018D		scientific name
Parsed data:
key: 2591237:ncbi:1
metadata:
{'accession': 'MK211378',
 'seqid': '2591237:ncbi:1',
 'seqnb': '1',
 'source': 'ncbi',
 'species': 'Coronavirus BtRs-BetaCoV/YN2018D  scientific name',
 'taxonomyid': '2591237'}
Definition Line for sequence 2:
>11128:ncbi:2 [LC494191]	11128	ncbi	2 [LC494191] 11128	Bovine coronavirus		scientific name
Parsed data:
key: 11128:ncbi:2
metadata:
{'accession': 'LC494191',
 'seqid': '11128:ncbi:2',
 'seqnb': '2',
 'source': 'ncbi',
 'species': 'Bovine coronavirus  scientific name',
 'taxonomyid': '11128'}
Definition Line for sequence 3:
>31631:ncbi:3 [KY967361]	31631	ncbi	3 [KY967361] 31631	Human coronavirus OC43		scientific name
Parsed data:
key: 31631:ncbi:3
metadata:
{'accession': 'KY967361',
 'seqid': '31631:ncbi:3',
 'seqnb': '3',
 'source': 'ncbi',
 'species': 'Human coronavirus OC43  scientific name',
 

## Explore simulated read output files (FASTQ and ALN)

When simulating with ART Illumina,:
- We simulated reads from referenmce sequences in a fasta file (`fasta` reader)
- ART returns two files:
    - a fastq file (`fastq` reader) with the reads sequences, their quality and some read metadata
    - a aln file (`aln` reader), including:
        - a header with the summulation command and the list of reference sequences
        - additional metadata on each reads, including: 
            - the readid and its number (rank in the list)
            - the read sequence itself (may include errors)
            - the same sequence as in the reference sequence
            - the read alignment starting position in the reference sequence
            - the reference sequence strand
            - the reference sequence id and its number in the sequence file            

Like for FASTA file, `FastqFileReader` and `AlnFileReader` makes it easier to load and work with fastq and aln file

In [86]:
p2fastq = p2simreads / f"{p2simreads.stem}.fq"
assert p2fastq.is_file()
p2aln = p2simreads / f"{p2simreads.stem}.aln"
assert p2aln.is_file()

print(f" fq reads file:  {p2fastq.name}\n aln reads file: {p2aln.name}")

fastq = FastqFileReader(p2fastq)
aln = AlnFileReader(p2aln)

 fq reads file:  single_10seq_50bp.fq
 aln reads file: single_10seq_50bp.aln


### Exploring information about the simreads in fastaq

In [88]:
for i, (k,v) in enumerate(fastq.parse_file(add_readseq=True).items()):
    print(f"Read {i}:")
    pprint(v)
    print()

    if i+1 >= 3: break

Read 0:
{'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'readseq': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 1:
{'readid': '2591237:ncbi:1-60399',
 'readnb': '60399',
 'readseq': 'GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 2:
{'readid': '2591237:ncbi:1-60398',
 'readnb': '60398',
 'readseq': 'ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}



### Exploring information in ALN header

The header of ALN file includes:
- the ART command used to simulate the reads
- the list of reference sequences used to create the reads, with metadata

First we can see from which reference fasta file the simulated sequences were imported, by checking the ART command in the alignment file's header. Then we can open a fasta reader for the reference fasta file as well:

In [89]:
aln.header['command']

'/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/cov_data/cov_virus_sequences_ten.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/cov_simreads/single_10seq_50bp/single_10seq_50bp -rs 1674660835'

We see that the simulated reads were generated from the reference sequences in `data/cov_data/cov_virus_sequences_ten.fa`, that is the same as the file in our `fasta`

In [92]:
fasta.path

PosixPath('/home/vtec/projects/bio/metagentools/data/ncbi/refsequences/cov_virus_sequences_010-seqs.fa')

Then we can review the metadata available for each reference sequence, as listed in `aln`'s header.

This information can be seen in `aln.header` or with the method `parse_header_reference_sequences()`

In [117]:
aln.header

{'command': '/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/cov_data/cov_virus_sequences_ten.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/cov_simreads/single_10seq_50bp/single_10seq_50bp -rs 1674660835',
 'reference sequences': ['@SQ\t2591237:ncbi:1 [MK211378]\t2591237\tncbi\t1 [MK211378] 2591237\tCoronavirus BtRs-BetaCoV/YN2018D\t\tscientific name\t30213',
  '@SQ\t11128:ncbi:2 [LC494191]\t11128\tncbi\t2 [LC494191] 11128\tBovine coronavirus\t\tscientific name\t30942',
  '@SQ\t31631:ncbi:3 [KY967361]\t31631\tncbi\t3 [KY967361] 31631\tHuman coronavirus OC43\t\tscientific name\t30661',
  '@SQ\t277944:ncbi:4 [LC654455]\t277944\tncbi\t4 [LC654455] 277944\tHuman coronavirus NL63\t\tscientific name\t27516',
  '@SQ\t11120:ncbi:5 [MN987231]\t11120\tncbi\t5 [MN987231] 11120\tInfectious bronchitis virus\t\tscientific name\t27617',
  '@SQ\t28295:ncbi:6 [KU893866]\t28295\tncbi\t6 [KU893866] 28295\tPorcine epidemic diarrhea virus\t\tscientific name\t28043',
 

With `AlnFileReader.parse_header_reference_sequences`, we can extract information on all the sequences used to generate the reads:

In [118]:
header_refseq_meta = aln.parse_header_reference_sequences()

print(f"Reads simulated from {len(header_refseq_meta)} reference sequences.\n")

for i, (k,v) in enumerate(header_refseq_meta.items()):
    print(f"Reference Sequence {i}:")
    pprint(v)
    print()

    if i+1 >= 3: break

Reads simulated from 10 reference sequences.

Reference Sequence 0:
{'refseq_accession': 'MK211378',
 'refseq_length': '30213',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237',
 'species': 'Coronavirus BtRs-BetaCoV/YN2018D  scientific name'}

Reference Sequence 1:
{'refseq_accession': 'LC494191',
 'refseq_length': '30942',
 'refseqid': '11128:ncbi:2',
 'refseqnb': '2',
 'refsource': 'ncbi',
 'reftaxonomyid': '11128',
 'species': 'Bovine coronavirus  scientific name'}

Reference Sequence 2:
{'refseq_accession': 'KY967361',
 'refseq_length': '30661',
 'refseqid': '31631:ncbi:3',
 'refseqnb': '3',
 'refsource': 'ncbi',
 'reftaxonomyid': '31631',
 'species': 'Human coronavirus OC43  scientific name'}



We see that the first reference sequence is a **Coronavirus BtRs-BetaCoV/YN2018D**

### Exploring metadata available for simulated read's

In [119]:
reads_meta = aln.parse_file(add_ref_seq_aligned=True, add_read_seq_aligned=True)

print(f"{len(reads_meta):,d} reads were simulated.\n")

for i, (k,v) in enumerate(reads_meta.items()):
    print(f"Read {i}")
    print(k)
    pprint(v)
    print()
    if i+1>=3: break

571,980 reads were simulated.

Read 0
2591237:ncbi:1-60400
{'aln_start_pos': '14770',
 'read_seq_aligned': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'ref_seq_aligned': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 1
2591237:ncbi:1-60399
{'aln_start_pos': '17012',
 'read_seq_aligned': 'GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC',
 'readid': '2591237:ncbi:1-60399',
 'readnb': '60399',
 'ref_seq_aligned': 'GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC',
 'refseq_strand': '-',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 2
2591237:ncbi:1-60398
{'aln_start_pos': '9188',
 'read_seq_aligned': 'ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC',
 'readid': '2591237:ncbi:1-60398',
 'readnb': '60398',
 'ref_seq_aligned':

### Comparison metadata in fasta, fastq and aln

There is redendancy in metadata across the three sets of files (`.fa`, `.fq`, `.aln`). Let's compare them to confirm this is consistent

Create the following list and dictionaries parsing metadata from fasta, fastq and aln:
- `refseqs_fasta`: metadata in fasta sequence file used as reference sequences to create reads
- `simreads`: metadata on reads, as available in the fastq read file
- `refseqs_aln`: reference sequences with metadata as available in the aln header
- `simread_align`: read sequence and alignment info

In [120]:
refseqs_fasta = fasta.parse_file(add_seq=True)
simreads = fastq.parse_file(add_readseq=True)
refseqs_aln = aln.ref_sequences
simread_align = aln.parse_file(add_ref_seq_aligned=True, add_read_seq_aligned=True)

In [121]:
print(f"reads simulated from the following {len(refseqs_fasta)} full sequences from `{fasta.path.name}`:\n")
print('\n'.join([f" {i:02d}: {refseq['species']}" for i, refseq in enumerate(refseqs_fasta.values())]))

reads simulated from the following 10 full sequences from `cov_virus_sequences_ten.fa`:

 00: Coronavirus BtRs-BetaCoV/YN2018D  scientific name
 01: Bovine coronavirus  scientific name
 02: Human coronavirus OC43  scientific name
 03: Human coronavirus NL63  scientific name
 04: Infectious bronchitis virus  scientific name
 05: Porcine epidemic diarrhea virus  scientific name
 06: Porcine epidemic diarrhea virus  scientific name
 07: Porcine epidemic diarrhea virus  scientific name
 08: Porcine epidemic diarrhea virus  scientific name
 09: Camel alphacoronavirus  scientific name


In [122]:
print(f"{len(simreads):,d} reads generated and available in fastq file.")
print('For each read, following metadata is available in the fastq file:')
for s in ['readid', 'readnb', 'refseqnb', 'refsource', 'reftaxonomyid', 'readseq']:
    print(f" - {s}")

571,980 reads generated and available in fastq file.
For each read, following metadata is available in the fastq file:
 - readid
 - readnb
 - refseqnb
 - refsource
 - reftaxonomyid
 - readseq


In [123]:
print(f"{len(simread_align):,d} reads generated and available in aln file.")
print('For each read, following metadata is available in the aln file:')
for s in ['aln_start_pos', 'readid', 'readnb', 'refseq_strand', 'refseqid', 'refseqnb', 'refsource', 'reftaxonomyid', 'ref_seq_aligned', 'read_seq_aligned']:
    print(f" - {s}")

571,980 reads generated and available in aln file.
For each read, following metadata is available in the aln file:
 - aln_start_pos
 - readid
 - readnb
 - refseq_strand
 - refseqid
 - refseqnb
 - refsource
 - reftaxonomyid
 - ref_seq_aligned
 - read_seq_aligned


Check consistency between refseqs from fasta and from aln 

In [124]:
# utility functions
def complementary_strand(seq):
    """Converts a strand in its complementary"""
    conv = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
    return ''.join([conv[base] for base in seq])

strand = 'ATCCGTGGGT'
print(strand, complementary_strand(strand))

def reverse_sequence(seq):
    return seq[::-1]

print(strand, reverse_sequence(strand))

ATCCGTGGGT TAGGCACCCA
ATCCGTGGGT TGGGTGCCTA


### Check aln reference sequence information

In [125]:
refseqid = '2591237:ncbi:1'
# refseqid = '11128:ncbi:2'

In the source `.fa` file

In [126]:
original_seq = refseqs_fasta[refseqid]['sequence']
original_seq_accession = refseqs_fasta[refseqid]['accession']

original_seq_accession, len(original_seq)

('MK211378', 30213)

In the output`.aln` file

In [127]:
assert original_seq_accession == refseqs_aln[refseqid]['refseq_accession']
assert len(original_seq) == int(refseqs_aln[refseqid]['refseq_length'])

refseqs_aln[refseqid]['refseq_accession'], int(refseqs_aln[refseqid]['refseq_length'])

('MK211378', 30213)

### Check alignment information

In [128]:
pprint(simread_align['2591237:ncbi:1-60400'])

{'aln_start_pos': '14770',
 'read_seq_aligned': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'ref_seq_aligned': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}


Select all reads generated from a single reference sequence

In [129]:
print(f"Select all reads from reference sequence '{refseqid}''")
reads_from_refseq = {k:v for k,v in simread_align.items() if v['refseqid']==refseqid}
nbr_generated_reads = len(reads_from_refseq)
print(f"Total of {nbr_generated_reads:,d} reads")

Select all reads from reference sequence '2591237:ncbi:1''
Total of 60,400 reads


In [130]:
n = 6
selected_simread = [v for k,v in reads_from_refseq.items()][n]
pprint(selected_simread)

{'aln_start_pos': '11417',
 'read_seq_aligned': 'CTCTAACTATTCTGGTGTCGTCACGACTATCATGTTTTTAGCTAGAGCTA',
 'readid': '2591237:ncbi:1-60394',
 'readnb': '60394',
 'ref_seq_aligned': 'CTCTAACTATTCTGGTGTCGTCACGACTATCATGTTTTTAGCTAGAGCTA',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}


In [131]:
def check_alignment(n, reads):
    selected_simread = [v for k,v in reads.items()][n]
    print(f"{'='*80}")
    print(f"checking read {selected_simread['readid']}")
    start = int(selected_simread['aln_start_pos'])
    strand = selected_simread['refseq_strand']
    print(f"simread info:")
    print(f" - from `{strand}` strand")
    print(f" - position: {start:,d}")

    if strand == '+':
        segment_from_refseq = original_seq[start:start+50]
    else:
        segment_from_refseq = complementary_strand(reverse_sequence(original_seq)[start:start+50])

    print('sequences:')
    print(f'- simread seq          :', selected_simread['read_seq_aligned'])
    print(f'- refseq aligned       :', selected_simread['ref_seq_aligned'])
    print(f'- segment in orig. seq :', segment_from_refseq)

In [132]:
for n in range(nbr_generated_reads-1, nbr_generated_reads-6, -1):
    check_alignment(n, reads_from_refseq)

checking read 2591237:ncbi:1-1
simread info:
 - from `+` strand
 - position: 12,125
sequences:
- simread seq          : CAAAAAGTTAAAGAAATCTTTGAATGTGGCTAAATCTGAGTTTGACCGTG
- refseq aligned       : CAAAAAGTTAAAGAAATCTTTGAATGTGGCTAAATCTGAGTTTGACCGTG
- segment in orig. seq : CAAAAAGTTAAAGAAATCTTTGAATGTGGCTAAATCTGAGTTTGACCGTG
checking read 2591237:ncbi:1-2
simread info:
 - from `-` strand
 - position: 15,136
sequences:
- simread seq          : TCTATTTGTCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTC
- refseq aligned       : TCTATTTGTCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTC
- segment in orig. seq : TCTATTTGTCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTC
checking read 2591237:ncbi:1-3
simread info:
 - from `+` strand
 - position: 19,210
sequences:
- simread seq          : TGATGGTGGTAGCTTGTATGTGAATAAGCATGCATTCCACACTCCAGCTT
- refseq aligned       : TGATGGTGGTAGCTTGTATGTGAATAAGCATGCATTCCACACTCCAGCTT
- segment in orig. seq : TGATGGTGGTAGCTTGTATGTGAATAAGCATGCATTCCACACTCCAGCTT
checking read 2591237:ncbi:1-4


## end of section