# Review NCBI CoV Sequences and SimReads Data

In this notebook, we review the CoV sequence retireved from NCBI and the corresponding simulated reads generated with ARt Illumina.

Note: The notebook was built and tested to run locally but should also work on Colab or Kaggle. If on Colab, it assumes that the project shared gdrive directory is accessible through a shortcut called `Metagenomics` under the root of gdrive.

# 1. Imports and setup environment

### Install and import packages

In [1]:
# Install required custom packages if not installed yet.
import importlib.util
if not importlib.util.find_spec('ecutilities'):
    print('installing package: `ecutilities`')
    ! pip install -qqU ecutilities
else:
    print('`ecutilities` already installed')
if not importlib.util.find_spec('metagentools'):
    print('installing package: `metagentools')
    ! pip install -qqU metagentools
else:
    print('`metagentools` already installed')

`ecutilities` already installed
`metagentools` already installed


In [2]:
# Import all required packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

from ecutilities.core import files_in_tree
from ecutilities.ipython import nb_setup
from IPython.display import display, Markdown
from pathlib import Path
from pprint import pprint
from tqdm.notebook import tqdm, trange

# Setup the notebook for development
nb_setup()

from metagentools.cnn_virus.data import FastaFileReader, FastqFileReader, AlnFileReader
from metagentools.core import ProjectFileSystem

Set autoreload mode


# 2. Review project file system

This project adopts a unified file structure to make coding and colaboration easier. In addition, we can run the code locally (from a `project-root` directory) or in the cloud (colab or kaggle).

The unified file structure when running localy is:
```text
    project-root   
        |--- data
        |      |--- CNN_Virus_data  (all data related to CNN Virus original paper)
        |      |--- models          (trained and finetuned models)
        |      |--- ....            (raw and pre-processed data from various sources)  
        |      
        |--- nbs  (all reference and work notebooks)
        |      |--- cnn_virus
        |      |        |--- notebooks.ipynb
```

When running on google colab, it is assumed that a google drive is mounted on the colab server instance, and that the google drive root includes a shortcut named `Metagenomics` and pointing to the project shared directory. The project shared directory is accessible [here](/https://drive.google.com/drive/folders/134uei5fmt08TpzhmjG4sW0FQ06kn2ZfZ) if your google account was given access permission.

To make access to the unified file system, `metagentools` provides a `ProjectFileSystem` class (see [documentation](https://vtecftwy.github.io/metagentools/core.html#projectfilesystem) for more info).

Key folders and system information

In [3]:
pfs = ProjectFileSystem()
pfs.info()

Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentools 
 - Data Dir .... /home/vtec/projects/bio/metagentools/data 
 - Notebooks ... /home/vtec/projects/bio/metagentools/nbs


In [4]:
pfs.readme()

ReadMe file for directory `data`:

### Data structure for `metagentools`
This directory includes all the data required for the project `metagentools`.

```text
data
 |--- CNN_Virus_data 
 |--- ncbi                
 |--- saved         
 |--- yf
 |--- ....           
     
```
#### Sub-directories
- `CNN_Virus_data`: includes all the data related to the original CNN Virus paper, i.e. training data and validation data in a format that can be used by the CNN Virus code.
- `ncbi`: includes data related to the use of viral sequences from NCBI: reference sequences, simulated reads, inference datasets, inference results.
- `saved`: includes model saved parameters and preprocessing datasets.
- `yf`: includes all data related to real yellow fever reads, from "wet" samples

Also available on AWS S3 at `https://s3.ap-southeast-1.amazonaws.com/bio.cnn-virus.data/data/...`

### Explore files in NCBI data directory

In [5]:
files_in_tree(pfs.data/'ncbi');

data
  |--ncbi
  |    |--readme.md (0)
  |    |--refsequences
  |    |    |--readme.md (1)
  |    |--simreads
  |    |    |--readme.md (2)
  |    |--infer_results
  |    |    |--readme.md (3)
  |    |--ds
  |    |    |--readme.md (4)


In [6]:
for d in ['refsequences', 'simreads', 'infer_results']:
    path = pfs.data / 'ncbi' / d
    pfs.readme(dir_path=path)
    print(f"{'-'*100}")
    files_in_tree(path)
    print(f"{'='*100}")

ReadMe file for directory `data/ncbi/refsequences`:

### NCBI Reference Sequences

This directory includes the reference sequences from NCBI, for each virus group. 

Each group of viruses is grouped in one directory:
- `cov`: all reference sequences and related metadata files for corona (sars and mers)
- `mRhiFer1`:  a reference sequence different from CoV and YF to test false positive predictions
- `yf`: all reference sequences and related metadata files for yellow fever


----------------------------------------------------------------------------------------------------
ncbi
  |--refsequences
  |    |--readme.md (0)
  |    |--mRhiFer1
  |    |    |--Rhinolophus_ferrumequinum.mRhiFer1_v1.p.dna_rm.primary_assembly.1.fa (1)
  |    |    |--Rhinolophus_ferrumequinum.mRhiFer1_v1.p.dna.primary_assembly.1.fa.gz (2)
  |    |    |--Rhinolophus_ferrumequinum.mRhiFer1_v1.p.dna_rm.primary_assembly.1.clean.fa (3)
  |    |    |--readme.md (4)
  |    |    |--Rhinolophus_ferrumequinum.mRhiFer1_v1.p.dna_rm.primary_assembly.1.fa.gz (5)
  |    |--yf
  |    |    |--readme.md (6)
  |    |    |--yf_AY968064_Angola_1971.fa (7)
  |    |    |--yf_2023_multiple_alignment.fa (8)
  |    |--cov
  |    |    |--cov_virus_sequence_001-seq1.fa (9)
  |    |    |--cov_virus_sequences.txt (10)
  |    |    |--cov_virus_sequences-original.txt (11)
  |    |    |--cov_virus_sequences_100-seqs.fa (12)
  |    |    |--readme.md (13)
  |    |    |--taxonomyid-label-mapping.json (14)
  |    |    |

ReadMe file for directory `data/ncbi/simreads`:

### NCBI simulated reads
This directory includes all sets of simulated read sequence files generated from NCBI viral sequences using  ARC Illumina. 

```ascii
this-directory
    |--cov
    |    |
    |    |--single_10seq_50bp
    |    |    |--single_10seq_50bp.fq
    |    |    |--single_10seq_50bp.alnEnd
    |    |-- ...
    |    |--single_100seq_150bp
    |    |    |--single_100seq_150bp.fq
    |    |    |--single_100seq_150bp.aln
    |    |--paired_100seq_50bp
    |    |    |--paired_100seq_50bp2.aln
    |    |    |--paired_100seq_50bp1.aln
    |    |    |--paired_100seq_50bp2.fq
    |    |    |--paired_100seq_50bp1.fq
    |    |-- ...
    |    |
    |---yf
    |    |
    |    |--yf_AY968064-single-150bp
    |    |    |--yf_AY968064-single-1seq-150bp.fq
    |    |    |--yf_AY968064-single-1seq-150bp.aln
    |    |
    |--mRhiFer1
    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1
    |    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1.fq
    |    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1.aln
    |    |

```

This directory includes several subdirectories, each for one virus, e.g. `cov` for corona, `yf` for yellow fever.

In each virus subdirectory, several simreads directory includes simulated reads with various parameters, named as `<method>_<nb-seq>_<nb-bp>` where"
- `<method>` is either `single` or `paired` depending on the simulation method
- `<nb-seq>` is the number of reference sequences used for simulation, and refers to the `fa` file used
- `<nb-bp>` is the number of base pairs used to simulate reads


Each sub-directory includes simreads files made using a simulation method and a specific number of reference sequences.
- `xxx.fq` and `xxx.aln` files when method is `single`
- `xxx1.fq`, `xxx2.fq`, `xxx1.aln` and `xxx2.aln` files when method is `paired`.

Example:
- `paired_10seq_50bp` means that the simreads were generated by using the `paired` method to simulate 50-bp reads, and using the `fa` file `cov_virus_sequences_010-seqs.fa`.
- `single_100seq_50bp` means that the simreads were generated by using the `single` method to simulate 50-bp reads, and using the `fa` file `cov_virus_sequences_100-seqs.fa`. Note that this generated 20,660,104 reads !

#### Simread file formats

Simulated reads information is split between two files:
- **FASTQ** (`.fq`) files providing the read sequences and their ASCII quality scores
- **ALN** (`.aln`) files with alignment information

##### FASTQ (`.fq`)
FASTQ files generated by ART Illumina have the following structure (showing 5 reads), with 4 lines for each read:

```ascii
@2591237:ncbi:1-60400
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
+
CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG
@2591237:ncbi:1-60399
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
+
BCBCCFGGGGGGGG1CGGGG<GGBGGGGGFGCGGGGGGDGGG/GG1GGGG
@2591237:ncbi:1-60398
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
+
CCCCCGGGEGG1GGF1G/GGEGGGGGGGGGGGGFFGGGGGGGGGGDGGDG
@2591237:ncbi:1-60397
CGTAAAGTAGAGGCTGTATGGTAGCTAGCACAAATGCCAGCACCAATAGG
+
BCCCCGGGFGGGGGGFGGGGFGG1GGGGGGG>GG1GGGGGGGGGGE<GGG
@2591237:ncbi:1-60396
GGTATCGGGTATCTCCTGCATCAATGCAAGGTCTTACAAAGATAAATACT
+
CBCCCGGG@CGGGGGGGGGGGG=GFGGGGDGGGFG1GGGGGGGG@GGGGG
```
The following information can be parsed from the each read sequence in the FASTQ file:

- Line 1: `readid`, a unique ID for the read, under for format `@readid` 
- Line 2: `readseq`, the sequence of the read
- Line 3: a separator `+`
- Line 4: `read_qscores`, the base quality scores encoded in ASCII 

Example:
```
@2591237:ncbi:1-60400
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
+
CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG
```
- `readid` = `2591237:ncbi:1-60400`
- `readseq` = `ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG`, a 50 bp read
- `read_qscores` = `CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG`


#### ALN (`.aln`) 
ALN files generated by ART Illumina consist of :
- a header with the ART-Ilumina command used for the simulation (`@CM`) and info on each of the reference sequences used for the simulations (`@SQ`). Header always starts with `##ART_Illumina` and ends with `##Header End` :
- the body with 3 lines for each read:
    1. definition line with `readid`, 
        - reference sequence identification number `refseqid`, 
        - the position in the read in the reference sequence `aln_start_pos` 
        - the strand the read was taken from `ref_seq_strand`. `+` for coding strand and `-` for template strand
    2. aligned reference sequence, that is the sequence segment in the original reference corresponding to the read
    3. aligned read sequence, that is the simmulated read sequence, where each bp corresponds to the reference sequence bp in the same position.

Example of a ALN file generated by ART Illumina (showing 5 reads):

```ascii
##ART_Illumina    read_length    50
@CM    /bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/cov_data/cov_virus_sequences_ten.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/cov_simreads/single_10seq_50bp/single_10seq_50bp -rs 1674660835
@SQ    2591237:ncbi:1 [MK211378]    2591237    ncbi    1 [MK211378] 2591237    Coronavirus BtRs-BetaCoV/YN2018D    30213
@SQ    11128:ncbi:2 [LC494191]    11128    ncbi    2 [LC494191] 11128    Bovine coronavirus    30942
@SQ    31631:ncbi:3 [KY967361]    31631    ncbi    3 [KY967361] 31631    Human coronavirus OC43        30661
@SQ    277944:ncbi:4 [LC654455]    277944    ncbi    4 [LC654455] 277944    Human coronavirus NL63    27516
@SQ    11120:ncbi:5 [MN987231]    11120    ncbi    5 [MN987231] 11120    Infectious bronchitis virus    27617
@SQ    28295:ncbi:6 [KU893866]    28295    ncbi    6 [KU893866] 28295    Porcine epidemic diarrhea virus    28043
@SQ    28295:ncbi:7 [KJ645638]    28295    ncbi    7 [KJ645638] 28295    Porcine epidemic diarrhea virus    27998
@SQ    28295:ncbi:8 [KJ645678]    28295    ncbi    8 [KJ645678] 28295    Porcine epidemic diarrhea virus    27998
@SQ    28295:ncbi:9 [KR873434]    28295    ncbi    9 [KR873434] 28295    Porcine epidemic diarrhea virus    28038
@SQ    1699095:ncbi:10 [KT368904]    1699095    ncbi    10 [KT368904] 1699095    Camel alphacoronavirus    27395
##Header End
>2591237:ncbi:1    2591237:ncbi:1-60400    14770    +
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
>2591237:ncbi:1    2591237:ncbi:1-60399    17012    -
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
>2591237:ncbi:1    2591237:ncbi:1-60398    9188    +
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
.....
```

----------------------------------------------------------------------------------------------------
ncbi
  |--simreads
  |    |--readme.md (0)
  |    |--mRhiFer1
  |    |    |--readme.md (1)
  |    |--yf
  |    |    |--readme.md (2)
  |    |--cov
  |    |    |--readme.md (3)


ReadMe file for directory `data/ncbi/infer_results`:

#### Inference Result Files

Each directory includes a set of prediction/inference result files.

----------------------------------------------------------------------------------------------------
ncbi
  |--infer_results
  |    |--readme.md (0)
  |    |--cov-ncbi
  |    |    |--readme.md (1)
  |    |--yf-ncbi
  |    |    |--yf_2023-single-69seq-150bp-2024-05-02_16_29_18-probs.csv (2)
  |    |    |--readme.md (3)
  |    |    |--yf_AY968064-single-1seq-150bp-2024-05-02_01_10_12-probs.csv (4)
  |    |    |--yf_AY968064-single-1seq-150bp-2024-05-02_01_10_12-results.csv (5)
  |    |    |--yf_2023-single-69seq-150bp-2024-05-02_16_29_18-results.csv (6)


# 3. Review fasta, fastq and aln files

## Setup paths

- `p2refseqs`: path to a reference sequence file (FASTA)
- `p2simreads`: path to folder where reads files are located (FASTQ and ALN)

In [7]:
p2refseqs = pfs.data / 'ncbi/refsequences/cov/cov_virus_sequences_010-seqs.fa'
assert p2refseqs.is_file(), f"No file found at {p2refseqs.absolute()}"

p2simreads = pfs.data / 'ncbi/simreads/cov/single_10seq_50bp'
assert p2simreads.is_dir(), f"No directory found at {p2simreads.absolute()}"

## Explore Reference Files (FASTA)

`FastaFileReader` is a class to make reading and accessing fasta file information easier

In [8]:
fasta = FastaFileReader(path=p2refseqs)

In [9]:
fasta.print_first_chunks(nchunks=3)


Sequence 1:
>2591237:ncbi:1 [MK211378]	2591237	ncbi	1 [MK211378] 2591237	Coronavirus BtRs-BetaCoV/YN2018D		scientific name
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...

Sequence 2:
>11128:ncbi:2 [LC494191]	11128	ncbi	2 [LC494191] 11128	Bovine coronavirus		scientific name
CATCCCGCTTCACTGATCTCTTGTTAGATCTTTTCATAATCTAAACTTTATAAAAACATCCACTCCCTGTAGTCTATGCC ...

Sequence 3:
>31631:ncbi:3 [KY967361]	31631	ncbi	3 [KY967361] 31631	Human coronavirus OC43		scientific name
ATCTCTTGTTAGATCTTTTTGTAATCTAAACTTTATAAAAACATCCACTCCCTGTAATCTATGCTTGTGGGCGTAGATTT ...


Access each sequence (definition line and sequence) one by one

In [10]:
fasta.reset_iterator()

for i, seq in enumerate(fasta):
    print(f"Definition Line for sequence {i+1}:")
    print(seq['definition line'])
    print(f"{len(seq['sequence']):,d} bases:")
    print(seq['sequence'])
    print()
    if i >= 2: break

Definition Line for sequence 1:
>2591237:ncbi:1 [MK211378]	2591237	ncbi	1 [MK211378] 2591237	Coronavirus BtRs-BetaCoV/YN2018D		scientific name
30,213 bases:
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTACGGTTCCGTCCGTGTTGCAGTCGATCATCAGCATACCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTTCTTGGTGTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTCTTCAGGTTAGAGACGTGCTAGTGCGTGGCTTCGGGGACTCTGTGGAAGAGGCCCTATCGGAGGCACGTGAACATCTTAAAAATGGCACTTGTGGTTTAGTAGAGCTGGAAAAAGGCGTACTGCCCCAGCTTGAACAGCCCTATGTGTTCATTAAACGTTCTGATGCCTTAAGCACCAATCACGGCCACAAGGTCGTTGAGCTGGTTGCAGAATTGGACGGCATTCAGTACGGTCGTAGCGGTATAACTCTGGGAGTACTCGTGCCACATGTGGGCGAAACCCCAATCGCATACCGCAATGTTCTTCTTCGTAAGAACGGTAATAAGGGAGCCGGTGGCCATAGCTTTGGCATCGATCTAAAGTCTTATGACTTAGGTGACGAGCTTGGTACTGATCCCATTGAAGATTATGAACAAAACTGGAACACTAAGCATGGCAGTGGTGTACTCCGTGAACTCACTCGTGAGCTCAATGGAGGTGCAGTCACTCGCTATGTCGACAACAACTTCTGTGGCCCA

In [11]:
definition_lines = fasta.parse_file()
fasta.reset_iterator()

for i, (k, v) in enumerate(definition_lines.items()):
    print(f"Definition Line for sequence {i+1}:")
    print(next(fasta)['definition line'])
    print('Parsed data:')
    print(f"key: {k}\nmetadata:")
    pprint(v)
    print(f"{'='*100}")
    if i >= 2: break

Definition Line for sequence 1:
>2591237:ncbi:1 [MK211378]	2591237	ncbi	1 [MK211378] 2591237	Coronavirus BtRs-BetaCoV/YN2018D		scientific name
Parsed data:
key: 2591237:ncbi:1
metadata:
{'accession': 'MK211378',
 'seqid': '2591237:ncbi:1',
 'seqnb': '1',
 'source': 'ncbi',
 'species': 'Coronavirus BtRs-BetaCoV/YN2018D  scientific name',
 'taxonomyid': '2591237'}
Definition Line for sequence 2:
>11128:ncbi:2 [LC494191]	11128	ncbi	2 [LC494191] 11128	Bovine coronavirus		scientific name
Parsed data:
key: 11128:ncbi:2
metadata:
{'accession': 'LC494191',
 'seqid': '11128:ncbi:2',
 'seqnb': '2',
 'source': 'ncbi',
 'species': 'Bovine coronavirus  scientific name',
 'taxonomyid': '11128'}
Definition Line for sequence 3:
>31631:ncbi:3 [KY967361]	31631	ncbi	3 [KY967361] 31631	Human coronavirus OC43		scientific name
Parsed data:
key: 31631:ncbi:3
metadata:
{'accession': 'KY967361',
 'seqid': '31631:ncbi:3',
 'seqnb': '3',
 'source': 'ncbi',
 'species': 'Human coronavirus OC43  scientific name',
 

## Explore simulated read output files (FASTQ and ALN)

When simulating with ART Illumina,:
- We simulated reads from referenmce sequences in a fasta file (`fasta` reader)
- ART returns two files:
    - a fastq file (`fastq` reader) with the reads sequences, their quality and some read metadata
    - a aln file (`aln` reader), including:
        - a header with the summulation command and the list of reference sequences
        - additional metadata on each reads, including: 
            - the readid and its number (rank in the list)
            - the read sequence itself (may include errors)
            - the same sequence as in the reference sequence
            - the read alignment starting position in the reference sequence
            - the reference sequence strand
            - the reference sequence id and its number in the sequence file            

Like for FASTA file, `FastqFileReader` and `AlnFileReader` makes it easier to load and work with fastq and aln file

In [12]:
p2fastq = p2simreads / f"{p2simreads.stem}.fq"
assert p2fastq.is_file()
p2aln = p2simreads / f"{p2simreads.stem}.aln"
assert p2aln.is_file()

print(f" fq reads file:  {p2fastq.name}\n aln reads file: {p2aln.name}")

fastq = FastqFileReader(p2fastq)
aln = AlnFileReader(p2aln)

 fq reads file:  single_10seq_50bp.fq
 aln reads file: single_10seq_50bp.aln


### Exploring information about the simreads in fastaq

In [13]:
for i, (k,v) in enumerate(fastq.parse_file(add_readseq=True).items()):
    print(f"Read {i}:")
    pprint(v)
    print()

    if i+1 >= 3: break

Read 0:
{'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'readseq': 'TTGATCATCCAAATCCTAAAGGATTTTGTGACTTGAAAGGTAAGTACGTC',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 1:
{'readid': '2591237:ncbi:1-60399',
 'readnb': '60399',
 'readseq': 'ATTTATTAGACTCTTACTTTGTAGTTAAGAGGCATACTATGTCTAACTAC',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 2:
{'readid': '2591237:ncbi:1-60398',
 'readnb': '60398',
 'readseq': 'TCAGTTGAAAGGAGTGCATTTACATTAGCTGTAACAGCTTGACAAATGTT',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}



### Exploring information in ALN header

The header of ALN file includes:
- the ART command used to simulate the reads
- the list of reference sequences used to create the reads, with metadata

First we can see from which reference fasta file the simulated sequences were imported, by checking the ART command in the alignment file's header. Then we can open a fasta reader for the reference fasta file as well:

In [14]:
aln.header['command']

'/usr/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/ncbi/refsequences/cov_virus_sequences_010-seqs.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/ncbi/simreads/single_10seq_50bp/single_10seq_50bp -rs 1705653351'

We see that the simulated reads were generated from the reference sequences in `data/cov_data/cov_virus_sequences_ten.fa`, that is the same as the file in our `fasta`

In [15]:
fasta.path

PosixPath('/home/vtec/projects/bio/metagentools/data/ncbi/refsequences/cov/cov_virus_sequences_010-seqs.fa')

Then we can review the metadata available for each reference sequence, as listed in `aln`'s header.

This information can be seen in `aln.header` or with the method `parse_header_reference_sequences()`

In [39]:
print('\n'.join(aln.header['reference sequences']))

@SQ	2591237:ncbi:1 [MK211378]	2591237	ncbi	1 [MK211378] 2591237	Coronavirus BtRs-BetaCoV/YN2018D		scientific name	30213
@SQ	11128:ncbi:2 [LC494191]	11128	ncbi	2 [LC494191] 11128	Bovine coronavirus		scientific name	30942
@SQ	31631:ncbi:3 [KY967361]	31631	ncbi	3 [KY967361] 31631	Human coronavirus OC43		scientific name	30661
@SQ	277944:ncbi:4 [LC654455]	277944	ncbi	4 [LC654455] 277944	Human coronavirus NL63		scientific name	27516
@SQ	11120:ncbi:5 [MN987231]	11120	ncbi	5 [MN987231] 11120	Infectious bronchitis virus		scientific name	27617
@SQ	28295:ncbi:6 [KU893866]	28295	ncbi	6 [KU893866] 28295	Porcine epidemic diarrhea virus		scientific name	28043
@SQ	28295:ncbi:7 [KJ645638]	28295	ncbi	7 [KJ645638] 28295	Porcine epidemic diarrhea virus		scientific name	27998
@SQ	28295:ncbi:8 [KJ645678]	28295	ncbi	8 [KJ645678] 28295	Porcine epidemic diarrhea virus		scientific name	27998
@SQ	28295:ncbi:9 [KR873434]	28295	ncbi	9 [KR873434] 28295	Porcine epidemic diarrhea virus		scientific name	28038
@SQ	1699

With `AlnFileReader.parse_header_reference_sequences`, we can extract information on all the sequences used to generate the reads:

In [17]:
header_refseq_meta = aln.parse_header_reference_sequences()

print(f"Reads simulated from {len(header_refseq_meta)} reference sequences.\n")

for i, (k,v) in enumerate(header_refseq_meta.items()):
    print(f"Reference Sequence {i}:")
    pprint(v)
    print()

    if i+1 >= 3: break

Reads simulated from 10 reference sequences.

Reference Sequence 0:
{'refseq_accession': 'MK211378',
 'refseq_length': '30213',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237',
 'species': 'Coronavirus BtRs-BetaCoV/YN2018D  scientific name'}

Reference Sequence 1:
{'refseq_accession': 'LC494191',
 'refseq_length': '30942',
 'refseqid': '11128:ncbi:2',
 'refseqnb': '2',
 'refsource': 'ncbi',
 'reftaxonomyid': '11128',
 'species': 'Bovine coronavirus  scientific name'}

Reference Sequence 2:
{'refseq_accession': 'KY967361',
 'refseq_length': '30661',
 'refseqid': '31631:ncbi:3',
 'refseqnb': '3',
 'refsource': 'ncbi',
 'reftaxonomyid': '31631',
 'species': 'Human coronavirus OC43  scientific name'}



We see that the first reference sequence is a **Coronavirus BtRs-BetaCoV/YN2018D**

### Exploring metadata available for simulated read's

In [18]:
reads_meta = aln.parse_file(add_ref_seq_aligned=True, add_read_seq_aligned=True)

print(f"{len(reads_meta):,d} reads were simulated.\n")

for i, (k,v) in enumerate(reads_meta.items()):
    print(f"Read {i}")
    print(k)
    pprint(v)
    print()
    if i+1>=3: break

572,045 reads were simulated.

Read 0
2591237:ncbi:1-60400
{'aln_start_pos': '13195',
 'read_seq_aligned': 'TTGATCATCCAAATCCTAAAGGATTTTGTGACTTGAAAGGTAAGTACGTC',
 'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'ref_seq_aligned': 'TTGATCATCCAAATCCTAAAGGATTTTGTGACTTGAAAGGTAAGTACGTC',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 1
2591237:ncbi:1-60399
{'aln_start_pos': '13560',
 'read_seq_aligned': 'ATTTATTAGACTCTTACTTTGTAGTTAAGAGGCATACTATGTCTAACTAC',
 'readid': '2591237:ncbi:1-60399',
 'readnb': '60399',
 'ref_seq_aligned': 'ATTTATTAGACTCTTACTTTGTAGTTAAGAGGCATACTATGTCTAACTAC',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 2
2591237:ncbi:1-60398
{'aln_start_pos': '14711',
 'read_seq_aligned': 'TCAGTTGAAAGGAGTGCATTTACATTAGCTGTAACAGCTTGACAAATGTT',
 'readid': '2591237:ncbi:1-60398',
 'readnb': '60398',
 'ref_seq_aligned'

### Comparison metadata in fasta, fastq and aln

There is redendancy in metadata across the three sets of files (`.fa`, `.fq`, `.aln`). Let's compare them to confirm this is consistent

Create the following list and dictionaries parsing metadata from fasta, fastq and aln:
- `refseqs_fasta`: metadata in fasta sequence file used as reference sequences to create reads
- `simreads`: metadata on reads, as available in the fastq read file
- `refseqs_aln`: reference sequences with metadata as available in the aln header
- `simread_align`: read sequence and alignment info

In [19]:
refseqs_fasta = fasta.parse_file(add_seq=True)
simreads = fastq.parse_file(add_readseq=True)
refseqs_aln = aln.ref_sequences
simread_align = aln.parse_file(add_ref_seq_aligned=True, add_read_seq_aligned=True)

In [20]:
aln.re_header_pattern, aln.re_header_rule_name

('^@SQ[\\t\\s]*(?P<refseqid>(?P<reftaxonomyid>\\d*):(?P<refsource>\\w*):(?P<refseqnb>\\d*))[\\t\\s]*\\[(?P<refseq_accession>[\\d\\w]*)\\][\\t\\s]*(?P=reftaxonomyid)[\\s\\t]*(?P=refsource)[\\s\\t]*(?P=refseqnb)[\\s\\t]*\\[(?P=refseq_accession)\\][\\s\\t]*(?P=reftaxonomyid)[\\s\\t]*(?P<species>\\w[\\w\\d\\/\\s\\-\\.]*)[\\s\\t](?P<refseq_length>\\d*)$',
 'aln_art_illumina-refseq-ncbi-cov')

In [21]:
print(f"reads simulated from the following {len(refseqs_fasta)} full sequences from `{fasta.path.name}`:\n")
print('\n'.join([f" {i:02d}: {refseq['species']}" for i, refseq in enumerate(refseqs_fasta.values())]))

reads simulated from the following 10 full sequences from `cov_virus_sequences_010-seqs.fa`:

 00: Coronavirus BtRs-BetaCoV/YN2018D  scientific name
 01: Bovine coronavirus  scientific name
 02: Human coronavirus OC43  scientific name
 03: Human coronavirus NL63  scientific name
 04: Infectious bronchitis virus  scientific name
 05: Porcine epidemic diarrhea virus  scientific name
 06: Porcine epidemic diarrhea virus  scientific name
 07: Porcine epidemic diarrhea virus  scientific name
 08: Porcine epidemic diarrhea virus  scientific name
 09: Camel alphacoronavirus  scientific name


In [22]:
print(f"{len(simreads):,d} reads generated and available in fastq file.")
print('For each read, following metadata is available in the fastq file:')
for s in ['readid', 'readnb', 'refseqnb', 'refsource', 'reftaxonomyid', 'readseq']:
    print(f" - {s}")

572,045 reads generated and available in fastq file.
For each read, following metadata is available in the fastq file:
 - readid
 - readnb
 - refseqnb
 - refsource
 - reftaxonomyid
 - readseq


In [23]:
print(f"{len(simread_align):,d} reads generated and available in aln file.")
print('For each read, following metadata is available in the aln file:')
for s in ['aln_start_pos', 'readid', 'readnb', 'refseq_strand', 'refseqid', 'refseqnb', 'refsource', 'reftaxonomyid', 'ref_seq_aligned', 'read_seq_aligned']:
    print(f" - {s}")

572,045 reads generated and available in aln file.
For each read, following metadata is available in the aln file:
 - aln_start_pos
 - readid
 - readnb
 - refseq_strand
 - refseqid
 - refseqnb
 - refsource
 - reftaxonomyid
 - ref_seq_aligned
 - read_seq_aligned


Check consistency between refseqs from fasta and from aln 

In [24]:
# utility functions
def complementary_strand(seq):
    """Converts a strand in its complementary"""
    conv = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
    return ''.join([conv[base] for base in seq])

strand = 'ATCCGTGGGT'
print(strand, complementary_strand(strand))

def reverse_sequence(seq):
    return seq[::-1]

print(strand, reverse_sequence(strand))

ATCCGTGGGT TAGGCACCCA
ATCCGTGGGT TGGGTGCCTA


### Check aln reference sequence information

In [25]:
refseqid = '2591237:ncbi:1'
# refseqid = '11128:ncbi:2'

In the source `.fa` file

In [26]:
original_seq = refseqs_fasta[refseqid]['sequence']
original_seq_accession = refseqs_fasta[refseqid]['accession']

original_seq_accession, len(original_seq)

('MK211378', 30213)

In [27]:
refseqs_fasta.keys()
refseqs_aln.keys()

dict_keys(['2591237:ncbi:1', '11128:ncbi:2', '31631:ncbi:3', '277944:ncbi:4', '11120:ncbi:5', '28295:ncbi:6', '28295:ncbi:7', '28295:ncbi:8', '28295:ncbi:9', '1699095:ncbi:10'])

In the output`.aln` file

In [28]:
assert original_seq_accession == refseqs_aln[refseqid]['refseq_accession']
assert len(original_seq) == int(refseqs_aln[refseqid]['refseq_length'])

refseqs_aln[refseqid]['refseq_accession'], int(refseqs_aln[refseqid]['refseq_length'])

('MK211378', 30213)

### Check alignment information

In [29]:
pprint(simread_align['2591237:ncbi:1-60400'])

{'aln_start_pos': '13195',
 'read_seq_aligned': 'TTGATCATCCAAATCCTAAAGGATTTTGTGACTTGAAAGGTAAGTACGTC',
 'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'ref_seq_aligned': 'TTGATCATCCAAATCCTAAAGGATTTTGTGACTTGAAAGGTAAGTACGTC',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}


Select all reads generated from a single reference sequence

In [30]:
print(f"Select all reads from reference sequence '{refseqid}''")
reads_from_refseq = {k:v for k,v in simread_align.items() if v['refseqid']==refseqid}
nbr_generated_reads = len(reads_from_refseq)
print(f"Total of {nbr_generated_reads:,d} reads")

Select all reads from reference sequence '2591237:ncbi:1''
Total of 60,400 reads


In [31]:
n = 6
selected_simread = [v for k,v in reads_from_refseq.items()][n]
pprint(selected_simread)

{'aln_start_pos': '27612',
 'read_seq_aligned': 'AACGACAGCTCCATTTGTGAAGCTATCAACAGGCGTCTCGAGTGCTTCGA',
 'readid': '2591237:ncbi:1-60394',
 'readnb': '60394',
 'ref_seq_aligned': 'AACGACAGCTCCATTTGTGAAGCTATCAACAGGCGTCTCGAGTGCTTCGA',
 'refseq_strand': '-',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}


In [32]:
def check_alignment(n, reads):
    selected_simread = [v for k,v in reads.items()][n]
    print(f"{'='*80}")
    print(f"checking read {selected_simread['readid']}")
    start = int(selected_simread['aln_start_pos'])
    strand = selected_simread['refseq_strand']
    print(f"simread info:")
    print(f" - from `{strand}` strand")
    print(f" - position: {start:,d}")

    if strand == '+':
        segment_from_refseq = original_seq[start:start+50]
    else:
        segment_from_refseq = complementary_strand(reverse_sequence(original_seq)[start:start+50])

    print('sequences:')
    print(f'- simread seq          :', selected_simread['read_seq_aligned'])
    print(f'- refseq aligned       :', selected_simread['ref_seq_aligned'])
    print(f'- segment in orig. seq :', segment_from_refseq)

In [33]:
for n in range(nbr_generated_reads-1, nbr_generated_reads-6, -1):
    check_alignment(n, reads_from_refseq)

checking read 2591237:ncbi:1-1
simread info:
 - from `+` strand
 - position: 12,792
sequences:
- simread seq          : GATGGTACAGGTACAATTTACACAGAACTGGAACCACCTTGTAGGTTTGT
- refseq aligned       : GATGGTACAGGTACAATTTACACAGAACTGGAACCACCTTGTAGGTTTGT
- segment in orig. seq : GATGGTACAGGTACAATTTACACAGAACTGGAACCACCTTGTAGGTTTGT
checking read 2591237:ncbi:1-2
simread info:
 - from `+` strand
 - position: 3,849
sequences:
- simread seq          : TCTGTCGTACAGAAGCCTGTCGATGTGAAGCCAAAAATTAAGGCTTGCAT
- refseq aligned       : TCTGTCGTACAGAAGCCTGTCGATGTGAAGCCAAAAATTAAGGCTTGCAT
- segment in orig. seq : TCTGTCGTACAGAAGCCTGTCGATGTGAAGCCAAAAATTAAGGCTTGCAT
checking read 2591237:ncbi:1-3
simread info:
 - from `+` strand
 - position: 23,632
sequences:
- simread seq          : TCAGGAGTGCAGTAATTTACTTCTTCAATACGGAAGTTTCTGCACGCAAT
- refseq aligned       : TCAGGAGTGCAGTAATTTACTTCTTCAATACGGAAGTTTCTGCACGCAAT
- segment in orig. seq : TCAGGAGTGCAGTAATTTACTTCTTCAATACGGAAGTTTCTGCACGCAAT
checking read 2591237:ncbi:1-4
s

## end of section