# Review NCBI Yellow Fever (YF) Sequences and SimReads Data

In this notebook, we review the yellow fever sequence retireved from NCBI and the corresponding simulated reads generated with ARt Illumina.

Note: The notebook was built and tested to run locally but should also work on Colab or Kaggle. If on Colab, it assumes that the project shared gdrive directory is accessible through a shortcut called `Metagenomics` under the root of gdrive.

# 1. Imports and setup environment

### Install and import packages

In [249]:
# Install required custom packages if not installed yet.
import importlib.util
if not importlib.util.find_spec('ecutilities'):
    print('installing package: `ecutilities`')
    ! pip install -qqU ecutilities
else:
    print('`ecutilities` already installed')
if not importlib.util.find_spec('metagentools'):
    print('installing package: `metagentools')
    ! pip install -qqU metagentools
else:
    print('`metagentools` already installed')

`ecutilities` already installed
`metagentools` already installed


In [250]:
# Import all required packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import re

from ecutilities.core import files_in_tree
from ecutilities.ipython import nb_setup
from IPython.display import display, Markdown
from pathlib import Path
from pprint import pprint
from tqdm.notebook import tqdm, trange

# Setup the notebook for development
nb_setup()

from metagentools.cnn_virus.data import FastaFileReader, FastqFileReader, AlnFileReader
from metagentools.core import ProjectFileSystem

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Set autoreload mode


# 2. Review project file system

This project adopts a unified file structure to make coding and colaboration easier. In addition, we can run the code locally (from a `project-root` directory) or in the cloud (colab or kaggle).

The unified file structure when running localy is:
```text
    project-root   
        |--- data
        |      |--- CNN_Virus_data  (all data related to CNN Virus original paper)
        |      |--- saved           (trained and finetuned models)
        |      |--- ....            (raw and pre-processed data from various sources)  
        |      
        |--- nbs  (all reference and work notebooks)
        |      |--- cnn_virus
        |      |        |--- notebooks.ipynb
```

When running on google colab, it is assumed that a google drive is mounted on the colab server instance, and that the google drive root includes a shortcut named `Metagenomics` and pointing to the project shared directory. The project shared directory is accessible [here](/https://drive.google.com/drive/folders/134uei5fmt08TpzhmjG4sW0FQ06kn2ZfZ) if your google account was given access permission.

To make access to the unified file system, `metagentools` provides a `ProjectFileSystem` class (see [documentation](https://vtecftwy.github.io/metagentools/core.html#projectfilesystem) for more info).

Key folders and system information

In [251]:
pfs = ProjectFileSystem()
pfs.info()

Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentools 
 - Data Dir .... /home/vtec/projects/bio/metagentools/data 
 - Notebooks ... /home/vtec/projects/bio/metagentools/nbs


In [252]:
pfs.readme()

ReadMe file for directory `data`:

### Data structure for `metagentools`
This directory includes all the data required for the project `metagentools`.

```text
data
 |--- CNN_Virus_data 
 |--- ncbi                
 |--- saved         
 |--- yf-reads
 |--- ....           
     
```
#### Sub-directories
- `CNN_Virus_data`: includes all the data related to the original CNN Virus paper, i.e. training data and validation data in a format that can be used by the CNN Virus code.
- `ncbi`: includes data related to the use of viral sequences from NCBI: reference sequences, simulated reads, inference datasets, inference results.
- `saved`: includes model saved parameters and preprocessing datasets.
- `yf-reads`: includes all data related to real yellow fever reads, from "wet" samples

Also available on AWS S3 at `https://s3.ap-southeast-1.amazonaws.com/bio.cnn-virus.data/data/...`

# Rework definition line in NCBI YF Fasta

In [253]:
p2yfrefseqs = pfs.data / 'ncbi/refsequences/yf/yf_2023_multiple_alignment_original.fa'
assert p2yfrefseqs.is_file()
fasta_original = FastaFileReader(p2yfrefseqs)
fasta_original.print_first_chunks(1)


Sequence 1:
>AY968064_Angola_1971
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAAT ...


        None of the saved parsing rules were able to extract metadata from the first line in this file.
        You must set a custom rule (pattern + keys) before parsing text, by using:
            `self.set_parsing_rules(custom_pattern, custom_list_of_keys)`
                


Yellow Fever on NCBI database:
- Accession Number: AY968064
- https://www.ncbi.nlm.nih.gov/nuccore/AY968064
- source: `/organism="Yellow fever virus"/mol_type="genomic RNA"/strain="Angola71"/host="Homo sapiens"/db_xref="taxon:11089"/geo_loc_name="Angola"`
- https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11089

Original file definition line is: `>AY968064_Angola_1971`.

Rework it to match the structure of the NCBI CoV reference sequence files (tab separated):
- `>taxonomy_id:ncbi:seqnb [accession]	taxonomy_id	ncbi	seqnb [accession] taxonomy_id	Yellow Fever .....`
- Yellow Fever Taxonomy ID is `11089` for all accessions
- `taxonomy_id` is the start of the original definition line
- end of the original definition line is name of the strain

New dfn line: `>11089:ncbi:seqnb [acession]	11089	ncbi	seqnb [accession] 11098	.....`

In [254]:
p2yf_new_dfn_lines = pfs.data / 'ncbi/refsequences/yf/yf_2023_yellow_fever.fa'

In [255]:
re_accession = re.compile(r'^>(?P<accession>[A-Z]{1,2}\d*)(\_|\.\d)?_(?P<species>.*)$')

In [256]:
CREATE_NEW_FA = False

if CREATE_NEW_FA:
    original_fa = FastaFileReader(p2yfrefseqs)
    with open(p2yf_new_dfn_lines, 'w') as fp:
        for i, refseq in enumerate(original_fa):
            dfn = refseq['definition line']
            accession = re_accession.search(dfn).group('accession')
            species = re_accession.search(dfn).group('species')
            new_dfn = f">11089:ncbi:{i+1}\t[{accession}]\t11089	ncbi\t{i+1}\t[{accession}]\t11089\t{species}"

            fp.write(f"{new_dfn}\n{refseq['sequence']}\n")
else:
    print('No new fasta file created')

No new fasta file created


When we create FastFileReader with the new file, the format is understood

In [257]:
new_fa = FastaFileReader(p2yf_new_dfn_lines)
new_fa.print_first_chunks(1)


Sequence 1:
>11089:ncbi:1	[AY968064]	11089	ncbi	1	[AY968064]	11089	Angola_1971
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAAT ...


In [258]:
new_fa.re_rule_name

'fasta_ncbi_std'

### Explore files in NCBI data directory

In [259]:
files_in_tree(pfs.data/'ncbi');

data
  |--ncbi
  |    |--readme.md (0)
  |    |--refsequences
  |    |    |--readme.md (1)
  |    |--simreads
  |    |    |--readme.md (2)
  |    |--infer_results
  |    |    |--readme.md (3)
  |    |--ds
  |    |    |--readme.md (4)


In [260]:
for d in ['refsequences', 'simreads', 'infer_results']:
    path = pfs.data / 'ncbi' / d
    pfs.readme(dir_path=path)
    # print(f"{'-'*100}")
    files_in_tree(path)
    print(f"{'='*100}")

ReadMe file for directory `data/ncbi/refsequences`:

### NCBI Reference Sequences

This directory includes the reference sequences from NCBI, for each virus group. 

Each group of viruses is grouped into one directory:
- `cov`: all reference sequences and related metadata files for corona (sars and mers)
- `mRhiFer1`:  a reference sequence different from CoV and YF to test false positive predictions
- `yf`: all reference sequences and related metadata files for yellow fever


ncbi
  |--refsequences
  |    |--readme.md (0)
  |    |--mRhiFer1
  |    |    |--Rhinolophus_ferrumequinum.mRhiFer1_v1.p.dna_rm.primary_assembly.1.fa (1)
  |    |    |--Rhinolophus_ferrumequinum.mRhiFer1_v1.p.dna.primary_assembly.1.fa.gz (2)
  |    |    |--Rhinolophus_ferrumequinum.mRhiFer1_v1.p.dna_rm.primary_assembly.1.clean.fa (3)
  |    |    |--readme.md (4)
  |    |    |--Rhinolophus_ferrumequinum.mRhiFer1_v1.p.dna_rm.primary_assembly.1.fa.gz (5)
  |    |--yf
  |    |    |--yf_2023_yellow_fever.fa (6)
  |    |    |--yf_2023_multiple_alignment_original.fa (7)
  |    |    |--readme.md (8)
  |    |    |--yf_1971_Angola.fa (9)
  |    |--cov
  |    |    |--cov_virus_sequence_001-seq1.fa (10)
  |    |    |--cov_virus_sequences.txt (11)
  |    |    |--cov_virus_sequences-original.txt (12)
  |    |    |--cov_virus_sequences_100-seqs.fa (13)
  |    |    |--readme.md (14)
  |    |    |--taxonomyid-label-mapping.json (15)
  |    |    |--cov_virus_sequences_002-seqs.fa (16)
  |    |    |--cov

ReadMe file for directory `data/ncbi/simreads`:

### NCBI simulated reads
This directory includes all sets of simulated read sequence files generated from NCBI viral sequences using  ARC Illumina. 

```ascii
this-directory
    |--cov
    |    |
    |    |--single_10seq_50bp
    |    |    |--single_10seq_50bp.fq
    |    |    |--single_10seq_50bp.alnEnd
    |    |-- ...
    |    |--single_100seq_150bp
    |    |    |--single_100seq_150bp.fq
    |    |    |--single_100seq_150bp.aln
    |    |--paired_100seq_50bp
    |    |    |--paired_100seq_50bp2.aln
    |    |    |--paired_100seq_50bp1.aln
    |    |    |--paired_100seq_50bp2.fq
    |    |    |--paired_100seq_50bp1.fq
    |    |-- ...
    |    |
    |---yf
    |    |
    |    |--yf_AY968064-single-150bp
    |    |    |--yf_AY968064-single-1seq-150bp.fq
    |    |    |--yf_AY968064-single-1seq-150bp.aln
    |    |
    |--mRhiFer1
    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1
    |    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1.fq
    |    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1.aln
    |    |

```

This directory includes several subdirectories, each for one virus, e.g. `cov` for corona, `yf` for yellow fever.

In each virus subdirectory, several simreads directory includes simulated reads with various parameters, named as `<method>_<nb-seq>_<nb-bp>` where"
- `<method>` is either `single` or `paired` depending on the simulation method
- `<nb-seq>` is the number of reference sequences used for simulation, and refers to the `fa` file used
- `<nb-bp>` is the number of base pairs used to simulate reads


Each sub-directory includes simreads files made using a simulation method and a specific number of reference sequences.
- `xxx.fq` and `xxx.aln` files when method is `single`
- `xxx1.fq`, `xxx2.fq`, `xxx1.aln` and `xxx2.aln` files when method is `paired`.

Example:
- `paired_10seq_50bp` means that the simreads were generated by using the `paired` method to simulate 50-bp reads, and using the `fa` file `cov_virus_sequences_010-seqs.fa`.
- `single_100seq_50bp` means that the simreads were generated by using the `single` method to simulate 50-bp reads, and using the `fa` file `cov_virus_sequences_100-seqs.fa`. Note that this generated 20,660,104 reads !

#### Simread file formats

Simulated reads information is split between two files:
- **FASTQ** (`.fq`) files providing the read sequences and their ASCII quality scores
- **ALN** (`.aln`) files with alignment information

##### FASTQ (`.fq`)
FASTQ files generated by ART Illumina have the following structure (showing 5 reads), with 4 lines for each read:

```ascii
@2591237:ncbi:1-60400
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
+
CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG
@2591237:ncbi:1-60399
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
+
BCBCCFGGGGGGGG1CGGGG<GGBGGGGGFGCGGGGGGDGGG/GG1GGGG
@2591237:ncbi:1-60398
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
+
CCCCCGGGEGG1GGF1G/GGEGGGGGGGGGGGGFFGGGGGGGGGGDGGDG
@2591237:ncbi:1-60397
CGTAAAGTAGAGGCTGTATGGTAGCTAGCACAAATGCCAGCACCAATAGG
+
BCCCCGGGFGGGGGGFGGGGFGG1GGGGGGG>GG1GGGGGGGGGGE<GGG
@2591237:ncbi:1-60396
GGTATCGGGTATCTCCTGCATCAATGCAAGGTCTTACAAAGATAAATACT
+
CBCCCGGG@CGGGGGGGGGGGG=GFGGGGDGGGFG1GGGGGGGG@GGGGG
```
The following information can be parsed from the each read sequence in the FASTQ file:

- Line 1: `readid`, a unique ID for the read, under for format `@readid` 
- Line 2: `readseq`, the sequence of the read
- Line 3: a separator `+`
- Line 4: `read_qscores`, the base quality scores encoded in ASCII 

Example:
```
@2591237:ncbi:1-60400
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
+
CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG
```
- `readid` = `2591237:ncbi:1-60400`
- `readseq` = `ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG`, a 50 bp read
- `read_qscores` = `CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG`


#### ALN (`.aln`) 
ALN files generated by ART Illumina consist of :
- a header with the ART-Ilumina command used for the simulation (`@CM`) and info on each of the reference sequences used for the simulations (`@SQ`). Header always starts with `##ART_Illumina` and ends with `##Header End` :
- the body with 3 lines for each read:
    1. definition line with `readid`, 
        - reference sequence identification number `refseqid`, 
        - the position in the read in the reference sequence `aln_start_pos` 
        - the strand the read was taken from `ref_seq_strand`. `+` for coding strand and `-` for template strand
    2. aligned reference sequence, that is the sequence segment in the original reference corresponding to the read
    3. aligned read sequence, that is the simmulated read sequence, where each bp corresponds to the reference sequence bp in the same position.

Example of a ALN file generated by ART Illumina (showing 5 reads):

```ascii
##ART_Illumina    read_length    50
@CM    /bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/cov_data/cov_virus_sequences_ten.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/cov_simreads/single_10seq_50bp/single_10seq_50bp -rs 1674660835
@SQ    2591237:ncbi:1 [MK211378]    2591237    ncbi    1 [MK211378] 2591237    Coronavirus BtRs-BetaCoV/YN2018D    30213
@SQ    11128:ncbi:2 [LC494191]    11128    ncbi    2 [LC494191] 11128    Bovine coronavirus    30942
@SQ    31631:ncbi:3 [KY967361]    31631    ncbi    3 [KY967361] 31631    Human coronavirus OC43        30661
@SQ    277944:ncbi:4 [LC654455]    277944    ncbi    4 [LC654455] 277944    Human coronavirus NL63    27516
@SQ    11120:ncbi:5 [MN987231]    11120    ncbi    5 [MN987231] 11120    Infectious bronchitis virus    27617
@SQ    28295:ncbi:6 [KU893866]    28295    ncbi    6 [KU893866] 28295    Porcine epidemic diarrhea virus    28043
@SQ    28295:ncbi:7 [KJ645638]    28295    ncbi    7 [KJ645638] 28295    Porcine epidemic diarrhea virus    27998
@SQ    28295:ncbi:8 [KJ645678]    28295    ncbi    8 [KJ645678] 28295    Porcine epidemic diarrhea virus    27998
@SQ    28295:ncbi:9 [KR873434]    28295    ncbi    9 [KR873434] 28295    Porcine epidemic diarrhea virus    28038
@SQ    1699095:ncbi:10 [KT368904]    1699095    ncbi    10 [KT368904] 1699095    Camel alphacoronavirus    27395
##Header End
>2591237:ncbi:1    2591237:ncbi:1-60400    14770    +
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
>2591237:ncbi:1    2591237:ncbi:1-60399    17012    -
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
>2591237:ncbi:1    2591237:ncbi:1-60398    9188    +
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
.....
```

ncbi
  |--simreads
  |    |--readme.md (0)
  |    |--mRhiFer1
  |    |    |--readme.md (1)
  |    |--yf
  |    |    |--readme.md (2)
  |    |--cov
  |    |    |--readme.md (3)


ReadMe file for directory `data/ncbi/infer_results`:

#### Inference Result Files

Each directory includes a set of prediction/inference result files.

ncbi
  |--infer_results
  |    |--readme.md (0)
  |    |--cov-ncbi
  |    |    |--readme.md (1)
  |    |--yf-ncbi
  |    |    |--yf_2023-single-69seq-150bp-2024-05-02_16_29_18-probs.csv (2)
  |    |    |--readme.md (3)
  |    |    |--yf_AY968064-single-1seq-150bp-2024-05-02_01_10_12-probs.csv (4)
  |    |    |--yf_AY968064-single-1seq-150bp-2024-05-02_01_10_12-results.csv (5)
  |    |    |--yf_2023-single-69seq-150bp-2024-05-02_16_29_18-results.csv (6)


# 3. Review fasta, fastq and aln files

## Setup paths

- `p2refseqs`: path to a reference sequence file (FASTA)
- `p2simreads`: path to folder where reads files are located (FASTQ and ALN)

In [261]:
p2refseqs = pfs.data / 'ncbi/refsequences/yf/yf_2023_yellow_fever.fa'
assert p2refseqs.is_file(), f"No file found at {p2refseqs.absolute()}"

# p2simreads = pfs.data / 'ncbi/simreads/yf/single_all_seq_150bp'
p2simreads = pfs.data / 'ncbi/simreads/yf/paired_all_seq_150bp'
assert p2simreads.is_dir(), f"No directory found at {p2simreads.absolute()}"

## Explore Reference Files (FASTA)

`FastaFileReader` is a class to make reading and accessing fasta file information easier

In [262]:
fasta = FastaFileReader(path=p2refseqs)

In [263]:
fasta.print_first_chunks(nchunks=3)


Sequence 1:
>11089:ncbi:1	[AY968064]	11089	ncbi	1	[AY968064]	11089	Angola_1971
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAAT ...

Sequence 2:
>11089:ncbi:2	[U54798]	11089	ncbi	2	[U54798]	11089	Ivory_Coast_1982
ATGTCTGGTCGCAAAGCTCAGGGAAAAACCCTGGGCGTCAATATGGTTCGACGGGGAGTCCGCTCCTTGTCAAACAAAAT ...

Sequence 3:
>11089:ncbi:3	[DQ235229]	11089	ncbi	3	[DQ235229]	11089	Ethiopia_1961
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGGGGAGCACGCTCCTTGTCAAACAAAAT ...


Access each sequence (definition line and sequence) one by one

In [264]:
fasta.reset_iterator()

for i, seq in enumerate(fasta):
    print(f"Definition Line for sequence {i+1}:")
    print(seq['definition line'])
    print(f"{len(seq['sequence']):,d} bases:")
    print(seq['sequence'])
    print()
    if i >= 2: break

Definition Line for sequence 1:
>11089:ncbi:1	[AY968064]	11089	ncbi	1	[AY968064]	11089	Angola_1971
10,237 bases:
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAATAAAACAAAAAACAAAACAGATTGGAAACAGACCTGGCCCTTCAAGAGGTGTTCAAGGATTTATTTTCTTCTTTTTGTTTAACATTCTGACTGGGAAAAAGTTGACTGCTCATCTAAAAAAACTTTGGAGGATGCTTGATCCAAGGCAGGGACTTGCTGTACTGCGGAAGGTCAAAAGAGTTGTAGCTAGCTTAATGAGAGGACTGTCTTCCAGGAAACGTAGATCCAATGAAATGGCCTTGTTTCCACTCTTGTTACTGGGTTTGTTGGCTCTATCAGGAGGAGTGACCCTCGTCAGAAAGAACAGATGGTTGCTCTTGAATGTAACTGCTGAAGATCTGGGGAAAACGTTTTCAGTGGGAACTGGGAATTGCACCACGAATATTCTGGAAGCAAAATACTGGTGCCCCGACTCAATGGAGTACAATTGTCCCAATCTCAGTCCAAGAGAAGAGCCAGATGACATAGATTGCTGGTGTTATGGAGTGGAAAATGTCAGAGTGGCCTATGGAAGATGTGATGCGGTAGGGCGATCAAAACGTTCTAGGAGAGCGATTGATCTACCCACACATGAGAACCATGGACTAAAGACTCGGCAGGAGAAGTGGATGACTGGCAGAATGGGTGAGAGGCAGCTCCAGAAGATTGAAAGATGGCTGGTTAGGAATCCATTTTTTGCGGTCACAGCATTGGCAATAGCCTATCTGGTGGGTAACAACACGACGCAACGAGTGGTAATAGCACTACTGGTTCTAGCGGTTGGTCCAGCGTACTCTGCCCATTGCATAGGGATAACCGACAGAGATTT

In [265]:
definition_lines = fasta.parse_file()
fasta.reset_iterator()

for i, (k, v) in enumerate(definition_lines.items()):
    print(f"Definition Line for sequence {i+1}:")
    print(next(fasta)['definition line'])
    print('Parsed data:')
    print(f"key: {k}\nmetadata:")
    pprint(v)
    print(f"{'='*100}")
    if i >= 2: break

Definition Line for sequence 1:
>11089:ncbi:1	[AY968064]	11089	ncbi	1	[AY968064]	11089	Angola_1971
Parsed data:
key: 11089:ncbi:1
metadata:
{'accession': 'AY968064',
 'seqid': '11089:ncbi:1',
 'seqnb': '1',
 'source': 'ncbi',
 'species': 'Angola_1971',
 'taxonomyid': '11089'}
Definition Line for sequence 2:
>11089:ncbi:2	[U54798]	11089	ncbi	2	[U54798]	11089	Ivory_Coast_1982
Parsed data:
key: 11089:ncbi:2
metadata:
{'accession': 'U54798',
 'seqid': '11089:ncbi:2',
 'seqnb': '2',
 'source': 'ncbi',
 'species': 'Ivory_Coast_1982',
 'taxonomyid': '11089'}
Definition Line for sequence 3:
>11089:ncbi:3	[DQ235229]	11089	ncbi	3	[DQ235229]	11089	Ethiopia_1961
Parsed data:
key: 11089:ncbi:3
metadata:
{'accession': 'DQ235229',
 'seqid': '11089:ncbi:3',
 'seqnb': '3',
 'source': 'ncbi',
 'species': 'Ethiopia_1961',
 'taxonomyid': '11089'}


## Explore simulated read output files (FASTQ and ALN)

When simulating with ART Illumina,:
- We simulated reads from referenmce sequences in a fasta file (`fasta` reader)
- ART returns two files:
    - a fastq file (`fastq` reader) with the reads sequences, their quality and some read metadata
    - a aln file (`aln` reader), including:
        - a header with the summulation command and the list of reference sequences
        - additional metadata on each reads, including: 
            - the readid and its number (rank in the list)
            - the read sequence itself (may include errors)
            - the same sequence as in the reference sequence
            - the read alignment starting position in the reference sequence
            - the reference sequence strand
            - the reference sequence id and its number in the sequence file            

Like for FASTA file, `FastqFileReader` and `AlnFileReader` makes it easier to load and work with fastq and aln file

In [266]:
files_in_tree(p2simreads);

yf
  |--paired_all_seq_150bp
  |    |--paired_all_seq_150bp2.fq (0)
  |    |--paired_all_seq_150bp2.aln (1)
  |    |--paired_all_seq_150bp1.fq (2)
  |    |--paired_all_seq_150bp1.aln (3)


In [267]:
# suffix = '1' if 'paired' in p2simreads.stem else ''
suffix = '2' if 'paired' in p2simreads.stem else ''

p2fastq = p2simreads / f"{p2simreads.stem}{suffix}.fq"
print(p2fastq.name)
assert p2fastq.is_file(), f"{p2fastq} is not a file"
p2aln = p2simreads / f"{p2simreads.stem}{suffix}.aln"
print(p2aln.name)
assert p2aln.is_file(), f"{p2aln} is not a file"

print(f"fq reads file:  {p2fastq.name}\naln reads file: {p2aln.name}")

fastq = FastqFileReader(p2fastq)
assert fastq.re_rule_name is not None
aln = AlnFileReader(p2aln)
assert aln.re_rule_name is not None

paired_all_seq_150bp2.fq
paired_all_seq_150bp2.aln
fq reads file:  paired_all_seq_150bp2.fq
aln reads file: paired_all_seq_150bp2.aln


### Exploring information about the simreads in fastaq

In [268]:
fastq.re_rule_name, aln.re_header_rule_name

('fastq_art_illumina_ncbi_std', 'aln_art_illumina-refseq-ncbi-std')

In [269]:
for i, (k,v) in enumerate(fastq.parse_file(add_readseq=True).items()):
    print(f"Read {i}:")
    pprint(v)
    print()

    if i+1 >= 3: break

Read 0:
{'readid': '11089:ncbi:1-13600/2',
 'readnb': '13600/2',
 'readseq': 'AGGCTTCATTCACAGGTATACTCCTTCTGCCAAACACCTGCGTGGACATGTATGCACACAGCCCTAGAAAGGGCTGGGTCAAGCCCATGTATGAGGTCAGTGTCAGGGCTACAATGGGAATGGTCTTTTGCATAGATGTGTCTTTTGAAT',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089'}

Read 1:
{'readid': '11089:ncbi:1-13598/2',
 'readnb': '13598/2',
 'readseq': 'CACTAACTTCTTTTTGAGCTGCTGCATAGTAGCACCAGCCTCCACGTCCACACCCAAGATCGGTGACCCTTCCTTCTAGCTTCACATAGCCACGCTCGTGGAACCATCTCAGCTTTGCAGTCCCTCTCGACACGGCCACTCCCGTGTCCA',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089'}

Read 2:
{'readid': '11089:ncbi:1-13596/2',
 'readnb': '13596/2',
 'readseq': 'TGTCTTGGAAGGCTAGCCCTGCTAGCACTCCCACCAAGCCAGCCGCTGCCAAGGCTTCATTCACAGGTATACTCCTTCTGCCAAACACCTGCGTGGACATGTATGCACACAGCCCTAGAAAGGGCTGGGTCAAGCCCATGTATGAGGTCA',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089'}



### Exploring information in ALN header

The header of ALN file includes:
- the ART command used to simulate the reads
- the list of reference sequences used to create the reads, with metadata

First we can see from which reference fasta file the simulated sequences were imported, by checking the ART command in the alignment file's header. Then we can open a fasta reader for the reference fasta file as well:

In [270]:
aln.header['command']

'/usr/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf/yf_2023_yellow_fever.fa -ss HS25 -p -l 150 -f 200 -m 200 -s 10 -o /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/paired_all_seq_150bp/paired_all_seq_150bp -rs 1723183668'

We see that the simulated reads were generated from the reference sequences in `data/cov_data/cov_virus_sequences_ten.fa`, that is the same as the file in our `fasta`

In [271]:
fasta.path

PosixPath('/home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf/yf_2023_yellow_fever.fa')

Then we can review the metadata available for each reference sequence, as listed in `aln`'s header.

This information can be seen in `aln.header` or with the method `parse_header_reference_sequences()`

In [272]:
print('\n'.join(aln.header['reference sequences']))

@SQ	11089:ncbi:1	[AY968064]	11089	ncbi	1	[AY968064]	11089	Angola_1971	10237
@SQ	11089:ncbi:2	[U54798]	11089	ncbi	2	[U54798]	11089	Ivory_Coast_1982	10237
@SQ	11089:ncbi:3	[DQ235229]	11089	ncbi	3	[DQ235229]	11089	Ethiopia_1961	10237
@SQ	11089:ncbi:4	[AY572535]	11089	ncbi	4	[AY572535]	11089	Gambia_2001	10237
@SQ	11089:ncbi:5	[MF405338]	11089	ncbi	5	[MF405338]	11089	Ghana_Hsapiens_1927	10237
@SQ	11089:ncbi:6	[U21056]	11089	ncbi	6	[U21056]	11089	Senegal_1927	10237
@SQ	11089:ncbi:7	[AY968065]	11089	ncbi	7	[AY968065]	11089	Uganda_1948	10237
@SQ	11089:ncbi:8	[JX898871]	11089	ncbi	8	[JX898871]	11089	ArD114896_Senegal_1995	10237
@SQ	11089:ncbi:9	[JX898872]	11089	ncbi	9	[JX898872]	11089	Senegal_Aedes-aegypti_1995	10237
@SQ	11089:ncbi:10	[GQ379163]	11089	ncbi	10	[GQ379163]	11089	Peru_Hsapiens_2007	10237
@SQ	11089:ncbi:11	[DQ118157]	11089	ncbi	11	[DQ118157]	11089	Spain_Vaccine_2004	10237
@SQ	11089:ncbi:12	[MF289572]	11089	ncbi	12	[MF289572]	11089	Singapore_2017	10237
@SQ	11089:ncbi:13	[KU978764]	11

With `AlnFileReader.parse_header_reference_sequences`, we can extract information on all the sequences used to generate the reads:

In [273]:
header_refseq_meta = aln.parse_header_reference_sequences()

print(f"Reads simulated from {len(header_refseq_meta)} reference sequences.\n")

for i, (k,v) in enumerate(header_refseq_meta.items()):
    print(f"Reference Sequence {i}:")
    pprint(v)
    print()

    if i+1 >= 3: break

Reads simulated from 69 reference sequences.

Reference Sequence 0:
{'refseq_accession': 'AY968064',
 'refseq_length': '10237',
 'refseqid': '11089:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089',
 'species': 'Angola_1971'}

Reference Sequence 1:
{'refseq_accession': 'U54798',
 'refseq_length': '10237',
 'refseqid': '11089:ncbi:2',
 'refseqnb': '2',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089',
 'species': 'Ivory_Coast_1982'}

Reference Sequence 2:
{'refseq_accession': 'DQ235229',
 'refseq_length': '10237',
 'refseqid': '11089:ncbi:3',
 'refseqnb': '3',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089',
 'species': 'Ethiopia_1961'}



We see that the first reference sequence is a **Coronavirus BtRs-BetaCoV/YN2018D**

### Exploring metadata available for simulated read's

In [274]:
reads_meta = aln.parse_file(add_ref_seq_aligned=True, add_read_seq_aligned=True)

print(f"{len(reads_meta):,d} reads were simulated.\n")

for i, (k,v) in enumerate(reads_meta.items()):
    print(f"Read {i}")
    print(k)
    pprint(v)
    print()
    if i+1>=3: break

463,386 reads were simulated.

Read 0
11089:ncbi:1-13600/2
{'aln_start_pos': '6153',
 'read_seq_aligned': 'AGGCTTCATTCACAGGTATACTCCTTCTGCCAAACACCTGCGTGGACATGTATGCACACAGCCCTAGAAAGGGCTGGGTCAAGCCCATGTATGAGGTCAGTGTCAGGGCTACAATGGGAATGGTCTTTTGCATAGATGTGTCTTTTGAAT',
 'readid': '11089:ncbi:1-13600/2',
 'readnb': '13600/2',
 'ref_seq_aligned': 'AGGCTTCATTCACAGGTATACTCCTTCTGCCAAACACCTGCGTGGACATGTATGCACACAGCCCTAGAAAGGGCTGGGTCAAGCCCATGTATGAGGTCAGTGTCAGGGCTACAATGGGAATGGTCTTTTGCATAGATGTGTCTTTTGAAT',
 'refseq_strand': '-',
 'refseqid': '11089:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089'}

Read 1
11089:ncbi:1-13598/2
{'aln_start_pos': '2421',
 'read_seq_aligned': 'CACTAACTTCTTTTTGAGCTGCTGCATAGTAGCACCAGCCTCCACGTCCACACCCAAGATCGGTGACCCTTCCTTCTAGCTTCACATAGCCACGCTCGTGGAACCATCTCAGCTTTGCAGTCCCTCTCGACACGGCCACTCCCGTGTCCA',
 'readid': '11089:ncbi:1-13598/2',
 'readnb': '13598/2',
 'ref_seq_aligned': 'CACTAACTTCTTTTTGAGCTGCTGCATAGTAGCACCAGCCTCCACGTCCACACCCAAGATCGGTGACCCTTCCTTCTAGCTT

### Comparison metadata in fasta, fastq and aln

There is redendancy in metadata across the three sets of files (`.fa`, `.fq`, `.aln`). Let's compare them to confirm this is consistent

Create the following list and dictionaries parsing metadata from fasta, fastq and aln:
- `refseqs_fasta`: metadata in fasta sequence file used as reference sequences to create reads
- `simreads`: metadata on reads, as available in the fastq read file
- `refseqs_aln`: reference sequences with metadata as available in the aln header
- `simread_align`: read sequence and alignment info

In [275]:
refseqs_fasta = fasta.parse_file(add_seq=True)
simreads = fastq.parse_file(add_readseq=True)
refseqs_aln = aln.ref_sequences
simread_align = aln.parse_file(add_ref_seq_aligned=True, add_read_seq_aligned=True)

In [277]:
print(aln.re_header_rule_name,'\n',aln.re_header_pattern)

aln_art_illumina-refseq-ncbi-std 
 ^@SQ[\t\s]*(?P<refseqid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*))[\t\s]*\[(?P<refseq_accession>[\d\w]*)\][\t\s]*(?P=reftaxonomyid)[\s\t]*(?P=refsource)[\s\t]*(?P=refseqnb)[\s\t]*\[(?P=refseq_accession)\][\s\t]*(?P=reftaxonomyid)[\s\t]*(?P<species>\w[\w\d\/\s\-\.:]*)[\s\t](?P<refseq_length>\d*)$


In [278]:
print(f"reads simulated from the following {len(refseqs_fasta)} full sequences from `{fasta.path.name}`:\n")
print('\n'.join([f" {i:02d}: {refseq['species']}" for i, refseq in enumerate(refseqs_fasta.values())]))

reads simulated from the following 69 full sequences from `yf_2023_yellow_fever.fa`:

 00: Angola_1971
 01: Ivory_Coast_1982
 02: Ethiopia_1961
 03: Gambia_2001
 04: Ghana_Hsapiens_1927
 05: Senegal_1927
 06: Uganda_1948
 07: ArD114896_Senegal_1995
 08: Senegal_Aedes-aegypti_1995
 09: Peru_Hsapiens_2007
 10: Spain_Vaccine_2004
 11: Singapore_2017
 12: Sudan_Hsapiens_1941
 13: ArD181250_Senegal_2005
 14: ArD181676_Senegal_2005
 15: Senegal_Aedes_luteocephalus_2005
 16: ArD181564_Senegal_2005
 17: ArD181464_Senegal_2005
 18: Senegal_Aedes_fucifer_2001
 19: Guinea_Bissau_Hsapiens_1965
 20: Senegal_Ae_fucifer_1996
 21: isolate_HD117294_Senegal_1995
 22: Senegal_Aedes_fucifer_2000
 23: ArD149194_Senegal_2000
 24: ArD149214_Senegal_2000
 25: Netherlands_Hsapiens_Gambia_2018
 26: Nigeria_Hsapiens_2018
 27: Nigeria_Hsapiens_2018
 28: CotedIvoire_Ae_africanus_1973
 29: Nigeria_Hsapiens_1946
 30: Bolivia_Hsapiens_1999
 31: Brazil_Hsapiens_1983
 32: Brazil_Haemagogus_sp_1980
 33: Brazil_Hsapiens_

In [279]:
print(f"{len(simreads):,d} reads generated and available in fastq file.")
print('For each read, following metadata is available in the fastq file:')
for s in ['readid', 'readnb', 'refseqnb', 'refsource', 'reftaxonomyid', 'readseq']:
    print(f" - {s}")

463,386 reads generated and available in fastq file.
For each read, following metadata is available in the fastq file:
 - readid
 - readnb
 - refseqnb
 - refsource
 - reftaxonomyid
 - readseq


In [280]:
print(f"{len(simread_align):,d} reads generated and available in aln file.")
print('For each read, following metadata is available in the aln file:')
for s in ['aln_start_pos', 'readid', 'readnb', 'refseq_strand', 'refseqid', 'refseqnb', 'refsource', 'reftaxonomyid', 'ref_seq_aligned', 'read_seq_aligned']:
    print(f" - {s}")

463,386 reads generated and available in aln file.
For each read, following metadata is available in the aln file:
 - aln_start_pos
 - readid
 - readnb
 - refseq_strand
 - refseqid
 - refseqnb
 - refsource
 - reftaxonomyid
 - ref_seq_aligned
 - read_seq_aligned


Check consistency between refseqs from fasta and from aln 

In [281]:
# utility functions
def complementary_strand(seq):
    """Converts a strand in its complementary"""
    conv = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
    return ''.join([conv[base] for base in seq])

strand = 'ATCCGTGGGT'
print(strand, complementary_strand(strand))

def reverse_sequence(seq):
    return seq[::-1]

print(strand, reverse_sequence(strand))

ATCCGTGGGT TAGGCACCCA
ATCCGTGGGT TGGGTGCCTA


### Check aln reference sequence information

In [282]:
# refseqid = '2591237:ncbi:1'
# refseqid = '11128:ncbi:2'
refseqid = '11089:ncbi:1'


In the source `.fa` file

In [283]:
original_seq = refseqs_fasta[refseqid]['sequence']
original_seq_accession = refseqs_fasta[refseqid]['accession']

original_seq_accession, len(original_seq)

('AY968064', 10237)

In [284]:
refseqs_fasta.keys()
refseqs_aln.keys()

dict_keys(['11089:ncbi:1', '11089:ncbi:2', '11089:ncbi:3', '11089:ncbi:4', '11089:ncbi:5', '11089:ncbi:6', '11089:ncbi:7', '11089:ncbi:8', '11089:ncbi:9', '11089:ncbi:10', '11089:ncbi:11', '11089:ncbi:12', '11089:ncbi:13', '11089:ncbi:14', '11089:ncbi:15', '11089:ncbi:16', '11089:ncbi:17', '11089:ncbi:18', '11089:ncbi:19', '11089:ncbi:20', '11089:ncbi:21', '11089:ncbi:22', '11089:ncbi:23', '11089:ncbi:24', '11089:ncbi:25', '11089:ncbi:26', '11089:ncbi:27', '11089:ncbi:28', '11089:ncbi:29', '11089:ncbi:30', '11089:ncbi:31', '11089:ncbi:32', '11089:ncbi:33', '11089:ncbi:34', '11089:ncbi:35', '11089:ncbi:36', '11089:ncbi:37', '11089:ncbi:38', '11089:ncbi:39', '11089:ncbi:40', '11089:ncbi:41', '11089:ncbi:42', '11089:ncbi:43', '11089:ncbi:44', '11089:ncbi:45', '11089:ncbi:46', '11089:ncbi:47', '11089:ncbi:48', '11089:ncbi:49', '11089:ncbi:50', '11089:ncbi:51', '11089:ncbi:52', '11089:ncbi:53', '11089:ncbi:54', '11089:ncbi:55', '11089:ncbi:56', '11089:ncbi:57', '11089:ncbi:58', '11089:ncbi:

In the output`.aln` file

In [285]:
assert original_seq_accession == refseqs_aln[refseqid]['refseq_accession']
assert len(original_seq) == int(refseqs_aln[refseqid]['refseq_length'])

refseqs_aln[refseqid]['refseq_accession'], int(refseqs_aln[refseqid]['refseq_length'])

('AY968064', 10237)

### Check alignment information

In [289]:
pprint(simread_align['11089:ncbi:1-13598/2'])

{'aln_start_pos': '2421',
 'read_seq_aligned': 'CACTAACTTCTTTTTGAGCTGCTGCATAGTAGCACCAGCCTCCACGTCCACACCCAAGATCGGTGACCCTTCCTTCTAGCTTCACATAGCCACGCTCGTGGAACCATCTCAGCTTTGCAGTCCCTCTCGACACGGCCACTCCCGTGTCCA',
 'readid': '11089:ncbi:1-13598/2',
 'readnb': '13598/2',
 'ref_seq_aligned': 'CACTAACTTCTTTTTGAGCTGCTGCATAGTAGCACCAGCCTCCACGTCCACACCCAAGATCGGTGACCCTTCCTTCTAGCTTCACATAGCCACGCTCGTGGAACCATCTCAGCTTTGCAGTCCCTCTCGACACGGCCACTCCCGTGTCCA',
 'refseq_strand': '-',
 'refseqid': '11089:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089'}


Select all reads generated from a single reference sequence

In [290]:
print(f"Select all reads from reference sequence '{refseqid}''")
reads_from_refseq = {k:v for k,v in simread_align.items() if v['refseqid']==refseqid}
nbr_generated_reads = len(reads_from_refseq)
print(f"Total of {nbr_generated_reads:,d} reads")

Select all reads from reference sequence '11089:ncbi:1''
Total of 6,800 reads


In [291]:
n = 6
selected_simread = [v for k,v in reads_from_refseq.items()][n]
pprint(selected_simread)

{'aln_start_pos': '3744',
 'read_seq_aligned': 'ATGGGTGGGCTTTGGCAGTACTTGAACGCCGTTTCACTGTGTGTGCTCACCATAAATGCAATCTCATCAAGGAAGGCCTCAAATATGATCCTCCCCCTGATGGCACTCATGACTCCCATGACAATGCATGAGGTGAGGATGGCGACGATG',
 'readid': '11089:ncbi:1-13588/2',
 'readnb': '13588/2',
 'ref_seq_aligned': 'ATGGGTGGGCTTTGGCAGTACTTGAACGCCGTTTCACTGTGTGTGCTCACCATAAATGCAATCTCATCAAGGAAGGCCTCAAATATGATCCTCCCCCTGATGGCACTCATGACTCCCATGACAATGCATGAGGTGAGGATGGCGACGATG',
 'refseq_strand': '+',
 'refseqid': '11089:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '11089'}


In [292]:
def check_alignment(n, reads):
    selected_simread = [v for k,v in reads.items()][n]
    print(f"{'='*80}")
    print(f"checking read {selected_simread['readid']}")
    start = int(selected_simread['aln_start_pos'])
    strand = selected_simread['refseq_strand']
    print(f"simread info:")
    print(f" - from `{strand}` strand")
    print(f" - position: {start:,d}")

    if strand == '+':
        segment_from_refseq = original_seq[start:start+50]
    else:
        segment_from_refseq = complementary_strand(reverse_sequence(original_seq)[start:start+50])

    print('sequences:')
    print(f'- simread seq          :', selected_simread['read_seq_aligned'])
    print(f'- refseq aligned       :', selected_simread['ref_seq_aligned'])
    print(f'- segment in orig. seq :', segment_from_refseq)

In [293]:
for n in range(nbr_generated_reads-1, nbr_generated_reads-6, -1):
    check_alignment(n, reads_from_refseq)

checking read 11089:ncbi:1-2/2
simread info:
 - from `+` strand
 - position: 4,435
sequences:
- simread seq          : TTAAAGGTGCTAGGAGGAGTGGGGATGTTCTTTGGGACATTCCCACGCCAAAGGTGATTGAGGAATGTGAGCACCTGGAGGATGGAATCTATGGCATATTCCAGTCAACCTTCCTTGGAGCCTCGCAGCGAGGTGTTGGAGTGGCGCAGG
- refseq aligned       : TTAAAGGTGCTAGGAGGAGTGGGGATGTTCTTTGGGACATTCCCACGCCAAAGGTGATTGAGGAATGTGAGCACCTGGAGGATGGAATCTATGGCATATTCCAGTCAACCTTCCTTGGAGCCTCGCAGCGAGGTGTTGGAGTGGCGCAGG
- segment in orig. seq : TTAAAGGTGCTAGGAGGAGTGGGGATGTTCTTTGGGACATTCCCACGCCA
checking read 11089:ncbi:1-4/2
simread info:
 - from `+` strand
 - position: 4,927
sequences:
- simread seq          : TGGGTGATAACTCCTTTGTGTCTGCCATCTCACAAACTGAATTGAAGGAAGAATCCAAAGAAGAACTGCAAGAAATACCAACTATGCTGAAGAAAGGAATGACCACCATCCTTGACTTCCACCCTGGGGCGGGGAAAACCCGTAGGTTCC
- refseq aligned       : TGGGTGATAACTCCTTTGTGTCTGCCATCTCACAAACTGAATTGAAGGAAGAATCCAAAGAAGAACTGCAAGAAATACCAACTATGCTGAAGAAAGGAATGACCACCATCCTTGACTTCCACCCTGGGGCGGGGAAAACCCGTAGGTTCC
- segment in orig. seq : TGGGTGA

# end of section