# Create simreads from Yellow Fever NCBI reference sequences

This reference notebook shows how to create different sets of simulated reads for yellow fever reference sequences from NCBI

# 1. Imports and setup environment

### Install and import packages

In [1]:
# Install required custom packages if not installed yet.
import importlib.util
if not importlib.util.find_spec('ecutilities'):
    print('installing package: `ecutilities`')
    ! pip install -qqU ecutilities
else:
    print('`ecutilities` already installed')
if not importlib.util.find_spec('metagentools'):
    print('installing package: `metagentools')
    ! pip install -qqU metagentools
else:
    print('`metagentools` already installed')

`ecutilities` already installed
`metagentools` already installed


In [2]:
# Import all required packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re

from ecutilities.core import files_in_tree, path_to_parent_dir
from ecutilities.ipython import nb_setup
from metagentools.art import ArtIllumina
from metagentools.core import ProjectFileSystem, TextFileBaseReader
from metagentools.cnn_virus.data import FastaFileReader, FastqFileReader, AlnFileReader
from nbdev import show_doc
from pathlib import Path
from pprint import pprint

# Setup the notebook for development
nb_setup()

Set autoreload mode


# 2. Setup project file system

In [3]:
pfs = ProjectFileSystem()
pfs.info()

Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentools 
 - Data Dir .... /home/vtec/projects/bio/metagentools/data 
 - Notebooks ... /home/vtec/projects/bio/metagentools/nbs


## Load fasta file with reference sequences and review

In [4]:
show_doc(ArtIllumina)

---

[source](https://github.com/vtecftwy/metagentools/blob/main/metagentools/art.py#LNone){target="_blank" style="float:right; font-size:smaller"}

### ArtIllumina

>      ArtIllumina (path2app:str|pathlib.Path, input_dir:str|pathlib.Path,
>                   output_dir:str|pathlib.Path=None,
>                   app_in_system_path:bool=False)

Class to handle all aspects of simulating sequencing with art_illumina

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| path2app | str \| pathlib.Path |  | full path to art_illumina application on the system |
| input_dir | str \| pathlib.Path |  | full path to dir where input files are |
| output_dir | str \| pathlib.Path | None | full path to dir where to save output files, if different from input_dir |
| app_in_system_path | bool | False | whether `art_illumina` is in the system path or not |

We need to define several paths:
- path2app
- input_dir: the directory where all the fasta files we want to use are located
- output_dit: the directory where to save output files

In [5]:
p2inputs = pfs.data/ 'ncbi/refsequences/yf'
assert p2inputs.is_dir()
print(p2inputs.absolute())
files_in_tree(p2inputs);

/home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
refsequences
  |--yf
  |    |--yf_1971_angola.fa (0)
  |    |--yf_2023_yellow_fever.fa (1)
  |    |--yf_2023_multiple_alignment_original.fa (2)
  |    |--readme.md (3)


In [6]:
p2outputs = pfs.data/ 'ncbi/simreads/yf'
assert p2outputs.is_dir()
print(p2outputs.absolute())
files_in_tree(p2outputs);

/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf
simreads
  |--yf
  |    |--readme.md (0)
  |    |--single_1seq_150bp
  |    |    |--single_1seq_150bp.fq (1)
  |    |    |--single_1seq_150bp.aln (2)
  |    |--single_69seq_150bp
  |    |    |--single_69seq_150bp.fq (3)
  |    |    |--single_69seq_150bp.aln (4)
  |    |--paired_1seq_150bp
  |    |    |--paired_1seq_150bp2.aln (5)
  |    |    |--paired_1seq_150bp2.fq (6)
  |    |    |--paired_1seq_150bp1.fq (7)
  |    |    |--paired_1seq_150bp1.aln (8)
  |    |--paired_69seq_150bp
  |    |    |--paired_69seq_150bp1.fq (9)
  |    |    |--paired_69seq_150bp2.fq (10)
  |    |    |--paired_69seq_150bp1.aln (11)
  |    |    |--paired_69seq_150bp2.aln (12)


Explore files in the directory:

Review fasta file for reference sequences

In [7]:
p2fasta = p2inputs / 'yf_1971_angola.fa'        # one sequence only
p2fasta = p2inputs / 'yf_2023_yellow_fever.fa'  # all sequences
fasta = FastaFileReader(p2fasta)
nb_seqs = fasta.review()

There are 69 sequences in this file

First Sequence:
>11089:ncbi:1	1	AY968064	11089	ncbi	Angola_1971
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAAT ...
{'accession': 'AY968064', 'organism': 'Angola_1971', 'seqid': '11089:ncbi:1', 'seqnb': '1', 'source': 'ncbi', 'taxonomyid': '11089'}

Last Sequence:
>11089:ncbi:69	69	OM066737	11089	ncbi	VHF-21-037/GHA/Damongo/2021
ATGTCTGGTCGTAAAGCTCAGGGCAAAACCCTGGGCGTCAATATGGTACGACGAGGAGTCCGCTCCNNNNNNNNNAAAAT ...
{'accession': 'OM066737', 'organism': 'VHF-21-037/GHA/Damongo/2021', 'seqid': '11089:ncbi:69', 'seqnb': '69', 'source': 'ncbi', 'taxonomyid': '11089'}


# Simulate reads using 1 sequence

## Single read simulation - 150 bp reads

In [8]:
print(p2inputs.absolute())
print(p2outputs.absolute())

/home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf


In [9]:
art = ArtIllumina(
    path2app=Path('/usr/bin/art_illumina'), 
    input_dir=p2inputs, 
    output_dir=p2outputs
    )

Ready to operate with art: /usr/bin/art_illumina
Input files from : /home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
Output files to :  /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf


### Define the input file and number of sequences

In [10]:
art.list_all_input_files()

yf_1971_angola.fa
yf_2023_multiple_alignment_original.fa
yf_2023_yellow_fever.fa


In [11]:
input_fname = 'yf_1971_angola.fa'
fasta = FastaFileReader(p2inputs / input_fname)
nb_sequences = fasta.review()
nb_sequences

There is 1 sequences in this file

First Sequence:
>11089:ncbi:1	1	AY968064	11089	ncbi	Angola_1971
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAAT ...
{'accession': 'AY968064', 'organism': 'Angola_1971', 'seqid': '11089:ncbi:1', 'seqnb': '1', 'source': 'ncbi', 'taxonomyid': '11089'}


1

### Run the simulation

Parameter `fold`:

Fold coverage, also known as sequencing depth or read depth, represents the average number of times each base in the reference genome is expected to be sequenced. For example:
- If you set -f 20, it means you're simulating a sequencing run that would cover each base in the reference genome an average of 20 times.
- If you set -f 100, it would simulate coverage where each base is sequenced an average of 100 times.

The fold coverage is an important parameter because it affects:
- The total number of reads generated: Higher fold coverage results in more reads.
- The likelihood of capturing rare variants or sequencing errors: Higher coverage generally improves the ability to detect rare variants and distinguish true variants from sequencing errors.
- The overall quality of the simulated dataset: Higher coverage typically leads to more accurate representation of the reference genome in the simulated data.

It's worth noting that ART Illumina uses this fold coverage value along with the read length and reference genome size to calculate the total number of reads to generate. The actual formula is:

```Total number of reads = (Genome size * Fold coverage) / Read length```

In [12]:
read_length = 150
genome_size = 10_238
fold = 250

print(f"Estimated number of simulated reads per reference sequence: {(genome_size * fold) // read_length: ,d}.")


Estimated number of simulated reads per reference sequence:  17,063.


In [13]:
sim_params = {
    'input_file': input_fname,
    "sim_type": "single",
    "read_length": read_length,
    'nb_sequences': nb_sequences,
    "fold": fold,
    'q_profile': 'HS25'
}

sim_params['output_seed'] = f"{sim_params['sim_type']}_{sim_params['nb_sequences']}seq_{sim_params['read_length']}bp"
sim_params

{'input_file': 'yf_1971_angola.fa',
 'sim_type': 'single',
 'read_length': 150,
 'nb_sequences': 1,
 'fold': 250,
 'q_profile': 'HS25',
 'output_seed': 'single_1seq_150bp'}

In [14]:
art.sim_reads( 
    input_file=sim_params['input_file'],
    output_seed=sim_params['output_seed'],
    sim_type=sim_params['sim_type'],
    read_length=sim_params['read_length'],
    fold=sim_params['fold'],
    ss=sim_params['q_profile'],
    overwrite=True
)

return code:  0 


             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Single-end Simulation

Total CPU time used: 0.399107

The random seed for the run: 1724163574

Parameters used during run
	Read Length:	150
	Genome masking 'N' cutoff frequency: 	1 in 150
	Fold Coverage:            250X
	Profile Type:             Combined
	ID Tag:                   

Quality Profile(s)
	First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 

Output files

  FASTQ Sequence File:
	/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/single_1seq_150bp/single_1seq_150bp.fq

  ALN Alignment File:
	/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/single_1seq_150bp/single_1seq_150bp.aln




In [15]:
art.list_last_output_files()

single_1seq_150bp.fq
single_1seq_150bp.aln


In [16]:
art.list_all_output_files()

paired_1seq_150bp
- paired_1seq_150bp2.aln
- paired_1seq_150bp2.fq
- paired_1seq_150bp1.fq
- paired_1seq_150bp1.aln
paired_69seq_150bp
- paired_69seq_150bp1.fq
- paired_69seq_150bp2.fq
- paired_69seq_150bp1.aln
- paired_69seq_150bp2.aln
single_1seq_150bp
- single_1seq_150bp.fq
- single_1seq_150bp.aln
single_69seq_150bp
- single_69seq_150bp.fq
- single_69seq_150bp.aln


## Paired read simulation - 150 bp read

In [17]:
print(p2inputs.absolute())
print(p2outputs.absolute())

/home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf


In [18]:
art = ArtIllumina(
    path2app=Path('/usr/bin/art_illumina'), 
    input_dir=p2inputs, 
    output_dir=p2outputs
    )

Ready to operate with art: /usr/bin/art_illumina
Input files from : /home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
Output files to :  /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf


### Define the input file and number of sequences

In [19]:
art.list_all_input_files()

yf_1971_angola.fa
yf_2023_multiple_alignment_original.fa
yf_2023_yellow_fever.fa


In [20]:
input_fname = 'yf_1971_angola.fa'
fasta = FastaFileReader(p2inputs / input_fname)
nb_sequences = fasta.review()
nb_sequences

There is 1 sequences in this file

First Sequence:
>11089:ncbi:1	1	AY968064	11089	ncbi	Angola_1971
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAAT ...
{'accession': 'AY968064', 'organism': 'Angola_1971', 'seqid': '11089:ncbi:1', 'seqnb': '1', 'source': 'ncbi', 'taxonomyid': '11089'}


1

### Run the simulation

Parameter `fold`:

Fold coverage, also known as sequencing depth or read depth, represents the average number of times each base in the reference genome is expected to be sequenced. For example:
- If you set -f 20, it means you're simulating a sequencing run that would cover each base in the reference genome an average of 20 times.
- If you set -f 100, it would simulate coverage where each base is sequenced an average of 100 times.

The fold coverage is an important parameter because it affects:
- The total number of reads generated: Higher fold coverage results in more reads.
- The likelihood of capturing rare variants or sequencing errors: Higher coverage generally improves the ability to detect rare variants and distinguish true variants from sequencing errors.
- The overall quality of the simulated dataset: Higher coverage typically leads to more accurate representation of the reference genome in the simulated data.

It's worth noting that ART Illumina uses this fold coverage value along with the read length and reference genome size to calculate the total number of reads to generate. The actual formula is:

```Total number of reads = (Genome size * Fold coverage) / Read length```

In [21]:
read_length = 150
genome_size = 10_238
fold = 250

print(f"Estimated number of simulated reads per reference sequence: {(genome_size * fold) // read_length: ,d}.")


Estimated number of simulated reads per reference sequence:  17,063.


In [22]:
sim_params = {
    'input_file': input_fname,
    "sim_type": "paired",
    "read_length": read_length,
    'nb_sequences': nb_sequences,
    "fold": fold,
    'mean_read':read_length + 50,
    'std_read':10,
    'q_profile': 'HS25'
}

sim_params['output_seed'] = f"{sim_params['sim_type']}_{sim_params['nb_sequences']}seq_{sim_params['read_length']}bp"
sim_params

{'input_file': 'yf_1971_angola.fa',
 'sim_type': 'paired',
 'read_length': 150,
 'nb_sequences': 1,
 'fold': 250,
 'mean_read': 200,
 'std_read': 10,
 'q_profile': 'HS25',
 'output_seed': 'paired_1seq_150bp'}

In [23]:
art.sim_reads(
    input_file=sim_params['input_file'],
    output_seed=sim_params['output_seed'],
    sim_type=sim_params['sim_type'],
    read_length=sim_params['read_length'],
    fold=sim_params['fold'],
    mean_read=sim_params['mean_read'],
    std_read=sim_params['std_read'],
    overwrite=True
)   

return code:  0 


             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Paired-end sequencing simulation

Total CPU time used: 0.378615

The random seed for the run: 1724163587

Parameters used during run
	Read Length:	150
	Genome masking 'N' cutoff frequency: 	1 in 150
	Fold Coverage:            250X
	Mean Fragment Length:     200
	Standard Deviation:       10
	Profile Type:             Combined
	ID Tag:                   

Quality Profile(s)
	First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 
	First Read:   HiSeq 2500 Length 150 R2 (built-in profile) 

Output files

  FASTQ Sequence Files:
	 the 1st reads: /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/paired_1seq_150bp/paired_1seq_150bp1.fq
	 the 2nd reads: /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/paired_1seq_150bp/paired_1seq_150bp2.fq

  

In [24]:
art.list_last_output_files()

paired_1seq_150bp2.aln
paired_1seq_150bp2.fq
paired_1seq_150bp1.fq
paired_1seq_150bp1.aln


In [25]:
art.list_all_output_files()

paired_1seq_150bp
- paired_1seq_150bp2.aln
- paired_1seq_150bp2.fq
- paired_1seq_150bp1.fq
- paired_1seq_150bp1.aln
paired_69seq_150bp
- paired_69seq_150bp1.fq
- paired_69seq_150bp2.fq
- paired_69seq_150bp1.aln
- paired_69seq_150bp2.aln
single_1seq_150bp
- single_1seq_150bp.fq
- single_1seq_150bp.aln
single_69seq_150bp
- single_69seq_150bp.fq
- single_69seq_150bp.aln


# Simulate using all sequences


## Single read simulation - 150 bp reads

In [26]:
print(p2inputs.absolute())
print(p2outputs.absolute())

/home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf


In [27]:
art = ArtIllumina(
    path2app=Path('/usr/bin/art_illumina'), 
    input_dir=p2inputs, 
    output_dir=p2outputs
    )

Ready to operate with art: /usr/bin/art_illumina
Input files from : /home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
Output files to :  /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf


### Define the input file and number of sequences

In [28]:
art.list_all_input_files()

yf_1971_angola.fa
yf_2023_multiple_alignment_original.fa
yf_2023_yellow_fever.fa


In [29]:
input_fname = 'yf_2023_yellow_fever.fa'
fasta = FastaFileReader(p2inputs / input_fname)
nb_sequences = fasta.review()
nb_sequences

There are 69 sequences in this file

First Sequence:
>11089:ncbi:1	1	AY968064	11089	ncbi	Angola_1971
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAAT ...
{'accession': 'AY968064', 'organism': 'Angola_1971', 'seqid': '11089:ncbi:1', 'seqnb': '1', 'source': 'ncbi', 'taxonomyid': '11089'}

Last Sequence:
>11089:ncbi:69	69	OM066737	11089	ncbi	VHF-21-037/GHA/Damongo/2021
ATGTCTGGTCGTAAAGCTCAGGGCAAAACCCTGGGCGTCAATATGGTACGACGAGGAGTCCGCTCCNNNNNNNNNAAAAT ...
{'accession': 'OM066737', 'organism': 'VHF-21-037/GHA/Damongo/2021', 'seqid': '11089:ncbi:69', 'seqnb': '69', 'source': 'ncbi', 'taxonomyid': '11089'}


69

### Run the simulation

Parameter `fold`:

Fold coverage, also known as sequencing depth or read depth, represents the average number of times each base in the reference genome is expected to be sequenced. For example:
- If you set -f 20, it means you're simulating a sequencing run that would cover each base in the reference genome an average of 20 times.
- If you set -f 100, it would simulate coverage where each base is sequenced an average of 100 times.

The fold coverage is an important parameter because it affects:
- The total number of reads generated: Higher fold coverage results in more reads.
- The likelihood of capturing rare variants or sequencing errors: Higher coverage generally improves the ability to detect rare variants and distinguish true variants from sequencing errors.
- The overall quality of the simulated dataset: Higher coverage typically leads to more accurate representation of the reference genome in the simulated data.

It's worth noting that ART Illumina uses this fold coverage value along with the read length and reference genome size to calculate the total number of reads to generate. The actual formula is:

```Total number of reads = (Genome size * Fold coverage) / Read length```

In [30]:
read_length = 150
genome_size = 10_238
fold = 250

print(f"Estimated number of simulated reads per reference sequence: {(genome_size * fold) // read_length: ,d}.")

Estimated number of simulated reads per reference sequence:  17,063.


In [31]:
sim_params = {
    'input_file': input_fname,
    "sim_type": "single",
    "read_length": read_length,
    'nb_sequences': nb_sequences,
    "fold": fold,
    'q_profile': 'HS25'
}

sim_params['output_seed'] = f"{sim_params['sim_type']}_{sim_params['nb_sequences']}seq_{sim_params['read_length']}bp"
sim_params

{'input_file': 'yf_2023_yellow_fever.fa',
 'sim_type': 'single',
 'read_length': 150,
 'nb_sequences': 69,
 'fold': 250,
 'q_profile': 'HS25',
 'output_seed': 'single_69seq_150bp'}

In [32]:
art.sim_reads( 
    input_file=sim_params['input_file'],
    output_seed=sim_params['output_seed'],
    sim_type=sim_params['sim_type'],
    read_length=sim_params['read_length'],
    fold=sim_params['fold'],
    ss=sim_params['q_profile'],
    overwrite=True
)

return code:  0 


             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Single-end Simulation

Total CPU time used: 29.0825

The random seed for the run: 1724163599

Parameters used during run
	Read Length:	150
	Genome masking 'N' cutoff frequency: 	1 in 150
	Fold Coverage:            250X
	Profile Type:             Combined
	ID Tag:                   

Quality Profile(s)
	First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 

Output files

  FASTQ Sequence File:
	/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/single_69seq_150bp/single_69seq_150bp.fq

  ALN Alignment File:
	/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/single_69seq_150bp/single_69seq_150bp.aln




In [33]:
art.list_last_output_files()

single_69seq_150bp.fq
single_69seq_150bp.aln


In [34]:
art.list_all_output_files()

paired_1seq_150bp
- paired_1seq_150bp2.aln
- paired_1seq_150bp2.fq
- paired_1seq_150bp1.fq
- paired_1seq_150bp1.aln
paired_69seq_150bp
- paired_69seq_150bp1.fq
- paired_69seq_150bp2.fq
- paired_69seq_150bp1.aln
- paired_69seq_150bp2.aln
single_1seq_150bp
- single_1seq_150bp.fq
- single_1seq_150bp.aln
single_69seq_150bp
- single_69seq_150bp.fq
- single_69seq_150bp.aln


## Paired read simulation - 150 bp read

In [36]:
print(p2inputs.absolute())
print(p2outputs.absolute())

/home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
/home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf


In [37]:
art = ArtIllumina(
    path2app=Path('/usr/bin/art_illumina'), 
    input_dir=p2inputs, 
    output_dir=p2outputs
    )

Ready to operate with art: /usr/bin/art_illumina
Input files from : /home/vtec/projects/bio/metagentools/data/ncbi/refsequences/yf
Output files to :  /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf


### Define the input file and number of sequences

In [38]:
art.list_all_input_files()

yf_1971_angola.fa
yf_2023_multiple_alignment_original.fa
yf_2023_yellow_fever.fa


In [39]:
input_fname = 'yf_2023_yellow_fever.fa'
fasta = FastaFileReader(p2inputs / input_fname)
nb_sequences = fasta.review()
nb_sequences

There are 69 sequences in this file

First Sequence:
>11089:ncbi:1	1	AY968064	11089	ncbi	Angola_1971
ATGTCTGGTCGAAAAGCTCAGGGTAAAACCCTGGGCGTCAATATGGTAAGACGAGGGGTTCGCTCCTTGTCAAACAAAAT ...
{'accession': 'AY968064', 'organism': 'Angola_1971', 'seqid': '11089:ncbi:1', 'seqnb': '1', 'source': 'ncbi', 'taxonomyid': '11089'}

Last Sequence:
>11089:ncbi:69	69	OM066737	11089	ncbi	VHF-21-037/GHA/Damongo/2021
ATGTCTGGTCGTAAAGCTCAGGGCAAAACCCTGGGCGTCAATATGGTACGACGAGGAGTCCGCTCCNNNNNNNNNAAAAT ...
{'accession': 'OM066737', 'organism': 'VHF-21-037/GHA/Damongo/2021', 'seqid': '11089:ncbi:69', 'seqnb': '69', 'source': 'ncbi', 'taxonomyid': '11089'}


69

### Run the simulation

Parameter `fold`:

Fold coverage, also known as sequencing depth or read depth, represents the average number of times each base in the reference genome is expected to be sequenced. For example:
- If you set -f 20, it means you're simulating a sequencing run that would cover each base in the reference genome an average of 20 times.
- If you set -f 100, it would simulate coverage where each base is sequenced an average of 100 times.

The fold coverage is an important parameter because it affects:
- The total number of reads generated: Higher fold coverage results in more reads.
- The likelihood of capturing rare variants or sequencing errors: Higher coverage generally improves the ability to detect rare variants and distinguish true variants from sequencing errors.
- The overall quality of the simulated dataset: Higher coverage typically leads to more accurate representation of the reference genome in the simulated data.

It's worth noting that ART Illumina uses this fold coverage value along with the read length and reference genome size to calculate the total number of reads to generate. The actual formula is:

```Total number of reads = (Genome size * Fold coverage) / Read length```

In [43]:
read_length = 150
genome_size = 10_238
fold = 250

print(f"Estimated number of simulated reads per reference sequence: {(genome_size * fold) // read_length: ,d}.")
print(f"Estimated total number of simulated reads: {(genome_size * fold) // read_length * nb_sequences: ,d}.")


Estimated number of simulated reads per reference sequence:  17,063.
Estimated total number of simulated reads:  1,177,347.


In [44]:
sim_params = {
    'input_file': input_fname,
    "sim_type": "paired",
    "read_length": read_length,
    'nb_sequences': nb_sequences,
    "fold": fold,
    'mean_read':read_length + 50,
    'std_read':10,
    'q_profile': 'HS25'
}

sim_params['output_seed'] = f"{sim_params['sim_type']}_{sim_params['nb_sequences']}seq_{sim_params['read_length']}bp"
sim_params

{'input_file': 'yf_2023_yellow_fever.fa',
 'sim_type': 'paired',
 'read_length': 150,
 'nb_sequences': 69,
 'fold': 250,
 'mean_read': 200,
 'std_read': 10,
 'q_profile': 'HS25',
 'output_seed': 'paired_69seq_150bp'}

In [45]:
art.sim_reads(
    input_file=sim_params['input_file'],
    output_seed=sim_params['output_seed'],
    sim_type=sim_params['sim_type'],
    read_length=sim_params['read_length'],
    fold=sim_params['fold'],
    mean_read=sim_params['mean_read'],
    std_read=sim_params['std_read'],
    overwrite=True
)   

return code:  0 


             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Paired-end sequencing simulation

Total CPU time used: 26.4077

The random seed for the run: 1724163819

Parameters used during run
	Read Length:	150
	Genome masking 'N' cutoff frequency: 	1 in 150
	Fold Coverage:            250X
	Mean Fragment Length:     200
	Standard Deviation:       10
	Profile Type:             Combined
	ID Tag:                   

Quality Profile(s)
	First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 
	First Read:   HiSeq 2500 Length 150 R2 (built-in profile) 

Output files

  FASTQ Sequence Files:
	 the 1st reads: /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/paired_69seq_150bp/paired_69seq_150bp1.fq
	 the 2nd reads: /home/vtec/projects/bio/metagentools/data/ncbi/simreads/yf/paired_69seq_150bp/paired_69seq_150bp2.fq


In [46]:
art.list_last_output_files()

paired_69seq_150bp1.fq
paired_69seq_150bp2.fq
paired_69seq_150bp1.aln
paired_69seq_150bp2.aln


In [47]:
art.list_all_output_files()

paired_1seq_150bp
- paired_1seq_150bp2.aln
- paired_1seq_150bp2.fq
- paired_1seq_150bp1.fq
- paired_1seq_150bp1.aln
paired_69seq_150bp
- paired_69seq_150bp1.fq
- paired_69seq_150bp2.fq
- paired_69seq_150bp1.aln
- paired_69seq_150bp2.aln
single_1seq_150bp
- single_1seq_150bp.fq
- single_1seq_150bp.aln
single_69seq_150bp
- single_69seq_150bp.fq
- single_69seq_150bp.aln


In [51]:
p2aln = p2outputs / 'paired_69seq_150bp/paired_69seq_150bp1.aln'
assert p2aln.exists()
aln = AlnFileReader(p2aln)
print('\n'.join(aln.header['reference sequences']))

@SQ	11089:ncbi:1	1	AY968064	11089	ncbi	Angola_1971	10237
@SQ	11089:ncbi:2	2	U54798	11089	ncbi	Ivory_Coast_1982	10237
@SQ	11089:ncbi:3	3	DQ235229	11089	ncbi	Ethiopia_1961	10237
@SQ	11089:ncbi:4	4	AY572535	11089	ncbi	Gambia_2001	10237
@SQ	11089:ncbi:5	5	MF405338	11089	ncbi	Ghana_Hsapiens_1927	10237
@SQ	11089:ncbi:6	6	U21056	11089	ncbi	Senegal_1927	10237
@SQ	11089:ncbi:7	7	AY968065	11089	ncbi	Uganda_1948	10237
@SQ	11089:ncbi:8	8	JX898871	11089	ncbi	ArD114896_Senegal_1995	10237
@SQ	11089:ncbi:9	9	JX898872	11089	ncbi	Senegal_Aedes-aegypti_1995	10237
@SQ	11089:ncbi:10	10	GQ379163	11089	ncbi	Peru_Hsapiens_2007	10237
@SQ	11089:ncbi:11	11	DQ118157	11089	ncbi	Spain_Vaccine_2004	10237
@SQ	11089:ncbi:12	12	MF289572	11089	ncbi	Singapore_2017	10237
@SQ	11089:ncbi:13	13	KU978764	11089	ncbi	Sudan_Hsapiens_1941	10237
@SQ	11089:ncbi:14	14	JX898878	11089	ncbi	ArD181250_Senegal_2005	10237
@SQ	11089:ncbi:15	15	JX898879	11089	ncbi	ArD181676_Senegal_2005	10237
@SQ	11089:ncbi:16	16	JX898881	11089	ncbi	Senegal