# Inference for Simulated CoV sequences

In this notebook: 
- inference only, using the **original pretrained model**.
- input data are **reads simulated from CoV sequences**

The notebook was built and tested to run locally but should also work on Colab or Kaggle. If on Colab, it assumes that the project shared gdrive directory is accessible through a shortcut called `Metagenomics` under the root of gdrive.

# 1. Imports and setup environment

### Install and import packages

In [1]:
# Install required custom packages if not installed yet.
import importlib.util
if not importlib.util.find_spec('ecutilities'):
    print('installing package: `ecutilities`')
    ! pip install -qqU ecutilities
else:
    print('`ecutilities` already installed')
if not importlib.util.find_spec('metagentools'):
    print('installing package: `metagentools')
    ! pip install -qqU metagentools
else:
    print('`metagentools` already installed')

`ecutilities` already installed
`metagentools` already installed


In [197]:
# Import all required packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

from ecutilities.core import path_to_parent_dir
from ecutilities.ipython import nb_setup
from pathlib import Path
from pprint import pprint
from tqdm.notebook import tqdm, trange

# Setup the notebook for development
nb_setup()

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # or any {'0', '1', '2'}
import tensorflow as tf
from tensorflow.python.client import device_lib
from tensorflow.keras.models import load_model
print(f"Tensorflow version: {tf.__version__}\n")

from metagentools.cnn_virus.data import strings_to_tensors, create_infer_ds_from_fastq
from metagentools.cnn_virus.data import FastaFileReader, FastqFileReader, AlnFileReader
from metagentools.core import TextFileBaseReader, ProjectFileSystem

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Set autoreload mode
Tensorflow version: 2.8.2



List all computing devices available on the machine

In [3]:
devices = device_lib.list_local_devices()
print('\nDevices:')
for d in devices:
    t = d.device_type
    name = d.physical_device_desc
    l = [item.split(':', 1) for item in name.split(', ')]
    name_attr = dict([x for x in l if len(x)==2])
    dev = name_attr.get('name', ' ')
    print(f"  - {t}  {d.name} {dev:25s}")


Devices:
  - CPU  /device:CPU:0                          
  - GPU  /device:GPU:0  NVIDIA GeForce GTX 1050 


## 2. Review project file system

This code assumes the project file system adopts the following structure:

If running localy:
```text
    project-root   
        |--- data
        |      |--- CNN_Virus_data  (all data related to CNN Virus original paper)
        |      |--- cov_simreads
        |      |--- ds
        |      |--- saved
        |      
        |--- nbs  (all reference and work notebooks)
        |      |--- cnn_virus
        |      |        |--- notebooks.ipynb
```

If running on google colab: 


gdrive root includes a shortcut named Metagenomics and pointing to the following project shared directory: /https://drive.google.com/drive/folders/134uei5fmt08TpzhmjG4sW0FQ06kn2ZfZ

Key folders and system information

In [187]:
pfs = ProjectFileSystem()
pfs.info()

Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentools 
 - Data Dir .... /home/vtec/projects/bio/metagentools/data 
 - Notebooks ... /home/vtec/projects/bio/metagentools/nbs


# Experiments with CoV simulated reads

## Setup paths

- `p2saved`: path to file with saved original model
- `p2virus_labels` path to file with virus names and labels mapping for original model
- `p2simreads`: path to folder where reads files are located (FASTQ and ALN)

In [111]:
p2saved = pfs.data / 'saved/cnn_virus_original/pretrained_model.h5'
assert p2saved.is_file(), f"No file found at {p2saved.absolute()}"

p2virus_labels = pfs.data / 'CNN_Virus_data/virus_name_mapping'
assert p2virus_labels.is_file(), f"No file found at {p2virus_labels.absolute()}"

p2simreads = pfs.data / 'cov_simreads/single_10seq_50bp'
assert p2simreads.is_dir(), f"No directory found at {p2simreads.absolute()}"

## Explore simulated read output files

When simulating with ART Illumina,:
- We simulated reads from referenmce sequences in a fasta file (`fasta` reader)
- ART returns two files:
    - a fastq file (`fastq` reader) with the reads sequences, their quality and some read metadata
    - a aln file (`aln` reader), including:
        - a header with the summulation command and the list of reference sequences
        - additional metadata on each reads, including: 
            - the readid and its number (rank in the list)
            - the read sequence itself (may include errors)
            - the same sequence as in the reference sequence
            - the read alignment starting position in the reference sequence
            - the reference sequence strand
            - the reference sequence id and its number in the sequence file            

We load the fastq and aln file using their respective file reader

In [112]:
from metagentools.cnn_virus.data import FastaFileReader, FastqFileReader, AlnFileReader

In [113]:
p2fastq = p2simreads / f"{p2simreads.stem}.fq"
assert p2fastq.is_file()
p2aln = p2simreads / f"{p2simreads.stem}.aln"
assert p2aln.is_file()

print(f" fq reads file:  {p2fastq.name}\n aln reads file: {p2aln.name}")

fastq = FastqFileReader(p2fastq)
aln = AlnFileReader(p2aln)

 fq reads file:  single_10seq_50bp.fq
 aln reads file: single_10seq_50bp.aln


### Exploring information about the simreads in fastaq

In [114]:
for i, (k,v) in enumerate(fastq.parse_file(add_readseq=True).items()):
    print(f"Read {i}:")
    pprint(v)
    print()

    if i+1 >= 3: break

Read 0:
{'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'readseq': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 1:
{'readid': '2591237:ncbi:1-60399',
 'readnb': '60399',
 'readseq': 'GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 2:
{'readid': '2591237:ncbi:1-60398',
 'readnb': '60398',
 'readseq': 'ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}



### Exploring information in ALN header

The header of ALN file includes:
- the ART command used to simulate the reads
- the list of reference sequences used to create the reads, with metadata

First we can see from which reference fasta file the simulated sequences were imported, by checking the ART command in the alignment file's header. Then we can open a fasta reader for the reference fasta file as well:

In [115]:
aln.header['command']

'/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/cov_data/cov_virus_sequences_ten.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/cov_simreads/single_10seq_50bp/single_10seq_50bp -rs 1674660835'

In [116]:
p2fasta = pfs.data / 'cov_data/cov_virus_sequences_ten.fa'
assert p2fasta.is_file()
fasta = FastaFileReader(p2fasta)

Then we can review the metadata available for each reference sequence, as listed in `aln`'s header.

This information can be seen in `aln.header` or with the method `parse_header_reference_sequences()`

In [117]:
aln.header

{'command': '/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/cov_data/cov_virus_sequences_ten.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/cov_simreads/single_10seq_50bp/single_10seq_50bp -rs 1674660835',
 'reference sequences': ['@SQ\t2591237:ncbi:1 [MK211378]\t2591237\tncbi\t1 [MK211378] 2591237\tCoronavirus BtRs-BetaCoV/YN2018D\t\tscientific name\t30213',
  '@SQ\t11128:ncbi:2 [LC494191]\t11128\tncbi\t2 [LC494191] 11128\tBovine coronavirus\t\tscientific name\t30942',
  '@SQ\t31631:ncbi:3 [KY967361]\t31631\tncbi\t3 [KY967361] 31631\tHuman coronavirus OC43\t\tscientific name\t30661',
  '@SQ\t277944:ncbi:4 [LC654455]\t277944\tncbi\t4 [LC654455] 277944\tHuman coronavirus NL63\t\tscientific name\t27516',
  '@SQ\t11120:ncbi:5 [MN987231]\t11120\tncbi\t5 [MN987231] 11120\tInfectious bronchitis virus\t\tscientific name\t27617',
  '@SQ\t28295:ncbi:6 [KU893866]\t28295\tncbi\t6 [KU893866] 28295\tPorcine epidemic diarrhea virus\t\tscientific name\t28043',
 

In [118]:
# Parse the ALN header and print meta data for a few reference sequence

header_refseq_meta = aln.parse_header_reference_sequences()

print(f"Reads simulated from {len(header_refseq_meta)} reference sequences.\n")

for i, (k,v) in enumerate(header_refseq_meta.items()):
    print(f"Reference Sequence {i}:")
    pprint(v)
    print()

    if i+1 >= 3: break

Reads simulated from 10 reference sequences.

Reference Sequence 0:
{'refseq_accession': 'MK211378',
 'refseq_length': '30213',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237',
 'species': 'Coronavirus BtRs-BetaCoV/YN2018D  scientific name'}

Reference Sequence 1:
{'refseq_accession': 'LC494191',
 'refseq_length': '30942',
 'refseqid': '11128:ncbi:2',
 'refseqnb': '2',
 'refsource': 'ncbi',
 'reftaxonomyid': '11128',
 'species': 'Bovine coronavirus  scientific name'}

Reference Sequence 2:
{'refseq_accession': 'KY967361',
 'refseq_length': '30661',
 'refseqid': '31631:ncbi:3',
 'refseqnb': '3',
 'refsource': 'ncbi',
 'reftaxonomyid': '31631',
 'species': 'Human coronavirus OC43  scientific name'}



We see that the first reference sequence is a **Coronavirus BtRs-BetaCoV/YN2018D**

### Exploring metadata available for simulated read's

In [119]:
reads_meta = aln.parse_file(add_ref_seq_aligned=True, add_read_seq_aligned=True)

print(f"{len(reads_meta):,d} reads were simulated.\n")

for i, (k,v) in enumerate(reads_meta.items()):
    print(f"Read {i}")
    print(k)
    pprint(v)
    print()
    if i+1>=3: break

571,980 reads were simulated.

Read 0
2591237:ncbi:1-60400
{'aln_start_pos': '14770',
 'read_seq_aligned': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'ref_seq_aligned': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 1
2591237:ncbi:1-60399
{'aln_start_pos': '17012',
 'read_seq_aligned': 'GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC',
 'readid': '2591237:ncbi:1-60399',
 'readnb': '60399',
 'ref_seq_aligned': 'GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC',
 'refseq_strand': '-',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

Read 2
2591237:ncbi:1-60398
{'aln_start_pos': '9188',
 'read_seq_aligned': 'ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC',
 'readid': '2591237:ncbi:1-60398',
 'readnb': '60398',
 'ref_seq_aligned':

### Comparison metadata in fasta, fastq and aln

There is redendancy in metadata across the three sets of files (`.fa`, `.fq`, `.aln`). Let's compare them to confirm this is consistent

Create the following list and dictionaries parsing metadata from fasta, fastq and aln:
- `refseqs_fasta`: metadata in fasta sequence file used as reference sequences to create reads
- `simreads`: metadata on reads, as available in the fastq read file
- `refseqs_aln`: reference sequences with metadata as available in the aln header
- `simread_align`: read sequence and alignment info

In [120]:
refseqs_fasta = fasta.parse_file(add_seq=True)
simreads = fastq.parse_file(add_readseq=True)
refseqs_aln = aln.ref_sequences
simread_align = aln.parse_file(add_ref_seq_aligned=True, add_read_seq_aligned=True)

In [121]:
print(f"reads simulated from the following {len(refseqs_fasta)} full sequences from `{fasta.path.name}`:\n")
print('\n'.join([f" {i:02d}: {refseq['species']}" for i, refseq in enumerate(refseqs_fasta.values())]))

reads simulated from the following 10 full sequences from `cov_virus_sequences_ten.fa`:

 00: Coronavirus BtRs-BetaCoV/YN2018D  scientific name
 01: Bovine coronavirus  scientific name
 02: Human coronavirus OC43  scientific name
 03: Human coronavirus NL63  scientific name
 04: Infectious bronchitis virus  scientific name
 05: Porcine epidemic diarrhea virus  scientific name
 06: Porcine epidemic diarrhea virus  scientific name
 07: Porcine epidemic diarrhea virus  scientific name
 08: Porcine epidemic diarrhea virus  scientific name
 09: Camel alphacoronavirus  scientific name


In [122]:
print(f"{len(simreads):,d} reads generated and available in fastq file.")
print('For each read, following metadata is available in the fastq file:')
for s in ['readid', 'readnb', 'refseqnb', 'refsource', 'reftaxonomyid', 'readseq']:
    print(f" - {s}")

571,980 reads generated and available in fastq file.
For each read, following metadata is available in the fastq file:
 - readid
 - readnb
 - refseqnb
 - refsource
 - reftaxonomyid
 - readseq


In [123]:
print(f"{len(simread_align):,d} reads generated and available in aln file.")
print('For each read, following metadata is available in the aln file:')
for s in ['aln_start_pos', 'readid', 'readnb', 'refseq_strand', 'refseqid', 'refseqnb', 'refsource', 'reftaxonomyid', 'ref_seq_aligned', 'read_seq_aligned']:
    print(f" - {s}")

571,980 reads generated and available in aln file.
For each read, following metadata is available in the aln file:
 - aln_start_pos
 - readid
 - readnb
 - refseq_strand
 - refseqid
 - refseqnb
 - refsource
 - reftaxonomyid
 - ref_seq_aligned
 - read_seq_aligned


Check consistency between refseqs from fasta and from aln 

In [124]:
# utility functions
def complementary_strand(seq):
    """Converts a strand in its complementary"""
    conv = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
    return ''.join([conv[base] for base in seq])

strand = 'ATCCGTGGGT'
print(strand, complementary_strand(strand))

def reverse_sequence(seq):
    return seq[::-1]

print(strand, reverse_sequence(strand))

ATCCGTGGGT TAGGCACCCA
ATCCGTGGGT TGGGTGCCTA


### Check aln reference sequence information

In [125]:
refseqid = '2591237:ncbi:1'
# refseqid = '11128:ncbi:2'

In the source `.fa` file

In [126]:
original_seq = refseqs_fasta[refseqid]['sequence']
original_seq_accession = refseqs_fasta[refseqid]['accession']

original_seq_accession, len(original_seq)

('MK211378', 30213)

In the output`.aln` file

In [127]:
assert original_seq_accession == refseqs_aln[refseqid]['refseq_accession']
assert len(original_seq) == int(refseqs_aln[refseqid]['refseq_length'])

refseqs_aln[refseqid]['refseq_accession'], int(refseqs_aln[refseqid]['refseq_length'])

('MK211378', 30213)

### Check alignment information

In [128]:
pprint(simread_align['2591237:ncbi:1-60400'])

{'aln_start_pos': '14770',
 'read_seq_aligned': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'readid': '2591237:ncbi:1-60400',
 'readnb': '60400',
 'ref_seq_aligned': 'ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}


Select all reads generated from a single reference sequence

In [129]:
print(f"Select all reads from reference sequence '{refseqid}''")
reads_from_refseq = {k:v for k,v in simread_align.items() if v['refseqid']==refseqid}
nbr_generated_reads = len(reads_from_refseq)
print(f"Total of {nbr_generated_reads:,d} reads")

Select all reads from reference sequence '2591237:ncbi:1''
Total of 60,400 reads


In [130]:
n = 6
selected_simread = [v for k,v in reads_from_refseq.items()][n]
pprint(selected_simread)

{'aln_start_pos': '11417',
 'read_seq_aligned': 'CTCTAACTATTCTGGTGTCGTCACGACTATCATGTTTTTAGCTAGAGCTA',
 'readid': '2591237:ncbi:1-60394',
 'readnb': '60394',
 'ref_seq_aligned': 'CTCTAACTATTCTGGTGTCGTCACGACTATCATGTTTTTAGCTAGAGCTA',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}


In [131]:
def check_alignment(n, reads):
    selected_simread = [v for k,v in reads.items()][n]
    print(f"{'='*80}")
    print(f"checking read {selected_simread['readid']}")
    start = int(selected_simread['aln_start_pos'])
    strand = selected_simread['refseq_strand']
    print(f"simread info:")
    print(f" - from `{strand}` strand")
    print(f" - position: {start:,d}")

    if strand == '+':
        segment_from_refseq = original_seq[start:start+50]
    else:
        segment_from_refseq = complementary_strand(reverse_sequence(original_seq)[start:start+50])

    print('sequences:')
    print(f'- simread seq          :', selected_simread['read_seq_aligned'])
    print(f'- refseq aligned       :', selected_simread['ref_seq_aligned'])
    print(f'- segment in orig. seq :', segment_from_refseq)

In [132]:
for n in range(nbr_generated_reads-1, nbr_generated_reads-6, -1):
    check_alignment(n, reads_from_refseq)

checking read 2591237:ncbi:1-1
simread info:
 - from `+` strand
 - position: 12,125
sequences:
- simread seq          : CAAAAAGTTAAAGAAATCTTTGAATGTGGCTAAATCTGAGTTTGACCGTG
- refseq aligned       : CAAAAAGTTAAAGAAATCTTTGAATGTGGCTAAATCTGAGTTTGACCGTG
- segment in orig. seq : CAAAAAGTTAAAGAAATCTTTGAATGTGGCTAAATCTGAGTTTGACCGTG
checking read 2591237:ncbi:1-2
simread info:
 - from `-` strand
 - position: 15,136
sequences:
- simread seq          : TCTATTTGTCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTC
- refseq aligned       : TCTATTTGTCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTC
- segment in orig. seq : TCTATTTGTCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTC
checking read 2591237:ncbi:1-3
simread info:
 - from `+` strand
 - position: 19,210
sequences:
- simread seq          : TGATGGTGGTAGCTTGTATGTGAATAAGCATGCATTCCACACTCCAGCTT
- refseq aligned       : TGATGGTGGTAGCTTGTATGTGAATAAGCATGCATTCCACACTCCAGCTT
- segment in orig. seq : TGATGGTGGTAGCTTGTATGTGAATAAGCATGCATTCCACACTCCAGCTT
checking read 2591237:ncbi:1-4


# Create inference dataset

The model expect a dataset file in the following format:

```text
    AAAAAGATTTTGAGAGAGGTCGACCTGTCCTCCTAAAACGTTTACAAAAG
    CATGTAACGCAGCTTAGTCCGATCGTGGCTATAATCCGTCTTTCGATTTG
    AACAACATCTTGTTGATGATAACCGTCAAAGTGTTTTGGGTCTGGAGGGA
    AGTACCTGGAGAGCGTTAAGAAACACAAACGGCTGGATGTAGTGCCGCGC
    CCACGTCGATGAAGCTCCGACGAGAGTCGGCGCTGAGCCCGCGCACCTCC
```

Each line corresponds to a **read sequence**. During inference, the model will predict the **virus species code** and the **relative position** of the read in the full reference sequence.

The mapping between code and virus specie name are in the file `virus_labels.csv`

### Create a ds file from the simulated read output files

The function `create_infer_ds_from_fastq` takes reads from a `fastq` file and created an inference dataset in the format expected by the model. 

The function returns the path to the inference dataset file, as well as a DataFrame with all refseq metadata. 

Datasets are saved into `data/ds`

In [133]:
p2datasets = pfs.data / 'ds'
assert p2datasets.is_dir()

In [163]:
nsamples = None

p2ds, p2meta, reads_info = create_infer_ds_from_fastq(
    p2fastq=p2fastq, 
    output_dir=p2datasets,
    overwrite_ds=True, 
    nsamples=nsamples
    )

0it [00:00, ?it/s]

Dataset with 571,980 reads


In [164]:
print(f"Path to inference dataset file: {p2ds.absolute()}")
print(f"Path to read metadata file:     {p2meta.absolute()}")

Path to inference dataset file: /home/vtec/projects/bio/metagentools/data/ds/single_10seq_50bp_ds
Path to read metadata file:     /home/vtec/projects/bio/metagentools/data/ds/single_10seq_50bp_metadata.csv


In [165]:
reads_info.head()

Unnamed: 0,read_ids,read_refseqs,read_start_pos,read_strand
0,2591237:ncbi:1-60400,2591237:ncbi:1,14770,+
1,2591237:ncbi:1-60399,2591237:ncbi:1,17012,-
2,2591237:ncbi:1-60398,2591237:ncbi:1,9188,+
3,2591237:ncbi:1-60397,2591237:ncbi:1,6764,-
4,2591237:ncbi:1-60396,2591237:ncbi:1,27357,+


### Create the data loader for the model 

In [166]:
import tensorflow as tf
from tensorflow.keras.models import load_model
from metagentools.cnn_virus.data import strings_to_tensors

In [175]:
bs = 128
# bs = 256

text_ds = tf.data.TextLineDataset(p2ds).batch(bs)
ds = text_ds.map(strings_to_tensors)

The bases in the read sequences are encoded as a 5-dim one-hot-encoded vector, as the model expects.

In this example, each 50bp read in converted into a tensor of shape [50,5]

In [176]:
for batch, (y1b, y2b) in ds.take(1):
    # show the shape of one batch
    print(batch.shape)
    # show the forst 10 bases, after one-hot-endoding
    print(batch[0, :10, :])

(128, 50, 5)
tf.Tensor(
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]], shape=(10, 5), dtype=float32)


# Inference

In [177]:
model = load_model(p2saved)

In [178]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 50, 5)]      0           []                               
                                                                                                  
 conv1d_1 (Conv1D)              (None, 50, 512)      13312       ['input_1[0][0]']                
                                                                                                  
 batch_normalization_1 (BatchNo  (None, 50, 512)     2048        ['conv1d_1[0][0]']               
 rmalization)                                                                                     
                                                                                                  
 max_pooling1d_1 (MaxPooling1D)  (None, 25, 512)     0           ['batch_normalization_1[0][

The model returns two sets of probabilities:
- `prob_preds_species`: a vector of 187 values representing the probability that each of the 187 species are the correct ones, for each input read
- `prob_preds_pos`: a vector of 10 values representing the probability that the read is from the corresponding segment of the original sequence (1 to 10)

In [179]:
prob_preds_species, prob_preds_pos = model.predict(ds, verbose=1)



In [180]:
prob_preds_species.shape, prob_preds_pos.shape

((571980, 187), (571980, 10))

To find the prediction, we pick the max probability

In [184]:
class_preds = np.argmax(prob_preds_species, axis=1)
class_preds.shape
class_preds[:10]

array([117, 117, 117, 117,  32,  89, 117, 117,  94, 117])

## Evaluate Model for cov

Original model was trained with 187 different virus species.

In [188]:
p2virus_labels = pfs.data / 'CNN_Virus_data/virus_name_mapping'
with open(p2virus_labels, 'r') as fp:
    i, c = 0, 0
    cov = []
    while True:
        line  = fp.readline()
        if line == '': break
        elif ('corona' in line) or ('mers' in line) : 
            c += 1
            line = line.replace('\t', '    \t')
            cov.append(f" - {line}")
        i += 1
print(f"Original model is trained to detect {i} virus species, including {c} coronavirus species:")
print(''.join(cov))

Original model is trained to detect 187 virus species, including 2 coronavirus species:
 - Middle_East_respiratory_syndrome-related_coronavirus    	94
 - Severe_acute_respiratory_syndrome-related_coronavirus    	117



In [189]:
fasta.reset_iterator()
fasta.print_first_chunks(5)


Sequence 1:
>2591237:ncbi:1 [MK211378]	2591237	ncbi	1 [MK211378] 2591237	Coronavirus BtRs-BetaCoV/YN2018D		scientific name

TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...

Sequence 2:
>11128:ncbi:2 [LC494191]	11128	ncbi	2 [LC494191] 11128	Bovine coronavirus		scientific name

CATCCCGCTTCACTGATCTCTTGTTAGATCTTTTCATAATCTAAACTTTATAAAAACATCCACTCCCTGTAGTCTATGCC ...

Sequence 3:
>31631:ncbi:3 [KY967361]	31631	ncbi	3 [KY967361] 31631	Human coronavirus OC43		scientific name

ATCTCTTGTTAGATCTTTTTGTAATCTAAACTTTATAAAAACATCCACTCCCTGTAATCTATGCTTGTGGGCGTAGATTT ...

Sequence 4:
>277944:ncbi:4 [LC654455]	277944	ncbi	4 [LC654455] 277944	Human coronavirus NL63		scientific name

ATTTTCTTATTTAGACTTTGTGTCTACTCTTCTCAACTAAACGAAATTTTTCTAGTGCTGTCATTTGTTATGGCAGTCCT ...

Sequence 5:
>11120:ncbi:5 [MN987231]	11120	ncbi	5 [MN987231] 11120	Infectious bronchitis virus		scientific name

TCCTAAGTGTGATATAAATATATATCATACACACTAGCCTTGCGCTAGATTTCTAACTTAACAAAACGGACTTAAATACC ...

Sequence 

Our objective is to validate that the model detects coronavirus species out of the sequences. 

We create several test functions:

In [190]:
def is_cov(y_preds):
    """Return 1 if the corresponding prediction is a corona virus, 0 otherwise"""
    return (y_preds == 94).astype(int) + (y_preds == 117).astype(int)

def is_mers(y_preds):
    """Returns 1 if model prediction is MERS and 0 otherwise

    Note: 94 is the code for Middle_East_respiratory_syndrome-related_coronavirus"""
    return y_preds == 94

def is_sars(y_preds):
    """Returns 1 if model prediction is SARS and 0 otherwise

    Note: 117 is the virus code for Severe_acute_respiratory_syndrome-related_coronavirus 
    """
    return y_preds == 117

def cov_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus sequences"""
    return is_cov(y_preds).sum()/y_preds.shape[0]

def mers_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus sequences"""
    return is_mers(y_preds).sum()/y_preds.shape[0]

def sars_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus sequences"""
    return is_sars(y_preds).sum()/y_preds.shape[0]

Review all reads, broken down per reference sequence

In [215]:
aln = AlnFileReader(p2fastq.parent / f"{p2fastq.stem}.aln")
acc_per_refseq = {}

for refseqid in tqdm(reads_info.read_refseqs.unique()):
    mask = reads_info.read_refseqs == refseqid
    acc = cov_acc(None, class_preds[mask])
    aln_refseq_meta = aln.ref_sequences[refseqid]
    print(f"Reference Sequence: {aln_refseq_meta['species']}:")
    print(f"  Nbr reads: {class_preds[mask].shape[0]:,d}")
    print(f"  Accuracy CoV:       {acc:.3f}")
    print(f"  Accuracy MERS:  {mers_acc(None, class_preds[mask]):.3f}")
    print(f"  Accuracy SARS:  {sars_acc(None, class_preds[mask]):.3f}")

  0%|          | 0/10 [00:00<?, ?it/s]

Reference Sequence: Coronavirus BtRs-BetaCoV/YN2018D  scientific name:
  Nbr reads: 60,400
  Accuracy CoV:       0.733
  Accuracy MERS:  0.014
  Accuracy SARS:  0.719
Reference Sequence: Bovine coronavirus  scientific name:
  Nbr reads: 61,800
  Accuracy CoV:       0.055
  Accuracy MERS:  0.030
  Accuracy SARS:  0.026
Reference Sequence: Human coronavirus OC43  scientific name:
  Nbr reads: 61,080
  Accuracy CoV:       0.057
  Accuracy MERS:  0.031
  Accuracy SARS:  0.026
Reference Sequence: Human coronavirus NL63  scientific name:
  Nbr reads: 55,000
  Accuracy CoV:       0.067
  Accuracy MERS:  0.031
  Accuracy SARS:  0.035
Reference Sequence: Infectious bronchitis virus  scientific name:
  Nbr reads: 55,200
  Accuracy CoV:       0.065
  Accuracy MERS:  0.032
  Accuracy SARS:  0.034
Reference Sequence: Porcine epidemic diarrhea virus  scientific name:
  Nbr reads: 56,000
  Accuracy CoV:       0.068
  Accuracy MERS:  0.032
  Accuracy SARS:  0.036
Reference Sequence: Porcine epidemic d

In [219]:
for refseqid in tqdm(reads_info.read_refseqs.unique()):
    mask_refseq = reads_info.read_refseqs == refseqid
    mask_strand_coding = reads_info.read_strand == '+'
    mask_strand_template = reads_info.read_strand == '-'
    mask_coding = (mask_strand_coding.astype(int) * mask_refseq.astype(int)).astype(bool)
    mask_template = (mask_strand_template.astype(int) * mask_refseq.astype(int)).astype(bool)

    aln_refseq_meta = aln.ref_sequences[refseqid]
    acc = cov_acc(None, class_preds[mask_refseq])
    acc_coding = cov_acc(None, class_preds[mask_coding])
    acc_template = cov_acc(None, class_preds[mask_template])
       
    print(f"Ref. Sequence: {aln_refseq_meta['species'].replace('scientific name', '').strip()}:")
    print(f"  Accuracy :............... {acc:.3f}")
    print(f"  Acc. coding strand: ..... {acc_coding:.3f}")
    print(f"  Acc. template strand: ... {acc_template:.3f}")
    print(f"  Nbr reads: {class_preds[mask_refseq].shape[0]:,d}, incl. {mask_coding.sum():,d} from coding strand and {mask_template.sum():,d} from template strand")
    print()

  0%|          | 0/10 [00:00<?, ?it/s]

Ref. Sequence: Coronavirus BtRs-BetaCoV/YN2018D:
  Accuracy :............... 0.733
  Acc. coding strand: ..... 0.733
  Acc. template strand: ... 0.733
  Nbr reads: 60,400, incl. 30,099 from coding strand and 30,301 from template strand

Ref. Sequence: Bovine coronavirus:
  Accuracy :............... 0.055
  Acc. coding strand: ..... 0.058
  Acc. template strand: ... 0.053
  Nbr reads: 61,800, incl. 30,928 from coding strand and 30,872 from template strand

Ref. Sequence: Human coronavirus OC43:
  Accuracy :............... 0.057
  Acc. coding strand: ..... 0.064
  Acc. template strand: ... 0.051
  Nbr reads: 61,080, incl. 30,565 from coding strand and 30,515 from template strand

Ref. Sequence: Human coronavirus NL63:
  Accuracy :............... 0.067
  Acc. coding strand: ..... 0.062
  Acc. template strand: ... 0.071
  Nbr reads: 55,000, incl. 27,560 from coding strand and 27,440 from template strand

Ref. Sequence: Infectious bronchitis virus:
  Accuracy :............... 0.065
  Acc. c

# Evaluating the original pretrained model on simulated CoV reads

## Using reads simulated from 25 reference CoV sequences

In [51]:
p2saved = pfs.data / 'saved/cnn_virus_original/pretrained_model.h5'
p2simreads = pfs.data / 'cov_simreads/single_25seq_50bp'
p2virus_labels = pfs.data / 'CNN_Virus_data/virus_name_mapping'
assert p2saved.is_file()
assert p2simreads.is_dir()
assert p2virus_labels.is_file()

In [52]:
p2fastq = p2simreads / f"{p2simreads.stem}.fq"
p2aln = p2simreads / f"{p2simreads.stem}.aln"
assert p2fastq.is_file()
assert p2aln.is_file()

fastq = FastqFileReader(p2fastq)
aln = AlnFileReader(p2aln)

In [59]:
refseqs_aln = aln.ref_sequences

print(f"reads simulated from the following {len(refseqs_aln)} reference sequences:\n")
print('\n'.join([f" {i:02d}: {refseq['species']}" for i, refseq in enumerate(refseqs_aln.values())]))

reads simulated from the following 25 reference sequences:

 00: Coronavirus BtRs-BetaCoV/YN2018D  scientific name
 01: Bovine coronavirus  scientific name
 02: Human coronavirus OC43  scientific name
 03: Human coronavirus NL63  scientific name
 04: Infectious bronchitis virus  scientific name
 05: Porcine epidemic diarrhea virus  scientific name
 06: Porcine epidemic diarrhea virus  scientific name
 07: Porcine epidemic diarrhea virus  scientific name
 08: Porcine epidemic diarrhea virus  scientific name
 09: Camel alphacoronavirus  scientific name
 10: Human coronavirus HKU1  scientific name
 11: Middle East respiratory syndrome-related coronavirus  scientific name
 12: Murine hepatitis virus  scientific name
 13: Middle East respiratory syndrome-related coronavirus  scientific name
 14: Middle East respiratory syndrome-related coronavirus  scientific name
 15: Middle East respiratory syndrome-related coronavirus  scientific name
 16: Infectious bronchitis virus  scientific name
 17

In [60]:
nsamples = None
p2ds, reads_info = create_infer_ds_from_fastq(p2fastq, overwrite_ds=True, nsamples=nsamples)

text_ds = tf.data.TextLineDataset(p2ds).batch(32)
ds = text_ds.map(strings_to_tensors)

Dataset with 1,442,519 reads


In [61]:
model = load_model(p2saved)

In [52]:
prob_preds = model.predict(ds, verbose=1)



In [58]:
prob_preds[0].shape, prob_preds[1].shape

((1442519, 187), (1442519, 10))

In [59]:
class_preds = np.argmax(prob_preds[0], axis=1)
class_preds.shape

(1442519,)

In [60]:
def is_cov(y_preds):
    """Return 1 if the corresponding prediction is a corona virus, 0 otherwise"""
    return (y_preds == 94).astype(int) + (y_preds == 117).astype(int)

def is_mers(y_preds):
    return y_preds == 94

def is_sars(y_preds):
    return y_preds == 117

def cov_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus"""
    return is_cov(y_preds).sum()/y_preds.shape[0]

def mers_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus"""
    return is_mers(y_preds).sum()/y_preds.shape[0]

def sars_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus"""
    return is_sars(y_preds).sum()/y_preds.shape[0]

# cov_acc(None, class_preds)

In [61]:
np.unique(reads_info[:,1]).shape[0]

25

In [62]:
for refseqid in np.unique(reads_info[:,1]):
    mask_refseq = reads_info[:,1] == refseqid
    mask_strand_coding = reads_info[:,3] == '+'
    mask_strand_template = reads_info[:,3] == '-'
    mask_coding = (mask_strand_coding.astype(int) * mask_refseq.astype(int)).astype(bool)
    mask_template = (mask_strand_template.astype(int) * mask_refseq.astype(int)).astype(bool)

    aln_refseq_meta = aln.ref_sequences[refseqid]
    acc = cov_acc(None, class_preds[mask_refseq])
    acc_coding = cov_acc(None, class_preds[mask_coding])
    acc_template = cov_acc(None, class_preds[mask_template])
    
    species = aln_refseq_meta['species'].replace('scientific name', '').strip()
    refid = aln_refseq_meta['refseqid'].strip()
    refseq_accession = aln_refseq_meta['refseq_accession'].strip()
       
    print(f"Ref. Sequence: {species} ({refseqid} / {refseq_accession}):")
    print(f"  Accuracy :............... {acc:.3f}")
    print(f"  Accuracy MERS: .......... {mers_acc(None, class_preds[mask_refseq]):.3f}")
    print(f"  Accuracy SARS: .......... {sars_acc(None, class_preds[mask_refseq]):.3f}")
    print(f"  Acc. coding strand: ..... {acc_coding:.3f}")
    print(f"  Acc. template strand: ... {acc_template:.3f}")
    print(f"  Nbr reads: {class_preds[mask_refseq].shape[0]:,d}, incl. {mask_coding.sum():,d} from coding strand and {mask_template.sum():,d} from template strand")
    print()

Ref. Sequence: Infectious bronchitis virus (11120:ncbi:17 / MW792514):
  Accuracy :............... 0.055
  Accuracy MERS: .......... 0.026
  Accuracy SARS: .......... 0.029
  Acc. coding strand: ..... 0.054
  Acc. template strand: ... 0.056
  Nbr reads: 55,300, incl. 27,748 from coding strand and 27,552 from template strand

Ref. Sequence: Infectious bronchitis virus (11120:ncbi:19 / EU526388):
  Accuracy :............... 0.055
  Accuracy MERS: .......... 0.027
  Accuracy SARS: .......... 0.028
  Acc. coding strand: ..... 0.053
  Acc. template strand: ... 0.057
  Nbr reads: 55,400, incl. 27,670 from coding strand and 27,730 from template strand

Ref. Sequence: Infectious bronchitis virus (11120:ncbi:5 / MN987231):
  Accuracy :............... 0.063
  Accuracy MERS: .......... 0.031
  Accuracy SARS: .......... 0.032
  Acc. coding strand: ..... 0.068
  Acc. template strand: ... 0.059
  Nbr reads: 55,200, incl. 27,657 from coding strand and 27,543 from template strand

Ref. Sequence: Bovin

## Simreads from 100 sequences

In [62]:
p2saved = pfs.data / 'saved/cnn_virus_original/pretrained_model.h5'
p2simreads = pfs.data / 'cov_simreads/single_100seq_50bp'
p2virus_labels = pfs.data / 'CNN_Virus_data/virus_name_mapping'
assert p2saved.is_file()
assert p2simreads.is_dir()
assert p2virus_labels.is_file()

In [69]:
p2fastq = p2simreads / f"{p2simreads.stem}.fq"
p2aln = p2simreads / f"{p2simreads.stem}.aln"
assert p2fastq.is_file()
assert p2aln.is_file()

fastq = FastqFileReader(p2fastq)
aln = AlnFileReader(p2aln)

In [70]:
# FIXME: following line trows an error: ValueError: No match on this line
refseqs_aln = aln.ref_sequences

# print(f"reads simulated from the following {len(refseqs_aln)} reference sequences:\n")
# print('\n'.join([f" {i:02d}: {refseq['species']}" for i, refseq in enumerate(refseqs_aln.values())]))

ValueError: No match on this line

In [None]:
nsamples = 3_000_000
p2ds, reads_info = create_infer_ds_from_fastq(p2fastq, overwrite_ds=True, nsamples=nsamples)

text_ds = tf.data.TextLineDataset(p2ds).batch(32)
ds = text_ds.map(strings_to_tensors)

In [None]:
model = load_model(p2saved)

In [None]:
prob_preds = model.predict(ds, verbose=1)



ResourceExhaustedError: OOM when allocating tensor with shape[3000000,187] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat

Seems that the inference runs fine for 93750 steps (3M samples) but then when the output is assembled, cannot allocate the prob_preds of shape shape `[3000000,187]` on the GPU.

Questions:
- why does this have to be on the GPU?
- is it possible to have it on the main memory instead


In [None]:
prob_preds[0].shape, prob_preds[1].shape

In [None]:
class_preds = np.argmax(prob_preds[0], axis=1)
class_preds.shape

In [None]:
def is_cov(y_preds):
    """Return 1 if the corresponding prediction is a corona virus, 0 otherwise"""
    return (y_preds == 94).astype(int) + (y_preds == 117).astype(int)

def cov_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus"""
    return is_cov(y_preds).sum()/y_preds.shape[0]

cov_acc(None, class_preds)

In [None]:
np.unique(reads_info[:,1]).shape[0]

In [None]:
for refseqid in np.unique(reads_info[:,1]):
    mask_refseq = reads_info[:,1] == refseqid
    mask_strand_coding = reads_info[:,3] == '+'
    mask_strand_template = reads_info[:,3] == '-'
    mask_coding = (mask_strand_coding.astype(int) * mask_refseq.astype(int)).astype(bool)
    mask_template = (mask_strand_template.astype(int) * mask_refseq.astype(int)).astype(bool)

    aln_refseq_meta = aln.ref_sequences[refseqid]
    acc = cov_acc(None, class_preds[mask_refseq])
    acc_coding = cov_acc(None, class_preds[mask_coding])
    acc_template = cov_acc(None, class_preds[mask_template])
       
    print(f"Ref. Sequence: {aln_refseq_meta['species'].replace('scientific name', '').strip()}:")
    print(f"  Accuracy :............... {acc:.3f}")
    print(f"  Accuracy MERS:  {mers_acc(None, class_preds[mask]):.3f}")
    print(f"  Accuracy SARS:  {sars_acc(None, class_preds[mask]):.3f}")
    print(f"  Acc. coding strand: ..... {acc_coding:.3f}")
    print(f"  Acc. template strand: ... {acc_template:.3f}")
    print(f"  Nbr reads: {class_preds[mask_refseq].shape[0]:,d}, incl. {mask_coding.sum():,d} from coding strand and {mask_template.sum():,d} from template strand")
    print()

# New Section

Access AWS from colab:
- https://colab.research.google.com/github/bytehub-ai/code-examples/blob/main/tutorials/04_using_cloud_storage.ipynb
- https://python.plainenglish.io/how-to-load-data-from-aws-s3-into-google-colab-7e76fbf534d2
- https://medium.com/@lily_su/accessing-s3-bucket-from-google-colab-16f7ee6c5b51
- 

## handle GPU with tf

In [None]:
# device = tf.config.list_physical_devices('GPU')[0]
# device

PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')

In [None]:
# tf.config.experimental.get_memory_info(device='GPU:0')

{'current': 301555968, 'peak': 330915840}

In [None]:
# physical_devices = tf.config.list_physical_devices('GPU')
# try:
#     tf.config.experimental.set_memory_growth(physical_devices[0], True)
#     assert tf.config.experimental.get_memory_growth(physical_devices[0])
# except:
#     print('Invalid device or cannot modify virtual devices once initialized.')

Invalid device or cannot modify virtual devices once initialized.


In [None]:
# tf.keras.backend.clear_session()

In [None]:
# model.summary()

In [None]:
# try:
#     del model
# except:
#     pass
# import gc
# gc.collect()

In [None]:
# gpus = tf.config.list_physical_devices('GPU')
# if gpus:
#   # Restrict TensorFlow to only use the first GPU
#   try:
#     tf.config.set_visible_devices(gpus[0], 'GPU')
#     logical_gpus = tf.config.list_logical_devices('GPU')
#     print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
#   except RuntimeError as e:
#     # Visible devices must be set before GPUs have been initialized
#     print(e)

1 Physical GPUs, 1 Logical GPU


## end of section