# Inference using simulated CoV sequences from NCBI

In this notebook, we use simreads generated from NCBI CoV reference sequences to test inference, using the original pretrained CNN Virus model.

This notebook works when run locally and also should run on Colab, as long as the file system is in line with the unified file ystem (see documentation).

# 1. Imports and setup environment

### Install and import packages

In [1]:
# Install required custom packages if not installed yet.
import importlib.util
if not importlib.util.find_spec('ecutilities'):
    print('installing package: `ecutilities`')
    ! pip install -qqU ecutilities
else:
    print('`ecutilities` already installed')
if not importlib.util.find_spec('metagentools'):
    print('installing package: `metagentools')
    ! pip install -qqU metagentools
else:
    print('`metagentools` already installed')

`ecutilities` already installed
`metagentools` already installed


In [63]:
# Import all required packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

from ecutilities.core import files_in_tree
from ecutilities.ipython import nb_setup
from IPython.display import display, Markdown, HTML
from pathlib import Path
from pprint import pprint
from tqdm.notebook import tqdm, trange

# Setup the notebook for development
nb_setup()

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # or any {'0', '1', '2'}
import tensorflow as tf
from tensorflow.python.client import device_lib
from tensorflow.keras.models import load_model
print(f"Tensorflow version: {tf.__version__}\n")

from metagentools.cnn_virus.data import strings_to_tensors, create_infer_ds_from_fastq
from metagentools.cnn_virus.data import FastaFileReader, FastqFileReader, AlnFileReader
from metagentools.core import TextFileBaseReader, ProjectFileSystem

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Set autoreload mode
Tensorflow version: 2.8.2



List all computing devices available on the machine

In [3]:
devices = device_lib.list_local_devices()
print('\nDevices:')
for d in devices:
    t = d.device_type
    name = d.physical_device_desc
    l = [item.split(':', 1) for item in name.split(', ')]
    name_attr = dict([x for x in l if len(x)==2])
    dev = name_attr.get('name', ' ')
    print(f"  - {t}  {d.name} {dev:25s}")


Devices:
  - CPU  /device:CPU:0                          
  - GPU  /device:GPU:0  NVIDIA GeForce GTX 1050 


# 2. Setup paths to files

Key folders and system information

In [4]:
pfs = ProjectFileSystem()
pfs.info()

Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentools 
 - Data Dir .... /home/vtec/projects/bio/metagentools/data 
 - Notebooks ... /home/vtec/projects/bio/metagentools/nbs


In [8]:
pfs.readme()

*/home/vtec/projects/bio/metagentools/data*

This directory includes all the data for the project `metagentools`.

```text
data
     |--- CNN_Virus_data  (all data related to CNN Virus original paper)
     |--- models          (trained and finetuned models)
     |--- ncbi            (refsequences, simreads, datasets and infer_results for CoV from NCBI)
     |--- ncov_data       (refsequences, simreads, datasets and infer_results for non CoV sequences)
     |--- ....            (raw and pre-processed data from various sources)  
     
```



- `p2model`: path to file with saved original pretrained model
- `p2virus_labels` path to file with virus names and labels mapping for original model
- `p2simreads`: path to folder where reads files are located (FASTQ and ALN)

In [12]:
p2model = pfs.data / 'models/cnn_virus_original/pretrained_model.h5'
assert p2model.is_file(), f"No file found at {p2model.absolute()}"

p2virus_labels = pfs.data / 'CNN_Virus_data/virus_name_mapping'
assert p2virus_labels.is_file(), f"No file found at {p2virus_labels.absolute()}"

p2simreads = pfs.data / 'ncbi/simreads/'
assert p2simreads.is_dir(), f"No directory found at {p2simreads.absolute()}"

In [13]:
pfs.readme(dir_path=p2simreads)

*/home/vtec/projects/bio/metagentools/data/ncbi/simreads*

# Information on CoV simulated read sequence files
This folder includes a set of simulated read sequence files generated from CoV sequences in `cov_data` and ARC Illumina. 


Each simread sub-directory is named as `<method>_<nb seq>_<nb bp>` where"
- `<method>` is either `single` or `paired` depending on the simulation method
- `<nb seq>` is the number of reference sequences used for simulation, and refers to the `fa` files in `cov_data`
- `<nb bp>` is the number of base pairs in the simulated read

Each directory includes simreads files made using a simulation method and a specific number of reference sequences.
- `xxx.fq` and `xxx.aln` files with method is `single`
- `xxx1.fq`, `xxx2.fq`, `xxx1.aln` and `xxx2.aln` files with method is `paired`.

Example:
- `paired_10seq_50bp` means that the simreads were generated by using the `paired` method to simulate 50-bp reads, and using the `fa` file `/cov_data/cov_virus_sequences_010-seqs.fa`
- `single_100seq_50bp` means that the simreads were generated by using the `single` method to simulate 50-bp reads, and using the `fa` file `/cov_data/cov_virus_sequences_100-seqs.fa`. Note that this generated 20,660,104 reads !



In [16]:
files_in_tree(path=p2simreads);

ncbi
  |--simreads
  |    |--readme.md (0)
  |    |--paired_25seq_150bp
  |    |    |--paired_25seq_150bp2.fq (1)
  |    |    |--paired_25seq_150bp1.aln (2)
  |    |    |--paired_25seq_150bp2.aln (3)
  |    |    |--paired_25seq_150bp1.fq (4)
  |    |--paired_100seq_150bp
  |    |    |--paired_100seq_150bp1.aln (5)
  |    |    |--paired_100seq_150bp2.aln (6)
  |    |    |--paired_100seq_150bp1.fq (7)
  |    |    |--paired_100seq_150bp2.fq (8)
  |    |--paired_100seq_50bp
  |    |    |--paired_100seq_50bp2.aln (9)
  |    |    |--paired_100seq_50bp1.aln (10)
  |    |    |--paired_100seq_50bp2.fq (11)
  |    |    |--paired_100seq_50bp1.fq (12)
  |    |--paired_10seq_150bp
  |    |    |--paired_10seq_150bp1.aln (13)
  |    |    |--paired_10seq_150bp2.fq (14)
  |    |    |--paired_10seq_150bp2.aln (15)
  |    |    |--paired_10seq_150bp1.fq (16)
  |    |--paired_10seq_50bp
  |    |    |--paired_10seq_50bp2.aln (17)
  |    |    |--paired_10seq_50bp1.fq (18)
  |    |    |--paired_10seq_50bp1.al

For this experiment, we will use the follwoing simreads:
- 50 bp, single
- from 10 reference sequences

This means we will use the files in `data/ncbi/simreads/single_10seq_50bp`

In [53]:
p2simreads = pfs.data / 'ncbi/simreads' / 'single_10seq_50bp'
assert p2simreads.is_dir()
p2simreads.absolute()

PosixPath('/home/vtec/projects/bio/metagentools/data/ncbi/simreads/single_10seq_50bp')

In [54]:
p2fastq = p2simreads / f"{p2simreads.stem}.fq"
assert p2fastq.is_file()
p2aln = p2simreads / f"{p2simreads.stem}.aln"
assert p2aln.is_file()

print(f" fq reads file:  {p2fastq.name}\n aln reads file: {p2aln.name}")

fastq = FastqFileReader(p2fastq)
aln = AlnFileReader(p2aln)

 fq reads file:  single_10seq_50bp.fq
 aln reads file: single_10seq_50bp.aln


## Review data

ART Illumina command to generate the reads

In [55]:
print(aln.header['command'])

/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/cov_data/cov_virus_sequences_ten.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/cov_simreads/single_10seq_50bp/single_10seq_50bp -rs 1674660835


Reference Sequences:

In [56]:
print('\n'.join([v['species'] for v in aln.ref_sequences.values()]))

Coronavirus BtRs-BetaCoV/YN2018D  scientific name
Bovine coronavirus  scientific name
Human coronavirus OC43  scientific name
Human coronavirus NL63  scientific name
Infectious bronchitis virus  scientific name
Porcine epidemic diarrhea virus  scientific name
Porcine epidemic diarrhea virus  scientific name
Porcine epidemic diarrhea virus  scientific name
Porcine epidemic diarrhea virus  scientific name
Camel alphacoronavirus  scientific name


# 3. Create inference dataset

The model expect a dataset file in the following format:

```text
    AAAAAGATTTTGAGAGAGGTCGACCTGTCCTCCTAAAACGTTTACAAAAG
    CATGTAACGCAGCTTAGTCCGATCGTGGCTATAATCCGTCTTTCGATTTG
    AACAACATCTTGTTGATGATAACCGTCAAAGTGTTTTGGGTCTGGAGGGA
    AGTACCTGGAGAGCGTTAAGAAACACAAACGGCTGGATGTAGTGCCGCGC
    CCACGTCGATGAAGCTCCGACGAGAGTCGGCGCTGAGCCCGCGCACCTCC
```

Each line corresponds to a **read sequence**. During inference, the model will predict the **virus species code** and the **relative position** of the read in the full reference sequence.

The mapping between code and virus specie name are in the file `virus_labels.csv`

## Create a ds file from the simulated read output files

The function `create_infer_ds_from_fastq` takes reads from a `fastq` file and created an inference dataset in the format expected by the model. 

The function returns the path to the inference dataset file, as well as a DataFrame with all refseq metadata. 

Datasets are saved into `data/ds`

In [57]:
p2datasets = pfs.data / 'ncbi/ds'
assert p2datasets.is_dir()

In [58]:
pfs.readme(dir_path=p2datasets)

*/home/vtec/projects/bio/metagentools/data/ncbi/ds*

# Dataset Directory
When using simread files (`fa` and `aln`) for inference, an inference dataset in a format required by the CNN Virus model must be build. In addition, metadata can be extracted to make it possible to analyse the result from different perspectives.

This directory includes the generated  inference datasets and metadata for each inference experiment.


In [59]:
print('\n'.join([p.name for p in p2datasets.glob('*') if p.suffix not in ['.md']]))

single_25seq_50bp_metadata.csv
single_10seq_50bp_metadata.csv
single_25seq_50bp_ds
single_10seq_50bp_ds


Create the dataset and the metadata file

In [60]:
nsamples = None

p2ds, p2meta, reads_info = create_infer_ds_from_fastq(
    p2fastq=p2fastq, 
    output_dir=p2datasets,
    overwrite_ds=True, 
    nsamples=nsamples
    )

0it [00:00, ?it/s]

Dataset with 571,980 reads


In [61]:
print(f"Path to inference dataset file: {p2ds.absolute()}")
print(f"Path to read metadata file:     {p2meta.absolute()}")

Path to inference dataset file: /home/vtec/projects/bio/metagentools/data/ncbi/ds/single_10seq_50bp_ds
Path to read metadata file:     /home/vtec/projects/bio/metagentools/data/ncbi/ds/single_10seq_50bp_metadata.csv


In [62]:
reads_info.head()

Unnamed: 0,read_ids,read_refseqs,read_start_pos,read_strand
0,2591237:ncbi:1-60400,2591237:ncbi:1,14770,+
1,2591237:ncbi:1-60399,2591237:ncbi:1,17012,-
2,2591237:ncbi:1-60398,2591237:ncbi:1,9188,+
3,2591237:ncbi:1-60397,2591237:ncbi:1,6764,-
4,2591237:ncbi:1-60396,2591237:ncbi:1,27357,+


## Create the data loader for the model 

Define batch size and create a first dataset accessing data from the dataset text file. Batch size can be adjusted depending on the memory available on the GPU. For reference, `bs = 4096` was used with a 4GB GPU. 

Then transform the text dataset into a tensor dataset by applying the `string_to_tensor` preprocessing function

In [91]:
bs = 4096

text_ds = tf.data.TextLineDataset(p2ds).batch(bs)
ds = text_ds.map(strings_to_tensors)

The bases in the read sequences are encoded as a 5-dim one-hot-encoded vector, as the model expects.

In this example, each 50bp read in converted into a tensor of shape [50,5]

In [92]:
for batch, (y1b, y2b) in ds.take(1):
    # show the shape of one batch
    print(batch.shape)
    # show the forst 10 bases, after one-hot-endoding
    print(batch[0, :10, :])

(4096, 50, 5)
tf.Tensor(
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]], shape=(10, 5), dtype=float32)


# 4. Inference

Load and review the pretrained model

In [93]:
model = load_model(p2model)

In [94]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 50, 5)]      0           []                               
                                                                                                  
 conv1d_1 (Conv1D)              (None, 50, 512)      13312       ['input_1[0][0]']                
                                                                                                  
 batch_normalization_1 (BatchNo  (None, 50, 512)     2048        ['conv1d_1[0][0]']               
 rmalization)                                                                                     
                                                                                                  
 max_pooling1d_1 (MaxPooling1D)  (None, 25, 512)     0           ['batch_normalization_1[0][

Present the inference dataset to the model and collect prediction.

The model returns two sets of probabilities:
- `prob_preds_species`: a vector of 187 values representing the probability that each of the 187 species are the correct ones, for each input read
- `prob_preds_pos`: a vector of 10 values representing the probability that the read is from the corresponding segment of the original sequence (1 to 10)

In [95]:
%time
prob_preds_species, prob_preds_pos = model.predict(ds, verbose=1)

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.25 µs


In [96]:
prob_preds_species.shape, prob_preds_pos.shape

((571980, 187), (571980, 10))

To find the prediction, we pick the argmax probability, which gives us the index/code for the predicted virus species.

In [97]:
class_preds = np.argmax(prob_preds_species, axis=1)
class_preds.shape
class_preds[:10]

array([117, 117, 117, 117,  32,  89, 117, 117,  94, 117])

117 is for `SARS` and 94 for `MERS`. We see that there are a few errors in prediction as all the reference sequences are either SARS or MERS

## Simple evaluation of the model for CoV

Original model was trained with 187 different virus species.

In [98]:
p2virus_labels = pfs.data / 'CNN_Virus_data/virus_name_mapping'
with open(p2virus_labels, 'r') as fp:
    i, c = 0, 0
    cov = []
    while True:
        line  = fp.readline()
        if line == '': break
        elif ('corona' in line) or ('mers' in line) : 
            c += 1
            line = line.replace('\t', '    \t')
            cov.append(f" - {line}")
        i += 1
print(f"Original model is trained to detect {i} virus species, including {c} coronavirus species:")
print(''.join(cov))

Original model is trained to detect 187 virus species, including 2 coronavirus species:
 - Middle_East_respiratory_syndrome-related_coronavirus    	94
 - Severe_acute_respiratory_syndrome-related_coronavirus    	117



Our objective is to validate that the model detects either a MERS or a SARS species out of the reads: 

In [100]:
for i, v in enumerate(aln.ref_sequences.values()):
    print(f"- RefSeq {i+1}: {v['species'].replace('scientific name','')}")

- RefSeq 1: Coronavirus BtRs-BetaCoV/YN2018D  
- RefSeq 2: Bovine coronavirus  
- RefSeq 3: Human coronavirus OC43  
- RefSeq 4: Human coronavirus NL63  
- RefSeq 5: Infectious bronchitis virus  
- RefSeq 6: Porcine epidemic diarrhea virus  
- RefSeq 7: Porcine epidemic diarrhea virus  
- RefSeq 8: Porcine epidemic diarrhea virus  
- RefSeq 9: Porcine epidemic diarrhea virus  
- RefSeq 10: Camel alphacoronavirus  



We create several test functions:

In [102]:
def is_cov(y_preds):
    """Return 1 if the corresponding prediction is a corona virus, 0 otherwise"""
    return (y_preds == 94).astype(int) + (y_preds == 117).astype(int)

def is_mers(y_preds):
    """Returns 1 if model prediction is MERS and 0 otherwise

    Note: 94 is the code for Middle_East_respiratory_syndrome-related_coronavirus"""
    return y_preds == 94

def is_sars(y_preds):
    """Returns 1 if model prediction is SARS and 0 otherwise

    Note: 117 is the virus code for Severe_acute_respiratory_syndrome-related_coronavirus 
    """
    return y_preds == 117

def cov_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus sequences"""
    return is_cov(y_preds).sum()/y_preds.shape[0]

def mers_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus sequences"""
    return is_mers(y_preds).sum()/y_preds.shape[0]

def sars_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus sequences"""
    return is_sars(y_preds).sum()/y_preds.shape[0]

Review all reads, broken down per reference sequence

In [105]:
aln = AlnFileReader(p2fastq.parent / f"{p2fastq.stem}.aln")

acc_per_refseq = {}
for refseqid in tqdm(reads_info.read_refseqs.unique()):
    mask = reads_info.read_refseqs == refseqid
    acc = cov_acc(None, class_preds[mask])
    aln_refseq_meta = aln.ref_sequences[refseqid]
    print(f"Reference Sequence: {aln_refseq_meta['species']}:")
    print(f"  Nbr reads: {class_preds[mask].shape[0]:,d}")
    print(f"  Accuracy CoV:   {acc:.3f}")
    print(f"  Accuracy MERS:  {mers_acc(None, class_preds[mask]):.3f}")
    print(f"  Accuracy SARS:  {sars_acc(None, class_preds[mask]):.3f}")

  0%|          | 0/10 [00:00<?, ?it/s]

Reference Sequence: Coronavirus BtRs-BetaCoV/YN2018D  scientific name:
  Nbr reads: 60,400
  Accuracy CoV:   0.733
  Accuracy MERS:  0.014
  Accuracy SARS:  0.719
Reference Sequence: Bovine coronavirus  scientific name:
  Nbr reads: 61,800
  Accuracy CoV:   0.055
  Accuracy MERS:  0.030
  Accuracy SARS:  0.026
Reference Sequence: Human coronavirus OC43  scientific name:
  Nbr reads: 61,080
  Accuracy CoV:   0.057
  Accuracy MERS:  0.031
  Accuracy SARS:  0.026
Reference Sequence: Human coronavirus NL63  scientific name:
  Nbr reads: 55,000
  Accuracy CoV:   0.067
  Accuracy MERS:  0.031
  Accuracy SARS:  0.035
Reference Sequence: Infectious bronchitis virus  scientific name:
  Nbr reads: 55,200
  Accuracy CoV:   0.065
  Accuracy MERS:  0.032
  Accuracy SARS:  0.034
Reference Sequence: Porcine epidemic diarrhea virus  scientific name:
  Nbr reads: 56,000
  Accuracy CoV:   0.068
  Accuracy MERS:  0.032
  Accuracy SARS:  0.036
Reference Sequence: Porcine epidemic diarrhea virus  scientifi

In [106]:
for refseqid in tqdm(reads_info.read_refseqs.unique()):
    mask_refseq = reads_info.read_refseqs == refseqid
    mask_strand_coding = reads_info.read_strand == '+'
    mask_strand_template = reads_info.read_strand == '-'
    mask_coding = (mask_strand_coding.astype(int) * mask_refseq.astype(int)).astype(bool)
    mask_template = (mask_strand_template.astype(int) * mask_refseq.astype(int)).astype(bool)

    aln_refseq_meta = aln.ref_sequences[refseqid]
    acc = cov_acc(None, class_preds[mask_refseq])
    acc_coding = cov_acc(None, class_preds[mask_coding])
    acc_template = cov_acc(None, class_preds[mask_template])
       
    print(f"Ref. Sequence: {aln_refseq_meta['species'].replace('scientific name', '').strip()}:")
    print(f"  Accuracy :............... {acc:.3f}")
    print(f"  Acc. coding strand: ..... {acc_coding:.3f}")
    print(f"  Acc. template strand: ... {acc_template:.3f}")
    print(f"  Nbr reads: {class_preds[mask_refseq].shape[0]:,d}, incl. {mask_coding.sum():,d} from coding strand and {mask_template.sum():,d} from template strand")
    print()

  0%|          | 0/10 [00:00<?, ?it/s]

Ref. Sequence: Coronavirus BtRs-BetaCoV/YN2018D:
  Accuracy :............... 0.733
  Acc. coding strand: ..... 0.733
  Acc. template strand: ... 0.733
  Nbr reads: 60,400, incl. 30,099 from coding strand and 30,301 from template strand

Ref. Sequence: Bovine coronavirus:
  Accuracy :............... 0.055
  Acc. coding strand: ..... 0.058
  Acc. template strand: ... 0.053
  Nbr reads: 61,800, incl. 30,928 from coding strand and 30,872 from template strand

Ref. Sequence: Human coronavirus OC43:
  Accuracy :............... 0.057
  Acc. coding strand: ..... 0.064
  Acc. template strand: ... 0.051
  Nbr reads: 61,080, incl. 30,565 from coding strand and 30,515 from template strand

Ref. Sequence: Human coronavirus NL63:
  Accuracy :............... 0.067
  Acc. coding strand: ..... 0.062
  Acc. template strand: ... 0.071
  Nbr reads: 55,000, incl. 27,560 from coding strand and 27,440 from template strand

Ref. Sequence: Infectious bronchitis virus:
  Accuracy :............... 0.065
  Acc. c

## Simreads generated from 100 sequences

In [108]:
p2model = pfs.data / 'models/cnn_virus_original/pretrained_model.h5'
p2simreads = pfs.data / 'ncbi/simreads/single_100seq_50bp'
p2virus_labels = pfs.data / 'CNN_Virus_data/virus_name_mapping'
assert p2model.is_file()
assert p2simreads.is_dir()
assert p2virus_labels.is_file()

In [109]:
p2fastq = p2simreads / f"{p2simreads.stem}.fq"
p2aln = p2simreads / f"{p2simreads.stem}.aln"
assert p2fastq.is_file()
assert p2aln.is_file()

fastq = FastqFileReader(p2fastq)
aln = AlnFileReader(p2aln)

In [110]:
# FIXME: following line trows an error: ValueError: No match on this line
refseqs_aln = aln.ref_sequences

# print(f"reads simulated from the following {len(refseqs_aln)} reference sequences:\n")
# print('\n'.join([f" {i:02d}: {refseq['species']}" for i, refseq in enumerate(refseqs_aln.values())]))

ValueError: No match on this line

In [None]:
nsamples = 3_000_000
p2ds, reads_info = create_infer_ds_from_fastq(p2fastq, overwrite_ds=True, nsamples=nsamples)

text_ds = tf.data.TextLineDataset(p2ds).batch(32)
ds = text_ds.map(strings_to_tensors)

In [None]:
model = load_model(p2saved)

In [None]:
prob_preds = model.predict(ds, verbose=1)



ResourceExhaustedError: OOM when allocating tensor with shape[3000000,187] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat

Seems that the inference runs fine for 93750 steps (3M samples) but then when the output is assembled, cannot allocate the prob_preds of shape shape `[3000000,187]` on the GPU.

Questions:
- why does this have to be on the GPU?
- is it possible to have it on the main memory instead


In [None]:
prob_preds[0].shape, prob_preds[1].shape

In [None]:
class_preds = np.argmax(prob_preds[0], axis=1)
class_preds.shape

In [None]:
def is_cov(y_preds):
    """Return 1 if the corresponding prediction is a corona virus, 0 otherwise"""
    return (y_preds == 94).astype(int) + (y_preds == 117).astype(int)

def cov_acc(y_true, y_preds):
    """Evaluates the accuracy of the model assuming all evaluated reads are from corona virus"""
    return is_cov(y_preds).sum()/y_preds.shape[0]

cov_acc(None, class_preds)

In [None]:
np.unique(reads_info[:,1]).shape[0]

In [None]:
for refseqid in np.unique(reads_info[:,1]):
    mask_refseq = reads_info[:,1] == refseqid
    mask_strand_coding = reads_info[:,3] == '+'
    mask_strand_template = reads_info[:,3] == '-'
    mask_coding = (mask_strand_coding.astype(int) * mask_refseq.astype(int)).astype(bool)
    mask_template = (mask_strand_template.astype(int) * mask_refseq.astype(int)).astype(bool)

    aln_refseq_meta = aln.ref_sequences[refseqid]
    acc = cov_acc(None, class_preds[mask_refseq])
    acc_coding = cov_acc(None, class_preds[mask_coding])
    acc_template = cov_acc(None, class_preds[mask_template])
       
    print(f"Ref. Sequence: {aln_refseq_meta['species'].replace('scientific name', '').strip()}:")
    print(f"  Accuracy :............... {acc:.3f}")
    print(f"  Accuracy MERS:  {mers_acc(None, class_preds[mask]):.3f}")
    print(f"  Accuracy SARS:  {sars_acc(None, class_preds[mask]):.3f}")
    print(f"  Acc. coding strand: ..... {acc_coding:.3f}")
    print(f"  Acc. template strand: ... {acc_template:.3f}")
    print(f"  Nbr reads: {class_preds[mask_refseq].shape[0]:,d}, incl. {mask_coding.sum():,d} from coding strand and {mask_template.sum():,d} from template strand")
    print()

# New Section

Access AWS from colab:
- https://colab.research.google.com/github/bytehub-ai/code-examples/blob/main/tutorials/04_using_cloud_storage.ipynb
- https://python.plainenglish.io/how-to-load-data-from-aws-s3-into-google-colab-7e76fbf534d2
- https://medium.com/@lily_su/accessing-s3-bucket-from-google-colab-16f7ee6c5b51
- 

## handle GPU with tf

In [None]:
# device = tf.config.list_physical_devices('GPU')[0]
# device

PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')

In [None]:
# tf.config.experimental.get_memory_info(device='GPU:0')

{'current': 301555968, 'peak': 330915840}

In [None]:
# physical_devices = tf.config.list_physical_devices('GPU')
# try:
#     tf.config.experimental.set_memory_growth(physical_devices[0], True)
#     assert tf.config.experimental.get_memory_growth(physical_devices[0])
# except:
#     print('Invalid device or cannot modify virtual devices once initialized.')

Invalid device or cannot modify virtual devices once initialized.


In [None]:
# tf.keras.backend.clear_session()

In [None]:
# model.summary()

In [None]:
# try:
#     del model
# except:
#     pass
# import gc
# gc.collect()

In [None]:
# gpus = tf.config.list_physical_devices('GPU')
# if gpus:
#   # Restrict TensorFlow to only use the first GPU
#   try:
#     tf.config.set_visible_devices(gpus[0], 'GPU')
#     logical_gpus = tf.config.list_logical_devices('GPU')
#     print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
#   except RuntimeError as e:
#     # Visible devices must be set before GPUs have been initialized
#     print(e)

1 Physical GPUs, 1 Logical GPU


## end of section