# Inference using original data by paper's author

In this notebook, we use the validation data from the CNN Virus paper to do inference using the pretrained model.

This notebook works when run locally and also should run on Colab, as long as the file system is in line with the unified file ystem (see documentation).

# 1. Imports and setup environment

### Install and import packages

In [2]:
# Install required custom packages if not installed yet.
import importlib.util
if not importlib.util.find_spec('ecutilities'):
    print('installing package: `ecutilities`')
    ! pip install -qqU ecutilities
else:
    print('`ecutilities` already installed')
if not importlib.util.find_spec('metagentools'):
    print('installing package: `metagentools')
    ! pip install -qqU metagentools
else:
    print('`metagentools` already installed')

`ecutilities` already installed
`metagentools` already installed


In [3]:
# Import all required packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

from ecutilities.core import files_in_tree
from ecutilities.ipython import nb_setup
from IPython.display import display, Markdown, HTML
from pathlib import Path
from pprint import pprint
from tqdm.notebook import tqdm, trange

# Setup the notebook for development
nb_setup()

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # or any {'0', '1', '2'}
import tensorflow as tf
from tensorflow.python.client import device_lib
from tensorflow.keras.models import load_model
print(f"Tensorflow version: {tf.__version__}\n")

from metagentools.cnn_virus.data import strings_to_tensors, create_infer_ds_from_fastq
from metagentools.cnn_virus.data import FastaFileReader, FastqFileReader, AlnFileReader
from metagentools.core import TextFileBaseReader, ProjectFileSystem

Set autoreload mode
Tensorflow version: 2.8.2



List all computing devices available on the machine

In [4]:
devices = device_lib.list_local_devices()
print('\nDevices:')
for d in devices:
    t = d.device_type
    name = d.physical_device_desc
    l = [item.split(':', 1) for item in name.split(', ')]
    name_attr = dict([x for x in l if len(x)==2])
    dev = name_attr.get('name', ' ')
    print(f"  - {t}  {d.name} {dev:25s}")


Devices:
  - CPU  /device:CPU:0                          
  - GPU  /device:GPU:0  NVIDIA GeForce GTX 1050 


# 2. Setup paths to files

Key folders and system information

In [5]:
pfs = ProjectFileSystem()
pfs.info()

Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentools 
 - Data Dir .... /home/vtec/projects/bio/metagentools/data 
 - Notebooks ... /home/vtec/projects/bio/metagentools/nbs


In [6]:
pfs.readme()

ReadMe file for directory `data`:

### Data structure for `metagentools`
This directory includes all the data required for the project `metagentools`.

```text
data
 |--- CNN_Virus_data 
 |--- ncbi           
 |--- ncov_data      
 |--- saved         
 |--- ....           
     
```
#### Sub-directories
- `CNN_Virus_data`: includes all the data related to the original CNN Virus paper, i.e. training data and validation data in a format that can be used by the CNN Virus code.
- `ncbi`: includes data related to the use of CoV sequences from NCBI: reference sequences, simulated reads, inference datasets, inference results.
- `ncov_data`: includes data related to the use of non Cov sequences from various sources: reference sequences, simulated reads, inference datasets, inference results.
- `saved`: includes model saved parameters and preprocessing datasets.


- `p2model`: path to file with saved original pretrained model
- `p2virus_labels` path to file with virus names and labels mapping for original model
- `p2simreads`: path to folder where reads files are located (FASTQ and ALN)

In [8]:
p2model = pfs.data / 'saved/cnn_virus_original/pretrained_model.h5'
assert p2model.is_file(), f"No file found at {p2model.absolute()}"

p2virus_labels = pfs.data / 'CNN_Virus_data/virus_name_mapping'
assert p2virus_labels.is_file(), f"No file found at {p2virus_labels.absolute()}"

p2original = pfs.data / 'CNN_Virus_data'
assert p2original.is_dir(), f"No directory found at {p2original.absolute()}"

In [9]:
pfs.readme(dir_path=p2original)

ReadMe file for directory `data/CNN_Virus_data`:

### CNN Virus data

This directory includes data used to train and validate the initial CNN Virus model, as well as a few smaller datasets for experimenting. 


#### File list and description:
##### 50-mer 
50-mer reads and their labels, in *text format* with one line per sample. Each line consists of three components, separated by tabs: the 50-mer read or sequence, the virus species label and the position label:
```text
'TTACNAGCTCCAGTCTAAGATTGTAACTGGCCTTTTTAAAGATTGCTCTA    94    5\n'
``` 
Files:
- `50mer_training`: dataset with 50,903,296 reads for training
- `50mer_validating`: dataset with 1,000,000 reads for validation
- `50mer_ds_100_reads`: small subset of 100 reads from the validating dataset for experiments

##### 150-mer
150-mer reads and their labels in *text format* in a similar format as above:
```text
'TTCTTTCACCACCACAACCAGTCGGCCGTGGAGAGGCGTCGCCGCGTCTCGTTCGTCGAGGCCGATCGACTGCCGCATGAGAGCGGGTGGTATTCTTCCGAAGACGACGGAGACCGGGACGGTGATGAGGAAACTGGAGAGAGCCACAAC    6    0\n'
```
Files:
- `ICTV_150mer_benchmarking`: dataset with 10,0000 read
- `150mer_ds_100_reads`: small subset of 100 reads from `ICTV_150mer_benchmarking`

##### Longer reads
Reads of various length with no labels, in simple *fasta format*. Each read sequence is preceded by a definition line: `> Sequence n`, where `n` is the sequence number.

Files:
- `training_sequences_300bp.fasta`: dataset with 9,000 300-mer reads
- `training_sequences_500bp.fasta`: dataset with 9,000 500-mer reads
- `validation_sequences.fasta`: dataset with 564 reads of mixed lengths ranging from 163-mer to 497-mer

##### Other files:
- `virus_name_mapping`: mapping between virus species and their numerical label
- `weight_of_classes`:  weights for each virus species class in the training dataset



In [10]:
files_in_tree(path=p2original);

data
  |--CNN_Virus_data
  |    |--50mer_validating (0)
  |    |--50mer_ds_100_reads (1)
  |    |--validation_sequences.fasta (2)
  |    |--ICTV_150mer_benchmarking (3)
  |    |--readme.md (4)
  |    |--50mer_training (5)
  |    |--training_sequences_500bp.fasta (6)
  |    |--weight_of_classes (7)
  |    |--150mer_ds_100_reads (8)
  |    |--virus_name_mapping (9)
  |    |--training_sequences_300bp.fasta (10)


For this experiment, we will use the dataset:
- 50mer_validating

In [11]:
p2ds = p2original / '50mer_validating'
assert p2ds.is_file()
p2ds.absolute()

PosixPath('/home/vtec/projects/bio/metagentools/data/CNN_Virus_data/50mer_validating')

# 3. Create inference dataset

The model expect a dataset file in the following format:

```text
    AAAAAGATTTTGAGAGAGGTCGACCTGTCCTCCTAAAACGTTTACAAAAG
    CATGTAACGCAGCTTAGTCCGATCGTGGCTATAATCCGTCTTTCGATTTG
    AACAACATCTTGTTGATGATAACCGTCAAAGTGTTTTGGGTCTGGAGGGA
    AGTACCTGGAGAGCGTTAAGAAACACAAACGGCTGGATGTAGTGCCGCGC
    CCACGTCGATGAAGCTCCGACGAGAGTCGGCGCTGAGCCCGCGCACCTCC
```

`50mer_validating` is already in the correct format

The mapping between code and virus specie name are in the file `virus_labels.csv`

In [12]:
reader = TextFileBaseReader(p2ds, nlines=10)
reader.print_first_chunks(nchunks=1)

10-line chunk 1
AAAAAGATTTTGAGAGAGGTCGACCTGTCCTCCTAAAACGTTTACAAAAG	71	0
CATGTAACGCAGCTTAGTCCGATCGTGGCTATAATCCGTCTTTCGATTTG	1	7
AACAACATCTTGTTGATGATAACCGTCAAAGTGTTTTGGGTCTGGAGGGA	158	6
AGTACCTGGAGAGCGTTAAGAAACACAAACGGCTGGATGTAGTGCCGCGC	6	7
CCACGTCGATGAAGCTCCGACGAGAGTCGGCGCTGAGCCCGCGCACCTCC	71	6
AGCTCGTGGATCTCCCCTCCTTCTGCAGTTTCAACATCAGAAGCCCTGAA	87	1
GACTCTGTGTTTATGTATCAGCATACAGAGCTTATGCAGAAGAACGCGTC	10	0
CGTCATGAGGAAGTTGCTAATAATATGTGGATGCATGCATTCCTCTGGGT	178	7
TTCACCTTGAGCAAGGGCAGGTTGAACACGCGGCTGACATCGCCGTCGTA	71	3
CAAAACTTTCACCGGGGTTCCAATCCGCGGTGGTAATGACGTTNTGCTGT	22	6



## Create the data loader for the model 

Define batch size and create a first dataset accessing data from the dataset text file. Batch size can be adjusted depending on the memory available on the GPU. For reference, `bs = 4096` was used with a 4GB GPU. 

Then transform the text dataset into a tensor dataset by applying the `string_to_tensor` preprocessing function

In [13]:
bs = 4096

text_ds = tf.data.TextLineDataset(p2ds).batch(bs)
ds = text_ds.map(strings_to_tensors)

The bases in the read sequences are encoded as a 5-dim one-hot-encoded vector, as the model expects.

In this example, each 50bp read in converted into a tensor of shape [50,5]

In [14]:
for batch, (y1b, y2b) in ds.take(1):
    # show the shape of one batch
    print(batch.shape)
    # show the forst 10 bases, after one-hot-endoding
    print(batch[0, :10, :])

(4096, 50, 5)
tf.Tensor(
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]], shape=(10, 5), dtype=float32)


# 4. Inference

Load and review the pretrained model

In [15]:
model = load_model(p2model)

In [16]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 50, 5)]      0           []                               
                                                                                                  
 conv1d_1 (Conv1D)              (None, 50, 512)      13312       ['input_1[0][0]']                
                                                                                                  
 batch_normalization_1 (BatchNo  (None, 50, 512)     2048        ['conv1d_1[0][0]']               
 rmalization)                                                                                     
                                                                                                  
 max_pooling1d_1 (MaxPooling1D)  (None, 25, 512)     0           ['batch_normalization_1[0][

Present the inference dataset to the model and collect prediction.

The model returns two sets of probabilities:
- `prob_preds_species`: a vector of 187 values representing the probability that each of the 187 species are the correct ones, for each input read
- `prob_preds_pos`: a vector of 10 values representing the probability that the read is from the corresponding segment of the original sequence (1 to 10)

In [18]:
%time
prob_preds_species, prob_preds_pos = model.predict(ds, verbose=1)

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.96 µs


In [19]:
prob_preds_species.shape, prob_preds_pos.shape

((1000000, 187), (1000000, 10))

To find the prediction, we pick the argmax probability, which gives us the index/code for the predicted virus species.

In [20]:
class_preds = np.argmax(prob_preds_species, axis=1)
class_preds.shape
class_preds[:10]

array([ 71,   1, 158,   6,  71,  87,  10, 178,  71,  22])

# 5. Simple evaluation

Original model was trained with 187 different virus species.

# New Section

## end of section