# Training covid models
### This notebook is an example usage of how to use the model alongside the covid-data-collector in order to train, evaluate and test the model
#### In this notebook you will find example usages on how to use the core functionalities of the model 

#### Import third party modules, and also the data_collector: covid19_genome and the model module

In [1]:
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Uncomment to disable GPU
import glob

from model import Model, DatasetName, load_model, remove_model

__ORIG_WD__ = os.getcwd()

os.chdir(f"{__ORIG_WD__}/../data_collectors/")
from covid19_genome import Covid19Genome

os.chdir(__ORIG_WD__)


2023-08-30 11:56:07.038878: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-30 11:56:07.063718: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### Create a model, or try to load it, if it was already have been created.

In order to use the model, the first thing you have to do is provide it with a dataset (with the help of the data_collector). In the following cell you are provided with an example that create the dataset.

You should note that when you are creating the dataset, you are passing the dataset type. You can obtain the available dataset types in the system by calling the model class function ```get_ds_types()```

In [2]:
model_name = "covid19-1024examples"

try:
    model = load_model(model_name)
except Exception:
    covid19_genome = Covid19Genome()
    lineages = covid19_genome.getLocalLineages(1024)
    lineages.sort()
    dataset = []
    def get_dataset():
        for lineage in lineages:
            dataset.append((lineage, covid19_genome.getLocalAccessionsPath(lineage)))
        return dataset

    portions = {
        DatasetName.trainset.name: 0.8,
        DatasetName.validset.name: 0.1,
        DatasetName.testset.name: 0.1
    }

    dataset = get_dataset()
    model = Model(model_name)
    model.create_datasets(model.get_ds_types()[0], dataset, portions)

2023-08-30 11:56:08.128304: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-30 11:56:08.144881: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-30 11:56:08.144995: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

After you have created the model, and created its datasets. You can check which neural network structures is available. You can do that by calling the model class function ```get_ml_model_structure()```.

After you see all the ml_model structures available in the system, you can check which hyper parameters are needed to define each and every ml_model structure. This is done by calling the model class function ```get_ml_model_structure_hps()```. The ```get_ml_model_structure_hps()``` will return which hps are required, and what it their type.

In [3]:
print(model.get_ml_model_structures())
print(model.get_ml_model_structure_hps(model.get_ml_model_structures()[0]))

['VitStructure']
{'d_model': 'required', 'd_val': 'required', 'd_key': 'required', 'd_ff': 'required', 'heads': 'required', 'dropout_rate': 'optional', 'regularizer': 'optional', 'initializer': 'optional', 'activation': 'optional', 'encoder_repeats': 'required', 'labels': 'required'}


You can also see which properties help define the current type of dataset by calling to the model class function ```get_ds_props()``` This function could be called only after the dataset have been succesfully created. This function will return the properties of the dataset as well as their values.

In [4]:
print(model.get_ds_props())

{'coverage': 4, 'substitution_rate': 0.002, 'insertion_rate': 0.01, 'deletion_rate': 0.01, 'read_length': 128, 'frag_len': 128, 'num_frags': 256}


A use case of the system with the VitStructure model and the minhash genome datasets (a.k.a. mh_genome_ds).

In the mh_genome_ds the coverage is a dataset property that sets the genome coverage rate.

In the VitStructure, the model_depth is the number of transformer encoders.

In this example use-case these two parameters will help us define a neural network that will be trained on the dataset (with the current coverage rate)

In [12]:
coverage = 2
ml_model_depth = 2
sequencer_instrument = "roche"

In [6]:
sequencer_instrument_to_error_profile_map = {
    "illumina": {
        "substitution_rate": 0.005,
        "insertion_rate": 0.001,
        "deletion_rate": 0.001
    },
    "ont": {
        "substitution_rate": 0.01,
        "insertion_rate": 0.04,
        "deletion_rate": 0.04
    },
    "pacbio": {
        "substitution_rate": 0.005,
        "insertion_rate": 0.025,
        "deletion_rate": 0.025
    },
    "roche": {
        "substitution_rate": 0.002,
        "insertion_rate": 0.01,
        "deletion_rate": 0.01
    }
}

def get_model_name(ml_model_depth, coverage, sequencer_instrument):
    if not sequencer_instrument in sequencer_instrument_to_error_profile_map:
        raise Exception(f"Invalid sequencer instrument: {sequencer_instrument}")
    return f"vit.{ml_model_depth}.{coverage}x.{sequencer_instrument}"

ml_model_name = get_model_name(ml_model_depth, coverage, sequencer_instrument)

#### Adding a new neural network

In this cell we will create an ml_model with the required hps (and also optional) as outputted earlier.

In [7]:
newly_added = True
try:
    model.add_ml_model(ml_model_name, hps={
        "structure": model.get_ml_model_structures()[0],
        "d_model": model.get_ds_props()["frag_len"],
        "d_val": 128,
        "d_key": 128,
        "heads": 8,
        "d_ff": 1024+256,
        "labels":  len(model.get_labels()),
        "activation": "relu",
        "optimizer": {
            "name": "AdamW",
            "params": {
                "learning_rate": 0.001,
            },
        },
        "encoder_repeats": ml_model_depth,
        "regularizer": {
            "name": "l2",
            "params": {
                "l2": 0.0001
            }
        },
        "dropout": 0.2,
    })
except:
    newly_added = False
    print("Model already exists")

2023-08-30 11:56:09.672400: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-08-30 11:56:09.766321: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600


In [8]:
models = model.list_ml_models()
print(models)

['vit.1.4x.roche', 'vit.2.4x.roche', 'vit.2.2x.roche', 'vit.2.4x.illumina', 'vit.1.4x.illumina']


In [13]:
if newly_added:
    assert False, "Please consider doing transfer learning"
# model.transfer(get_model_name(ml_model_depth, coverage * 2, sequencer_instrument), ml_model_name, False)

In [None]:
# model.change_ml_hps(ml_model_name, {
# # #     "regularizer": {
# # #         "name": "l2",
# # #         "params": {
# # #             "l2": 0.00005,
# # #         },
# # #     },
#     "optimizer": {
#         "name": "AdamW",
#         "params": {
#             "learning_rate": 0.00005,
#         },
#     },
# })

2023-08-30 11:47:57.112884: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-08-30 11:47:57.209829: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600


#### Updating the dataset coverage

In [None]:
model.update_ds_props({
    "coverage": coverage,
    } | sequencer_instrument_to_error_profile_map[sequencer_instrument])

#### Setting dataset batch size and training

In [None]:
model.set_ds_batch_size(320)
model.train(ml_model_name, epochs=1000)

Epoch 1/1000


2023-08-30 11:48:06.147815: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fc2d4025c90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-08-30 11:48:06.147914: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA RTX A5000, Compute Capability 8.6
2023-08-30 11:48:06.156530: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-08-30 11:48:06.293223: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-08-30 11:48:06.375459: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/1000
 68/427 [===>..........................] - ETA: 4:03 - loss: 0.4556 - accuracy: 0.8955 - auc: 0.9954

KeyboardInterrupt: 