# Main notebook for training models using DeepVHPPI
This notebook is the main notebook for doing experiments and coding in DeepVHPPI

## Running terminal commands and bash in Jupyter
You can run a terminal command by pre-pending the command with a ```!```.

If need the terminal output to be printed then ```!``` will not work. You will need to pre-pend the ```%run``` magic command

To run a bash script add %%bash at the top of the cell

Here we can start the training for the Zhou virus-host interaction training.

In [None]:
import torch.nn
%run -i main.py --data_root ./data/ -tr zhou/h1n1/human/train.json  -va zhou/h1n1/human/test.json -v vocab.data -s 1024 -hs 512 -l 12  -o results --lr 0.00001 --dropout 0.1 --epochs 20000 --attn_heads 8 --activation gelu --task ppi --emb_type conv --overwrite  --batch_size 2 --grad_ac_steps 2 --name ''

## Datasets for TB training
### HPIDB dataset
This dataset was obtained from [HPIDB](https://hpidb.igbb.msstate.edu/) in 2022, a host-pathogen database containing experimental and predicted interactions between various hosts and pathogens. 10704 host-bacterial pathogen interactions were downloaded by clicking [The Pie Chart](https://hpidb.igbb.msstate.edu/hpi30_statistics.html), bacteria section.

The HPIDB dataset serves as the training dataset for PPI prediction training using the BERT model.

#### Negative HPIDB dataset
The negative interaction dataset was created by downloading a random sequence from Uniprot and pairing it with a bacterial pathogen sequence. This created a negative human-pathogen interaction dataset that can be used for training. To ensure that no human sequence in the positive dataset occurred, we used CD-HIT-2D to compare the sequence similarity of the sequences in both datasets.

1. A negative set of sequences of len == length of positive set of human sequences.
2. CD-HIT finds examples in the negative dataset that is greater than 80% similarity in the positive dataset.
3. We remove the examples and replace them with new examples (create a new list) and compare this list again to the positive dataset.
4. Finally, we will end with a positive and negative dataset that can be used for training.

## Training the MTB dataset
I used the same parameters for training this dataset as that was used for Zhou PPIs. Here is the commandline instruction. The batch_size was reduced from 8 to 2, as I am training on my desktop RTX3060 and not on V100s. It will take weeks to get to 20000 epochs, but training can be interrupted and restarted by loading already trained model parameters.


In [None]:
%run -i main.py --data_root ./data/ -tr williams_MTB/hpidb_train.json  -va williams_MTB/hpidb_test.json  -te williams_MTB/mt37_HPI_test.json  -v vocab.data -s 1024 -hs 512 -l 12  -o results --lr 0.00001 --dropout 0.1 --epochs 20000 --attn_heads 8 --activation gelu --task ppi --emb_type conv --overwrite  --batch_size 2 --grad_ac_steps 2 --name ''

Not there are 8384 iterations because we have a batch_size of 2.


### Loading from pre-trained
If we interrupt training we can start the training again by loading from already trained model parameters.

Add the extra parameter ```--saved_model``` and point to the file best_model.pt to start from a checkpoint

In [None]:
%run -i main.py --data_root ./data/ -tr williams_MTB/hpidb_train.json  -va williams_MTB/hpidb_test.json  -te williams_MTB/mt37_HPI_test.json  -v vocab.data -s 1024 -hs 512 -l 12  -o results --lr 0.00001 --dropout 0.1 --epochs 20000 --attn_heads 8 --activation gelu --task ppi --emb_type conv --overwrite  --batch_size 2 --grad_ac_steps 2 --name from_saved --saved_model results/ppi.bert.bsz_4.layers_12.size_512.heads_8.drop_10.lr_1e-05.emb_conv.saved_model.h1n1.'mtb2'/best_model.pt

## Viewing Training Progress
Even though we can use useful tools like Weights and Bias, for now we will use Plotly to view the training progress locally.

In [None]:
import pandas as pd
import json
import plotly.express as px
import numpy as np

I created two functions:

1. To show the graph
2. To load the json formatted log file from disk

In [None]:
def show_graph(df, metric, title):
    fig = px.line(df, x='epoch', y=metric, title=title)
    fig.show()


def log_to_pandas(path_to_log_file):
    with open(path_to_log_file, 'r') as f:
        log = json.load(f)
    log_df = pd.DataFrame([log]).T
    normalized = pd.json_normalize(log_df[0])
    log_df = normalized.reset_index().rename({'index': 'epoch'}, axis='columns')
    return log_df

In [None]:
training_log = log_to_pandas("results/ppi.bert.bsz_4.layers_12.size_512.heads_8.drop_10.lr_1e-05.emb_conv.saved_model.h1n1.'mtb2'/train_log.json")
show_graph(training_log, 'acc', 'Training Accuracy')

In [None]:
show_graph(training_log, 'loss', 'Training Loss')

In [None]:
training_log

## How to view the Vocab data
The data file that makes up the vocabulary of the model

In [None]:
from data import WordVocab

vocab = WordVocab.load_vocab('data/vocab.data')
print(len(vocab))
print(vocab.stoi)

### Viewing how nn.Embedding works
We need to first tokenize the sentence and convert these tokens to tensors
We can then use nn.Embedding to create a one-hot encoded vector of the input sentence

In [None]:
import torch
import torch.nn as nn
onehot = nn.Embedding(29, 28, padding_idx=0)
onehot.weight.requires_grad = False
onehot.weight[1:] = torch.eye(28)

In [None]:
def process_seq(sentence, vocab):
    tokens = list(sentence)
    for i, token in enumerate(tokens):
        tokens[i] = vocab.stoi.get(token, vocab.unk_index)

    tokens = [vocab.cls_index] + tokens
    return tokens

In [None]:
seq = process_seq('MTAVVATA', vocab)
seq = torch.tensor(seq)

In [None]:
onehot(seq)

In [None]:
tokens = list('MTAVVATA')
tokens

In [None]:
vocab.stoi.get('U', vocab.unk_index)

## How to view the Vocab data
The data file that makes up the vocabulary of the model

In [1]:
from data import WordVocab

vocab = WordVocab.load_vocab('data/vocab.data')
print(len(vocab))
print(vocab.stoi)

29
{'<pad>': 0, '<mask>': 1, '<cls>': 2, '<unk>': 3, 'L': 4, 'A': 5, 'G': 6, 'V': 7, 'S': 8, 'I': 9, 'E': 10, 'R': 11, 'D': 12, 'T': 13, 'K': 14, 'P': 15, 'F': 16, 'N': 17, 'Q': 18, 'Y': 19, 'H': 20, 'M': 21, 'W': 22, 'C': 23, 'X': 24, 'U': 25, 'O': 26, 'Z': 27, 'B': 28}


### Viewing how nn.Embedding works
We need to first tokenize the sentence and convert these tokens to tensors
We can then use nn.Embedding to create a one-hot encoded vector of the input sentence

In [6]:
import torch
import torch.nn as nn
onehot = nn.Embedding(29, 28, padding_idx=0)
onehot.weight.requires_grad = False
onehot.weight[1:] = torch.eye(28)

In [8]:
def process_seq(sentence, vocab):
    tokens = list(sentence)
    for i, token in enumerate(tokens):
        tokens[i] = vocab.stoi.get(token, vocab.unk_index)

    tokens = [vocab.cls_index] + tokens
    return tokens

In [13]:
seq = process_seq('MTAVVATA', vocab)
seq = torch.tensor(seq)

In [15]:
onehot(seq)

tensor([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
  

In [17]:
tokens = list('MTAVVATA')
tokens

['M', 'T', 'A', 'V', 'V', 'A', 'T', 'A']

In [23]:
vocab.stoi.get('U', vocab.unk_index)

25