**This is a intruction on how to apply our benchmark for you own task. This illustration covers the tasks of:**\

**1. Model Development:**\
**(1) Gpaph node embedding task**\
**(2) Graph-wise embedding task**\
**(3) Sequence embedding task**\
**(4) Sequence geneerative task**

**2. Bioinformatic development:**\
**(1) Protein structure representation task**\
**(2) Protein sequence representation task**\
**(3) Topology/Structure based protein design task**\
**(4) Antibody design task**

**By going through this tutorial you will be familiar with our datasets, data structure and how to applly our pipeline for training and evaluation. The user development modules are mark with "User Defined" 

In [None]:
from DataLoading import Dataloader
import networks
import training_helper
import evaluation_helper
import numpy as np
import torch

# Topology / Structure-based Protein Design

### Database: SCOPe

**For Discriminative Tasks:**

|            | class | fold | super-family | family | protein |
| ---        | ---   | ---  | ---          | ---    |  ---    |
| Training   | 6     | 1080 | 1820         | 4304   | 40082   |
| Validation | 6     | 771  | 1232         | 2705   | 10069   |
| Test       | 6     | 902  | 1480         | 3373   | 10737   |
| All        | 6     | 1080 | 1820         | 4304   | 60888   |

**For Generative Task**

|            | class | fold | super-family | family | protein |
| ---        | ---   | ---  | ---          | ---    |  ---    |
| Training   | 6     | 870  | 1367         | 3022   | 39979   |
| Validation | 6     | 131  | 276          | 678    | 10678   |
| Test       | 6     | 152  | 259          | 662    | 10231   |
| All        | 6     | 1080 | 1820         | 4304   | 60888   |

## Discriminative Task
**Test the node or graph-wise graph embedding models.**

### Load the data

**Utilize the dataloader as the following shows. For the first time the dataloader will directly download the processed data from the website. For the following attempts it will diretly load the downloaded data.**

In [2]:
train_set, vali_set, test_set = Dataloader(database = 'SCOPe', 
                                           path = '../Datasets/SCOPe/', 
                                           task = 'Discriminative', 
                                           batch_size = 16)

Downloading the database...
Downloading 1wc9o9p7nyg8s95_MO-UI80qqKVoSf4L9 into ../Datasets/SCOPe/SCOPe.zip... Done.
Unzipping...Done.

Database: SCOPe
Task: Discriminative
Shuffle: True False False
training: 40082 samples
validation: 10069 samples
test: 10737 samples
Batch size: 16


**For simiplicity here we only load part of the dataset. For the challenge please apply the complete dataset (database = 'SCOPe_debug').**

In [2]:
train_set, vali_set, test_set = Dataloader(database = 'SCOPe_debug', 
                                           path = '../Datasets/SCOPe/', 
                                           task = 'Discriminative', 
                                           batch_size = 16)

Downloading the database...
Downloading 1BFsBdQzLiRKmc1lDOZBwREcCnfg4EiRU into ../Datasets/SCOPe/SCOPe_debug.zip... Done.
Unzipping...Done.

Database: SCOPe_debug
Task: Discriminative
Shuffle: True False False
training: 55 samples
validation: 55 samples
test: 59 samples
Batch size: 16


### Define the model (user defined)

**Firstly users need to define their own GNN, and then they can take the class *GraphLevelEmbedding* as a container for their model. Then they can do the discriminative task following our pipeline. The GNN can only take the feature vector (max_num_of_nodes x feat_dim) and the adjacency tensor (channel_num x max_num_of_nodes x max_num_of_nodes). In this task max_num_of_nodes = 60, feat_dim = 11; channel_num = 5 for heterogeneous graph and 1 for heterogenous graph. The GNN can be either node-wise or graph-wise.**

In [2]:
### a node-wise embedding graph (the part need to be defined by users)
gnn = networks.GraphConvolNetwork(feature_dim = 11, hidden_dim = 100, embedding_dim = 20, 
                                  num_layers = 3, channel_num=5)
# This is an illustration implementation of the GCN and the inputs are not necessary, 
# but the input feature dimension must be 11.

In [4]:
### The container of the GNN. 
model_dis = networks.GNN_Container(model = gnn, embedding_dim = 20, 
                                   pooling = 'max', CUDA = False, channel_num=5)
# embedding dim x channel_num = output dimension of the defined GNN
# For node-wise GNN, "pooling" can be 'max', 'sum' or 'mean'; for graph-wise GNN "pooling" need to be set as None.

### Train the model

In [10]:
model_dis, optimizer, results = training_helper.discriminative_train(train_set, # training set
                                                                     model_dis, # model 
                                                                     num_epochs = 3, # amount of epochs
                                                                     val_dataset=vali_set, # val set (optional)
                                                                     test_dataset=None, # test set (optional)
                                                                     heterogeous = True, # False for homogeneous graph
                                                                     USE_CUDA = False)

Epoch:  1
Average loss: 2.410692
Training accuracy: 0.6545


  _warn_prf(average, modifier, msg_start, len(result))


Validation accuracy: 0.0000
Training time for Epoch 1: 0.5225 s
Total time for Epoch 1: 0.9112 s
Epoch:  2


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Average loss: 1.620403
Training accuracy: 0.8364


  _warn_prf(average, modifier, msg_start, len(result))


Validation accuracy: 0.0000
Training time for Epoch 2: 0.5063 s
Total time for Epoch 2: 0.8715 s
Epoch:  3


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Average loss: 1.172347
Training accuracy: 0.8545


  _warn_prf(average, modifier, msg_start, len(result))


Validation accuracy: 0.0000
Training time for Epoch 3: 0.4726 s
Total time for Epoch 3: 0.8412 s
Best training result: 0.8545 (epoch 3)
Best validation result: 0.0000 (epoch 0)


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Evaluation

In [14]:
eval_result = evaluation_helper.Disc_evaluate(test_set, model_dis, heterogeous=True, USE_CUDA = False)
print(eval_result)

{'prec': 0.0, 'recall': 0.0, 'acc': 0.0, 'F1': 0.0}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Generative Task
**Test node-wise graph embedding models in a sequence generation task.**

In [3]:
train_set, vali_set, test_set = Dataloader(database = 'SCOPe_debug', 
                                           path = '../Datasets/SCOPe/', 
                                           task = 'Generative', 
                                           batch_size = 16)

The database SCOPe_debug has already been downloaded.

Database: SCOPe_debug
Task: Generative
Shuffle: True False False
training: 68 samples
validation: 89 samples
test: 59 samples
Batch size: 16


### Define the model

In [4]:
# the "gnn" was already defined above
model_gen = networks.VAE_Container(gnn_model = gnn, gnn_out_dim = 100, CUDA=False)

### Train the model

In [5]:
model_gen, optimizer_gen = training_helper.VAE_training(model_gen, train_set, Epoch_NUM = 3,
                                                        heterogeous = True, USE_CUDA = False)

Epoch 1:
Average-Loss: 2.9342	Average-CE: 2.9340	Average-KLD: 0.2962
Training time: 7.2537s
Epoch 2:
Average-Loss: 2.7058	Average-CE: 2.7051	Average-KLD: 0.3352
Training time: 7.8602s
Epoch 3:
Average-Loss: 2.5504	Average-CE: 2.5492	Average-KLD: 0.3738
Training time: 7.8868s


### Evaluation

In [6]:
ele_all, seq_all, iden_list, ppl_list = evaluation_helper.Gen_evaluation(model_gen, train_set,
                                                                         heterogeous = True, 
                                                                         USE_CUDA = False)
print('The perplexity is %.4f.'%(float(torch.mean(torch.Tensor(ppl_list)))))
print('The average sequence identity is %.4f.'%(np.mean(iden_list)))
print()
print('Examples of the generated sequences:')
for s in  seq_all[:10]:
    print(s)

The perplexity is 14.3335.
The average sequence identity is 0.0301.

Examples of the generated sequences:
EEKKQLVMMYMMMLMYMYYHHHMHYMYHHHHYHHMYHYHHMYYYHYHHYEHHSKKVKVSSEEQLQENVNNNVHNHNHHHVNNHNNHNVNVHVVVHNVNHVIFQ
CCCKKYYYESSEMMCTTCMMMMTCCTTCTCTMTCCMMMMMMMTMCCTEEGDVGGMMKKKGVQLKLLSLSLKLSKSLKSSSLLKSSLKLKSLLLKSKSWWWVWWPPMWPWPWPPWMMWMWWPMMPMWMPPWPWMHWWMMTTTVTNPTTPPKMMPMEELLEEEELSSSSSSSELELEESSEEESSSELEESHKH
HMHHTIIDEEEFHHFFHFTHEHDQEKEEDIFMGAYNNNNY
YYAPEMS
NNPPNIMIIIYASASAASAAFYYHQQQHYYQYQYHQYQYYHQQYYQYYYYHQHYQYSSFFLSSSLMMSSVQQGGTTTTTTICMMHHAHHHVMRMMRNHEFWATNNHFQEQQEEQ
EGVITFTNTTEESDSMTTTIMMTTMTMTMTMIMTIIMMITIMMIMMMTMMMMWMVVNVNEHHSSDRRRDDDRDMMMMDRMRDRMRDRDDRMRDRDRDDRSEEEMDDMHIHHHQTDTTTDTTHDDDMMMDMMTTSS
GFFQQEQQE
HASASSWTNDDSNNYRYYRRRRNRRRNNNNRRNRYYYNYYNYYRRNNDMMMQMDQDQDDDMDMMMDMDMMMMMMMQQDDQDD
MKSSKKSDNNEETATTHKKHHKHKHTEIIDYEDDMDFMM
SSSLSHIIESGSGGTSSGGTSGGTTSGTGGTSGSSSSSTTGSSGGTTTTTTNNHMYYYMYMEEEEHQMTMMMTVIKKAHMPMDGMVVKMVMVMVVKMMKVKVVMVVVKMKVKVKKKVVK


In [8]:
ele_all, seq_all, iden_list, ppl_list = evaluation_helper.Gen_evaluation(model_gen, test_set,
                                                                         heterogeous = True, 
                                                                         USE_CUDA = False)
print('The perplexity is %.4f.'%(float(torch.mean(torch.Tensor(ppl_list)))))
print('The average sequence identity is %.4f.'%(np.mean(iden_list)))
print()
print('Examples of the generated sequences:')
for s in  seq_all[:10]:
    print(s)

The perplexity is 15.1727.
The average sequence identity is 0.0244.

Examples of the generated sequences:
FSSSKMEHHEFMFFYHT
AKAAKMMTTVMMVMVVMTTMTMMVTTVVTMTMMTTMMVMTMMYKHTTKTTKTTTHTHHHHHTKHKHHHKTTTTHHKHTFWWQQSAAAHRMGMMMSQQNNEIS
SSSPPKKPTWWHMIHIIHYIIIIIIYYYHYHHHIHHHYIHIIHHIIHYK
TKEVVEESLTTTKSTTSTTSKTSSSTTSKSTTSSKKKTSKTKSTDTTPLLTKKA
SKSKSSETNHTTLLESI
LMYYFYFYYWFWFFYWYWYWFFFFFWWWYWFWYYFYYNLLMMMT
AQQVIHWMWWMWGMWWMGMWMMWMWGWWWMMMMWMWWGMGMRNTRTNRTRTRTTRTNNNRNTRNTNNTRNRTRTTN
EETTMFTTMMMFFTMTMFTTFMMMFFMMMTMMMMFMFMFADDDAADWVEEEVEEEEEHVEVVHVVEHEEEEHEVVEEHVHVVPPPGGGIIITTTLETVNTTTTTTNLTLTLLLTTTNNNTLLTNTNNVQT
LLYLWDDMDRDRDMMDMMRMDRMDMRMMDDMMMDDRRDMDASSSLEELSSLLSLSEESLLLESSESLELESSSESLWA
QWLMYSSTHTTDMMDHHEHWRRRRREVEVEVV
