Author: Wolfgang Black <br>
date_modified: 2022-07-24
    
# Notebook purpose:
    
This notebook is meant to be run in google colab. It will build an ensemble ProtENN model, which contains n ProtCNN models.  This nb will load the training and dev datasets from the raw data_dir for model training and development, load in the saved models from the model_dir, and save the classification results to a results_dir.

This is the debugging notebook responsible for inference.py and the inference step of main.py

In [None]:
import os
from collections import Counter

import pandas as pd
import numpy as np

from google.colab import drive
drive.mount('/content/drive/')

!cp './drive/MyDrive/Colab Notebooks/ProtCNN/utils/datautils.py' ./
from datautils import *
!cp './drive/MyDrive/Colab Notebooks/ProtCNN/utils/modelutils.py' ./
from modelutils import *

import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf


Mounted at /content/drive/


In [None]:
data_dir = './drive/MyDrive/PFAM_database/data/random_split/'
model_dir = './drive/MyDrive/Colab Notebooks/ProtCNN/models/ensemble_model/'


In [None]:
train_data, train_targets = reader('train',data_dir)

In [None]:
fam2label = build_labels(train_targets)

There are 17930 labels.


In [None]:
word2id = build_vocab(train_data)
vocab_len = len(word2id)

AA dictionary formed. the length of dictionary is: 22.


In [None]:
max_len = 120

In [None]:
train = SequenceData(word2id, fam2label, max_len, data_dir,"train")
train_dict = train.get_data_dictionaries()
dev = SequenceData(word2id, fam2label, max_len, data_dir,"dev")
dev_dict = dev.get_data_dictionaries()

In [None]:
num_classes = len(fam2label)

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_dict['sequence'], train_dict['target'])).shuffle(True).batch(256)
validation_dataset = tf.data.Dataset.from_tensor_slices((dev_dict['sequence'], dev_dict['target'])).shuffle(True).batch(266)


In [None]:
  epochs = 5
  num_models = 3

In [None]:
for i in range(num_models):
  model = get_protCNN_model( max_len, 22, num_classes)
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  model.summary()


  history = model.fit(train_dataset,
                    epochs = epochs, batch_size = 256,
                    validation_data = validation_dataset)
  
  model.save(model_dir+'ProtENN_model_'+str(i)+'_two_resblock_5_epoch_model.h5')

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None, 120, 22)]    0           []                               
                                                                                                  
 conv1d_10 (Conv1D)             (None, 120, 128)     2944        ['input_3[0][0]']                
                                                                                                  
 batch_normalization_8 (BatchNo  (None, 120, 128)    512         ['conv1d_10[0][0]']              
 rmalization)                                                                                     
                                                                                                  
 activation_8 (Activation)      (None, 120, 128)     0           ['batch_normalization_8[0][

In [None]:
print('model has finished training and is saved')

model has finished training and is saved
