<a href="https://colab.research.google.com/github/veren4/Kinase_residue_classification/blob/master/SMILES_featurization_try_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is based on an official AllenNLP example: https://allennlp.org/tutorials \
And on this [second, more detailed tutorial](https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/)

In [None]:
!pip install allennlp==0.9
!pip install SmilesPE

In [2]:
#import allennlp
#print(allennlp.__version__)

0.9.0


## Imports

In [3]:
from typing import Iterator, List, Dict         # type annotations

import torch                                    # AllenNLP is built on PyTorch
import torch.optim as optim
import numpy as np

from allennlp.data import Instance              # training sample = Instance containing fields

In [4]:
from allennlp.data.fields import TextField, SequenceLabelField      # possible fields: http://docs.allennlp.org/v0.9.0/api/allennlp.data.fields.html

In [5]:
from allennlp.data.dataset_readers import DatasetReader     # reads a file and produces a stream of Instances

In [6]:
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer     # Tokenindexer: rule for how to turn a token into indices
from allennlp.data.tokenizers import Token

# The Tokenizer I used before:
from SmilesPE.pretokenizer import atomwise_tokenizer

In [7]:
from allennlp.data.vocabulary import Vocabulary     # mapping from strings -> integers

from allennlp.models import Model                   # a PyTorch Module
                                                    # input: tensors
                                                    # output: dict of tensor output (including the training loss)

from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits

from allennlp.training.metrics import CategoricalAccuracy       # for tracking accuracy on the training and validation datasets
                                                                # accuracy = (TP+TN)/(TP+FP+TN+FN)

from allennlp.data.iterators import BasicIterator

from allennlp.training.trainer import Trainer

from allennlp.predictors import SentenceTaggerPredictor # make predictions on new input

Iterator: Batches the data \
Trainer: Handles training and metric recording \
(Predictor: Generates predictions from raw strings)

Things to consider for the Iterator:
* Sequences of different lengths need to be padded
* To minimize padding, sequences of similar lengths can be put in the same batch
* Tensors need to be sent to the GPU if using the GPU
* Data needs to be shuffled at the end of each epoch during training, but we don't want to shuffle in the midst of an epoch in order to cover all examples evenly


The BucketIterator batches sequences of similar lengths together to minimize padding. -> But I set a global max size, right?

## Setup

Achtung: Ich muss bei dem PubChem sample file die Zeilennummerierung wegschneiden!

In [8]:
torch.manual_seed(1)   # Set random seed manually to replicate results

<torch._C.Generator at 0x7f26a700ed38>

DatasetReader: Extracts necessary information from data into a list of Instance objects \
1. Reading the data from disk
2. Extracting relevant information from the data
3. Converting the data into a list of Instances

In [None]:
class SMILES_tokens_DatasetReader(DatasetReader):

    def __init__(self, token_indexers: Dict[str, TokenIndexer] = None):
        super().__init__(lazy=False)
        self.tokenizer = atomwise_tokenizer()   # PROBABLY WRONG
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}      # unique ID for each distinct token
#       self.max_seq_len = max_seq_len          # Add max value!

    @overrides
    def text_to_instance(self,                                    # goal:  take the data for a single example and pack it into an Instance object
                         tokens: List[Token],
                         tags: List[str] = None) -> Instance:
        # I want to take 1 line (without maybe EOL characters!)

        # I could write my own Field class..
        SMILES_field = TextField(tokens, self.token_indexers)     # Before constructing this object,
                                                                  # I need to tokenize the raw strings using a Tokenizer
        fields = {"sentence": SMILES_field}

        if tags:
            label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
            fields["labels"] = label_field

        return Instance(fields)

    @overrides
    def _read(self, file_path: str) -> Iterator[Instance]:
        with open(downloads) as f:        # file: downloads (see above)
            for line in f:
                pairs = line.strip().split()
                sentence, tags = zip(*(pair.split("###") for pair in pairs))
                yield self.text_to_instance([Token(word) for word in sentence], tags)

AllenNLP models are generally composed from the following components:
* A token embedder
* An encoder
* (For seq-to-seq models) A decoder

In [9]:
class LstmTagger(Model):                # subclass of the torch.nn.Module

    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder,
                 vocab: Vocabulary) -> None:

        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder

        self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                          out_features=vocab.get_vocab_size('labels'))

        #self.accuracy = CategoricalAccuracy()


    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> Dict[str, torch.Tensor]:
        mask = get_text_field_mask(tokens)

        embeddings = self.word_embeddings(tokens)

        encoder_out = self.encoder(embeddings, mask)    # state

        class_logits = self.hidden2tag(encoder_out)    # oder: self.projection(encoder_out)

        output = {"class_logits": class_logits}

        # The loss must be computed within the forward method during training.

        if labels is not None:
            #self.accuracy(tag_logits, labels, mask)
            output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

        return output

    #def backward(self)

    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}


SyntaxError: ignored

**Embedder**\
sequence of token IDs -> sequence of tensors\
I need 2 classes for handling embeddings: Embedding class + BasicTextFieldEmbedder class\
\
**Encoder**\
sequence of embeddings -> 1 vector
model in AllenNLP: Seq2VecEncoder (there are several variations available)

## Load the data

In [None]:
# load data as a cached_path??

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

file_id = '1lX3mV3DBYiMxp4xICvUt4fxRyeNSpwN7'
downloaded = drive.CreateFile({'id': file_id})

In [None]:
#print(downloaded.GetContentString())

## Training

Split the dataset into train, validate.

In [None]:
my_data = []

for line in downloaded.GetContentString().splitlines():
  my_data.append(line)

my_data = list(filter(None, my_data))               # 1 list item = 1 SMILES

In [None]:
from sklearn.model_selection import train_test_split
train_dataset, val_dataset = train_test_split(my_data)

Tokenize the input

In [None]:
#toks = atomwise_tokenizer(smi)

In [None]:
reader = SMILES_tokens_DatasetReader()                     # create an instance of the DatasetReader

vocab = Vocabulary.from_instances(train_dataset + validation_dataset)

EMBEDDING_DIM = 6
HIDDEN_DIM = 6

# The embedder maps a sequence of token ids (or character ids) into a sequence of tensors. Here: simple
# embedding matrix:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)            # character encoding
                            
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})   # word encoding

# To classify each sentence, we need to convert the sequence of embeddings into
# a single vector. In AllenNLP, the model that handles this is referred to as a
# Seq2VecEncoder: a mapping from sequences to a single vector.
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

model = LstmTagger(word_embeddings, lstm, vocab)

if torch.cuda.is_available():
    cuda_device = 0

    model = model.cuda(cuda_device)
else:

    cuda_device = -1

optimizer = optim.SGD(model.parameters(), lr=0.1)

iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])

iterator.index_with(vocab)

trainer = Trainer(model=model,              # instantiate the trainer
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=validation_dataset,
                  patience=10,
                  num_epochs=1000,
                  cuda_device=cuda_device)

trainer.train()                             # run the trainer

predictor = SentenceTaggerPredictor(model, dataset_reader=reader)

tag_logits = predictor.predict("The dog ate the apple")['tag_logits']

tag_ids = np.argmax(tag_logits, axis=-1)

print([model.vocab.get_token_from_index(i, 'labels') for i in tag_ids])

In [None]:
predictor2 = SentenceTaggerPredictor(model2, dataset_reader=reader)
tag_logits2 = predictor2.predict("The dog ate the apple")['tag_logits']
np.testing.assert_array_almost_equal(tag_logits2, tag_logits)