<a href="https://colab.research.google.com/github/veren4/SMILES_featurization/blob/master/My_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch.autograd import Variable
import os
import pandas as pd
#import torchvision    # data loaders for common datasets such as Imagenet, FashionMNIST, MNIST, etc. and data transformers for images

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import pickle

Mounted at /content/drive


In [None]:
import platform
print('Using python: ', platform.python_version())
print('Using torch version: ', torch.__version__)
print('Using device: ', device)
# Machine: 2015 13" Macbook Pro, i5 dual core

#import torch.nn as nn
from torch import nn
import torch.nn.functional as F
import timeit

Using python:  3.6.9
Using torch version:  1.6.0+cu101
Using device:  cuda:0


In [None]:
import tensorboard
#print(f"Tensorboard version: {tensorboard.__version__}")

# Load the TensorBoard notebook extension
%load_ext tensorboard

# imports
%autoreload 2

In [None]:
from torch.utils.tensorboard import SummaryWriter

In [None]:
import torchvision.transforms as transforms

###Set up TensorBoard

[How to TensorBoard](https://pytorch.org/docs/stable/tensorboard.html)\
[PyTorch Tensorboard tutorial](https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html)\
([Medium article](https://medium.com/@iamsdt/using-tensorboard-in-google-colab-with-pytorch-458f9bb95212) on TensorBoard in Colab with PyTorch) <- no suitable version available\
[The best tutorial](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensorboard_with_pytorch.ipynb#scrollTo=IRSe6eHcFPyT): in a colab notebook

[How to run Tensorboard in Colab](https://stackoverflow.com/questions/47818822/can-i-use-tensorboard-with-google-colab)

In [None]:
#%tensorboard --logdir logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


###Seed

In [None]:
torch.manual_seed(1)

<torch._C.Generator at 0x7fd7e8c34570>

https://discuss.pytorch.org/t/reproducibility-with-all-the-bells-and-whistles/81097

###Description & documentation

Later: automated batching\
Visualization

[FloydHub](https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/)\
[GitHub Minimal example](https://github.com/chrisvdweth/ml-toolkit/blob/master/pytorch/notebooks/minimal-example-lstm-input.ipynb)\
[PyTorch LSTM Beginner guide](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html)\
[PyTorch LSTM Guide](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)\
[Medium article](https://medium.com/@sunitachoudhary103/generating-molecules-using-a-char-rnn-in-pytorch-16885fd9394b)

[deeplearningwizard](https://www.deeplearningwizard.com/deep_learning/practical_pytorch/pytorch_lstm_neuralnetwork/): Steps

Step 1: Load Dataset\
Step 2: Make Dataset Iterable\
Step 3: Create Model Class\
Step 4: Instantiate Model Class\
Step 5: Instantiate Loss Class\
Step 6: Instantiate Optimizer Class\
Step 7: Train Model


##Data preparation (int_tokens)

In [None]:
infile1 = open('/content/drive/My Drive/Rostlab internship/7_PyTorch/Tokenized_Dataset', 'rb')
tokenized_dataset = pickle.load(infile1)
infile1.close()

####Vocabulary (all occuring SMILES tokens)

In [None]:
alphabet = set()

for i in tokenized_dataset:
  alphabet.update(i)

In [None]:
#alphabet    # length: 12

####Dictionary (Token alphabet; vocabulary + UNK, EOL etc.)

In [None]:
dict_token_alphabet = {}
dict_token_alphabet.update({'UNK': 0, 'SOS': 1, 'EOS':2})

index = 3
for i in alphabet:
  dict_token_alphabet.update({i: index})
  index = index+1

dict_token_alphabet

{'(': 3,
 ')': 4,
 '1': 11,
 '2': 9,
 '3': 5,
 '4': 7,
 '=': 6,
 'C': 13,
 'EOS': 2,
 'N': 12,
 'O': 10,
 'P': 14,
 'S': 8,
 'SOS': 1,
 'UNK': 0}

####Vectorize SMILES

In [None]:
int_tokens = [None]*len(tokenized_dataset)    # empty list of length 14

for i in range(len(tokenized_dataset)):
  int_tokens[i] = [None]*len(tokenized_dataset[i])

  for j in range(len(tokenized_dataset[i])):
    int_tokens[i][j] = dict_token_alphabet.get(tokenized_dataset[i][j])

int_tokens is a list of lists

In [None]:
a = torch.ShortTensor(int_tokens[0])   # or torch.cuda.ShortTensor for a GPU-tensor
len(int_tokens[0])

33

In [None]:
b = torch.ShortTensor(int_tokens[1])
len(int_tokens[1])

17

In [None]:
c = torch.cat(tensors=(a,b), dim=1)

IndexError: ignored

In [None]:
len(c)

50

In [None]:
batch = int_tokens[0]

# make a numpy array out of it
batch = np.array(batch)

# make a PyTorch tensor
batch = torch.tensor(batch, dtype=torch.long)

print(batch)
print('The shape of batch is:', batch.shape)

tensor([ 7, 11,  7, 11,  7, 11,  7,  4, 13, 11,  9,  4, 10, 11,  4, 10,  4, 10,
         4, 10,  4, 10,  7, 11,  9,  4, 10,  7, 11,  9,  4, 10,  4])
The shape of batch is: torch.Size([33])


Now we have the first step of having our data in the shape (batch_size, seq_len). to feed it into an LSTM, we still need input_size.

####Embedding layer

Define layer

In [None]:
vocab_size = len(dict_token_alphabet)  # 14
embed_dim = 10 #<- What does the size of the embeddings mean in my case?

word_embedding_layer = nn.Embedding(vocab_size, embed_dim)

Push batch through layer

In [None]:
batch = word_embedding_layer(batch)

print('The shape of batch is:', batch.shape)   # 3, 33, 10
#print()
#print(batch)

The shape of batch is: torch.Size([33, 10])


Now we have want we want: (batch_size, seq_len, input_size)

####Dataloader

PyTorch Dataloader:\
[PyTorch guide to its Dataloader class](https://pytorch.org/docs/stable/data.html)\
[Parameters](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

\
(Alternative to the PyTorch function:  Function from the [Medium article](https://medium.com/@sunitachoudhary103/generating-molecules-using-a-char-rnn-in-pytorch-16885fd9394b))

In [None]:
batch_size = 1
n_iters = 10
num_epochs = n_iters / (len(tokenized_dataset) / batch_size)
num_epochs = int(num_epochs)
print(num_epochs, 'epochs')

0 epochs


Customize my own dataloader.\
What you need is basically pad your variable-length of input and torch.stack() them together into a single tensor. This tensor will then be used as an input to your model.\
[torch.cat](https://pytorch.org/docs/stable/generated/torch.cat.html)

In [None]:
#my_stack = nn.utils.rnn.pack_padded_sequence(tokenized_dataset) # ATTENTION: the sequences need to already be padded!

There is a [pack_padded_sequence](https://pytorch.org/docs/master/generated/torch.nn.utils.rnn.pack_padded_sequence.html#torch.nn.utils.rnn.pack_padded_sequence) function that packs a tensor containing padded sequences of variable length.

In [None]:
my_loader = torch.utils.data.DataLoader(dataset=tokenized_dataset,
                                        batch_size=batch_size,
                                        shuffle=True,
                                        sampler=None,       # optional: custom Sampler object
                                        batch_sampler=None, # optional: provide a custom sampler
                                        num_workers=0,      # if positive int => multi-process data loading
                                        collate_fn=None,    # optional: custom collate function
                                        pin_memory=False,   # to speed it up when working on a GPU
                                        drop_last=False,
                                        timeout=0,
                                        worker_init_fn=None)

In [None]:
for i in my_loader:
  print(i)

[('C',), ('C',), ('(',), ('C',), (')',), ('(',), ('C',), ('O',), ('P',), ('(',), ('=',), ('O',), (')',), ('(',), ('O',), (')',), ('O',), ('P',), ('(',), ('=',), ('O',), (')',), ('(',), ('O',), (')',), ('O',), ('C',), ('C',), ('1',), ('C',), ('(',), ('C',), ('(',), ('C',), ('(',), ('O',), ('1',), (')',), ('N',), ('2',), ('C',), ('=',), ('N',), ('C',), ('3',), ('=',), ('C',), ('(',), ('N',), ('=',), ('C',), ('N',), ('=',), ('C',), ('3',), ('2',), (')',), ('N',), (')',), ('O',), (')',), ('O',), ('P',), ('(',), ('=',), ('O',), (')',), ('(',), ('O',), (')',), ('O',), (')',), ('C',), ('(',), ('C',), ('(',), ('=',), ('O',), (')',), ('N',), ('C',), ('C',), ('C',), ('(',), ('=',), ('O',), (')',), ('N',), ('C',), ('C',), ('S',), ('C',), ('(',), ('=',), ('O',), (')',), ('C',), ('C',), ('(',), ('C',), ('C',), ('C',), ('(',), ('=',), ('O',), (')',), ('O',), (')',), ('O',), (')',), ('O',)]
[('C',), ('1',), ('=',), ('C',), ('C',), ('(',), ('=',), ('C',), ('C',), ('(',), ('=',), ('C',), ('1',), (')',)

I have a map-style dataset, at least for now with this testing dataset. However, when I take the huge PubChem dataset, this might change to iterable-style.\
When automatic batching is enabled, collate_fn is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the data loader iterator.

##Torchtext Dataset Class

[How to Torchtext](https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84)\
[TorchText Github Documentation](https://github.com/pytorch/text/blob/master/README.rst)

The newline characters need to be removed. Otherwise torchtext cannot read the csv files correctly.\
df_test.comment_text.str.replace("\n", " ")

In [None]:
df_data = pd.read_csv('/content/drive/My Drive/Rostlab internship/7_PyTorch/tokenized_dataset.csv',
                      error_bad_lines=False)
#df_data["comment_text"] = \
#    df_data.comment_text.str.replace("\n", " ")
#idx = np.arange(df_data.shape[0])
#df_test.to_csv("cache/dataset_test.csv", index=False)

b'Skipping line 3: expected 33 fields, saw 111\nSkipping line 5: expected 33 fields, saw 45\nSkipping line 7: expected 33 fields, saw 112\nSkipping line 10: expected 33 fields, saw 46\nSkipping line 12: expected 33 fields, saw 105\nSkipping line 13: expected 33 fields, saw 86\n'


In [None]:
import torchtext

[Field class](https://pytorch.org/text/data.html#field) models common text processing datatypes that can be represented by tensors. It holds a **Vocab object** that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a **tokenization method** and the kind of Tensor that should be produced.

In [None]:
# input: the dataset       output: tensor
smiles = torchtext.data.Field(sequential=True,
                    use_vocab=True, 
                    init_token='<sos>', # or should this be numerical?
                    eos_token='<eos>',  # same as above
                    fix_length=80,    # TODO check for optimal fixed length!
                    dtype=torch.int64,    # what is the datatype of a BATCH of examples??
                                          # and also again: is this already in the numerical state?
                                          # For now, I just leave it at int64 (default)
                    preprocessing=None,   # Hier evtl. Padding?
                    postprocessing=None,  # oder hier Padding?
                    lower=False, 
                    #tokenize='spacy',         # re.findall(pattern = "'(\S+)'")
                    #tokenizer_language='en', 
                    include_lengths=False, 
                    batch_first=False, 
                    pad_token='<pad>', 
                    unk_token='<unk>', 
                    pad_first=False, 
                    truncate_first=False,
                    stop_words=None,
                    is_target=False)   # Is this a target variable? Kind of yes..

In [None]:
my_dataset = torchtext.data.Dataset(
    examples='tokenized_dataset_with_header.tsv',  # tsv, weil ich nur 1 Spalte haben möchte.
    fields=[('smiles', torchtext.data.Field())]
)

In [None]:
for i in range(100):
  print(my_dataset[i])

t
o
k
e
n
i
z
e
d
_
d
a
t
a
s
e
t
_
w
i
t
h
_
h
e
a
d
e
r
.
t
s
v


IndexError: ignored

##Putting my data into a model

In [None]:
batch.size()

torch.Size([3, 33, 10])

In [None]:
# choose the input parameters
input_size = batch.shape[2]
hidden_dim = 32

# define the model
my_LSTM=nn.LSTM(input_size, hidden_dim)

# initialise the lstm
for parameter in my_LSTM.parameters():
    nn.init.normal_(parameter)
    #print(parameter)

In [None]:
# Model architecture visualization
writer = SummaryWriter()    # default `log_dir` is "runs"

writer.add_graph(model=my_LSTM.cpu(),
                 input_to_model=batch,
                 verbose=True)
writer.close()

input must have 3 dimensions, got 2
Error occurs, No graph saved


RuntimeError: ignored

In [None]:
%tensorboard --writer

ERROR: Failed to launch TensorBoard (exited with 2).
Contents of stderr:
2020-10-21 17:57:25.351254: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: tensorboard [-h] [--helpfull] [--logdir PATH] [--logdir_spec PATH_SPEC]
                   [--host ADDR] [--bind_all] [--port PORT]
                   [--purge_orphaned_data BOOL] [--db URI] [--db_import]
                   [--inspect] [--version_tb] [--tag TAG] [--event_file PATH]
                   [--path_prefix PATH] [--window_title TEXT]
                   [--max_reload_threads COUNT] [--reload_interval SECONDS]
                   [--reload_task TYPE] [--reload_multifile BOOL]
                   [--reload_multifile_inactive_secs SECONDS]
                   [--generic_data TYPE]
                   [--samples_per_plugin SAMPLES_PER_PLUGIN]
                   [--debugger_data_server_grpc_port PORT]
                   [--debugger_port PORT]
                   [--

In [None]:
%tensorboard --logdir logs

####Instantiate Loss Class

In [None]:
#import torch.optim as optim

In [None]:
criterion = nn.CrossEntropyLoss()

####Instantiate Optimizer Class

In [None]:
# minibatch SGD
learning_rate = 0.1
optimizer = torch.optim.SGD(my_LSTM.parameters(), lr=learning_rate)  

In [None]:
#for i in range(len(list(my_LSTM.parameters()))):
#    print(list(my_LSTM.parameters())[i].size())

torch.Size([128, 10])
torch.Size([128, 32])
torch.Size([128])
torch.Size([128])


##Train

In [None]:
lstm_out, hidden = my_LSTM(batch)

print('The shape of lstm_out is:', lstm_out.shape) # (seq_len, batch_size, hidden_dim)
print('The shape of hidden is:', hidden.shape) # (num_layers*num_directions, batch_size, hidden_dim)

RuntimeError: ignored

In [None]:
import timeit

In [None]:
runs=10**4

print("Time Pytorch LSTM {} runs: {:.3f}s".format(runs, timeit.timeit("my_LSTM(batch)", 
                                       setup="from __main__ import my_LSTM, batch", 
                                       number=runs))
     )

Time Pytorch LSTM 10000 runs: 18.690s


[Optuna](https://github.com/optuna/optuna): automated Hyperparameter Optimization

##Visualization

Let's write some stuff to TensorBoard, and log into it to see how things go :)
You can log into TensorBoard by running the command from this exercise folder in Terminal:

```tensorboard --logdir=runs```

For Linux user, you can use open a Terminal and simply run it

For Windows user with Anaconda, you may open an Anaconda Prompt and then run the command. Otherwise use your default setup of running python code in cmd.

Note that before running the command **you may get into root folder of this Notebook first**. Finally from the command line and then navigating to http://localhost:6006. If everything went well you will be presented with the tensorboard stup and after executing the next cell you should see the following images in TensorBoard.


In [None]:
# write to tensorboard
#writer.add_image('four_mnist_images', img_grid)     # I need to write my data here!

NameError: ignored

In [None]:
# Model architecture visualization

writer.add_graph(model=word_embedding_layer.cpu(),
                 input_to_model=batch,
                 verbose=True)
writer.close()

Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead (while checking arguments for embedding)
Error occurs, No graph saved


RuntimeError: ignored

In [None]:
# Start TensorBoard within the notebook using magics function
%tensorboard — logdir logs