# SMILES Autoencoder 

Replicating SMILES autoencoder in this article:  
https://www.nature.com/articles/s42004-023-00932-3#Sec19  

Supplementary material for above paper:  
https://static-content.springer.com/esm/art%3A10.1038%2Fs42004-023-00932-3/MediaObjects/42004_2023_932_MOESM1_ESM.pdf  

Model architecure and training details from above supplementary material:   

• Bidirectional GRU with an encoder-decoder architecture and  
– number of encoder and decoder layers: 3  
– hidden dimension for encoder and decoder: 512  
– Dimensionality of the embedding space: 512  
– nonlinearity: tanh  

Training Hyperparameters:  
• Batch size: 256  
• Learning rate: 0.0001  

Training - The autoencoder is trained as a translation model by translating from a randomized SMILES version of a molecule
to its canonical version. The model is trained on 135M molecules from Pubchem and ZINC12 datasets.
The architecture of the translation model and latent dimension of 512 is similar to the one used in Winter et al.
In order to make the learnt representations more meaningful, we also jointly trained a regression model to predict some
molecular properties that can be calculated using the molecular structure. The regression model uses two fully connected layers
with dimensions 512 and 128 and ReLU non-linearity. The properties that are predicted are: logP, molar refractivity, number
of valence electrons, number of hydrogen bond donors and acceptors, Balaban’s J value, topological polar surface area, drug
likeliness (QED) and synthetic accessibility (SA).

In [None]:
# IMPORT NECESSARY PACKAGES
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
import seaborn as sns

In [172]:
# LOAD DATASETS AND PAD SMILES STRINGS TO LEN=77

# load dataset
in_random_smiles_df = pd.read_csv('smiles.csv')     # smiles.csv file with 706863 rows
out_canon_smiles_df = pd.read_csv('sanitized_smiles.csv')       # sanitized_smiles.csv file with 706863 rows
out_canon_smiles_df.rename(columns={'SMILES': 'smiles'}, inplace=True)

# calculate the length of each string in the 'smiles' column
in_random_smiles_df['length'] = in_random_smiles_df['smiles'].apply(len)
out_canon_smiles_df['length'] = out_canon_smiles_df['smiles'].apply(len)

# filter dataframes to only keep input random SMILES with length 74 or less
in_filtered_random_smiles_df = in_random_smiles_df[in_random_smiles_df['length'] <= 74]
out_filtered_canon_smiles_df = out_canon_smiles_df.loc[in_filtered_random_smiles_df.index]

# filter dataframes to only keep output random SMILES with length 77 or less
out_filtered_canon_smiles_df = out_filtered_canon_smiles_df[out_filtered_canon_smiles_df['length'] <= 77]
in_filtered_random_smiles_df = in_filtered_random_smiles_df.loc[out_filtered_canon_smiles_df.index]

# pad the strings in the 'smiles' column to the desired length using " "
max_length = 77

in_filtered_random_smiles_df['smiles_padded'] = in_filtered_random_smiles_df['smiles'].apply(lambda x: x.ljust(max_length, ' '))
out_filtered_canon_smiles_df['smiles_padded'] = out_filtered_canon_smiles_df['smiles'].apply(lambda x: x.ljust(max_length, ' '))

# Find the maximum length
max_length_random = in_filtered_random_smiles_df['length'].max()
max_length_canon = out_filtered_canon_smiles_df['length'].max()

# Print the maximum length
print(f"The maximum length of a string in the input random 'smiles' column is: {max_length_random}")
print(f"The maximum length of a string in the output canon 'smiles' column is: {max_length_canon}")

print(f"Number of SMILES sequences for input random dataframe: {len(in_filtered_random_smiles_df)}")
print(f"Number of SMILES sequences for output canonical dataframe: {len(out_filtered_canon_smiles_df)}")

The maximum length of a string in the input random 'smiles' column is: 74
The maximum length of a string in the output canon 'smiles' column is: 77
Number of SMILES sequences for input random dataframe: 643168
Number of SMILES sequences for output canonical dataframe: 643168


In [173]:
# TOKENIZE SMILES STRINGS (CONVERT STRING OF CHARACTERS TO LIST OF FLOATS), CONVERT TO INPUT/OUTPUT TENSORS, SAMPLE 5 SMILES SEQUENCES

# Convert SMILES strings to list of characters tokens
in_filtered_random_smiles_df['smiles_tokenized_lists'] = in_filtered_random_smiles_df['smiles_padded'].apply(lambda x: list(x))
out_filtered_canon_smiles_df['smiles_tokenized_lists'] = out_filtered_canon_smiles_df['smiles_padded'].apply(lambda x: list(x))

# combines each list of SMILES characters into one list
flattened_list_all_random_smiles = [item for sublist in in_filtered_random_smiles_df['smiles_tokenized_lists'] for item in sublist]
flattened_list_all_canon_smiles = [item for sublist in out_filtered_canon_smiles_df['smiles_tokenized_lists'] for item in sublist]

# get unique characters
unique_characters_random = set(flattened_list_all_random_smiles)
unique_characters_canon = set(flattened_list_all_canon_smiles)

# convert back to list
unique_characters_random = list(unique_characters_random)
unique_characters_canon = list(unique_characters_canon)

# make one unique characters list
unique_characters = unique_characters_random + unique_characters_canon
unique_characters = set(unique_characters)
unique_characters = list(unique_characters)
print(f"Number of unique characters: {len(unique_characters)}")

# mapping from characters to integers
char_to_int = {char: i for i, char in enumerate(unique_characters)}

# convert SMILES tokenized lists into integer lists
int_lists_random_smiles = [
    [char_to_int[char] for char in sublist]
    for sublist in in_filtered_random_smiles_df['smiles_tokenized_lists']
]

int_lists_canon_smiles = [
    [char_to_int[char] for char in sublist]
    for sublist in out_filtered_canon_smiles_df['smiles_tokenized_lists']
]

in_filtered_random_smiles_df['smiles_integer_lists'] = int_lists_random_smiles
out_filtered_canon_smiles_df['smiles_integer_lists'] = int_lists_canon_smiles

# convert the tokenized sequences to tensors
int_lists_random_smiles = in_filtered_random_smiles_df['smiles_integer_lists'].tolist()
int_lists_canon_smiles = out_filtered_canon_smiles_df['smiles_integer_lists'].tolist()

in_smiles_tensor = torch.tensor(int_lists_random_smiles, dtype=torch.float32)
out_smiles_tensor = torch.tensor(int_lists_canon_smiles, dtype=torch.float32)

print(f"Shape of input SMILES tensor: {in_smiles_tensor.shape}")
print(f"Shape of output SMILES tensor: {out_smiles_tensor.shape}")

# create random sample of tensors

# five random indices between 0 and 643167
random_indices = torch.randint(0, 643167, (5,))

# take the random sample of five rows from each tensor
sample_in_smiles_tensor = in_smiles_tensor[random_indices]
sample_out_smiles_tensor = out_smiles_tensor[random_indices]

# Print shape of sampled rows
print(f"Shape of sampled input SMILES tensor: {sample_in_smiles_tensor.shape}")
print(f"Shape of sampled output SMILES tensor: {sample_out_smiles_tensor.shape}")

Number of unique characters: 64
Shape of input SMILES tensor: torch.Size([643168, 77])
Shape of output SMILES tensor: torch.Size([643168, 77])
Shape of sampled input SMILES tensor: torch.Size([5, 77])
Shape of sampled output SMILES tensor: torch.Size([5, 77])


In [174]:
print(f"Sampled input SMILES tensor: {sample_in_smiles_tensor}")

Sampled input SMILES tensor: tensor([[ 1.,  1., 50.,  1., 22.,  0., 50., 36.,  1.,  2., 16.,  2., 23., 23.,
         22., 23., 29., 40.,  2., 23., 22.,  2., 29., 36., 23., 32., 23., 23.,
         22., 13., 36., 23., 23., 23., 32., 13., 36., 23., 16., 23., 46., 23.,
         23., 23., 23., 23., 46., 17., 17., 17., 17., 17., 17., 17., 17., 17.,
         17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17.,
         17., 17., 17., 17., 17., 17., 17.],
        [50.,  0.,  1., 22.,  1.,  1., 37., 23., 16., 23., 23., 23., 22., 59.,
         20., 36., 23., 23., 16., 36., 23., 16., 23., 23., 23., 22., 50., 36.,
         23., 23., 16., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17.,
         17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17.,
         17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17., 17.,
         17., 17., 17., 17., 17., 17., 17.],
        [50.,  0.,  1., 22.,  1., 50., 23., 16., 23., 23., 23., 23., 22., 52.,
          1.

In [175]:
class bidirectional_GRU_AE(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=3, embedding_dim=10):
        super(bidirectional_GRU_AE, self).__init__()

        # embedding layer to convert integer indices to dense float vectors
        self.embedding = nn.Embedding(input_size, embedding_dim)

        self.encoder = nn.GRU(
            input_size=embedding_dim, 
            hidden_size=hidden_size, 
            num_layers=num_layers, 
            bidirectional=True,
            # input and output tensors are (batch_size, seq_len) format
            batch_first = True      
        )

        self.decoder = nn.GRU(
            # for bidirectional GRU, input size is doubled (each hidden layer as forward and backward state)
            input_size=hidden_size * 2,     
            hidden_size=hidden_size, 
            num_layers=num_layers, 
            bidirectional=True, 
            batch_first = True
        )

        # fully connected layer
        # passing through Tanh activation function (self.tanh = nn.Tanh) for nonlinearity
        self.fc = nn.Sequential(
            nn.Linear(hidden_size * 2, input_size),  # Linear layer
            nn.Tanh()  # Tanh activation
        )

    def forward(self, x):
        # encoding
        # embed input integer indices to floating point vectors
        # after embedding, shape of x becomes (batch size, sequence length, embedding dimensions)
        x = self.embedding(x)

        # encoder_output shape is (batch size, sequence length, hidden_size * 2)
        # hidden shape is (num_layers * 2, batch size, hidden_size)
        encoder_output, hidden = self.encoder(x)
        
        # for bidirectional GRU, there are two separate hidden states for each hidden layer
        # forward state goes from start to end of sequence and backward state goes from end to start of sequence
        # we can concatenate forward and backward directions from last hidden layers or pass the hidden layers to the decoder directly
        # we are currently passing hidden layers to decoder directly

        # decoding
        # encoder_output has shape (batch size, sequence length, hidden_size * 2)
        # decoder output has shape (batch size, sequence length, hidden_size * 2)
        decoder_output, _ = self.decoder(encoder_output, hidden)
        print(f"Output shape after decoder: {decoder_output.shape}")

        # pass decoder output through the fully connected layer
        # output has shape of (batch size, sequence length, vocab size) = (5, 77, 64)
        output = self.fc(decoder_output)

        print(f"Output shape after fully connected layer: {output.shape}")
        return output

In [178]:
# train bidirectional GRU AE model

# Hyperparameters
seq_len = 77
input_size = 64     # vocabulary size  - number of unique chars in SMILES  
embed_dim = 10     
hidden_dim = 10    
num_layers = 3
batch_size = 5
learning_rate = 0.0001
#dropout = 0.1

# training data
# shape: (batch_size, seq_len) = (5, 77)
input = sample_in_smiles_tensor.long()
# shape: (batch_size, seq_len) = (5, 77)
target = sample_out_smiles_tensor.long()

# instantiate model
model = bidirectional_GRU_AE(input_size, hidden_dim, num_layers, embed_dim)

# optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=17)  # ignore " " padding indices: ignore_index=17

# training loop
losses = []
for epoch in range(10):
    model.train()

    # zeros out the gradients before computing the gradients for the current batch
    optimizer.zero_grad()
    
    # forward pass
    output = model(input)

    # to compute loss, we flatten the output tensor from (batch_size, seq_len, vocab_size) to (batch_size * seq_len, vocab_size) 
    # loss function (CrossEntropyLoss) operates over each token position in the sequence

    # output shape after flattening is (385, 64)
    # 385 is flattened batch and sequence length (5*77 = 385 total sequence positions)
    # for each of those positions, there are 64 possible characters/classes
    # for each of those 385 positions, the model outputs a probability distribution over 64 possible characters

    # flatten output to (batch_size * seq_len, vocab_size) = (5 * 77, 64) = (385, 64)
    output = output.contiguous().view(-1, input_size)  

    # flatten target to (batch_size * seq_len = 385)
    target = target.contiguous().view(-1) 

    print(f"input shape: {input.shape}")
    print(f"output shape: {output.shape}")
    print(f"target shape: {target.shape}")

    # compute the loss (cross-entropy) for each epoch
    loss = criterion(output, target)
    
    # backpropagation
    loss.backward()
    
    losses.append(loss.item())

    # update the parameters
    optimizer.step()

    print(f"Epoch [{epoch+1}/10], Loss: {loss.item():.4f}")

print(losses)

Output shape after decoder: torch.Size([5, 77, 20])
Output shape after fully connected layer: torch.Size([5, 77, 64])
input shape: torch.Size([5, 77])
output shape: torch.Size([385, 64])
target shape: torch.Size([385])
Epoch [1/10], Loss: 4.2592
Output shape after decoder: torch.Size([5, 77, 20])
Output shape after fully connected layer: torch.Size([5, 77, 64])
input shape: torch.Size([5, 77])
output shape: torch.Size([385, 64])
target shape: torch.Size([385])
Epoch [2/10], Loss: 4.2566
Output shape after decoder: torch.Size([5, 77, 20])
Output shape after fully connected layer: torch.Size([5, 77, 64])
input shape: torch.Size([5, 77])
output shape: torch.Size([385, 64])
target shape: torch.Size([385])
Epoch [3/10], Loss: 4.2540
Output shape after decoder: torch.Size([5, 77, 20])
Output shape after fully connected layer: torch.Size([5, 77, 64])
input shape: torch.Size([5, 77])
output shape: torch.Size([385, 64])
target shape: torch.Size([385])
Epoch [4/10], Loss: 4.2514
Output shape aft