# SMILES Autoencoder 

Replicating SMILES autoencoder in this article:  
https://www.nature.com/articles/s42004-023-00932-3#Sec19  

Supplementary material for above paper:  
https://static-content.springer.com/esm/art%3A10.1038%2Fs42004-023-00932-3/MediaObjects/42004_2023_932_MOESM1_ESM.pdf  

Model architecure and training details from above supplementary material:   

• Bidirectional GRU with an encoder-decoder architecture and  
– number of encoder and decoder layers: 3  
– hidden dimension for encoder and decoder: 512  
– Dimensionality of the embedding space: 512  
– nonlinearity: tanh  

Training Hyperparameters:  
• Batch size: 256  
• Learning rate: 0.0001  

Training - The autoencoder is trained as a translation model by translating from a randomized SMILES version of a molecule
to its canonical version. The model is trained on 135M molecules from Pubchem and ZINC12 datasets.
The architecture of the translation model and latent dimension of 512 is similar to the one used in Winter et al.
In order to make the learnt representations more meaningful, we also jointly trained a regression model to predict some
molecular properties that can be calculated using the molecular structure. The regression model uses two fully connected layers
with dimensions 512 and 128 and ReLU non-linearity. The properties that are predicted are: logP, molar refractivity, number
of valence electrons, number of hydrogen bond donors and acceptors, Balaban’s J value, topological polar surface area, drug
likeliness (QED) and synthetic accessibility (SA).

- try mol2vec

In [47]:
# IMPORT NECESSARY PACKAGES
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
import seaborn as sns
import random
np.random.seed(42)

In [48]:
# LOAD DATASETS AND PAD SMILES STRINGS TO LEN=77

# load dataset
sanitized_smiles_df = pd.read_csv('dataset/sanitized_smiles_first100k.csv')     # smiles.csv file with 100K rows
sanitized_smiles_df.rename(columns={'SMILES': 'smiles'}, inplace=True)

# calculate the length of each string in the 'smiles' column
sanitized_smiles_df['length'] = sanitized_smiles_df['smiles'].apply(len)

# filter dataframes to only keep input random SMILES with length 77 or less
sanitized_smiles_df = sanitized_smiles_df[sanitized_smiles_df['length'] <= 77]

# Find the maximum length
max_length_sanitized = sanitized_smiles_df['length'].max()

# Print the maximum length
print(f"The maximum length of a string in the input random 'smiles' column is: {max_length_sanitized}")

print(f"Number of SMILES sequences for input sanitized dataframe: {len(sanitized_smiles_df)}")

The maximum length of a string in the input random 'smiles' column is: 77
Number of SMILES sequences for input sanitized dataframe: 98414


In [49]:
# CREATE RANDOM SMILES SEQUENCE BY SHUFFLING SMILES STRING AND ADD TO DATAFRAME
def shuffle_string(smiles):
    list_smiles = list(smiles)
    random.shuffle(list_smiles)
    return ''.join(list_smiles)

random.seed(42)  # Set random seed for reproducibility

sanitized_smiles_df['random_smiles'] = sanitized_smiles_df['smiles'].apply(shuffle_string)

# pad the strings in the 'smiles' column to the desired length using " "
max_length = 77

sanitized_smiles_df['smiles_padded'] = sanitized_smiles_df['smiles'].apply(lambda x: x.ljust(max_length, ' '))
sanitized_smiles_df['random_smiles_padded'] = sanitized_smiles_df['random_smiles'].apply(lambda x: x.ljust(max_length, ' '))

In [50]:
# TOKENIZE SMILES STRINGS (CONVERT STRING OF CHARACTERS TO LIST OF FLOATS), CONVERT TO INPUT/OUTPUT TENSORS, SAMPLE 5 SMILES SEQUENCES

# Convert SMILES strings to list of characters tokens
sanitized_smiles_df['smiles_tokenized_lists'] = sanitized_smiles_df['smiles_padded'].apply(lambda x: list(x))
sanitized_smiles_df['random_smiles_tokenized_lists'] = sanitized_smiles_df['random_smiles_padded'].apply(lambda x: list(x))

# combines each list of SMILES characters into one list
flattened_list_all_smiles = [item for sublist in sanitized_smiles_df['smiles_tokenized_lists'] for item in sublist]

# get unique characters
unique_characters = set(flattened_list_all_smiles)

# convert back to list
unique_characters = list(unique_characters)
unique_characters = sorted(unique_characters)

print(f"Number of unique characters: {len(unique_characters)}")

# mapping from characters to integers
char_to_int = {char: i for i, char in enumerate(unique_characters)}

# convert SMILES tokenized lists into integer lists
random_smiles_int_lists = [
    [char_to_int[char] for char in sublist]
    for sublist in sanitized_smiles_df['random_smiles_tokenized_lists']
]
smiles_int_lists = [
    [char_to_int[char] for char in sublist]
    for sublist in sanitized_smiles_df['smiles_tokenized_lists']
]

sanitized_smiles_df['random_smiles_integer_lists'] = random_smiles_int_lists
sanitized_smiles_df['smiles_integer_lists'] = smiles_int_lists

# convert the tokenized sequences to tensors
int_lists_random_smiles = sanitized_smiles_df['random_smiles_integer_lists'].tolist()
int_lists_smiles = sanitized_smiles_df['smiles_integer_lists'].tolist()

in_random_smiles_tensor = torch.tensor(int_lists_random_smiles, dtype=torch.float32)
out_smiles_tensor = torch.tensor(int_lists_smiles, dtype=torch.float32)

print(f"Shape of input SMILES tensor: {in_random_smiles_tensor.shape}")
print(f"Shape of output SMILES tensor: {out_smiles_tensor.shape}")

# create random sample of tensors

# five random indices between 0 and 98414
random_indices = torch.randint(0, 98414, (1000,))

# take the random sample of five rows from each tensor
sample_in_smiles_tensor = in_random_smiles_tensor[random_indices]
sample_out_smiles_tensor = out_smiles_tensor[random_indices]

# Print shape of sampled rows
print(f"Shape of sampled input SMILES tensor: {sample_in_smiles_tensor.shape}")
print(f"Shape of sampled output SMILES tensor: {sample_out_smiles_tensor.shape}")

Number of unique characters: 35
Shape of input SMILES tensor: torch.Size([98414, 77])
Shape of output SMILES tensor: torch.Size([98414, 77])
Shape of sampled input SMILES tensor: torch.Size([1000, 77])
Shape of sampled output SMILES tensor: torch.Size([1000, 77])


In [51]:
print(char_to_int)

{' ': 0, '#': 1, '(': 2, ')': 3, '+': 4, '-': 5, '1': 6, '2': 7, '3': 8, '4': 9, '5': 10, '6': 11, '7': 12, '8': 13, '=': 14, 'A': 15, 'B': 16, 'C': 17, 'F': 18, 'H': 19, 'I': 20, 'N': 21, 'O': 22, 'P': 23, 'S': 24, '[': 25, ']': 26, 'c': 27, 'e': 28, 'i': 29, 'l': 30, 'n': 31, 'o': 32, 'r': 33, 's': 34}


In [52]:
print(f"Sampled input SMILES tensor: {sample_in_smiles_tensor}")

Sampled input SMILES tensor: tensor([[27.,  7., 17.,  ...,  0.,  0.,  0.],
        [26.,  2., 31.,  ...,  0.,  0.,  0.],
        [27., 27., 22.,  ...,  0.,  0.,  0.],
        ...,
        [31., 17., 22.,  ...,  0.,  0.,  0.],
        [ 7., 32., 17.,  ...,  0.,  0.,  0.],
        [14.,  6., 27.,  ...,  0.,  0.,  0.]])


In [None]:
class bidirectional_GRU_AE(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=3, embedding_dim=10):
        super(bidirectional_GRU_AE, self).__init__()

        # embedding layer to convert integer indices to dense float vectors
        self.embedding = nn.Embedding(input_size, embedding_dim)

        self.encoder = nn.GRU(
            input_size=embedding_dim, 
            hidden_size=hidden_size, 
            num_layers=num_layers, 
            bidirectional=True,
            # input and output tensors are (batch_size, seq_len) format
            batch_first = True      
        )

        self.decoder = nn.GRU(
            # for bidirectional GRU, input size is doubled (each hidden layer as forward and backward state)
            input_size=hidden_size * 2,     
            hidden_size=hidden_size, 
            num_layers=num_layers, 
            bidirectional=True, 
            batch_first = True
        )

        # fully connected layer
        # passing through Tanh activation function (self.tanh = nn.Tanh) for nonlinearity
        self.fc = nn.Sequential(
            nn.Linear(hidden_size * 2, input_size),  # Linear layer
            nn.Tanh(),  # Tanh activation
        )

    def prob_to_char_out(self, output):
        _, predicted_classes = torch.max(output, dim=2)
        return predicted_classes     

    def forward(self, x):
        # encoding
        # embed input integer indices to floating point vectors
        # after embedding, shape of x becomes (batch size, sequence length, embedding dimensions)
        x = self.embedding(x)

        # encoder_output shape is (batch size, sequence length, hidden_size * 2)
        # hidden shape is (num_layers * 2, batch size, hidden_size)
        encoder_output, hidden = self.encoder(x)
        
        # for bidirectional GRU, there are two separate hidden states for each hidden layer
        # forward state goes from start to end of sequence and backward state goes from end to start of sequence
        # we can concatenate forward and backward directions from last hidden layers or pass the hidden layers to the decoder directly
        # we are currently passing hidden layers to decoder directly

        # decoding
        # encoder_output has shape (batch size, sequence length, hidden_size * 2)
        # decoder output has shape (batch size, sequence length, hidden_size * 2)
        decoder_output, _ = self.decoder(encoder_output, hidden)
        print(f"Output shape after decoder: {decoder_output.shape}")

        # pass decoder output through the fully connected layer
        # output has shape of (batch size, sequence length, vocab size) = (5, 77, 64)
        output = self.fc(decoder_output)
        print(f"Output shape after fully connected layer: {output.shape}")

        # Define custom order of classes (e.g., reversing the class order)
        #custom_order = list(range(64))
        #custom_order = torch.tensor(custom_order)  # This means class 3 comes first, then class 2, etc.

        # Reorder the vocab_size dimension of the output tensor based on custom_order
        #reordered_output = output[:, :, custom_order]

        predicted_classes = self.prob_to_char_out(output)
        print(f"Predicted classes: {predicted_classes}") 

        return output
    
int_to_char = {i: char for char, i in char_to_int.items()}

def decode_smiles(tensor):
    decoded_smiles = []
    for sequence in tensor:
        smiles = ''.join([int_to_char[int(idx)] for idx in sequence if int(idx) in int_to_char])
        decoded_smiles.append(smiles.strip())  # Remove trailing spaces
    return decoded_smiles

In [56]:
# Hyperparameters
seq_len = 77
input_size = len(unique_characters)  # vocabulary size
embed_dim = 64    
hidden_dim = 64  
num_layers = 3
batch_size = 1000
learning_rate = 0.0001

# training data
input = sample_in_smiles_tensor.long()
target = sample_out_smiles_tensor.long()

# instantiate model
model = bidirectional_GRU_AE(input_size, hidden_dim, num_layers, embed_dim)

# optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=char_to_int[' '])  # ignore " " padding indices

# training loop
losses = []
for epoch in range(10):
    model.train()
    optimizer.zero_grad()
    output = model(input)
    output = output.contiguous().view(-1, input_size)  # Flatten for loss calculation
    target = target.contiguous().view(-1)  # Flatten targets

    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

    print(f"Epoch [{epoch+1}/10], Loss: {loss.item():.4f}")

    # Decode and print the predicted SMILES
    model.eval()
    with torch.no_grad():
        predicted_indices = torch.argmax(output.view(batch_size, seq_len, input_size), dim=2)
        predicted_smiles = decode_smiles(predicted_indices)
        random_smiles = decode_smiles(input)
        actual_smiles = sanitized_smiles_df['smiles'].iloc[random_indices]
        print("Actual vs Predicted SMILES:")
        for actual, random, predicted in zip(actual_smiles, random_smiles, predicted_smiles):
            print(f"Actual: {actual}")
            print(f"Random input: {random}")
            print(f"Predicted: {predicted}")
            print("-" * 30)

print(losses)

Output shape after decoder: torch.Size([1000, 77, 128])
Output shape after fully connected layer: torch.Size([1000, 77, 35])
Predicted classes: tensor([[26, 26, 26,  ..., 27, 27, 27],
        [ 7, 27, 27,  ..., 27, 27, 27],
        [26, 26, 26,  ..., 27, 27, 27],
        ...,
        [27, 27, 27,  ..., 27, 27, 27],
        [30, 26, 26,  ..., 27, 27, 27],
        [30, 26, 26,  ..., 27, 27, 27]])
Epoch [1/10], Loss: 3.5527
Actual vs Predicted SMILES:
Actual: Nc1nc(Cl)cc(N2CCOCC2)n1
Random input: c2CncO1)2CNnc(l)NCCc(C1
Predicted: ]]]]]]]]]]]]]]]]]]]]ccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
------------------------------
Actual: CCc1ccc(-n2cc(-c3ccc([N+](=O)[O-])cc3)[nH]c2=S)cc1
Random input: ](nc][cc-)cn1)(-c+)cc)c3=CS[cH(c=23Cc]c-2c(1ccONO[
Predicted: 2cc]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]ccccccccccccccccccccccccccccc
------------------------------
Actual: O=C1Cc2cc(Cc3ccccc3)ccc2C(=O)N1O
Random input: ccOC2cccC(OccCcc=321)=N3)(OcCc1c
Predicted: ]]]]]]]]]]]]]]