# Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation

The discovery of novel materials and functional molecules can help to solve some of society’s
most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel
pharmaceutical drug candidates. Traditionally matter engineering – generally denoted as inverse
design – was based massively on human intuition and high-throughput virtual screening. The last
few years have seen the emergence of significant interest in computer-inspired designs based on
evolutionary or deep learning methods. The major challenge here is that the standard strings
molecular representation SMILES shows substantial weaknesses in that task because large fractions
of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental
level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation
of molecules which is 100% robust. Every SELFIES string corresponds to a valid molecule, and
SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine
learning models without the adaptation of the models; each of the generated molecule candidates is
valid. In our experiments, the model’s internal memory stores two orders of magnitude more diverse
molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for
explanation and interpretation of the internal working of the generative models.

Link to paper: https://arxiv.org/abs/1905.13741

Credit: https://github.com/seyonechithrananda/selfies-mirror

## Installation

Install SELFIES in the command line using pip:

In [1]:
!pip install selfies

Collecting selfies
  Using cached selfies-1.0.4-py3-none-any.whl (30 kB)
Installing collected packages: selfies
Successfully installed selfies-1.0.4


In [None]:
# Install RDKit via conda
!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
!bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
!conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

In [2]:
# Import libraries
import selfies as sf
from rdkit import Chem

## Standard Usage

First let’s try translating from SMILES to SELFIES, and then from SELFIES to SMILES. We will use a non-fullerene acceptor for organic solar cells as an example.

In [4]:
smiles = "CN1C(=O)C2=C(c3cc4c(s3)-c3sc(-c5ncc(C#N)s5)cc3C43OCCO3)N(C)C(=O)" \
         "C2=C1c1cc2c(s1)-c1sc(-c3ncc(C#N)s3)cc1C21OCCO1"
encoded_selfies = sf.encoder(smiles)  # SMILES --> SEFLIES
decoded_smiles = sf.decoder(encoded_selfies)  # SELFIES --> SMILES

print(f"Original SMILES: {smiles}")
print(f"Translated SELFIES: {encoded_selfies}")
print(f"Translated SMILES: {decoded_smiles}")

Original SMILES: CN1C(=O)C2=C(c3cc4c(s3)-c3sc(-c5ncc(C#N)s5)cc3C43OCCO3)N(C)C(=O)C2=C1c1cc2c(s1)-c1sc(-c3ncc(C#N)s3)cc1C21OCCO1
Translated SELFIES: [C][N][C][Branch1_2][C][=O][C][=C][Branch2_1][Ring2][Branch1_3][C][=C][C][=C][Branch1_1][Ring2][S][Ring1][Branch1_1][C][S][C][Branch1_1][N][C][=N][C][=C][Branch1_1][Ring1][C][#N][S][Ring1][Branch1_3][=C][C][Expl=Ring1][N][C][Ring1][S][O][C][C][O][Ring1][Branch1_1][N][Branch1_1][C][C][C][Branch1_2][C][=O][C][Ring2][Ring1][=N][=C][Ring2][Ring1][P][C][=C][C][=C][Branch1_1][Ring2][S][Ring1][Branch1_1][C][S][C][Branch1_1][N][C][=N][C][=C][Branch1_1][Ring1][C][#N][S][Ring1][Branch1_3][=C][C][Expl=Ring1][N][C][Ring1][S][O][C][C][O][Ring1][Branch1_1]
Translated SMILES: CN7C(=O)C6=C(C1=CC4=C(S1)C=3SC(C2=NC=C(C#N)S2)=CC=3C45OCCO5)N(C)C(=O)C6=C7C8=CC%11=C(S8)C=%10SC(C9=NC=C(C#N)S9)=CC=%10C%11%12OCCO%12


When comparing the original and decoded SMILES, do not use <code>==</code> equality. Use <code>RDKit</code> to check whether both SMILES correspond to the same molecule.


In [5]:
print(f"== Equals: {smiles == decoded_smiles}")

# Recomended
can_smiles = Chem.CanonSmiles(smiles)
can_decoded_smiles = Chem.CanonSmiles(decoded_smiles)
print(f"RDKit Equals: {can_smiles == can_decoded_smiles}")

== Equals: False
RDKit Equals: True


## Advanced Usage

Now let’s try to customize the SELFIES constraints. We will first look at the default SELFIES semantic constraints.

In [6]:
default_constraints = sf.get_semantic_constraints()
print(f"Default Constraints:\n {default_constraints}")

Default Constraints:
 {'H': 1, 'F': 1, 'Cl': 1, 'Br': 1, 'I': 1, 'O': 2, 'O+1': 3, 'O-1': 1, 'N': 3, 'N+1': 4, 'N-1': 2, 'C': 4, 'C+1': 5, 'C-1': 3, 'P': 5, 'P+1': 6, 'P-1': 4, 'S': 6, 'S+1': 7, 'S-1': 5, '?': 8}


We have two compounds here, <code>CS=CC#S</code> and <code>[Li]=CC</code> in SELFIES form. Under the default SELFIES settings, they are translated like so. Note that since <code>Li</code> is not recognized by SELFIES, it is constrained to 8 bonds by default.



In [10]:
c_s_compound = sf.encoder("CS=CC#S")
li_compound = sf.encoder("[Li]=CC")

print(f"CS=CC#S --> {sf.decoder(c_s_compound)}")
print(f"[Li]=CC --> {sf.decoder(li_compound)}")

CS=CC#S --> CS=CC#S
[Li]=CC --> [Li]=CC


We can add <code>Li</code> to the SELFIES constraints, and restrict it to 1 bond only. We can also restrict <code>S</code> to 2 bonds (instead of its default 6). After setting the new constraints, we can check to see if they were updated.

In [11]:
new_constraints = default_constraints
new_constraints['Li'] = 1
new_constraints['S'] = 2

sf.set_semantic_constraints(new_constraints)  # update constraints

print(f"Updated Constraints:\n {sf.get_semantic_constraints()}")

Updated Constraints:
 {'H': 1, 'F': 1, 'Cl': 1, 'Br': 1, 'I': 1, 'O': 2, 'O+1': 3, 'O-1': 1, 'N': 3, 'N+1': 4, 'N-1': 2, 'C': 4, 'C+1': 5, 'C-1': 3, 'P': 5, 'P+1': 6, 'P-1': 4, 'S': 2, 'S+1': 7, 'S-1': 5, '?': 8, 'Li': 1}


Under our new settings, our previous molecules are translated like so. Notice that our new semantic constraints are met.


In [12]:
print(f"CS=CC#S --> {sf.decoder(c_s_compound)}")
print(f"[Li]=CC --> {sf.decoder(li_compound)}")

CS=CC#S --> CSCC=S
[Li]=CC --> [Li]CC


To revert back to the default constraints, simply call:

In [13]:
sf.set_semantic_constraints()