Ucak UV, Ashyrmamatov I, Lee J (2023) Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. J Cheminformatics 15:55. https://doi.org/10.1186/s13321-023-00725-9
Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.
It can be installed using pip.
pip install atomInSmiles
or clone it from the GitHub repository and install locally.
git clone https://github.com/snu-lcbc/atom-in-SMILES
cd atom-in-SMILES
python setup.py install
Brief descriptions of the main functions:
Function | Description |
---|---|
atomInSmiles.encode |
Converts a SMILES string into Atom-in-SMILES tokens. |
atomInSmiles.decode |
Converts an Atom-in-SMILES tokens into SMILES string. |
atomInSmiles.similarity |
Calcuates Tanimoto coefficient of two Atom-inSMILSE tokens. |
import atomInSmiles
smiles = 'NCC(=O)O'
# SMILES -> atom-in-SMILES
ais_tokens = atomInSmiles.encode(smiles) # '[NH2;!R;C] [CH2;!R;CN] [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C]'
# atom-in-SMILES -> SMILES
decoded_smiles = atomInSmiles.decode(ais_tokens) #'NCC(=O)O'
assert smiles == decoded_smiles
NOTE: By default, it first canonicalizes the input SMILES. In order to get atom-in-Smiles tokens with the same order of SMILES, the input SMILES should be provided with atom map numbers.
from rdkit.Chem import MolFromSmiles, MolToSmiles
import atomInSmiles
import atomInSmiles
# ensuring the order of SMILES in atom-in-SMILES.
smiles = 'NCC(=O)O'
mol = MolFromSmiles(smiles)
random_smiles = MolToSmiles(mol, doRandom=True) # e.g 'C(C(=O)O)N'
# mapping atomID into SMILES srting
tmp = MolFromSmiles(random_smiles)
for atom in tmp.GetAtoms():
atom.SetAtomMapNum(atom.GetIdx())
smiles_1 = MolToSmiles(tmp) # 'C([C:1](=[O:2])[OH:3])[NH2:4]'
# SMILES -> atom-in-SMILES
ais_tokens_1 = atomInSmiles.encode(smiles_1, with_atomMap=True) # '[CH2;!R;CN] ( [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C] ) [NH2;!R;C]'
# atom-in-SMILES -> SMILES
decoded_smiles_1 = atomInSmiles.decode(ais_tokens_1) # 'C(C(=O)O)N'
assert random_smiles == decoded_smiles_1
Implementation | Items | Description |
---|---|---|
Single-step retrosynthesis | python src/predict.py |
to conduct an inference with the trained model |
--model_type |
(SMILES , SELFIES , DeepSmiles , SmilesPE , AIS ) |
|
--checkpoint_name |
name of the checkpoint file checkpoints files | |
--input |
Tokenized input sequence | |
Molecular Property Prediction | Molecular-property-prediction.ipynb | MoleculeNet: Classification (ESOL, FreeSolv, Lipo.), Regression (BBBP, BACE, HIV) |
Normalized repetition rate | Normalized-Repetition-Rates.ipynb | Natural products, drugs, metal complexes, lipids, stereoids, isomers |
Fingerprint nature of AIS | AIS-as-fingerprint.ipynb | AIS fingerprint resolution |
Single-token repetition (rep-l) | rep-l_USPTO50k.ipynb | USPTO-50K, retrosynthetic translations |
input-output equivalent mapping | GDB13-results.ipynb | Augmented subset of GDB-13, noncanon-2-canon translations |
For example, in retrosynthesis task:
python src/predict.py --model_type AIS --checkpoint_name AIS_checkpoint.pth
--input='[CH3;!R;O] [O;!R;CC] [C;!R;COO] ( = [O;!R;C] ) [c;R;CCS] 1 [cH;R;CC] [c;R;CCC] ( [CH2;!R;CC] [CH2;!R; CC] [CH2;!R;CC] [c;R;CCN] 2 [cH;R;CC] [c;R;CCC] 3 [c;R;CNO] ( = [O;!R;C] ) [nH;R;CC] [c;R;NNN] ( [NH2 ;!R;C] ) [n;R;CC] [c;R;CNN] 3 [nH;R;CC] 2 ) [cH;R;CS] [s;R;CC] 1'
@article{10.1186/s13321-023-00725-9,
year = {2023},
title = {{Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization}},
author = {Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong},
journal = {Journal of Cheminformatics},
doi = {10.1186/s13321-023-00725-9},
pages = {55},
number = {1},
volume = {15},
keywords = {}
}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.