GitHub - snu-lcbc/atom-in-SMILES: Atom-in-SMILES tokenizer for SMILES strings.

Atom-in-SMILES tokenization.

Ucak UV, Ashyrmamatov I, Lee J (2023) Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. J Cheminformatics 15:55. https://doi.org/10.1186/s13321-023-00725-9

Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.

Tutorial - Google Colab Notes

Installation

It can be installed using pip.

pip install atomInSmiles

or clone it from the GitHub repository and install locally.

git clone https://github.com/snu-lcbc/atom-in-SMILES
cd atom-in-SMILES
python setup.py install

Usage & Demo

Brief descriptions of the main functions:

Function	Description
`atomInSmiles.encode`	Converts a SMILES string into Atom-in-SMILES tokens.
`atomInSmiles.decode`	Converts an Atom-in-SMILES tokens into SMILES string.
`atomInSmiles.similarity`	Calcuates Tanimoto coefficient of two Atom-inSMILSE tokens.

import atomInSmiles

smiles = 'NCC(=O)O'

# SMILES -> atom-in-SMILES 
ais_tokens = atomInSmiles.encode(smiles) # '[NH2;!R;C] [CH2;!R;CN] [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C]'

# atom-in-SMILES -> SMILES
decoded_smiles = atomInSmiles.decode(ais_tokens) #'NCC(=O)O'

assert smiles == decoded_smiles

NOTE: By default, it first canonicalizes the input SMILES. In order to get atom-in-Smiles tokens with the same order of SMILES, the input SMILES should be provided with atom map numbers.

from rdkit.Chem import MolFromSmiles, MolToSmiles
import atomInSmiles

import atomInSmiles
# ensuring the order of SMILES in atom-in-SMILES. 
smiles = 'NCC(=O)O'
mol = MolFromSmiles(smiles)
random_smiles = MolToSmiles(mol, doRandom=True) # e.g 'C(C(=O)O)N' 

# mapping atomID into SMILES srting
tmp = MolFromSmiles(random_smiles)
for atom in tmp.GetAtoms():
    atom.SetAtomMapNum(atom.GetIdx())
smiles_1 = MolToSmiles(tmp) # 'C([C:1](=[O:2])[OH:3])[NH2:4]' 

# SMILES -> atom-in-SMILES
ais_tokens_1 = atomInSmiles.encode(smiles_1, with_atomMap=True) # '[CH2;!R;CN] ( [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C] ) [NH2;!R;C]'

# atom-in-SMILES -> SMILES
decoded_smiles_1 = atomInSmiles.decode(ais_tokens_1) # 'C(C(=O)O)N'

assert random_smiles == decoded_smiles_1

Implementations & Results

Implementation	Items	Description
Single-step retrosynthesis	`python src/predict.py`	to conduct an inference with the trained model
	`--model_type`	(`SMILES`, `SELFIES`, `DeepSmiles`, `SmilesPE`, `AIS`)
	`--checkpoint_name`	name of the checkpoint file checkpoints files
	`--input`	Tokenized input sequence
Molecular Property Prediction	Molecular-property-prediction.ipynb	MoleculeNet: Classification (ESOL, FreeSolv, Lipo.), Regression (BBBP, BACE, HIV)
Normalized repetition rate	Normalized-Repetition-Rates.ipynb	Natural products, drugs, metal complexes, lipids, stereoids, isomers
Fingerprint nature of AIS	AIS-as-fingerprint.ipynb	AIS fingerprint resolution
Single-token repetition (rep-l)	rep-l_USPTO50k.ipynb	USPTO-50K, retrosynthetic translations
input-output equivalent mapping	GDB13-results.ipynb	Augmented subset of GDB-13, noncanon-2-canon translations

For example, in retrosynthesis task:

python src/predict.py --model_type AIS  --checkpoint_name AIS_checkpoint.pth
 --input='[CH3;!R;O] [O;!R;CC] [C;!R;COO] ( = [O;!R;C] ) [c;R;CCS] 1 [cH;R;CC] [c;R;CCC] ( [CH2;!R;CC] [CH2;!R; CC] [CH2;!R;CC] [c;R;CCN] 2 [cH;R;CC] [c;R;CCC] 3 [c;R;CNO] ( = [O;!R;C] ) [nH;R;CC] [c;R;NNN] ( [NH2 ;!R;C] ) [n;R;CC] [c;R;CNN] 3 [nH;R;CC] 2 ) [cH;R;CS] [s;R;CC] 1'

Cite this work

@article{10.1186/s13321-023-00725-9, 
year = {2023}, 
title = {{Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization}}, 
author = {Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong}, 
journal = {Journal of Cheminformatics}, 
doi = {10.1186/s13321-023-00725-9}, 
pages = {55}, 
number = {1}, 
volume = {15}, 
keywords = {}
}

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
atomInSmiles		atomInSmiles
data		data
saved_models		saved_models
src		src
utils		utils
.gitignore		.gitignore
AIS-as-fingerprint.ipynb		AIS-as-fingerprint.ipynb
AIS-as-fingerprint.png		AIS-as-fingerprint.png
GDB13-results.ipynb		GDB13-results.ipynb
LICENSE.md		LICENSE.md
Molecular-property-prediction.ipynb		Molecular-property-prediction.ipynb
Normilized-Repetition-Rates.ipynb		Normilized-Repetition-Rates.ipynb
README.md		README.md
rep-l_USPTO50k.ipynb		rep-l_USPTO50k.ipynb
rep-l_calc.tar.gz		rep-l_calc.tar.gz
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Atom-in-SMILES tokenization.

Installation

Usage & Demo

Implementations & Results

Cite this work

License

About

Releases 1

Packages

Contributors 2

Languages

License

snu-lcbc/atom-in-SMILES

Folders and files

Latest commit

History

Repository files navigation

Atom-in-SMILES tokenization.

Installation

Usage & Demo

Implementations & Results

Cite this work

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages