Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMILES parser rejects uncommon order of Atom properties #2632

Open
mkossner opened this issue Aug 28, 2019 · 6 comments
Open

SMILES parser rejects uncommon order of Atom properties #2632

mkossner opened this issue Aug 28, 2019 · 6 comments
Labels

Comments

@mkossner
Copy link

Description:
One common customer using MOE and RDKit found that the following SMILES is not parsed by RDKit:

[N+H4]

This is Written from MOE2019.0102 and seems to be a valid SMILES.

@vfscalfani
Copy link
Contributor

From everything I can infer from the Daylight SMILES specification, the standard order within brackets is hydrogen count, then charge (e.g., [NH4+]. So this is definitely a non-standard form as you suggest, and perhaps even an invalid SMILES according to the Daylight specification.

On the other hand, OpenSMILES considers the charge before hydrogen count as valid, but nonstandard. But RDKit does not follow the OpenSMILES specification. Anyway, hope this helps provide a bit more context to whether this is a bug or a choice to not support it within RDKit.

@bp-kelley bp-kelley added the bug label Sep 10, 2019
@bp-kelley
Copy link
Contributor

I had a brief discussion with Greg about this and we agreed that the original daylight specs allow this. I think we can assign this as a bug.

@nbehrnd
Copy link
Contributor

nbehrnd commented Sep 18, 2019

Joining this thread a bit later, I would like a add a list of SMILES where rdkit (release 2019.03.1 installed in Linux Debian 10, branch testing, CPython 3.7.4+) revealed difficulties to attribute a Murcko scaffold because of lesser common valences, too. These SMILES are generated as one of the export options for (small molecule) crystallographic models with the CCDC CSD with their Python API, typically N-oxides and isocyanides.

The CSD Python API equally allows an export of model structures as .mol2 files, which openbabel may translate as SMILES normally accessible to rdkit. Except N-oxides and isocyanides represented this way again trigger an exception-clause in my script, too.

Attached below both a listing of the SMILES in question not passing rdkit successfully, a visual survey about them generated with openbabel, and a list of the chemical names of the molecules in question as found in the corresponding .cif.

chemical_names_unsuccessful.txt
unsuccessful.txt
unsuccessful.pdf
unsuccessful_color.pdf

@adalke
Copy link
Contributor

adalke commented Sep 30, 2019

The Daylight toolkit definitely accepts [N+H4], which it normalizes to [NH4+]. OpenSMILES does not accept that form. Toolkits may accept it as an extension.

This is a deliberate design choice in OpenSMILES because there seemed to be little need for a non-normal form, and it was a source of bugs in SMILES parsers. (I pushed for this choice.)

For example, the Open Babel parser used to accept [N-H4+2], and treat the result as [N+H4] (that is, it summed the charges). Another parser might erroneously accept the last charge definition.

The Daylight toolkit kept track of which atom attributes it had seen, in order to detect duplicate specifications like this. This subtle source of errors can be eliminated by requiring a specific order. The resulting parser is also simpler.

@nbehrnd
Copy link
Contributor

nbehrnd commented Sep 30, 2019 via email

@adalke
Copy link
Contributor

adalke commented Oct 1, 2019

nbehrnd, your list of SMILES are due to places where RDKit doesn't accept a 4-valent nitrogen, or can't find a Kekule assignment for the structure. This is a different problem than what this issue thread is about, which concerns how to parse an uncommon order of atom properties in the SMILES string. You should create another issue.

However, I think you may want to discuss this on #358 ("Handling of non-standard or variable valency, like tetra-valent Nitrogen").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants