SMILES parser rejects uncommon order of Atom properties #2632

mkossner · 2019-08-28T14:25:45Z

Description:
One common customer using MOE and RDKit found that the following SMILES is not parsed by RDKit:

[N+H4]

This is Written from MOE2019.0102 and seems to be a valid SMILES.

vfscalfani · 2019-08-30T14:32:38Z

From everything I can infer from the Daylight SMILES specification, the standard order within brackets is hydrogen count, then charge (e.g., [NH4+]. So this is definitely a non-standard form as you suggest, and perhaps even an invalid SMILES according to the Daylight specification.

On the other hand, OpenSMILES considers the charge before hydrogen count as valid, but nonstandard. But RDKit does not follow the OpenSMILES specification. Anyway, hope this helps provide a bit more context to whether this is a bug or a choice to not support it within RDKit.

bp-kelley · 2019-09-10T14:39:05Z

I had a brief discussion with Greg about this and we agreed that the original daylight specs allow this. I think we can assign this as a bug.

nbehrnd · 2019-09-18T15:49:48Z

Joining this thread a bit later, I would like a add a list of SMILES where rdkit (release 2019.03.1 installed in Linux Debian 10, branch testing, CPython 3.7.4+) revealed difficulties to attribute a Murcko scaffold because of lesser common valences, too. These SMILES are generated as one of the export options for (small molecule) crystallographic models with the CCDC CSD with their Python API, typically N-oxides and isocyanides.

The CSD Python API equally allows an export of model structures as .mol2 files, which openbabel may translate as SMILES normally accessible to rdkit. Except N-oxides and isocyanides represented this way again trigger an exception-clause in my script, too.

Attached below both a listing of the SMILES in question not passing rdkit successfully, a visual survey about them generated with openbabel, and a list of the chemical names of the molecules in question as found in the corresponding .cif.

chemical_names_unsuccessful.txt
unsuccessful.txt
unsuccessful.pdf
unsuccessful_color.pdf

adalke · 2019-09-30T14:22:43Z

The Daylight toolkit definitely accepts [N+H4], which it normalizes to [NH4+]. OpenSMILES does not accept that form. Toolkits may accept it as an extension.

This is a deliberate design choice in OpenSMILES because there seemed to be little need for a non-normal form, and it was a source of bugs in SMILES parsers. (I pushed for this choice.)

For example, the Open Babel parser used to accept [N-H4+2], and treat the result as [N+H4] (that is, it summed the charges). Another parser might erroneously accept the last charge definition.

The Daylight toolkit kept track of which atom attributes it had seen, in order to detect duplicate specifications like this. This subtle source of errors can be eliminated by requiring a specific order. The resulting parser is also simpler.

nbehrnd · 2019-09-30T16:16:42Z

Dear Andrew, so far, I was lucky enough that the occasional removal of ammonium compounds -- either by removing a proton and the anion nearby, or skipping the molecule completely did not cause significant harm. Yet, my additional encounters were reported in /this/ thread because + of my assumption a «market place» about nitrogen-containing molecules difficult to handle by rdkit could extend the already existing test bench for the program + frankly I did not see how the examples provided here, both in drawing and chemical name (typically iso-nitriles, enones, and N-oxides), could be discharged like the ammonium compounds.

…

On Mon, 30 Sep 2019 07:22:49 -0700 Andrew Dalke ***@***.***> wrote: The Daylight toolkit definitely accepts `[N+H4]`, which it normalizes to `[NH4+]`. OpenSMILES does not accept that form. Toolkits may accept it as an extension. This is a deliberate design choice in OpenSMILES because there seemed to be little need for a non-normal form, and it was a source of bugs in SMILES parsers. (I pushed for this choice.) For example, the Open Babel parser used to accept `[N-H4+2]`, and treat the result as `[N+H4]` (that is, it summed the charges). Another parser might erroneously accept the last charge definition. The Daylight toolkit kept track of which atom attributes it had seen, in order to detect duplicate specifications like this. This subtle source of errors can be eliminated by requiring a specific order. The resulting parser is also simpler.

adalke · 2019-10-01T14:27:17Z

nbehrnd, your list of SMILES are due to places where RDKit doesn't accept a 4-valent nitrogen, or can't find a Kekule assignment for the structure. This is a different problem than what this issue thread is about, which concerns how to parse an uncommon order of atom properties in the SMILES string. You should create another issue.

However, I think you may want to discuss this on #358 ("Handling of non-standard or variable valency, like tetra-valent Nitrogen").

bp-kelley added the bug label Sep 10, 2019

bp-kelley mentioned this issue May 21, 2020

RDKit can't parse non-standard smiles, but chemdraw can parse it #3179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMILES parser rejects uncommon order of Atom properties #2632

SMILES parser rejects uncommon order of Atom properties #2632

mkossner commented Aug 28, 2019

vfscalfani commented Aug 30, 2019

bp-kelley commented Sep 10, 2019

nbehrnd commented Sep 18, 2019

adalke commented Sep 30, 2019

nbehrnd commented Sep 30, 2019 via email

adalke commented Oct 1, 2019

SMILES parser rejects uncommon order of Atom properties #2632

SMILES parser rejects uncommon order of Atom properties #2632

Comments

mkossner commented Aug 28, 2019

vfscalfani commented Aug 30, 2019

bp-kelley commented Sep 10, 2019

nbehrnd commented Sep 18, 2019

adalke commented Sep 30, 2019

nbehrnd commented Sep 30, 2019 via email

adalke commented Oct 1, 2019