-
Notifications
You must be signed in to change notification settings - Fork 854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMILES parser rejects uncommon order of Atom properties #2632
Comments
From everything I can infer from the Daylight SMILES specification, the standard order within brackets is hydrogen count, then charge (e.g., [NH4+]. So this is definitely a non-standard form as you suggest, and perhaps even an invalid SMILES according to the Daylight specification. On the other hand, OpenSMILES considers the charge before hydrogen count as valid, but nonstandard. But RDKit does not follow the OpenSMILES specification. Anyway, hope this helps provide a bit more context to whether this is a bug or a choice to not support it within RDKit. |
I had a brief discussion with Greg about this and we agreed that the original daylight specs allow this. I think we can assign this as a bug. |
Joining this thread a bit later, I would like a add a list of SMILES where rdkit (release 2019.03.1 installed in Linux Debian 10, branch testing, CPython 3.7.4+) revealed difficulties to attribute a Murcko scaffold because of lesser common valences, too. These SMILES are generated as one of the export options for (small molecule) crystallographic models with the CCDC CSD with their Python API, typically N-oxides and isocyanides. The CSD Python API equally allows an export of model structures as .mol2 files, which openbabel may translate as SMILES normally accessible to rdkit. Except N-oxides and isocyanides represented this way again trigger an exception-clause in my script, too. Attached below both a listing of the SMILES in question not passing rdkit successfully, a visual survey about them generated with openbabel, and a list of the chemical names of the molecules in question as found in the corresponding .cif. chemical_names_unsuccessful.txt |
The Daylight toolkit definitely accepts This is a deliberate design choice in OpenSMILES because there seemed to be little need for a non-normal form, and it was a source of bugs in SMILES parsers. (I pushed for this choice.) For example, the Open Babel parser used to accept The Daylight toolkit kept track of which atom attributes it had seen, in order to detect duplicate specifications like this. This subtle source of errors can be eliminated by requiring a specific order. The resulting parser is also simpler. |
Dear Andrew,
so far, I was lucky enough that the occasional removal of ammonium
compounds -- either by removing a proton and the anion nearby, or
skipping the molecule completely did not cause significant harm. Yet,
my additional encounters were reported in /this/ thread because
+ of my assumption a «market place» about nitrogen-containing molecules
difficult to handle by rdkit could extend the already existing test
bench for the program
+ frankly I did not see how the examples provided here, both in drawing
and chemical name (typically iso-nitriles, enones, and N-oxides),
could be discharged like the ammonium compounds.
…On Mon, 30 Sep 2019 07:22:49 -0700 Andrew Dalke ***@***.***> wrote:
The Daylight toolkit definitely accepts `[N+H4]`, which it normalizes
to `[NH4+]`. OpenSMILES does not accept that form. Toolkits may
accept it as an extension.
This is a deliberate design choice in OpenSMILES because there seemed
to be little need for a non-normal form, and it was a source of bugs
in SMILES parsers. (I pushed for this choice.)
For example, the Open Babel parser used to accept `[N-H4+2]`, and
treat the result as `[N+H4]` (that is, it summed the charges).
Another parser might erroneously accept the last charge definition.
The Daylight toolkit kept track of which atom attributes it had seen,
in order to detect duplicate specifications like this. This subtle
source of errors can be eliminated by requiring a specific order. The
resulting parser is also simpler.
|
nbehrnd, your list of SMILES are due to places where RDKit doesn't accept a 4-valent nitrogen, or can't find a Kekule assignment for the structure. This is a different problem than what this issue thread is about, which concerns how to parse an uncommon order of atom properties in the SMILES string. You should create another issue. However, I think you may want to discuss this on #358 ("Handling of non-standard or variable valency, like tetra-valent Nitrogen"). |
Description:
One common customer using MOE and RDKit found that the following SMILES is not parsed by RDKit:
[N+H4]
This is Written from MOE2019.0102 and seems to be a valid SMILES.
The text was updated successfully, but these errors were encountered: