Skip to content
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
496 lines (312 sloc) 17.3 KB
Tim bullet 4
Tim Vandermeersch:
Is "[0S]" the same as "[S]"?
Examples in aromaticity section.
Tim Vandermeersch:
Two comments about the specification:
* section 2.2 Grammar: definition chiral for trigonal bipyrimidal has
too many class (currently TB1, ..., TB30, should be TB1, ..., TB20).
* Grammar: '|' missing between SPACE and TAB
* section 3.9.6: Second paragraph: "The shorthand notations '@' and
'@@' correspond to clockwise and anti-clockwise tetrahedral chirality,
and are the same a '@TH1' and '@TH2', respectively. Likewise, in an
allenal configuration, the shorthand notations '@' and '@@' correspond
to '@AL1' and '@AL2', respectively." The words "clockwise" and
"anti-clockwise" should be exchanged for clarity I think.
* OpenSMILES Section 3.9.2 (Tetrahedral Centers):
"A chiral center can be written starting from any of its neighbor
atoms, and the choice of whether to list the remaining neighbor in
clockwise or anticlockwise order is also arbitrary."
I think the first part should be changed. A chiral atom may also be
specified as the first atom. This is even required when writing a
canonical smiles where the canonicalization algorithm places a chiral
atom to be the one with the lowest rank.
The SMILES [C@H]1=[C@@H][C@@H]=[C@@H][C@@H]=[C@@H][C@@H]=[C@@H]1 in
section 6.3 also allows this. Most toolkits support this. See also:
Also, I think the phrasing of "The the remaining three neighbor atoms
are written by listing them in anticlockwise order" could be clarified
"... by listing the bonds connecting them to the chiral center in
anticlockwise order of in the SMILES string ..."
It's actually the bond's order in the SMILES string that should be
considered. These are different than the atom order when one or more
ring bonds originate from the chiral atom. (see link to Noel's blog)
Noel O'Boyle:
I noted some of these in the past too, but didn't have the chance to
change them.
Andrew Dalke:
* The OpenSMILES spec does not place bound on the acceptable range of charges.
My view is that we say OpenSMILES parsers must support charges from -15 to +15
(that's 5 bits). Support for a wider range is implementation specific.
This would go into the "Programming Practices" section.
There was general agreement that this is acceptable. See the thread
"spec clarification on charge limits" for a full discussion, including
my observation that existing PubChem records have charges from -4 to +9.
* support the empty string as valid SMILES for an empty molecule
* support for [b] (omission in the spec)
* only support for @TB1 ... @TB20 (.. @TB30 is an error in the spec)
* support for '0' as the first digit of two digit chiral indicators,
that is, I support both @OH01 and @OH1.
* The OpenSMILES list of element symbols ends with:
There have been three IUPAC elements since then: copernicium (Cn),
flerovium (Fl) and livermorium (Lv).
These likely should be added to OpenSMILES. However, this is a nearly
theoretical point as only a few score atoms of each have been detected
and it won't be in any cheminformatics databases.
There is a slight problem of adding them. Currently [Cn] means to
match an atom which is an aliphatic carbon and is an aromatic nitrogen.
Which of course means it matches nothing. This is a round-about way
to write "don't match".
With the above change, Cn would match copernicium atoms, so it's not
completely backwards compatible.
We could also say that elements after Rg must be written in '#' form,
as '[#112]' for Cn.
Or we could do nothing. Which annoys the OCD part of me. ;) I would
prefer adding Cn, Dl, and Lv as possible symbols.
Updated through 10/05/2007 10:40 PM.
Greg Landrum: I would like to see something ... that flags the format type
... [and] version number in the format code. ...I'm not sure what form this
format/version coding would take, but we could borrow from InChI and do
something like: "SMILESv1=c1ccccc1".
Others suggested that this was a good idea but was outside the SMILES spec
Polymers and Crystals: The proposed extension to allow polymers and
crystals using (for example) C&1&1&1 for graphite was vetoed by one
participant (Andrew?), who points out that it makes things like
substructure searching, aromaticity detection, canonicalization, and many
other things, very difficult.
Get rid of [Mg++] notation, require [Mg+2] form.
ORGANIC_SYMBOL should include 's' and 'p'
Is "[0S]" the same as "[S]"?
uses "atomic mass" instead of "atomic weight."
What is the minimum hcount that a library must support? 9 hydrogens?
What's the minimum valence that a library must support? 15?
Agrees that '$' for quadruple bond should be allowed.
Disagrees that '.' is a valid bond.
See Andrew: 09/30/07 11:10 for better wording on hydrogen and charge.
Move "Isotopes' section in with atoms instead of separate.
> If no isotope is specified, an atom is assumed to have either the
> naturally occuring isotopes for that element, or the isotope of the
> element is unknown, or the isotope is unspecified. SMILES has no
> way to distinguish these three cases.
Rewrite to say it's the naturally-occuring isotopic mixture.
Missing ring-closure bonds - Allow all three forms
C=1CCCCC1 recommended form
are all valid. Conflicts such as C-1CCCCC=1 are illegal
Missing items in spec (Craig -- not posted to list)
* Atom can't be bonded to itself
* Only one bond can join two atoms,
Thus the following are illegal:
Last branch can be in parentheses, e.g.
should be legal.
Make any whitespace (space, tab, CR, LF) the offical end of the SMILES string
by rule. Simplifies parsing.
Document structure
Separate the introductory material:
* Motivation
* Background: valence model
* History
Move SMILES file discussion etc. to separate doc, or to "best practices"
Section 2.3, "see below" regarding aromaticity needs to be a hyperlink
Minor point: say "is not present" - missing implies that it
should be there, while the preferred form is without the digit
if there is only one hydrogen.
Suggested H-count description:
- Hydrogens are specified as "H" or "Hn" where "n" is a digit
from 0 to 9. Examples are "H", "H3" and "H1" meaning
that there is 1, 3 and 1 attached hydrogen, respectively.
Suggested charge description:
- Positive charge is specified by "+", "++", or "+n" where
n is a digit from 0 to 9. Examples are "+", "++", "+2", "+4"
and "+0" meaning charges of +1, +2, +2, +4 and 0, respectively.
A charge of +0 means that the atom is uncharged.
- Negative charge is specified by "-", "--", or "-n" where ...
n is a digit from 0 to 9. Examples are "-", "--", "-2", "-4"
and "-0" meaning charges of -1, -2, -2, -4 and 0, respectively.
A charge of -0 means that the atom is uncharged.
Suggested text for "organic subset":
An element in the "organic subset" of B, C, N, O, P, S, F, Cl, Br, I,
and * (the "wildcard" atom) can be written using only its atomic
symbol (that is, without the square brackets, H-count, etc.). When
written in this form the atom:
- has no formal charge
- has undefined atomic mass
- does not specify any chirality
The hydrogen count is implied from the possible valences for the element
(given XXX) and the sum of the orders of the bond types. If the sum is
equal to a possible valence or greater than the highest possible valence
then the hydrogen count is 0. Otherwise the implied hydrogen count is
the difference between the sum and the next highest possible valence.
> An atom with three or more bonds is called a branched atom
Is that the usual terminology? I'm not used to it. I also
see that the term "branched" does not occur elsewhere in the doc.
Controversial subject: Is C.1CCCCC.1 legal? I say yes, everybody else says no.
Need clarification of term "ring bond": it's usually, but not always, a
ring bond.
Andrew argues for allowing different models of chemistry. Craig strongly
disagrees with this idea, because it means a given SMILES can have
different interpretations.
Normalized SMILES: Several people argued for reusing ring digits to avoid
the %nn form as much as possible.
Several people suggested adopting OpenEye's atom-class feature,
e.g. [CH2:1]
For completeness we should do all forms of chirality.
Craig's comment: Contrary to several people's assertion that the other
forms of chirality are rare, they're not. All of the forms of chiral
centers are fairly common in large databases. The only reason they're not
seen much is because most file formats and toolkits can't handle them.
Several people suggested adding R groups: "to support existing Cactus and
JChem/Marvin usage, OEChem allows "[R2]" to be used as short-hand for [*:2]
Andrew suggested a "Dumb SMILES" normalization, where single (non aromatic)
bonds between aromatic atoms would always be written as '-'. In this form, a very
simple parser that doesn't have the Hueckel's rule code to detect
aromaticity can still figure out every bond's type.
Craig's comment: This would be a nice feature for all "normalized" SMILES.
Single bonds between aromatic atoms are just a small percentage of all
bonds, so it wouldn't add much lenght to most SMILES.
Missing a rule that says your normalized SMILES must use the minimum number
of ring closures. Otherwise, C1.C12.C23.C34.C4 is considered "normalized"
For [atomic weights], the Blue Obelisk Data Repository should be used [1]
Question: To what extent should we reject completely absurd structures
such as C(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)C? Daylight generates
a warning when non-standard valences are encountered.
Are boron and silicon aromatic?
What about selenium and arsenic?
Rich A's assertion: Lowercase (aromatic) is nothing more than: "if you see
a lower case atom symbol, subtract one from the implicit hydrogen count you
would otherwise assign."
Craig's question: Once you do this, can you deduce aromaticity
using Hueckel's rule, for atoms AND bonds? See the example above:
Rich later says:
These two simple definitions encompass all uses of the lower case SMILES
atom notation I'm aware of. And they bypass the sticky problem of trying to
work out what aromaticity really means. (We don't really want to go there,
do we?)
... these would all be legal under this simplified definition:
Ccc - propylene
c1ccc1 - cyclobutadiene
c1ccC#Cc1 - benzyne
co - formaldehyde
cccc - butadiene
And Chris M. replies:
Of course, this reduced hydrogen count interpretation ties in nicely
with the use of "c" as a radical center.
Cc - ethyl radical
ccc - allyl radical
But maybe you want to represent furan as c1cocc1 where the o is not
H-deficient (Daylight Depict accepts this, as well as c1cOcc1). But then
you might write pyrrole as c1cncc1 (which it does not; it accepts
c1c[nH]cc1 and c1c[NH]cc1). I think all these forms should be
acceptable, since they are unambiguous
Rich replies:
Using the simplified rule for lower-case atom labels eliminates the need to
write the brackets around pyrrole/indole nitrogen (c1cncc1 is
acceptable). I like the idea of c1cocc1 and c1cOcc1 both being acceptable
for furan as well (the underlying assumption being that reduction in
implicit hydrogen count can never lead to a negative number).
Greg replies:
This new rule simplifies pyrrole, so you get to write c1cncc1 instead
of c1c[nH]cc1, but it is really confusing (to me at least) when you
compare pyrrole with pyridine, c1ccncc1. Sometimes an aromatic N with
two explicit neighbors has an implicit H, other times it doesn't.
Egon's suggestion:
Let's give that some more detail:
"a lowercase element symbol indicates a organic subset element
participating in a delocalized bonding system which is part of a ring,
such as aromatic and anti-aromatic rings, and labeled as 'aromatic'.".
"Bonds between 'aromatic' atoms are only labeled aromatic if they
participate in the delocalized bonding system which is part of a ring,
and the same bonding system as defined for the aromatic atoms between
the bond is defined."
"a ':' bond indicates that a bond participates in a delocalized bonding
system which is part of a ring, and is labeled as 'aromatic'"
More from Rich:
Hmm. Under the simplified reduced-implicit-hydrogen interpretation of lower
case atom symbols c1cncc1 is actually the pyrrole radical, if I'm not
mistaken. (Greg later agrees with this)
So under this system you'd have:
c1c[nH]cc1 (but not c1cncc1)
Greg provided some informal rules:
here are my "rules" for lower case atom symbols:
1) They're only allowed in rings.
2) They are only allowed for a subset of atoms (to be agreed upon later).
3) Aromaticity is *not* defined solely using "primitive rings" i.e. it
is possible to have an aromatic fused ring system where one or all of
the subrings is/are not aromatic. An example of this is: c12coccc1ccc2
in which neither of the subrings have 4N+2 electrons, but the fused
system does.
4) A bond of unspecified type between two aromatic atoms is either
aromatic or single. Examples of aromatic-aromatic single bonds are the
bond connecting the rings in biphenyl c1ccccc1c1ccccc1 and the fusing
bond in the previous example.
Craig: Has anyone come up with an example where a single bond or aromatic
bond *must* be written to correctly specify an aromatic system using
Daylight's rules? Or can single/aromatic bonds always be distinguished?
A good example to include: the molecule c12coccc1ccc2 (C12=COC=CC1=CC=C2 if
you'd prefer it kekulized) has a fused 6-ring and 5-ring; neither ring is
itself aromatic, but the fused system is (10 pi electrons). The ring
fusion bond is a ring bond, it's between two aromatic systems, but it's not
an aromatic bond (again, it's not part of the conjugation path).
Andrew provided a new grammar. See 10/02/2007 2:47
Andrew rewrote the whole document. ;-) See 10/03/2007 7:36
Andrew's version is more formal.
In section 3.2 of the draft spec, it would be useful to state explictly
what happens when one or more of the bonds to the tetrahedral chiral
center is a ring closure. It's needed. I have just corrected a bug in
the OpenBabel parser where this had not been properly handled.
I think this means, "what does the ring digit represent?"
You can’t perform that action at this time.