Skip to content

FrequentlyAskedQuestions

Greg Landrum edited this page Mar 31, 2023 · 4 revisions

Frequently Asked Questions, with answers

If you would like to contribute a question (and/or answer) to this list, or if you have suggestions about improvements to an existing answer, please use the FAQ Discussion topic. We will migrate questions and answers from there to this list.

Reading molecules

Can't kekulize mol.

Here's an example of what this looks like:

>>> m = Chem.MolFromSmiles('c1nccc1')
[14:05:30] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4

When you read in a molecule the RDKit, by default, does a lot of preprocessing - called sanitization - to perceive chemistry and detect errors. The documentation has a detailed description of this.

One of the checks which is performed by default is to convert any aromatic systems into their Kekule forms (i.e. all aromatic bonds are replaced with alternating single and double bonds). The kekulization process requires that the RDKit know how many implicit hydrogens an atom has so that it can figure out how many double bonds it can accept. Aromatic heteroatoms, like the N in the example above, are a challenge here, because they can have different numbers of implicit Hs depending on the nature of the ring they are in (compare, for example, the N in a pyridine ring, which has no implicit hydrogens, and that in pyrrole, which has one implicit H). The kekulization error shown above arises for rings where the RDKit cannot figure out a chemically reasonable Kekule form. In these cases an error is reported and the molecule is rejected: MolFromSmiles() returns None instead of a molecule.

Since these are generally the result of errors in the input, the best solution is usually to fix the input structure. So, for the example above, I can just add the H that I forgot:

>>> m = Chem.MolFromSmiles('c1[nH]ccc1')
>>> m.GetNumAtoms()
5

A natural followup question is "Why can't the RDKit just add the H to the heteroatom in these cases?" This is straightforward enough if there's a single heteroatom, but when more than one is present, like for the SMILES Cc1ncnc1, then the code would have to arbitrarily pick one. This kind of attempting to guess what the user meant is something the RDKit generally avoids.

Explicit valence for atom ... is greater than permitted.

Here's an example of what this looks like:

>>> m = Chem.MolFromSmiles('CN(C)(C)C')
[07:52:08] Explicit valence for atom # 1 N, 4, is greater than permitted

When you read in a molecule the RDKit, by default, does a lot of preprocessing - called sanitization - to perceive chemistry and detect errors. The documentation has a detailed description of this.

One of the checks which is performed by default is to make sure that all of the atoms in the molecule have valence states which make chemical sense. If there's an atom with a chemically unreasonable valence state (like the four-coordinate neutral N in the example above), then an error is reported and the molecule is rejected: MolFromSmiles() returns None instead of a molecule.

Since these are generally the result of errors in the input, the best solution is usually to fix the input structure. So, for the example above, I can just add the positive charge that I forgot:

>>> m = Chem.MolFromSmiles('C[N+](C)(C)C')
>>> m.GetNumAtoms()
5