-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework MaeMolSupplier, fix #2617 #2620
Conversation
Updated with changes to copy over structure and atom properties from the Mae files to the mol, and updated to current master, including @lorton's changes to include PDB info (which I refactored a little). Checks broken because this still depends on some changes that have not yet been updated into maeparser. |
Updated to parse chirality, pseudochirality and stereo bond properties instead of just copying them to the mol, and updated to current master again. This adds a new dependence on another maeparser PR, schrodinger/maeparser/pull/50, which adds constants for the prefixes used in stereo properties, and some comment on how these properties are build. |
I removed the dependence on maeparser's |
Merged in current master (after maeparser and coordgen update). This should build fine now, as schrodinger/maeparser/pull/50 has also been merged |
@ricrogz : those build failures on windows look real. Can you look into them? |
Yeah, sorry, I noticed, but something came up before I could look into it, but will do soon. |
Ok, now it should finally build and pass tests on all platforms :) |
31970d5
to
905c44d
Compare
905c44d
to
1da64c8
Compare
I realized this had become too messy, and have reset the original branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the diff up to date?
d_reader.reset(new Reader(dp_sInStream)); | ||
d_next_struct = d_reader->next("f_m_ct"); | ||
} | ||
class PDBInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this diff - this part seems like it's already merged in. I'm just trying to separate out the two fixes in my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I know what the problem is: I created the PR, and after that, the PDBInfo
was added, and I had to update the PR to include it too. Is it possible that it is that?
CHECK_INVARIANT(chiral_atom != nullptr, "bad prop value"); | ||
|
||
unsigned nSwaps = 2; | ||
switch (*((++tItr)->rbegin())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the token have a .back()
like vector
and string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually a string, so it does. I'll make this a bit clearer.
CHECK_INVARIANT(bond_indexes.size() == chiral_atom->getDegree(), | ||
"bad prop value"); | ||
|
||
nSwaps += chiral_atom->getPerturbationOrder(bond_indexes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about hydrogens? If this is a chirality, any hydrogen would be at the end. But if this were a molecule with explicit hydrogens and atom-numbering-only absolute configuration, the hydrogen wouldn't necessarily be last in the mae property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the point of doing this, all atoms are still explicit: in the next()
method, first, atoms are parsed, then bonds, then mol properties, and then, Hydrogen nodes are removed, if we opted to do so.
|
||
for (const auto &prop : atom_block.getProperties<double>()) { | ||
if (prop.first == mae::ATOM_X_COORD || prop.first == mae::ATOM_Y_COORD || | ||
// Coordinates are used in defining a conformation, and should not be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird place for the comment, please move it either into the block or out of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that one is quite weird. Not sure how it got there, though.
//! Set atom properties. Some of these have already been parsed to construct the | ||
//! Atom object, and should be skipped. Also, atom properties may be undefined | ||
//! for an atom, and should also be skipped for that atom. | ||
void set_atom_properties(Atom &atom, const mae::IndexedBlock &atom_block, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably more efficient to set a single property on all atoms and then move on to the next property. We can leave that for later, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried both options with just a few properties, and I didn't see too much of a difference. But, as you say, we might come back to this later on.
} | ||
for (const auto &prop : atom_block.getProperties<int>()) { | ||
if (prop.first == mae::ATOM_ATOMIC_NUM) { | ||
// Atomic number was already used in the creation of the atom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we delete the property when we consume it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have a preference about that. I don't think it is going to be a big performance improvement, so maybe we can just leave it be, and do it together with the other change about setting one property for all atoms at a time.
} | ||
void addAtoms(const mae::IndexedBlock &atom_block, RWMol &mol) { | ||
// All atoms are guaranteed to have these three field names: | ||
const auto atomic_numbers = atom_block.getIntProperty(mae::ATOM_ATOMIC_NUM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not quite true. All atoms have either atomic number or atom type (or both).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but we don't have a way of translating to/from atom types without Schrodinger libraries, so we can't handle atom types, which means that need atomic numbers.
const auto from_atom = from_atoms->at(i) - 1; | ||
const auto to_atom = to_atoms->at(i) - 1; | ||
const auto order = bolookup.find(orders->at(i))->second; | ||
if (from_atom > to_atom) continue; // Maestro files double-list bonds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Older Maestro files double-list bonds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They still do if the bond has properties. I guess Maestro files double-list some bonds.
bond->setOwningMol(mol); | ||
bond->setBeginAtomIdx(from_atom); | ||
bond->setEndAtomIdx(to_atom); | ||
mol.addBond(bond, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only used:
mol.addBond(from_atom, to_atom, order);
Is there a reason to prefer one idiom over the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mol.addBond(from_atom, to_atom, order);
can't take indexes for from_atom
and to_atom
, they have to be Atom*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone through this in excruciating detail, but it looks reasonable and seems to have tests there. Willing to merge as is.
This started as a fix for #2617, but I added some more changes. That issue is caused by #2579, but after finding the problem, the
MaeMolSupplier
does not stop iteration despite it does not get any more structures, and continues returning "None" forever.Also, this depends on schrodinger/maeparser#46 and schrodinger/maeparser#49, and should not be merged until these changes are accessible to RDKit.
Changes in the PR:
Check the state of the stream after the creation of the Supplier: on construction, the
Reader
populates the internal buffer, so that, if there aren't any problems, the stream may be left either in agood
state, if there was more data available than fits in the internal buffer, or with thefailed
andeof
bits activated if there was less data than the buffer tried to read (both states are ok).Do not read the next block / structure before it is requested. Before this change, the MaeMolSupplier attempted to read the first structure still inside the constructor, right after constructing the Reader object; the second one when the first was requested, etc. With this mechanism, if a problem was encountered while reading the second structure, the first one would not have been returned.
Catching of exceptions thrown during and after reading of the
mae::Block
, and rethrowing them asFileParseException
so that they can be handled by RDKit (e.g., by preventing further reads, like should happen in Chem.rdmolfiles.MaeMolSupplier Never Stops Reading #2617).Changed the way how
atEnd()
is handled. This depends on Provide eof() methods to check end of data schrodinger/maeparser#49.Provide translation of
FileParseException
s into PythonRuntimeErrors
. This is the part that would actually prevent an infinite reading loop as in Chem.rdmolfiles.MaeMolSupplier Never Stops Reading #2617, and also the one I am most interested in, especially because of the comment at:rdkit/Code/GraphMol/Wrap/MolSupplier.h
Lines 90 to 94 in 01fbec3
I tried to iterate over a
MaeMolSupplier
, aSDMolSupplier
and aForwardSDMolSupplier
after introducing this patch, and did not observe any problems like unexpectedRuntimeError
s, e.g. at the end of the iteration).