Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework MaeMolSupplier, fix #2617 #2620

Merged
merged 3 commits into from
Dec 3, 2019

Conversation

ricrogz
Copy link
Contributor

@ricrogz ricrogz commented Aug 17, 2019

This started as a fix for #2617, but I added some more changes. That issue is caused by #2579, but after finding the problem, the MaeMolSupplier does not stop iteration despite it does not get any more structures, and continues returning "None" forever.

Also, this depends on schrodinger/maeparser#46 and schrodinger/maeparser#49, and should not be merged until these changes are accessible to RDKit.

Changes in the PR:

  • Check the state of the stream after the creation of the Supplier: on construction, the Reader populates the internal buffer, so that, if there aren't any problems, the stream may be left either in a good state, if there was more data available than fits in the internal buffer, or with the failed and eof bits activated if there was less data than the buffer tried to read (both states are ok).

  • Do not read the next block / structure before it is requested. Before this change, the MaeMolSupplier attempted to read the first structure still inside the constructor, right after constructing the Reader object; the second one when the first was requested, etc. With this mechanism, if a problem was encountered while reading the second structure, the first one would not have been returned.

  • Catching of exceptions thrown during and after reading of the mae::Block, and rethrowing them as FileParseException so that they can be handled by RDKit (e.g., by preventing further reads, like should happen in Chem.rdmolfiles.MaeMolSupplier Never Stops Reading #2617).

  • Changed the way how atEnd() is handled. This depends on Provide eof() methods to check end of data schrodinger/maeparser#49.

  • Provide translation of FileParseExceptions into Python RuntimeErrors. This is the part that would actually prevent an infinite reading loop as in Chem.rdmolfiles.MaeMolSupplier Never Stops Reading #2617, and also the one I am most interested in, especially because of the comment at:

    // it's kind of doofy that we can't just catch the FileParseException
    // that is thrown when we run off the end of the supplier, but it seems
    // that the cross-shared library exception handling thing is just too
    // hard for me as of boost 1.35.0 (also 1.34.1) and g++ 4.1.x on linux
    // this approach works as well:

    I tried to iterate over a MaeMolSupplier, a SDMolSupplier and a ForwardSDMolSupplier after introducing this patch, and did not observe any problems like unexpected RuntimeErrors, e.g. at the end of the iteration).

@ricrogz
Copy link
Contributor Author

ricrogz commented Aug 28, 2019

Updated with changes to copy over structure and atom properties from the Mae files to the mol, and updated to current master, including @lorton's changes to include PDB info (which I refactored a little).

Checks broken because this still depends on some changes that have not yet been updated into maeparser.

@ricrogz
Copy link
Contributor Author

ricrogz commented Aug 30, 2019

Updated to parse chirality, pseudochirality and stereo bond properties instead of just copying them to the mol, and updated to current master again.

This adds a new dependence on another maeparser PR, schrodinger/maeparser/pull/50, which adds constants for the prefixes used in stereo properties, and some comment on how these properties are build.

@ricrogz
Copy link
Contributor Author

ricrogz commented Sep 14, 2019

I removed the dependence on maeparser's eof() method, as we had certain reservations on it, and probably won't implement it (or not this way/right now).

@ricrogz
Copy link
Contributor Author

ricrogz commented Sep 22, 2019

Merged in current master (after maeparser and coordgen update). This should build fine now, as schrodinger/maeparser/pull/50 has also been merged

@greglandrum
Copy link
Member

@ricrogz : those build failures on windows look real. Can you look into them?

@ricrogz
Copy link
Contributor Author

ricrogz commented Sep 22, 2019

@ricrogz : those build failures on windows look real. Can you look into them?

Yeah, sorry, I noticed, but something came up before I could look into it, but will do soon.

@ricrogz
Copy link
Contributor Author

ricrogz commented Oct 3, 2019

Ok, now it should finally build and pass tests on all platforms :)

@ricrogz
Copy link
Contributor Author

ricrogz commented Nov 1, 2019

I realized this had become too messy, and have reset the original branch.

Copy link
Contributor

@d-b-w d-b-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the diff up to date?

d_reader.reset(new Reader(dp_sInStream));
d_next_struct = d_reader->next("f_m_ct");
}
class PDBInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this diff - this part seems like it's already merged in. I'm just trying to separate out the two fixes in my head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I know what the problem is: I created the PR, and after that, the PDBInfo was added, and I had to update the PR to include it too. Is it possible that it is that?

CHECK_INVARIANT(chiral_atom != nullptr, "bad prop value");

unsigned nSwaps = 2;
switch (*((++tItr)->rbegin())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the token have a .back() like vector and string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually a string, so it does. I'll make this a bit clearer.

CHECK_INVARIANT(bond_indexes.size() == chiral_atom->getDegree(),
"bad prop value");

nSwaps += chiral_atom->getPerturbationOrder(bond_indexes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about hydrogens? If this is a chirality, any hydrogen would be at the end. But if this were a molecule with explicit hydrogens and atom-numbering-only absolute configuration, the hydrogen wouldn't necessarily be last in the mae property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the point of doing this, all atoms are still explicit: in the next() method, first, atoms are parsed, then bonds, then mol properties, and then, Hydrogen nodes are removed, if we opted to do so.


for (const auto &prop : atom_block.getProperties<double>()) {
if (prop.first == mae::ATOM_X_COORD || prop.first == mae::ATOM_Y_COORD ||
// Coordinates are used in defining a conformation, and should not be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird place for the comment, please move it either into the block or out of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that one is quite weird. Not sure how it got there, though.

//! Set atom properties. Some of these have already been parsed to construct the
//! Atom object, and should be skipped. Also, atom properties may be undefined
//! for an atom, and should also be skipped for that atom.
void set_atom_properties(Atom &atom, const mae::IndexedBlock &atom_block,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably more efficient to set a single property on all atoms and then move on to the next property. We can leave that for later, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried both options with just a few properties, and I didn't see too much of a difference. But, as you say, we might come back to this later on.

}
for (const auto &prop : atom_block.getProperties<int>()) {
if (prop.first == mae::ATOM_ATOMIC_NUM) {
// Atomic number was already used in the creation of the atom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete the property when we consume it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a preference about that. I don't think it is going to be a big performance improvement, so maybe we can just leave it be, and do it together with the other change about setting one property for all atoms at a time.

}
void addAtoms(const mae::IndexedBlock &atom_block, RWMol &mol) {
// All atoms are guaranteed to have these three field names:
const auto atomic_numbers = atom_block.getIntProperty(mae::ATOM_ATOMIC_NUM);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite true. All atoms have either atomic number or atom type (or both).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we don't have a way of translating to/from atom types without Schrodinger libraries, so we can't handle atom types, which means that need atomic numbers.

const auto from_atom = from_atoms->at(i) - 1;
const auto to_atom = to_atoms->at(i) - 1;
const auto order = bolookup.find(orders->at(i))->second;
if (from_atom > to_atom) continue; // Maestro files double-list bonds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older Maestro files double-list bonds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They still do if the bond has properties. I guess Maestro files double-list some bonds.

bond->setOwningMol(mol);
bond->setBeginAtomIdx(from_atom);
bond->setEndAtomIdx(to_atom);
mol.addBond(bond, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only used:

mol.addBond(from_atom, to_atom, order);

Is there a reason to prefer one idiom over the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mol.addBond(from_atom, to_atom, order); can't take indexes for from_atom and to_atom, they have to be Atom*.

Copy link
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone through this in excruciating detail, but it looks reasonable and seems to have tests there. Willing to merge as is.

@greglandrum greglandrum added this to the 2020_03_1 milestone Dec 3, 2019
@greglandrum greglandrum merged commit 4b4085f into rdkit:master Dec 3, 2019
@ricrogz ricrogz deleted the rework_MaeMolSupplier branch March 17, 2020 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants