Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup of the SMILES/SMARTS parsing and writing code #2912

Merged
merged 15 commits into from
Jan 29, 2020

Conversation

greglandrum
Copy link
Member

Not a major refactoring, but a bunch of cleanup work.
Along the way some bugs/new ideas came up and I dealt with those here too:

  1. Fixes SMARTS parser fails on high-numbered ring closures in branches #2909
  2. Adds Allow RDProps::clearProp to succeed even if the prop doesn't exist #2910, which lead to some cleanup of the Dict class
  3. Removed RDKit::round in favor of std::round
  4. Switched usage of some boost::math functions over to the equivalent C++11 std functions.

@greglandrum greglandrum added Cleanup Code cleanup and refactoring enhancement labels Jan 27, 2020
@greglandrum greglandrum added this to the 2020_03_1 milestone Jan 27, 2020
Copy link
Contributor

@d-b-w d-b-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tidying.

}

std::string res;
unsigned int nAtoms = tmol->getNumAtoms();
UINT_VECT ranks(nAtoms);
std::vector<unsigned int> ranks(nAtoms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. I personally find this much more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but back in the day when we were chiseling the code into stone tablets it was important to save those characters!
;-)

Copy link
Contributor

@ricrogz ricrogz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dan asked me to have a look at this, too. Looks good; I just added a couple suggestions :)

There's still one thing that I don't really like, which is the alternation of unsigned int and int for atom indexes, but that's not going to go away in one commit. Still, it would be good to decide which should be used, and start updating it.

@@ -241,14 +243,15 @@ void AUTOCORR3D(const ROMol& mol, std::vector<double>& res, int confId,
double* dist3D =
MolOps::get3DDistanceMat(mol, confId, false, true); // 3D distance matrix
if (customAtomPropName != "") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if (!customAtomPropName.empty()) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. I did a few of those, but no doubt missed a few.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd make sense to get these in one big sweep through the code in a different PR though

std::vector<unsigned int> atomOrdering;

if (canonical) {
if (tmol->hasProp("_canonicalRankingNumbers")) {
for (unsigned int i = 0; i < tmol->getNumAtoms(); ++i) {
for (const auto atom : tmol->atoms()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth moving to detail::common_properties in types.h ?

Also, I can't find where this property is set. Is it some legacy property from older code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's an undocumented way to allow you to provide the atom ranking to be used in canonicalization

Comment on lines -1089 to -1091
bool *numSwapsChiralAtoms = (bool *)malloc(nAtoms * sizeof(bool));
CHECK_INVARIANT(numSwapsChiralAtoms, "failed to allocate memory");
memset(numSwapsChiralAtoms, 0, nAtoms * sizeof(bool));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one had been making me uncomfortable for some time...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I wasn't sad to remove that

@greglandrum
Copy link
Member Author

There's still one thing that I don't really like, which is the alternation of unsigned int and int for atom indexes, but that's not going to go away in one commit. Still, it would be good to decide which should be used, and start updating it.

Agreed. That would be a nice one to cleanup/rationalize. It's a non-zero amount of effort but might be a good one for the next time I have a chunk of time to devote to cleaning up code.

boost::make_iterator_range(mol.getAtomBonds(atom))) {
// can't just check for single bonds, because dative bonds also have an
// order of 1
if (mol[bndItr]->getBondTypeAsDouble() > 1.001) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0 is perfectly representable in floating point. I think you can just do getBondTypeAsDouble() > 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

ROMol &mol = dblBond->getOwningMol();
int firstVisitOrder = mol.getNumBonds() + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be unsigned int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing this, along with a number of other signed/unsigned things

double dtmp;

for (int i = 0; i < 10; i++) {
double* Bimat = GetGeodesicMatrix(topologicaldistance, i + 1, numAtoms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_ptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a general refactoring pass through the 3D descriptors at some point would be a really good idea, but I'd rather not get too deeply into that in this PR. I just did mechanical things (the replacement of boost::math functionality) this time through


std::vector<double> customAtomarray(numAtoms, 0.0);
for (int i = 0; i < numAtoms; ++i) {
if (mol.getAtomWithIdx(i)->hasProp(customAtomPropName)) {
Copy link
Contributor

@bp-kelley bp-kelley Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially could use

std::vector<double> customAtomarray(numAtoms, 1.0);
for (auto &atom : mol.atoms()) {
  atom.GetPropIfPresent(customAtomPropName, customAtomArray[atom.getIdx()]);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@@ -172,7 +172,8 @@ std::vector<double> getWhimD(std::vector<double> weightvector,
}
if (std::fabs(Scores(j, i) + Scores(k, i)) <= th) {
// those that are close opposite & not close to the axis!
ns += 1; // check only once the symmetric none null we need to add +2!
ns +=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd formatting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that was clang-format. I just moved the comments around and that cleared up

auto start_tok = static_cast<int>(START_BOND);
std::vector<RWMol *> molVect;
Atom *atom = nullptr;
auto res = smarts_parse_helper(inp, molVect, atom, bond, start_tok);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can simplify to

return smarts_parse_helper(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often leave the local variable there because it makes debugging easier. That's not particularly relevant here

mol->setBondBookmark(b, atIt->first);
frag->clearBondBookmark(atIt->first, b);
mol->setBondBookmark(b, atIt.first);
frag->clearBondBookmark(atIt.first, b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does/Can this invalidate the iterator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The only range/iterator loops here are lines 177 and 181 and those both are over atom bookmarks. Here we're removing bond bookmarks.

@@ -447,8 +432,8 @@ void CloseMolRings(RWMol *mol, bool toleratePartials) {
Bond *bond2 = *bondIt;

// remove those bonds from the bookmarks:
mol->clearBondBookmark(bookmarkIt->first, bond1);
mol->clearBondBookmark(bookmarkIt->first, bond2);
mol->clearBondBookmark(bookmark.first, bond1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this invalidate the iterator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same story here, the loop is over atom bookmarks

switch (mSE.type) {
case Canon::MOL_STACK_ATOM:
if (!ringClosuresToErase.empty()) {
BOOST_FOREACH (unsigned int rclosure, ringClosuresToErase) {
for (auto rclosure : ringClosuresToErase) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ringClosuresToErase.empty() call is not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

@@ -137,9 +137,8 @@ class RDKIT_RDGENERAL_EXPORT Dict {
*/
STR_VECT keys() const {
STR_VECT res;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small optimization maybe

res.reserve(_data.size());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@@ -170,10 +170,18 @@ class RDKIT_RDGENERAL_EXPORT Dict {
}
}
throw KeyErrorException(what);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semi colon isn't necessary. Mildly surprised this isn't an error.

Also in the function above here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's legal syntax (and is used pretty regularly in the RDKit), but it's not needed.
https://stackoverflow.com/questions/9997895/semicolon-after-function

I'll remove it here and will try to remember to remove those trailing semicolons in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A semicolon after a function (non class) only became legal in C++11 I think. It doesn't really matter except for consistency and as long as we don't start seeing:

void foo() {
};;;;;;;;; // Hi Brian!

@bp-kelley
Copy link
Contributor

@greglandrum just a few minor comments, otherwise looks good. There is one suspicious maybe iterator invalidation question.

Copy link
Contributor

@bp-kelley bp-kelley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to make the 1 -> 1.0 change go for it. Otherwise I approve.

@@ -23,7 +23,7 @@ bool isUnsaturated(const Atom *atom, const ROMol &mol) {
boost::make_iterator_range(mol.getAtomBonds(atom))) {
// can't just check for single bonds, because dative bonds also have an
// order of 1
if (mol[bndItr]->getBondTypeAsDouble() > 1.001) {
if (mol[bndItr]->getBondTypeAsDouble() > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still should be a double value if only to silence warning :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually doesn't produce a warning on any the platforms where we build.
There's really no reason to warn on double > int or double > long int

@greglandrum greglandrum merged commit 9991c52 into rdkit:master Jan 29, 2020
@greglandrum greglandrum deleted the dev/refactor_smiles_writing branch January 29, 2020 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cleanup Code cleanup and refactoring enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SMARTS parser fails on high-numbered ring closures in branches
4 participants