Skip to content

Add extract mol fragment api#8811

Merged
greglandrum merged 31 commits intordkit:masterfrom
bp-kelley:pr/addExtractMolFragmentApi
Dec 9, 2025
Merged

Add extract mol fragment api#8811
greglandrum merged 31 commits intordkit:masterfrom
bp-kelley:pr/addExtractMolFragmentApi

Conversation

@bp-kelley
Copy link
Copy Markdown
Contributor

@bp-kelley bp-kelley commented Sep 24, 2025

Fixes the root cause of #8649

This replaces the pr for #8742

Your Name and others added 6 commits August 29, 2025 13:26
…ew ROMol by creating new graph (rdkit#8742)

This adds a new api, `RDKit::MolOps::ExtractMolFragment`, to allow efficient
extractions of mol fragments from large mols. Compared to the approach where
we delete "unwanted" atoms/bonds from the input mol, this api is faster for
small mols (about 2x faster) and at least 3x faster for big mols
(was 10x faster for "CCC"*1000).
bp-kelley pushed a commit to bp-kelley/rdkit that referenced this pull request Sep 24, 2025
…l as a new ROMol by creating new graph (rdkit#8742) (rdkit#8743)"

This reverts commit 040bdb6.

During testing of using this as a replacement for portions of
getTheFrags in getMolFrags, several issues came up regarding
how copies should actually work in practice.  These are being
corrected in a new pr:  rdkit#8811
@rachelnwalker
Copy link
Copy Markdown
Collaborator

Is it possible to add back some of the tests @whosayn had in the original PR?

Also, I would definitely find this useful in python :)

@bp-kelley
Copy link
Copy Markdown
Contributor Author

@rachelnwalker thanks for the note, I thought I had. I'll readd them.

@bp-kelley
Copy link
Copy Markdown
Contributor Author

@rachelnwalker I did, there's a merge conflict though. I'll fix it

@bp-kelley
Copy link
Copy Markdown
Contributor Author

bp-kelley commented Sep 24, 2025 via email

greglandrum pushed a commit that referenced this pull request Sep 25, 2025
…l as a new ROMol by creating new graph (#8742) (#8743)" (#8814)

This reverts commit 040bdb6.

During testing of using this as a replacement for portions of
getTheFrags in getMolFrags, several issues came up regarding
how copies should actually work in practice.  These are being
corrected in a new pr:  #8811

Co-authored-by: Brian Kelley <bkelley@glysade.com>
@greglandrum
Copy link
Copy Markdown
Member

@bp-kelley can you please push an empty commit here so that the CI builds run again?

Copy link
Copy Markdown
Contributor

@whosayn whosayn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't look too closely at the implementation details in this first pass, so my comments will reflect that. my main concern with these changes is the lack of unit tests for the SubsetOptions and SubsetMethod::BOND configs. @bp-kelley is that coming in a subsequent diff?

}
}
SubsetOptions opts;
opts.copyCoordinates = copyConformers;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider designated initializers like SubsetOptions opts{.copyCoorsinates = copyConformers, ...};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make these changes when I add the python wrapper

@bp-kelley
Copy link
Copy Markdown
Contributor Author

bp-kelley commented Oct 2, 2025

@whosayn @greglandrum I'll tackle some of the review comments and I sorted the ubuntu error with the if statement.

I would like a mechanism to get mapped bonds as well as atoms. For instance, just because the endpoints of a bond are mapped doesn't mean that the bond is. This means to either keep SubsetInfo and probably make the following public as the other functions are helper functions around this.

copyMolSubset(RWMol &mol, SubsetInfo &info, const SubsetOptions &options)

Thoughts?

Once this is in place I'll do the python API.

Then I can show you how to start investigating the weird stuff.

@greglandrum
Copy link
Copy Markdown
Member

@whosayn @greglandrum I'll tackle some of the review comments and I sorted the ubuntu error with the if statement.

I would like a mechanism to get mapped bonds as well as atoms. For instance, just because the endpoints of a bond are mapped doesn't mean that the bond is. This means to either keep SubsetInfo and probably make the following public as the other functions are helper functions around this.

I can definitely see that being able to track where bonds go would be useful.

@bp-kelley
Copy link
Copy Markdown
Contributor Author

@whosayn @rachelnwalker @greglandrum I think this is finally ready. I think I have handled all critiques and made the Java and python wrappers.

@greglandrum greglandrum self-assigned this Nov 19, 2025
Copy link
Copy Markdown
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished with Subset.cpp. I'll finish it after the suggested changes have been made

#define RD_SUBSET_H

#include <RDGeneral/export.h>
#include <RDGeneral/BetterEnums.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you aren't using this here. Probably don't need to include it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

//! Subsetting Options for copyMolSubset
/*
* These control what is copied over from the original molecule
* \param sanitize - perform sanitization automatically on the subset
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\param is for arguments to a function.
To document data members of a struct, we add the docs directly after declaring the member.
As an example, take a look at the docs of the options in FileParsers.h

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved

frag->addConformer(conf);
}
}
SubsetOptions opts {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticing this here. I'm sure it applies elsewhere:
Please run clang-format over all of the code in this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run clang-format last to help with the PR.

@@ -0,0 +1,96 @@
#
# Copyright (C) 2003-2021 Greg Landrum and other RDKit contributors
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well fix this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +41 to +43
static constinit bool updateLabel = false;
static constinit bool takeOwnership = true;
atomMapping[ref_atom->getIdx()] = extracted_mol.addAtom(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiosity: why are these constinit and not constexpr?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these (and the snake case) came from the original PR, I'll have to ask @whosayn, but constexpr or just const seem to be the right choice here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Comment on lines +82 to +86
// we need to update rings now
if(reference_mol.getRingInfo()->isInitialized()) {
extracted_mol.getRingInfo()->reset();
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we create the molecule just before the call to this function, we're never going to need to do this. Still, I guess it doesn't hurt to include it.

However, I don't think we need to do this for every bond that we add. Can't we move it outside of the loop and do:

Suggested change
// we need to update rings now
if(reference_mol.getRingInfo()->isInitialized()) {
extracted_mol.getRingInfo()->reset();
}
}
// we need to update rings now
if(selectedBonds.any() && reference_mol.getRingInfo()->isInitialized()) {
extracted_mol.getRingInfo()->reset();
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to remember needing this, but you are right about the location.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't move it. Was that intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

}
}

[[nodiscard]] static bool is_selected_sgroup(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be [[nodiscard]]?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's not in the public API, so I won't insist, but we don't use underscores in function/variable names

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a safety thing, if the return value isn't used, it is a warning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what [[nodiscard]] does, but I think it only makes sense to use it when there's a reason that you shouldn't ignore the return value of the function (like when it's passing back a pointer that you own).
I don't think that's the case here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler might helpfully optimise it out and you’re relying on a side effect of the function? That might not be the case here but it’s why I’ve used it on the past.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler might helpfully optimise it out and you’re relying on a side effect of the function? That might not be the case here but it’s why I’ve used it on the past.

Definitely could happen, but there are no side-effects here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

}
}

static void copySelectedAtomsAndBonds(::RDKit::RWMol &extracted_mol,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're inside the RDKit namespace, I don't think it needs to be used anywhere in this code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the qualifiers

bp-kelley and others added 3 commits November 25, 2025 16:16
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
…ests.java

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
@greglandrum
Copy link
Copy Markdown
Member

@bp-kelley let me know when you think this is ready to be reviewed again

@bp-kelley
Copy link
Copy Markdown
Contributor Author

I’ll be able to make the changes this weekend focusing on the non testing code first. I expect some of the stylistic issues were just not addressed in the original pr that I moved over. I’ll do a style pass and then clang format

@bp-kelley
Copy link
Copy Markdown
Contributor Author

@greglandrum should be up for a second look now

Copy link
Copy Markdown
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tagged a couple of comments from earlier reviews that still have no replies/fixes.
There are others

Please at least look at those


def check(info, selection):
for i,v in enumerate(selection):
if v: assert i in info
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a reminder that there are still two comments in this file

CHECK(extracted_atoms == expected_atoms);
}

// This test makes sure we correctly extract atoms
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still here

copySelectedAtomsAndBonds(*extracted_mol, mol, selection_info, options);
copySelectedSubstanceGroups(*extracted_mol, mol, selection_info, options);
copySelectedStereoGroups(*extracted_mol, mol, selection_info);
if(options.copyCoordinates)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove all of the one liners

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Contributor Author

@bp-kelley bp-kelley Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greglandrum I believe these are all finished.

Copy link
Copy Markdown
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@greglandrum greglandrum changed the title Pr/add extract mol fragment api Add extract mol fragment api Dec 9, 2025
@greglandrum greglandrum added this to the 2025_09_4 milestone Dec 9, 2025
@greglandrum greglandrum merged commit 70540c2 into rdkit:master Dec 9, 2025
12 checks passed
greglandrum added a commit that referenced this pull request Dec 30, 2025
* Create a function to extract some specified atoms from a ROMol as a new ROMol by creating new graph (#8742)

This adds a new api, `RDKit::MolOps::ExtractMolFragment`, to allow efficient
extractions of mol fragments from large mols. Compared to the approach where
we delete "unwanted" atoms/bonds from the input mol, this api is faster for
small mols (about 2x faster) and at least 3x faster for big mols
(was 10x faster for "CCC"*1000).

* clang-format

* review comments

* cleanup

* Consolidate copying subsets of molecules

* Readd missing tests

* Update comment to restart build

* Remove missing test

* Remove debugging comment, fix warnings

* Fix warnings on gcc11

* Add docs

* Make vector<bool> dynamic_bitset<>

* Update copyright

* Add swig wrappers

* Use new designated constructor API

* Fix windows builds

* Change enum values from unsigned int to integer

* Fix unsigned int variable

* Update Code/GraphMol/Wrap/test_subset.py

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Update Code/GraphMol/Subset.cpp

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Update Code/JavaWrappers/gmwrapper/src-test/org/RDKit/ChemTransformsTests.java

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Reponse to review

* Fix documentation

* Remove comments

* Remove unnecessary comments

* Fix one liners

* Change assertion to be clearer (and not one-liners)

* Run clang-format

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Hussein Faara <hussein.faara@schrodinger.com>
Co-authored-by: Brian Kelley <bkelley@glysade.com>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants