Add extract mol fragment api#8811
Conversation
…ew ROMol by creating new graph (rdkit#8742) This adds a new api, `RDKit::MolOps::ExtractMolFragment`, to allow efficient extractions of mol fragments from large mols. Compared to the approach where we delete "unwanted" atoms/bonds from the input mol, this api is faster for small mols (about 2x faster) and at least 3x faster for big mols (was 10x faster for "CCC"*1000).
…l as a new ROMol by creating new graph (rdkit#8742) (rdkit#8743)" This reverts commit 040bdb6. During testing of using this as a replacement for portions of getTheFrags in getMolFrags, several issues came up regarding how copies should actually work in practice. These are being corrected in a new pr: rdkit#8811
|
Is it possible to add back some of the tests @whosayn had in the original PR? Also, I would definitely find this useful in python :) |
|
@rachelnwalker thanks for the note, I thought I had. I'll readd them. |
|
@rachelnwalker I did, there's a merge conflict though. I'll fix it |
|
I added back the tests with one change, in commitBatch we only remove atoms
from stereogroups, not the stereogroups themselves so I use this behavior.
And yes, python/swig all that good stuff should be implemented.
…On Wed, Sep 24, 2025 at 1:52 PM Rachel Walker ***@***.***> wrote:
*rachelnwalker* left a comment (rdkit/rdkit#8811)
<#8811 (comment)>
Is it possible to add back some of the tests @whosayn
<https://github.com/whosayn> had in the original PR?
Also, I would definitely find this useful in python :)
—
Reply to this email directly, view it on GitHub
<#8811 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLOFIBLEEJSMXXS4J54YUD3ULK7LAVCNFSM6AAAAACHL6XYOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMZQGA2DCOJUHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
…l as a new ROMol by creating new graph (#8742) (#8743)" (#8814) This reverts commit 040bdb6. During testing of using this as a replacement for portions of getTheFrags in getMolFrags, several issues came up regarding how copies should actually work in practice. These are being corrected in a new pr: #8811 Co-authored-by: Brian Kelley <bkelley@glysade.com>
|
@bp-kelley can you please push an empty commit here so that the CI builds run again? |
whosayn
left a comment
There was a problem hiding this comment.
didn't look too closely at the implementation details in this first pass, so my comments will reflect that. my main concern with these changes is the lack of unit tests for the SubsetOptions and SubsetMethod::BOND configs. @bp-kelley is that coming in a subsequent diff?
Code/GraphMol/MolOps.cpp
Outdated
| } | ||
| } | ||
| SubsetOptions opts; | ||
| opts.copyCoordinates = copyConformers; |
There was a problem hiding this comment.
nit: consider designated initializers like SubsetOptions opts{.copyCoorsinates = copyConformers, ...};
There was a problem hiding this comment.
I'll make these changes when I add the python wrapper
|
@whosayn @greglandrum I'll tackle some of the review comments and I sorted the ubuntu error with the if statement. I would like a mechanism to get mapped bonds as well as atoms. For instance, just because the endpoints of a bond are mapped doesn't mean that the bond is. This means to either keep SubsetInfo and probably make the following public as the other functions are helper functions around this. Thoughts? Once this is in place I'll do the python API. Then I can show you how to start investigating the weird stuff. |
I can definitely see that being able to track where bonds go would be useful. |
Note: Subset.h SubsetMethod needed to change to an enum class because of a conflict in the java wrappers
|
@whosayn @rachelnwalker @greglandrum I think this is finally ready. I think I have handled all critiques and made the Java and python wrappers. |
greglandrum
left a comment
There was a problem hiding this comment.
I haven't finished with Subset.cpp. I'll finish it after the suggested changes have been made
Code/GraphMol/Subset.h
Outdated
| #define RD_SUBSET_H | ||
|
|
||
| #include <RDGeneral/export.h> | ||
| #include <RDGeneral/BetterEnums.h> |
There was a problem hiding this comment.
you aren't using this here. Probably don't need to include it
Code/GraphMol/Subset.h
Outdated
| //! Subsetting Options for copyMolSubset | ||
| /* | ||
| * These control what is copied over from the original molecule | ||
| * \param sanitize - perform sanitization automatically on the subset |
There was a problem hiding this comment.
\param is for arguments to a function.
To document data members of a struct, we add the docs directly after declaring the member.
As an example, take a look at the docs of the options in FileParsers.h
There was a problem hiding this comment.
Should be resolved
Code/GraphMol/MolOps.cpp
Outdated
| frag->addConformer(conf); | ||
| } | ||
| } | ||
| SubsetOptions opts { |
There was a problem hiding this comment.
Just noticing this here. I'm sure it applies elsewhere:
Please run clang-format over all of the code in this PR
There was a problem hiding this comment.
I'll run clang-format last to help with the PR.
Code/GraphMol/Wrap/test_subset.py
Outdated
| @@ -0,0 +1,96 @@ | |||
| # | |||
| # Copyright (C) 2003-2021 Greg Landrum and other RDKit contributors | |||
Code/GraphMol/Subset.cpp
Outdated
| static constinit bool updateLabel = false; | ||
| static constinit bool takeOwnership = true; | ||
| atomMapping[ref_atom->getIdx()] = extracted_mol.addAtom( |
There was a problem hiding this comment.
Curiosity: why are these constinit and not constexpr?
There was a problem hiding this comment.
Some of these (and the snake case) came from the original PR, I'll have to ask @whosayn, but constexpr or just const seem to be the right choice here.
Code/GraphMol/Subset.cpp
Outdated
| // we need to update rings now | ||
| if(reference_mol.getRingInfo()->isInitialized()) { | ||
| extracted_mol.getRingInfo()->reset(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Since we create the molecule just before the call to this function, we're never going to need to do this. Still, I guess it doesn't hurt to include it.
However, I don't think we need to do this for every bond that we add. Can't we move it outside of the loop and do:
| // we need to update rings now | |
| if(reference_mol.getRingInfo()->isInitialized()) { | |
| extracted_mol.getRingInfo()->reset(); | |
| } | |
| } | |
| // we need to update rings now | |
| if(selectedBonds.any() && reference_mol.getRingInfo()->isInitialized()) { | |
| extracted_mol.getRingInfo()->reset(); | |
| } | |
| } |
There was a problem hiding this comment.
I seem to remember needing this, but you are right about the location.
There was a problem hiding this comment.
You didn't move it. Was that intentional?
Code/GraphMol/Subset.cpp
Outdated
| } | ||
| } | ||
|
|
||
| [[nodiscard]] static bool is_selected_sgroup( |
There was a problem hiding this comment.
Why does this need to be [[nodiscard]]?
There was a problem hiding this comment.
Also, it's not in the public API, so I won't insist, but we don't use underscores in function/variable names
There was a problem hiding this comment.
I think it's a safety thing, if the return value isn't used, it is a warning.
There was a problem hiding this comment.
I understand what [[nodiscard]] does, but I think it only makes sense to use it when there's a reason that you shouldn't ignore the return value of the function (like when it's passing back a pointer that you own).
I don't think that's the case here.
There was a problem hiding this comment.
The compiler might helpfully optimise it out and you’re relying on a side effect of the function? That might not be the case here but it’s why I’ve used it on the past.
There was a problem hiding this comment.
The compiler might helpfully optimise it out and you’re relying on a side effect of the function? That might not be the case here but it’s why I’ve used it on the past.
Definitely could happen, but there are no side-effects here.
Code/GraphMol/Subset.cpp
Outdated
| } | ||
| } | ||
|
|
||
| static void copySelectedAtomsAndBonds(::RDKit::RWMol &extracted_mol, |
There was a problem hiding this comment.
Since we're inside the RDKit namespace, I don't think it needs to be used anywhere in this code.
There was a problem hiding this comment.
Removed the qualifiers
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
…ests.java Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
|
@bp-kelley let me know when you think this is ready to be reviewed again |
|
I’ll be able to make the changes this weekend focusing on the non testing code first. I expect some of the stylistic issues were just not addressed in the original pr that I moved over. I’ll do a style pass and then clang format |
|
@greglandrum should be up for a second look now |
greglandrum
left a comment
There was a problem hiding this comment.
I tagged a couple of comments from earlier reviews that still have no replies/fixes.
There are others
Please at least look at those
Code/GraphMol/Wrap/test_subset.py
Outdated
|
|
||
| def check(info, selection): | ||
| for i,v in enumerate(selection): | ||
| if v: assert i in info |
There was a problem hiding this comment.
just a reminder that there are still two comments in this file
Code/GraphMol/catch_molops.cpp
Outdated
| CHECK(extracted_atoms == expected_atoms); | ||
| } | ||
|
|
||
| // This test makes sure we correctly extract atoms |
Code/GraphMol/Subset.cpp
Outdated
| copySelectedAtomsAndBonds(*extracted_mol, mol, selection_info, options); | ||
| copySelectedSubstanceGroups(*extracted_mol, mol, selection_info, options); | ||
| copySelectedStereoGroups(*extracted_mol, mol, selection_info); | ||
| if(options.copyCoordinates) |
There was a problem hiding this comment.
please remove all of the one liners
There was a problem hiding this comment.
@greglandrum I believe these are all finished.
* Create a function to extract some specified atoms from a ROMol as a new ROMol by creating new graph (#8742) This adds a new api, `RDKit::MolOps::ExtractMolFragment`, to allow efficient extractions of mol fragments from large mols. Compared to the approach where we delete "unwanted" atoms/bonds from the input mol, this api is faster for small mols (about 2x faster) and at least 3x faster for big mols (was 10x faster for "CCC"*1000). * clang-format * review comments * cleanup * Consolidate copying subsets of molecules * Readd missing tests * Update comment to restart build * Remove missing test * Remove debugging comment, fix warnings * Fix warnings on gcc11 * Add docs * Make vector<bool> dynamic_bitset<> * Update copyright * Add swig wrappers * Use new designated constructor API * Fix windows builds * Change enum values from unsigned int to integer * Fix unsigned int variable * Update Code/GraphMol/Wrap/test_subset.py Co-authored-by: Greg Landrum <greg.landrum@gmail.com> * Update Code/GraphMol/Subset.cpp Co-authored-by: Greg Landrum <greg.landrum@gmail.com> * Update Code/JavaWrappers/gmwrapper/src-test/org/RDKit/ChemTransformsTests.java Co-authored-by: Greg Landrum <greg.landrum@gmail.com> * Reponse to review * Fix documentation * Remove comments * Remove unnecessary comments * Fix one liners * Change assertion to be clearer (and not one-liners) * Run clang-format --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Hussein Faara <hussein.faara@schrodinger.com> Co-authored-by: Brian Kelley <bkelley@glysade.com> Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Fixes the root cause of #8649
This replaces the pr for #8742