Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AddHs for molecules read from PDB #1647

Merged
merged 8 commits into from Jan 18, 2018

Conversation

mwojcikowski
Copy link
Contributor

What does this implement/fix? Explain your changes.

This PR fixes behavior of AddHs for molecules read from PDB file. Current behavior is to ignore residue information. What's more, the Hs written to PDB file are always HETATM and of UNL residue.

Proposed changes add AtomPDBResidueInfo based on the information collected from the root atom.
I decided against adding a flag, since it doesn't break default behavior and is kind of expected when working with proteins.

@mwojcikowski mwojcikowski changed the title [MRG] Improve AddHs for molecules read freom PDB [MRG] Improve AddHs for molecules read from PDB Nov 6, 2017
@@ -386,6 +387,13 @@ void addHs(RWMol &mol, bool explicitOnly, bool addCoords,
mol.addBond(aidx, newIdx, Bond::SINGLE);
mol.getAtomWithIdx(newIdx)->updatePropertyCache();
if (addCoords) setHydrogenCoords(&mol, newIdx, aidx);
AtomPDBResidueInfo *info = (AtomPDBResidueInfo *)(newAt->getMonomerInfo());
if (info && info->getMonomerType() == AtomMonomerInfo::PDBRESIDUE) {
AtomPDBResidueInfo *newInfo = new AtomPDBResidueInfo(" H ", newIdx, "", info->getResidueName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the atom index for the serial number might not work in practice as they are not related. I would find the current maximum serial number and add one to it.

It might be nice is the hydrogens have unique names, especially if one is using/making a refinement dictionary. In general, this is done by using the number of hydrogen atoms on the residue, i.e. " H1", " H2" again you will need to count the existing hydrogens and names first.

Note that in practice, the hydrogen index "wraps around" due to a mis reading of the initial PDB spec way back in the day. Hydrogen 123 is written "3H12" for instance. There was no reason for the H to be in column 1, but someone codified that and now we're stuck :)

The names only have to be unique for the residue, not for the protein.

@mwojcikowski mwojcikowski changed the title [MRG] Improve AddHs for molecules read from PDB Improve AddHs for molecules read from PDB Nov 16, 2017
@mwojcikowski
Copy link
Contributor Author

@bp-kelley All requested changes were implemented. I have also refactored the code to be in one place. The hydrogens are numbered one per-resiude basis + wrap-around. Effectively it supports 1099 hydrogens to be uniquely numbered (the 10XY are wrapped around to YH0X). Plus the serial number, as you suggested, is incremented maximum serial number seen among other atoms.

@@ -409,6 +422,47 @@ void addHs(RWMol &mol, bool explicitOnly, bool addCoords,
}
// update the atom's derived properties (valence count, etc.)
newAt->updatePropertyCache();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical, but it seems like this code should be in a separate function (in an anonymous namespace).

@mwojcikowski
Copy link
Contributor Author

@greglandrum Done. I've fixed a bug: the H idx was reset for every atom, now it is for each residue.

@mwojcikowski
Copy link
Contributor Author

Ping @bp-kelley @greglandrum

Copy link
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change and think it's quite useful, but I don't like that it's called for every molecule, even ones that have not been generated from PDB files. That's an unneccessary performance hit. I think it would be better to have a new function in the public API: AddHsWithResidueInfo() that calls AddHs() and then this.

@mwojcikowski
Copy link
Contributor Author

Why not expose AssignHsPDBInfo to the world, then? User could decide should it be called.

@greglandrum
Copy link
Member

That's more or less what I was suggesting, but if you call it AddHsWithResidueInfo() then you could imagine the same call working for other types of polymers (not just things from PDB files).

@greglandrum
Copy link
Member

Note that I'm not suggesting that you actually make it general right now, just that we have a stub in place so that we can do that in the future.

@mwojcikowski
Copy link
Contributor Author

Yet another idea - what if we execute AssignHsResidueInfo inside the AddHs function, but optionally (new param AssignResidueInfo=False). This way we retain fast(er) AddHs for non-PDB molecules, give an opportunity to assign the info (user's choice), and finally we don't need to copy constructor/helper/wrapper code connected to AddHs (Java, python, duplicated API which further changes should be also modified).

@mwojcikowski
Copy link
Contributor Author

@greglandrum I've modified the code to my previous post.

Copy link
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The one change suggested is to change the argument name.

@@ -931,8 +931,11 @@ struct molops_wrapper {
- addCoords: (optional) if this toggle is set, The Hs will have 3D coordinates\n\
set. Default value is 0 (no 3D coords).\n\
\n\
- onlyOnHs: (optional) if this sequence is provided, only these atoms will be\n\
- onlyOnAtoms: (optional) if this sequence is provided, only these atoms will be\n\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thanks for that fix. :-)

void addHs(RWMol &mol, bool explicitOnly, bool addCoords,
const UINT_VECT *onlyOnAtoms) {
const UINT_VECT *onlyOnAtoms, bool residueInfo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the argument to be called addResidueInfo as I think it's more explicit.

@mwojcikowski
Copy link
Contributor Author

Done

@greglandrum greglandrum added this to the 2018_03_1 milestone Jan 18, 2018
@greglandrum greglandrum merged commit 50299cb into rdkit:master Jan 18, 2018
@greglandrum
Copy link
Member

Thanks @mwojcikowski !

@bp-kelley : I'll take care of getting this merged into modern_cxx over the next couple of days.

greglandrum pushed a commit that referenced this pull request Jan 20, 2018
* Add AtomPDBResidueInfo during molops::addhs

* Test addhs and AtomPDBResidueInfo

* Little cleanup

* Fix serial

* Refactor code, add unique Hs labels

* Move code to separate function + bugfix.

* Make function optional (AddHs residueInfo param)

* Rename argument
@mwojcikowski mwojcikowski deleted the fix-pdb-addhs branch February 6, 2018 13:36
ricrogz pushed a commit to ricrogz/rdkit that referenced this pull request Aug 21, 2018
* Add AtomPDBResidueInfo during molops::addhs

* Test addhs and AtomPDBResidueInfo

* Little cleanup

* Fix serial

* Refactor code, add unique Hs labels

* Move code to separate function + bugfix.

* Make function optional (AddHs residueInfo param)

* Rename argument
ricrogz pushed a commit to ricrogz/rdkit that referenced this pull request Aug 21, 2018
* Add AtomPDBResidueInfo during molops::addhs

* Test addhs and AtomPDBResidueInfo

* Little cleanup

* Fix serial

* Refactor code, add unique Hs labels

* Move code to separate function + bugfix.

* Make function optional (AddHs residueInfo param)

* Rename argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants