Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating xyz2mol Into The RDKit Core (GSoC 2022) #5557

Merged
merged 55 commits into from Oct 6, 2022

Conversation

gosreya
Copy link
Contributor

@gosreya gosreya commented Sep 11, 2022

Organization: Open Chemistry

Student: Sreya Gogineni

Mentors: Greg Landrum, Joey Storer

This summer, I worked on integrating 'xyz2mol' into the RDKit, an open source cheminformatics library. 'xyz2mol' was originally developed by Professor Jan H. Jensen's research group at the University of Copenhagen, based off of the work published in this paper (DOI: 10.1002/bkcs.10334).

The program, given a molecule's charge and the spatial location of each atom, could predict the molecule's most favorable set of bonds. A user would would pass in the molecule's XYZ file, a file format often used in computational chemistry that delivers each atom's coordinates, and would in return get an RDKit molecule object with predicted bonds in place.

As the original program was written in Python, the nucleus of this project was translation into C++, the language of the RDKit core.

Integrating xyz2mol into the RDKit required

  • adding an XYZ file parser,
  • implementing atomic connectivity determination (knowing which atoms are bonded to each other),
  • implementing bond order determination (knowing whether each bond is single, double, or triple), and
  • adding Python and Java bindings.

As of the end of the GSoC coding period, the first 3 steps have been completed. The final step, adding bindings to make the features available to RDKit Python and Java users, remains to be finished.

The XYZ File Parser

As with other RDKit file parsers (such as the Mol file parser), the XYZ parser constructs an RDKit molecule from the file data. Since the only information an XYZ file contains is the element and location of each atom, the molecule built from the parser contains only atoms and not bonds, as well as a conformer containing the atomic coordinates. The function XYZFileToMol() calls the file parser.

Atomic Connectivity Determination

The original xyz2mol offers two methods of predicting connectivity: the 'van der Waals' method and 'Hueckel' method. The former considers atoms' covalent radii to predict bonding, while the Hueckel method uses extended Hueckel theory.

These two methods were made available through the function determineConnectivity(), which modifies a passed in molecule object in place and adds single bonds wherever a bond is predicted.

Bond Order Determination

Determining bond order (whether a bond is single, double, or triple) was the largest part of this project. Given a molecule object with bonds corresponding to atomic connectivity, the function, determineBondOrdering() further modifes the molecule to have a favorable bond ordering. Also added, the function determineBonds() calls both determineConnectivity() and determineBondOrdering() and gives users of the original xyz2mol the ability to use a similar workflow.

Some interesting tasks while implementing the function included writing an algorithm to calculate the Cartesian product with an arbitrary number of input vectors of arbitrary size and using the Boost graph library.

Looking Ahead

Through the integration of xyz2mol into the RDKit, its capabilities were made more modular. While the original program did file parsing, connectivity determination, and bond order determination at once, users can now do the three tasks independently of one another, enabling them to potentially swap out atomic connectivity and bond order determination methods or simply read in an XYZ file without using the rest of xyz2mol.

A lot of progress was made this summer in integrating xyz2mol into the RDKit, but there's yet more work to be done. The first order of business is doing a more comprehensive review of bond order determination for accuracy. This will involve thorough code review and also possibly testing the work with a larger, more diverse set of molecules. And, as mentioned earlier, Python and Java bindings still need to be added.

Sreya Gogineni and others added 30 commits July 13, 2022 06:23
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Added file parser and accompanied tests
Added the DetermineBonds library with atomic connectivity determination
Added determineBondOrdering() function
Copy link
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first version of this code, but I think it's useful and worth merging already

radius if the van der Waals method is used
*/
RDKIT_DETERMINEBONDS_EXPORT void determineConnectivity(RWMol &mol,
bool useHueckel = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the odds of adding more method parameters? Should we add a param class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same question for the functions below.

### Student: Sreya Gogineni
### Mentors: Greg Landrum, Joey Storer, Jan H. Jensen

This summer, I worked on integrating 'xyz2mol' into the RDKit, an open source cheminformatics library. '[xyz2mol](https://github.com/jensengroup/xyz2mol)' was originally developed by Professor Jan H. Jensen's research group at the University of Copenhagen, based off of the work published in [this paper](https://onlinelibrary.wiley.com/doi/10.1002/bkcs.10334) (DOI: 10.1002/bkcs.10334).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the paper reference to the doc string? I doubt anyone but us will see the README.

@bp-kelley
Copy link
Contributor

Minor comments aside, I have three API points:

  1. Is it planned to have this replace or possible be a method for the PDBParser, that is where I would find the most use currently.
  2. If so, we possibly need an option to ignore H_H contacts like ConnectTheDots.
  3. We should try and avoid returning raw pointers in the exposed C++ API, this makes it slightly harder to wrap in python, but I think the tradeoffs are worth it.

@greglandrum greglandrum merged commit d6eab05 into rdkit:master Oct 6, 2022
@greglandrum greglandrum added this to the 2022_09_1 milestone Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants