Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dump_one and dump_many for SDF file format #266

Closed
1 of 2 tasks
FanwangM opened this issue May 20, 2021 · 6 comments · Fixed by #267
Closed
1 of 2 tasks

Fix dump_one and dump_many for SDF file format #266

FanwangM opened this issue May 20, 2021 · 6 comments · Fixed by #267

Comments

@FanwangM
Copy link
Contributor

FanwangM commented May 20, 2021

The current implementation of iodata has some problems with saving SDF files.

  • No M END tag and therefore not readable by other packages.
  • No connectivity information dumped and therefore, it is regarded as a set of atoms when loading with other packages.

I don't know a quick fix to it yet, but it would nice to have SDF dump functionality.

@tovrstra
Copy link
Member

Thanks for noticing these issues. The M END seems to be missing indeed.

I'm a bit confused with your second point. When connectivity information is present in the IOData instance, it should be dumped. Can you given an example showing this problem? That may clarify the issue.

@FanwangM
Copy link
Contributor Author

Yeah, you make the point. I didn't make it clear. Because when connectivity information is not available, such as in XYZ file format, there would be no such information dumped to SDF file. So, I think we need to figure out a good way of generating the connectivity. I know open babel handles this very well.

@tovrstra

@PaulWAyers
Copy link
Member

It's a bit of mission creep....we can dump connectivity when we have it, but defining it would be a utility external to IOData I think. It's implicit in GOpt, and we could use that to define connectivity to the extent we need it. Perhaps would require splitting off a utility from GOpt for connectivity.

@tovrstra
Copy link
Member

I agree with Paul. IOData does (at least for now) not attempt to guess where bonds are because it goes beyond the original scope of reading and writing data.

If we decide to extend the scope, there should also be some discussion on how far we'd like to go. I'll try to make a few guesses. Just detecting connectivity (without trying to guess the types of bonds) can be done with relatively little code (~15 lines) and a table of covalent radii. For PDB files, that would be fine. However, not for the SDF format, because it also describes the type of bond to represent a Lewis structures. Trying to guess a Lewis structure from the connectivity is quite complex and existing algorithms tend to break on exotic molecules. (Even humans don't always agree.)Such an algorithm would go quite far beyond the scope of IOData. Openbabel, RDKit and OpenEye have advanced solutions for this. You can also try to use variations in bond length to detect the bond order, but that would require well-optimized geometries. Effects from level of theory, basis set or just internal strain may be enough to break the algorithm.

In any case, I'd suggest to fix one thing at a time. If you can make a PR fixing the M END issue, that would already be very welcome, irrespective of the connectivity discussion.

@PaulWAyers
Copy link
Member

So to be clear, I wouldn't be averse to having a stand-alone utility that had the functionality:

  • Input: iodata instance without connectivity.
  • Output: iodata instance with connectivity generated by RDKit, OpenBabel, etc. Or even just covalent radii and interatomic distances.

I wouldn't want to include this in iodata (except maybe the last one) because it adds major external dependencies and goes beyond the simple mandate of iodata which is averse to "internal computation", and focused on actual input/output. Once one starts trying to duplicate RDKit and OpenBabel then one has gone down a whole different (very interest, but very difficult) rabbit hole.

It is a fascinating problem, though. I thought a little bit about the problem of generating atom/bond types this morning (for fun) and, wow, what a mess. Especially as we are interested in structures that are not necessarily equilibrium structures, coming up with anything sensible would be very difficult, except maybe for relatively simple organic compounds and inorganic molecules involving only elements from Groups 1,2, 16, 17, 18. Even in such easy cases, what one does with things like sulfur hexafluoride? One would almost need to run a semiempirical calculation (or minimal basis set HF) and then post-process the data to be reliable, and then one is really truly in the HORTON landscape, not merely IOData.

Also, (for now) iodata is mostly (not exclusively, obviously) focussed on ab initio quantum calculations, and having atom/bond types end up being mostly useful for molecular mechanics and some types of semiempirical calculations.

@FanwangM
Copy link
Contributor Author

Thanks for the comments! @PaulWAyers @tovrstra
Given this problem is beyond the scope of our IOData, let's leave this for RDKit or OpenBabel.

It makes things very clear to time for now. I will fix the missing tag issues shortly and make a new PR.

tovrstra added a commit that referenced this issue May 22, 2021
Fix missing `M END` tag for SDF format, close #266
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants