Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DetermineBondOrders running out of memory on medium-sized disconnected structure #5902

Closed
jasondbiggs opened this issue Dec 28, 2022 · 7 comments
Labels
Milestone

Comments

@jasondbiggs
Copy link
Contributor

This XYZ file is perhaps a bit pathological but it was in our test set. It has 13 disconnected fragments that have the same connectivity.

image

DetermineConnectivity has no problem with this molecule, but DetermineBondOrders runs away, consuming more and more memory, and is killed by the shell:

>>> mol = rdmolfiles.MolFromXYZFile('/Users/jasonb/test_13_components.txt')
>>> rdDetermineBonds.DetermineConnectivity(mol)
>>> rdDetermineBonds.DetermineBondOrders(mol)
zsh: killed     python

If I take one isolated fragment from the input file and run that through (see attached test_1_component.txt) it can determine bond orders, although it does return a charge-separated molecule that isn't quite right (see #5888).

I have run DetermineBondOrders on structures with many more atoms than this, even disconnected structures with more atoms: make a 3D table of ~50 benzene molecules translated in space and write them to an XYZ file, no issue with DetermineBondOrders. But this one fails.

Is the code somehow considering permutations of bond orders between fragments? I assumed DetermineBondOrders would work as if it iterated over the disconnected fragments.

test_1_component.txt
test_13_components.txt

@jasondbiggs
Copy link
Contributor Author

I spent a little more time on this today, and came up with a slightly less pathological example. This molecule has the SMILES string

C(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)C

test_20_ketones.txt

Importing the attached XYZ file will result in runaway memory usage, though I haven't waited to see if it crashes before I kill it because I see the memory go over 50GB.

The bottleneck is these two lines. The above molecule has 20 oxygen atoms, so the variable numCombos is 3^20 or about 3.5 trillion. The next line tries to allocate a numCombos x numAtoms matrix (about 355 trillion integers in this case), and that's where I kill it.

Even if it could finish that allocation I wonder about the timing for the 4-fold nested for loop immediately following.

It is interesting that the original python xyz2mol does not suffer this problem - it returns the above SMILES pretty quickly so the scaling issue described here is not a fundamental limitation of the algorithm.

@greglandrum
Copy link
Member

@jasondbiggs Thanks for the report and for doing the extra research. We'll take a look at it and see what we can do.

@gosreya: do you have the time and inclination to look into this or shall I do it?

@gosreya
Copy link
Contributor

gosreya commented Dec 30, 2022

@greglandrum Yes, I'll look into this!

@greglandrum
Copy link
Member

@greglandrum Yes, I'll look into this!

@gosreya thanks!

@jasondbiggs
Copy link
Contributor Author

@gosreya I had some time off over the holiday, so I took this as a puzzle to solve. It's not too difficult to emulate python's itertools.combinations to lazily return the valence combinations rather than precomputing them. (I could paste my solution here, or make a PR even, but I can hold back if you would like a go at it).

My fix works great for the test_20_ketones.txt example. It also makes the test_13_components.txt not die from consuming available memory, but it still just spins and spins because of the combinatorics. That will require a more extensive fix I think.

@gosreya
Copy link
Contributor

gosreya commented Jan 4, 2023

@jasondbiggs Thank you for writing up a solution! Feel free to make a PR, or I can integrate your solution if you'd prefer to copy and paste- whatever works for you.

I'd observed the same thing thing about test_13_components.txt when I ran it on the Python xyz2mol so yeah I suppose that case might require reworking of the algorithm itself.

jasondbiggs pushed a commit to jasondbiggs/rdkit that referenced this issue Jan 4, 2023
lazily generate the combinations of possible atom valences rather than computing them up front
@greglandrum greglandrum added this to the 2022_09_4 milestone Jan 11, 2023
greglandrum pushed a commit that referenced this issue Jan 16, 2023
* fixes #5902

lazily generate the combinations of possible atom valences rather than computing them up front

* changes in response to review

* mark input as const

forgot this one

Co-authored-by: Jason Biggs <jasonb@wolfram.com>
@viniavila
Copy link

I observed the same with this xyz data:

36

C	0.4070680770	1.4408559846	-0.0613969049
C	-0.0706049485	1.5694609670	-1.3560332864
C	-1.0463804263	0.7173849368	-1.8499980669
C	-1.5482670792	-0.2829323469	-1.0280244168
C	-1.1455152188	-0.4366378339	0.3275050436
C	-0.1172240023	0.4519439122	0.7579611778
N	0.4445216801	0.3928939836	2.1172773394
O	1.0022932427	1.3957006724	2.5427619172
O	0.3489788613	-0.6703882563	2.7405360968
N	-1.6598531608	-1.3709389969	1.1987611003
C	-2.9262538367	-1.9042638519	1.2935432637
C	-3.1015754213	-3.2486335104	1.7336174820
C	-4.3467611236	-3.8555130644	1.8056057617
C	-5.4728928652	-3.1293109194	1.4517192311
C	-5.3775066003	-1.8065175411	1.0465409221
C	-4.1279913341	-1.2030096428	0.9979884486
N	-4.1481602385	0.2492848407	0.7271196627
O	-3.2854095395	0.9419571995	1.2660527167
O	-5.0539292195	0.6733847184	0.0183186066
N	-6.7963570653	-3.7684902937	1.5324288356
O	-6.8324681627	-4.9310196769	1.9257597355
O	-7.7665320291	-3.0910632937	1.2040054167
N	-1.9600425090	-4.0936449382	2.1195946034
O	-0.9177679498	-3.5340986805	2.4784318922
O	-2.1187573624	-5.3069314739	2.0906442853
N	-2.4443851838	-1.2472142987	-1.6993462221
O	-2.4087606063	-2.4174618160	-1.3202153244
O	-3.1295293186	-0.8234979340	-2.6232277400
N	0.4888120017	2.6148528391	-2.2287454615
O	1.3699441783	3.3261718161	-1.7544372365
O	0.0353710184	2.6958900253	-3.3671171056
H	1.1701397844	2.1092194941	0.3153183128
H	-1.3920177855	0.8089586355	-2.8717813357
H	-0.9921238961	-1.7275800781	1.8873231984
H	-4.4304059883	-4.8874369944	2.1205622886
H	-6.2668407534	-1.2411565498	0.7995927926

That molecule corresponds to CID=8576 from PubChem, with optimized geometry from PubChemQC Project.
Total charge of the molecule is zero.

DetermineConnectivity works well and fast, either with Van der Waals method or Hueckel method, but DetermineBondOrders taker forever to execute and crashes with an error: ValueError: Final molecular charge does not match input; could not find valid bond ordering . Any way to set up a timeout or any tip to make this function faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants