DetermineBondOrders running out of memory on medium-sized disconnected structure #5902

jasondbiggs · 2022-12-28T00:01:26Z

This XYZ file is perhaps a bit pathological but it was in our test set. It has 13 disconnected fragments that have the same connectivity.

DetermineConnectivity has no problem with this molecule, but DetermineBondOrders runs away, consuming more and more memory, and is killed by the shell:

>>> mol = rdmolfiles.MolFromXYZFile('/Users/jasonb/test_13_components.txt')
>>> rdDetermineBonds.DetermineConnectivity(mol)
>>> rdDetermineBonds.DetermineBondOrders(mol)
zsh: killed     python

If I take one isolated fragment from the input file and run that through (see attached test_1_component.txt) it can determine bond orders, although it does return a charge-separated molecule that isn't quite right (see #5888).

I have run DetermineBondOrders on structures with many more atoms than this, even disconnected structures with more atoms: make a 3D table of ~50 benzene molecules translated in space and write them to an XYZ file, no issue with DetermineBondOrders. But this one fails.

Is the code somehow considering permutations of bond orders between fragments? I assumed DetermineBondOrders would work as if it iterated over the disconnected fragments.

test_1_component.txt
test_13_components.txt

The text was updated successfully, but these errors were encountered:

jasondbiggs · 2022-12-29T23:31:57Z

I spent a little more time on this today, and came up with a slightly less pathological example. This molecule has the SMILES string

C(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)CC(=O)C

test_20_ketones.txt

Importing the attached XYZ file will result in runaway memory usage, though I haven't waited to see if it crashes before I kill it because I see the memory go over 50GB.

The bottleneck is these two lines. The above molecule has 20 oxygen atoms, so the variable numCombos is 3^20 or about 3.5 trillion. The next line tries to allocate a numCombos x numAtoms matrix (about 355 trillion integers in this case), and that's where I kill it.

Even if it could finish that allocation I wonder about the timing for the 4-fold nested for loop immediately following.

It is interesting that the original python xyz2mol does not suffer this problem - it returns the above SMILES pretty quickly so the scaling issue described here is not a fundamental limitation of the algorithm.

greglandrum · 2022-12-30T05:58:41Z

@jasondbiggs Thanks for the report and for doing the extra research. We'll take a look at it and see what we can do.

@gosreya: do you have the time and inclination to look into this or shall I do it?

gosreya · 2022-12-30T18:31:34Z

@greglandrum Yes, I'll look into this!

greglandrum · 2022-12-30T19:06:59Z

@greglandrum Yes, I'll look into this!

@gosreya thanks!

jasondbiggs · 2023-01-03T23:58:39Z

@gosreya I had some time off over the holiday, so I took this as a puzzle to solve. It's not too difficult to emulate python's itertools.combinations to lazily return the valence combinations rather than precomputing them. (I could paste my solution here, or make a PR even, but I can hold back if you would like a go at it).

My fix works great for the test_20_ketones.txt example. It also makes the test_13_components.txt not die from consuming available memory, but it still just spins and spins because of the combinatorics. That will require a more extensive fix I think.

gosreya · 2023-01-04T05:30:15Z

@jasondbiggs Thank you for writing up a solution! Feel free to make a PR, or I can integrate your solution if you'd prefer to copy and paste- whatever works for you.

I'd observed the same thing thing about test_13_components.txt when I ran it on the Python xyz2mol so yeah I suppose that case might require reworking of the algorithm itself.

lazily generate the combinations of possible atom valences rather than computing them up front

* fixes #5902 lazily generate the combinations of possible atom valences rather than computing them up front * changes in response to review * mark input as const forgot this one Co-authored-by: Jason Biggs <jasonb@wolfram.com>

viniavila · 2023-03-13T12:48:44Z

I observed the same with this xyz data:

36

C	0.4070680770	1.4408559846	-0.0613969049
C	-0.0706049485	1.5694609670	-1.3560332864
C	-1.0463804263	0.7173849368	-1.8499980669
C	-1.5482670792	-0.2829323469	-1.0280244168
C	-1.1455152188	-0.4366378339	0.3275050436
C	-0.1172240023	0.4519439122	0.7579611778
N	0.4445216801	0.3928939836	2.1172773394
O	1.0022932427	1.3957006724	2.5427619172
O	0.3489788613	-0.6703882563	2.7405360968
N	-1.6598531608	-1.3709389969	1.1987611003
C	-2.9262538367	-1.9042638519	1.2935432637
C	-3.1015754213	-3.2486335104	1.7336174820
C	-4.3467611236	-3.8555130644	1.8056057617
C	-5.4728928652	-3.1293109194	1.4517192311
C	-5.3775066003	-1.8065175411	1.0465409221
C	-4.1279913341	-1.2030096428	0.9979884486
N	-4.1481602385	0.2492848407	0.7271196627
O	-3.2854095395	0.9419571995	1.2660527167
O	-5.0539292195	0.6733847184	0.0183186066
N	-6.7963570653	-3.7684902937	1.5324288356
O	-6.8324681627	-4.9310196769	1.9257597355
O	-7.7665320291	-3.0910632937	1.2040054167
N	-1.9600425090	-4.0936449382	2.1195946034
O	-0.9177679498	-3.5340986805	2.4784318922
O	-2.1187573624	-5.3069314739	2.0906442853
N	-2.4443851838	-1.2472142987	-1.6993462221
O	-2.4087606063	-2.4174618160	-1.3202153244
O	-3.1295293186	-0.8234979340	-2.6232277400
N	0.4888120017	2.6148528391	-2.2287454615
O	1.3699441783	3.3261718161	-1.7544372365
O	0.0353710184	2.6958900253	-3.3671171056
H	1.1701397844	2.1092194941	0.3153183128
H	-1.3920177855	0.8089586355	-2.8717813357
H	-0.9921238961	-1.7275800781	1.8873231984
H	-4.4304059883	-4.8874369944	2.1205622886
H	-6.2668407534	-1.2411565498	0.7995927926

That molecule corresponds to CID=8576 from PubChem, with optimized geometry from PubChemQC Project.
Total charge of the molecule is zero.

DetermineConnectivity works well and fast, either with Van der Waals method or Hueckel method, but DetermineBondOrders taker forever to execute and crashes with an error: ValueError: Final molecular charge does not match input; could not find valid bond ordering . Any way to set up a timeout or any tip to make this function faster?

jasondbiggs added the bug label Dec 28, 2022

jasondbiggs pushed a commit to jasondbiggs/rdkit that referenced this issue Jan 4, 2023

fixes rdkit#5902

668507b

lazily generate the combinations of possible atom valences rather than computing them up front

greglandrum added this to the 2022_09_4 milestone Jan 11, 2023

greglandrum closed this as completed in 64becfc Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DetermineBondOrders running out of memory on medium-sized disconnected structure #5902

DetermineBondOrders running out of memory on medium-sized disconnected structure #5902

jasondbiggs commented Dec 28, 2022

jasondbiggs commented Dec 29, 2022

greglandrum commented Dec 30, 2022

gosreya commented Dec 30, 2022

greglandrum commented Dec 30, 2022

jasondbiggs commented Jan 3, 2023

gosreya commented Jan 4, 2023

viniavila commented Mar 13, 2023

DetermineBondOrders running out of memory on medium-sized disconnected structure #5902

DetermineBondOrders running out of memory on medium-sized disconnected structure #5902

Comments

jasondbiggs commented Dec 28, 2022

jasondbiggs commented Dec 29, 2022

greglandrum commented Dec 30, 2022

gosreya commented Dec 30, 2022

greglandrum commented Dec 30, 2022

jasondbiggs commented Jan 3, 2023

gosreya commented Jan 4, 2023

viniavila commented Mar 13, 2023