Exception raised when reading very large SMILES file #5692

DavidACosgrove · 2022-10-26T12:43:07Z

Describe the bug
When reading a very large SMILES file with the SMILESMolSupplier, I get:

Read 29559000 molecules N#Cc1ccc(N2CCC(Nc3c(C(N)=O)cnc4[nH]ccc34)CC2)nn1.
Traceback (most recent call last):
File "/Users/david/Projects/GeneralRDKit/./test_reader.py", line 7, in
for i, mol in enumerate(suppl):
RuntimeError: File parsing error: ERROR: Index error (idx = 29559491): ran out of lines

To Reproduce

#!/usr/bin/env python

from sys import argv
from rdkit import Chem

suppl = Chem.SmilesMolSupplier(argv[1], titleLine=True, nameColumn=1, smilesColumn=0)
for i, mol in enumerate(suppl):
    if not i % 1000:
        print(f'Read {i} molecules {Chem.MolToSmiles(mol)}.')

I concatenated multiple copies of ChEMBL30 to make a large enough file - the original error was on a proprietary file. I'm not attaching the file as it was 2.5Gb.

Expected behavior
What I would expect would be for it to read files of any size.

Screenshots
If applicable, add screenshots to help explain your problem.

Configuration (please complete the following information):

RDKit version: 2022.09.1 and 2022.03.5
OS: MacOs and RH7.9 respectively
Python version (if relevant): 3.10.6 and 3.7 respectively
Are you using conda? Yes
If you are using conda, which channel did you install the rdkit from? conda-forge
If you are not using conda: how did you install the RDKit?

Additional context
I think that this is a file size issue, rather than a number of lines thing, despite what the error message says. The files are over 2.3Gb. I have tried different files and the code fails with different line numbers but at roughly the same size point, as far as I can tell. I suspect the fseek pointer being used in next() is wrapping round on a 32 bit boundary - the file sizes look about right for that. I had a look at the C++ but couldn't see anything obvious, but I didn't follow exactly what was going on. The error message could be raised in 2 different places in next() - as a minor thing it might be convenient for debugging if they were distinguished in some way; it takes a long time for the code to get to the offending bit of the input file so another long test will need to be run to establish which of the two error points it is.

The text was updated successfully, but these errors were encountered:

greglandrum · 2022-10-29T03:09:50Z

I'm going to guess that the problem is the use of a standard int here:
https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/FileParsers/SmilesMolSupplier.cpp#L393
Changing that to auto nextP = this->skipComments() should resolve it since skipComments() returns a long.
At the same time d_line needs to be changed to size_t or long.
That should provide sufficient headroom for these files.
I'll do a PR for this.

We're still limited by using an int as the molecule index in the API; fixing that is a minor API break

DavidACosgrove · 2022-10-30T08:04:34Z

It's now working for me after the PR, thanks.

We're still limited by using an int as the molecule index in the API; fixing that is a minor API break

DavidACosgrove added the bug label Oct 26, 2022

greglandrum added a commit to greglandrum/rdkit that referenced this issue Oct 29, 2022

Fixes rdkit#5692

abeb2eb

We're still limited by using an int as the molecule index in the API; fixing that is a minor API break

greglandrum added this to the 2022_09_2 milestone Oct 29, 2022

bp-kelley closed this as completed in 812568e Nov 2, 2022

greglandrum added a commit that referenced this issue Nov 23, 2022

Fixes #5692 (#5706)

b8a657e

We're still limited by using an int as the molecule index in the API; fixing that is a minor API break

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception raised when reading very large SMILES file #5692

Exception raised when reading very large SMILES file #5692

DavidACosgrove commented Oct 26, 2022

greglandrum commented Oct 29, 2022 •

edited

DavidACosgrove commented Oct 30, 2022

Exception raised when reading very large SMILES file #5692

Exception raised when reading very large SMILES file #5692

Comments

DavidACosgrove commented Oct 26, 2022

greglandrum commented Oct 29, 2022 • edited

DavidACosgrove commented Oct 30, 2022

greglandrum commented Oct 29, 2022 •

edited