Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception raised when reading very large SMILES file #5692

Closed
DavidACosgrove opened this issue Oct 26, 2022 · 2 comments
Closed

Exception raised when reading very large SMILES file #5692

DavidACosgrove opened this issue Oct 26, 2022 · 2 comments
Labels
Milestone

Comments

@DavidACosgrove
Copy link
Collaborator

Describe the bug
When reading a very large SMILES file with the SMILESMolSupplier, I get:

Read 29559000 molecules N#Cc1ccc(N2CCC(Nc3c(C(N)=O)cnc4[nH]ccc34)CC2)nn1.
Traceback (most recent call last):
File "/Users/david/Projects/GeneralRDKit/./test_reader.py", line 7, in
for i, mol in enumerate(suppl):
RuntimeError: File parsing error: ERROR: Index error (idx = 29559491): ran out of lines

To Reproduce

#!/usr/bin/env python

from sys import argv
from rdkit import Chem

suppl = Chem.SmilesMolSupplier(argv[1], titleLine=True, nameColumn=1, smilesColumn=0)
for i, mol in enumerate(suppl):
    if not i % 1000:
        print(f'Read {i} molecules {Chem.MolToSmiles(mol)}.')

I concatenated multiple copies of ChEMBL30 to make a large enough file - the original error was on a proprietary file. I'm not attaching the file as it was 2.5Gb.

Expected behavior
What I would expect would be for it to read files of any size.

Screenshots
If applicable, add screenshots to help explain your problem.

Configuration (please complete the following information):

  • RDKit version: 2022.09.1 and 2022.03.5
  • OS: MacOs and RH7.9 respectively
  • Python version (if relevant): 3.10.6 and 3.7 respectively
  • Are you using conda? Yes
  • If you are using conda, which channel did you install the rdkit from? conda-forge
  • If you are not using conda: how did you install the RDKit?

Additional context
I think that this is a file size issue, rather than a number of lines thing, despite what the error message says. The files are over 2.3Gb. I have tried different files and the code fails with different line numbers but at roughly the same size point, as far as I can tell. I suspect the fseek pointer being used in next() is wrapping round on a 32 bit boundary - the file sizes look about right for that. I had a look at the C++ but couldn't see anything obvious, but I didn't follow exactly what was going on. The error message could be raised in 2 different places in next() - as a minor thing it might be convenient for debugging if they were distinguished in some way; it takes a long time for the code to get to the offending bit of the input file so another long test will need to be run to establish which of the two error points it is.

@greglandrum
Copy link
Member

greglandrum commented Oct 29, 2022

I'm going to guess that the problem is the use of a standard int here:
https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/FileParsers/SmilesMolSupplier.cpp#L393
Changing that to auto nextP = this->skipComments() should resolve it since skipComments() returns a long.
At the same time d_line needs to be changed to size_t or long.
That should provide sufficient headroom for these files.
I'll do a PR for this.

greglandrum added a commit to greglandrum/rdkit that referenced this issue Oct 29, 2022
We're still limited by using an int as the molecule index in the API;
fixing that is a minor API break
@greglandrum greglandrum added this to the 2022_09_2 milestone Oct 29, 2022
@DavidACosgrove
Copy link
Collaborator Author

It's now working for me after the PR, thanks.

greglandrum added a commit that referenced this issue Nov 23, 2022
We're still limited by using an int as the molecule index in the API;
fixing that is a minor API break
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants