You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When reading a very large SMILES file with the SMILESMolSupplier, I get:
Read 29559000 molecules N#Cc1ccc(N2CCC(Nc3c(C(N)=O)cnc4[nH]ccc34)CC2)nn1.
Traceback (most recent call last):
File "/Users/david/Projects/GeneralRDKit/./test_reader.py", line 7, in
for i, mol in enumerate(suppl):
RuntimeError: File parsing error: ERROR: Index error (idx = 29559491): ran out of lines
To Reproduce
#!/usr/bin/env python
from sys import argv
from rdkit import Chem
suppl = Chem.SmilesMolSupplier(argv[1], titleLine=True, nameColumn=1, smilesColumn=0)
for i, mol in enumerate(suppl):
if not i % 1000:
print(f'Read {i} molecules {Chem.MolToSmiles(mol)}.')
I concatenated multiple copies of ChEMBL30 to make a large enough file - the original error was on a proprietary file. I'm not attaching the file as it was 2.5Gb.
Expected behavior
What I would expect would be for it to read files of any size.
Screenshots
If applicable, add screenshots to help explain your problem.
Configuration (please complete the following information):
RDKit version: 2022.09.1 and 2022.03.5
OS: MacOs and RH7.9 respectively
Python version (if relevant): 3.10.6 and 3.7 respectively
Are you using conda? Yes
If you are using conda, which channel did you install the rdkit from? conda-forge
If you are not using conda: how did you install the RDKit?
Additional context
I think that this is a file size issue, rather than a number of lines thing, despite what the error message says. The files are over 2.3Gb. I have tried different files and the code fails with different line numbers but at roughly the same size point, as far as I can tell. I suspect the fseek pointer being used in next() is wrapping round on a 32 bit boundary - the file sizes look about right for that. I had a look at the C++ but couldn't see anything obvious, but I didn't follow exactly what was going on. The error message could be raised in 2 different places in next() - as a minor thing it might be convenient for debugging if they were distinguished in some way; it takes a long time for the code to get to the offending bit of the input file so another long test will need to be run to establish which of the two error points it is.
The text was updated successfully, but these errors were encountered:
I'm going to guess that the problem is the use of a standard int here: https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/FileParsers/SmilesMolSupplier.cpp#L393
Changing that to auto nextP = this->skipComments() should resolve it since skipComments() returns a long.
At the same time d_line needs to be changed to size_t or long.
That should provide sufficient headroom for these files.
I'll do a PR for this.
Describe the bug
When reading a very large SMILES file with the SMILESMolSupplier, I get:
Read 29559000 molecules N#Cc1ccc(N2CCC(Nc3c(C(N)=O)cnc4[nH]ccc34)CC2)nn1.
Traceback (most recent call last):
File "/Users/david/Projects/GeneralRDKit/./test_reader.py", line 7, in
for i, mol in enumerate(suppl):
RuntimeError: File parsing error: ERROR: Index error (idx = 29559491): ran out of lines
To Reproduce
I concatenated multiple copies of ChEMBL30 to make a large enough file - the original error was on a proprietary file. I'm not attaching the file as it was 2.5Gb.
Expected behavior
What I would expect would be for it to read files of any size.
Screenshots
If applicable, add screenshots to help explain your problem.
Configuration (please complete the following information):
Additional context
I think that this is a file size issue, rather than a number of lines thing, despite what the error message says. The files are over 2.3Gb. I have tried different files and the code fails with different line numbers but at roughly the same size point, as far as I can tell. I suspect the fseek pointer being used in next() is wrapping round on a 32 bit boundary - the file sizes look about right for that. I had a look at the C++ but couldn't see anything obvious, but I didn't follow exactly what was going on. The error message could be raised in 2 different places in next() - as a minor thing it might be convenient for debugging if they were distinguished in some way; it takes a long time for the code to get to the offending bit of the input file so another long test will need to be run to establish which of the two error points it is.
The text was updated successfully, but these errors were encountered: