Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDMolSupplier requires several attempts to load a SDF file in Python 3.6/3.7 #3517

Closed
jaimergp opened this issue Oct 21, 2020 · 2 comments
Closed
Labels
Milestone

Comments

@jaimergp
Copy link

jaimergp commented Oct 21, 2020

Configuration:

  • RDKit Version: 2020.03.6
  • Operating system: Ubuntu 18.04 (on WSL2)
  • Python version (if relevant): 3.6, 3.7
  • Are you using conda? Yes
  • If you are using conda, which channel did you install the rdkit from? Conda Forge
  • If you are not using conda: how did you install the RDKit?

Description:

This is a weird one :)

SDMolSupplier needs two iterations on the supplier instance to get the molecules loaded on Python 3.6 and 3.7. Python 3.8 does not exhibit this problem.

Python 3.6 🚫

import sys
import rdkit
from rdkit.Chem import SDMolSupplier
print("System:", sys.platform)
print("Python:", ".".join(map(str, sys.version_info)))
print("RDKit:", rdkit.__version__)
# System: linux
# Python: 3.6.11.final.0
# RDKit: 2020.03.6

supplier = SDMolSupplier("sample.sdf")
mols, attempts = [], 0
while not mols and attempts < 10:
    mols = list(supplier)
    attempts += 1
print(f"Loaded {len(mols)} molecules after {attempts} attempts.")
# Loaded 145 molecules after 2 attempts.

Python 3.7 🚫

import sys
import rdkit
from rdkit.Chem import SDMolSupplier
print("System:", sys.platform)
print("Python:", ".".join(map(str, sys.version_info)))
print("RDKit:", rdkit.__version__)
# System: linux
# Python: 3.7.8.final.0
# RDKit: 2020.03.6

supplier = SDMolSupplier("sample.sdf")
mols, attempts = [], 0
while not mols and attempts < 10:
    mols = list(supplier)
    attempts += 1
print(f"Loaded {len(mols)} molecules after {attempts} attempts.")
# Loaded 145 molecules after 2 attempts.

Python 3.8 ✅

import sys
import rdkit
from rdkit.Chem import SDMolSupplier
print("System:", sys.platform)
print("Python:", ".".join(map(str, sys.version_info)))
print("RDKit:", rdkit.__version__)
# System: linux
# Python: 3.8.6.final.0
# RDKit: 2020.03.6

supplier = SDMolSupplier("sample.sdf")
mols, attempts = [], 0
while not mols and attempts < 10:
    mols = list(supplier)
    attempts += 1
print(f"Loaded {len(mols)} molecules after {attempts} attempts.")
# Loaded 145 molecules after 1 attempts.

Python 3.9 ✅

import sys
import rdkit
from rdkit.Chem import SDMolSupplier
print("System:", sys.platform)
print("Python:", ".".join(map(str, sys.version_info)))
print("RDKit:", rdkit.__version__)
# System: linux
# Python: 3.9.0.final.0
# RDKit: 2020.03.6

supplier = SDMolSupplier("sample.sdf")
mols, attempts = [], 0
while not mols and attempts < 10:
    mols = list(supplier)
    attempts += 1
print(f"Loaded {len(mols)} molecules after {attempts} attempts.")
# Loaded 145 molecules after 1 attempts.

Zipped sample SDF: sample.zip

Thanks!

@jaimergp jaimergp changed the title SDMolSupplier requires several attempts to load a SDF file in Python 3.6 SDMolSupplier requires several attempts to load a SDF file in Python 3.6/3.7 Oct 21, 2020
@ptosco
Copy link
Contributor

ptosco commented Oct 22, 2020

@jaimergp I will investigate more in-depth over the weekend. For now I guess the reason is that the SDMolSupplier wrapper exposes a random access container to Python, while it does not 100% fulfil Python requirements for iterators.
For the time being, here are some workarounds:

import sys
import rdkit
from rdkit.Chem import SDMolSupplier, ForwardSDMolSupplier
print("System:", sys.platform)
print("Python:", ".".join(map(str, sys.version_info)))
print("RDKit:", rdkit.__version__)
System: linux
Python: 3.6.10.final.0
RDKit: 2021.03.1dev1

supplier = SDMolSupplier("/home/toscopa1/sdf/chembl100.sdf")
mols = list(supplier)
len(mols)
0

# Use ForwardSDMolSupplier
supplier = ForwardSDMolSupplier("/home/toscopa1/sdf/chembl100.sdf")
mols = list(supplier)
len(mols)
100

# Use ForwardSDMolSupplier
supplier = ForwardSDMolSupplier("/home/toscopa1/sdf/chembl100.sdf")
mols = tuple(supplier)
len(mols)
100

# Explicitly iterate with a list comprehension
supplier = SDMolSupplier("/home/toscopa1/sdf/chembl100.sdf")
mols = [m for m in supplier]
len(mols)
100

# Explicitly iterate with a map
supplier = SDMolSupplier("/home/toscopa1/sdf/chembl100.sdf")
mols = list(map(lambda m: m, supplier))
len(mols)
100

@jaimergp
Copy link
Author

Oh, I see. ForwardSDMolSupplier seems to be the way to go. We were using a plain comprehension before and it worked, but it was refactored into that list(...) call for simplicity. Interesting 🤔 Thanks for looking into it!

ptosco added a commit to ptosco/rdkit that referenced this issue Oct 24, 2020
@greglandrum greglandrum added the bug label Nov 4, 2020
@greglandrum greglandrum added this to the 2020_09_2 milestone Nov 4, 2020
greglandrum added a commit that referenced this issue Nov 24, 2020
* fixes #3517

* added test in response to review

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants