Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with V3000 SD files containing enhanced stereochemistry information #5165

Closed
GLPG-GT opened this issue Apr 5, 2022 · 1 comment · Fixed by #5209
Closed

issue with V3000 SD files containing enhanced stereochemistry information #5165

GLPG-GT opened this issue Apr 5, 2022 · 1 comment · Fixed by #5209
Assignees
Labels
Milestone

Comments

@GLPG-GT
Copy link

GLPG-GT commented Apr 5, 2022

Describe the bug
Chem.SDMolSupplier raises warnings when reading a V3000 SD file made by Dassault/Biovia/Scitegic software, and ignores the enhanced stereochemistry information, whereas the same molecule written out to a V3000 SDF from within rdkit can be read back correctly.

To Reproduce
Run:

with Chem.SDMolSupplier('mol_with_enhanced_stereo_2_And_groups.sdf') as SDF:
    ms2 = [m for m in SDF if m is not None]

on file:
mol_with_enhanced_stereo_2_And_groups.sdf.txt
after renaming it to .sdf (GitHub did not accept sdf as file type)

For comparison, run:

with Chem.SDMolSupplier('m_with_enh_stereo.sdf') as SDF:
    ms = [m for m in SDF if m is not None]

on file:
m_with_enh_stereo.sdf.txt
(again after renaming it to .sdf).

Expected behavior
Both files should result in a V3000 MolBlock like the following:

 RDKit          2D

0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 22 24 0 0 0
M V30 BEGIN ATOM
M V30 1 O 7.414605 -6.052405 0.000000 0
M V30 2 C 6.201079 -6.934083 0.000000 0
M V30 3 N 4.830761 -6.323978 0.000000 0
M V30 4 C 4.673969 -4.832195 0.000000 0
M V30 5 C 3.303650 -4.222090 0.000000 0
M V30 6 C 2.004612 -4.972090 0.000000 0
M V30 7 C 0.889895 -3.968394 0.000000 0
M V30 8 C 1.500000 -2.598076 0.000000 0
M V30 9 C 0.750000 -1.299038 0.000000 0
M V30 10 C 1.500000 0.000000 0.000000 0
M V30 11 C 0.750000 1.299038 0.000000 0
M V30 12 C -0.750000 1.299038 0.000000 0
M V30 13 C -1.500000 0.000000 0.000000 0
M V30 14 C -0.750000 -1.299038 0.000000 0
M V30 15 O 2.991783 -2.754869 0.000000 0
M V30 16 N 6.357872 -8.425866 0.000000 0
M V30 17 C 7.728190 -9.035971 0.000000 0
M V30 18 C 9.027228 -8.285971 0.000000 0
M V30 19 O 10.141946 -9.289667 0.000000 0
M V30 20 C 9.531841 -10.659985 0.000000 0
M V30 21 C 8.040058 -10.503192 0.000000 0
M V30 22 O 7.036362 -11.617910 0.000000 0
M V30 END ATOM
M V30 BEGIN BOND
M V30 1 2 1 2
M V30 2 1 2 3
M V30 3 1 3 4
M V30 4 1 5 4 CFG=3
M V30 5 1 5 6
M V30 6 1 6 7
M V30 7 1 8 7 CFG=3
M V30 8 1 8 9
M V30 9 2 9 10
M V30 10 1 10 11
M V30 11 2 11 12
M V30 12 1 12 13
M V30 13 2 13 14
M V30 14 1 8 15
M V30 15 1 2 16
M V30 16 1 17 16 CFG=3
M V30 17 1 17 18
M V30 18 1 18 19
M V30 19 1 19 20
M V30 20 1 20 21
M V30 21 1 21 22 CFG=3
M V30 22 1 15 5
M V30 23 1 21 17
M V30 24 1 14 9
M V30 END BOND
M V30 BEGIN COLLECTION
M V30 MDLV30/STERAC1 ATOMS=(2 5 8)
M V30 MDLV30/STERAC2 ATOMS=(2 17 21)
M V30 END COLLECTION
M V30 END CTAB
M END

Instead, the first file (the one causing warnings) results in the following V3000 MolBlock:

2 And groups, from CXSMILES
RDKit 2D

0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 22 24 0 0 0
M V30 BEGIN ATOM
M V30 1 O 7.414600 -6.052410 0.000000 0
M V30 2 C 6.201080 -6.934080 0.000000 0
M V30 3 N 4.830760 -6.323980 0.000000 0
M V30 4 C 4.673970 -4.832190 0.000000 0
M V30 5 C 3.303650 -4.222090 0.000000 0
M V30 6 C 2.004610 -4.972090 0.000000 0
M V30 7 C 0.889900 -3.968390 0.000000 0
M V30 8 C 1.500000 -2.598080 0.000000 0
M V30 9 C 0.750000 -1.299040 0.000000 0
M V30 10 C 1.500000 0.000000 0.000000 0
M V30 11 C 0.750000 1.299040 0.000000 0
M V30 12 C -0.750000 1.299040 0.000000 0
M V30 13 C -1.500000 0.000000 0.000000 0
M V30 14 C -0.750000 -1.299040 0.000000 0
M V30 15 O 2.991780 -2.754870 0.000000 0
M V30 16 N 6.357870 -8.425870 0.000000 0
M V30 17 C 7.728190 -9.035970 0.000000 0
M V30 18 C 9.027230 -8.285970 0.000000 0
M V30 19 O 10.141950 -9.289670 0.000000 0
M V30 20 C 9.531840 -10.659990 0.000000 0
M V30 21 C 8.040060 -10.503190 0.000000 0
M V30 22 O 7.036360 -11.617910 0.000000 0
M V30 END ATOM
M V30 BEGIN BOND
M V30 1 2 1 2
M V30 2 1 2 3
M V30 3 1 3 4
M V30 4 1 5 4 CFG=3
M V30 5 1 5 6
M V30 6 1 6 7
M V30 7 1 8 7 CFG=3
M V30 8 1 8 9
M V30 9 2 9 10
M V30 10 1 10 11
M V30 11 2 11 12
M V30 12 1 12 13
M V30 13 2 13 14
M V30 14 1 8 15
M V30 15 1 2 16
M V30 16 1 17 16 CFG=3
M V30 17 1 17 18
M V30 18 1 18 19
M V30 19 1 19 20
M V30 20 1 20 21
M V30 21 1 21 22 CFG=3
M V30 22 1 15 5
M V30 23 1 21 17
M V30 24 1 14 9
M V30 END BOND
M V30 END CTAB
M END

Screenshots
Not applicable.

Configuration (please complete the following information):

  • RDKit version: 2021.09.2 build py39hccf6a74_0
  • OS: CentOS Linux 7
  • Python version (if relevant): 3.9.7
  • Are you using conda? yes
  • If you are using conda, which channel did you install the rdkit from? conda-forge
  • If you are not using conda: how did you install the RDKit? NA

Additional context
Not applicable.

@GLPG-GT GLPG-GT added the bug label Apr 5, 2022
@d-b-w
Copy link
Contributor

d-b-w commented Apr 5, 2022

The specific warnings are:

 In [1]: from rdkit import Chem                                                                                                                                                                                              
 In [2]: Chem.MolFromMolFile('/Users/wandschn/Downloads/mol_with_enhanced_stereo_2_And_groups.sdf')                                                                                                                          
[16:16:42] Skipping unrecognized collection type at line 58: MDLV30/STERAC1 ATOMS=(2 5 8) 
[16:16:42] Skipping unrecognized collection type at line 59: MDLV30/STERAC2 ATOMS=(2 17 21) 

It appears that lines 58 & 59 have trailing spaces. This is absolutely a bug in the regex we're using to parse this line.

@greglandrum greglandrum self-assigned this Apr 6, 2022
@gosreya gosreya mentioned this issue Apr 17, 2022
greglandrum added a commit that referenced this issue Apr 17, 2022
* Improved regex whitespace handling

Change was made in the parseEnhancedStereo function

* Files for Github #5165 test case

Both files use enhanced stereochemistry, but differ in whitespace content

* Test case for Github Issue #5165

Catches whitespace parsing error

* Improves test case check

Makes test case more specific, less prone to potential invalid access to container

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Improves test case check

Makes test case more specific, less prone to potential invalid access to container

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Update test case "Github #5165"

Add 'require(mol)' to confirm valid molecule before additional testing

* Cleans up test for Issue #5165

* Cleans up test for Issue #5165

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
@greglandrum greglandrum added this to the 2022_03_2 milestone Apr 17, 2022
greglandrum added a commit that referenced this issue Apr 25, 2022
* Improved regex whitespace handling

Change was made in the parseEnhancedStereo function

* Files for Github #5165 test case

Both files use enhanced stereochemistry, but differ in whitespace content

* Test case for Github Issue #5165

Catches whitespace parsing error

* Improves test case check

Makes test case more specific, less prone to potential invalid access to container

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Improves test case check

Makes test case more specific, less prone to potential invalid access to container

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>

* Update test case "Github #5165"

Add 'require(mol)' to confirm valid molecule before additional testing

* Cleans up test for Issue #5165

* Cleans up test for Issue #5165

Co-authored-by: Greg Landrum <greg.landrum@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants