Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCQM4Mv2 sdf problem #336

Closed
PierreHao opened this issue Jun 8, 2022 · 13 comments
Closed

PCQM4Mv2 sdf problem #336

PierreHao opened this issue Jun 8, 2022 · 13 comments

Comments

@PierreHao
Copy link

HI, I try to get the mol from pcqm4m-v2-train.sdf, and compare the structure with mol from rdkit.Chem.MolFromSmiles, for example:
obabel -ixyz 1.xyz -osmi -O 1.smi, we get the smiles CC(=O)N(C)/C=C/c1ccc(cc1OC)OC, but origin is COc1cc(OC)ccc1/C=C/N(C(=O)C)C
then I do gnn inference with these two graphs, the final result is a little different。 it seems that 2D graph is not the same?
btw sdf does not provide xyz of H like pcqm4m-v2_xyz.zip ?

@weihua916
Copy link
Contributor

weihua916 commented Jun 8, 2022

Hi! Why did you use 1.xyz? You should not use pcqm4m-v2_xyz.zip because it is outdated. Did you use pcqm4m-v2-train.sdf?

@weihua916
Copy link
Contributor

You should be able to reproduce the 2D graph by

from rdkit import Chem

suppl = Chem.SDMolSupplier('pcqm4m-v2-train.sdf')
for idx, mol in enumerate(suppl):
    print(f'{idx}-th rdkit mol obj: {mol}')

See here for more details.

@PierreHao
Copy link
Author

@weihua916 thank you for your reply,I have also tried pcqm4m-v2-train.sdf,but this one does not provide atom H。And inference result with the 2d graph of pcqm4m-v2-train.sdf is a little different from the origin data.csv.gz (parts are the same)。I am doing debug to find the reason now, do you find the same problem?

@weihua916
Copy link
Contributor

weihua916 commented Jun 8, 2022

Correct, we do not provide atom H for some chemistry reason (@nakatamaho can elaborate).

And yes, it is a known issue that some 2D graphs of pcqm4m-v2-train.sdf are different. Here, we wrote "Known issue: A very small number of training molecules (around 46 out of 3,378,606) have 2D graph structures that are inconsistent with the ones calculated from SMILES. These molecules often involve Si atom(s). For the rest of the training molecules, the 2D graphs constructed from SDF and SMILES are identical (even though the atom-to-atom correspondence is not available)."

@PierreHao
Copy link
Author

I have read your description carefully before. Now, I find that we get different gnn inference result with the first molecular on 2D graph (which does not have Si)。Maybe it's a code problem, I'll check again.

@nakatamaho
Copy link

Hi, PierreHao,

  1. SMILES from rdkit.Chem.MolFromSmiles and SMILES by Open Babel can (slightly) be different.
  2. we contain Hydrogen atoms in SDF. If you see the file "pcqm4m-v2-train.sdf" you'll find following part:
    /Volumes/PubChemQCDataBaseWork/pubchemqc2017database/xyz/00000000_00009999/1.xyz
    OpenBabel02162213453D

34 34 0 0 0 0 0 0 0 0999 V2000
7.0068 1.8970 3.2727 C 0 0 0 0 0 0 0 0 0 0 0 0
4.4650 -0.2257 1.3353 C 0 0 0 0 0 0 0 0 0 0 0 0
0.4432 2.1011 9.5057 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3366 -2.5350 5.1551 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6396 1.6000 5.9257 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9474 1.8190 7.1190 C 0 0 0 0 0 0 0 0 0 0 0 0
3.1830 0.2123 3.8873 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3870 0.7439 3.5935 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7754 -0.2729 6.7631 C 0 0 0 0 0 0 0 0 0 0 0 0
6.3568 1.0842 2.1622 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4537 0.4611 5.1348 C 0 0 0 0 0 0 0 0 0 0 0 0
1.0098 0.8730 7.5380 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4835 -0.4763 5.5827 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0856 0.5626 2.3959 N 0 0 0 0 0 0 0 0 0 0 0 0
6.9369 0.8984 1.1006 O 0 0 0 0 0 0 0 0 0 0 0 0
0.2593 0.9658 8.6759 O 0 0 0 0 0 0 0 0 0 0 0 0
1.3108 -1.5731 4.7848 O 0 0 0 0 0 0 0 0 0 0 0 0
7.1466 1.3059 4.1845 H 0 0 0 0 0 0 0 0 0 0 0 0
6.4174 2.7835 3.5327 H 0 0 0 0 0 0 0 0 0 0 0 0
7.9816 2.2170 2.9038 H 0 0 0 0 0 0 0 0 0 0 0 0
4.3066 -1.2574 1.6696 H 0 0 0 0 0 0 0 0 0 0 0 0
5.1340 -0.2156 0.4776 H 0 0 0 0 0 0 0 0 0 0 0 0
3.4967 0.2086 1.0638 H 0 0 0 0 0 0 0 0 0 0 0 0
-0.2404 1.9714 10.3469 H 0 0 0 0 0 0 0 0 0 0 0 0
0.1957 3.0318 8.9780 H 0 0 0 0 0 0 0 0 0 0 0 0
1.4730 2.1641 9.8816 H 0 0 0 0 0 0 0 0 0 0 0 0
0.3625 -3.3009 4.3776 H 0 0 0 0 0 0 0 0 0 0 0 0
-0.6682 -2.0951 5.1999 H 0 0 0 0 0 0 0 0 0 0 0 0
0.5723 -2.9938 6.1242 H 0 0 0 0 0 0 0 0 0 0 0 0
3.3352 2.3628 5.5870 H 0 0 0 0 0 0 0 0 0 0 0 0
2.1307 2.7235 7.6863 H 0 0 0 0 0 0 0 0 0 0 0 0
2.7052 -0.4677 3.1921 H 0 0 0 0 0 0 0 0 0 0 0 0
4.8961 1.3664 4.3173 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0389 -0.9814 7.1205 H 0 0 0 0 0 0 0 0 0 0 0 0

In the last 17 lines, you'll find Hydrogen atoms.

@PierreHao
Copy link
Author

@nakatamaho I have seen it, thanks

@nakatamaho
Copy link

  1. it seems that 2D graph is not the same?
    It can be. We calculated SMILES and SDF by xyz files using Open Babel. There is no rigorous or standard algorithm to convert atomic xyz coordination to SMILES, the resultant SMILES strings can be slightly different. However, these should not be large differences.

@PierreHao
Copy link
Author

yes,a slightly different. My experimental data shows a performance difference of 0.005 on pcqv2 valid set。
For the 2nd OGB-LSC,we should not use atom H information from xyz file?

@nakatamaho
Copy link

Note that you can also extract molecular graphs from SDF and including bond order as well.

@nakatamaho
Copy link

For the 2nd OGB-LSC,we should not use atom H information from xyz file?
It is up to you! In this SDF, we provide all information for the molecules; xyz coordinates of atoms for each molecule. (and molecules are all neutral)

@nakatamaho
Copy link

First, SDF is just a naive collection of mol files. Second, you don't need xyz. SDF contains everything you need!
Anyway: I attached a.sdf.txt. Please save as a.sdf.
a.sdf.txt
Then,
$ obabel -isdf a.sdf -o xyz
34
/Volumes/PubChemQCDataBaseWork/pubchemqc2017database/xyz/00000000_00009999/1.xyz
C 7.00680 1.89700 3.27270
C 4.46500 -0.22570 1.33530
C 0.44320 2.10110 9.50570
C 0.33660 -2.53500 5.15510
C 2.63960 1.60000 5.92570
C 1.94740 1.81900 7.11900
C 3.18300 0.21230 3.88730
C 4.38700 0.74390 3.59350
C 0.77540 -0.27290 6.76310
C 6.35680 1.08420 2.16220
C 2.45370 0.46110 5.13480
C 1.00980 0.87300 7.53800
C 1.48350 -0.47630 5.58270
N 5.08560 0.56260 2.39590
O 6.93690 0.89840 1.10060
O 0.25930 0.96580 8.67590
O 1.31080 -1.57310 4.78480
H 7.14660 1.30590 4.18450
H 6.41740 2.78350 3.53270
H 7.98160 2.21700 2.90380
H 4.30660 -1.25740 1.66960
H 5.13400 -0.21560 0.47760
H 3.49670 0.20860 1.06380
H -0.24040 1.97140 10.34690
H 0.19570 3.03180 8.97800
H 1.47300 2.16410 9.88160
H 0.36250 -3.30090 4.37760
H -0.66820 -2.09510 5.19990
H 0.57230 -2.99380 6.12420
H 3.33520 2.36280 5.58700
H 2.13070 2.72350 7.68630
H 2.70520 -0.46770 3.19210
H 4.89610 1.36640 4.31730
H 0.03890 -0.98140 7.12050

@PierreHao
Copy link
Author

@nakatamaho Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants