PCQM4Mv2 sdf problem #336

PierreHao · 2022-06-08T07:37:17Z

HI, I try to get the mol from pcqm4m-v2-train.sdf, and compare the structure with mol from rdkit.Chem.MolFromSmiles, for example:
obabel -ixyz 1.xyz -osmi -O 1.smi, we get the smiles CC(=O)N(C)/C=C/c1ccc(cc1OC)OC, but origin is COc1cc(OC)ccc1/C=C/N(C(=O)C)C
then I do gnn inference with these two graphs, the final result is a little different。 it seems that 2D graph is not the same?
btw sdf does not provide xyz of H like pcqm4m-v2_xyz.zip ?

weihua916 · 2022-06-08T07:52:37Z

Hi! Why did you use 1.xyz? You should not use pcqm4m-v2_xyz.zip because it is outdated. Did you use pcqm4m-v2-train.sdf?

weihua916 · 2022-06-08T07:54:17Z

You should be able to reproduce the 2D graph by

from rdkit import Chem

suppl = Chem.SDMolSupplier('pcqm4m-v2-train.sdf')
for idx, mol in enumerate(suppl):
    print(f'{idx}-th rdkit mol obj: {mol}')

See here for more details.

PierreHao · 2022-06-08T08:05:53Z

@weihua916 thank you for your reply，I have also tried pcqm4m-v2-train.sdf，but this one does not provide atom H。And inference result with the 2d graph of pcqm4m-v2-train.sdf is a little different from the origin data.csv.gz (parts are the same)。I am doing debug to find the reason now, do you find the same problem?

weihua916 · 2022-06-08T08:10:44Z

Correct, we do not provide atom H for some chemistry reason (@nakatamaho can elaborate).

And yes, it is a known issue that some 2D graphs of pcqm4m-v2-train.sdf are different. Here, we wrote "Known issue: A very small number of training molecules (around 46 out of 3,378,606) have 2D graph structures that are inconsistent with the ones calculated from SMILES. These molecules often involve Si atom(s). For the rest of the training molecules, the 2D graphs constructed from SDF and SMILES are identical (even though the atom-to-atom correspondence is not available)."

PierreHao · 2022-06-08T08:19:01Z

I have read your description carefully before. Now, I find that we get different gnn inference result with the first molecular on 2D graph (which does not have Si)。Maybe it's a code problem, I'll check again.

nakatamaho · 2022-06-08T08:40:55Z

Hi, PierreHao,

SMILES from rdkit.Chem.MolFromSmiles and SMILES by Open Babel can (slightly) be different.
we contain Hydrogen atoms in SDF. If you see the file "pcqm4m-v2-train.sdf" you'll find following part:
/Volumes/PubChemQCDataBaseWork/pubchemqc2017database/xyz/00000000_00009999/1.xyz
OpenBabel02162213453D

34 34 0 0 0 0 0 0 0 0999 V2000
7.0068 1.8970 3.2727 C 0 0 0 0 0 0 0 0 0 0 0 0
4.4650 -0.2257 1.3353 C 0 0 0 0 0 0 0 0 0 0 0 0
0.4432 2.1011 9.5057 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3366 -2.5350 5.1551 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6396 1.6000 5.9257 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9474 1.8190 7.1190 C 0 0 0 0 0 0 0 0 0 0 0 0
3.1830 0.2123 3.8873 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3870 0.7439 3.5935 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7754 -0.2729 6.7631 C 0 0 0 0 0 0 0 0 0 0 0 0
6.3568 1.0842 2.1622 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4537 0.4611 5.1348 C 0 0 0 0 0 0 0 0 0 0 0 0
1.0098 0.8730 7.5380 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4835 -0.4763 5.5827 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0856 0.5626 2.3959 N 0 0 0 0 0 0 0 0 0 0 0 0
6.9369 0.8984 1.1006 O 0 0 0 0 0 0 0 0 0 0 0 0
0.2593 0.9658 8.6759 O 0 0 0 0 0 0 0 0 0 0 0 0
1.3108 -1.5731 4.7848 O 0 0 0 0 0 0 0 0 0 0 0 0
7.1466 1.3059 4.1845 H 0 0 0 0 0 0 0 0 0 0 0 0
6.4174 2.7835 3.5327 H 0 0 0 0 0 0 0 0 0 0 0 0
7.9816 2.2170 2.9038 H 0 0 0 0 0 0 0 0 0 0 0 0
4.3066 -1.2574 1.6696 H 0 0 0 0 0 0 0 0 0 0 0 0
5.1340 -0.2156 0.4776 H 0 0 0 0 0 0 0 0 0 0 0 0
3.4967 0.2086 1.0638 H 0 0 0 0 0 0 0 0 0 0 0 0
-0.2404 1.9714 10.3469 H 0 0 0 0 0 0 0 0 0 0 0 0
0.1957 3.0318 8.9780 H 0 0 0 0 0 0 0 0 0 0 0 0
1.4730 2.1641 9.8816 H 0 0 0 0 0 0 0 0 0 0 0 0
0.3625 -3.3009 4.3776 H 0 0 0 0 0 0 0 0 0 0 0 0
-0.6682 -2.0951 5.1999 H 0 0 0 0 0 0 0 0 0 0 0 0
0.5723 -2.9938 6.1242 H 0 0 0 0 0 0 0 0 0 0 0 0
3.3352 2.3628 5.5870 H 0 0 0 0 0 0 0 0 0 0 0 0
2.1307 2.7235 7.6863 H 0 0 0 0 0 0 0 0 0 0 0 0
2.7052 -0.4677 3.1921 H 0 0 0 0 0 0 0 0 0 0 0 0
4.8961 1.3664 4.3173 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0389 -0.9814 7.1205 H 0 0 0 0 0 0 0 0 0 0 0 0

In the last 17 lines, you'll find Hydrogen atoms.

PierreHao · 2022-06-08T08:44:13Z

@nakatamaho I have seen it, thanks

nakatamaho · 2022-06-08T08:45:33Z

it seems that 2D graph is not the same?
It can be. We calculated SMILES and SDF by xyz files using Open Babel. There is no rigorous or standard algorithm to convert atomic xyz coordination to SMILES, the resultant SMILES strings can be slightly different. However, these should not be large differences.

PierreHao · 2022-06-08T08:52:26Z

yes，a slightly different. My experimental data shows a performance difference of 0.005 on pcqv2 valid set。
For the 2nd OGB-LSC，we should not use atom H information from xyz file?

nakatamaho · 2022-06-08T08:53:34Z

Note that you can also extract molecular graphs from SDF and including bond order as well.

nakatamaho · 2022-06-08T08:55:22Z

For the 2nd OGB-LSC，we should not use atom H information from xyz file?
It is up to you! In this SDF, we provide all information for the molecules; xyz coordinates of atoms for each molecule. (and molecules are all neutral)

nakatamaho · 2022-06-08T09:08:38Z

First, SDF is just a naive collection of mol files. Second, you don't need xyz. SDF contains everything you need!
Anyway: I attached a.sdf.txt. Please save as a.sdf.
a.sdf.txt
Then,
$ obabel -isdf a.sdf -o xyz
34
/Volumes/PubChemQCDataBaseWork/pubchemqc2017database/xyz/00000000_00009999/1.xyz
C 7.00680 1.89700 3.27270
C 4.46500 -0.22570 1.33530
C 0.44320 2.10110 9.50570
C 0.33660 -2.53500 5.15510
C 2.63960 1.60000 5.92570
C 1.94740 1.81900 7.11900
C 3.18300 0.21230 3.88730
C 4.38700 0.74390 3.59350
C 0.77540 -0.27290 6.76310
C 6.35680 1.08420 2.16220
C 2.45370 0.46110 5.13480
C 1.00980 0.87300 7.53800
C 1.48350 -0.47630 5.58270
N 5.08560 0.56260 2.39590
O 6.93690 0.89840 1.10060
O 0.25930 0.96580 8.67590
O 1.31080 -1.57310 4.78480
H 7.14660 1.30590 4.18450
H 6.41740 2.78350 3.53270
H 7.98160 2.21700 2.90380
H 4.30660 -1.25740 1.66960
H 5.13400 -0.21560 0.47760
H 3.49670 0.20860 1.06380
H -0.24040 1.97140 10.34690
H 0.19570 3.03180 8.97800
H 1.47300 2.16410 9.88160
H 0.36250 -3.30090 4.37760
H -0.66820 -2.09510 5.19990
H 0.57230 -2.99380 6.12420
H 3.33520 2.36280 5.58700
H 2.13070 2.72350 7.68630
H 2.70520 -0.46770 3.19210
H 4.89610 1.36640 4.31730
H 0.03890 -0.98140 7.12050

PierreHao · 2022-06-08T09:12:04Z

@nakatamaho Thank you so much

PierreHao closed this as completed Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCQM4Mv2 sdf problem #336

PCQM4Mv2 sdf problem #336

PierreHao commented Jun 8, 2022

weihua916 commented Jun 8, 2022 •

edited

weihua916 commented Jun 8, 2022

PierreHao commented Jun 8, 2022

weihua916 commented Jun 8, 2022 •

edited

PierreHao commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

PierreHao commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

PierreHao commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

PierreHao commented Jun 8, 2022

PCQM4Mv2 sdf problem #336

PCQM4Mv2 sdf problem #336

Comments

PierreHao commented Jun 8, 2022

weihua916 commented Jun 8, 2022 • edited

weihua916 commented Jun 8, 2022

PierreHao commented Jun 8, 2022

weihua916 commented Jun 8, 2022 • edited

PierreHao commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

PierreHao commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

PierreHao commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

nakatamaho commented Jun 8, 2022

PierreHao commented Jun 8, 2022

weihua916 commented Jun 8, 2022 •

edited

weihua916 commented Jun 8, 2022 •

edited