New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCQM4Mv2 sdf problem #336
Comments
Hi! Why did you use 1.xyz? You should not use |
You should be able to reproduce the 2D graph by
See here for more details. |
@weihua916 thank you for your reply,I have also tried pcqm4m-v2-train.sdf,but this one does not provide atom H。And inference result with the 2d graph of pcqm4m-v2-train.sdf is a little different from the origin data.csv.gz (parts are the same)。I am doing debug to find the reason now, do you find the same problem? |
Correct, we do not provide atom H for some chemistry reason (@nakatamaho can elaborate). And yes, it is a known issue that some 2D graphs of pcqm4m-v2-train.sdf are different. Here, we wrote "Known issue: A very small number of training molecules (around 46 out of 3,378,606) have 2D graph structures that are inconsistent with the ones calculated from SMILES. These molecules often involve Si atom(s). For the rest of the training molecules, the 2D graphs constructed from SDF and SMILES are identical (even though the atom-to-atom correspondence is not available)." |
I have read your description carefully before. Now, I find that we get different gnn inference result with the first molecular on 2D graph (which does not have Si)。Maybe it's a code problem, I'll check again. |
Hi, PierreHao,
34 34 0 0 0 0 0 0 0 0999 V2000 In the last 17 lines, you'll find Hydrogen atoms. |
@nakatamaho I have seen it, thanks |
|
yes,a slightly different. My experimental data shows a performance difference of 0.005 on pcqv2 valid set。 |
Note that you can also extract molecular graphs from SDF and including bond order as well. |
|
First, SDF is just a naive collection of mol files. Second, you don't need xyz. SDF contains everything you need! |
@nakatamaho Thank you so much |
HI, I try to get the mol from pcqm4m-v2-train.sdf, and compare the structure with mol from rdkit.Chem.MolFromSmiles, for example:
obabel -ixyz 1.xyz -osmi -O 1.smi, we get the smiles CC(=O)N(C)/C=C/c1ccc(cc1OC)OC, but origin is COc1cc(OC)ccc1/C=C/N(C(=O)C)C
then I do gnn inference with these two graphs, the final result is a little different。 it seems that 2D graph is not the same?
btw sdf does not provide xyz of H like pcqm4m-v2_xyz.zip ?
The text was updated successfully, but these errors were encountered: