Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support valences of 4 and 6 for Te #1204

Closed
hsiaoyi0504 opened this issue Dec 15, 2016 · 13 comments
Closed

Support valences of 4 and 6 for Te #1204

hsiaoyi0504 opened this issue Dec 15, 2016 · 13 comments
Labels
Milestone

Comments

@hsiaoyi0504
Copy link
Contributor

I have some SMILES from ZINC database, but it seems that some of them are invalid or RDKit is unable to parse (read + canonicalize) them. I collect them in this file. Additionally, I have tried to submit these SMILES to PubChem Standardization Service. Only CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4 can be standardized (but RDKit can't do this).

@greglandrum
Copy link
Member

@hsiaoyi0504 all of those molecules have chemistry problems (i.e. invalid valence states on main-group elements). What do you expect to happen with them?

There's documentation out there on the web about why these are rejected by the RDKit and how to process them anyway. Does that not help you?

@jir322
Copy link

jir322 commented Dec 15, 2016 via email

@hsiaoyi0504
Copy link
Contributor Author

@jir322 Sorry, I finally found it should be from chembl22

@hsiaoyi0504
Copy link
Contributor Author

@greglandrum Yes, when I parse all of them through RDKit, they all report that there are invalid valence states, but is it really an invalid valence state in CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4? I found that when I use this as a query searching against PubChem, I will find this record

@jir322
Copy link

jir322 commented Dec 15, 2016 via email

@hsiaoyi0504
Copy link
Contributor Author

@jir322 Ha, you deserve it. It's also an interesting project here called keras-molecules trying to reproduce result from one of deep learning paper here. It uses the datasets from ZINC and ChEMBL.

@greglandrum
Copy link
Member

Pubchem is a great public resource, but it's not necessarily the best source to cite for to indicate that something is reasonable chemistry. :-)
I can add acceptable valences of 4 and 6 to Te in order to make it analogous to Se and S.

@greglandrum greglandrum changed the title Error parsing some SMILES Support valences of 4 and 6 for Te Dec 16, 2016
@greglandrum greglandrum added this to the 2016_09_3 milestone Dec 16, 2016
@hsiaoyi0504
Copy link
Contributor Author

@greglandrum Thanks, I think I didn't clearly point out my point. I think the valence of Te here is acceptable. I also now testing the functionality of reading SMILES from ZINC. Maybe will add more things to current list later.

greglandrum added a commit to greglandrum/rdkit that referenced this issue Dec 16, 2016
@hsiaoyi0504
Copy link
Contributor Author

I update the list here.
More examples from ZINC database are added. All of them are due to N valence of 4 and 5.

@greglandrum
Copy link
Member

Those structures are chemically wrong. What exactly do you expect the RDKit to do with them?

@hsiaoyi0504
Copy link
Contributor Author

hsiaoyi0504 commented Dec 16, 2016

@greglandrum I know they are chemically wrong, but is it possible to correct them? In my opinion, correct some cases is possible. For instance, if I input CC(=O)Nc1ccc(cc1)S(=O)(=O)N=N=N to pubchem standardization service, I will get CC(=O)NC1=CC=C(C=C1)S(=O)(=O)N=[N+]=N. Thus, I think it is possible to do that.

@greglandrum
Copy link
Member

The RDKit generally does not "guess" about these things unless it's really clear what the user intended and the incorrect form is a more or less standard one. I don't think either of those conditions applies here. This is one where, if you care about correctness, a human being needs to look at what's intended and fix the input. If you don't care about correctness, it's easy to write some RDKit code that reads a molecule in without sanitizing it and adds a charge to neutral four-valent nitrogens.

The pubchem solution, as you present, it actually changes the overall charge on the molecule. I'm pretty sure that's not correct.

@hsiaoyi0504
Copy link
Contributor Author

OK, I got your point, but I really thought it is good for a software to notify user checking or modifying the input data. How about adding some suggestion for some cases? In most case here, it occurs for N=N=N, which is a common error for not showing the charge of the N in the middle. These errors are common even in Chemistry textbook. Also, advanced features for guessing this could be possible.

greglandrum added a commit that referenced this issue Dec 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants