Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize molecules when SMILES needs to be produced in PandasTools #3818

Merged

Conversation

mwojcikowski
Copy link
Contributor

When loading SDF to a Pandas dataframe and discarding of molecules and retaining SMILES, the SDMolSupplier was not sanitizing the molecules before producing of SMILES. This tiny PR addresses this issue.

Additional change to improve the performance by getting property dict in one call instead of a loop over all properties.

@mwojcikowski mwojcikowski force-pushed the bugfix/sanitize_molecules_for_dataframe branch from c596f3e to c2ed5ac Compare February 16, 2021 16:30
@greglandrum
Copy link
Member

@mwojcikowski : the CI test failures here seem real.
It looks like you need to take a look at the tests for the PandasTools module and update the expected results.

@mwojcikowski
Copy link
Contributor Author

@greglandrum Yes, I've seen them. I simply didn't manage to address them yesterday yet. The issue is that mol.GetPropsAsDict() returns typed properties instead of strings. I will handle that and ping you as soon as this is sorted out.

@mwojcikowski mwojcikowski force-pushed the bugfix/sanitize_molecules_for_dataframe branch from 13d8e9f to 9e4ec69 Compare February 17, 2021 10:23
@mwojcikowski mwojcikowski force-pushed the bugfix/sanitize_molecules_for_dataframe branch from 9e4ec69 to f67438e Compare February 17, 2021 10:24
@mwojcikowski
Copy link
Contributor Author

@greglandrum I removed the changes improving performance, and left only the bugfix part and now tests pass.

@greglandrum
Copy link
Member

@greglandrum I removed the changes improving performance, and left only the bugfix part and now tests pass.

Thanks.
I'm not convinced the use of GetPropsAsDict() is a good idea here anyway: there's no guarantee that the given property will have the same type for every molecule and I'm not sure if pandas will do the right thing.
It would be nice (though not 100% backwards compatible) to have real types there when it can work though... maybe something for the future.

@greglandrum greglandrum added this to the 2021_03_1 milestone Feb 17, 2021
@greglandrum greglandrum merged commit 8ebf50f into rdkit:master Feb 17, 2021
@mwojcikowski
Copy link
Contributor Author

Indeed, I'd expect the sanitization to be the bottleneck anyways.

@mwojcikowski mwojcikowski deleted the bugfix/sanitize_molecules_for_dataframe branch February 17, 2021 13:23
@greglandrum
Copy link
Member

Indeed, I'd expect the sanitization to be the bottleneck anyways.

most definitely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants