Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods for Assessing the Diversity of a Set (and how optimal it is) #4

Closed
PaulWAyers opened this issue Jan 25, 2022 · 16 comments
Closed
Milestone

Comments

@PaulWAyers
Copy link
Member

PaulWAyers commented Jan 25, 2022

For example, we could measure how much more diverse a selected sample is than a random sample.

If clusters are similar size, then random-ish samples are a lot better than if clusters are very inhomogenous in size.

@RichRick1
Copy link
Collaborator

RichRick1 commented Jan 26, 2022

Found this paper really easy to follow and it describes a couple of algorithm (clustering and Dissimilarity-Based Compound Selection) that might be relatively easy to implement at the beginning. Also, Dissimilarity-Based Compound Selection provides a decent metric while quality of the Clustering approach can be used as Gini Index/ Entropy. Here are some screenshots:
image

And for the clustering algorithm:
image

So, overall steps might be:

  1. Build (or let the user input) the dissimilarity matrix as a np.array/sparse array/ .txt file (?).
  2. Use either proposed Dissimilarity-Based Compound Selection which doesn't require a lot of computation power and/or resource. Another choice at this stage might be to use clustering algorithm (hierarchical (application can be found in this paper), k-means/ DBSCAN/ etc) and than just select the required amount of samples from each cluster.
  3. Use the Gini Index/ Entropy to decide how diverse dataset is

Potential problems might be:

  1. How to select number of clusters for clustering problem (can be done using elbow method and/or silhouette coefficient)
  2. Clustering algorithm in general doesn't work great, so maybe we need to compare the output with a threshold random sampling
  3. How to deal with imbalanced datasets

@fwmeng88 please, let me know what do you think about this

@PaulWAyers
Copy link
Member Author

I think these are two great brute-strength methods for generating samples. Selecting the right number of clusters is probably easily left to the user; a default option could be to select the number of clusters to equal the size of the sample desired. A simple revision would be to take n_members elements from each of n_clusters clusters to generate n_members*n_clusters samples, where any deficit (because of a cluster not having enough members) is resolved in some way (random sample or dissimilarity-based selection).

I thinks some of the other methods are more efficient, but for most molecular databases these (very robust; should be extremely reliable, only potential issue is cost) methods should be very reliable.

@FanwangM
Copy link
Collaborator

This paper is very interesting. Thanks for proposing the idea. @RichRick1

DBCS can be the very first few examples that we can implement. But we may need to think about its tendency to select outliers, which relates to the problem you mentioned, class imbalance. These two problems are not identical, but closely related. An alternative to this is the sphere of exclusion algorithm (J. Chem. Inf. Comput. Sci. 1995, 35, 1, 59–67 and ACS Chem. Biol. 2011, 6, 3, 208–217).

Another possibility is that we can try clustering together with the Monte Carlo search. Based on @PaulWAyers's comments above, we can try a clustering algorithm that can provide probability information (soft clustering) and then take a few members out of each cluster to form a new sample set. Then we use the Monte Carlo search to improve the diversity.
@RichRick1

@PaulWAyers
Copy link
Member Author

An article I was trying to remember but (finally) found is:
https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/10.1021/ci500749q
https://europepmc.org/article/med/25594586
This basically uses the maximin algorithm (the dissimilarity-based selection) but uses a grid to make it linear scaling.

The directed sphere evolution method is a sphere-of-exclusion method that seems (slightly) better than the most direct approach. It's implemented in the chemalot code (but not in Python)
https://github.com/chemalot/chemalot

I'm starting to feel like we probably have enough algorithms here, unless there is something really amazing out there. But having:

  • brute strength dissimilarity (expensive; prone to select outliers)
  • dissimilarity maximization with a grid (not perfectly optimal; less prone to select outliers)
  • directed sphere evolution (cheaper; less prone to outliers)
  • clustering+selection

The first papers above suggest that when you have values of the target property, you can sample (even with stratification) within boxes/clusters to ensure diverse samples for the target property, or to select a library that is good for the target property.

This would give us three or four families of methods. With enough work there, the bigger indicator of performance is likely to end up being how good our metric is for selecting molecules...which is something that is a never-ending task, where we mostly need to support a wide variety of metrics.

@PaulWAyers
Copy link
Member Author

Determinantal point processes seem really cool here; they maximize the volume of the parallelpiped containing the data, and in that sense ensure that the "interpolation regime" is as big as possible.

@FanwangM
Copy link
Collaborator

@FanwangM
Copy link
Collaborator

Using Gini coefficient to measure the chemical diversity is a great idea! @RichRick1
This paper explains more details, Journal of Computational Chemistry2016,37, 2091–2097. @PaulWAyers

@FanwangM
Copy link
Collaborator

FanwangM commented Jan 28, 2022

At this moment, I have seen ways to compute the dataset diversity,

  1. total diversity column, Eq. 4 on page 843 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the first method in this paper)
  2. Eq. 9 on page 844 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the second method in this paper)
  3. Gini coefficient from Journal of Computational Chemistry2016,37, 2091–2097
  4. entropy from Eq. 9 on page 4 of Leguy et al. J Cheminform (2021) 13:76 with implementation at https://github.com/jules-leguy/EvoMol/blob/master/evomol/evaluation_entropy.py
  5. log-determinant function from Sci. Rep. 12: 1124 (2022)
  6. Wasserstein distance for property‑based evaluation of diversity. from Sci. Rep. 12: 1124 (2022) with codes available at https://github.com/tomotomonakanaka/SUBMO
  7. Explicit Diversity Index from J. Chem. Inf. Model. 2006, 46, 5, 1898–1904

I will add more as I read.
@PaulWAyers @RichRick1

Update: fixed wrong link

@PaulWAyers
Copy link
Member Author

The Gini coefficient works primarily for the binary fingerprint I think. Entropy could be made to work for more general options, but is probably easiest from a bit-string or another case where there are a discrete number of states (or a known distribution of numbers) for each element in the molecular descriptor vector. I'm not sure how the Gini coefficient could be used if only a (dis)similarity matrix was available (ditto for the entropy there, though). But that's a case where some sort of determinantal measure could work...

The links Fanwang had above seem wrong; try below instead.
https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/10.1021/ci9700337
Equations 12 and 13 in this paper are also sensible.

There may be some other good things here too.

@PaulWAyers
Copy link
Member Author

PaulWAyers commented Mar 1, 2022

moved to issue #7

@PaulWAyers
Copy link
Member Author

PaulWAyers commented Mar 1, 2022

Recommendation: Split issue for similarity/dissimilarity and diversity computation. This amounts to putting some key info in issue #7

@PaulWAyers
Copy link
Member Author

Notes on Wasserstein Distance to Uniform Distribution (for @Khaleeh )
Wasserstein-Dist-to-Uniform-Dist.pdf
Wasserstein-Dist-to-Uniform-Dist.md

@PaulWAyers
Copy link
Member Author

@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset.

@PaulWAyers
Copy link
Member Author

We should implement determinantal point processes as a diversity measure (using the Grammian if features are given; using the distance matrix or one of its implied quantities otherwise). There are greedy algorithms for optimizing the determinantal point process, which is a very good selection method.

@FanwangM
Copy link
Collaborator

FanwangM commented Apr 8, 2022

@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset.

This is good to have and the original plan was set to fix this soon. Thanks for the advice.

@PaulWAyers
Copy link
Member Author

Note we should also support p-sums (not just minimum (p=-infinity) and sum (p=1)) versions for the brute-strength algorithm, so that a balanced representation can be achieved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants