Methods for Assessing the Diversity of a Set (and how optimal it is) #4

PaulWAyers · 2022-01-25T23:00:38Z

For example, we could measure how much more diverse a selected sample is than a random sample.

If clusters are similar size, then random-ish samples are a lot better than if clusters are very inhomogenous in size.

RichRick1 · 2022-01-26T16:34:39Z

Found this paper really easy to follow and it describes a couple of algorithm (clustering and Dissimilarity-Based Compound Selection) that might be relatively easy to implement at the beginning. Also, Dissimilarity-Based Compound Selection provides a decent metric while quality of the Clustering approach can be used as Gini Index/ Entropy. Here are some screenshots:

And for the clustering algorithm:

So, overall steps might be:

Build (or let the user input) the dissimilarity matrix as a np.array/sparse array/ .txt file (?).
Use either proposed Dissimilarity-Based Compound Selection which doesn't require a lot of computation power and/or resource. Another choice at this stage might be to use clustering algorithm (hierarchical (application can be found in this paper), k-means/ DBSCAN/ etc) and than just select the required amount of samples from each cluster.
Use the Gini Index/ Entropy to decide how diverse dataset is

Potential problems might be:

How to select number of clusters for clustering problem (can be done using elbow method and/or silhouette coefficient)
Clustering algorithm in general doesn't work great, so maybe we need to compare the output with a threshold random sampling
How to deal with imbalanced datasets

@fwmeng88 please, let me know what do you think about this

PaulWAyers · 2022-01-26T18:51:34Z

I think these are two great brute-strength methods for generating samples. Selecting the right number of clusters is probably easily left to the user; a default option could be to select the number of clusters to equal the size of the sample desired. A simple revision would be to take n_members elements from each of n_clusters clusters to generate n_members*n_clusters samples, where any deficit (because of a cluster not having enough members) is resolved in some way (random sample or dissimilarity-based selection).

I thinks some of the other methods are more efficient, but for most molecular databases these (very robust; should be extremely reliable, only potential issue is cost) methods should be very reliable.

FanwangM · 2022-01-27T04:27:29Z

This paper is very interesting. Thanks for proposing the idea. @RichRick1

DBCS can be the very first few examples that we can implement. But we may need to think about its tendency to select outliers, which relates to the problem you mentioned, class imbalance. These two problems are not identical, but closely related. An alternative to this is the sphere of exclusion algorithm (J. Chem. Inf. Comput. Sci. 1995, 35, 1, 59–67 and ACS Chem. Biol. 2011, 6, 3, 208–217).

Another possibility is that we can try clustering together with the Monte Carlo search. Based on @PaulWAyers's comments above, we can try a clustering algorithm that can provide probability information (soft clustering) and then take a few members out of each cluster to form a new sample set. Then we use the Monte Carlo search to improve the diversity.
@RichRick1

PaulWAyers · 2022-01-27T14:56:15Z

An article I was trying to remember but (finally) found is:
https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/10.1021/ci500749q
https://europepmc.org/article/med/25594586
This basically uses the maximin algorithm (the dissimilarity-based selection) but uses a grid to make it linear scaling.

The directed sphere evolution method is a sphere-of-exclusion method that seems (slightly) better than the most direct approach. It's implemented in the chemalot code (but not in Python)
https://github.com/chemalot/chemalot

I'm starting to feel like we probably have enough algorithms here, unless there is something really amazing out there. But having:

brute strength dissimilarity (expensive; prone to select outliers)
dissimilarity maximization with a grid (not perfectly optimal; less prone to select outliers)
directed sphere evolution (cheaper; less prone to outliers)
clustering+selection

The first papers above suggest that when you have values of the target property, you can sample (even with stratification) within boxes/clusters to ensure diverse samples for the target property, or to select a library that is good for the target property.

This would give us three or four families of methods. With enough work there, the bigger indicator of performance is likely to end up being how good our metric is for selecting molecules...which is something that is a never-ending task, where we mostly need to support a wide variety of metrics.

PaulWAyers · 2022-01-27T21:54:38Z

Determinantal point processes seem really cool here; they maximize the volume of the parallelpiped containing the data, and in that sense ensure that the "interpolation regime" is as big as possible.

FanwangM · 2022-01-27T22:14:50Z

The brute strength dissimilarity method has been implemented in https://rdrr.io/cran/caret/man/maxDissim.html and the original paper is Willett, P. (1999), "Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds," Journal of Computational Biology, 6, 447-457..

FanwangM · 2022-01-27T22:18:27Z

Using Gini coefficient to measure the chemical diversity is a great idea! @RichRick1
This paper explains more details, Journal of Computational Chemistry2016,37, 2091–2097. @PaulWAyers

FanwangM · 2022-01-28T03:13:13Z

At this moment, I have seen ways to compute the dataset diversity,

total diversity column, Eq. 4 on page 843 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the first method in this paper)
Eq. 9 on page 844 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the second method in this paper)
Gini coefficient from Journal of Computational Chemistry2016,37, 2091–2097
entropy from Eq. 9 on page 4 of Leguy et al. J Cheminform (2021) 13:76 with implementation at https://github.com/jules-leguy/EvoMol/blob/master/evomol/evaluation_entropy.py
log-determinant function from Sci. Rep. 12: 1124 (2022)
Wasserstein distance for property‑based evaluation of diversity. from Sci. Rep. 12: 1124 (2022) with codes available at https://github.com/tomotomonakanaka/SUBMO
Explicit Diversity Index from J. Chem. Inf. Model. 2006, 46, 5, 1898–1904

I will add more as I read.
@PaulWAyers @RichRick1

Update: fixed wrong link

PaulWAyers · 2022-01-28T04:07:03Z

The Gini coefficient works primarily for the binary fingerprint I think. Entropy could be made to work for more general options, but is probably easiest from a bit-string or another case where there are a discrete number of states (or a known distribution of numbers) for each element in the molecular descriptor vector. I'm not sure how the Gini coefficient could be used if only a (dis)similarity matrix was available (ditto for the entropy there, though). But that's a case where some sort of determinantal measure could work...

The links Fanwang had above seem wrong; try below instead.
https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/10.1021/ci9700337
Equations 12 and 13 in this paper are also sensible.

There may be some other good things here too.

PaulWAyers · 2022-03-01T22:47:41Z

moved to issue #7

PaulWAyers · 2022-03-01T22:53:15Z

Recommendation: Split issue for similarity/dissimilarity and diversity computation. This amounts to putting some key info in issue #7

PaulWAyers · 2022-03-23T21:32:03Z

Notes on Wasserstein Distance to Uniform Distribution (for @Khaleeh )
Wasserstein-Dist-to-Uniform-Dist.pdf
Wasserstein-Dist-to-Uniform-Dist.md

PaulWAyers · 2022-04-07T22:04:48Z

@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset.

PaulWAyers · 2022-04-07T22:40:12Z

We should implement determinantal point processes as a diversity measure (using the Grammian if features are given; using the distance matrix or one of its implied quantities otherwise). There are greedy algorithms for optimizing the determinantal point process, which is a very good selection method.

FanwangM · 2022-04-08T15:21:44Z

@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset.

This is good to have and the original plan was set to fix this soon. Thanks for the advice.

PaulWAyers · 2022-04-08T18:51:29Z

Note we should also support p-sums (not just minimum (p=-infinity) and sum (p=1)) versions for the brute-strength algorithm, so that a balanced representation can be achieved.

FanwangM mentioned this issue Jan 28, 2022

Identify/Choose Methods for Selecting Diverse Samples #2

Closed

FanwangM mentioned this issue Apr 8, 2022

Quantifying the diversity of selcted subset #56

Closed

FanwangM mentioned this issue Apr 9, 2022

Add support p-sums for the brute-strength algorithm #57

Closed

This was referenced May 13, 2022

[Selector module] Implementing maximin algorithm with grid #69

Open

Implementing Gini coefficient in metric.py #72

Closed

[Selector module] Implementing D-optimal designs #71

Open

FarnazH added this to the release milestone May 19, 2022

FanwangM closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods for Assessing the Diversity of a Set (and how optimal it is) #4

Methods for Assessing the Diversity of a Set (and how optimal it is) #4

PaulWAyers commented Jan 25, 2022 •

edited

Loading

RichRick1 commented Jan 26, 2022 •

edited by FanwangM

Loading

PaulWAyers commented Jan 26, 2022

FanwangM commented Jan 27, 2022

PaulWAyers commented Jan 27, 2022

PaulWAyers commented Jan 27, 2022

FanwangM commented Jan 27, 2022

FanwangM commented Jan 27, 2022

FanwangM commented Jan 28, 2022 •

edited

Loading

PaulWAyers commented Jan 28, 2022

PaulWAyers commented Mar 1, 2022 •

edited

Loading

PaulWAyers commented Mar 1, 2022 •

edited

Loading

PaulWAyers commented Mar 23, 2022

PaulWAyers commented Apr 7, 2022

PaulWAyers commented Apr 7, 2022

FanwangM commented Apr 8, 2022 •

edited

Loading

PaulWAyers commented Apr 8, 2022

Methods for Assessing the Diversity of a Set (and how optimal it is) #4

Methods for Assessing the Diversity of a Set (and how optimal it is) #4

Comments

PaulWAyers commented Jan 25, 2022 • edited Loading

RichRick1 commented Jan 26, 2022 • edited by FanwangM Loading

PaulWAyers commented Jan 26, 2022

FanwangM commented Jan 27, 2022

PaulWAyers commented Jan 27, 2022

PaulWAyers commented Jan 27, 2022

FanwangM commented Jan 27, 2022

FanwangM commented Jan 27, 2022

FanwangM commented Jan 28, 2022 • edited Loading

PaulWAyers commented Jan 28, 2022

PaulWAyers commented Mar 1, 2022 • edited Loading

PaulWAyers commented Mar 1, 2022 • edited Loading

PaulWAyers commented Mar 23, 2022

PaulWAyers commented Apr 7, 2022

PaulWAyers commented Apr 7, 2022

FanwangM commented Apr 8, 2022 • edited Loading

PaulWAyers commented Apr 8, 2022

PaulWAyers commented Jan 25, 2022 •

edited

Loading

RichRick1 commented Jan 26, 2022 •

edited by FanwangM

Loading

FanwangM commented Jan 28, 2022 •

edited

Loading

PaulWAyers commented Mar 1, 2022 •

edited

Loading

PaulWAyers commented Mar 1, 2022 •

edited

Loading

FanwangM commented Apr 8, 2022 •

edited

Loading