-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Methods for Assessing the Diversity of a Set (and how optimal it is) #4
Comments
Found this paper really easy to follow and it describes a couple of algorithm (clustering and Dissimilarity-Based Compound Selection) that might be relatively easy to implement at the beginning. Also, Dissimilarity-Based Compound Selection provides a decent metric while quality of the Clustering approach can be used as Gini Index/ Entropy. Here are some screenshots: And for the clustering algorithm: So, overall steps might be:
Potential problems might be:
@fwmeng88 please, let me know what do you think about this |
I think these are two great brute-strength methods for generating samples. Selecting the right number of clusters is probably easily left to the user; a default option could be to select the number of clusters to equal the size of the sample desired. A simple revision would be to take I thinks some of the other methods are more efficient, but for most molecular databases these (very robust; should be extremely reliable, only potential issue is cost) methods should be very reliable. |
This paper is very interesting. Thanks for proposing the idea. @RichRick1 DBCS can be the very first few examples that we can implement. But we may need to think about its tendency to select outliers, which relates to the problem you mentioned, class imbalance. These two problems are not identical, but closely related. An alternative to this is the sphere of exclusion algorithm (J. Chem. Inf. Comput. Sci. 1995, 35, 1, 59–67 and ACS Chem. Biol. 2011, 6, 3, 208–217). Another possibility is that we can try clustering together with the Monte Carlo search. Based on @PaulWAyers's comments above, we can try a clustering algorithm that can provide probability information (soft clustering) and then take a few members out of each cluster to form a new sample set. Then we use the Monte Carlo search to improve the diversity. |
An article I was trying to remember but (finally) found is: The directed sphere evolution method is a sphere-of-exclusion method that seems (slightly) better than the most direct approach. It's implemented in the I'm starting to feel like we probably have enough algorithms here, unless there is something really amazing out there. But having:
The first papers above suggest that when you have values of the target property, you can sample (even with stratification) within boxes/clusters to ensure diverse samples for the target property, or to select a library that is good for the target property. This would give us three or four families of methods. With enough work there, the bigger indicator of performance is likely to end up being how good our metric is for selecting molecules...which is something that is a never-ending task, where we mostly need to support a wide variety of metrics. |
Determinantal point processes seem really cool here; they maximize the volume of the parallelpiped containing the data, and in that sense ensure that the "interpolation regime" is as big as possible. |
The brute strength dissimilarity method has been implemented in https://rdrr.io/cran/caret/man/maxDissim.html and the original paper is Willett, P. (1999), "Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds," Journal of Computational Biology, 6, 447-457.. |
Using Gini coefficient to measure the chemical diversity is a great idea! @RichRick1 |
At this moment, I have seen ways to compute the dataset diversity,
I will add more as I read. Update: fixed wrong link |
The Gini coefficient works primarily for the binary fingerprint I think. Entropy could be made to work for more general options, but is probably easiest from a bit-string or another case where there are a discrete number of states (or a known distribution of numbers) for each element in the molecular descriptor vector. I'm not sure how the Gini coefficient could be used if only a (dis)similarity matrix was available (ditto for the entropy there, though). But that's a case where some sort of determinantal measure could work... The links Fanwang had above seem wrong; try below instead. There may be some other good things here too. |
moved to issue #7 |
Recommendation: Split issue for similarity/dissimilarity and diversity computation. This amounts to putting some key info in issue #7 |
Notes on Wasserstein Distance to Uniform Distribution (for @Khaleeh ) |
@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset. |
We should implement determinantal point processes as a diversity measure (using the Grammian if features are given; using the distance matrix or one of its implied quantities otherwise). There are greedy algorithms for optimizing the determinantal point process, which is a very good selection method. |
This is good to have and the original plan was set to fix this soon. Thanks for the advice. |
Note we should also support p-sums (not just minimum (p=-infinity) and sum (p=1)) versions for the brute-strength algorithm, so that a balanced representation can be achieved. |
For example, we could measure how much more diverse a selected sample is than a random sample.
If clusters are similar size, then random-ish samples are a lot better than if clusters are very inhomogenous in size.
The text was updated successfully, but these errors were encountered: