# Practice 2: Ligand-based screening: Compound similarity and compound clustering

> **Note:** This book is available in two ways:
> 1. Downloading the repository and following the instructions in the file [README.md](https://github.com/ramirezlab/CHEMO/blob/main/README.md)
> 2. Clicking here on [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ramirezlab/PILE/blob/main/2.%20De%20datos%20a%20gr%C3%A1ficas%3A%20Propiedades%20drug-likeness%20y%20similitud%20qu%C3%ADmica%20con%20python/2.4_Practice-2.en.ipynb?hl=es)

## Theory

### **Molecular fingerprints**
Molecular fingerprints are essential cheminformatics tools for virtual screening and mapping chemical space. These is a way to describe a molecular structure that can convert a molecular structure into a bit string<sup> **1** </sup>. Each bit corresponds to a predefined molecular feature or environment, where "1" represents the presence and "0" the absence of a feature. Since molecular fingerprint encodes the structure of a molecule, it is a useful method to describe the structural similarity among the molecules as a molecular descriptor. 

#### **Morgan fingerprints**
The most popular molecular fingerprint is the Morgan fingerprint that is based on the Morgan algorithm.These algorithmm generated bits correspond to the circular environments of each atom in a molecule and the number of neighboring bonds and atoms to consider is set by the radius, which are predictive of the biological activities of small organic molecules<sup> **2** </sup>.

### **Molecular similarity measure: Tanimoto coefficient**
Two such fingerprints are most commonly compared with the Tanimoto similarity metric. These metric take  a value between 0 and 1, with 1 corresponding to identical fingerprints<sup> **3** </sup>.


<img src="img/Tanimoto-coefficient-en.jpg" alt="Tanimoto-coefficient" width="800"/>

### **Clustering**
Is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). Compound clustering in pharmaceutical research is often based on chemical or structural similarity between compounds to find groups that share properties.

There are [key steps](https://www.sciencedirect.com/science/article/pii/B008045044X001474) in the clustering approach that we will follow:

**1. Data preparation and compound encoding:**

- The compounds in the input data will be encoded as molecular fingerprints.
    
**2. Tanimoto similarity (or distance) matrix:**

- The similarity between two fingerprints is calculated using the Tanimoto coefficient.
- Matrix with Tanimoto similarities between all possible molecule/fingerprint pairs (n * n similarity matrix with =number of molecules, upper triangle matrix used only).
- Equally, the distances matrix can be calculated (1 - similarity).
    
**3. Clustering molecules**

- The clustering result depends on the threshold chosen by the user:
    - The smaller the distance value cut-off, the more similar the compounds are required to belong to one cluster.
    - The higher the threshold (distance cutoff), the more molecules are considered as similar, you will have less clusters.
    - The lower the threshold, the more small clusters and "singletons" appear.

## Problem Statement
We have a dataset with many compounds and we want to group them because similar compounds might bind to the same targets and show similar effects. From such a clustering, a diverse set of compounds can also be selected from a larger set of screening compounds for further experimental testing.

In [None]:
# Import libraries needed to run this notebook
!pip install rdkit
from pathlib import Path
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from rdkit import Chem, DataStructs
from rdkit.ML.Cluster import Butina
from rdkit.Chem import (
    PandasTools,
    Draw,
    Descriptors,
    rdFingerprintGenerator,
)

# Example 1: Compare a molecule with a data set

Initially, we want to compare a query molecule with all the molecules in the dataset containing the bioactive compounds against *glycogen synthase kinase-3 beta*. In this case the query molecule is **Ruboxistaurin**, we want to look for similar molecules.

## Step 1: Load data set

The dataset contain the bioactive compounds against Glycogen synthase kinase-3 beta that we built in the tutorial 2.1_Dataframes. 

In [None]:
import pandas as pd

# CSV file URL
csv_url = 'https://raw.githubusercontent.com/ramirezlab/PILE/refs/heads/main/2.%20De%20datos%20a%20gr%C3%A1ficas%3A%20Propiedades%20drug-likeness%20y%20similitud%20qu%C3%ADmica%20con%20python/data/compounds_P49841_full.csv'
# Read the CSV file from the URL and extract specific columns
columns_to_use = ["molecule_chembl_id", "smiles"]
molecule_dataset = pd.read_csv(csv_url, usecols=columns_to_use)

# Print the total number of compounds and show the first rows
print(f'# total compounds: {len(molecule_dataset)}')
molecule_dataset.head()

## Step 2: Generate the fingerprint of the query molecule
For the Ruboxistaurin molecule, we generated the ROMol object from SMILES.

In [None]:
from rdkit import Chem  # Make sure to import Chem from RDKit

# Create a molecule from a SMILES string (Simplified Molecular Input Line Entry System)
# This string represents the chemical structure of Ruboxistaurin, a protein kinase C inhibitor
# RDKit interprets the SMILES string and internally builds a Mol (molecule) object
query = Chem.MolFromSmiles("CN(C)CC1CCN2C=C(C3=CC=CC=C32)C4=C(C5=CN(CCO1)C6=CC=CC=C65)C(=O)NC4=O")

# Display the generated Mol object. This object can be used in chemical computations,
# substructure searches, similarity comparisons, visualization, etc.
query


We then generated Morgan's fingerprints for the Ruboxistaurin molecule.

In [None]:
# Generate the circular fingerprint (ECFP) for the molecule represented by 'query'.
# GetFPs takes a list of molecules as input. Here, it is passed a list
# containing only one molecule ('[query]'). The result is a list of
# fingerprints, even if only one molecule is processed.
# [0] is used to extract the first (and only) fingerprint from the resulting list.
circular_fp_query = rdFingerprintGenerator.GetFPs([query])[0]

# Convert the circular fingerprint, which is in an object format,
# to a bit string (a sequence of 0s and 1s).
# This allows viewing the fingerprint as a string representation.
circular_fp_query.ToBitString()


## Step 3: Calculate the fingerprints of the dataset
We now generate Morgan fingerprints for all the molecules in our data set.

In [None]:
# Add a column named "ROMol" to the DataFrame molecule_dataset.
# This column contains RDKit Mol objects generated from the SMILES strings
# in the "smiles" column of the DataFrame.
PandasTools.AddMoleculeColumnToFrame(molecule_dataset, "smiles")

# Generate a list of circular fingerprints (ECFP) for the molecules
# in the molecule_dataset DataFrame.
# molecule_dataset["ROMol"] accesses the "ROMol" column (which contains RDKit Mol objects).
# .tolist() converts this column into a Python list of RDKit Mol objects, which is
# the expected input for rdFingerprintGenerator.GetFPs().
# The result, circular_fp_list, is a list of fingerprint objects.
circular_fp_list = rdFingerprintGenerator.GetFPs(molecule_dataset["ROMol"].tolist())


## Step 4: We calculate the similarity between the molecule and the data set
We calculated the Tanimoto similarity between the Ruboxistaurin molecule and all molecules in our data set using Morgan fingerprints.

In [None]:
# Calculate the Tanimoto similarity between the query fingerprint and the fingerprints of the molecules in the DataFrame
molecule_dataset["tanimoto_morgan"] = DataStructs.BulkTanimotoSimilarity(circular_fp_query, circular_fp_list)

# Display the first rows with the ChEMBL identifier of the molecule and its Tanimoto similarity
molecule_dataset[["molecule_chembl_id", "tanimoto_morgan"]].head()


We can now organize the values to identify the molecules most similar to Ruboxistaurin.

In [None]:
# Sort the molecule_dataset DataFrame in descending order based on the values in the "tanimoto_morgan" column.
# The 'by' argument specifies the column to sort by, 'ascending=False' indicates descending order,
# and 'inplace=True' modifies the original DataFrame instead of returning a new one.
molecule_dataset.sort_values(by = ["tanimoto_morgan"], ascending=False, inplace=True)

# Return the first 5 rows of the sorted molecule_dataset DataFrame.
# This allows inspection of the molecules with the highest "tanimoto_morgan" values.
molecule_dataset.head(5)

Finally, we can see the Ruboxistaurin molecule and the five most similar molecules in the data set.

In [None]:
# Set the variable top_n_molecules to 5, indicating that the top 5 molecules will be selected.
top_n_molecules = 5

# Create a new DataFrame, top_molecules, containing the first 5 rows of molecule_dataset.
# This selects the 5 molecules with the highest "tanimoto_morgan" values (due to the previous sorting).
top_molecules = molecule_dataset[:top_n_molecules]

legends = [
    f"#{index} {molecule['molecule_chembl_id']}"
    for index, molecule in top_molecules.iterrows()
]
# Create a list of legends for the molecules to be drawn.
# Iterate over the rows of the top_molecules DataFrame.
# For each molecule, format a string that includes the row index and the 'molecule_chembl_id' value.
# This list will be used to label the molecules in the grid image.

Chem.Draw.MolsToGridImage(
    mols=[query] + top_molecules["ROMol"].tolist(),
    # Generate a grid image showing the molecules.
    # 'mols' is a list of RDKit Mol objects to draw.
    #  [query] is the reference molecule.
    #  top_molecules["ROMol"].tolist() is the list of Mol objects for the 5 most similar molecules.
    legends=(["Ruboxistaurin"] + legends),
    # 'legends' is a list of strings to label each molecule in the grid.
    #  ["Ruboxistaurin"] is the label for the reference molecule ('query').
    molsPerRow=4,
    # 'molsPerRow' specifies the number of molecules per row in the grid.
    subImgSize=(250, 270),
    # 'subImgSize' specifies the size of each subimage (molecule) in the grid.
)
# Generate the grid image using RDKit's MolsToGridImage function.


## Similarity distribution
To graphically see the Tanimoto similarity distribution, we can make a histogram, remember that the closer the number is to 1, the more similar the molecules are.

In [None]:
# Set up the Matplotlib figure with a size of 10x6
fig, axes = plt.subplots(figsize=(10, 6), nrows=1, ncols=1)

# Generate a histogram of the Tanimoto similarity values in the "tanimoto_morgan" column
# You can find more color names at: https://www.w3schools.com/colors/colors_names.asp
molecule_dataset.hist(["tanimoto_morgan"], ax=axes, color="darkorchid")

# X-axis label
axes.set_xlabel("Similarity value")

# Y-axis label
axes.set_ylabel("# molecules")

# Display the figure
fig;

We can also do a histogram with a kernel density estimates curve using seaborn. With this plot We can see how easier is the dataset distributed.

In [None]:
# Generate a distribution plot (displot) with kernel density estimation (kde) enabled
# Set the height to 8 and the aspect ratio to 2 for a wider visualization
# Use the color "darkorchid" for the plot
sns.displot(data=molecule_dataset["tanimoto_morgan"], kde=True, height=8, aspect=2, color="darkorchid")

# Example 2: Hierarchical clustering

Hierarchical clustering (**Hierarchical Clustering**) is a method commonly used to group data with similar characteristics (groups of data are called **clusters**)<sup> **4** </sup>.

The hierarchical clustering algorithm groups the data based on the distance between each one and looking for the data that are within a cluster to be the most similar to each other. For our case, we can group the most similar compounds according to the Tanimoto distance.

Initially, we are going to use **Agglomerative clustering** which starts with each compound as a separate cluster. At each step, the two closest clusters are merged creating a new cluster. These mergers continue to take place successively until at the end of the process there is only one single cluster that agglomerates all the elements.

Another aspect to take into account is the way in which the **distance** between two clusters is measured, by definition the * Euclidean distance* is used, but the algorithms allow this metric to be modified.

## Data preparation

We start by loading the dataset that contains the bioactive compounds against *glycogen synthase kinase-3 beta* we built it in tutorial 2.1_Dataframes.

From the SMILES we create the *ROMol* objects and the *fingerprints* of each compound

In [None]:
# Read a CSV file from a GitHub URL and load it into a Pandas DataFrame.
# The file contains compound data, and only the 'molecule_chembl_id' and 'smiles' columns are read.
molecule_dataset = pd.read_csv('https://raw.githubusercontent.com/ramirezlab/PILE/refs/heads/main/2.%20De%20datos%20a%20gr%C3%A1ficas%3A%20Propiedades%20drug-likeness%20y%20similitud%20qu%C3%ADmica%20con%20python/data/compounds_P49841_full.csv', usecols=["molecule_chembl_id", "smiles"])

# Print the total number of compounds (rows) in the molecule_dataset DataFrame.
print(f'# total compounds: {len(molecule_dataset)}')

# Add a column named "ROMol" to the molecule_dataset DataFrame.
# This column contains RDKit Mol objects, which are representations of the molecules
# derived from the SMILES strings in the "smiles" column.
PandasTools.AddMoleculeColumnToFrame(molecule_dataset, "smiles") # add ROMol

# Generate a list of Morgan (ECFP) fingerprints for the molecules in the DataFrame.
# First, extract the "ROMol" column from the DataFrame, which contains the RDKit Mol objects.
# .tolist() is used to convert the column into a list of Mol objects.
# rdFingerprintGenerator.GetFPs() generates the fingerprints for each molecule in the list.
morgan_fp_list = rdFingerprintGenerator.GetFPs(molecule_dataset["ROMol"].tolist()) # Morgan FP

# Add a new column named "morgan_fp" to the molecule_dataset DataFrame.
# This column contains the Morgan fingerprints generated in the previous step.
molecule_dataset['morgan_fp'] = morgan_fp_list

# Print the first 5 rows of the molecule_dataset DataFrame.
# This allows you to view the columns and initial data, including the new "morgan_fp" column.
molecule_dataset.head()

## Tanimoto similarity matrix
Similar to what was worked on in example 1, we are going to find the similarity of each molecule with the rest of the molecules in the set.

We are going to create a function whose input is the set of *fingerprints* of the compounds, and whose output is the Tanimoto similarity matrix, where the similarity between two compounds is measured.

In [None]:
def tanimoto_matrix(fp_list):
    # Create an identity matrix
    N = len(fp_list)
    similarity_matrix = np.identity(N)
    # Indices of the lower triangular positions of the matrix
    a, b = np.tril_indices(N, 0)
    similarities = list()
    for ind, i in enumerate(fp_list):
        # Compare the current fingerprint with all previous ones in the list
        similarities = np.append(similarities, DataStructs.BulkTanimotoSimilarity(i, fp_list[:ind+1]))
        # Build the distance matrix
    similarity_matrix[a,b] = similarities
    similarity_matrix[b,a] = similarity_matrix[a,b]
    return similarity_matrix

### Cluster of ten molecules
To understand the clustering process we are going to work only with the first ten molecules of the set. We start by creating a list with the fingerprints of the ten molecules.

In [None]:
# Select the first 10 fingerprints from the list of Morgan fingerprints
list_fingerprints = morgan_fp_list[0:10]

Now, we find the tanimoto similarity matrix with the ten fingerprints

In [None]:
# Calculate the Tanimoto similarity matrix for only the first 10 compounds
similarity_matrix = tanimoto_matrix(list_fingerprints)

# Display the similarity matrix
similarity_matrix

We can represent it by means of a heat map:

In [None]:
similarity_matrix = tanimoto_matrix(list_fingerprints) # Only for the first 10 compounds
ax = sns.heatmap(similarity_matrix, annot=True, fmt='.2f', cmap="vlag") # annot=True displays the Tanimoto coefficient and fmt='.2f' shows only two decimal places
ax.set(xlabel="", ylabel="")
ax.xaxis.tick_top()

#### Grouping by distances
As we explained initially, the **grouping by agglomeration** consists of merging consecutively those clusters that are closest, to understand the grouping we can use the method [`linkage`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage), which creates a *linkage matrix* with the order in which the different clusters were grouped ( By default, the measurement metric is the *Euclidean distance* and the grouping method is the *closest point*), we are going to work with the mean distance of the grouping elements (*average*).

In [None]:
# Perform hierarchical clustering using the average linkage method
# 'similarity_matrix' is the previously computed Tanimoto similarity matrix
Z = linkage(similarity_matrix, method='average')

# Display the resulting hierarchical clustering structure
Z

In the *i-th* row, `Z[i,0]` and `Z[i,1]` indicate the clusters that are combined to form cluster $n+i$. `Z[i,2]` indicates the distance between the clusters and `Z[i,3]` represents the number of compounds in the new cluster.
Recall that we start with ten clusters numbered from 0 to 9 (the initial ten molecules), so the rows of the Z bond matrix are:
- **row-0**: Cluster 10 is created, made up of molecule 1 (cluster 1) and molecule 5 (cluster 5), the distance between cluster 1 and 5 is 0.235548, and the new cluster has 2 molecules
- **row-1**: Cluster 11 is created, made up of molecule 8 (cluster 8) and molecule 9 (cluster 9), the distance between cluster 8 and 9 is 0.343349, and the new cluster has 2 molecules
- **row-2**: Cluster 12 is created, made up of molecule 2 (cluster 2) and molecule 7 (cluster 7), the distance between cluster 2 and 7 is 0.364662, and the new cluster has 2 molecules
- **row-3**: Cluster 13 is created, made up of molecule 4 (cluster 4) and cluster 10 (created in **row-0**), the distance between cluster 4 and 10 is of 0.500123, and the new cluster has 3 molecules
- **row-4**: Cluster 14 is created, made up of cluster 11 (created in **row-1**) and cluster 12 (created in **row-2**), the distance between cluster 11 and 12 is 0.571593, and the new cluster has 4 molecules
- **row-5**: Cluster 15 is created, made up of molecule 0 (cluster 0) and molecule 6 (cluster 6), the distance between cluster 0 and 6 is 0.639747, and the new cluster has 2 molecules
- **row-6**: Cluster 16 is created, made up of molecule 3 (cluster 3) and cluster 15 (created in **row-5**), the distance between cluster 3 and 15 is of 1.084561, and the new cluster has 3 molecules
- **row-7**: Cluster 17 is created, made up of cluster 13 (created in **row-3**) and cluster 16 (created in **row-6**), the distance between cluster 13 and 16 is 1.555059, and the new cluster has 6 molecules
- **row-8**: Cluster 18 is created, made up of cluster 14 (created in **row-4**) and cluster 17 (created in **row-17**), the distance between cluster 14 and 17 is 1.611812, and the new cluster has 10 molecules

#### Representation: the *dendrogram*
The way to represent a hierarchical grouping is with a dendrogram

In [None]:
# Generate the dendrogram from the clustering matrix 'Z'
dn = dendrogram(Z)

The vertical lines in the dendrogram illustrate the mergers (or splits) made at each stage of clustering. We can see the distance, the different levels of associations between the individual data and also the associations between clusters. Let's remember that the distance used was the Euclidean distance, which we can modify when we build Z.

#### Clustermap
All of the above can be arranged in a matrix and plotted by means of a hierarchical clustering heatmap, note that the order of the compounds is not necessarily the same order as the dendrogram

In [None]:
# Create a clustermap (heatmap with hierarchical clustering) based on the Tanimoto similarity matrix
g = sns.clustermap(similarity_matrix, method='average',
                   cmap="vlag",  # Color map
                   dendrogram_ratio=(.1, .2),  # Dendrogram size ratio on the axes (rows, columns)
                   linewidths=.5)  # Line width between heatmap cells

# Remove the row dendrogram for a clearer visualization
g.ax_row_dendrogram.remove()

#### Clustering threshold
We can use the distance between the clusters as **limit** to group the compounds, for example, if we choose to group with a distance less than or equal to 0.6, 5 would be formed with the following compounds:
- Cluster-1: (1, 4, 8, 9)
- Cluster-2: (7, 2, 5)
- Cluster-3: 3
- Cluster-4: 0
- Cluster-5: 6

The method ['fcluster'](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster) organizes an array of $n $ elements, where each element indicates the number of the cluster to which the composite of that position belongs

In [None]:
# Perform hierarchical clustering on the Tanimoto similarity matrix
Z = linkage(similarity_matrix)

# Obtain clusters from the linkage matrix Z
# 't=0.6' defines the distance threshold for forming clusters
# 'criterion="distance"' indicates that linkage distance will be used to define the clusters
fcluster(Z, t=0.6, criterion='distance')

### *Butina* Clustering Algorithm: Centroids and Exclusion Spheres
A commonly used algorithm for clustering molecules is known as the *Butina clustering algorithm*<sup> **5** </sup>

We can use the `rdkit` library to implement this algorithm (`Butina.ClusterData`). As input you need a list with the Tanimoto distances of the compounds. This list can be found from the Tanimoto similarity matrix where the distances can be found with the formula $distance = 1 - similarity$.

To build the list we can use the function of the tanimoto matrix, separate only the elements that are below the main diagonal and find the distance

In [None]:
similarity_matrix = tanimoto_matrix(list_fingerprints)  # Similarity matrix
a, b = np.tril_indices(len(list_fingerprints), -1)  # Indices of the elements below the main diagonal
dist_similarity_matrix = 1 - similarity_matrix[a, b]  # Compound distances


Now we choose a distance threshold to carry out the grouping, for example, if we choose 0.4 as the threshold, five clusters are obtained.

In [None]:
# Apply the Butina clustering algorithm based on the similarity matrix
# 'dist_similarity_matrix' is the Tanimoto similarity matrix transformed into a distance matrix
# 'len(list_fingerprints)' is the total number of fingerprints to consider
# 'distThresh=0.4' sets a distance threshold to define the clusters
# 'isDistData=True' indicates that the input data is a distance matrix
clusters = Butina.ClusterData(dist_similarity_matrix, len(list_fingerprints), distThresh=0.4, isDistData=True)

# Sort the clusters by size in descending order (largest first)
clusters = sorted(clusters, key=len, reverse=True)

# Display the obtained clusters
clusters

**Note**: Although the clustering is similar to that found with the `fcluster` method, note that the threshold used is different, additionally, `Butina`'s algorithm determines the *centroid* of the cluster, which will be similar to any other cluster molecule according to the given threshold value. The first element of each cluster is the centroid.

## Elbow Method
One of the problems that we find when applying the grouping is the choice of the number of Clusters. There is no objective or broadly valid criterion for choosing an optimal number of Clusters; but we have to keep in mind that a bad choice of them can lead to very heterogeneous data groupings (few Clusters); or data, that being very similar to each other, we group them into different Clusters (many Clusters).

The elbow method uses the inertia values obtained after applying the clustering for different numbers of clusters (from 1 to N Clusters), where the **inertia** is the *sum of the squared distances of each Cluster object from its centroid*. Then, we can find the *average* of the inertias for each N (commonly called *distortion*) and plot the distortion against the number of clusters<sup> **6** </sup>. The graph guides us to appreciate the change and from this we can estimate the optimal number of clusters to select.

We begin by defining a function that determines the inertia and distortion, since we are going to use Butina's method to do the clustering, we must bear in mind that this depends on the *similarity threshold* and from this threshold the number of clusters is found ( the smaller the threshold, the more clusters, since the number of similar molecules is smaller).

In [None]:
# Define a function called 'distorion_tanimoto' that calculates the Tanimoto distortion for a set of clusters.
def distorion_tanimoto(clusters, full_dataset):
    # Initialize an empty list called 'inertia' to store the inertia of each cluster.
    inertia = list()

    # Iterate over each cluster in the 'clusters' list.
    for cluster in clusters:
        # Create a subset of the 'full_dataset' DataFrame corresponding to the molecules in the current cluster.
        # 'cluster' contains the indices of the molecules belonging to the cluster.
        # 'full_dataset.iloc[list(cluster)]' selects the rows of the DataFrame using these indices.
        cluster_dataset = full_dataset.iloc[list(cluster)]

        # Select the fingerprint of the first molecule in the cluster as the "centroid" of the cluster.
        circular_fp_query = cluster_dataset['morgan_fp'].iloc[0]

        # Create a list of fingerprints for all molecules in the cluster.
        circular_fp_list = list(cluster_dataset['morgan_fp'])

        # Compute the squared Tanimoto distance between the centroid and each molecule in the cluster.
        # DataStructs.BulkTanimotoSimilarity calculates Tanimoto similarity between the centroid and the rest.
        # (1 - similarity) gives the distance, which is then squared.
        sqrt_dist_to_centroid = (1 - np.array(DataStructs.BulkTanimotoSimilarity(circular_fp_query, circular_fp_list)))**2

        # Compute the sum of squared distances from each molecule to the centroid of the cluster.
        # This sum represents the cluster's inertia, a measure of how dispersed the points are within the cluster.
        inertia.append(sum(sqrt_dist_to_centroid))

    # Calculate the mean inertia of all clusters.
    # This mean represents the total distortion of the partition, a measure of how well the clusters represent the data.
    distortion = np.mean(inertia)
    
    # Return the number of clusters and the distortion.
    return len(clusters), distortion

Now we repeat the above process, varying the similarity threshold from 0 to 1 with steps of 0.05 (*note line 7: `np.arange(0,1,0.05)`*), then we create a table of results to be able to graph them.

In [None]:
# Initialize an empty list called 'result' to store the clustering analysis results.
result = list()

# Create a subset of the 'molecule_dataset' DataFrame containing the first 10 molecules.
molecule_mini_dataset = molecule_dataset[:10]

# Extract the Morgan fingerprint column ('morgan_fp') from the subset and assign it to the variable 'list_fingerprints'.
list_fingerprints = molecule_mini_dataset['morgan_fp']

# Calculate the Tanimoto similarity matrix between the fingerprints in 'list_fingerprints' using the 'tanimoto_matrix' function.
# The result is a square matrix where each element (i, j) represents the similarity between fingerprint i and fingerprint j.
similarity_matrix = tanimoto_matrix(list_fingerprints)  # Similarity matrix

# Generate the indices for the positions below the main diagonal of a square matrix.
# 'a' contains the row indices, and 'b' contains the column indices.
# The argument '-1' ensures that the main diagonal is excluded.
a, b = np.tril_indices(len(list_fingerprints), -1)  # Indices of elements below the main diagonal

# Calculate the distance matrix from the similarity matrix.
# Distance is computed as 1 - similarity. This converts similarities into distances,
# where a similarity of 1 (maximum similarity) corresponds to a distance of 0, and a similarity of 0
# (minimum similarity) corresponds to a distance of 1. Then, only the distances
# corresponding to the lower triangle of the matrix (excluding the diagonal) are selected using indices 'a' and 'b'.
dist_similarity_matrix = 1 - similarity_matrix[a, b]  # Compound distances

# Iterate over a range of threshold (cutoff) values from 0 to 1 with a step of 0.05.
for i in np.arange(0,1,0.05):
    # Round the threshold value to 2 decimal places.
    cutoff = round(i,2)

    # Perform Butina clustering using the distance matrix and the current threshold.
    # 'dist_similarity_matrix' contains the distances between compounds.
    # 'len(list_fingerprints)' is the number of compounds.
    # 'distThresh' is the distance threshold for forming clusters.
    # 'isDistData=True' indicates that the input data are distances, not similarities.
    clusters = Butina.ClusterData(dist_similarity_matrix,len(list_fingerprints), distThresh=cutoff, isDistData=True)

    # Compute the number of clusters ('n') and the distortion ('dist') for the current clustering.
    # Distortion is a measure of how well the clusters represent the data.
    n, dist = distorion_tanimoto(clusters, molecule_mini_dataset)

    # Append the results (cutoff, number of clusters, distortion) to the 'result' list.
    result.append((cutoff, n, dist))

# Create a Pandas DataFrame called 'table' from the 'result' list.
# The columns of the DataFrame are 'cutoff', 'N_clusters', and 'distortion'.
table = pd.DataFrame(result, columns=['cutoff', 'N_clusters', 'distortion'])

# Print the 'table' DataFrame.
# This DataFrame contains the results of the clustering analysis for different threshold values.
print(table)


In [None]:
# Plot the average intra-cluster distortion as a function of the number of clusters
# ---------------------------------------------------------------
# This plot allows visualization of how distortion (1 - average similarity)
# changes as the number of clusters varies with different cutoff thresholds.
# A good "cutoff" value is typically found where there is a balance
# between a reasonable number of clusters and low distortion.
# ---------------------------------------------------------------
table.plot(x='N_clusters', y='distortion')


The graph shows that when the number of clusters is `N=4` there is an abrupt change (as if it were the elbow of an arm), therefore, we can choose this as the optimal number of clusters. Reviewing the table we see that the threshold we must choose is `cutoff=0.45`.

## Hierarchical grouping of the total data
We can use what we learned with the clustering of ten compounds to represent the clustering of the entire set of compounds.

Let's start by finding the Tanimoto similarity matrix, since there are 2605 compounds, the similarity matrix has a size of 2605 x 2605

**Note**: the variable *circular_fp_list* has the list of all fingerprints of the compounds

In [None]:
# Calculate the Tanimoto similarity matrix for all molecules in the fingerprint list
# ----------------------------------------------------------------------------
# This is a square matrix (n x n), where n is the total number of molecules.
# Each element [i, j] of the matrix represents the Tanimoto similarity between molecule i and molecule j.
# Tanimoto similarity compares the overlap between two binary vectors (fingerprints).
# ----------------------------------------------------------------------------
similarity_matrix_full = tanimoto_matrix(circular_fp_list)

# Show the shape (dimensions) of the similarity matrix
# It should return a tuple (n, n), confirming that there is one row and column per molecule
similarity_matrix_full.shape

Let's look at the heat map of the similarity matrix, which is not yet organized.

In [None]:
# Create a figure and axes to visualize the similarity matrix
# --------------------------------------------------------------
# figsize=(10, 10) sets the size of the figure in inches.
# This is useful to prevent a large matrix from appearing compressed or unreadable.
fig, ax = plt.subplots(figsize=(10, 10))

# Draw the similarity matrix as a heatmap
# --------------------------------------------------------------
# sns.heatmap visualizes the matrix values as colors.
# cmap="vlag" defines the color palette, which in this case is divergent
# (useful for representing differences centered around a middle value, like similarity ~0.5).
# yticklabels=False and xticklabels=False remove axis labels for a cleaner view,
# especially helpful when visualizing a large number of molecules.
ax = sns.heatmap(similarity_matrix_full, cmap="vlag",
                 yticklabels=False, xticklabels=False)

### Clustering: Butina Clustering Algorithm
Similar to the clustering performed with the ten compounds, we are going to create a function where we can decide the Tanimoto distance threshold, for example, suppose we want to cluster molecules whose similarity is less than or equal to *cutoff*=0.2.
**Note**: Let's remember that we are using the clustering algorithm *Butina*

In [None]:
# Define a function called 'cluster_fingerprints' that performs clustering of fingerprints based on Tanimoto similarity.
def cluster_fingerprints(fp_list, cutoff=0.2):
    # Compute the Tanimoto similarity matrix for the provided fingerprint list.
    # The result is a square matrix where each element (i, j) represents the similarity between fingerprint i and fingerprint j.
    similarity_matrix = tanimoto_matrix(fp_list)  # Similarity matrix

    # Compute the distances between compounds from the similarity matrix.
    # 'np.tril_indices' generates the indices for the lower triangle of the matrix (excluding the diagonal).
    # Distance is calculated as 1 - similarity.
    a, b = np.tril_indices(len(fp_list), -1)
    dist_similarity_matrix = 1 - similarity_matrix[a, b]

    # Cluster the data using the Butina clustering algorithm.
    # 'dist_similarity_matrix' contains the distances between fingerprints.
    # 'len(fp_list)' is the total number of fingerprints.
    # 'cutoff' is the distance threshold for forming a cluster; points within this threshold are grouped together.
    # 'isDistData=True' indicates that distances, not similarities, are being provided.
    clusters = Butina.ClusterData(dist_similarity_matrix, len(fp_list), cutoff, isDistData=True)

    # Sort the clusters in descending order by size (number of elements).
    # This allows easy identification of the largest clusters.
    clusters = sorted(clusters, key=len, reverse=True)

    # Return the list of sorted clusters. Each cluster is a list of indices indicating which fingerprints belong to that cluster.
    return clusters

When executing the *cluster_fingerprints* function, the clusters are created, starting with those with the largest number of molecules, let's see the first 10.

In [None]:
# Execute the clustering procedure for the dataset, distance: 0.2
clusters = cluster_fingerprints(morgan_fp_list, cutoff=0.2)
# Largest clusters
print(clusters[:10])

Since we are working with a large data set, it is not a good idea to print all of the clusters, however, we can summarize the information in a frequency table.

In [None]:
# Get the size of each group (number of elements in each cluster)
# 'clusters' is a list of lists, where each sublist contains the indices of the molecules in that cluster
agrup = list(map(len, clusters))

# Calculate how many clusters have the same number of elements
# np.unique returns the unique values (cluster sizes) and how many times each appears
unique, counts = np.unique(agrup, return_counts=True)

# Create a frequency table where:
# - Each row represents a cluster size and its frequency
# - np.flip reverses the order of the rows to show the largest clusters first
frec_table = np.flip(np.array([unique, counts]).T)  # reversed order for clarity

# Display the table: each row shows [cluster size, number of clusters of that size]
frec_table

We can see that the largest cluster (the first one) has 14 elements and that there are 1494 compounds that did not cluster (individual clusters).
Let's see graphically the number of elements in the largest clusters and how many of each one there are

In [None]:
# Set up the figure and axes for a bar chart using matplotlib.
# 'figsize' specifies the size of the figure in inches (width x height).
fig, ax = plt.subplots(figsize=(10, 3))  # Set up the matplotlib figure

# Create a bar chart.
# 'list(map(str, frec_table[:, 0]))' converts the first column values of frec_table to strings and uses them as the x-axis.
# 'frec_table[:, 1]' provides the values of the second column of frec_table for the y-axis (bar heights).
ax.bar(list(map(str, frec_table[:, 0])), frec_table[:, 1])

# Set the x-axis label for the chart.
ax.set_xlabel("# total clusters")

# Set the y-axis label for the chart.
ax.set_ylabel("# total elements")

# Display the chart on screen.
plt.show()

# Close the figure to free memory.
plt.close()


We can examine the first 14-element cluster in more detail:

In [None]:
# Convert the first cluster (index 0) from 'clusters' into a list of indices.
list_ind_cluster0 = list(clusters[0])

# Use the indices from the first cluster to select the corresponding rows from the 'molecule_dataset' DataFrame.
# The result is a new DataFrame, 'molecules_cluster0', containing only the molecules that belong to the first cluster.
molecules_cluster0 = molecule_dataset.iloc[list_ind_cluster0]

# Print the 'molecules_cluster0' DataFrame.
# This displays the molecules that are part of the first cluster.
molecules_cluster0

In [None]:
# Print the number of molecules in the 'molecules_cluster0' DataFrame, along with a descriptive message.
print(f'{len(molecules_cluster0)} molecules from the largest cluster')

# Create a list of legends for the molecules to be drawn in the grid image.
# Iterate over the rows of the 'molecules_cluster0' DataFrame.
# For each molecule, format a string that includes the row index and the value from the 'molecule_chembl_id' column.
legends = [
    f"#{index} {molecule['molecule_chembl_id']}"
    for index, molecule in molecules_cluster0.iterrows()
]

# Generate a grid image showing the molecules from the first cluster.
# 'mols' is a list of RDKit Mol objects to draw, taken from the "ROMol" column of 'molecules_cluster0'.
# 'legends' is the list of strings created earlier, used to label each molecule in the grid.
# 'molsPerRow' specifies the number of molecules per row in the grid.
# 'subImgSize' specifies the size of each subimage (molecule) in the grid.
Chem.Draw.MolsToGridImage(
    mols= molecules_cluster0["ROMol"].tolist(),
    legends=legends,
    molsPerRow=5,
    subImgSize=(250, 270),
)

We can also have a brief report on the number of groups and their sizes, depending on the size of the cluster:

In [None]:
# Calculate the number of clusters that contain exactly 1 compound.
# Iterate through the 'clusters' list, where each element 'c' is a cluster (a list of indices).
# The expression 'len(c) == 1' checks if the cluster size is 1.
# 'sum(...)' counts how many times this condition is true.
num_clust_g1 = sum(1 for c in clusters if len(c) == 1)

# Calculate the number of clusters that contain more than 5 compounds.
# Similar to the above, but the condition is 'len(c) > 5'.
num_clust_g5 = sum(1 for c in clusters if len(c) > 5)

# Calculate the number of clusters that contain more than 10 compounds.
# Similar to the above, but the condition is 'len(c) > 10'.
num_clust_g10 = sum(1 for c in clusters if len(c) > 10)

# Print the total number of clusters.
print("Total number of clusters: ", len(clusters))

# Print the number of clusters with only 1 compound.
print("# clusters with only 1 compound: ", num_clust_g1)

# Print the number of clusters with more than 5 compounds.
print("# clusters with >5 compounds: ", num_clust_g5)

# Print the number of clusters with more than 10 compounds.
print("# clusters with >10 compounds: ", num_clust_g10)

#### Optimal number of clusters and similarity threshold

Using the algorithm we used for the ten elements, we can vary finding the *distortion* for different cluster numbers (`N`).

In [None]:
# Initialize an empty list called 'result' to store the clustering analysis results.
result = list()

# Assign the Morgan fingerprint column from the molecule_dataset DataFrame to the variable list_fingerprints.
# Unlike the earlier snippet, this uses the full DataFrame, not just the first 10 rows.
list_fingerprints = molecule_dataset['morgan_fp']

# Calculate the Tanimoto similarity matrix between the fingerprints in list_fingerprints.
# The result is a square matrix where each element (i, j) represents the similarity between fingerprint i and fingerprint j.
similarity_matrix = tanimoto_matrix(list_fingerprints)  # Similarity matrix

# Generate the indices for the positions below the main diagonal of a square matrix.
# 'a' contains the row indices, and 'b' contains the column indices.
# The argument '-1' ensures that the main diagonal is not included.
a, b = np.tril_indices(len(list_fingerprints), -1)  # Indices of elements below the main diagonal

# Calculate the distance matrix from the similarity matrix.
# Distance is calculated as 1 - similarity. This converts similarities into distances,
# where a similarity of 1 (maximum similarity) corresponds to a distance of 0, and a similarity of 0
# (minimum similarity) corresponds to a distance of 1. Only the distances
# corresponding to the lower triangle of the matrix (excluding the diagonal) are selected using indices 'a' and 'b'.
dist_similarity_matrix = 1 - similarity_matrix[a, b]  # Compound distances

# Iterate over a range of cutoff threshold values from 0 to 1 with a step of 0.05.
for i in np.arange(0,1,0.05):
    # Round the cutoff value to 2 decimal places.
    cutoff = round(i,2)

    # Perform Butina clustering using the distance matrix and the current threshold.
    # 'dist_similarity_matrix' contains the distances between compounds.
    # 'len(list_fingerprints)' is the total number of compounds.
    # 'distThresh' is the distance threshold to form clusters; points within this threshold are grouped together.
    # 'isDistData=True' indicates that the input data are distances, not similarities.
    clusters = Butina.ClusterData(dist_similarity_matrix,len(list_fingerprints), distThresh=cutoff, isDistData=True)

    # Compute the number of clusters ('n') and the distortion ('dist') for the current clustering.
    # Distortion is a measure of how well the clusters represent the data. Here, the full dataset is used.
    n, dist = distorion_tanimoto(clusters, molecule_dataset)

    # Append the results (cutoff, number of clusters, distortion) to the 'result' list.
    result.append((cutoff, n, dist))

# Create a Pandas DataFrame called 'table' from the 'result' list.
# The columns of the DataFrame are 'cutoff', 'N_clusters', and 'distortion'.
table = pd.DataFrame(result, columns=['cutoff', 'N_clusters', 'distortion'])

# Print the 'table' DataFrame.
# This DataFrame contains the clustering analysis results for different threshold values.
print(table)

In [None]:
table.plot(x='N_clusters', y='distortion')

As we can see, after `N=200` (number of clusters) there is no great variation of the distortion, therefore, we do not need to vary the similarity threshold from 0 to 1 (`cutoff`). Let's change line 7 so that it only goes to 0.5 instead of 1.

In [None]:
# Initialize an empty list called 'result' to store the clustering analysis results.
result = list()

# Assign the Morgan fingerprint column from the molecule_dataset DataFrame to the variable list_fingerprints.
# Unlike the previous snippet, this uses the full DataFrame, not just the first 10 rows.
list_fingerprints = molecule_dataset['morgan_fp']

# Calculate the Tanimoto similarity matrix between the fingerprints in 'list_fingerprints'.
# The result is a square matrix where each element (i, j) represents the similarity between fingerprint i and fingerprint j.
similarity_matrix = tanimoto_matrix(list_fingerprints)  # Similarity matrix

# Generate the indices for the positions below the main diagonal of a square matrix.
# 'a' contains the row indices, and 'b' contains the column indices.
# The argument '-1' ensures that the main diagonal is not included.
a, b = np.tril_indices(len(list_fingerprints), -1)  # Indices of elements below the main diagonal

# Calculate the distance matrix from the similarity matrix.
# Distance is calculated as 1 - similarity. This converts similarities into distances,
# where a similarity of 1 (maximum similarity) corresponds to a distance of 0, and a similarity of 0
# (minimum similarity) corresponds to a distance of 1. Only the distances
# corresponding to the lower triangle of the matrix (excluding the diagonal) are selected using indices 'a' and 'b'.
dist_similarity_matrix = 1 - similarity_matrix[a, b]  # Compound distances

# Iterate over a range of cutoff threshold values from 0 to 0.5 with a step of 0.05.
for i in np.arange(0, 0.5, 0.05):
    # Round the cutoff value to 2 decimal places.
    cutoff = round(i, 2)

    # Perform Butina clustering using the distance matrix and the current threshold.
    # 'dist_similarity_matrix' contains the distances between compounds.
    # 'len(list_fingerprints)' is the total number of compounds.
    # 'distThresh' is the distance threshold to form clusters; points within this threshold are grouped together.
    # 'isDistData=True' indicates that the input data are distances, not similarities.
    clusters = Butina.ClusterData(dist_similarity_matrix, len(list_fingerprints), distThresh=cutoff, isDistData=True)

    # Compute the number of clusters ('n') and the distortion ('dist') for the current clustering.
    # Distortion is a measure of how well the clusters represent the data. Here, the full dataset is used.
    n, dist = distorion_tanimoto(clusters, molecule_dataset)

    # Append the results (cutoff, number of clusters, distortion) to the 'result' list.
    result.append((cutoff, n, dist))

# Create a Pandas DataFrame called 'table' from the 'result' list.
# The columns of the DataFrame are 'cutoff', 'N_clusters', and 'distortion'.
table = pd.DataFrame(result, columns=['cutoff', 'N_clusters', 'distortion'])

# Print the 'table' DataFrame.
# This DataFrame contains the clustering analysis results for different cutoff values.
print(table)

In [None]:
table.plot(x='N_clusters', y='distortion')

In the graph we can appreciate that around `N = 1500` there is an abrupt change in the distortion. The Table allows us to see that for `N=1470` the `cutoff=0.25`. Let's see how the clustering looks with this value:

In [None]:
# Perform clustering of the Morgan fingerprints ('morgan_fp_list') using the 'cluster_fingerprints' function.
# The distance threshold to form clusters is set to 0.25.
clusters = cluster_fingerprints(morgan_fp_list, cutoff=0.25)

# Set up the figure and axes for a bar chart using matplotlib.
# 'figsize' specifies the size of the figure in inches (width x height).
fig, ax = plt.subplots(figsize=(10, 3))  # Set up the matplotlib figure

# Calculate the size of each cluster and store it in a list called 'agrup'.
# 'map(len, clusters)' applies the 'len' function to each cluster in the 'clusters' list,
# obtaining the number of elements in each cluster.
agrup = list(map(len, clusters))

# Calculate the frequency of each cluster size.
# 'np.unique(agrup, return_counts=True)' returns two arrays:
#  - 'unique': the unique cluster sizes.
#  - 'counts': the number of times each cluster size appears.
unique, counts = np.unique(agrup, return_counts=True)

# Create a frequency table from the unique cluster sizes and their counts.
# 'np.array([unique, counts]).T' creates a 2D array where each row represents a cluster size and its frequency.
# 'np.flip(...)' reverses the order of the rows, which is not necessary for the code to run.
frec_table = np.flip(np.array([unique, counts]).T)  # Reversed order

# Create a bar chart.
# 'list(map(str, frec_table[:, 0]))' converts the cluster sizes (first column of 'frec_table') to strings for use as x-axis labels.
# 'frec_table[:, 1]' provides the frequencies (second column of 'frec_table') for the y-axis (bar heights).
# 'color="mediumseagreen"' sets the color of the bars to medium sea green.
ax.bar(list(map(str, frec_table[:, 0])), frec_table[:, 1], color="mediumseagreen")

# Set the chart title.
ax.set_title(f"Threshold: 0.25")

# Set the x-axis label.
ax.set_xlabel("# total clusters")

# Set the y-axis label.
ax.set_ylabel("# total elements")

# Display the chart on screen.
plt.show()

# Close the figure to free up memory.
plt.close()

The cluster with the most similar molecules has 18 molecules.

In [None]:
# Convert the first cluster (index 0) from the 'clusters' list into a list of indices.
# This assumes that 'clusters' is a list of lists, where each sublist contains the indices
# of the molecules that belong to that cluster. The first cluster is considered the largest.
list_ind_cluster0 = list(clusters[0])

# Use the indices from the first cluster to select the corresponding rows from the 'molecule_dataset' DataFrame.
# 'molecule_dataset.iloc[list_ind_cluster0]' selects the rows in the DataFrame
# whose indices match the indices of the molecules in the first cluster.
# The result is a new DataFrame called 'molecules_cluster0' containing only the molecules from the first cluster.
molecules_cluster0 = molecule_dataset.iloc[list_ind_cluster0]

# Print the 'molecules_cluster0' DataFrame.
# This displays all the columns and rows of the DataFrame, allowing inspection of the properties
# of the molecules belonging to the largest cluster.
molecules_cluster0

In [None]:
# Print the number of molecules in the 'molecules_cluster0' DataFrame, along with a descriptive message.
print(f'{len(molecules_cluster0)} molecules from the largest cluster')

# Create a list of legends for the molecules to be drawn in the grid image.
# Iterate over the rows of the 'molecules_cluster0' DataFrame.
# For each molecule, format a string that includes the row index and the value from the 'molecule_chembl_id' column.
legends = [
    f"#{index} {molecule['molecule_chembl_id']}"
    for index, molecule in molecules_cluster0.iterrows()
]

# Generate a grid image showing the molecules from the first cluster.
# 'mols' is a list of RDKit Mol objects to draw, obtained from the "ROMol" column of 'molecules_cluster0'.
# 'legends' is the list of strings created earlier, used to label each molecule in the grid.
# 'molsPerRow' specifies the number of molecules per row in the grid.
# 'subImgSize' specifies the size of each subimage (molecule) in the grid.
Chem.Draw.MolsToGridImage(
    mols= molecules_cluster0["ROMol"].tolist(),
    legends=legends,
    molsPerRow=5,
    subImgSize=(250, 270),
)

## Clustermap
We can also organize the Tanimoto similarity matrix into a hierarchical clustering heat map where we can see how the most similar molecules cluster together.

In [None]:
# Import the 'figure' function from the 'matplotlib.pyplot' module, although it's not used directly in this code.
from matplotlib.pyplot import figure

# Calculate the Tanimoto similarity matrix for the full list of circular fingerprints ('circular_fp_list').
# This generates a square matrix where each element (i, j) represents the Tanimoto similarity between fingerprint i and fingerprint j.
similarity_matrix_full = tanimoto_matrix(circular_fp_list)

# Create a clustered heatmap (clustermap) of the Tanimoto similarity matrix using the Seaborn library.
# 'similarity_matrix_full' is the input matrix to visualize.
# 'cmap="vlag"' specifies the color palette to use (useful for showing similarities, where extreme colors represent high and low similarity).
# 'dendrogram_ratio=(.1,.2)' adjusts the proportion of space dedicated to the row and column dendrograms, respectively.
# 'yticklabels=False' and 'xticklabels=False' hide the axis labels, which simplifies visualization when the number of points is large.
# 'figsize=(10,10)' sets the figure size in inches (width x height).
g = sns.clustermap(similarity_matrix_full, cmap="vlag",
                   dendrogram_ratio=(.1,.2),
                   yticklabels=False,xticklabels=False,
                   figsize=(10,10))

# Remove the row dendrogram from the clustered heatmap.
# This can be useful to simplify the visualization if only the column dendrogram is needed.
g.ax_row_dendrogram.remove()

# Create a directory named 'data/' if it doesn't already exist.
# 'mkdir -p data/' is a shell command that creates the 'data/' directory and its parent directories if needed.
# The '-p' flag prevents an error if the directory already exists. The '!' executes the shell command.
!mkdir -p data/

# Save the clustered heatmap as a PNG file.
# './data/TanimotoSimilarity.png' specifies the path and filename.
# 'bbox_inches="tight"' adjusts the figure boundaries to ensure everything fits in the saved image.
# 'dpi=500' sets the resolution of the image
plt.savefig('./data/TanimotoSimilarity.png', bbox_inches='tight', dpi=500)

# Display the clustered heatmap on screen.
plt.show()

# Close the figure to free up memory.
plt.close()

# Practical activity

Taking into account what has been reviewed in this second part, make a code in python with which you can:

1. Vary the similarity threshold from 0 to 0.7. Discuss your results.
2. Select a different appropriate threshold and display the cluster centers of the first 10 clusters.

At the end, you must prepare a document in PDF format in which you attach the proposed code and the output of the execution.

# Conclusion

In this lab, we have learned how to use fingerprints and similarity measures to compare a query molecule against a data set of molecules and rank the molecule by similarity. Additionally, we learned about clustering a composite data set and discussed how to choose a reasonable clustering threshold.

# References

1.  Seo, M., Shin, H. K., Myung, Y., Hwang, S., & No, K. T. (2020). Development of natural compound molecular fingerprint (Nc-mfp) with the dictionary of natural products (Dnp) for natural product-based drug development. Journal of Cheminformatics, 12(1), 6. https://doi.org/10.1186/s13321-020-0410-3
2. Capecchi, A., Probst, D., & Reymond, J.-L. (2020). One molecular fingerprint to rule them all: Drugs, biomolecules, and the metabolome. Journal of Cheminformatics, 12(1), 43. https://doi.org/10.1186/s13321-020-00445-4
3. Rácz, A., Bajusz, D., & Héberger, K. (2018). Life beyond the Tanimoto coefficient: Similarity measures for interaction fingerprints. Journal of Cheminformatics, 10(1), 48. https://doi.org/10.1186/s13321-018-0302-y
4. Nielsen, F. (2016). Hierarchical clustering. En F. Nielsen (Ed.), Introduction to HPC with MPI for Data Science (pp. 195-211). Springer International Publishing. https://doi.org/10.1007/978-3-319-21903-5_8
5. Butina, D. (1999). Unsupervised data base clustering based on daylight’s fingerprint and tanimoto similarity: A fast and automated way to cluster small and large data sets. Journal of Chemical Information and Computer Sciences, 39(4), 747-750. https://doi.org/10.1021/ci9803381
6. Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., & Liu, J. (2021). A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP Journal on Wireless Communications and Networking, 2021(1), 31. https://doi.org/10.1186/s13638-021-01910-w