# Introduction
Chem-informatics is a field dealing with chemistry and information science, with the primary motivation being the use of data mining, information retrieval and machine learning techniques to make predictions and inferences which can later be verified experimentally. Chemical Compounds are frequently represented as feature vectors, where every feature represents the absence, presence or the frequency count of certain important substructures or other chemical properties. These feature vectors are known as ”Chemical Fingerprints” . You can read more about cheminformatics here, https://www.emolecules.com/info/molecular-informatics

The conversion of the compounds to the vector format is known as fingerprinting. In the field of computer science, fingerprinting is a technique that hashes or maps every large data item to a much shorter string, generally a bit vector, also known as its fingerprint. This fingerprint is used to uniquely identify the data item for all purposes, very similar to how fingerprints of humans can be mapped uniquely for every single individual. Fingerprinting as mentioned earlier is used for avoiding transfer and comparison of large bulky data.

The technique we will use to store the vectors is an indexing method based on the structure of the Metric-tree or M-tree which we will discuss here. The standard similarity measure used for chemical fingerprints is the Tanimoto Similarity/ Min-Max similarity. The corresponding distance measure satisfies triangle inequality. The M-tree structure helps us exploit this fact.


# Similarity Measure
The similarity measure used to compare two fingerprints X, Y is generally the Tanimoto similarity method which compares two bit vectors . Assume that each vector is of length N, corresponding to the total number of features.

 $$Tanimoto~Similarity~(X, Y) = \frac{\sum_{i=i}^N X_i \cap Y_i}{\sum_{i=i}^N X_i \cup Y_i} $$
 
Notice that the similarity measure always lies between 0 and 1. Hence they define the corresponding distance measure of the two fingerprints as :

 $$Distance~(X,Y) = 1 - sim(X,Y) $$
 
 where 'sim' is the Tanimoto similarity
 
 You can read more about chemical similarity here : https://en.wikipedia.org/wiki/Chemical_similarity
 


# Representation of fingerprints

Chemical fingerprints are vectors which are highly sparse and have a high number of features. Hence storing them as a list would be space consuming. Rather ideally a chemical fingerprint should be stored as a set of feature indices and feature values for the non-zero features. Hence we will be storing each fingerprint as a list of feature indices which correspond to the non-zero features. Note we are only going to be dealing with binary fingerprints, hence the feature value is always 1.0.

Lets say for example, there is a chemical fingerprint which has only its 2nd, 3rd and 500th feature bit set to 1.0. We will store it as [2,3,500].

Now lets answer what the question of what these features represent. The features generally represent particular attributes of the compound. Some examples for the features are, absence or presence of a double Oxygen bond or hydrogen atoms, is it photosensitive, is the molecular weight of compound is greater than a threshold, how is its reaction with specific compounds, how many carbon atoms are present, is the bond angle very high, etc. 

I have generated a synthetic dataset in the file "fingerprint2.txt" which we will use for our calculations. The dataset has been generated with statistics gathered from actual data to produce similar patterns in the data.

Let us first compute the similarity of two compounds given their fingerprint representations.

In [68]:
#Computing similarity and Distance measures

def Tanimoto_sim(X,Y):
    #Input: X, Y are lists of lists of feature indices for the two fingerprints
    #Output: similairty measure , which is a double value
    return 1.0*len(set(X)& set(Y))/len(set(X) | set(Y))  
    pass
    
def distance(X,Y):
    #Input: X, Y are lists of lists of feature indices for the two fingerprints
    #Output: distance measure , which is a double value
    return 1 - Tanimoto_sim(X,Y)
    pass
    

In [69]:
import numpy as np

#Example for finding Tanimoto similarity
# In the example below there the two fingerprints share only one feature, which is feature number 4
fingerprint_1 = [2,3,4]
fingerprint_2 =[4,5,6]
print "X:"+ str(fingerprint_1)
print "Y:"+ str(fingerprint_2)
print "The Tanimoto similarity of the two fingerprints, X and Y is: "+ str(Tanimoto_sim(fingerprint_1, fingerprint_2))
print "The distance of the two fingerprints is: "+ str(distance(fingerprint_1, fingerprint_2))



X:[2, 3, 4]
Y:[4, 5, 6]
The Tanimoto similarity of the two fingerprints, X and Y is: 0.2
The distance of the two fingerprints is: 0.8


In [70]:

#Dataset load from file
fingerprints=[]
with open('fingerprint2.txt') as f:
    for line in f:
        line= line.split(" ")
        fingerprints.append([int(i) for i in line[:-1]])
        
print "The total number of points in our dataset is :" + str(len(fingerprints)) + "\n"

print "The following statistics of the dataset show how sparse it is."
print "The maximum number of features set in a fingerprint is: "+ str(max([len(i) for i in fingerprints]))
print "The maximum number of features set in a fingerprint is: "+ str(min([len(i) for i in fingerprints]))
print "The total number of unique features in the dataset is: "+ str(max([max(i) for i in fingerprints]) + 1)



The total number of points in our dataset is :10000

The following statistics of the dataset show how sparse it is.
The maximum number of features set in a fingerprint is: 100
The maximum number of features set in a fingerprint is: 50
The total number of unique features in the dataset is: 1500


# M-tree

M-tree, also known as the Metric tree is a tree data structure constructed using a metric distance measure, and which relies on the triangle inequality for efficient range search queries. Similar to other tree data structures, an M-tree data structure has Leaf Nodes and non-Leaf nodes. Every non-leaf node has a pointer to its parent node, a pointer to its sub-tree, its object information and a covering radius denoting maximum distance of a node to any node in its sub-tree. Every leaf node keeps a pointer to its parent and object information.

The M-tree data structure compartments the objects into nodes, which define regions of the metric space.  For each database object to be indexed, there is some node of the tree to which it corresponds. A node of the tree stores the object information i.e mainly the data point feature values, object id, a pointer to its subtree and the maximum distance of the nodes to any of its children in the subtree. M-tree organizes the metric space into a set of, possibly overlapping, regions, to which the same principle is recursively applied. You can think of M-tree as clustering the dataset into groups and doing this recursively in each group till the group size reaches a minimum.
<img src="mtree.jpg">

To summarize, the following information is stored in each entry of a node in our tree:
1. Object identifier id
2. Object features i.e  a feature vector corresponding to the object stored in the node
2. Pointer to sub-tree $S_i$ ( a way to do this would be to store the immediate children in an array which in turn would store the grandchildren)
3. Farthest child distance i.e. Covering radius $r_i$ 

Other information which you can store in the node are pointers to the parent, number of children in the subtree, depth of node in the M-tree, etc. You can find more information about the M-tree here, https://en.wikipedia.org/wiki/M-tree. There are many variants of an M-tree. We will be using a simple non-overlapping tree where every node can a child of only one other node and there are no overlapping regions as shown in the figure above.

Let us first construct the class Node and write some helper functions which we will require while constructing the M-tree. Note that the build-Tree function has been explained in the next section.

In [71]:
from random import shuffle
class Node:
    #
    def __init__(self, idx, features):
        #Initilaize Node class with node id and feature vector
        # covered nodes correspond to its children
        # covering radius is the max distance of the node to any node in its subtree
        self.features =  features
        self.idx = idx
        self.covered_nodes = []
        self.covering_radius = 0.0
        
        self.subtree_ids =[]
        pass
    
       
    #Add list of nodes to its children    
    def add_all_nodes(self, nodes, idxs):
        if idxs == None:
            return
        self.covered_nodes.extend(nodes)
    
    # Get distance to given node
    def get_distance(self,node):
        return distance(self.features, node.features)
        pass
    
    # Recurisvely build a subtree by choosing pivots
    # The algorithm is explained in the section "Building the M-tree Index"
    def buildTree(self, nodes, idxs, P, M):
        if self.features == None:
            return
        self.subtree_ids = [i for i in idxs]
        for node in nodes:    
            dist=self.get_distance(node)
            if dist > self.covering_radius:
                self.covering_radius = dist
        num=len(nodes)
        if(num < M or num < P):
            self.add_all_nodes(nodes, idxs)
            return
        
        rnd_idc =[i for i in range(num)]
        shuffle(rnd_idc)
        pivots = [nodes[rnd_idc[i] ] for i in range(P)]
        nodes_to_be_alloted = [nodes[rnd_idc[i]] for i in xrange(P,num)]
        idxs_lst =[[] for i in range(P)]
        idxs_lst_actual =[[] for i in range(P)]
        for i in range(len(nodes_to_be_alloted)):
            idx = self.find_closest_pivot(nodes_to_be_alloted[i], pivots)
            idxs_lst[idx].append(idxs[P+i])
            idxs_lst_actual[idx].append(i)
        self.add_all_nodes(pivots,idxs[:P])
        j=0
        for node in self.covered_nodes:
            node.buildTree([nodes_to_be_alloted[i]  for i in idxs_lst_actual[j]], idxs_lst[j], P, M)
            j+=1
        return
    
    # Find the closest node among the list of pivots 
    def find_closest_pivot(self,node, pivots):
        min_distance = 1.0
        min_idx =0
        for i in range(len(pivots)):
            curr_distance = node.get_distance(pivots[i])
            if curr_distance < min_distance:
                min_distance = curr_distance
                min_idx=i
        return min_idx
    
    # Recursive test function to print all nodes in the subtree
    def print_children(self):
        if self.covered_nodes == None:
            return
        print self.idx, [node.idx for node in self.covered_nodes]
        for node in self.covered_nodes:
            node.print_children()

In [72]:
# lets initialize a node with id =0, features =[2,3,4]
node_test = Node(0,[2,3,4])
print "The feature indices of the node are "+ str(node_test.features)
print "The distance of the node from itself is " + str(node_test.get_distance(node_test))

# We will see more examples below while building the M-tree index

The feature indices of the node are [2, 3, 4]
The distance of the node from itself is 0.0


# Building the M-tree index

While building the M-tree, we are partitioning the data-set into groups using pivots which enables us to exploit the triangle inequality. The choice of pivots in the baseline approach is done randomly. The algorithm has been formally explained below. 
1. Select the given number of random pivots from the database of chemical compounds. 
2. After choosing pivots we assign every other chemical compound in the database to one of the pivots based on the similarity to the pivots. A chemical compound is assigned to the pivot which is nearest to it i.e. the pivot with which it shares the highest Min-Max similarity. This is shown in the figure below.
3. We apply this process recursively in each partition till we reach a partition of size less than M, which is another input to the algorithm.

The figure below shows the first iteration of building a M-tree with P=3, M=6. Since each partition has less than 6 nodes, the indexing process terminates at this step.
<img src="mtree-2.jpg">




# Searching for similar compounds in the database

In this section we will provide the motivation for why M-trees are used to store chemical fingerprints. Our primary goal is to be able to perform range queries on our compound database as fast as possible. This is because in the real world, scientists are always looking for alternatives for drugs. For example if particular drugs fail to come to production due to many reasons, like high molecular weight, unstability, they start looking for similar drugs with similar features. Hence we are dealing with range queries, so that given a compound, we want to find all compounds at a distance within a threshold from the query compound, and the answert could be a suitable alternative for the compound. We are looking at accurate searching of similar compounds when given a query compound. The similarity of chemical fingerprints is established using the Min-Max distance which is the generalization of Tanimoto distance for non-binary data-sets (explained in the previous section). 

So given a query compound, the goal is to find all compounds within a distance threshold of $\delta$. The trivial way to do this is to compare it with all compounds in the database one by one. But this method makes us look at the entire database and is time consuming. Rather we will use the M-tree efficiently to search for similar compounds. The main motivation behind using M-trees comes mainly because chemical fingerprints exhibit the following properties. Firstly, they have very high number of features and secondly, most of these features are zero, which means that the feature vector is very sparse. The structure of the M-tree helps us exploit triangle inequality

Given a query chemical fingerprint q and a distance threshold $\theta$ we want to find the set of chemical fingerprints from the database whose distance to the query is less than the threshold. We exploit the triangle inequality for the same. The basic idea is to be able to prune sub-trees based on the covering radius of the pivot of the sub-tree and the distance of the query to the pivot. The procedure for the range search querying can be described by the following steps:
1. Let the query fingerprint be q and the fingerprint pivot be $p_i$ with sub-tree $S_i$. We can calculate the maximum distance of any node in $S_i$ from q.(We start with the root of the tree as $p_i$).
2. Let the covering radius of pivot $p_i$ be $r_i$. Hence the maximum distance of any node in $S_i$ to the query will be dist(q, $p_i$) + $r_i$ due to triangle inequality. Similarly the minimum distance of any node in $S_i$ is max(dist(q, $p_i$) - $r_i$, 0).
3. Hence we can compute the range of the distance of any node in $S_i$ to q.
4. If the upper bound of the range or the maximum distance is lesser than the threshold distance $θ$, we can add all the nodes of the sub-tree Si to our resultant set.
5. If the lower bound of the range is greater than the threshold distance θ, we can prune the sub-tree Si, since we can say with certainty that the distance of every node in the sub-tree Si to the query point is greater than the threshold θ.
6. If there is an intersection in the intervals we recursively apply this technique on the second level of children in the sub-tree Si until we reach a leaf node. We make each of the children of the root of the sub-tree Si as the new pi and repeat the steps in each of the corresponding sub-trees.


In [73]:
import math
class M_tree:
    # Initilaize M-tree with parameter variables
    # P is number of pivots
    # M is the minimum size of a group required for us to choose pivots in it
    def __init__(self, P,M):
        self.root = None
        self.P = P
        self.M =M
        pass
    
    # Builds the M-tree by calling the build-tree function in class Node
    def make_pivot(self, data):
        self.root = Node(0,data[0])
        nodes =[Node(i, data[i]) for i in xrange(1,len(data))]
        self.root.buildTree(nodes,[i for i in xrange(1,len(data))],self.P,self.M)
        pass
    
    # The range search function
    # Given a threshold theta, and a query q, finds all compunds in the database which are within
    # the threshold distance from q
    def range_search(self, pivot,query, threshold):
        lst_idxs=[]
        dst_to_qry=pivot.get_distance(query)
        if dst_to_qry < threshold:
            lst_idxs.append(pivot.idx)
        max_distance = min(1.0,dst_to_qry)+pivot.covering_radius
        if max_distance < threshold:
            lst_idxs.extend(pivot.subtree_ids)
            return lst_idxs
        min_distance = max(0.0, dst_to_qry - pivot.covering_radius)
        if min_distance > threshold:
            return lst_idxs
        for node in pivot.covered_nodes:
            new_idcs = self.range_search(node, query, threshold)
            lst_idxs.extend(new_idcs)
        return lst_idxs

# Evaluation by Comparison with Linear Scan

We compare the running times of searching in the database using M-tree indexing versus using a linear scan of the database. We will evaluate the model for different parameters for threshold, for different queries. Note that the step of choosing pivots is random, hence the model must be empirically validated using few queries. hence every time you run the indexing method, it produces a different index and the structure of the tree will differ with every run.

In [75]:
import time


# Ideally queries are compounds from outside the database
# I have used compounds from within the database to show correctness.
queries=[Node(1000,fingerprints[2]),Node(1001,fingerprints[134]),Node(1002,fingerprints[567]),Node(1003,fingerprints[889])]
thresholds=[0.01,0.2, 0.25, 0.3]

mtree = M_tree(10,20)
mtree.make_pivot(fingerprints)
nodes =[Node(i, fingerprints[i]) for i in xrange(len(fingerprints))]


# Lets look at the first level pivots
print "The indices for the first level pivots are:" + str([ i.idx for i in mtree.root.covered_nodes])

# Lets find all nodes within a distance of 0.75 for the root
# ideally this would contain 0 (which is the id of the root) because it is within a range of 0.75 from itself. 
# if there are other nodes, those would be returned as well
print "The nodes within a threshold distance of 0.75 from the root is :" + str(mtree.range_search(mtree.root,mtree.root,0.75))


#Note that every time you run the indexing method, it produces a different index
#Hence the structure of the tree will differ with every run

print "\n Evaluation of M-tree vs linear scan"
m_tree_time=0
linear_time=0
for i in range(4):
    query = queries[i]
    threshold = thresholds[i]
    
    start = time.time()
    range_query = mtree.range_search(mtree.root,query,threshold)
    end = time.time()
    m_tree_time +=end-start
    
    lst=[]
    start = time.time()
    for node in nodes:
        if node.get_distance(query) < threshold:
            lst.append(node.idx)
    end = time.time()
    linear_time +=end-start
    
    
print "The average time taken by M-tree :" + str(m_tree_time*1.0/4)
print "The average time taken by Linear Scan :" + str(linear_time*1.0/4)


The indices for the first level pivots are:[8003, 4938, 5291, 2974, 7630, 8928, 4182, 1643, 2987, 658]
The nodes within a threshold distance of 0.75 from the root is :[0]

 Evaluation of M-tree vs linear scan
The average time taken by M-tree :0.228827238083
The average time taken by Linear Scan :0.26600420475


# Conclusion
This notebook covers how an M-tree is beneficial to index chemical compounds owing to various reasons which have been explained above. We have shown that linear scan of the database takes much more time for range queries than while using a M-tree. 

The basic idea which I want you take from this notebook is how an M-tree can be used for storing points in a database, why they are useful to store chemical compounds, whose similarity is defined by Tanimoto similarity and how triangle inequality can be leveraged to perform range search queries in a M-tree indexed database.