## Locality Sensitive Hashing
(by Tevfik Aytekin)

### Approximate Nearest Neighbor

Finding nearest neighbors in a set of objects is a very general problem which has applications in many areas. If the size of the set of objects is very large then an exhaustive pairwise comparision of all ojects can be very costly.

**Randomized near-neighbor reporting:** Given a set $P$ of points in a $d$-dimensional space $\mathbb{R}^d$, and parameters $R > 0, > δ > 0$, construct a data structure which, given any query point $q$, reports each $R$-near neighbor of $q$ in $P$ with probability $1 − δ$.


### The main idea of LSH

The main idea of LSH is to design hash functions such that the probability of a collision is much higher for closer points compared to points which are far apart. Given such hash functions one can hash a query point and retrive the elements in the buckets that contain the query point.


### An LSH Function for Cosine Similarity

<img src="images/lsh_cosine.jpg" width = "400">

In [156]:
import numpy as np
from scipy.spatial import distance
import time
from sklearn.metrics.pairwise import pairwise_distances



### Performance measurement

Nearest neighbor search: Suppose that we have a large dataset of vectors X and given a target vector we want to find the most similar k vectors to the target vector in the dataset X. Note that we can do this type of search more than once. 

First let us generate the dataset X:

In [177]:
n_vectors = 100000
dim = 10
n_neighbors = 100
n_queries = 2000
n_random_vectors = 20

target_vectors = np.random.randn(n_queries, dim)
dataset = np.random.randn(n_vectors, dim)
ns = NaiveSearch(dataset)
ns.build()
ns_nn = []
tic = time.time()
for i in range(n_queries):
    target_vector = target_vectors[i,:].reshape(1,dim)
    neighbors = ns.find_nn(target_vector, n_neighbors)
    ns_nn.append(neighbors)
#target_vector_b = target_vector
#print(neighbors)

toc = time.time()
print("Time:"+str(1000*(toc-tic))+"ms")
lsh_nn = []

lsh = LSH(dataset)
lsh.build(n_random_vectors)
tic = time.time()
for i in range(n_queries):
    target_vector = target_vectors[i,:].reshape(1,dim)
    neighbors = lsh.find_nn(target_vector, n_neighbors)
    lsh_nn.append(neighbors)
toc = time.time()
print("Time:"+str(1000*(toc-tic))+"ms")
#target_vector_b = target_vector
#print(neighbors)

hits = [len(np.intersect1d(ns_nn[i], lsh_nn[i])) for i in range(n_queries)]
#print(hits)
print("Hit ratio: ",sum(hits) / (n_queries*n_neighbors))

Time:20857.75399208069ms
Time:596.580982208252ms
Hit ratio:  0.023155


In [49]:
class NaiveSearch:
    def __init__(self, data):
        # data is a n-by-d matrix where d is the length of the vectors
        # and n is the number of vectors. 
        self.data = data
        self.norms = None
        self.data_normalized_T = None
    
        
    def build(self):
        self.norms = np.linalg.norm(self.data, axis=1)
        self.norms.shape = (len(self.norms), 1)
        
        data_normalized = np.divide(self.data, self.norms)
        self.data_normalized_T = data_normalized.T
            
            
    def find_nn(self, target_vector, n_neighbors=10):

        #target_vector_normalized = np.linalg.norm(target_vector)
        
        sims = np.dot(target_vector,self.data_normalized_T)[0]
        return sims.argsort()[::-1][:n_neighbors]


In [1]:

class LSH:
    def __init__(self, data):
        # data is a n-by-d matrix where d is the length of the vectors
        # and n is the number of vectors. 
        self.data = data
        self.hash_table = {}
        self.random_vectors = None
    
    def build(self, n_random_vectors):
        # generate random vectors
        dim = self.data.shape[1]
        self.random_vectors = np.random.randn(n_random_vectors, dim)
        # generate dim-by-n index bits
        sign_bits = np.dot(self.data, self.random_vectors.T) >= 0
        n_data_vectors = self.data.shape[0]
        for i in range(n_data_vectors):
            key = tuple(sign_bits[i,:])
            if key not in self.hash_table:
                self.hash_table[key] = []
            self.hash_table[key].append(i)
            
            
    def find_nn(self, target_vector, n_neighbors=10, max_radius = 0):

        sign_bits = (np.dot(target_vector, self.random_vectors.T) >= 0).flatten()
        sign_bits_tuple = tuple(sign_bits)
        candidate_ids = self.hash_table.get(sign_bits_tuple)
        if candidate_ids is not None:
            candidate_vectors = self.data[candidate_ids, :]
            sims = 1 - pairwise_distances(target_vector, candidate_vectors, metric='cosine').flatten()
            sorted_nn = sims.argsort()[::-1][:n_neighbors]
            return np.array([candidate_ids[i] for i in sorted_nn])
        else:
            return []
            
        