## Locality Sensitive Hashing
(by Tevfik Aytekin)

### Approximate Nearest Neighbor

Finding nearest neighbors in a set of objects is a very general problem which has applications in many areas. If the size of the set of objects is very large then an exhaustive pairwise comparision of all ojects can be very costly.

**Randomized near-neighbor reporting:** Given a set $P$ of points in a $d$-dimensional space $\mathbb{R}^d$, and parameters $R > 0, > δ > 0$, construct a data structure which, given any query point $q$, reports each $R$-near neighbor of $q$ in $P$ with probability $1 − δ$.


### The main idea of LSH

The main idea of LSH is to design hash functions such that the probability of a collision is much higher for closer points compared to points which are far apart. Given such hash functions one can hash a query point and retrive the elements in the buckets that contain the query point.


### An LSH Function for Cosine Similarity

<img src="images/lsh_cosine.jpg" width = "400">

In [1]:
import numpy as np
from scipy.spatial import distance
import time
from sklearn.metrics.pairwise import pairwise_distances



### Performance measurement

Nearest neighbor search: Suppose that we have a large dataset of vectors X and given a target vector we want to find the most similar k vectors to the target vector in the dataset X. Note that we can do this type of search more than once. 

First let us generate the dataset X:

In [44]:
n_vectors = 100000
dim = 10
n_neighbors = 100
n_queries = 2000
n_random_vectors = 10

target_vectors = np.random.randn(n_queries, dim)
dataset = np.random.randn(n_vectors, dim)

In [47]:
ns = NaiveSearch(dataset)
ns.build()
tic = time.time()
ns_nn = []
for i in range(n_queries):
    target_vector = target_vectors[i,:].reshape(1,dim)
    neighbors = ns.find_nn(target_vector, n_neighbors)
    ns_nn.append(neighbors)
toc = time.time()
print("Time:"+str(1000*(toc-tic))+"ms")

Time:19261.359930038452ms


In [70]:

#target_vector_b = target_vector
#print(neighbors)

lsh_nn = []

lsh = LSH(dataset)
lsh.build(n_random_vectors, n_bands = 10)
tic = time.time()
for i in range(n_queries):
    target_vector = target_vectors[i,:].reshape(1,dim)
    neighbors = lsh.find_nn(target_vector, n_neighbors, n_bands = 10)
    lsh_nn.append(neighbors)
toc = time.time()
print("Time:"+str(1000*(toc-tic))+"ms")
#target_vector_b = target_vector
#print(neighbors)

hits = [len(np.intersect1d(ns_nn[i], lsh_nn[i])) for i in range(n_queries)]
#print(hits)
print("Hit ratio: ",sum(hits) / (n_queries*n_neighbors))

Time:3326.2650966644287ms
Hit ratio:  0.49389


In [55]:
len(lsh.bands)

2

In [12]:
class NaiveSearch:
    def __init__(self, data):
        # data is a n-by-d matrix where d is the length of the vectors
        # and n is the number of vectors. 
        self.data = data
        self.norms = None
        self.data_normalized_T = None
    
        
    def build(self):
        self.norms = np.linalg.norm(self.data, axis=1)
        self.norms.shape = (len(self.norms), 1)
        
        data_normalized = np.divide(self.data, self.norms)
        self.data_normalized_T = data_normalized.T
            
            
    def find_nn(self, target_vector, n_neighbors=10):

        #target_vector_normalized = np.linalg.norm(target_vector)
        
        sims = np.dot(target_vector,self.data_normalized_T)[0]
        return sims.argsort()[::-1][:n_neighbors]


In [64]:

class LSH:
    def __init__(self, data):
        # data is a n-by-d matrix where d is the length of the vectors
        # and n is the number of vectors. 
        self.data = data
        self.bands = []
        self.random_vectors = []
    
    def build(self, n_random_vectors, n_bands = 1):
        for b in range(n_bands):
            # generate random vectors
            self.bands.append({})
            dim = self.data.shape[1]
            self.random_vectors.append(np.random.randn(n_random_vectors, dim))
            # generate dim-by-n index bits
            sign_bits = np.dot(self.data, self.random_vectors[b].T) >= 0
            n_data_vectors = self.data.shape[0]
            for i in range(n_data_vectors):
                key = tuple(sign_bits[i,:])
                if key not in self.bands[b]:
                    self.bands[b][key] = []
                self.bands[b][key].append(i)
            
            
    def find_nn(self, target_vector, n_neighbors=10, n_bands = 1):

        candidate_ids = []
        for b in range(n_bands):
            sign_bits = (np.dot(target_vector, self.random_vectors[b].T) >= 0).flatten()
            sign_bits_tuple = tuple(sign_bits)
            ids = self.bands[b].get(sign_bits_tuple)
            if ids is None: 
                ids = []
            candidate_ids = candidate_ids + ids
        if len(candidate_ids) > 0:
                candidate_vectors = self.data[candidate_ids, :]
                sims = 1 - pairwise_distances(target_vector, candidate_vectors, metric='cosine').flatten()
                sorted_nn = sims.argsort()[::-1][:n_neighbors]
                return np.array([candidate_ids[i] for i in sorted_nn])
        else:
            return []
            
        

### Making sense of the cost of pairwise similarity computation

Suppose that we have n objects represented as vectors of size d. The cost of computing all pairwise similarities is $O(n^2d)$. When $n$ is large the cost of this computation can be quite large. And in some applications like finding near duplicate web pages, in order to eliminate them from search results, $n$ (number of web pages) can be really large.

To get a sense of this cost below is a simple code which measures the time two multiply two matrices of size $n$-by-$d$. (Note that some version of multiplication is needed in order to find similarities between the vectors)

In [None]:
n = 100000
d = 1000
X = np.random.randn(n,d)
Y = np.random.randn(n,d)
tic = time.process_time()

z = np.dot(X,Y.T)

toc = time.process_time()
print("Time:"+str(1000*(toc-tic))+"ms")

Since the time complexity is quadratic, we expect the figures in the following table if we increase $n$. You can try some larger $n$'s to test this.

|  n | time  | 
|:---|:---|
|  100k | 400 seconds  |
|  1m | 11 hours  |
|  10m | 46 days  |
|  100m | 12 years  |
|  1b | 12 centuries |

The above results are taken on a 2,3 GHz Dual-Core Intel Core i5 laptop.

In [13]:
40000000000/(60*60*24*365)

1268.3916793505834

In [10]:
from itertools import combinations
for i in combinations([1,2,3],1):
    print(i)

(1,)
(2,)
(3,)


In [None]:

for different_bits in combinations(range(num_vector), search_radius):
            alternate_bits = copy(query_bin_bits)
            for i in different_bits:
                alternate_bits[i] = 1 if alternate_bits[i] == 0 else 0

            # Convert the new bit vector to an integer index
            nearby_bin = alternate_bits.dot(powers_of_two)