# Locality-Sensitive Hashing

Locality-Sensitive Hashing (LSH) is a method used to efficiently find approximate nearest neighbors in large datasets. The main idea behind LSH is to hash input items in such a way that similar items are mapped to the same "buckets" with high probability (i.e., they have the same hash value), while items that are dissimilar have a low probability of being placed in the same bucket. This method significantly reduces the computational cost of searching for similar items by limiting the search to items in the same bucket rather than comparing every item with every other item.


1. **Hash Functions in LSH**: In LSH, hash functions are designed so that the probability of collision (i.e., two items being mapped to the same hash value) is proportional to their similarity. The hash function $h$ satisfies two key properties:
   - $P(h(x) = h(y)) = 1$ if $x = y$, meaning identical items will always hash to the same value.
   - $P(h(x) \neq h(y)) = \text{sim}(x, y)$ if $x \neq y$, meaning the probability of different items having different hash values is proportional to their similarity. The similarity measure $\text{sim}(x, y)$ is a value between 0 and 1, where 1 indicates identical items and 0 indicates completely dissimilar items.

2. **Similarity and Distance**: The similarity measure must satisfy the triangle inequality to be a valid metric. The distance $d(x, y)$ between two items $x$ and $y$ can be defined as $1 - \text{sim}(x, y)$, which is the complement of their similarity. This ensures that the closer the items are in terms of their content, the smaller the distance between them.

3. **Triangle Inequality and LSH**: The triangle inequality for distances states that for any three items $x$, $y$, and $z$, the sum of the distances between two of the items is always greater than or equal to the distance between the third pair, i.e., $d(x, y) + d(y, z) \geq d(x, z)$. In the context of LSH, this means that if two items are similar to a third item, they are likely to be similar to each other as well. This principle helps in ensuring the consistency of similarity and distance measures within the hashed space.

In practice, LSH is used by search engines like Google to efficiently identify and eliminate duplicate or nearly identical web pages. By hashing web pages and grouping them into buckets based on their hash values, Google can quickly identify pages that are likely to be duplicates or very similar, based on the principle that these pages will hash to the same or similar buckets. This helps in reducing redundancy in search results and improving the overall quality of the search experience. The use of LSH allows for a scalable and efficient approach to this problem, even as the size of the web and the number of pages continue to grow exponentially.

# SimHash
To derive the probability that a SimHash value collides for two different vectors despite them having different underlying data, we consider the geometry involved in the hashing process. The critical factor is the angle $\theta$ between the two vectors in question.

### Background

SimHash uses random hyperplanes (defined by random vectors) to divide the space. The side of the hyperplane on which a vector falls determines each bit of its hash. The angle $\theta$ between two vectors affects the probability that they will be hashed to the same side of a randomly chosen hyperplane.

### Collision Probability Derivation

1. **Hyperplane Projection**: Consider a hyperplane defined by a normal vector $Z$. When we project two vectors $X$ and $Y$ onto this hyperplane, the projection's sign determines the hash bit. The hash bit is $1$ if the dot product $X \cdot Z > 0$ and $0$ otherwise.

2. **Angle Between Vectors**: The angle $\theta$ between $X$ and $Y$ can be related to their dot product through the cosine similarity formula:

   $$
   \cos(\theta) = \frac{X \cdot Y}{\|X\|\|Y\|}
   $$

3. **Probability of Different Hash Bits**: The probability that $X$ and $Y$ have different hash bits for a single projection is determined by the angle $\theta$ they form, because it affects whether they fall on the same or opposite sides of the hyperplane. This is given by the fraction of the circle's circumference that corresponds to angles leading to different signs in the projection:

   $$
   P(\text{different hash bits}) = \frac{\theta}{\pi}
   $$

   since the hyperplane can be oriented in any direction, and $\theta$ effectively measures the "arc" over the unit circle where the projections of $X$ and $Y$ would have different signs.

4. **Probability of Collision**: The probability that $X$ and $Y$ have the same hash bit for a single projection (a collision) is the complement of them having different hash bits:

   $$
   P(\text{collision}) = 1 - P(\text{different hash bits}) = 1 - \frac{\theta}{\pi}
   $$

### Conclusion

The derived equation $P(\text{collision}) = 1 - \frac{\theta}{\pi}$ quantifies the probability that two vectors will be hashed to the same value for a single bit in their SimHash, based solely on the angle $\theta$ between them. This highlights the geometric basis of SimHash: vectors that are closer together (smaller $\theta$) are more likely to collide, which is useful for identifying similar items in applications like duplicate detection or near-duplicate document retrieval.


In [59]:
import numpy as np

def simhash_single(Z, X):
    """
    Generate a SimHash for input vector X using a given random vector Z.
    
    Parameters:
    - Z: numpy array, random vector used for hashing.
    - X: numpy array, document vector.
    
    Returns:
    - int, SimHash value as an integer.
    """
    # Calculate the hash: h(X) = sign(X^T Z)
    assert Z.shape[0] == 1, "Z must be a row vector"
    assert X.shape[0] == 1, "X must be a row vector"
    assert Z.shape[1] == X.shape[1], "Z and X must have the same number of columns"

    hash_bit = np.sign(np.dot(Z, X.T))
    hash_bit = hash_bit[0, 0]
    return hash_bit

# Example usage
p = 1000
X1 = np.random.rand(1, p)  # Simulate a document vector
X2 = X1 + np.random.normal(0, 0.1, (1, p))  # Simulate a similar document vector
X3 = -X1  # Simulate a dissimilar document vector

# Generate a single random vector Z to be used for all documents
d = 32
Z = np.random.normal(0, 1, (1, p))

hash1 = simhash_single(Z, X1)
hash2 = simhash_single(Z, X2)
hash3 = simhash_single(Z, X3)

print("Original document hash:", hash1)
print("Similar document hash:", hash2)
print("Dissimilar document hash:", hash3)

Original document hash: -1.0
Similar document hash: -1.0
Dissimilar document hash: 1.0
