# Lab 03 - Similarity Search in Large Datasets - Locality-Sensitive Hashing (LSH) with Random Projection

During this lab we will study the concept of Similarity Search in large datasets. In particular, we
will implement a simple version of Locality-Sensitive Hashing (LSH) using Random Projection.

LSH is a popular technique for approximate nearest neighbor search in hight-dimensional spaces. The
main idea behind it is to hash the input items in such a way that similar items are mapped to the
same buckets. 

This approach differs from traditional hashing techniques, which aim to minimize collisions between
different items. Here, we want to maximize the probability of collision for items that are similar.

## 1. The Dataset

We will use one of the datasets employed to evaluate algorithms for approximate nearest neighbor
search as part of [ANN Benchmarks](https://ann-benchmarks.com/) initiative. 

Please download one of the GloVe variants from the above link. Alternatively, you can download the
dataset directly from the [GloVe](https://nlp.stanford.edu/projects/glove/) project page. Try to
pick as large dataset version as possible to fully appreciate the benefits of LSH.

See, Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for
Word Representation [pdf](https://nlp.stanford.edu/pubs/glove.pdf) for more details about GloVe.


In [21]:
# write your code here

import pathlib
from tqdm.notebook import tqdm
import requests

DATA_DIR = pathlib.Path.cwd().parent.parent / "data" / "lab-03"

def download_file(url: str, dest: pathlib.Path) -> None:
    """Download file from url to dest path."""
    dest.parent.mkdir(parents=True, exist_ok=True)
    chunk_size = 8192

    with requests.get(url, stream=True) as r:
        total_size = int(r.headers.get("content-length", 0))
        print(total_size)
        r.raise_for_status()
        with (
            open(dest, "wb") as f,
            tqdm(total=total_size, unit="B", unit_scale=True, desc=str(dest)) as pbar,
        ):
            for chunk in r.iter_content(chunk_size=chunk_size):
                if chunk:
                    f.write(chunk)
                    pbar.update(len(chunk))


download_file(
    "http://ann-benchmarks.com/glove-200-angular.hdf5",
    DATA_DIR / "glove-200-angular.hdf5",
)


962819488


/home/sebov/synced/backup/workspace/dzd/data/lab-03/glove-200-angular.hdf5:   0%|          | 0.00/963M [00:00<…

In [36]:
download_file(
    "https://nlp.stanford.edu/data/wordvecs/glove.2024.wikigiga.300d.zip",
    DATA_DIR / "glove.2024.wikigiga.300d.zip",
)

1705239207


/home/sebov/synced/backup/workspace/dzd/data/lab-03/glove.2024.wikigiga.300d.zip:   0%|          | 0.00/1.71G …

## 2. Investigate the Dataset

What is the type of the dataset? What is the dimensionality? How many vectors does it contain?
Can you fit the whole dataset in memory?



In [None]:
# write your code here


## 3. Similarity Search

What is the preferred similarity metric for this dataset?

What methods do you already know for performing similarity search? 

E.g., try brute-force, KD-Trees, Ball-Trees from `sklearn.neighbors` on a subset of the data. What
is the subset size you can use to perform the computations in reasonable time? What are the
complexities of these methods?

You do not have to spend too much time on this section. Just get the feel of the problem and the
main challenges associated with it.

In [None]:
# write your code here


## 4. Approximate Nearest Neighbors with LSH

Sometimes, it is acceptable to retrieve approximate nearest neighbors instead of the exact ones. In
many applications, we are absolutely fine with getting neighbors that are approximately close to
the query point if we can get them much faster.

What type of error can we observe when we retrieve approximate nearest neighbors? What are the
interpretations of false positives and false negatives in this context?

In [None]:
# write your code here
