# Prepare, index, then search

## 1. Data generation

Use this code in your solutions to generate the dataset.

In [1]:
%%time
def build(N, D):
    dataset = [None] * N
    for i in range(N):
        dataset[i] = [((i % 9997 - d) + (i * d - d)) % 9999 for d in range(D)]
        dataset[i] = tuple(dataset[i])
    return dataset
DATASET = build(100000, 3)

CPU times: user 116 ms, sys: 8.28 ms, total: 124 ms
Wall time: 122 ms


Here we check they are unique.

In [2]:
from collections import Counter
Counter(DATASET).most_common(10)

[((0, 9997, 9995), 1),
 ((1, 0, 9998), 1),
 ((2, 2, 2), 1),
 ((3, 4, 5), 1),
 ((4, 6, 8), 1),
 ((5, 8, 11), 1),
 ((6, 10, 14), 1),
 ((7, 12, 17), 1),
 ((8, 14, 20), 1),
 ((9, 16, 23), 1)]

## 2. Search with indexing

Implement search using the following techniques of indexing.
- [Annoy](https://github.com/spotify/annoy). Build 5+ trees (take more if 5 is not enough) with Euclidean distance.
- Scipy [kd-tree](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html) implementation. Euclidean distance in query.

All vectors are unique. You are given a query: a method of search (`annoy`|`kdtree`) and a vector which is exactly present in a dataset (3D integer vector). Print out the **index** of this vector in a `DATASET`.

E.g. `input.txt`
```
annoy
0 9997 9995
```
then `output.txt` will be:
```
0
```

Example of Annoy:

In [24]:
%%time
from annoy import AnnoyIndex
from scipy.spatial import KDTree

with open("input.txt", "r") as f:
    method = f.readline()
    query = [int(num) for num in f.readline().split()]

if method == "annoy":
    # ANNOY INITIALIZATION
    ann_ind = AnnoyIndex(3, "euclidean")
    for i, v in enumerate(DATASET):
        ann_ind.add_item(i, list(v)) 
    ann_ind.build(10)
    index = ann_ind.get_nns_by_vector(query, 1)[0]
else:
    # KD-TREE INITIALIZATION
    tree = KDTree(DATASET)
    index = tree.query(query, 1, p=2)[1]
    
with open("output.txt", "w") as f:
    f.write(str(index))
# ... for [0, 9997, 9995] returns 0 in 1.37s with index building

CPU times: user 687 ms, sys: 9.87 ms, total: 697 ms
Wall time: 697 ms


2
