# Real World Application

## Annoy [Spotify's product]

Annoy is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mapped into memory so that many processes may share the same data. 

To install: `pip install annoy`

Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load the index into memory and will be able to do lookups immediately. 

Spotify uses it for music recommendations. 

Now we will try to do some experiments and see its advantages over time and space. 

In [2]:
# !pip install annoy

In [3]:
from annoy import AnnoyIndex
import random

possible_metrics = ['angular', 'euclidean', 'manhattan', 'hamming', 'dot'] # you can try using one of these distance metrics
f = 50 # you can play with this argument (dimension)
metric = 'angular' 

Now, we will create synthetic data that is randomly distributed following standard Gaussian distribution. Basically, we are creating 1000 data points in f-dimensional space. 

In [4]:
t = AnnoyIndex(f, metric)  # Length of item vector that will be indexed

# create synthetic data that is randomly distributed following standard Gaussian distribution
Create 1000 different vectors of dimension 50 or f that  follow a gaussian distribution.
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('./Lab6_Extra/test.ann')

True

We built a forest of 10 trees. More trees give higher precision when querying. After calling `build`, no more items can be added. And, we didn't specify the second argument which means the algorithm uses all available CPU cores. Next, we saved the index to disk. 

In [6]:
u = AnnoyIndex(f, metric)
u.load('./Lab6_Extra/test.ann') # super fast, will just mmap the file

res = u.get_nns_by_item(0, 100)
print(res) # will find the 100 nearest neighbors

[0, 31, 319, 750, 239, 858, 975, 250, 496, 192, 719, 131, 281, 747, 590, 606, 735, 386, 621, 703, 226, 177, 466, 650, 597, 121, 714, 814, 5, 476, 87, 138, 705, 351, 576, 116, 203, 445, 951, 410, 535, 831, 285, 431, 277, 197, 166, 933, 513, 653, 798, 181, 734, 418, 587, 187, 631, 90, 149, 761, 82, 495, 130, 13, 752, 69, 17, 961, 223, 6, 176, 523, 451, 692, 760, 530, 227, 706, 89, 561, 385, 80, 890, 211, 112, 543, 690, 956, 657, 108, 457, 976, 849, 65, 570, 815, 93, 10, 932, 560]


Here, we loaded an index from disk, and found 100 closest items to item `0`. 

Next, in order to show how it's defined, we will see distances between all points found and query point. 

In [7]:
distances = []
for i in res:
    distances.append(u.get_distance(0, i))

distances

[0.0,
 1.0622214078903198,
 1.0642154216766357,
 1.078370451927185,
 1.0847351551055908,
 1.1373401880264282,
 1.1433602571487427,
 1.1528515815734863,
 1.1550509929656982,
 1.1594780683517456,
 1.1604605913162231,
 1.1620707511901855,
 1.1735116243362427,
 1.1737746000289917,
 1.1813374757766724,
 1.182036280632019,
 1.1840945482254028,
 1.186084508895874,
 1.1897029876708984,
 1.1929341554641724,
 1.1945185661315918,
 1.195190668106079,
 1.1964315176010132,
 1.1981054544448853,
 1.20073664188385,
 1.2013227939605713,
 1.2032915353775024,
 1.204382300376892,
 1.2047511339187622,
 1.2056618928909302,
 1.2100143432617188,
 1.210219383239746,
 1.2106143236160278,
 1.2134289741516113,
 1.214616060256958,
 1.2154388427734375,
 1.2154576778411865,
 1.215738296508789,
 1.216549038887024,
 1.219921350479126,
 1.2199440002441406,
 1.2202807664871216,
 1.2222844362258911,
 1.2246688604354858,
 1.2260814905166626,
 1.226647973060608,
 1.2268146276474,
 1.2289409637451172,
 1.2295904159545898,
 1

## Comparison with k-NN

In [10]:
# !pip install scikit-learn

In [11]:
# import sklearn datasets library
from sklearn import datasets

# load wines dataset
dataset = datasets.load_wine()

In [16]:
dataset

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [12]:
dataset.feature_names 

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [13]:
dataset.target_names 

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

In [14]:
X = dataset.data
y = dataset.target 

X.shape, y.shape

((178, 13), (178,))

Next, let's compare our findings with true labels, follow next steps.

In [15]:
t = AnnoyIndex(13, 'euclidean')  # cause 13 dimensions in the data

for i, elem in enumerate(X):
    t.add_item(i, elem)

t.build(10) # 10 trees
t.save('./Lab6_Extra/wine.ann')

True

In [23]:
u = AnnoyIndex(13, 'euclidean')
u.load('./Lab6_Extra/wine.ann') 
res = u.get_nns_by_item(0, 20)
print(res) # will find the 20 nearest neighbors

[0, 54, 45, 48, 46, 1, 34, 9, 8, 42, 22, 29, 41, 37, 38, 55, 23, 17, 32, 73]


So, data point at index `0` belongs to the first cluster, others also should have cluster 0. We can check their labels: 

In [17]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [24]:
y[res]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

As we see, closest points are more likely to be in a same cluster.

Next, you should try to classify these data points using k-NN and find which algorithm/approach is better than the other in terms of time complexity, accuracy. Take into account that your task is to give similar items as input and you are dealing with Big Data. 

### KNN Nearest Neighbours Implementation

In [24]:
X[0]

array([1.423e+01, 1.710e+00, 2.430e+00, 1.560e+01, 1.270e+02, 2.800e+00,
       3.060e+00, 2.800e-01, 2.290e+00, 5.640e+00, 1.040e+00, 3.920e+00,
       1.065e+03])

In [25]:
from sklearn.neighbors import NearestNeighbors
knn = neigh = NearestNeighbors(n_neighbors=20)
knn.fit(X)

NearestNeighbors(n_neighbors=20)

In [27]:
distances, neighbourIndices = knn.kneighbors([X[0]])

In [None]:
neighbourIndices

In [31]:
y[neighbourIndices]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

For the example above we can see that both accuracies for the KNN Nearest Neighbours and the Annoy implementation of Nearest Neighbours is the same. That is they both give the same accuracy for the dataset above. However, the time efficiency of KNN is very high when the dimensions are high because it is O(nxm) where m is the number of dimensions.