# KD-Tree

* each level of the tree organizes the data along different dimension (cycling through each dimention repeatedly). 

<center> (7, 2) -> (5, 4) -> (9, 6) -> (4, 7) -> (2, 3) -> (8, 1) </center>

red - along x; blue - along y; 

Structure                                                   | Plot
:----------------------------------------------------------:|:----------------------------------------------------------:
<center><img src="https://ucarecdn.com/c8cf11d9-4ef1-4fcf-a1e8-8f8aa49288d8/" width="500" height="350"/></center>| <center><img src="https://ucarecdn.com/18563484-639c-433d-a478-4da33b704da8/" width="500" height="350"/></center>

* first point = root -> unbalanced tree

> ???? How to find median with binary data 

* median value = root -> balanced tree

x:              2, 4, 5, 7, 8, 9 -> 7 

left subtree y: 3, 4, 7 -> 4 

### Build the tree from a set of points 

1. if points is empty return None 
2. if dimention to split == 0 -> sort along x 
   else sort along y 
4. med = median of sorted points 
5. root = med 
6. left_subtree = buildTree(left points from med, other dim)
7. right_subtree = buildTree(right points from med, other dim)
8. return root 


### Find nearest neighbor 

1. search for query point, until we reach a leaf  (not always the nearest -> radius within the nearest point must be found)
2. euclidian distance between this leaf-point and query-point 



<center><img src="https://ucarecdn.com/c8cf11d9-4ef1-4fcf-a1e8-8f8aa49288d8/" width="500" height="350"/></center>

* find the nearest to (5, 5) `[5 < 7; 5 > 4]`

* leaf (4, 7)

$$ dist = \sqrt{ (4 - 5)^2 + (7 - 5)^2 }  = \sqrt{5} $$

(4, 7) -> current the nearest neighbor (the nearest neightbor can be NO more than distance $ \sqrt{5} $)

<center><img src="https://ucarecdn.com/90eacce4-df32-4c35-8dff-7aac0771ea1b/" width="500" height="350"/></center>            


* back up the tree (unwinding the recursion) - goes up and check all sides (partly)

1. if the node is close to the current nearest neighbor 

    1.1. If it is, new node = the nearest 

2. check the other subtree (check distance)

    2.1. If it is, new node = the nearest 

<center><img src="https://ucarecdn.com/c8cf11d9-4ef1-4fcf-a1e8-8f8aa49288d8/" width="500" height="350"/></center>

* current nearest = (4, 7), dist = $ \sqrt{5} $ 
* check (5, 4), dist = $ \sqrt{1} = 1 $ 
* current nearest = (5, 4)
* decide if we need to check left subtree - all y < 4 -> (y = 5) - (y = 4) = 1 not less dist = 1 -> DO NOT EXPLORE 

<center><img src="https://ucarecdn.com/6788235e-42b7-437b-8ae9-36467f00c435/" width="500" height="350"/></center>

* return to the root (7, 2), $ dist = \sqrt{13} > 1 $ 
* (5, 4) - the nearest (still) 
* check distance from (5, 5) to x = 7, d = 2 > 1 -> DO NOT EXPLORE right subtree 
* (5, 4) - the nearest neighbor 

* [visualization](http://lti.cs.vt.edu/LTI_ruby/AV/Development/kd-treeAV.html)

In [1]:
import numpy as np 
import pandas as pd 
import random 

In [2]:
number_of_users = 10 
number_of_options = 8

In [3]:
np.random.seed(10)

data = np.random.randint(0, 2, size=(number_of_users, number_of_options))
data, data.shape

(array([[1, 1, 0, 1, 0, 1, 1, 0],
        [1, 1, 0, 1, 1, 0, 0, 1],
        [0, 0, 0, 0, 0, 1, 0, 0],
        [1, 1, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 1, 0, 1, 1],
        [1, 1, 1, 0, 1, 0, 0, 0],
        [0, 1, 0, 1, 1, 1, 0, 1],
        [0, 1, 1, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 1, 0, 0, 0],
        [1, 0, 1, 1, 0, 1, 1, 0]]),
 (10, 8))

In [4]:
pd_data = pd.DataFrame(
    data=data, 
    index=[i for i in range(number_of_users)],
    columns=['f'+str(j) for j in range(number_of_options)]
)
pd_data

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7
0,1,1,0,1,0,1,1,0
1,1,1,0,1,1,0,0,1
2,0,0,0,0,0,1,0,0
3,1,1,0,0,1,0,0,1
4,0,0,0,1,1,0,1,1
5,1,1,1,0,1,0,0,0
6,0,1,0,1,1,1,0,1
7,0,1,1,0,1,0,0,1
8,0,0,0,1,1,0,0,0
9,1,0,1,1,0,1,1,0


In [5]:
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html
# no inteface to modify kdtree (?!)
from sklearn.neighbors import KDTree

In [6]:
# leaf_size 
# Number of points at which to switch to brute-force.
# Changing leaf_size will not affect the results of a query, 
# but can significantly impact the speed of a query
# and the memory required to store the constructed tree.
kd_tree_l2 = KDTree(data, leaf_size=number_of_options, metric='l2')

In [7]:
# minkowski - the last element in array has the biggest value of similarity 
# manhattan - the first element in array has the biggest values of similarity [+]
kd_tree_l2.valid_metrics

['euclidean',
 'l2',
 'minkowski',
 'p',
 'manhattan',
 'cityblock',
 'l1',
 'chebyshev',
 'infinity']

| Euclidean                                | L2                                       | Minkowski (p = 2 => Euclidean)              |
|:----------------------------------------:|:----------------------------------------:|:-------------------------------------------:|
| $ \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} $ | $ \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} $ | $ \sqrt[p]{(x_1 - x_2)^p + (y_1 - y_2)^p} $ |
|<img width="250"/>|<img width="250"/>|<img width="250"/>|


| Manhattan                                    | Cityblock (= Manhattan)                     | L1                                            |
|:--------------------------------------------:|:-------------------------------------------:|:---------------------------------------------:|
| $ \mid x_1 -x_2 \mid + \mid y_1 - y_2 \mid $ | $ \mid x_1 -x_2 \mid + \mid y_1 - y_2 \mid $ | $ \mid x_1 -x_2 \mid + \mid y_1 - y_2 \mid $ |
|<img width="250"/>|<img width="250"/>|<img width="250"/>|



| Chebyshev                                           | Infinity (= Chebyshev)                              |
|:---------------------------------------------------:|:---------------------------------------------------:|
| $ max( \mid x_1 - x_2 \mid, \mid y_1 - y_2 \mid ) $ | $ max( \mid x_1 - x_2 \mid, \mid y_1 - y_2 \mid ) $ |
|<img width="250"/>|<img width="250"/>|

In [8]:
np.random.seed(10)
new_user = np.random.randint(0, 2, size=(1, number_of_options))
new_user

array([[1, 1, 0, 1, 0, 1, 1, 0]])

In [9]:
# L2 
# the last element in array - the biggest similarity 
dist_l2, ind_l2 = kd_tree_l2.query(new_user, k=3)
print(f"L2 Indeces {ind_l2}\nL2 Dist {dist_l2}")

L2 Indeces [[0 9 1]]
L2 Dist [[0.         1.41421356 2.        ]]


In [10]:
def print_tree_compare(query_data, dat, d, indexes):
    print(f"Query user:\t{query_data[0]}\n")
    for i, user_i in enumerate(indexes[0]):
        print(f"User {user_i}:\t\t{dat[user_i]}\t D:\t{d[0][i]}")

In [11]:
print_tree_compare(new_user, data, dist_l2, ind_l2)

Query user:	[1 1 0 1 0 1 1 0]

User 0:		[1 1 0 1 0 1 1 0]	 D:	0.0
User 9:		[1 0 1 1 0 1 1 0]	 D:	1.4142135623730951
User 1:		[1 1 0 1 1 0 0 1]	 D:	2.0


In [12]:
kd_tree_l1 = KDTree(data, leaf_size=number_of_options, metric='l1')
# L1 
# the FIRST element in array - the biggest similarity [+]
dist_l1, ind_l1 = kd_tree_l1.query(new_user, k=3)
print(f"L1 Indeces {ind_l1}\nL1 Dist {dist_l1}")

L1 Indeces [[0 9 1]]
L2 Dist [[0. 2. 4.]]


In [13]:
print_tree_compare(new_user, data, dist_l1, ind_l1)

Query user:	[1 1 0 1 0 1 1 0]

User 0:		[1 1 0 1 0 1 1 0]	 D:	0.0
User 9:		[1 0 1 1 0 1 1 0]	 D:	2.0
User 1:		[1 1 0 1 1 0 0 1]	 D:	4.0


In [14]:
kd_tree_inf = KDTree(data, leaf_size=number_of_options, metric='infinity')
# inf 
# ?????????????????????
dist_inf, ind_inf = kd_tree_inf.query(new_user, k=3)
print(f"INF Indeces {ind_inf}\nL2 Dist {dist_inf}")

INF Indeces [[0 2 1]]
L2 Dist [[0. 1. 1.]]


In [15]:
print_tree_compare(new_user, data, dist_inf, ind_inf)

Query user:	[1 1 0 1 0 1 1 0]

User 0:		[1 1 0 1 0 1 1 0]	 D:	0.0
User 2:		[0 0 0 0 0 1 0 0]	 D:	1.0
User 1:		[1 1 0 1 1 0 0 1]	 D:	1.0


In [16]:
from sklearn.metrics.pairwise import cosine_similarity

cos_similarities = np.zeros((data.shape[0],))
for i in range(data.shape[0]):
    if i == data.shape[0] - 1:
        cos_similarities[i] = cosine_similarity(new_user, data[i:])
        print(f"User {i}\tCos Similarity: {round(cos_similarities[i], 4)}")
        break 
    cos_similarities[i] = cosine_similarity(new_user, data[i:i+1])
    print(f"User {i}\tCos Similarity: {round(cos_similarities[i], 4)}")

User 0	Cos Similarity: 1.0
User 1	Cos Similarity: 0.6
User 2	Cos Similarity: 0.4472
User 3	Cos Similarity: 0.4472
User 4	Cos Similarity: 0.4472
User 5	Cos Similarity: 0.4472
User 6	Cos Similarity: 0.6
User 7	Cos Similarity: 0.2236
User 8	Cos Similarity: 0.3162
User 9	Cos Similarity: 0.8


In [17]:
from sklearn.metrics.pairwise import euclidean_distances

euc_similarities = np.zeros((data.shape[0],))
for i in range(data.shape[0]):
    if i == data.shape[0] - 1:
        euc_similarities[i] = euclidean_distances(new_user, data[i:])
        print(f"User {i}\tEuc Similarity: {round(euc_similarities[i], 4)}")
        break 
    euc_similarities[i] = euclidean_distances(new_user, data[i:i+1])
    print(f"User {i}\tEuc Similarity: {round(euc_similarities[i], 4)}")

User 0	Euc Similarity: 0.0
User 1	Euc Similarity: 2.0
User 2	Euc Similarity: 2.0
User 3	Euc Similarity: 2.2361
User 4	Euc Similarity: 2.2361
User 5	Euc Similarity: 2.2361
User 6	Euc Similarity: 2.0
User 7	Euc Similarity: 2.6458
User 8	Euc Similarity: 2.2361
User 9	Euc Similarity: 1.4142


In [18]:
# https://github.com/stefankoegl/kdtree
import kdtree

In [19]:
empty_tree = kdtree.create(dimensions=number_of_users)

In [20]:
new_tree = kdtree.create(list(data))
new_tree


<KDNode - array([1, 1, 0, 1, 0, 1, 1, 0])>

In [21]:
# do not return index )= 
# L1 Norm for distance 
neighbors = new_tree.search_knn(new_user[0], 3)

print(f"Value {neighbors[0][0].data}")
print(f"Dist {neighbors[0][1]}")

indeces = [] 
for neighbor in neighbors:
    for i, d in enumerate(data):
        if (d == neighbor[0].data).all():
            indeces.append(i)

print(indeces)




Value [1 1 0 1 0 1 1 0]
Dist 0.0
[0, 9, 1]


In [22]:
kdtree.visualize(new_tree)



                                     [1 1 0 1 0 1 1 0]                                 

                 [0 0 0 1 1 0 0 0]                        [1 1 0 0 1 0 0 1]             

       [0 0 0 1 1 0 1 1]    [0 1 1 0 1 0 0 1]    [1 0 1 1 0 1 1 0]    [1 1 1 0 1 0 0 0]   

      [0 0 0 0 0 1 0 0]            [0 1 0 1 1 1 0 1]            [1 1 0 1 1 0 0 1]                                  



In [23]:
new_user

array([[1, 1, 0, 1, 0, 1, 1, 0]])

In [24]:
# can add new node to kdtree
new_tree.add(new_user[0])

<KDNode - array([1, 1, 0, 1, 0, 1, 1, 0])>

In [25]:
# do not return index )= 
# L1 Norm for distance 
neighbors = new_tree.search_knn(new_user[0], 3)

print(neighbors)
print(f"Value {neighbors[0][0].data}")
print(f"Dist {neighbors[0][1]}")
data = np.vstack((data, new_user[0]))
indeces = set()

for neighbor in neighbors:
    for i, d in enumerate(data):
        if (d == neighbor[0].data).all():
            indeces.add(i)

print(indeces)

[(<KDNode - array([1, 1, 0, 1, 0, 1, 1, 0])>, 0.0), (<KDNode - array([1, 1, 0, 1, 0, 1, 1, 0])>, 0.0), (<KDNode - array([1, 0, 1, 1, 0, 1, 1, 0])>, 2.0)]
Value [1 1 0 1 0 1 1 0]
Dist 0.0
{0, 9, 10}


## Short Summary 

* KD Tree is suitable to develope recommedation system 

| Methods                  | sklearn.neighbors.KDTree | kdtree           |
|:------------------------:|:------------------------:|:----------------:|
| Create tree              |     +                    |    +             |
| Add new node             |     -                    |    +             |
| Index of neighbors       |     +                    |    -             |
| Different distances      |     +                    |    -             |
| Standart types of output |     +                    |    -             |
|<img width="250"/>        |<img width="250"/>        |<img width="250"/>|