# Nearest Neighbors Algorithms in Euclidean and Metric Spaces: Algorithms and Data Structures

## Efficient and Balanced Trees

We are looking for binary trees that allow **efficient query answers**. This relies on:
- A tree being **balanced**
- A tree that can perform **efficient dictionary operations**:
    - 1) *present*, 2) *insert*, 3) *delete* (similar to CRUD operations -- Create, Read, Update, Delete -- such as found with databases such as [NoSQL](https://towardsdatascience.com/crud-create-read-update-delete-operations-on-nosql-database-mongodb-using-node-js-3979573b9b24))
    - These operations function in *O(log n)* time
    - A tree can be built/sorted in **O(n log n) time**
   
An unbalanced tree has its height bounded by *n*, while a balanced tree has its height bounded by *log(n)*.

![balanced](images/balanced_tree.png)

<u>Note on efficient tree DS:</u> Adelson-Velsky-Landis (AVL) trees ([wiki](https://en.wikipedia.org/wiki/AVL_tree)), Red-Black Trees ([wiki](https://en.wikipedia.org/wiki/Red%E2%80%93black_tree))

### Why Trees?

Voronoi diagrams happen to break in higher dimensional spaces. As such, we need **more efficient data structures** to handle and process data.

<u>Note on tree construction using the median:</u> Finding a median can be used to build trees. However, if it can only be efficient if the tree is static as any modification operation will break the balancedness.

### Focused Solution: K-Dimensional Trees ([wiki](https://en.wikipedia.org/wiki/K-d_tree))

**Definition**: k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches) and creating point clouds. k-d trees are a special case of binary space partitioning trees.

**Construction**: One takes a direction (in the dimension of the data) and projects the points of the underlying dataset in that direction. Then the median is taken and the dataset split in that direction. The process is then recursed. *Points that are not bisected form the bottommest leaves* of the KD tree.

**Goal**: To spread the points of a dataset the most

![kdtree](images/kd_tree.png)

**Relation to PCA**: PCA tries to split the points of a dataset in a way that maximizes the variance

#### Driving Questions

- Is it important to miss exact nearest neighbors?
> no, especially in high dimensions

- Will a failure be frequent?
> it depends on a complexity/differentiability factor $\phi$

