### Instance Based Learning Before/Now

take all the training data, put it in a database, then look it up to make predictions on new data

remembers things
very fast
very simple

doesn't generalize, very sensitive to noise


### K Nearest Neighbors

Given:

>Training Data D = {x$_i$, y$_i$}<br>
Distance Metric d(q, x) (domain knowledge)<br>
Number of Neighbors K (domain knowledge)<br>
Query Point q

NN = {i: d(q, x$_i$), K smallest}<br>
(all the elements in data closest (K closest) to the data point (by determined distance metric)

Return:

>classification: vote of the y$_i$'s that are nearest (NN) (plurality)<br>
regression: mean of the y$_i$'s of NN (weighted by distance)

Cheap to learn, expensive to query (only logarithmic, but you potentially may need to query many times, while training generally only happens once)

Considered 'lazy learner' vs. 'eager learner', such as linear regression and most other machine learning algorithms

### Quiz: Domain K NNowledge

In [75]:
import numpy as np
from scipy.spatial import distance
from operator import itemgetter
A = [4,2]

X = [[1,6], [2,4], [3,7], [6,8], [7,1], [8,4]]

def k_nearest_mean(A, X, k, dist_function='euclidean'):
    X_points = []
    for v in X:
        if dist_function in ['minkowski', 'manhattan']:
            dist = distance.minkowski(A, v, 1)
        else:
            dist = distance.euclidean(A, v)
        X_points.append({'x': v, 'd': dist, 'y': v[0]**2 + v[1]})
    X_points.sort(key=itemgetter('d'))
    k_points = X_points[:k]
    for point in X_points[k:]:
        if point['d'] == k_points[-1]['d']:
            k_points.append(point)
        else:
            break
    return np.mean([point['y'] for point in k_points])      

print("K=1, distance='euclidean': ", k_nearest_mean(A, X, 1))
print("K=3, distance='euclidean': ", k_nearest_mean(A, X, 3))
print("K=1, distance='manhattan': ", k_nearest_mean(A, X, 1, 'minkowski'))
print("K=3, distance='manhattan': ", k_nearest_mean(A, X, 3, 'minkowski'))

K=1, distance='euclidean':  8.0
K=3, distance='euclidean':  42.0
K=1, distance='manhattan':  29.0
K=3, distance='manhattan':  35.5


### K-NN Bias

Preference bias:

* Locality: near point are similar
* Smoothness: averaging
* All features matter equally

### Curse of Dimensionality

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially (curse applies to ML in general, not just KNN)

Better off giving the model more data than giving it more dimensions

### Some Other Stuff

* distance metric; choice of distance metric/function has a huge impact; euclidean & manhattan really useful for regression
* weighting makes a difference on distance, can also help with dimensionality problem
* K = n (taking all data points and averaging y values together; basically ignoring the query)
* But what if you do a weighted average? Points near query weighted more heavily than the average, so it does matter where you put your query point; locally-weighted regression; in place of averaging function, you can use a decision tree, neural network, linear regression, pretty much anything

