# <div style="text-align: right"> KNN from scratch. </div>

---

<div style="text-align: right"> Geoff Counihan - Oct 5, 2017 </div>

### Notes

---

Unclear what the difference between sklearn's implementation and mine is. I thought it was the tie behavior but that doesn't align them. 
    
Tried scipy euclidean which was ~2x slower than numpy


    
__Additions__: Matrix implementation, Test cases, Weighted distance, Entropy, Predict Probability, [This?](https://www.kaggle.com/mineshjethva/knn-from-scratch-in-python-at-97-1)

In [3]:
from sklearn.datasets import load_iris

In [4]:
iris = load_iris()
X = iris.data[:,:2]
y = iris.target

In [5]:
Xy = np.column_stack((X,y))
Xy_point = Xy[:3]
print(Xy_point)

NameError: name 'np' is not defined

### Similarity

---

__Numeric__ data needs a similarity metric like euclidean or manhattan. 

__Categorical__ data needs a similarity metric like gini or entropy

In [6]:
import numpy as np
import operator

__Eucledean distance__ - is defined by the square root of the sum of squared differences between two arrays of numbers. 

In [13]:
def euclidean_loop(a, b):
    dist = 0
    for i in range(len(a)):
        dist += np.square(a[i]-b[i])
    return np.sqrt(dist)

In [14]:
def euclidean_matrix(a, b):
    return np.sqrt(((a-b)**2).sum(axis=0))

In [20]:
def minkowski_matrix(a, b, p):
    return ((np.abs(a-b)**p).sum(axis=0))**(1/p)

In [21]:
%%time
a = np.array([0,0,0,0])
b = np.array([2,2,2,2])

distance3d = minkowski_matrix(a,b,2)
print('2d Distance: {}'.format(distance3d))

c = np.array([0,0])
d = np.array([2,2])

distance2d = minkowski_matrix(c,d,2)
print('3d Distance: {}'.format(distance2d))

2d Distance: 4.0
3d Distance: 2.8284271247461903
CPU times: user 573 µs, sys: 586 µs, total: 1.16 ms
Wall time: 1.56 ms


In [22]:
%%time
a = np.array([0,0,0,0])
b = np.array([2,2,2,2])

distance3d = euclidean_loop(a,b)
print('2d Distance: {}'.format(distance3d))

c = np.array([0,0])
d = np.array([2,2])

distance2d = euclidean_loop(c,d)
print('3d Distance: {}'.format(distance2d))

2d Distance: 4.0
3d Distance: 2.8284271247461903
CPU times: user 730 µs, sys: 505 µs, total: 1.24 ms
Wall time: 862 µs


In [15]:
%%time
a = np.array([0,0,0,0])
b = np.array([2,2,2,2])

distance3d = euclidean_matrix(a,b)
print('2d Distance: {}'.format(distance3d))

c = np.array([0,0])
d = np.array([2,2])

distance2d = euclidean_matrix(c,d)
print('3d Distance: {}'.format(distance2d))

2d Distance: 4.0
3d Distance: 2.8284271247461903
CPU times: user 491 µs, sys: 358 µs, total: 849 µs
Wall time: 629 µs


__Manhattan distance__ - defined by only walking along axes of data from one point to another

In [25]:
def manhattan(a, b):
    dist = 0
    for i in range(len(a)):
        dist += np.abs(a[i]-b[i])
    return dist

In [23]:
%%time
a = np.array([0,0,0,0])
b = np.array([2,2,2,2])

distance3d = minkowski_matrix(a,b,1)
print('2d Distance: {}'.format(distance3d))

c = np.array([0,0])
d = np.array([2,2])

distance2d = minkowski_matrix(c,d,1)
print('3d Distance: {}'.format(distance2d))

2d Distance: 8.0
3d Distance: 4.0
CPU times: user 567 µs, sys: 560 µs, total: 1.13 ms
Wall time: 639 µs


In [26]:
a = np.array([0,0,0,0])
b = np.array([2,2,2,2])

distance3d = manhattan(a,b)
print('2d Distance: {}'.format(distance3d))

c = np.array([0,0])
d = np.array([2,2])

distance2d = manhattan(c,d)
print('3d Distance: {}'.format(distance2d))

2d Distance: 8
3d Distance: 4


### Other components

---

__Find neighbors__ - finds the k closest samples to the new sample

In [665]:
def find_neighbors(Xy, new_sample, k):
    distances = []
    neighbors = []
    dim = len(new_sample)
    for i in range(len(Xy)):
        distance = euclidean(Xy[i], new_sample)
        distances.append((Xy[i],distance))
    distances = sorted(distances,key=operator.itemgetter(1))
    #print(distances)
    for i in range(k):
        #print(distances[i])
        neighbors.append(distances[i][0])
    return neighbors

In [666]:
# Xy = np.column_stack((X,y))
# Xy_point = Xy[0]
# print(Xy_point)

Xy = [[2, 2, 2], [4, 4, 4], [3, 3, 3], [5, 5, 5]]
new_sample = [5, 5, 5]

#new_sample = Xy_point

k = 2
neighbors = find_neighbors(Xy, new_sample, k)
print(neighbors)

[([5, 5, 5], 0.0), ([4, 4, 4], 1.7320508075688772), ([3, 3, 3], 3.4641016151377544), ([2, 2, 2], 5.196152422706632)]
([5, 5, 5], 0.0)
([4, 4, 4], 1.7320508075688772)
[[5, 5, 5], [4, 4, 4]]


__Majority vote__ - calculate the majority class within a set of points. What to do about a tie?

In [118]:
def majority_vote(neighbors):
    class_votes = {}
    for x in range(len(neighbors)):
        sample_class = neighbors[x][-1]
        if sample_class in class_votes:
            class_votes[sample_class] += 1
        else:
            class_votes[sample_class] =1
    sorted_votes = sorted(class_votes.items())
    print(sorted_votes)
    return sorted_votes[0][0]

In [774]:
neighbors = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b'], [4,4,4,'c'], [5,5,5,'c']]
response = majority_vote(neighbors)
print(response)


[('a', 2), ('b', 1), ('c', 2)]
a


### Create class.

---

__Tie__ - Added to modify behavior when there is a tie for majority class

In [1196]:
class k_nearest_neighbors(object):
    def __init__(self,k=5,metric='euclidean',ties=True):
        '''K nearest neighbors model. Detemine an unknown sample with a
        majority vote of most similar known samples.
        
        k = number of neighbors to use
        metric = similarity measure
            'euclidean' is defined by the square root of the sum of squared differences between two arrays of numbers. 
            'manhattan' is defined by the sum of the absolute distance between two arrays of numbers.
        ties = in the case of a majority tie, winner goes to the most frequently occuring class
        
        '''
        self.k = k
        self.metric = metric
        self.ties = ties
        
    def minkowski_matrix(self, a, b, p):
        return ((np.abs(a-b)**p).sum(axis=0))**(1/p)

    def euclidean(self, a, b):
        dist = 0
        for i in range(len(a)):
            dist += np.square(a[i]-b[i])
        return np.sqrt(dist)
    
    def manhattan(self, a, b):
        dist = 0
        for i in range(len(a)):
            dist += np.abs(a[i]-b[i])
        return dist
    
    def find_neighbors(self, new_sample):
        '''List the k neighbors closest to the new sample.
        
        '''
        distances = []      
        for i in range(len(self.X)):
            if self.metric == 'euclidean':  
                distance = self.euclidean(self.X[i], new_sample)
            if self.metric == 'manhattan':
                distance = self.manhattan(self.X[i], new_sample)
            distances.append((self.y[i],distance))
        distances = sorted(distances,key=operator.itemgetter(1))
        
        neighbors = []
        for i in range(self.k):
            neighbors.append(distances[i][0])
        return neighbors
    
    def majority_vote(self, neighbors):
        '''Determine majority class from the set of neighbors.
        
        '''
        class_votes = {}
        for i in range(len(neighbors)):
            sample_class = neighbors[i]
            if sample_class in class_votes:
                class_votes[sample_class] += 1
            else:
                class_votes[sample_class] = 1
        sorted_votes = sorted(class_votes.items())
        if self.ties:
            sorted_votes = self.tie(sorted_votes)
        return sorted_votes[0][0]
    
#          addition to inspect how often there are ties in counts
    def tie(self,sorted_votes):
        '''Determine when ties occur in the the neighbors. Of the tied classes,
        choose the class most frequent in the training data.
        
        Print out number of ties.
        '''
        tie = {}
        for pair in sorted_votes:
            count = pair[1]
            if count in tie:
                self.tie_count += 1
                #print('tie')
                tie[count].append(pair[0])
            else:
                tie[count] = [pair[0]]
            #print(tie)
        tie_class_frequency = {}
        if len(tie[count]) > 1:
            #print('tie')
            for tie_class in tie[count]:
                tie_class_frequency[tie_class] = np.count_nonzero(self.y == tie_class)
            max_class = max(tie_class_frequency, key=tie_class_frequency.get)
            #print(max_class)
            sorted_votes = [(max_class,1)]
        return sorted_votes

    def fit(self,X,y):
        '''Save training data.
        
        '''
        self.X = X
        self.y = y
        self.Xy = np.column_stack((X, y))
        
    def predict(self, X_test):
        '''Predict class for each value in array of new samples.
        
        '''
        self.tie_count = 0
        y_pred = []
        for i in range(len(X_test)):
            neighbors = self.find_neighbors(X_test[i])
            pred_class = self.majority_vote(neighbors)
            y_pred.append(pred_class)
        if self.ties:
            print('{} ties'.format(self.tie_count))
        return y_pred
    

### Test.

---

In [1155]:
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data[:,:2]
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=35)

In [1194]:
knn = k_nearest_neighbors(k=2,ties=True)#,metric='euclidean')
#knn = k_nearest_neighbors(k=10,metric='manhattan')
knn.fit(X_train,y_train)

In [1195]:
%%time
my_pred = knn.predict(X_test)

11 ties
CPU times: user 42.3 ms, sys: 1.55 ms, total: 43.9 ms
Wall time: 43.5 ms


### Compare performance

---

In [1185]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

In [1186]:
sklearn_knn = KNeighborsClassifier(n_neighbors=2,algorithm='brute')
sklearn_knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=2, p=2,
           weights='uniform')

In [1187]:
%%time
sk_pred = sklearn_knn.predict(X_test)

CPU times: user 1.18 ms, sys: 835 µs, total: 2.02 ms
Wall time: 1.26 ms


### Sample points

---

In [1188]:
a = X_test[:3]

In [1189]:
print('sk_pred: {}'.format(sklearn_knn.predict(a)))
print('my_pred: {}'.format(knn.predict(a)))
print('true: {}'.format(y_test[:3]))

sk_pred: [1 2 2]
1 ties
my_pred: [1, 2, 1]
true: [1 1 2]


### Accuracy differences

---

I'm unclear how sklearn differs. Will need to look deeper.

In [1190]:
def accuracy(pred,true):
    correct = 0
    pred_len = len(pred)
    for i in range(pred_len):
        if pred[i] == true[i]:
            correct += 1
    return correct/pred_len

In [1191]:
accuracy(my_pred,y_test)

0.7105263157894737

In [1192]:
accuracy(sk_pred,y_test)

0.7894736842105263

In [1193]:
list(zip(my_pred,sk_pred,y_test))

[(1, 1, 1),
 (2, 2, 1),
 (1, 2, 2),
 (1, 1, 1),
 (0, 0, 0),
 (2, 2, 2),
 (1, 1, 2),
 (1, 1, 1),
 (1, 1, 1),
 (0, 0, 0),
 (1, 1, 1),
 (1, 1, 2),
 (0, 0, 0),
 (1, 1, 2),
 (0, 0, 0),
 (1, 1, 2),
 (1, 1, 1),
 (0, 0, 0),
 (0, 0, 0),
 (0, 0, 0),
 (1, 1, 1),
 (2, 1, 1),
 (1, 1, 2),
 (1, 1, 1),
 (0, 0, 0),
 (0, 0, 0),
 (0, 0, 0),
 (2, 2, 2),
 (0, 0, 0),
 (2, 2, 2),
 (0, 0, 0),
 (1, 1, 1),
 (1, 1, 2),
 (0, 0, 0),
 (2, 1, 1),
 (1, 1, 2),
 (0, 0, 0),
 (2, 2, 2)]