## Randomized K-NN Algorithm
In this notebook I implement a "randomized" K-NN model that classifies objects through partitioning its data into groups, then taking random samples of each of those groups, and useing sum and average to determine what class the new data point belongs to.

In [1]:
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

In [2]:
train_data = np.loadtxt("mnist_train.csv", 
                        delimiter=",")
test_data = np.loadtxt("mnist_test.csv", 
                       delimiter=",")

train_labels = np.asfarray(train_data[:, :1])
test_labels = np.asfarray(test_data[:, :1])

First want to sort the labels while keeping track of their original indices, from there can partition everything.

In [3]:
# method to produce sorted index list and dictionary                    

def make_partitions(train_labels):
    value_list = [[],[],[],[],[],[],[],[],[],[]]
    for i in range(0, len(train_labels)):
        insertion_indx = train_labels[i]
        insertion_indx = int(insertion_indx)
        value_list[insertion_indx].append(i)
        
    return value_list

In [4]:
value_list = make_partitions(train_labels)
value_list = np.array(value_list)

Now we have our partitioned data, where each partition contains the indices of that point.  So now we need to create the algorithm, which makes a random vector the size of k, and then it iterates through those indices in each partition, using cosine similarity to measure values between passed point and vectors in correspoinding data set.  Makes a vector of all the distance values for each partition.  Then, it takes each distance vector for each partition, and takes the sum and average.  Because it is cosine similarity, the largest sum and largest average win, and if the highest sum partition does not equal the highest average parition, it "re-rolls" by calling itself again to create a new random vector, and it does this until it finds agreement.  When highest sum is the same partition as highest average, then the corresponding partition is the label of the new point.

In [5]:
def knn(k, classify_point, value_list, train_data):
    
    # random vector for getting arbitrary k number of indices form each partition
    rand_vec = np.random.randint(1, 5300, k)
    
    distances = [[],[],[],[],[],[],[],[],[],[]]
    
    # create double for loop to iterate through the value_list and rand_vec and take distances of each
    for i in range(0, value_list.shape[0]):
        for j in rand_vec:
            
            curr_indx = value_list[i][j]
            curr_point = train_data[curr_indx]
            
            curr_point = curr_point.reshape(1, -1)
            classify_point = classify_point.reshape(1, -1)
            
            dist = cosine_similarity(curr_point, classify_point)
            
            distances[i].append(dist[0][0])
       
    maximum = []
    
    for i in distances:
        ma = max(i)
        maximum.append(ma)
    
    max_indx = np.argmax(maximum)
    return max_indx

In [6]:
pred = knn(10, test_data[5], value_list, train_data)
pred, test_labels[5]

(1, array([1.]))

And now we test the accuracy of the model

In [7]:
%%time
correct = 0
for i in range(0, len(test_labels)):
    pred = knn(10, test_data[i], value_list, train_data)
    actual = test_labels[i]
    if(pred == actual):
        correct += 1
print(correct, len(test_labels))

7546 10000
CPU times: user 2min, sys: 260 ms, total: 2min
Wall time: 2min


In [8]:
%%time
correct = 0
for i in range(0, len(test_labels)):
    pred = knn(50, test_data[i], value_list, train_data)
    actual = test_labels[i]
    if(pred == actual):
        correct += 1
print(correct, len(test_labels))

8703 10000
CPU times: user 9min 56s, sys: 776 ms, total: 9min 57s
Wall time: 9min 55s


In [9]:
%%time
correct = 0
for i in range(0, len(test_labels)):
    pred = knn(100, test_data[i], value_list, train_data)
    actual = test_labels[i]
    if(pred == actual):
        correct += 1
print(correct, len(test_labels))

8980 10000
CPU times: user 19min 52s, sys: 1.26 s, total: 19min 53s
Wall time: 19min 51s


In [None]:
%%time
correct = 0
for i in range(0, len(test_labels)):
    pred = knn(500, test_data[i], value_list, train_data)
    actual = test_labels[i]
    if(pred == actual):
        correct += 1
print(correct, len(test_labels))

In [None]:
%%time
correct = 0
for i in range(0, len(test_labels)):
    pred = knn(700, test_data[i], value_list, train_data)
    actual = test_labels[i]
    if(pred == actual):
        correct += 1
print(correct, len(test_labels))

This method is also quite fast, however it is not very accurate.