# Analysis of Cosine-Similarity Classifier

In this notebook, a version of K-Nearest Neighbors is proposed that results in a slightly higher accuracy than standard K-Nearest Neighbor models, along with a 2 times (or greater) classification speedup.

## Introduction

#### Standard K-Nearest Neighbors
K-Nearest Neighbor (K-NN) algorithms are simple supervised classification algorithms that have the capability of making robust classification on complex datasets.  K-NN is simple, so it is easy to implement.  It is a lazy learner, so it requires no training and can thus get right to classification, making it a much faster algorithm than other classification models such as SVM, regression, multi-layer perceptron, etc..  K-NN is also non-parametric, so it makes to assumptions about the data.  Because the algorithm requires no training, data can be added or taken away seamlessly, without making any major adjustments.

Given a point $p$ to classify, a K-NN model will "compare" the passed point with all the points $x_i$ the model has available to it using some distance metric (most commonly Euclidean distance).  This process will generate the unordered set $D$ that holds the distances between $p$ and every other point in the dataset, $x_i$, in the form of $d_i$.  Next, the algorithm pulls the $k$ lowest distances (or greatest similarities) from $D$, and uses either a classic or weighted voting technique, to classify $p$ as being a member of some class $C$.  

#### The Cosine-Similarity Classifier
The Cosine-Similarity Classifier works in the same general way as most K-NN classifiers.  The primary difference with the Cosine-Similarity Classifier is in its name: it uses cosine-similarity as a distance metric instead of standard Euclidean or Manhatten distance.  Cosine-similarity is given by

$$
similarity=cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| ||\vec{b}||}
$$

where $\vec{a}$ and $\vec{b}$ are vectors whos similarity is returned.  

After testing the Cosine-Similarity Classifier on the MNIST data set, it is found that the classifier is both faster and just as, if not more accurate than go-to K-NN models from the Scikit-Learn library.  In the analysis below, I will build out the Cosine-Similarity Classifier, and run it on the MNIST data set.  I will then test a go-to K-NN model from Scikit-Learn on the MNIST dataset, finally comparing both the accuracy and classification time of the two models in a variety of situations.  All tests were run on a Intel Core 3570K CPU (no GPU here unfortunately).

## Analysis

Start with required imports

In [1]:
import numpy as np
import heapq
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import datasets, model_selection

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix  

mnist = datasets.fetch_mldata('MNIST original')
data, target = mnist.data, mnist.target

Lets look over the MNIST data, and make different datasets out of it for testing the classifiers.

In [2]:
data.shape, target.shape

((70000, 784), (70000,))

In [54]:
# make an array of indices to use, in random order, the same length of the MNIST dataset
indx = np.random.choice(len(target), 70000, replace=False)

# use the random array indx to build testing data sets
####################################################

# Data set #1

# stored_data/stored_target: largest size of all test datasets, with 60,000 stored examples
# for model to use for classification
train_img1 = [data[i] for i in indx[:60000]]
train_img1 = np.array(train_img1)
train_target1 = [target[i] for i in indx[:60000]]
train_target1 = np.array(train_target1)

# will be keeping test set the same for different stored data sets
# test_data/test_target: the smaller dataset used to test model accuracy for the data sets
test_img = [data[i] for i in indx[60000:70000]]
test_img = np.array(test_img)
test_target = [target[i] for i in indx[60000:70000]]
test_target = np.array(test_target)

train_img1.shape, train_target1.shape, test_img.shape, test_target.shape

((60000, 784), (60000,), (10000, 784), (10000,))

In [66]:
# Single random test image, this is just used for testing the speed at which each model can 
# classify a single point
t_img1 = test_img[563]
t_img1 = np.array([t_img1])
t_target1 = test_target[563]
t_target1 = np.array([t_target1])
t_target1

array([0.])

In [56]:
# Data set #2

# stored_data/stored_target: data used by model for classificaiton of size 50,000
train_img2 = [data[i] for i in indx[:50000]]
train_img2 = np.array(train_img2)
train_target2 = [target[i] for i in indx[:50000]]
train_target2 = np.array(train_target2)

train_img2.shape, train_target2.shape

((50000, 784), (50000,))

In [57]:
# Data set #3

# stored_data/stored_target: data used by model for classificaiton of size 40,000
train_img3 = [data[i] for i in indx[:40000]]
train_img3 = np.array(train_img3)
train_target3 = [target[i] for i in indx[:40000]]
train_target3 = np.array(train_target3)

train_img3.shape, train_target3.shape

((40000, 784), (40000,))

In [58]:
# Data set #4

# stored_data/stored_target: data used by model for classificaiton of size 30,000
train_img4 = [data[i] for i in indx[:30000]]
train_img4 = np.array(train_img4)
train_target4 = [target[i] for i in indx[:30000]]
train_target4 = np.array(train_target4)

train_img4.shape, train_target4.shape

((30000, 784), (30000,))

In [59]:
# Data set #5

# stored_data/stored_target: data used by model for classificaiton of size 20,000
train_img5 = [data[i] for i in indx[:20000]]
train_img5 = np.array(train_img5)
train_target5 = [target[i] for i in indx[:20000]]
train_target5 = np.array(train_target5)

train_img5.shape, train_target5.shape

((20000, 784), (20000,))

In [60]:
# Data set #6

# stored_data/stored_target: data used by model for classificaiton of size 10,000
train_img6 = [data[i] for i in indx[:10000]]
train_img6 = np.array(train_img6)
train_target6 = [target[i] for i in indx[:10000]]
train_target6 = np.array(train_target6)

train_img6.shape, train_target6.shape

((10000, 784), (10000,))

In [61]:
# Data set #7

# stored_data/stored_target: data used by model for classificaiton of size 1,000
train_img7 = [data[i] for i in indx[:1000]]
train_img7 = np.array(train_img7)
train_target7 = [target[i] for i in indx[:1000]]
train_target7 = np.array(train_target7)

train_img7.shape, train_target7.shape

((1000, 784), (1000,))

Great.  Now we have 7 data sets to test on the classifier, ranging from just size 1,000 to size 60,000.  We also have a testing data set of size 10,000 to calculate accuracy and speed of the classifiers, as well as a smaller testing dataset of just size 1 used to pass sinlge point classification speed.

Now we build the Cosine-Similarity Classifier.  The method only takes the `test_target` argument to calculate prediction accuracy, it is not actually needed for classification.  

In [13]:
def cos_knn(k, test_data, test_target, stored_data, stored_target):
    """k: number of top most similar values to vote on
    test_data: a set of unobserved images to classify
    test_target: the labels for the test_data (for calculating accuracy)
    stored_data: the images already observed and available to the model
    stored_target: labels for stored_data
    """
    
    # find similarity for every point in test_data between every other point in stored_data
    cosim = cosine_similarity(test_data, stored_data)
    
    # get indices of images in stored_data that are most similar to any given test_data point
    top = [(heapq.nlargest((k+1), range(len(i)), i.take)) for i in cosim]
    # convert indices to numbers
    top = [[stored_target[j] for j in i[:k]] for i in top]
    
    # vote, and return prediction for every image in test_data
    pred = [max(set(i), key=i.count) for i in top]
    pred = np.array(pred)
    
    # print table giving classifier accuracy using test_target
    print(classification_report(test_target, pred))

Now lets look at what the Scikit-Learn K-NN method looks like.  I will put the entire Scikit-Learn K-NN classifier into a function so it can be called and tested with greater ease.

All we really have to worry about with the Scikit-Learn K-NN algorithm is the value for the`n_neighbors` argument (number of neighbors to use for classification), the `weights` argument for `KNeighborsClassifier()`, which we will just leave at its default value of `uniform`, as that is the same method used in the Cosine-Similarity Classifier.  Finally, we have the `algorithm` argument for `KNeighborsClassifier()`, which we will also leave at its default value of `auto`, as it will find the optimal algorithm to use for the given data.

In [14]:
def skl_knn(k, test_data, test_target, stored_data, stored_target):
    """k: number of neighbors to use in classication
    test_data: the data/targets used to test the classifier
    stored_data: the data/targets used to classify the test_data
    """
    
    classifier = KNeighborsClassifier(n_neighbors=k)  
    classifier.fit(stored_data, stored_target)

    y_pred = classifier.predict(test_data) 

    print(classification_report(test_target, y_pred))

That is all there is to it.  Now we test how each model does on different data sets.

Below, we will test the Scikit-Learn and Cosine-Similarity K-NN classifiers on each of the seven data sets, using the standard test data, as well as the single valued test data at the end to test speed of single point classification for each model.  

For each data set/model pair, we will be measuring classification accuracy and speed of test_data classification.  For the Scikit-Learn model, a $k$ value of 5 will be used, and for the Cosine-Similarity model a $k$ value of 3 will be used, as those were the values found to be optimal (it is possible other values for $k$ are indeed better than the ones chosen, these values were chosen after running many tests, but will likely not be absolutely optimal).  Here goes...

In [15]:
%%time
cos_knn(3, test_img, test_target, train_img1, train_target1)

             precision    recall  f1-score   support

        0.0       0.98      1.00      0.99       941
        1.0       0.98      1.00      0.99      1137
        2.0       0.99      0.97      0.98      1006
        3.0       0.98      0.97      0.98      1040
        4.0       0.98      0.97      0.98       931
        5.0       0.98      0.97      0.97       911
        6.0       0.98      0.99      0.99       994
        7.0       0.97      0.97      0.97      1036
        8.0       0.97      0.96      0.96       966
        9.0       0.95      0.97      0.96      1038

avg / total       0.98      0.98      0.98     10000

CPU times: user 5min 45s, sys: 1.16 s, total: 5min 46s
Wall time: 5min 23s


In [16]:
%%time
skl_knn(5, test_img, test_target, train_img1, train_target1)

             precision    recall  f1-score   support

        0.0       0.98      0.99      0.99       941
        1.0       0.96      0.99      0.98      1137
        2.0       0.98      0.97      0.97      1006
        3.0       0.97      0.97      0.97      1040
        4.0       0.97      0.97      0.97       931
        5.0       0.96      0.97      0.97       911
        6.0       0.98      0.99      0.99       994
        7.0       0.96      0.97      0.97      1036
        8.0       0.99      0.93      0.96       966
        9.0       0.96      0.96      0.96      1038

avg / total       0.97      0.97      0.97     10000

CPU times: user 9min 5s, sys: 280 ms, total: 9min 5s
Wall time: 9min 5s


In [17]:
%%time
cos_knn(3, test_img, test_target, train_img2, train_target2)

             precision    recall  f1-score   support

        0.0       0.97      1.00      0.99       941
        1.0       0.97      1.00      0.98      1137
        2.0       0.99      0.98      0.98      1006
        3.0       0.99      0.97      0.98      1040
        4.0       0.98      0.97      0.98       931
        5.0       0.98      0.96      0.97       911
        6.0       0.98      0.99      0.99       994
        7.0       0.97      0.97      0.97      1036
        8.0       0.96      0.95      0.96       966
        9.0       0.95      0.97      0.96      1038

avg / total       0.98      0.98      0.98     10000

CPU times: user 4min 52s, sys: 1.03 s, total: 4min 53s
Wall time: 4min 32s


In [18]:
%%time
skl_knn(5, test_img, test_target, train_img2, train_target2)

             precision    recall  f1-score   support

        0.0       0.98      0.99      0.98       941
        1.0       0.96      0.99      0.98      1137
        2.0       0.98      0.97      0.97      1006
        3.0       0.97      0.97      0.97      1040
        4.0       0.98      0.97      0.97       931
        5.0       0.96      0.97      0.97       911
        6.0       0.98      0.99      0.98       994
        7.0       0.96      0.97      0.97      1036
        8.0       0.99      0.92      0.96       966
        9.0       0.96      0.96      0.96      1038

avg / total       0.97      0.97      0.97     10000

CPU times: user 8min 1s, sys: 264 ms, total: 8min 2s
Wall time: 8min 2s


In [19]:
%%time
cos_knn(3, test_img, test_target, train_img3, train_target3)

             precision    recall  f1-score   support

        0.0       0.97      1.00      0.99       941
        1.0       0.97      0.99      0.98      1137
        2.0       0.98      0.97      0.98      1006
        3.0       0.99      0.96      0.97      1040
        4.0       0.98      0.97      0.98       931
        5.0       0.98      0.97      0.97       911
        6.0       0.99      0.99      0.99       994
        7.0       0.97      0.97      0.97      1036
        8.0       0.96      0.96      0.96       966
        9.0       0.95      0.97      0.96      1038

avg / total       0.97      0.97      0.97     10000

CPU times: user 3min 52s, sys: 864 ms, total: 3min 53s
Wall time: 3min 37s


In [20]:
%%time
skl_knn(5, test_img, test_target, train_img3, train_target3)

             precision    recall  f1-score   support

        0.0       0.97      0.99      0.98       941
        1.0       0.95      0.99      0.97      1137
        2.0       0.98      0.96      0.97      1006
        3.0       0.97      0.97      0.97      1040
        4.0       0.97      0.96      0.97       931
        5.0       0.95      0.97      0.96       911
        6.0       0.98      0.99      0.98       994
        7.0       0.96      0.97      0.96      1036
        8.0       0.99      0.92      0.95       966
        9.0       0.95      0.95      0.95      1038

avg / total       0.97      0.97      0.97     10000

CPU times: user 6min 52s, sys: 384 ms, total: 6min 52s
Wall time: 6min 52s


In [21]:
%%time
cos_knn(3, test_img, test_target, train_img4, train_target4)

             precision    recall  f1-score   support

        0.0       0.97      1.00      0.98       941
        1.0       0.97      0.99      0.98      1137
        2.0       0.98      0.97      0.98      1006
        3.0       0.98      0.96      0.97      1040
        4.0       0.98      0.96      0.97       931
        5.0       0.98      0.96      0.97       911
        6.0       0.99      0.99      0.99       994
        7.0       0.97      0.96      0.97      1036
        8.0       0.96      0.96      0.96       966
        9.0       0.94      0.96      0.95      1038

avg / total       0.97      0.97      0.97     10000

CPU times: user 2min 55s, sys: 592 ms, total: 2min 55s
Wall time: 2min 43s


In [22]:
%%time
skl_knn(5, test_img, test_target, train_img4, train_target4)

             precision    recall  f1-score   support

        0.0       0.97      0.99      0.98       941
        1.0       0.95      0.99      0.97      1137
        2.0       0.98      0.96      0.97      1006
        3.0       0.96      0.96      0.96      1040
        4.0       0.97      0.95      0.96       931
        5.0       0.96      0.96      0.96       911
        6.0       0.98      0.99      0.98       994
        7.0       0.96      0.97      0.96      1036
        8.0       0.99      0.91      0.95       966
        9.0       0.94      0.95      0.95      1038

avg / total       0.96      0.96      0.96     10000

CPU times: user 4min 31s, sys: 60.1 ms, total: 4min 32s
Wall time: 4min 31s


In [23]:
%%time
cos_knn(3, test_img, test_target, train_img5, train_target5)

             precision    recall  f1-score   support

        0.0       0.96      1.00      0.98       941
        1.0       0.97      0.99      0.98      1137
        2.0       0.98      0.97      0.97      1006
        3.0       0.98      0.95      0.96      1040
        4.0       0.98      0.95      0.97       931
        5.0       0.98      0.95      0.97       911
        6.0       0.98      0.98      0.98       994
        7.0       0.97      0.95      0.96      1036
        8.0       0.96      0.96      0.96       966
        9.0       0.92      0.96      0.94      1038

avg / total       0.97      0.97      0.97     10000

CPU times: user 1min 59s, sys: 687 ms, total: 2min
Wall time: 1min 53s


In [24]:
%%time
skl_knn(5, test_img, test_target, train_img5, train_target5)

             precision    recall  f1-score   support

        0.0       0.97      0.99      0.98       941
        1.0       0.93      0.99      0.96      1137
        2.0       0.98      0.95      0.96      1006
        3.0       0.96      0.96      0.96      1040
        4.0       0.96      0.94      0.95       931
        5.0       0.95      0.96      0.95       911
        6.0       0.98      0.98      0.98       994
        7.0       0.94      0.96      0.95      1036
        8.0       0.98      0.90      0.94       966
        9.0       0.93      0.95      0.94      1038

avg / total       0.96      0.96      0.96     10000

CPU times: user 3min 39s, sys: 440 ms, total: 3min 39s
Wall time: 3min 40s


In [25]:
%%time
cos_knn(3, test_img, test_target, train_img6, train_target6)

             precision    recall  f1-score   support

        0.0       0.95      1.00      0.97       941
        1.0       0.96      0.99      0.97      1137
        2.0       0.98      0.96      0.97      1006
        3.0       0.97      0.94      0.95      1040
        4.0       0.97      0.92      0.95       931
        5.0       0.97      0.93      0.95       911
        6.0       0.98      0.98      0.98       994
        7.0       0.96      0.94      0.95      1036
        8.0       0.94      0.94      0.94       966
        9.0       0.89      0.95      0.92      1038

avg / total       0.96      0.96      0.96     10000

CPU times: user 1min, sys: 308 ms, total: 1min
Wall time: 57.4 s


In [26]:
%%time
skl_knn(5, test_img, test_target, train_img6, train_target6)

             precision    recall  f1-score   support

        0.0       0.97      0.99      0.98       941
        1.0       0.91      0.99      0.95      1137
        2.0       0.97      0.93      0.95      1006
        3.0       0.95      0.95      0.95      1040
        4.0       0.96      0.93      0.94       931
        5.0       0.94      0.95      0.94       911
        6.0       0.97      0.98      0.97       994
        7.0       0.94      0.95      0.94      1036
        8.0       0.98      0.86      0.92       966
        9.0       0.91      0.93      0.92      1038

avg / total       0.95      0.95      0.95     10000

CPU times: user 1min 49s, sys: 348 ms, total: 1min 49s
Wall time: 1min 50s


In [27]:
%%time
cos_knn(3, test_img, test_target, train_img7, train_target7)

             precision    recall  f1-score   support

        0.0       0.85      0.99      0.91       941
        1.0       0.90      0.99      0.94      1137
        2.0       0.97      0.88      0.92      1006
        3.0       0.93      0.86      0.89      1040
        4.0       0.95      0.80      0.87       931
        5.0       0.93      0.84      0.88       911
        6.0       0.96      0.94      0.95       994
        7.0       0.95      0.85      0.90      1036
        8.0       0.83      0.88      0.85       966
        9.0       0.76      0.91      0.83      1038

avg / total       0.90      0.89      0.90     10000

CPU times: user 6.87 s, sys: 104 ms, total: 6.97 s
Wall time: 6.09 s


In [28]:
%%time
skl_knn(5, test_img, test_target, train_img7, train_target7)

             precision    recall  f1-score   support

        0.0       0.92      0.96      0.94       941
        1.0       0.75      0.99      0.86      1137
        2.0       0.95      0.82      0.88      1006
        3.0       0.89      0.84      0.87      1040
        4.0       0.87      0.86      0.87       931
        5.0       0.85      0.88      0.87       911
        6.0       0.94      0.95      0.95       994
        7.0       0.87      0.85      0.86      1036
        8.0       0.96      0.71      0.82       966
        9.0       0.81      0.84      0.82      1038

avg / total       0.88      0.87      0.87     10000

CPU times: user 11.6 s, sys: 8.05 ms, total: 11.6 s
Wall time: 11.7 s


Test on even less data as sklearn model seems to drop off.

The test results show the following:
* The Cosine-Similarity Classifier either matched the Scikit-Learn K-NN accuracy wise, or beats it by 1%-2%.  
* As far as speed of classification goes, the Cosine-Similarity Classifier tends to be between 1.5-2 times faster than the Scikit-Learn K-NN.
* Strangely, the Cosine-Similarity Classifier tends to underperform when classifying the digit 9.  This could be on account of the fact that 4 and 9 are so similar, and thus tend to muddle the similarity metric for classification.

Now we test classification speed of the two models on the single valued test set.

In [67]:
%%time
cos_knn(3, t_img1, t_target1, train_img1, train_target1)

             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00         1

avg / total       1.00      1.00      1.00         1

CPU times: user 573 ms, sys: 244 ms, total: 817 ms
Wall time: 590 ms


In [68]:
%%time
skl_knn(5, t_img1, t_target1, train_img1, train_target1)

             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00         1

avg / total       1.00      1.00      1.00         1

CPU times: user 25.9 s, sys: 144 ms, total: 26 s
Wall time: 26 s


As shown by the test results, the Cosine-Similarity Classifier is significantly faster at classifying single points than its corresponding Scikit-Learn classifier for larger stored data sets.


## Room For Improvement
Below are some points that can be used in the improvement of the Cosine-Similarity Model.
* Tree for faster classification


## Lessons Learned & Moving Forward
Below is a list of main takeaways from the project, and possible future applications of the model.
* Sometimes the best models to use are the simplest.  
* Cosine similarity is a great similarity metric, accurate and efficient...