## Continueing Cosine Similarity Model Improvement
In this notebook I will just be continueing the work from the latest version of the model in the last notebook.

Here will be focusing on:
* Furthur optimization
* See how the model performs on less data

In [1]:
import numpy as np
from tqdm import tqdm
import cupy as cp
import heapq
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import datasets, model_selection
mnist = datasets.fetch_mldata('MNIST original')

data, target = mnist.data, mnist.target

Below Is the latest version of the algorithm, version 4, optimized and simplified more from version 3 on the last notebook.

In [2]:
def classif(comparisons, n):
    """comparisons: the number of numbers to test
    n: number of top highest indices to vote on
    returns: amount of correct predictions, total predictions, and percentage accuracy of predictions, 
    """
    
    # find random images to test
    random_indx = np.random.choice(len(target), comparisons, replace=False)
    X = [data[i] for i in random_indx]
    
    # comparisons x size data structure
    cosim = cosine_similarity(X, data)
    
    # get top most similar image indices, excluding most similar as that is the tested image
    top = [(heapq.nlargest((n+1), range(len(i)), i.take)) for i in cosim]
    top = [[target[j] for j in i[1:(n+1)]] for i in top]
    
    # given top most similar, vote on what input is
    pred = [max(set(i), key=i.count) for i in top]
    pred = np.array(pred)
    
    # use target labels to check accuracy
    correct_classification = [target[i] for i in random_indx]
    correct_classification = np.array(correct_classification)
    
    correct = np.count_nonzero(pred == correct_classification)
    total = len(correct_classification)
            
    acc = (correct / total) * 100
    
    return correct, total, acc

In [3]:
%%time
classif(10000, 3)

CPU times: user 6min 32s, sys: 1.76 s, total: 6min 34s
Wall time: 6min 6s


(9768, 10000, 97.68)

Great, now the algorithm is much more optimized than it was in [version 1], and scales to larget comparisons far better than any previous algorithm.  This algorithm is still by no means perfect, but I will leave it for now and keep experimenting.

Next, I will start building smaller datasets out of the original dataset, by pulling images randomely, and see how the algorithm performs on smaller datasets, watching what happens to the average accuracy.  I will be building smaller testing datasets going down by 10,000 with each step, that I will be running the algorithm on and seeing how accuracy degrades as number of comparison images shrinks.  I will initially make a random array, and then for every step down in data size I will just take the indices from that array so that the data tested on stays relatively constant.

In [98]:
data.shape

(70000, 784)

In [6]:
sixty_indx = np.random.choice(len(target), 60000, replace=False)
sixty_img = [data[i] for i in sixty_indx]
sixty_img = np.array(sixty_img)
sixty_target = [target[i] for i in sixty_indx]
sixty_target = np.array(sixty_target)
sixty_img.shape, sixty_target.shape

((60000, 784), (60000,))

In [7]:
fifty_indx = sixty_indx[:50000]
fifty_img = [data[i] for i in fifty_indx]
fifty_img = np.array(fifty_img)
fifty_target = [target[i] for i in fifty_indx]
fifty_target = np.array(fifty_target)
fifty_img.shape, fifty_target.shape

((50000, 784), (50000,))

In [8]:
fourty_indx = sixty_indx[:40000]
fourty_img = [data[i] for i in fourty_indx]
fourty_img = np.array(fourty_img)
fourty_target = [target[i] for i in fourty_indx]
fourty_target = np.array(fourty_target)
fourty_img.shape, fourty_target.shape

((40000, 784), (40000,))

In [9]:
thirty_indx = sixty_indx[:30000]
thirty_img = [data[i] for i in thirty_indx]
thirty_img = np.array(thirty_img)
thirty_target = [target[i] for i in thirty_indx]
thirty_target = np.array(thirty_target)
thirty_img.shape, thirty_target.shape

((30000, 784), (30000,))

In [10]:
twenty_indx = sixty_indx[:20000]
twenty_img = [data[i] for i in twenty_indx]
twenty_img = np.array(twenty_img)
twenty_target = [target[i] for i in twenty_indx]
twenty_target = np.array(twenty_target)
twenty_img.shape, twenty_target.shape

((20000, 784), (20000,))

In [11]:
ten_indx = sixty_indx[:10000]
ten_img = [data[i] for i in ten_indx]
ten_img = np.array(ten_img)
ten_target = [target[i] for i in ten_indx]
ten_target = np.array(ten_target)
ten_img.shape, ten_target.shape

((10000, 784), (10000,))

In [25]:
five_indx = sixty_indx[:5000]
five_img = [data[i] for i in five_indx]
five_img = np.array(five_img)
five_target = [target[i] for i in five_indx]
five_target = np.array(five_target)
five_img.shape, five_target.shape

((5000, 784), (5000,))

In [26]:
one_indx = sixty_indx[:1000]
one_img = [data[i] for i in one_indx]
one_img = np.array(one_img)
one_target = [target[i] for i in one_indx]
one_target = np.array(one_target)
one_img.shape, one_target.shape

((1000, 784), (1000,))

In [29]:
twohun_indx = sixty_indx[:200]
twohun_img = [data[i] for i in twohun_indx]
twohun_img = np.array(twohun_img)
twohun_target = [target[i] for i in twohun_indx]
twohun_target = np.array(twohun_target)
twohun_img.shape, twohun_target.shape

((200, 784), (200,))

Now alter method to take dataset and test on above dataset sizes

In [13]:
def data_classif(comparisons, n, dataset, targetset):
    """comparisons: the number of numbers to test
    n: number of top highest indices to vote on
    dataset: takes dataset to test on
    returns: amount of correct predictions, total predictions, and percentage accuracy of predictions, 
    """
    
    # find random images to test
    random_indx = np.random.choice(len(dataset), comparisons, replace=False)
    X = [dataset[i] for i in random_indx]
    
    # comparisons x size data structure
    cosim = cosine_similarity(X, dataset)
    
    # get top most similar image indices, excluding most similar as that is the tested image
    top = [(heapq.nlargest((n+1), range(len(i)), i.take)) for i in cosim]
    top = [[targetset[j] for j in i[1:(n+1)]] for i in top]
    
    # given top most similar, vote on what input is
    pred = [max(set(i), key=i.count) for i in top]
    pred = np.array(pred)
    
    # use target labels to check accuracy
    correct_classification = [targetset[i] for i in random_indx]
    correct_classification = np.array(correct_classification)
    
    correct = np.count_nonzero(pred == correct_classification)
    total = len(correct_classification)
            
    acc = (correct / total) * 100
    
    return correct, total, acc

In [None]:
%%time
data_classif(1000, 3, data, target)

In [19]:
%%time
data_classif(1000, 3, sixty_img, sixty_target)

CPU times: user 40.4 s, sys: 456 ms, total: 40.8 s
Wall time: 37.8 s


(978, 1000, 97.8)

In [20]:
%%time
data_classif(1000, 3, fifty_img, fifty_target)

CPU times: user 34 s, sys: 340 ms, total: 34.3 s
Wall time: 31.9 s


(975, 1000, 97.5)

In [21]:
%%time
data_classif(1000, 3, fourty_img, fourty_target)

CPU times: user 27.1 s, sys: 304 ms, total: 27.4 s
Wall time: 25.2 s


(980, 1000, 98.0)

In [22]:
%%time
data_classif(1000, 3, thirty_img, thirty_target)

CPU times: user 20.5 s, sys: 240 ms, total: 20.7 s
Wall time: 19 s


(960, 1000, 96.0)

In [23]:
%%time
data_classif(1000, 3, twenty_img, twenty_target)

CPU times: user 13.7 s, sys: 164 ms, total: 13.9 s
Wall time: 12.6 s


(956, 1000, 95.6)

In [24]:
%%time
data_classif(1000, 3, ten_img, ten_target)

CPU times: user 7.19 s, sys: 100 ms, total: 7.29 s
Wall time: 6.34 s


(957, 1000, 95.7)

In [27]:
%%time
data_classif(1000, 3, five_img, five_target)

CPU times: user 4 s, sys: 79.8 ms, total: 4.08 s
Wall time: 3.32 s


(939, 1000, 93.89999999999999)

In [28]:
%%time
data_classif(1000, 3, one_img, one_target)

CPU times: user 1.12 s, sys: 47.8 ms, total: 1.17 s
Wall time: 675 ms


(892, 1000, 89.2)

In [30]:
%%time
data_classif(200, 3, twohun_img, twohun_target)

CPU times: user 170 ms, sys: 16.1 ms, total: 186 ms
Wall time: 66 ms


(151, 200, 75.5)

As you can see, as the dataset size decreases, so does accuracy. However, accuracy drops much less than what would be expected, only dropping a little less than 3% for a large drop in the data it has available from 70,000 to 10,000. What is more remarkable is that with a dataset of size only 1000, accuracy is still around 90%.  The funniest part is that the simple MNIST CNN (see notebook) only has a 85% accuracy on about 55,000 examples of the MNIST dataset, a worse accuracy than this simple similarity model on just 1000 samples.  Granted, the MNIST CNN is very simple and has a ton of ways it could be improved, and the data the CNN was trained/validated on is of slightly different composition then this scikit-learn dataset I used, however the CNN still acts as a solid benchmark.

That just about wraps up the analysis I will be doing here.  Note, that while this technique worked very well on the MNIST dataset, it is highly unlikely that it will work just as well on more complicated image datasets, however it is always worth a shot so I will be trying that out later.

improvements and things to think about:
* weighted voting
* maxpooling
* put all the datasets into one tensor
* algorithm seems to be similar in some ways to k-means clustering and k-nearest neighbor
* these results are certainly not state of the art, however they are interesting and provide several lessons
* Most CNN's are far better than the one I used and get higher testing accuracy, and are thus better than this "classifier".  However they don't work as well on so little data, and are far heavier etc.  This is also a "lazy classifier" in that no training is required
* turns out this model is K nearest neighbor.  Same idea, take a distance metric, and then classify the point by voting with the most similar points or vectors.
* could try cross validation to find the best n parameter
* supposedly humans get about 97.5% accuracy on MNIST

Some PCA optimization

In [9]:
def svd_pca(data, k):
    """Reduce DATA using its K principal components."""
    data = data.astype("float64")
    data -= np.mean(data, axis=0)
    U, S, V = np.linalg.svd(data, full_matrices=False)
    return U[:,:k].dot(np.diag(S)[:k,:k])

In [24]:
dfg = svd_pca(data, 784)

In [25]:
dfg.shape

(70000, 784)

In [26]:
%%time
data_classif(1000, 3, dfg, target)

CPU times: user 43.6 s, sys: 441 ms, total: 44 s
Wall time: 40.6 s


(977, 1000, 97.7)

Doesn't really make much of a difference, in speed or accuracy, even when shrink data to much lower dim.