# First Exercise for Pattern Recognition

**Goal:** Implement a KNN (K nearest neighbour) classification algorithm from scratch

## Overview of the approach
1. Read the data from the MNIST Dataset
2. Write KNN Classification Algorithm
4. Run the classification function over the train set to parametrise the mode
5. Test the outcome with the test set

Import needed packages

In [1]:
import numpy
from math import sqrt, pow
import time
from scipy import spatial


Define Reading function for file paths

In [2]:
def readMnistData(filePath):
    return numpy.genfromtxt(filePath, delimiter=",")

In [3]:
def getClassification(mnistDataEntry: numpy.ndarray):
    return mnistDataEntry[0]

def getMnistData(mnistDataEntry: numpy.ndarray):
    return numpy.delete(mnistDataEntry, [0])

Define finding the euclidean distance

In [4]:
def euclideanDistance(vectorA: numpy.ndarray, vectorB: numpy.ndarray):
    return numpy.sum(numpy.sqrt(numpy.power(vectorA - vectorB, 2)))


Read the train file

In [5]:
train = readMnistData("train.csv")
trainClassification = train[:, 0]
trainData = train[:, 1:]

testDataSet = readMnistData("test.csv")
testData = testDataSet[:,1:]
testClassification = testDataSet[:, 0]
print(testClassification.max())

9.0


First try to brute force one classification

## Brute force Iterative
Here follows the first try with just a brute force approach with no optimizations. The programm runs extremely slow.

In [26]:
def bruteforce1nn(toClassify: numpy.ndarray, trainSet: numpy.ndarray):
    withoutClassification = getMnistData(toClassify)
    smallestDistance = 100000
    bestCurrentClassifier = -1
    for currentIndex in range(0, len(trainSet)):
        currentTrainData = trainDataSet[currentIndex]
        currentDistance = euclideanDistance(getMnistData(currentTrainData), withoutClassification)
        if smallestDistance > currentDistance:
            smallestDistance = currentDistance
            bestCurrentClassifier = getClassification(currentTrainData)
    return bestCurrentClassifier


def checkClassification(classification, toClassify: numpy.ndarray):
    return classification == getClassification(toClassify)

In [33]:
def getTimeApproxPerClassification(functionToTime, trainDataSetToUse):
    start = time.time()
    for index, currentDataSet in enumerate(testDataSet):
        functionToTime(currentDataSet, trainDataSetToUse)
        print(index)
    end = time.time()
    print(end - start)
    return (end - start)/100

In [34]:
getTimeApproxPerClassification(bruteforce1nn, trainDataSet)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
49.63020205497742


0.4963020205497742

In [37]:
correct = 0
false = 0
print("Total number of Test cases: "+ str(len(testDataSet)))

for index, currentDataSet in enumerate(testDataSet):
    classifiedAs = bruteforce1nn(currentDataSet, trainDataSet)
    if classifiedAs == getClassification(currentDataSet):
        correct += 1
    else:
        false += 1
    print(str(index) + ": c: " + str(correct) + " f: " + str(false))

accuracy = correct / len(testDataSet)
print("Correct Classified: " + str(correct))
print("False Classified: " + str(false))

print("Accuracy: " + str(accuracy))

Total number of Test cases: 15001
0: c: 1 f: 0
1: c: 2 f: 0
2: c: 3 f: 0
3: c: 4 f: 0
4: c: 5 f: 0
5: c: 6 f: 0
6: c: 7 f: 0
7: c: 8 f: 0
8: c: 9 f: 0
9: c: 10 f: 0
10: c: 11 f: 0
11: c: 12 f: 0
12: c: 13 f: 0
13: c: 14 f: 0
14: c: 15 f: 0
15: c: 16 f: 0
16: c: 17 f: 0
17: c: 18 f: 0
18: c: 19 f: 0
19: c: 20 f: 0
20: c: 21 f: 0
21: c: 22 f: 0
22: c: 23 f: 0
23: c: 24 f: 0
24: c: 25 f: 0
25: c: 26 f: 0
26: c: 27 f: 0
27: c: 28 f: 0
28: c: 29 f: 0
29: c: 30 f: 0
30: c: 31 f: 0
31: c: 32 f: 0
32: c: 33 f: 0
33: c: 34 f: 0
34: c: 35 f: 0
35: c: 35 f: 1
36: c: 36 f: 1
37: c: 37 f: 1
38: c: 38 f: 1
39: c: 39 f: 1
40: c: 40 f: 1
41: c: 41 f: 1
42: c: 41 f: 2
43: c: 42 f: 2
44: c: 42 f: 3
45: c: 43 f: 3
46: c: 44 f: 3
47: c: 45 f: 3
48: c: 46 f: 3
49: c: 47 f: 3
50: c: 48 f: 3
51: c: 49 f: 3
52: c: 50 f: 3
53: c: 51 f: 3
54: c: 52 f: 3
55: c: 53 f: 3
56: c: 54 f: 3
57: c: 55 f: 3
58: c: 56 f: 3
59: c: 57 f: 3
60: c: 58 f: 3
61: c: 59 f: 3
62: c: 60 f: 3
63: c: 61 f: 3
64: c: 61 f: 4
65: c: 62 

KeyboardInterrupt: 

## Next Approach: Don't calculate iteratively but use pre build optimized spacial package.

In [6]:
distanceMatrix = spatial.distance.cdist(testData, trainData, metric='euclid')

In [7]:
print(distanceMatrix[1].min())
print(numpy.argmin(distanceMatrix[1]))
print(distanceMatrix[1,261])

1274.0616154644954
23518
1349.6914462202092


In [12]:
def getDistanceMatrix(testData, trainData, metric):
    return  spatial.distance.cdist(getOnlyData(testData), getOnlyData(trainData), metric=metric)


def getAllClassifications(dataMatrix: numpy.matrix):
    return numpy.squeeze(numpy.asarray(dataMatrix[:,0]))

def getOnlyData(dataMatrix: numpy.matrix):
    return numpy.asarray(dataMatrix[:,1:])


def getKNearestNeightbourIndices(distanceMatrix: numpy.ndarray, k):
    return numpy.argpartition(distanceMatrix,k)[:,:k]

def performKNearestNeighbour(distances, trainDataToUse, testDataToUse, k):
    neighrestNeighbours = getKNearestNeightbourIndices(distances, k)
    classifications = getAllClassifications(trainDataToUse)[neighrestNeighbours]
    return classifications

def getCorrectness(madeClassifications, trueClassifications):
     subtractedClassifications = numpy.subtract(trueClassifications, madeClassifications)
     numberOfFalse = numpy.count_nonzero(subtractedClassifications)
     return 1 - (numberOfFalse / len(testClassification))

classifications1 = performKNearestNeighbour(distanceMatrix, trainData, testData, 1)
print(getCorrectness(classifications1, getAllClassifications(testData)))


1.0


In [None]:
distanceMatrix = getDistanceMatrix(testData, trainData, 'euclid')


In [11]:
print(getOnlyData(testData))
print(distanceMatrix[1].min())
print(numpy.argmin(distanceMatrix[1]))
print(distanceMatrix[1,261])
print(classifications1.min())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
1274.0616154644954
23518
1349.6914462202092
0.0
