###    Task: 
We look at the task of classifying images of digits using k-nearest neighbor classiﬁcation. Files pa1train.txt, pa1validate.txt and pa1test.txt contain the training, validation and test data sets respectively. The images have already converted into vectors of pixel colors. The data ﬁles are in ASCII text format, and each line of the ﬁles contains a feature vector of size 784, followed by its label. The coordinates of the feature vector are separated by spaces.


## Summary:
### Part 1 without Projections
| k | Train Errors | Validation Errors |
| --- | --- | --- |
| 1 | 0.0 | 0.082 |
| 5 | 0.0565 | 0.095 |
| 9 | 0.0685 | 0.103 |
| 15 | 0.0925 | 0.108 |

* Since $k=1$ performs the best on validation, I calculated the test error for k = 1, and it is $0.094$

### Part 2 with Projections
| k | Train Errors | Validation Errors |
| --- | --- | --- |
| 1 | 0.0 | 0.32 |
| 5 | 0.1945 | 0.299 |
| 9 | 0.2305 | 0.302 |
| 15 | 0.257 | 0.289 |

* Since $k=15$ performs the best on validation, I calculated the test error for k = 15, and it is $0.296$

### How is the classiﬁcation accuracy aﬀected by projection?
* The classiﬁcation accuracy with projection $went down$, because the number of x variables went down from 784 to 20. 

### How does the running time change on projected data?
| Runtime | Train Errors (k = 1, 3, 5, 9, 15) | Validation Errors (k = 1, 5, 9, 15) | Test Error (k = 1 or 15)|
| --- | --- | --- | --- |
| without projection | 304.971 | 133.117 | 30.797|
| with projection | 282.221 | 115.279 |28.491 |

* The running time $decreases$ on projected data, because there are less columns put into the calculation of Euclidean distance.


# Part 1
##  Load Data

In [154]:
import pandas as pd
import numpy as np
from collections import Counter
from scipy.spatial import distance
from scipy import stats
import time

In [155]:
train=np.loadtxt("pa1train.txt")
test=np.loadtxt("pa1test.txt")
validate=np.loadtxt("pa1validate.txt")


## Create a Euclidean Function

In [156]:
def eu(a, b):
    return distance.euclidean(a[:-1], b[:-1])

## Create a Function for KNN

In [157]:
def knn(data, data_, index, k):
    len = data.shape[0]
    dist = []
    for i in range(len):
        # Add the distance and the index of the example to an ordered collection
        eu_d = eu(data_[index], data[i])
        dist.append((eu_d, i))
    #Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
    dist = sorted(dist)
    # Pick the first K entries from the sorted collection
    preds = dist[:k]
    # Get the labels of the selected K entries
    labels = [data[i[1]][-1] for i in preds]
    # get the mode of the K labels
    pred = stats.mode(labels)
    # return true: if the predicted label is the same as the real lable, false: otherwise
    return data_[index][-1] == pred[0][0]

## Get the Training Errors 

In [158]:
start = time.time()
len = train.shape[0]
ks = [1, 3, 5, 9, 15]
for k in ks:
    correct = []
    for i in range(len):
        correct.append(knn(train, train, i, k))
    print('The Training Error when k is equal to ' + str(k) + ' is ' + str(1 - np.mean(correct)))
end = time.time()
print(end - start)

The Training Error when k is equal to 1 is 0.0
The Training Error when k is equal to 3 is 0.04349999999999998
The Training Error when k is equal to 5 is 0.056499999999999995
The Training Error when k is equal to 9 is 0.0685
The Training Error when k is equal to 15 is 0.09250000000000003
304.97118067741394


## Get the Validation Errors

In [159]:
start = time.time()
len = validate.shape[0]
ks = [1, 5, 9, 15]
for k in ks:
    correct = []
    for i in range(len):
        correct.append(knn(train, validate, i, k))
    print('The Validation Error when k is equal to ' + str(k) + ' is ' + str(1 - np.mean(correct)))
end = time.time()
print(end - start)

The Validation Error when k is equal to 1 is 0.08199999999999996
The Validation Error when k is equal to 5 is 0.09499999999999997
The Validation Error when k is equal to 9 is 0.10399999999999998
The Validation Error when k is equal to 15 is 0.10799999999999998
133.11728882789612


As shown above, of all of these classiﬁers, k = 1 performs the best on validation data, so I'm going to use k=1 to get the Test Error

## Get the Test Error when k = 1

In [160]:
start = time.time()
len = test.shape[0]
correct = []
for i in range(len):
    correct.append(knn(train, test, i, 1))
print('The Test Error when k is equal to 1 is ' + str(1 - np.mean(correct)))
end = time.time()
print(end - start)

The Test Error when k is equal to 1 is 0.09399999999999997
30.797391891479492


### Part 1 summary
| k | Train Errors | Validation Errors |
| --- | --- | --- |
| 1 | 0.0 | 0.082 |
| 5 | 0.0565 | 0.095 |
| 9 | 0.0685 | 0.103 |
| 15 | 0.0925 | 0.108 |

* Since k=1 performs the best on validation, I calculated the test error for k = 1, and it is 0.094

# Part 2
we will look at how using a projection as a pre-processing step aﬀects the accuracy and running-time of nearest neighbor classiﬁcation

## Load the Projection Data

The ﬁle projection.txt represents a projection matrix P with 784 rows and 20 columns. Each column is a 784-dimensional unit vector, and the columns are orthogonal to each other. 

In [114]:
project=np.loadtxt("projection.txt")
project

array([[-0.015626  , -0.019702  , -0.005087  , ..., -0.019012  ,
        -0.053294  , -0.059311  ],
       [-0.043534  ,  0.038514  ,  0.061698  , ...,  0.03185   ,
         0.0024749 ,  0.00052616],
       [-0.016051  , -0.016189  ,  0.030413  , ..., -0.0085923 ,
         0.036698  ,  0.0069459 ],
       ...,
       [ 0.0084749 ,  0.017197  , -0.063448  , ...,  0.065384  ,
        -0.0032644 ,  0.026077  ],
       [-0.025274  ,  0.0051254 , -0.023191  , ...,  0.069659  ,
        -0.047332  , -0.0442    ],
       [-0.023108  , -0.0092047 , -0.021068  , ...,  0.010774  ,
        -0.01426   ,  0.0037587 ]])

## Project the training, validation and test data onto the column space of this matrix

In [148]:
train_ = np.delete(train, -1, axis=1)
train_ = train_.dot(project)
labels = np.delete(train, np.arange(0, 784), axis=1)
train_ = np.concatenate((train_, labels), axis=1)

In [146]:
validate_ = np.delete(validate, -1, axis=1)
validate_ = validate_.dot(project)
labels = np.delete(validate, np.arange(0, 784), axis=1)
validate_ = np.concatenate((validate_, labels), axis=1)

In [147]:
test_ = np.delete(test, -1, axis=1)
test_ = test_.dot(project)
labels = np.delete(test, np.arange(0, 784), axis=1)
test_ = np.concatenate((test_, labels), axis=1)

## Repeat Part 1

In [161]:
start = time.time()
len = train_.shape[0]
ks = [1, 3, 5, 9, 15]
for k in ks:
    correct = []
    for i in range(len):
        correct.append(knn(train_, train_, i, k))
    print('The Training Error when k is equal to ' + str(k) + ' is ' + str(1 - np.mean(correct)))
end = time.time()
print(end - start)

The Training Error when k is equal to 1 is 0.0
The Training Error when k is equal to 3 is 0.16049999999999998
The Training Error when k is equal to 5 is 0.1945
The Training Error when k is equal to 9 is 0.23050000000000004
The Training Error when k is equal to 15 is 0.257
282.2208981513977


In [162]:
start = time.time()
len = validate_.shape[0]
ks = [1, 5, 9, 15]
for k in ks:
    correct = []
    for i in range(len):
        correct.append(knn(train_, validate_, i, k))
    print('The Validation Error when k is equal to ' + str(k) + ' is ' + str(1 - np.mean(correct)))
end = time.time()
print(end - start)

The Validation Error when k is equal to 1 is 0.31999999999999995
The Validation Error when k is equal to 5 is 0.29900000000000004
The Validation Error when k is equal to 9 is 0.30200000000000005
The Validation Error when k is equal to 15 is 0.28900000000000003
115.27900719642639


In [163]:
start = time.time()
len = test_.shape[0]
correct = []
for i in range(len):
    correct.append(knn(train_, test_, i, 15))
print('The Test Error when k is equal to 1 is ' + str(1 - np.mean(correct)))
end = time.time()
print(end - start)

The Test Error when k is equal to 1 is 0.29600000000000004
28.491159200668335


## Summary:
### Part 1 without Projections
| k | Train Errors | Validation Errors |
| --- | --- | --- |
| 1 | 0.0 | 0.082 |
| 5 | 0.0565 | 0.095 |
| 9 | 0.0685 | 0.103 |
| 15 | 0.0925 | 0.108 |

* Since $k=1$ performs the best on validation, I calculated the test error for k = 1, and it is $0.094$

### Part 2 with Projections
| k | Train Errors | Validation Errors |
| --- | --- | --- |
| 1 | 0.0 | 0.32 |
| 5 | 0.1945 | 0.299 |
| 9 | 0.2305 | 0.302 |
| 15 | 0.257 | 0.289 |

* Since $k=15$ performs the best on validation, I calculated the test error for k = 15, and it is $0.296$

### How is the classiﬁcation accuracy aﬀected by projection?
* The classiﬁcation accuracy with projection $went down$, because the number of x variables went down from 784 to 20. 

### How does the running time change on projected data?
| Runtime | Train Errors (k = 1, 3, 5, 9, 15) | Validation Errors (k = 1, 5, 9, 15) | Test Error (k = 1 or 15)|
| --- | --- | --- | --- |
| without projection | 304.971 | 133.117 | 30.797|
| with projection | 282.221 | 115.279 |28.491 |

* The running time $decreases$ on projected data, because there are less columns put into the calculation of Euclidean distance.
