# Euclidean distance
The formula used for euclidean distance was:

**distance_matrix = np.sqrt(((test_data\*\*2).sum(axis = 1)[:, None]) + ((train_data\*\*2).sum(axis = 1)) - 2*test_data.dot(train_data.T))**

Let's take it apart to understand what it is doing

So basically:
    
<img src="https://github.com/rushikesh12/KNN-using-Vectorized-Computation-of-Distance-Metrics/blob/master/euclidean.png">
<br>
<center><i> <a href="https://en.wikipedia.org/wiki/Euclidean_distance">Image Source: Wikipedia </a> </i></center>


This formula can be interpreted as:
    
<img src="https://github.com/rushikesh12/KNN-using-Vectorized-Computation-of-Distance-Metrics/blob/master/euclidean-1.png">

*image & explanation reference: https://ljvmiranda921.github.io/notebook/2017/02/09/k-nearest-neighbors/*





In our case, in the confusion matrix we have the number of test data as the rows and the training data as the rows.

| Training data ==> | Train-1 | Train-2 | Train-3 | Train-n |
| :-: | :-: | :-: |:-: |:-: |
| Test data V | - | - | - | - |
| Test data-1 | distancebetween(train-1,test-1) | distancebetween(train-2,test-1) | distancebetween(train-3,test-1) | distancebetween(train-n,test-1) |

The train data and test data are basically vectors of the shape, 

train = (number of training data, number of relevant features)

test = (number of test data, number of relevant features)


In [1]:
import pandas as pd
import numpy as np

# For understanding we will be selecting 5 rows of train and 2 rows of test data.

df = pd.read_csv("hayes-roth.data", header = None, names = ["name","hobby","age","educational level","martial status","class"])
df = df.iloc[:5,2:]
test_df = pd.read_csv("hayes-roth.test", header = None, names = ["hobby","age","educational level","martial status","class"])
test_df = test_df.iloc[:2,1:]

# Get X_train & X_test to compute the distance between them

X_train = np.asarray(df, dtype=np.int64)
X_train = X_train[:,:-1]

X_test = np.asarray(test_df, dtype=np.int64)
X_test = X_test[:,:-1]


In [2]:
X_train

array([[1, 1, 2],
       [1, 3, 2],
       [1, 4, 1],
       [4, 2, 2],
       [1, 3, 4]], dtype=int64)

In [3]:
train_data_square = X_train ** 2
train_data_square

array([[ 1,  1,  4],
       [ 1,  9,  4],
       [ 1, 16,  1],
       [16,  4,  4],
       [ 1,  9, 16]], dtype=int64)

In [4]:
X_test

array([[1, 1, 2],
       [1, 2, 1]], dtype=int64)

In [5]:
# distance_matrix = np.sqrt(((test_data**2).sum(axis = 1)[:, None]) + ((train_data**2).sum(axis = 1)) - 2*test_data.dot(train_data.T))

test_data_square = X_test ** 2
test_data_square

array([[1, 1, 4],
       [1, 4, 1]], dtype=int64)

In [6]:
# Using sum function to sum all the elements of a row i.e (axis = 1)
test_data_square.sum(axis=1)

array([6, 6], dtype=int64)

In [7]:
# Adding a dimension to change it from (2,) to (2,1)
(X_test**2).sum(axis = 1)[:, None].shape

(2, 1)

In [8]:
# Similarly with X_train
(X_train**2).sum(axis=1)

array([ 6, 14, 18, 24, 26], dtype=int64)

In [9]:
(X_train**2).sum(axis=1).shape

(5,)

In [10]:
# Adding the X_train_square and X_test_square
((X_test**2).sum(axis = 1)[:, None] + (X_train**2).sum(axis=1)).shape

(2, 5)

In [11]:
print("X_test shape:",X_test.shape)
print("X_train shape:",X_train.shape)

X_test shape: (2, 3)
X_train shape: (5, 3)


In [12]:
# To multiply between X_test and X_train we need to transpose the train 
# We choose train as (2, 3) * (3, 5) = (2,5) (confusion matrix contatining test as rows and train as columns)
np.dot(X_test, X_train.T)

array([[ 6,  8,  7, 10, 12],
       [ 5,  9, 10, 10, 11]], dtype=int64)

In [13]:
# To put it all under sqrt
np.sqrt(((X_test**2).sum(axis = 1)[:, None]) + ((X_train**2).sum(axis = 1)) - 2*np.dot(X_test, X_train.T))

array([[0.        , 2.        , 3.16227766, 3.16227766, 2.82842712],
       [1.41421356, 1.41421356, 2.        , 3.16227766, 3.16227766]])

# Manhattan Distance

The formula used for manhattan distance was:

**distance_matrix = (np.abs(train_data[:,None] - test_data).sum(-1)).T**

The formula is given by:

<img src="https://github.com/rushikesh12/KNN-using-Vectorized-Computation-of-Distance-Metrics/blob/master/manhattan.png">
<br>
<center><i> <a href="https://en.wikipedia.org/wiki/Taxicab_geometry">Image Source: Wikipedia </a> </i></center>

In [14]:
# We will use the same small parts of data used above

print("X-train",X_train)
print("\nX-test",X_test)

X-train [[1 1 2]
 [1 3 2]
 [1 4 1]
 [4 2 2]
 [1 3 4]]

X-test [[1 1 2]
 [1 2 1]]


In [15]:
# The shape of X_train and X_test are:

print("X-train shape:",X_train.shape)
print("\nX-test shape:",X_test.shape)

X-train shape: (5, 3)

X-test shape: (2, 3)


In [16]:
# We add an axis to the train numpy array, to create 5 sets of (1,3) vectors, now we can perform element-wise...
# subtraction with the test nump array to get (5,2,3) i.e we subtracted each test row with each of the train cases

print(X_train[:,None])
print("\nAfter adding an axis the X_train shape is:",X_train[:,None].shape)

[[[1 1 2]]

 [[1 3 2]]

 [[1 4 1]]

 [[4 2 2]]

 [[1 3 4]]]

After adding an axis the X_train shape is: (5, 1, 3)


In [17]:
# Here, np.abs is used to find the absolute values after the subtraction

np.abs(X_train[:,None] - X_test)

array([[[0, 0, 0],
        [0, 1, 1]],

       [[0, 2, 0],
        [0, 1, 1]],

       [[0, 3, 1],
        [0, 2, 0]],

       [[3, 1, 0],
        [3, 0, 1]],

       [[0, 2, 2],
        [0, 1, 3]]], dtype=int64)

In [18]:
# We will now sum the elements, sum (-1) is used to flatten the results
# The result is then transposed to get the form of (test_cases, train_data), here (2,5)
np.abs((X_train[:,None] - X_test).sum(-1)).T

array([[0, 2, 2, 4, 4],
       [0, 2, 2, 4, 4]], dtype=int64)

In [19]:
# Going other way around i.e. test - train will give you the same results

np.abs(X_test[:,None] - X_train).sum(-1)

array([[0, 2, 4, 4, 4],
       [2, 2, 2, 4, 4]], dtype=int64)