## Question: 
### Generate a random matrix of shape (1 million, 2) and perofm below operations:
#### 1. Find dist b/w each 2-dim data point from centroid of the dataset
#### 2. Append newly calculated distances as new column in dataset
#### 3. Given centroid, generate 3 closest neighbours from it
#### 4. Given any data point, generate 3 c losest neighbours from it

In [1]:
import numpy as np

#### Generating a random matrix with given shape

In [2]:
arr = np.random.randint(100, 2000, size=(1000000,2))
arr

array([[1057,  569],
       [ 989,  871],
       [ 484,  353],
       ...,
       [ 181,  656],
       [1970,  185],
       [ 359,  565]])

#### Calculate centroid for the data and compute Euclidean dist

In [3]:
centroid = np.mean(arr, axis=0)
centroid

array([1050.017234, 1049.154379])

In [4]:
distances = np.sqrt(np.sum((arr-centroid)**2, axis=1))
distances

array([ 480.20515063,  188.31379557,  897.22150475, ...,  953.81409023,
       1262.19296484,  843.74775875])

#### Add another column in the data

In [5]:
np.hstack([arr, distances.reshape(-1,1)])

array([[1057.        ,  569.        ,  480.20515063],
       [ 989.        ,  871.        ,  188.31379557],
       [ 484.        ,  353.        ,  897.22150475],
       ...,
       [ 181.        ,  656.        ,  953.81409023],
       [1970.        ,  185.        , 1262.19296484],
       [ 359.        ,  565.        ,  843.74775875]])

#### Get 3 closest distant points

In [6]:
arr[distances.argsort()][:3]

array([[1051, 1048],
       [1048, 1051],
       [1048, 1047]])

## Perform same actions in pandas only

### Method-1

In [10]:
import pandas as pd

In [12]:
df = pd.DataFrame(arr, columns=['x', 'y'])

In [13]:
df.to_csv('matrix-data.csv', index=False)

In [16]:
df = pd.read_csv('matrix-data.csv')
df.shape

(1000000, 2)

In [17]:
mx, my = df['x'].mean(), df['y'].mean()

In [18]:
df['distance'] = df['x'].apply(lambda x:(x-mx)**2) + df['y'].apply(lambda y:(y-my)**2)

In [19]:
df['distance'] = df['distance'].apply(np.sqrt)

In [20]:
df.sort_values(by='distance').head(3)

Unnamed: 0,x,y,distance
606219,1051,1048,1.516054
476031,1048,1051,2.734145
408810,1048,1047,2.951369


### Method-2

In [21]:
df = pd.read_csv('matrix-data.csv')
df.shape

(1000000, 2)

In [22]:
centroid = df.mean()
centroid

x    1050.017234
y    1049.154379
dtype: float64

In [23]:
df['distances'] = ((df-centroid)**2).sum(axis=1)**0.5

In [24]:
df.sort_values(by='distances').head(3)

Unnamed: 0,x,y,distances
606219,1051,1048,1.516054
476031,1048,1051,2.734145
408810,1048,1047,2.951369
