<a href="https://colab.research.google.com/github/Sterls/colabs/blob/master/knn_demo_colab_0_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# _k_-Nearest Neighbors (KNN)

_k_-Nearest Neighbors (KNN) is an unsupervised algorithm where if one wants to find the “closest” datapoint(s) to new unseen data, one can calculate a suitable “distance” between each and every point, and return the top K datapoints which have the smallest distance to it.

cuML’s KNN expects a cuDF DataFrame or a Numpy Array (where automatic chunking will be done into a Numpy Array in a future release), and fits a special data structure first to approximate the distance calculations, allowing our querying times to be O(plogn) and not the brute force O(np) [where p = no(features)]:

The KNN function accepts the following parameters:

- n_neighbors: int (default = 5). The top K closest datapoints you want the algorithm to return. If this number is large, then expect the algorithm to run slower.
-  should_downcast: bool (default = False). Currently, only single-precision is supported in the underlying index. Setting this to true will allow double-precision input arrays to be automatically downcasted to single precision. Default = False.

The methods that can be used with KNN are:

- fit: Fit GPU index for performing nearest neighbor queries.
- kneighbors: Query the GPU index for the k nearest neighbors of row vectors in X.

The model accepts only numpy arrays or cuDF dataframes as the input. In order to convert your dataset to cuDF format, please read the cuDF documentation on https://docs.rapids.ai/api/cudf/stable.

For additional information on the _k_-Nearest Neighbors model, please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/stable/api.html#neighbors

# Install RAPIDS
- First, make sure you're running a GPU instance of Colab.  
- Next, we'll check to see if you are running a P100 or T4 GPU.  If you are getting an error, please check that you are running a **GPU Runtime** by going to `Runtime`>`Change Runtime Type`. Restart your instance if you are not. by going to `Runtime`>`Reset all Runtimes`. 
- Finally, we'll run the RAPIDS install script.

In [None]:
# Install RAPIDS
!wget -nc https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/rapids-colab.sh
!bash rapids-colab.sh

import sys, os

dist_package_index = sys.path.index("/usr/local/lib/python3.6/dist-packages")
sys.path = sys.path[:dist_package_index] + ["/usr/local/lib/python3.6/site-packages"] + sys.path[dist_package_index:]
sys.path
if os.path.exists('update_pyarrow.py'): ## This file only exists if you're using RAPIDS version 0.11 or higher
  exec(open("update_pyarrow.py").read(), globals())

## Imports

In [None]:
import cudf
import numpy as np
import os
import pandas as pd

from cuml.neighbors.nearest_neighbors import NearestNeighbors as cuKNN
from sklearn.neighbors import NearestNeighbors as skKNN

## Helper Functions

In [None]:
# check if the mortgage dataset is present and then extract the data from it, else just create a random dataset for clustering 
import gzip
# change the path of the mortgage dataset if you have saved it in a different directory
def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz',source='mortgage'):
    if os.path.exists(cached) and source=='mortgage':
        print('use mortgage data')
        with gzip.open(cached) as f:
            X = np.load(f)
        X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]
    else:
        # create a random dataset
        print('use random data')
        X = np.random.random((nrows,ncols)).astype('float32')
    df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])}).fillna(0)
    return df

In [None]:
from sklearn.metrics import mean_squared_error
# check if the results obtained from scikit-learn and cuML are the same
def array_equal(a,b,threshold=1e-3,with_sign=True,metric='mse'):
    a = to_nparray(a)
    b = to_nparray(b)
    if with_sign == False:
        a,b = np.abs(a),np.abs(b)
    if metric=='mse':
        error = mean_squared_error(a,b)
        res = error<threshold
    elif metric=='abs':
        error = a-b
        res = len(error[error>threshold]) == 0
    elif metric == 'acc':
        error = np.sum(a!=b)/(a.shape[0]*a.shape[1])
        res = error<threshold
    return res

# calculate the accuracy 
def accuracy(a,b, threshold=1e-4):
    a = to_nparray(a)
    b = to_nparray(b)
    c = a-b
    c = len(c[c>1]) / (c.shape[0]*c.shape[1])
    return c<threshold

# convert a variable from ndarray or dataframe format to numpy array
def to_nparray(x):
    if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):
        return np.array(x)
    elif isinstance(x,np.float64):
        return np.array([x])
    elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):
        return x.to_pandas().values
    return x

## Define parameters

In [None]:
# nrows = number of samples
# ncols = number of features per sample
nrows = 2**15 
ncols = 40    

# the number of neighbors whose labels are to be checked
n_neighbors = 10

## Load Data

In [None]:
%%time
X = load_data(nrows, ncols)
print('data', X.shape)

## Scikit-learn Model

In [None]:
%%time
# use the scikit-learn KNN model to fit the dataset 
knn_sk = skKNN(metric = 'sqeuclidean')
knn_sk.fit(X)

# calculate the distance and the indices of the samples present in the dataset
D_sk, I_sk = knn_sk.kneighbors(X, n_neighbors)

## cuML Model

In [None]:
%%time
# convert the pandas dataframe to cuDF dataframe
X = cudf.DataFrame.from_pandas(X)

In [None]:
%%time
# use cuML's KNN model to fit the dataset
knn_cuml = cuKNN()
knn_cuml.fit(X)

# calculate the distance and the indices of the samples present in the dataset
D_cuml, I_cuml = knn_cuml.kneighbors(X, n_neighbors)

## Compare Results

In [None]:
# compare the distance obtained while using scikit-learn and cuML models
passed = array_equal(D_sk,D_cuml, metric='abs') # metric used can be 'acc', 'mse', or 'abs'
message = 'compare knn: cuML vs scikit-learn distances %s'%('equal'if passed else 'NOT equal')
print(message)