# K-Nearest Neighbors Classification Demo

K-nearest neighbors classification uses the labels of neighborhoods around data samples to classify unseen data samples. 

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. 

For information on converting your dataset to cuDF format, refer to the cuDF documentation: https://rapidsai.github.io/projects/cudf/en/latest/

For additional information on cuML's Nearest Neighbors implementation: https://rapidsai.github.io/projects/cuml/en/latest/api.html#nearest-neighbors

In [1]:
import os

import numpy as np

from sklearn.datasets import make_blobs

import pandas as pd
import cudf as gd

from sklearn.neighbors import KNeighborsClassifier as skKNC
from cuml.neighbors import KNeighborsClassifier as cumlKNC

## Define Parameters

In [2]:
n_samples = 2**17
n_features = 40

n_query = 5000

n_neighbors = 4

## Generate Data

### Host

In [3]:
%%time
X_host_train, y_host_train = make_blobs(
   n_samples=n_samples, n_features=n_features, centers=5, random_state=0)

X_host_train = pd.DataFrame(X_host_train)
y_host_train = pd.DataFrame(y_host_train)

CPU times: user 264 ms, sys: 13.1 ms, total: 277 ms
Wall time: 277 ms


In [4]:
%%time
X_host_test, y_host_test = make_blobs(
   n_samples=n_query, n_features=n_features, centers=5, random_state=0)

X_host_test = pd.DataFrame(X_host_test)
y_host_test = pd.DataFrame(y_host_test)

CPU times: user 10.2 ms, sys: 2.17 ms, total: 12.4 ms
Wall time: 11.4 ms


### Device

In [5]:
X_device_train = gd.DataFrame.from_pandas(X_host_train)
y_device_train = gd.DataFrame.from_pandas(y_host_train)

In [6]:
X_device_test = gd.DataFrame.from_pandas(X_host_test)
y_device_test = gd.DataFrame.from_pandas(y_host_test)

## Scikit-learn Model

In [7]:
%%time
knn_sk = skKNC(algorithm="brute", n_neighbors=n_neighbors, n_jobs=-1)
knn_sk.fit(X_host_train, y_host_train)

sk_result = knn_sk.predict(X_host_test)

  


CPU times: user 1min 44s, sys: 51.9 s, total: 2min 36s
Wall time: 18.4 s


## cuML Model

In [8]:
%%time
knn_cuml = cumlKNC(n_neighbors=n_neighbors)
knn_cuml.fit(X_device_train, y_device_train)

cuml_result = knn_cuml.predict(X_device_test)

CPU times: user 1.39 s, sys: 358 ms, total: 1.75 s
Wall time: 1.79 s


## Compare Results

In [10]:
passed = np.array_equal(np.asarray(cuml_result.as_gpu_matrix())[:,0], sk_result)
print('compare knn: cuml vs sklearn classes %s'%('equal'if passed else 'NOT equal'))

compare knn: cuml vs sklearn classes equal
