# KNN -> sklearn VS RAPIDS

# Setup
This file was tested using RAPIDS 0.15 nightly build in Titan RTX GPU

Before we begin, let's check out our hardware setup by running the nvidia-smi command.

In [1]:
!nvidia-smi

Sat Feb  6 00:36:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P0    26W /  70W |    552MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, let's see what CUDA version we have:

In [2]:
!nvcc --version

/bin/bash: nvcc: command not found


# What is K nearest Neighbors 

The K-nearest neighbors (KNN) algorithm is a type of supervised machine learning algorithms. 
It is a non-parametric learning algorithm, which means that it doesn't assume anything about the underlying data.


It simply calculates the distance of a new data point to all other training data points. The distance can be of any type e.g Euclidean or Manhattan etc. It then selects the K-nearest data points, where K can be any integer. Finally it assigns the data point to the class to which the majority of the K data points belong.

# Pros and Cons of KNN 
(courtesy Google ofcourse)
# Pros
->It is extremely easy to implement

->As said earlier, it is lazy learning algorithm and therefore requires no training prior to making real time predictions. This makes the KNN algorithm much faster than other algorithms that require training e.g SVM, linear regression, etc.

->Since the algorithm requires no training before making predictions, new data can be added seamlessly.

->There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g. Euclidean or Manhattan etc.)

# Cons

->The KNN algorithm doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.

->The KNN algorithm has a high prediction cost for large datasets. This is because in large datasets the cost of calculating distance between new point and each existing point becomes higher.

->Finally, the KNN algorithm doesn't work well with categorical features since it is difficult to find the distance between dimensions with categorical features.

# What are we doing here 

We are going to use the famous iris data set for our KNN example. The dataset consists of four attributes: sepal-width, sepal-length, petal-width and petal-length. These are the attributes of specific types of iris plant. The task is to predict the class to which these plants belong. There are three classes in the dataset: Iris-setosa, Iris-versicolor and Iris-virginica.

# Part 1: SKLEARN 
What you've seen previously

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Assign colum names to the dataset

In [4]:
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
names_numeric = [1,2,3,4,5]

# Data Loading from Amazon S3 bucket

# Read dataset to pandas dataframe

In [5]:
dataset = pd.read_csv("https://rapids-keerthi.s3-us-west-1.amazonaws.com/iris.csv", names=names)

print("Number of records = ", len(dataset))

Number of records =  1042128


# Print the first five rows

In [6]:
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


# Time to preprocess our flowers

In [7]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# How many rows would be there in Train and how many in test (note the division above)

In [8]:
len(y_test)

260532

# Lets fit the training set
X_train -> Independent variable of training set

In [9]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [10]:
from sklearn.neighbors import KNeighborsClassifier
import time

starttime=time.time()

classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

endtime=time.time()

sklearn_time = (endtime-starttime)

# Part 2: Time to check how fast RAPIDS is

import the necessary libraries

In [11]:
import cudf, cuml
import cupy as cp
from cuml.neighbors import KNeighborsClassifier as cuKNeighbors

In [12]:
train = cudf.read_csv('https://rapids-keerthi.s3-us-west-1.amazonaws.com/iris_rapids_numeric.csv',names=names_numeric)

In [13]:
X = train.iloc[:,:-1]
y = train.iloc[:,4]

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

The line of code model.predict(test) does all the work.
Here we have used the number of neighbors as 3

In [15]:
model = cuKNeighbors(n_neighbors=3)
start_time_rapids = time.time()
model.fit(X_train,y_train)
y_hat = model.predict(X_test)
end_time_rapids = time.time()

In [16]:
rapids_time = (end_time_rapids - start_time_rapids )

In [17]:
print("Time taken by sklearn = ", sklearn_time)
print("Time taken by rapids = ", rapids_time)

Time taken by sklearn =  103.05224370956421
Time taken by rapids =  25.882997512817383
