![](https://i.imgur.com/bpU9JuI.gif)

# Introduction
Nearest neighbor classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of similar labeled examples.

Nearest Neighbors works well for classification task where the relationships among the features and target classes are numerous and extremely difficult to understand but the items of similar class tend to be homogeneous.

Nearest Neighbor classifier struggles most when there is no clear distinction exists among the groups.



# k-NN algorithm
k-Nearest Neighbor algorithm is an example of nearest neighbor classifier.

## Strength
* Simple and effective
* Makes no assumption about data
* Fast Training Process


## Weakness
* Doesn’t produce model, limiting the ability to understand how features are related to class
* Requires selection of k
* Slow classification phase
* Categorical features and missing data require pre processing

The k-Nearest Neighbor algorithm uses nearest k number of neighbors for labeling of an unlabeled example. The unlabeled test example is assigned the class of majority of the k-Nearest Neighbors.

For finding the distance k-NN algorithm uses **Euclidean distance**.

# Choosing an appropriate k
Choosing the value of k determines how well the model will generalize to future data. Choosing a large k reduces the impact or variance caused by noisy data but can bias the learner so that it runs the risk of ignoring small, but important pattern.

In Practice choosing k depends on the difficulty of the concept to be learned and the number of records in training data.

* Start with k value equal to the square root of the number of training examples.
* Using cross validation to determine the best k value.
* Weighted voting is one of interesting way to solve this problem. By giving higher weight to close neighbors.b 444

# k-NN from scratch
* Compute distances between x and all examples in the training set
* Sort by distance and return indexes of the first k neighbors
* Extract the labels of the k nearest neighbor training samples
* Return the most common class label

In [2]:
import numpy as np
from collections import Counter
def euclidean_distance(x1, x2):
        return np.sqrt(np.sum((x1 - x2)**2))

We used euclidean distance for calculating the nearest neighbors.

Now we will define our KNN Class

In [3]:
class KNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        # Compute distances between x and all examples in the training set
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        # Sort by distance and return indexes of the first k neighbors
        k_idx = np.argsort(distances)[:self.k]
        # Extract the labels of the k nearest neighbor training samples
        k_neighbor_labels = [self.y_train[i] for i in k_idx]  
        # return the most common class label
        most_common = Counter(k_neighbor_labels).most_common(1)
        return most_common[0][0]

We are going to use iris dataset to test our KNN model that we created !!!

In [4]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)


clf = KNN()
clf.fit(X_train, y_train)


def accuracy(y_true, y_pred):
    accuracy = np.sum(y_true == y_pred) / len(y_true)
    return accuracy


predictions = clf.predict(X_test)
print("custom KNN classification accuracy", accuracy(y_test, predictions))

custom KNN classification accuracy 1.0


# k-NN Classifier

## Defining the dataset

In [6]:
# First Feature
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
# Second Feature
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

# Label or target varible
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']

We have two features weather and temperature and one label play.

## Encoding data columns

In [7]:
from sklearn import preprocessing

#creating labelEncoder
le = preprocessing.LabelEncoder()

# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)

Similarly, you can encode temperature and label into numeric columns.

In [9]:
# converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)

## Combining Features

In [10]:
#combinig weather and temp into single listof tuples
features=list(zip(weather_encoded,temp_encoded))

## Generating Models

In [11]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)

# Train the model using the training sets
model.fit(features,label)


KNeighborsClassifier(n_neighbors=3)

## Predict Output

In [12]:
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print(predicted)

[1]


# k-NN Regressor

In [13]:
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
X,y = load_boston(return_X_y=True)
mod = KNeighborsRegressor()
mod.fit(X,y)


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

KNeighborsRegressor()

In [14]:
mod.predict(X)

array([21.78, 22.9 , 25.36, 26.06, 27.1 , 27.1 , 20.88, 19.1 , 18.4 ,
       19.48, 19.28, 22.  , 24.34, 20.52, 24.66, 21.3 , 30.48, 20.4 ,
       15.7 , 23.54, 16.82, 17.64, 18.3 , 17.08, 16.66, 15.1 , 16.78,
       14.94, 19.94, 18.34, 14.1 , 16.82, 15.12, 14.1 , 15.12, 26.92,
       22.14, 27.4 , 28.44, 31.88, 31.88, 25.36, 25.36, 24.22, 20.68,
       20.44, 20.44, 18.1 , 18.1 , 24.  , 21.54, 24.  , 27.16, 27.16,
       25.7 , 39.82, 27.08, 38.28, 24.8 , 25.64, 21.78, 33.6 , 21.78,
       24.06, 31.74, 25.3 , 26.98, 22.18, 20.42, 20.42, 27.76, 29.5 ,
       27.76, 27.76, 22.92, 21.64, 25.82, 21.64, 21.38, 22.02, 24.8 ,
       21.88, 25.22, 25.64, 25.98, 25.98, 23.28, 25.98, 24.02, 25.58,
       25.58, 25.06, 26.34, 26.04, 30.1 , 24.84, 23.62, 24.32, 28.52,
       24.96, 22.1 , 22.2 , 15.34, 19.74, 19.74, 19.66, 19.56, 21.34,
       19.66, 19.56, 22.08, 20.1 , 19.6 , 17.54, 20.1 , 17.7 , 20.2 ,
       20.1 , 20.66, 19.8 , 22.76, 20.6 , 19.66, 18.52, 19.66, 20.6 ,
       18.52, 16.62,