## K-Nearest Neighbours (Regression)

K-Nearest Neighbours is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.

It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any underlying assumptions about the distribution of data.

### problem Statement :
create a 1 nn algorithm for regression

data = cars

target = MPG

predictors = [weight,horsepower,displacement,acceleration]

1. divide the data into train and test (300,106)
2. seperate x train, y train, x test, y test,
3. standardize x
4. design a 1nn algorithm to predict mpg for all x test using x train

In [1]:
# Libraries
import os
import numpy as np
import pandas as pd
os.chdir(r'C:\Users\shameel\Desktop\Praxis')
cars = pd.read_csv("cars.csv")
predictors = cars[['Weight','Horsepower','Displacement','Acceleration']]
target = cars["MPG"]
shuffle = cars.sample(406)


In [2]:
# dividing the data
xtrain = shuffle.iloc[0:300,3:7]
ytrain = shuffle.iloc[0:300,1]
xtest = shuffle.iloc[301:406,3:7]
ytest = shuffle.iloc[301:406,1]

In [3]:
# standardizing the data
xtm = predictors.mean()
xts = predictors.std()
ytm = target.mean()
yts = target.std()
xtrainst = ((xtrain - xtm)/xts)
ytrainst = ((ytrain - ytm)/yts)
xtestst = ((xtest - xtm)/xts)
ytestst = ((ytest - ytm)/yts)


In [4]:
# defining a function to determine the distance between 2 points
def equiddist(a,b):
    d = np.sqrt(sum((a-b)**2))
    return (d)

In [28]:
#1nn algorithm to predict the ytrain
def oneNNclass(l):
    dist2 = []
    y = []
    for i in range (0,len(l)):    
        for j in range (0,len(xtrain)):
            d = equiddist (xtrainst.iloc[j,],xtestst.iloc[i,])  # to find the distance for every xtest value to every xtrain value                  
            dist2.append(d)                       # appending all distance in a list
        yindex = dist2.index(min(dist2))          # finding the index of every min(dist)
        y.append(ytrain.iloc[yindex,])            # appending the appropriate y train value using the index found above   
        dist2 = [] 
    return np.array(y)        
    #return(main)                 #printing the predicted value



In [30]:
xt = xtest.copy()

In [31]:
# all the Predicted value for the test data
xt['ypred'] = oneNNclass(xtest)

In [33]:
xt['yactual'] = ytest

In [34]:
# comparing the actual and predicted value
xt

Unnamed: 0,Displacement,Horsepower,Weight,Acceleration,ypred,yactual
328,108.0,75,2265,15.2,24.0,32.2
320,151.0,90,2678,16.5,26.6,28.0
166,351.0,148,4657,13.5,13.0,14.0
184,115.0,95,2694,15.0,24.0,23.0
86,96.0,69,2189,18.0,33.8,26.0
...,...,...,...,...,...,...
171,231.0,110,3039,15.0,22.0,21.0
400,151.0,90,2950,17.3,22.3,27.0
115,97.0,88,2279,19.0,26.0,20.0
212,97.0,75,2155,16.4,29.0,28.0


In [36]:
#calculating the rmse error
error = np.mean((xt['yactual'] - xt['ypred'])**2)

In [38]:
print('rmse for the predicted value:',error)

rmse for the predicted value: 29.93057142857143


### problem Statement :
create a knn algorithm for regression

data = cars

target = MPG

predictors = [weight,horsepower,displacement,acceleration]

1. divide the data into train and test (300,106)
2. seperate x train, y train, x test, y test,
3. standardize x
4. design a knn algorithm to predict mpg for all x test using x train

In [39]:
# knn algorithm to find 5 nearest neigbour to predict ytrain
def KNNclass(k):
    dist2 = []
    y = []
    b = [] 
    c = []
    for i in range (0,len(k)):    
        for j in range (0,len(xtrain)):
            d = equiddist (xtrainst.iloc[j,],xtestst.iloc[i,]) # to find the distance for every xtest value to every xtrain value
            dist2.append(d)
            v = dist2.copy()
            v.sort()                        #sorting the distance found in ascending order

        for k in v[0:5]:
            yindex = dist2.index(k)          #finding the index of the first 5 shortest distance
            b.append(ytrain.iloc[yindex,])   #appending the value of the ytrain of the 5 values
        c.append(sum(b)/len(b))              # finding the average for the 5 ytrain values
        dist2 = []
        v = []
        b = []
    return np.array(c)
    #print(main2)



In [41]:
xt = xtest.copy()

In [46]:
# the predicted value for knn algorithm
xt['ypred'] = KNNclass(xtest)

In [43]:
xt['yactual'] = ytest

In [47]:
# comparing the actual and predicted value
xt

Unnamed: 0,Displacement,Horsepower,Weight,Acceleration,yactual,ypred
328,108.0,75,2265,15.2,32.2,31.00
320,151.0,90,2678,16.5,28.0,26.30
166,351.0,148,4657,13.5,14.0,13.40
184,115.0,95,2694,15.0,23.0,26.58
86,96.0,69,2189,18.0,26.0,35.14
...,...,...,...,...,...,...
171,231.0,110,3039,15.0,21.0,20.50
400,151.0,90,2950,17.3,27.0,24.50
115,97.0,88,2279,19.0,20.0,29.28
212,97.0,75,2155,16.4,28.0,28.50


In [48]:
#calculating the rmse error
error = np.mean((xt['yactual'] - xt['ypred'])**2)

In [50]:
print('rmse for the predicted value:',error)

rmse for the predicted value: 28.705603809523797


as you can see the rmse error is lesser in KNN model than in 1NN model

### Problem Statement:

1nn algorithm for classification 

data - iris

traget - species

predictors - [sepal.length, sepal.width, petal length, petal width]

1. divide the data into train and test (100,49)
2. seperate x train, y train, x test, y test,
3. standardize x
4. design a 1nn algorithm to classify species for all x test using x train


In [51]:
#reading the file
os.chdir(r'C:\Users\shameel\Desktop\Praxis')
cars = pd.read_csv("iris.csv")
cars

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [52]:
# assigning values
predictors = cars[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
target = cars['species']
shuffle = cars.sample(150)

In [53]:
# dividing the data
xtrain = shuffle.iloc[0:100,0:4]
ytrain = shuffle.iloc[0:100,4]
xtest = shuffle.iloc[101:,0:4]
ytest = shuffle.iloc[101:,4]

In [54]:
# standardizing the data
xtm = predictors.mean()
xts = predictors.std()
xtrainst = ((xtrain - xtm)/xts)
xtestst = ((xtest - xtm)/xts)

In [55]:
# defining a function to determine the distance between 2 points
def equiddist(a,b):
    d = np.sqrt(sum((a-b)**2))
    return (d)

In [56]:
# 1nn algorithm to find 5 nearest neigbour to predict ytrain
def onenniris(k):
    dist = []
    y = []
    for i in range (0,len(k)):                   
        for j in range (0,len(xtrain)):
            d = equiddist (xtrain.iloc[j],xtest.iloc[i])    # to find the distance for every xtest value to every xtrain value       
            dist.append(d)                                  # appending all distance in a list
        yindex = dist.index(min(dist),)                     # finding the index of every min(dist)
        y.append(ytrain.iloc[yindex,])                      # appending the appropriate y train value using the index found above
        dist = []
    return np.array(y)
    print(main)

In [57]:
xt = xtest.copy()

In [58]:
# the predicted value for knn algorithm
xt['ypred'] = onenniris(xtest)

In [59]:
xt['yactual'] = ytest

In [61]:
# compare test and pred data
xt

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,ypred,yactual
2,4.7,3.2,1.3,0.2,Iris-setosa,Iris-setosa
42,4.4,3.2,1.3,0.2,Iris-setosa,Iris-setosa
49,5.0,3.3,1.4,0.2,Iris-setosa,Iris-setosa
60,5.0,2.0,3.5,1.0,Iris-versicolor,Iris-versicolor
148,6.2,3.4,5.4,2.3,Iris-virginica,Iris-virginica
80,5.5,2.4,3.8,1.1,Iris-versicolor,Iris-versicolor
56,6.3,3.3,4.7,1.6,Iris-versicolor,Iris-versicolor
23,5.1,3.3,1.7,0.5,Iris-setosa,Iris-setosa
111,6.4,2.7,5.3,1.9,Iris-virginica,Iris-virginica
32,5.2,4.1,1.5,0.1,Iris-setosa,Iris-setosa


In [63]:
from sklearn.metrics import confusion_matrix, f1_score, classification_report, accuracy_score
from sklearn.metrics import confusion_matrix

In [64]:
#finding the overall accuracy of the prediction made by the model
print('Confusion Matrix:\n', confusion_matrix(ytest, xt['ypred']))
print('\n')
print('Classification Report:\n', classification_report(ytest, xt['ypred']))

Confusion Matrix:
 [[17  0  0]
 [ 0 17  1]
 [ 0  1 13]]


Classification Report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        17
Iris-versicolor       0.94      0.94      0.94        18
 Iris-virginica       0.93      0.93      0.93        14

       accuracy                           0.96        49
      macro avg       0.96      0.96      0.96        49
   weighted avg       0.96      0.96      0.96        49



### Problem Statement:

knn algorithm for classification 

data - iris

traget - species

predictors - [sepal.length, sepal.width, petal length, petal width]

1. divide the data into train and test (100,49)
2. seperate x train, y train, x test, y test,
3. standardize x
4. design a knn algorithm to classify species for all x test using x train

In [65]:
# knn algorithm to find 5 nearest neigbour to predict ytrain
def knniris(k):
    dist = []
    v = []
    y = []
    for i in range (0,len(k)):
        for j in range (0,len(xtrain)):
            d = equiddist (xtrain.iloc[j],xtest.iloc[i])    # to find the distance for every xtest value to every xtrain value
            dist.append(d)
            dist2 = dist.copy()
            dist2.sort()                                    # sorting the distance found in ascending order
        for k in dist2[0:5]:
            yindex = dist.index(k)                          #finding the index of the first 5 shortest distance
            v.append(ytrain.iloc[yindex])                   #appending the value of the ytrain of the 5 values
        final = pd.DataFrame(v)
        finalmode = final.mode()                            # finding the mode for the 5 ytrain values
        y.append(finalmode[0])
        v=[]
        dist = []
    return np.array(y)
    print(main2)

In [66]:
xt = xtest.copy()

In [67]:
# the predicted value for knn algorithm
xt['ypred'] = onenniris(xtest)

In [68]:
xt['yactual'] = ytest

In [69]:
#compare test and pred data
xt

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,ypred,yactual
2,4.7,3.2,1.3,0.2,Iris-setosa,Iris-setosa
42,4.4,3.2,1.3,0.2,Iris-setosa,Iris-setosa
49,5.0,3.3,1.4,0.2,Iris-setosa,Iris-setosa
60,5.0,2.0,3.5,1.0,Iris-versicolor,Iris-versicolor
148,6.2,3.4,5.4,2.3,Iris-virginica,Iris-virginica
80,5.5,2.4,3.8,1.1,Iris-versicolor,Iris-versicolor
56,6.3,3.3,4.7,1.6,Iris-versicolor,Iris-versicolor
23,5.1,3.3,1.7,0.5,Iris-setosa,Iris-setosa
111,6.4,2.7,5.3,1.9,Iris-virginica,Iris-virginica
32,5.2,4.1,1.5,0.1,Iris-setosa,Iris-setosa


In [70]:
#finding the overall accuracy of the prediction made by the model
print('Confusion Matrix:\n', confusion_matrix(ytest, xt['ypred']))
print('\n')
print('Classification Report:\n', classification_report(ytest, xt['ypred']))

Confusion Matrix:
 [[17  0  0]
 [ 0 17  1]
 [ 0  1 13]]


Classification Report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        17
Iris-versicolor       0.94      0.94      0.94        18
 Iris-virginica       0.93      0.93      0.93        14

       accuracy                           0.96        49
      macro avg       0.96      0.96      0.96        49
   weighted avg       0.96      0.96      0.96        49

