# Part 1: Implementation

In [1]:
import csv
import random
import math  
import numpy as np

# loading  data and sorting the data in to train dataset and test dataset, the function returns both datasets
def retrievData(filename,train,test):
    with open(filename,'r') as f:
        lines=f.readlines()   
        for l in lines:            
            if random.random()<0.66:                
                train.append([float(i) for i in l.split(",")])
            else:
                test.append([float(i) for i in l.split(",")])
               
    return train,test
# Function that calculates and returns the distance/ similarity by using the Euclidean distance measure
def calcDistance(x_test,x_train):
    distance = 0
    for i in range(len(x_test)-1):
        distance += pow((x_test[i] - x_train[i]), 2)
    return math.sqrt(distance)
# Function to find the neighbours of a given test obejct
def findNeighbours(testd,trainX, k):
    similarityList=[]
    neighbours=[]
    for t in trainX:
        d=calcDistance(testd,t)
        similarityList.append((t,d))
        similarityList.sort(key = lambda x: x[1]) 
    for i in range(k):
        neighbours.append(similarityList[i][0])
    return neighbours
# Function which adds the prediction result to a given test object,
# so it gets compared with the actual label in a later process
def addPrediction(test,train,k):
    neighbours=findNeighbours(test,train,k)
    #find the most common label in the list of neighbours  and append it to the test dataset   
    labels=[]
    for n in neighbours:
        labels.append(n[len(n)-1])     
    
    test.append(np.bincount(labels).argmax())# Appends the object which is most common of the neighbours
    
    return test
# Function to attach the prediction to all the test objects in the test dataset
def makePredForAll(test,train,k): 
    predTest=[]
    for t in test:
        predTest.append(addPrediction(t,train,k))
    return predTest
#Function which counts the right predictions and retruns the efficiency of the process
def classificationAcc(predictedD):
    count=0
    for p in predictedD:
        if(p[len(p)-1]==p[len(p)-2]):
            count+=1
    return float(count/len(predictedD))
# main function which pulls everything together to  show the final results
def main():
    trainS=[]
    testS=[] 
    print("Time measurment for the  function retrievData.......")
    %timeit retrievData("IRIS-2.csv",train=[],test=[])
    trainS,testS=retrievData("IRIS-2.csv",train=[],test=[])
    print()
    print("Time measurment for the  function makePredForAll.......")
    %timeit -r 1 -n 1 makePredForAll(testS,trainS,5)
    d=makePredForAll(testS,trainS,5)
    print()
    print("Time measurment for the  function classificationAcc.......")
    %timeit classificationAcc(d)*100
    percenInAccuracy=classificationAcc(d)*100
    print()
    print("The accuracy in percent is : %2.2f" %percenInAccuracy,"%",sep='' )
    print()

main()

Time measurment for the  function retrievData.......
947 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Time measurment for the  function makePredForAll.......
93 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Time measurment for the  function classificationAcc.......
36.1 µs ± 837 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The accuracy in percent is : 93.75%



  ## Source code documentation

The implementation of the K-NN is  organized in such a way that functions are developed to do specific tasks-
The functions that are included in the implementation are as follows along with short description of what they do:-

### retrievData(filename,train,test) 

 its main task is to download the data from the given file and by spliting the data in to train and test data it returns two lists which are supposed to be used as train and test datasets.
 
### calcDistance(x_test,x_train)

This function takes one row from test dataset and one row from train dataset and  calculates the distance between them using the Euclidean distance measurement method to test the similarity and returns the result found from the calculation.

### findNeighbours(testd,trainX, k)

This function takes three partmeters, 
    testd - is a single row from the test dataset
    trainX- is the whole train dataset which is supposed to be a list
    k- which is  used to select the nearest neighbours
   
   The function takes the above three parameters and by calculate the distance  creates a similaritylist, from the similarity     list the most nearest neighbours are selcted and returned.
   
### addPrediction(test,train,k)

 It finds the label of the most common neighbour and adds it to the test dataset for comparison in a later process.

### makePredForAll(test,train,k)

 It used the addPrediction(...) function to make all the predictions for the whole test dataset
 
### classificationAcc(predictedD)
   
   checks and counts how many of the predictions were correct and returns the ratio of correct predictions to the total prediction attemt done.
   
###  main()

 Is a function where we run our program from, it some of the function mentioned above to perform the task of  displays the correctness in percent.
   


# Part 2 Evaluation of the effciency

 ### using the magic command %prun    


In [None]:
 """ ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4784    0.056    0.000    0.091    0.000 {method 'sort' of 'list' objects}
   251160    0.035    0.000    0.035    0.000 <ipython-input-10-003561e71200>:31(<lambda>)
     4784    0.017    0.000    0.028    0.000 <ipython-input-10-003561e71200>:19(calcDistance)
    19136    0.008    0.000    0.008    0.000 {built-in method builtins.pow}
       46    0.007    0.000    0.126    0.003 <ipython-input-10-003561e71200>:25(findNeighbours)
     4784    0.001    0.000    0.001    0.000 {built-in method math.sqrt}
     5486    0.001    0.000    0.001    0.000 {method 'append' of 'list' objects}
     5107    0.001    0.000    0.001    0.000 {built-in method builtins.len}
       46    0.001    0.000    0.001    0.000 {built-in method numpy.bincount}
       46    0.001    0.000    0.128    0.003 <ipython-input-10-003561e71200>:35(addPrediction)
        1    0.001    0.001    0.001    0.001 {built-in method io.open}
      104    0.000    0.000    0.000    0.000 <ipython-input-10-003561e71200>:13(<listcomp>)
        1    0.000    0.000    0.002    0.002 <ipython-input-10-003561e71200>:7(retrievData)
        1    0.000    0.000    0.000    0.000 <ipython-input-10-003561e71200>:51(classificationAcc)
       46    0.000    0.000    0.000    0.000 {method 'argmax' of 'numpy.ndarray' objects}
       46    0.000    0.000    0.000    0.000 <ipython-input-10-003561e71200>:15(<listcomp>)
        1    0.000    0.000    0.000    0.000 {method 'readlines' of '_io._IOBase' objects}
        1    0.000    0.000    0.128    0.128 <ipython-input-10-003561e71200>:45(makePredForAll)
      150    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.130    0.130 {built-in method builtins.exec}
      150    0.000    0.000    0.000    0.000 {method 'random' of '_random.Random' objects}
        1    0.000    0.000    0.130    0.130 <string>:1(<module>)
        1    0.000    0.000    0.130    0.130 <ipython-input-10-003561e71200>:57(main)
        2    0.000    0.000    0.000    0.000 {built-in method _codecs.charmap_decode}
        1    0.000    0.000    0.000    0.000 {built-in method _locale._getdefaultlocale}
        2    0.000    0.000    0.000    0.000 cp1252.py:22(decode)
        1    0.000    0.000    0.000    0.000 _bootlocale.py:11(getpreferredencoding)
        1    0.000    0.000    0.000    0.000 codecs.py:260(__init__)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects} """
    

 
 From the profiling data we observe that most of the execution time is taken by the method sort and the lambda function which   was used inside the sort method.
 
 ### Time measurments using the magic command %timeit
  
  Time measurment for the  function retrievData.......
  902 µs ± 7.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  Time measurment for the  function makePredForAll.......
  68.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

  Time measurment for the  function classificationAcc.......
  23.6 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  
  -As  shown in the above, an attempt was made to measure the execution time  for each of the functions which are used in the main function. The makePredForAll(.......) function takes highest execution time. Measuring the time excution of makePredForAll(.......) was not possible using the noraml magic function %timeit, and hence was used this command ' %timeit -r 1 -n 1' which kept the numbers of runs and loops for each run to just one.

  
