# Similarity-Based Learning Lab - focus on Python

Similarity-based learning is a family of algorithms which follows the intuition that similar items have similar labels. We can tell that items are similar by coming up with some notion of **distance** between instances. A distance function takes two instances as inputs and returns a number representing the distance between (or dissimilarity) between the two instances. Instances with a low distance are considered to be similar.

The most common similarity-based algorithm is the k-Nearest-Neighbours (kNN) algorithm. When predicting a label for instance *q*, this algorithm looks for the *k* closest (most similar) instances, and predicts the majority label for these instances.

In this lab we'll build distance functions for some of the most popular distance metrics. Along the way we'll learn how to write clean, *pythonic* code. We'll go on to build our own kNN model from scratch (don't worry, it's easier than you think!). Finally, we'll look at the tools Sci-kit Learn provides us for normalising our data, an essential preprocessing step for similarity-based learning models.

* [Distance Metrics](#distanceMetrics)
    * [Implementing a Distance Metric](#implementingADistanceMetric)
        * [Function Definitions and TypeHints](#functionDefinitions)
        * [From Equations to Functions - Euclidean Distance Pseudocode](#euclideanPseudo)
        * [Handling Errors - Defensive Programming](#handlingErrors)
    * [Pythonic Programming](#pythonicProgramming)
        * [Avoid Counters](#avoidCounters)
        * [Use Comprehensions](#useComprehensions)
        * [Use Lambdas where Appropriate](#useLambdas)
    * [Distance Metrics Exercises](#distanceMetricExercises)
    * [Distance Metrics in SkLearn](#distanceMetricsSklearn)
* [kNN From Scratch](#knnFromScratch)
    * [Classes in Python](#classesInPython)
    * [Building a kNN Class](#buildingAKnnClass)
        * [Implementing the Predict Method](#implementingPredict)
        * [The K Parameter](#theKParameter)
        * [Passing a Distance Metric as a Parameter](#distanceMetricParameter)
* [Data Preprocessing with SkLearn - Normalisation](#normalisation)
    * [Exercise - Preprocessing Iris](#preprocessingIris)
    

# Distance Metrics <a class="anchor" id="distanceMetrics"></a>

A **distance metric** is a function which takes two *instances* and returns a number describing the dissimilarity, or *distance* between those instances. We saw that there are a few other rules which distance metrics need to obey (such as the triangle inequality) to be truly considered *metrics* and not similarity indexes. The reason for this is that if any of these laws are violated, we couldn't plot these instances on a graph.

If we want to implement our own distance metric in Python we can use a **function**. A function allows us to take in any number of parameters (in this case, the instances we want to measure) and return a value, (in this case, the distance). We'll start by taking a look at how we would build a Euclidean distance function.

## Implementing a Distance Metric <a class="anchor" id="implementingADistanceMetric"></a>

### Function Definitions and TypeHints <a class="anchor" id="functionDefinitions"></a>

Let's begin by working out our function definition. The first thing we need is a descriptive name. 

```python 
    def get_euclidean_distance
```

We could call it something like *euclidean* but because this is a function including a verb like *get* or *calculate* will remind us that this needs to be called to be used. This is a matter of style and it's up to you which you prefer.

Now that we've got a name we need to think about parameters. We can see from the equation above that the Euclidean distance function takes two parameters. In the example above, they're called *p* and *q* but we can give them more descriptive names if so choose. 

```python
    def get_euclidean_distance(instanceA, instanceB):
```

The parameter names above are descriptive but they require more keystrokes. Again, it's a tradeoff and it's up to you whichever you think works best.

Finally, it's good practice to make it clear what type of data your function expects and returns returns. This is an optional extra feature in Python known as **type annotation**. Type annotation makes it clear to you, to other developers, and to your IDE what type of data is going to come back from the function when it's called. This can be very helpful in that it allows your IDE to give your intelligent hints when you code. See PyCharm for more details.

```python
    def get_euclidean_distance(instanceA: list, instanceB: list) -> float:
```

Type annotations for parameters are done using a colon symbol, followed by the type expected for that parameter. Annotations for return types are done using a minus sign followed by a greater-than sign; to make an arrow symbol. Our function takes two lists and is going to return a float.

In [None]:
def get_euclidean_distance(instanceA: list, instanceB: list) -> float:
    pass

Notice that we've added the keyword **pass** to the body of our function. It's not valid python to define a function without any body, so the pass keyword acts as a placeholder. It tells the Python interpreter that you've intentionally left this function empty. It's the coding equivalent of the somewhat amusing *this page is intentionally left blank* you'll sometimes see on exams or official documents.

### From Equations to Functions: Euclidean Distance Pseudo-Code <a class="anchor" id="euclideanPseudo"></a>

![Euclidean Distance Formula](euclidean_distance.svg "Euclidean Distance")

In the definition above, p and q are **instances**

**d(p, q)** is the (euclidean) distance between p and q

p<sub>i</sub> is the *i<sup>th</sup>* **feature** of the instance P


Now it's time to get down to the implementation. Notice the big Greek letter that looks. a little bit like a capital E. This is the greek letter *sigma* and is used in equations to represent **the sum of**. The sum of what, exactly? The letters above and below the sigma will give us more information. We read from bottom to top. 

Starting with i=1 and continuining until i=n, we work out the bit in brackets. q<sub>i</sub> - p<sub>i</sub> tells us that we need to subtract the *i<sup>th</sup>* feature in p from the *i<sup>th</sup>* feature in q and square the result. We then add up the results for each value of i to give us our final total.

So, how do we do this in Python? 
* Create a counter variable to let us loop through each list
* Subtract the current feature in instanceB from the feature in instanceA and square the result
* Add this to our running total
* Return the square root of the running total

We start by creating a counter variable, *i* to let us loop through each list and access the individual featuers.

In [None]:
# the math module provides a square-root function (no pip install needed)
import math

def get_euclidean_distance(instanceA: list, instanceB: list) -> float:

    grand_total = 0 # create a variable to hold the running total
    for i in range(0, len(instanceA)):
        # Double asterix is the exponent (power) operator
        current_total = (instanceB[i] - instanceA[i]) ** 2

        grand_total = grand_total + current_total 
        
    #sqrt is the square root function
    return math.sqrt(grand_total)

In [None]:
# the math module provides a square-root function (no pip install needed)
import math

def get_euclidean_distance(instanceA: list, instanceB: list) -> float:
        
    grand_total = 0 # create a variable to hold the running total
    for i in range(0, len(instanceA)):
        # Double asterix is the exponent (power) operator
        current_total = (instanceB[i] - instanceA[i]) ** 2

        grand_total = grand_total + current_total 
        
    #sqrt is the square root function
    return math.sqrt(grand_total)

In [None]:
instanceA = [1, 1]
instanceB = [4, 5]
get_euclidean_distance(instanceA, instanceA) # this should be 0
get_euclidean_distance(instanceA, instanceB) # this should be 5

### Handling Errors - Defensive Programming <a class="anchor" id="handlingErrors"></a>

It looks like our euclidean distance function is working as intended. However, what happens if our lists are different lengths?

In [None]:
instanceC = [1, 3, 4]
instanceD = [2, 2]

get_euclidean_distance(instanceC, instanceD)

We've run into an error because our script is trying to find the third element of instanceD, which doesn't exist. Euclidean distance only works on instances with the same number of features. We should check that our supplied parameters have the same number of features and raise an error with a helpful message for the user if not. The **raise** keyword allows us to halt execution and return an error message to the user.

In [None]:
# the math module provides a square-root function (no pip install needed)
import math

def get_euclidean_distance(instanceA: list, instanceB: list) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
        
    grand_total = 0 # create a variable to hold the running total
    for i in range(0, len(instanceA)):
        # Double asterix is the exponent (power) operator
        current_total = (instanceB[i] - instanceA[i]) ** 2

        grand_total = grand_total + current_total 
        
    #sqrt is the square root function
    return math.sqrt(grand_total)

In [None]:
instanceC = [1, 3, 4]
instanceD = [2, 2]

# it still doesn't work but at least we now have a useful error message
get_euclidean_distance(instanceC, instanceD)

## Pythonic Programming <a class="anchor" id="pythonicProgramming"></a>

The code above works perfectly well and does exactly what it needs to do. However, it doesn't quite follow the *pythonic* style. *Pythonic* code should be fluent, it should read well and it should be terse. Here's a guide talking about [pythonic style](https://docs.python-guide.org/writing/style/) in more depth for those of you who are interested.

In this section we're going to re-write the code above in a more pythonic way. This is entirely optional for those of you like me who are excited by these kinds of things : )

### Avoid Counters <a class="anchor" id="avoidCounters"></a>

Generally speaking, counters aren't very descriptive. You need to take step back to realise what a counter is doing. It's a computer's way of thinking about a problem rather than a human way. Pythonic code avoids counters where possible. The standard form of the python loop

```python
for item in collection:
    print(item)
```

iterates through a collection without using additional counter variables. The reason we used counters here is because we need to loop through two lists simultaneously and a loop only supports looping through one collection a time.

We can work around this using the **zip()** function. The zip function takes two collections of items, and returns a single collection containing the corresponding pairs

In [None]:
letters = ['a', 'b', 'c']
numbers = [1, 2, 3]

for letter, number in zip(letters, numbers):
    print(f"{letter}: {number}")

When we apply this to the funciton below the code becomes a little shorter, a litle more direct and a little more fluent.

In [None]:
# the math module provides a square-root function (no pip install needed)
import math

def get_euclidean_distance(instanceA: list, instanceB: list) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
        
    grand_total = 0 # create a variable to hold the running total
    for a, b in zip(instanceA, instanceB):
        current_total = (a-b) ** 2
        grand_total = grand_total + current_total 
        
    #sqrt is the square root function
    return math.sqrt(grand_total)

### Use Comprehensions <a class="anchor" id="useComprehensions"></a>

We've removed the list counters from our code, but we still have a lot of messing around with the grand_total variable. We're creating a counter, grand_total, setting it to 0 and then adding to it repeatedly in the loop. This is a standard way of maintaining a running total while we sum up a list of items, but again, it's a computer's way of looking at the problem rather than a human way.

List comprehensions allow us to get rid of a lot of this boilerplate code. They may look a little funny at first but as you get used to them you'll find them easier to read and write than the more verbose non-pythonic alternative. The easiest way to understand how a list comprehension works is by example.

We want to write a function which takes a list of items, and double each item in the list. Here's the traditional way of doing that

In [None]:
# this allows us to specify the type of a list's contents
from typing import List


def double_them(singles: List[float]):
    doubles = []
    for s in singles:
        # the + operator between two lists appends
        doubles = doubles + [s * 2]
    return doubles

double_them([1, 2, 6, 5])

<pre>
def double_them(singles: List[float]):
    doubles = []
    <b>for s in singles</b>:
        # the + operator between two lists appends
        doubles = doubles + <b>[s * 2]</b>
    return doubles
</pre>

Most of the code above is boilerplate. I've highlighted the two important bits. First, we're going through each item in singles, and then, we're multiplying it by two. In general, we can think of this as an operation to perform on a collection. In this case the collection is the list called *singles*. The operation is to multiply each element by 2. Let's see how we can use a list comprehension to cut down on the amount of code.

In [None]:
def double_them(singles: List[float]):
    return [s * 2 for s in singles]

double_them([1, 2, 6, 5])

To create a list comprehension we use square brackets just like we would for an actual list. However, a list comprehension is written as

[ **operation** for **variable_name(s)** in **collection** ]

Let's try out a few examples

In [None]:
# Write a function which triples each item in the list
def triple_them(singles: List[float]):
    pass

In [None]:
# Write a function which gets the square root of each element
def square_roots(squares: List[float]):
    pass

In [None]:
# Write a function which gets the length of each string
def string_lengths(strings: List[str]):
    pass

Let's replace the for loop in our function above with a list comprehension.

```python
# the math module provides a square-root function (no pip install needed)
import math

def get_euclidean_distance(instanceA: list, instanceB: list) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
        
    grand_total = 0 # create a variable to hold the running total
    for a, b in zip(instanceA, instanceB):
        current_total = (a-b) ** 2
        grand_total = grand_total + current_total 
        
    #sqrt is the square root function
    return math.sqrt(grand_total)
```

Our first step is to identify our operation. Check the for loop, what are we actually doing with the variables?

```python
(a-b) ** 2
```

We're subtracting b from a and squaring the result.

Step two is work out the collection statement. This is straightforward, we can take it directly from the first line of the loop

```python
for a, b in zip(instanceA, instanceB):
```

Putting it all together we get

```python
[(a-b) ** 2 for a, b in zip(instanceA, instanceB)]
```

Our list comprehension above gives us the square difference between each feature. We want to add all of the differences together. We can now do this using the **sum()** function

```python
sum([(a-b) ** 2 for a, b in zip(instanceA, instanceB)])
```

Finally, we want to return the square root of the sum total

```python
math.sqrt(sum([(a-b) ** 2 for a, b in zip(instanceA, instanceB)]))
```

In [None]:
from typing import List

def get_euclidean_distance(instanceA: List[float], instanceB: List[float]) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
        
    return math.sqrt(sum([(a-b) ** 2 for a, b in zip(instanceA, instanceB)]))

instanceA = [1, 1]
instanceB = [4, 5]
print(f"euclidean(a,a): {get_euclidean_distance(instanceA, instanceA)}") # this should be 0
print(f"euclidean(a,b): {get_euclidean_distance(instanceA, instanceB)}") # this should be 5

### Use Lambdas Where Appropriate <a class="anchor" id="useLambdas"></a>

We've significantly reduced the number of lines of codes in our get_euclidean_distance function. However, the (a-b)\*\*2 is still a bit of an eyesore. It's immediately clear what this is doing. One way of making this more readable is to use a named function

```python
def get_euclidean_distance(instanceA: List[float], instanceB: List[float]) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
    
    def squared_distance(a: float, b: float) -> float:
        return (a - b) ** 2

    return math.sqrt(sum([squared_distance(a, b) for a, b in zip(instanceA, instanceB)]))
```

It's now easier to see what's going on in our return statement. However, we've added a lot of keywords and other fluff which we don't really need. We can reduce the boilerplate by using a **lambda**. A lambda is essentially an anonymous function in Python.


In [None]:
double_it = lambda x: x * 2

double_it(2)
double_it(4)

Notice that the lambda function doesn't have a name. We are assigning it to a variable so we can use it later on, but it's possible to use it without doing that. It's not possible for a function to be anonymous.

Notice also that the lambda expression doesn't have a **return** keyword. Lambdas can only consist of a single line of code, so the return statement is implied. This means lambdas are suitable for simple operations. If we want to do something more complex, something that needs multiple lines of codes we must use a regular function.

We'll finish up *pythonicising*<sup>*</sup> our code by using a lambda to define the square distance function. This isn't exactly necessary but in this case it's a lot more readable than **(a - b) \*\* 2**

\* (almost definitely not a word)

In [None]:
from typing import List

def get_euclidean_distance(instanceA: List[float], instanceB: List[float]) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
    
    squared_distance = lambda a, b: (a - b) ** 2

    return math.sqrt(sum([squared_distance(a, b) for a, b in zip(instanceA, instanceB)]))

instanceA = [1, 1]
instanceB = [4, 5]
print(f"euclidean(a,a): {get_euclidean_distance(instanceA, instanceA)}") # this should be 0
print(f"euclidean(a,b): {get_euclidean_distance(instanceA, instanceB)}") # this should be 5

In [None]:
def get_cosine(vectorA, vectorB):
    if not len(vectorA) == len(vectorB):
        raise Exception("instances not of equal length", vectorA, vectorB)
        
    dot_product = sum([a * b for a, b in zip(vectorA, vectorB)])
    mag = lambda v: math.sqrt(sum([f ** 2 for f in v]))
    return dot_product / mag(vectorA) * mag(vectorB)

instance_a = [1, 0]
instance_b = [math.sqrt(2) / 2, math.sqrt(2) / 2]
#instance_b = [2, 2]
instance_c = [0, 1]

cosine_ab = get_cosine(instance_a, instance_b)
cosine_bc = get_cosine(instance_b, instance_c)
cosine_ac = get_cosine(instance_a, instance_c)


print(f"cosine_ab is {cosine_ab}")
print(f"cosine_bc is {cosine_bc}")
print(f"cosine_ac is {cosine_ac}")

# Distance Metric Exercises <a class="anchor" id="distanceMetricExercises"></a>

## Manhattan Distance

Look at the definition of the Manhattan distance in your notes and implement a function which calculates the Manhattan distance between 2 instances. The math module provides a function giving the absolute value of a number (look this up).

## Minkowski Distance

Look at the definition of the Minkowski distance in your notes and implement a function which calculates the Minkowski distance between 2 instances.

*hints*
* The Minkowski distance requires an extra parameter, what is it?
* The square root is the same as 1 to the power of 1/2, cube root is the same as 1 to the power of 1/3

## Cosine Distance

Look at the definition of the Cosine distance formula in your notes and implement a function whcih calculates the cosine distance between 2 instances

In [None]:
import math

def get_cosine(vectorA, vectorB):
    if not len(vectorA) == len(vectorB):
        raise Exception("instances not of equal length", vectorA, vectorB)
        
    dot_product = 0
    
    for i in range(0, len(vectorA)):
        dot_product = dot_product + (vectorA[i] * vectorB[i])
        
    print(f"dot product: {dot_product}")
    
    def calculate_magnitude(vector):
        total = 0
        for feature in vector:
            total += feature ** 2
        return math.sqrt(total)
    
    magnitude_a = calculate_magnitude(vectorA)
    magnitude_b = calculate_magnitude(vectorB)
    
    return dot_product / (magnitude_a * magnitude_b)

instance_a = [1, 0]
instance_b = [math.sqrt(2) / 2, math.sqrt(2) / 2]
#instance_b = [2, 2]
instance_c = [0, 1]

cosine_ab = get_cosine(instance_a, instance_b)
cosine_bc = get_cosine(instance_b, instance_c)
cosine_ac = get_cosine(instance_a, instance_c)


print(f"cosine_ab is {cosine_ab}")
print(f"cosine_bc is {cosine_bc}")
print(f"cosine_ac is {cosine_ac}")
        

## Distance Metrics in sklearn <a class="anchor" id="distanceMetricsSklearn"></a>

SciKitLearn provides its own implementations of the most common distance metrics. As always the [online documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html) is the best place to find out more. Test each of your distance metric implementations above by making sure they give the same answer as the SkLearn versions


In [None]:
from sklearn.neighbors import DistanceMetric
from sklearn.datasets import load_iris
import pandas as pd

# load the iris dataset using the sklearn's convenience function
data = load_iris()

# convert the dataset into a Pandas dataframe (descriptive features only)
df = pd.DataFrame(data.data, columns=data.feature_names)

# this is the target variable, the column we're interested in
labels = data.target

# Compare the distance between the first 3 rows
X = df.loc[0:2,]

# This gives a distance matrix showing the pairwise distance between each pair of rows
print(X)
euc = DistanceMetric.get_metric('euclidean')

pairwise = euc.pairwise(X)
print(pairwise)

# get the distance between the first and second rows
print(pairwise[0, 1])

# get the distance between the second and third rows
print(pairwise[1, 2])

# get the distance between the first row and itself
print(pairwise[0, 0])

# kNN From Scratch <a class="anchor" id="knnFromScratch"></a>

We're going to take what we've learned about kNNs and use it to build our own basic kNN algorithm from scratch. By breaking down the steps involved and taking it little by little it will become clear that it's not really that complicated to do. Before we begin we're going to make sure we've covered the basics and take a look at how we can use **classes** to build objects in Python.


## Classes in Python <a class="anchor" id="classesInPython"></a>

Loosely speaking, a class in python is a data structure which represents a complex "thing in the world". We'll take the example of a student class. This class will contain all of the variables we may want to hold on a student. Variables belonging to a class are known as **properties**. Below is an example of some properties belonging to a student

* student number
* first name
* last name
* grades
* status

A student class will also contain functions which allow us to manipulate the student. For example, we may want to register a student, or get the average grade a student has achieved. A function belonging to a class is known as a **method**

* register()
* add_grades()
* get_average_grade()

Classes give us as programmers an easy way to manage a bunch of related variables. Rather than having to work with 5 or 6 separate variables, we can store them all together. The example below shows a possible implementation of a student class in python

In [None]:
from typing import List

# The keyword class lets Python know we're creating a class here
class Student:
    
    # Below is a list of all of the properties (variables) belonging to a student
    student_number: str
    first_name: str
    last_name: str
    grades: List[float]
    status: str
        
    # This special function is known as a constructor. This function is called whenever
    # we create a new student. We use it to assign initial values. The self parameter
    # is a special parameter which is supplied by default to all class methods. It allows
    # us to reference the class, and is the equivalent of the C++ **this** keyword
    def __init__(self, student_number: str, first_name: str, last_name: str):
        if not len(student_number) == 9:
            raise Exception("Invalid student number, must be 9 characters", student_number)
        self.student_number = student_number
        self.first_name = first_name
        self.last_name = last_name
        
        # we can set default values in the constructor
        self.grades = []
        self.status = 'awaiting registration'

# To create a new student we use the class name followed by a list of parameters
# We can look at the __init__ function to see which parameters are required
# Note that the **self** parameter is passed automatically
rita = Student('d12345678', 'Rita', 'White')

# We can access properties of an object using the dot "." operator
print(rita.student_number)
print(rita.first_name)
print(rita.last_name)
print(rita.status)



We need to be able to change a student's status and add grades to the student's record. We can do this using methods.

In [None]:
from typing import List

# The keyword class lets Python know we're creating a class here
class Student:
    
    # Below is a list of all of the properties (variables) belonging to a student
    student_number: str
    first_name: str
    last_name: str
    grades: List[float]
    status: str
        
    # This special function is known as a constructor. This function is called whenever
    # we create a new student. We use it to assign initial values. The self parameter
    # is a special parameter which is supplied by default to all class methods. It allows
    # us to reference the class, and is the equivalent of the C++ **this** keyword
    def __init__(self, student_number: str, first_name: str, last_name: str):
        if not len(student_number) == 9:
            raise Exception("Invalid student number, must be 9 characters", student_number)
        self.student_number = student_number
        self.first_name = first_name
        self.last_name = last_name
        
        # we can set default values in the constructor
        self.grades = []
        self.status = 'awaiting registration'
        
    # The self parameter is required in all methods    
    def register(self) -> None:
        if not self.status == 'awaiting registration':
            raise Exception("Student is not awaiting registration")
        
        self.status = 'registered'
        
    def add_grade(self, grade: float) -> None:
        self.grades.append(grade)
        
    def get_average_grade(self) -> float:
        return sum(self.grades) / len(self.grades)
    
rita = Student('d12345678', 'Rita', 'White')

# like properties, we call methods using the dot "." operator. We don't supply the "self"
# parameter. The python interpreter does that for us
rita.register()
rita.add_grade(50)
rita.add_grade(70)

print(rita.get_average_grade())

## Building a kNN class <a class="anchor" id="buildingAKnnClass"></a>

Before we can start creating our class we need to think about two things. What data does our kNN need and what does it need to do. It's often easier to ask what it needs to do first; as this will often make it clear what data it needs to hold to do it.

Our kNN needs to predict labels for data. We're going to need a predict method. That method is going to take an instance (which we've been representing as a list) and it's going to return a string, representing the class for that instance

* predict(X: List[float]) -> str

In order to make the prediction it's going to have the find the distance between the input instance and all other instances it's been trained on. this means that we're going to have to have a property containing a list of all the training data. (This is a list of lists). 

* data: List[List]

We need to be able to train the model, by providing a list of input data and a corresponding list of labels

* train(X: List[List], y: List[str]) -> None

We've added a new variable, y, which is a list of the class labels for the training data, we're going to have add this as a property, too.

* labels: List[str]

now we've defined our properties and methods we're ready to go

In [None]:
from typing import List


class KNN:
    data: List[List]
    labels: List[str]
        
    def train(self, X: List[List], y: List[str]):
        self.data = X
        self.labels = y
        
    def predict(self, X: List[float]):
        pass
    
    
from sklearn.neighbors import DistanceMetric
from sklearn.datasets import load_iris
import pandas as pd

# load the iris dataset using the sklearn's convenience function
data = load_iris()

# convert the dataset into a Pandas dataframe (descriptive features only)
df = pd.DataFrame(data.data, columns=data.feature_names)

# this is the target variable, the column we're interested in
labels = data.target

model = KNN()
model.train(df, labels)
print(model.data)
print(model.labels)

### Implementing the Predict method <a class="anchor" id="implementingPredict"></a>

We've created a predict() function, but now we need to actually implement it. Some pseudo-code will help focus the mind on how to do this. How does a KNN work? When we make a prediction for query **q** using data **D** we need to

1. Calculate the distance between each instance *d* in the training set and *q*
2. Take the top *k* most similar *d*s, where most similar means the shortest distance to *q*
3. Check the label of each the top k instances, return the majority label

We'll start by considering only Euclidean distance. We'll add our Euclidean distance function from earlier.

In [None]:
from typing import List


def get_euclidean_distance(instanceA: List[float], instanceB: List[float]) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
    
    squared_distance = lambda a, b: (a - b) ** 2

    return math.sqrt(sum([squared_distance(a, b) for a, b in zip(instanceA, instanceB)]))


class KNN:
    data: List[List]
    labels: List[str]
        
    def train(self, X: List[List], y: List[str]):
        self.data = X
        self.labels = y
        
    def predict(self, X: List[float]):
        distances = []
        for d in self.data:
            distances.append(get_euclidean_distance(X, d))
        print(distances[0:5])
        
from sklearn.neighbors import DistanceMetric
from sklearn.datasets import load_iris
import pandas as pd

# load the iris dataset using the sklearn's convenience function
data = load_iris()

# convert the dataset into a Pandas dataframe (descriptive features only)
df = pd.DataFrame(data.data, columns=data.feature_names)

# this is the target variable, the column we're interested in
labels = data.target

# take all but the last row from the datafrom and convert it to a list of lists
train = df.iloc[:-1].values.tolist()

# take only the last row and convert it to a list
test = df.iloc[-1:].values.flatten().tolist()

model = KNN()
model.train(train, labels)
model.predict(test)



So far so good. We're calculating the distance between each item and *q*. Now we need to know how to pull out the top K values.

We have a list of items, *data* representing all of the training data. We have a separate list, labels, which tells us which class each instance in data belongs to. The first value in labels corresponds to the first value of data, and the third value in labels corresponds to the third value in data *etc.*

We'll start with k=1. If we can find the **index** of the closest item from data we can use that to find the corresponding label from the labels list. Numpy provides a very useful function, *argmin* which does exactly this for us. It looks through an array and returns the *index* of the smallest element.

In [None]:
import numpy as np

items = [4, 3, 1, 5, 2]
# the smallest item, 1, is in position [2] (third position) of the array
print(np.argmin(items))

In [None]:
from typing import List
import math
import numpy as np

def get_euclidean_distance(instanceA: List[float], instanceB: List[float]) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
    
    squared_distance = lambda a, b: (a - b) ** 2

    return math.sqrt(sum([squared_distance(a, b) for a, b in zip(instanceA, instanceB)]))


class KNN:
    data: List[List]
    labels: List[str]
        
    def train(self, X: List[List], y: List[str]):
        self.data = X
        self.labels = y
        
    def predict(self, X: List[float]):
        distances = []
        for d in self.data:
            distances.append(get_euclidean_distance(X, d))
        min_index = np.argmin(distances)
        return self.labels[min_index]
        
from sklearn.neighbors import DistanceMetric
from sklearn.datasets import load_iris
import pandas as pd

# load the iris dataset using the sklearn's convenience function
data = load_iris()

# convert the dataset into a Pandas dataframe (descriptive features only)
df = pd.DataFrame(data.data, columns=data.feature_names)

# this is the target variable, the column we're interested in
labels = data.target

# take all but the last row from the datafrom and convert it to a list of lists
train = df.iloc[:-1].values.tolist()

# take only the last row and convert it to a list
test = df.iloc[-1:].values.flatten().tolist()

model = KNN()
model.train(train, labels)
model.predict(test)

### The K Parameter <a class="anchor" id="theKParameter"></a>

So far we've looked at how to make this work for k=1. We want to expand this now to work for any value of k. As well as **arg_min** numpy provides an **argsort** function. The argsort function will return the indices needed to sort an array in a given order. The first item in the argsort array is the index of the smallest item, the second item is the index of the next smallest *etc.*

We can use slicing to take the first *k* items from the array. We then need to find the most common value among these k items. The scipy.stats module provides a function **mode**, which takes an array and returns the most common value. See [the docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html) for more info. The return value of the mode() function needs a little bit of massaging to get it into the right format.

In [None]:
import numpy as np

items = [4, 3, 1, 5, 2]
# the smallest item, 1, is in position [2] (third position) of the array
print(np.argsort(items)[:3])

So now we need a way of supplying a parameter *k*. The value of *k* shouldn't change once the model has been created, so we can put this in the constructor.

In [None]:
from typing import List
from scipy.stats import mode
import math
import numpy as np

def get_euclidean_distance(instanceA: List[float], instanceB: List[float]) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
    
    squared_distance = lambda a, b: (a - b) ** 2

    return math.sqrt(sum([squared_distance(a, b) for a, b in zip(instanceA, instanceB)]))


class KNN:
    data: List[List]
    labels: List[str]
    k: int
        
    def __init__(self, k: int):
        if not (k > 0):
            raise Exception("K must be greater than 0")
        self.k = k
    
    def train(self, X: List[List], y: List[str]):
        self.data = X
        self.labels = y
        
    def predict(self, X: List[float]):
        distances = []
        for d in self.data:
            distances.append(get_euclidean_distance(X, d))
        sorted_indices = np.argsort(distances)
        k_neighbours = self.labels[sorted_indices[:self.k]]
        # find the most common value
        return mode(k_neighbours).mode[0]
        
from sklearn.neighbors import DistanceMetric
from sklearn.datasets import load_iris
import pandas as pd

# load the iris dataset using the sklearn's convenience function
data = load_iris()

# convert the dataset into a Pandas dataframe (descriptive features only)
df = pd.DataFrame(data.data, columns=data.feature_names)

# this is the target variable, the column we're interested in
labels = data.target

# take all but the last row from the datafrom and convert it to a list of lists
train = df.iloc[:-1].values.tolist()

# take only the last row and convert it to a list
test = df.iloc[-1:].values.flatten().tolist()

model = KNN(3)
model.train(train, labels)
model.predict(test)

### Passing a Distance Metric as a Parameter <a class="anchor" id="distanceMetricParameter"></a>

So far we've hard-coded our model to use Euclidean distance. However, we've already seen that there are lots of other distances we might want to use. We don't want to write a new class for every different type of distance metric, the same we don't want to write a new class for every possible value of K. How do we make our model able to work with any kind of distance metric? We pass it as a parameter.

You're used to passing values (strings, numbers, lists) as parameters, but there's no reason you can't pass a function either (it's just not as often that we need to do it). In python we can pass a function as a parameter using its name.



In [None]:
from typing import List

def get_euclidean_distance(instanceA: List[float], instanceB: List[float]) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
    
    squared_distance = lambda a, b: (a - b) ** 2

    return math.sqrt(sum([squared_distance(a, b) for a, b in zip(instanceA, instanceB)]))

def calculate_distance(a, b, dist):
    return dist(a, b)

print(calculate_distance([1, 1], [4, 5], get_euclidean_distance))

In the example above we're passing in the function **get_euclidean_distance** to calculate_distance. The calculate_distance will take whatever distance function has been passed in, *dist*, execute it with parameters *a* and *b* an d return the result. This is a fairly long-winded way of achieving our goal, but we'll soon see why this can be a very powerful technique when used correctly.

Before we go any further can you think of any potential problems or difficulty using this method?

The calculate_distance function expects the dist parameter to be a function taking exactly two parameters. It's not immediately clear to the developer that this is the case. As before, we can use *type hints* to make it explicit.

The type hint for a function, a *Callable* needs to be imported from the typing module. When we declare a function type we also need to know what parameters it expects, and what type of data it returns.

In general, we typehint a function using the following syntax

Callable[**[Param1Type, Param2Type, Param3Type]** *ReturnType*]

The input parameters are grouped together using square brackets, the last type hint is the return-type

```python
from typing import Callable
# we're creating our own custom type here, Instance, which is just an alias for a list of floats
Instance = List[float]

def calculate_distance(a: List[float], b: List[float], dist: Callable[[Instance, Instance], float]) -> float:
    return dist(a, b)
```

As always, this isn't essential but it makes for cleaner, easier-to-read code and provide additional information which can be used by your IDE to make intelligent suggestions as you type.

In [None]:
from typing import Callable
# we're creating our own custom type here, Instance, which is just an alias for a list of floats
Instance = List[float]

def calculate_distance(a: List[float], b: List[float], dist: Callable[[Instance, Instance], float]) -> float:
    return dist(a, b)

print(calculate_distance([1, 1], [4, 5], get_euclidean_distance))

### Exercise

In the snippet above, swap out euclidean distance for the Manhattan distance you created earlier

### Adding a Distance Metric Property to the Model

In order to update our model with a configurable distance metric parameter we need to

* create a distance_metric property to hold the distance function
* add a distance_metric parameter to the constructor to allow the user to specify the metric when creating the model
* replace the hardcoded call to Euclidean distance with a call to the distance metric property



In [None]:
from typing import List
from scipy.stats import mode
import math
import numpy as np

Instance = List[float]

def get_euclidean_distance(instanceA: Instance, instanceB: Instance) -> float:

    if not len(instanceA) == len(instanceB):
        raise Exception("Instances must be of equal length", instanceA, instanceB)
    
    squared_distance = lambda a, b: (a - b) ** 2

    return math.sqrt(sum([squared_distance(a, b) for a, b in zip(instanceA, instanceB)]))

class KNN:
    data: List[List]
    labels: List[str]
    k: int
    # add a distance_metric property
    distance_metric: Callable[[Instance, Instance], float]
        
    def __init__(self, k: int, distance_metric: Callable[[Instance, Instance], float]):
        if not (k > 0):
            raise Exception("K must be greater than 0")
        self.k = k
        # assign the distanc_metric to the property
        self.distance_metric = distance_metric
    
    def train(self, X: List[List], y: List[str]):
        self.data = X
        self.labels = y
        
    def predict(self, X: List[float]):
        distances = []
        for d in self.data:
            distances.append(self.distance_metric(X, d))
        sorted_indices = np.argsort(distances)
        k_neighbours = self.labels[sorted_indices[:self.k]]
        # find the most common value
        return mode(k_neighbours).mode[0]
        
from sklearn.neighbors import DistanceMetric
from sklearn.datasets import load_iris
import pandas as pd

# load the iris dataset using the sklearn's convenience function
data = load_iris()

# convert the dataset into a Pandas dataframe (descriptive features only)
df = pd.DataFrame(data.data, columns=data.feature_names)

# this is the target variable, the column we're interested in
labels = data.target

# take all but the last row from the datafrom and convert it to a list of lists
train = df.iloc[:-1].values.tolist()

# take only the last row and convert it to a list
test = df.iloc[-1:].values.flatten().tolist()

model = KNN(3, get_euclidean_distance)
model.train(train, labels)
model.predict(test)

# Data Preprocessing with SKLearn: Normalisation <a class="anchor" id="normalisation"></a>

We've seen in the lecture that it's very important to normalise all data in a KNN. This ensures that all features are of equal importance. Two of the most common nomalisation techniques are minMax scaling and standard scaling.

MinMax scaling expresses each number as a value between 0 and 1 where 0 is the smallest value, 1 is the largest, and 0.5 is halfway between the two. The formula for minMax scaling is given below

![Min-Max Scaling Formula](min_max.png "Min Max Scaling")

When we train a KNN we apply the scaling formula to each feature individually. This ensures that all of our feature values are between 0 and 1. Whenever we want to make a new prediction we need to run it through the **same transformation**. This means that we need to remember the minimum and maximum values from our initial transformation and use those same values whenever we normalise a query.



In [None]:
import numpy as np

# numpy allow us to use vector operations e.g. X1 / 2 divides each item in X1 by 2 
# and returns the results as an array
X1 = np.array([3, 6, 9, 8, 1, 4, 15])
X2 = np.array([20, 40, 13, 69, 37, 57, 14])

X1_scaled = (X1 - min(X1)) / (max(X1) - min(X1))
X2_scaled = (X2 - min(X2)) / (max(X2) - min(X2))

print(X1_scaled)
print(X2_scaled)

We could pull this out into a function

In [None]:
import numpy as np

def min_max_scale(X):
    return (X - min(X)) / (max(X) - min(X))

X1 = np.array([3, 6, 9, 8, 1, 4, 15])
X2 = np.array([20, 40, 13, 69, 37, 57, 14])

print(min_max_scale(X1))

We've just scaled the training data. What happens when a new query comes in?

In [None]:
import numpy as np

def min_max_scale(X):
    return (X - min(X)) / (max(X) - min(X))

Q = np.array([8]) # 8 is halfway between 1 and 15... expecting 0.5

print(min_max_scale(Q))

We get a divide-by-zero error. The problem here is that we're recalculating the values max_x and min_x with the new query. We *should* use the values we found when we initially carried out the scaling. This shows us that, as with machine learning models, scaling is a two-step process. First, we need to work out the minimum/maximum values etc. Then we can use those values to actually perform the scaling operation. 

SkLearn provides a set of data transformations for us to cover the most common scenarios. As we saw last week, each of these transformers implements two methods, a **fit()** method, which works out maximum/minimum values etc. and a **transform()** method, which actually scales the data. We need to fit the data before we can carry out any transformations.

In [None]:
from sklearn import preprocessing
import numpy as np

# This is a 2-d array, consisting of rows
# [1,2,3]
# [4,8,2]
# [9,1,1]

X = np.array([[1, 2, 3], 
              [5, 8, 2], 
              [9, 1, 1]
             ])

min_max_scaler = preprocessing.MinMaxScaler().fit(X)
Scaled = min_max_scaler.transform(X)
print(Scaled)

# the transform function expects a list of lists
Q = [ [3, 9, 3] ]
print(min_max_scaler.transform(Q))

## Exercise - Preprocessing the iris dataset <a class="anchor" id="preprocessingIris"></a>

1. Load the Iris dataset using the sklearn convenience function
2. Extract instances 0-39, 50-89 and 100-139 as training data
3. Scale the training data using a standard scaler
4. Extract instances 40 - 59, 90 - 99, 140 - 149
5. Scale these instances using the *same scaler*
6. Create an SkLearn Knn
7. Train the Knn on the training data
8. Predict the instances of the test data using the KNN

### Exercise - Exploring the SKLearn KNeighboursClassifier algorithm

1. Create a KNeighbours classifier with k=5
2. Find out what distance metric is used by default (see [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
3. Create a model using the Manhattan distance, rather than the default
4. Create a model using the Minkowski distance with p=4

## Exercise - Putting it all together

1. Load the wine dataset using the load_wine() convenience function
2. Extract only the numerical columns
3. Extract rows 0 - 49, 60-119, 130-168 as training data
4. Extract rows 50-59, 120 - 129, 169-178 as test data
5. Create a Knn using default parameters and evaluate accuracy on the test data
6. Try to improve on this accuracy by adjusting the parameters of the model