## Introduction

This tutorial will introduce the k-nearest neighbors (KNN) classification algorithm, and some of its practical applications.

Machine learning and data science are two deeply connected fields. In this tutorial, we will see how to complete the data science pipeline, beginning with data collection, and ending with utilization of KNN to tell us something meaningful about the data.

The tutorial begins with a brief introduction to basic machine learning concepts.

### Tutorial Content

- [What is Machine Learning?](#What-is-Machine-Learning?)
- [KNN Classification Algorithm](#KNN-Classification-Algorithm)
- [Practical Applications](#Practical-Applications)

## What is Machine Learning?

Before we get started, let's take a closer look at what "machine learning" means.

_"Machine learning explores the study and construction of algorithms that can learn from and make predictions on data"_ - Wikipedia

Basically, people like to observe and make predictions about the universe. Unfortunately, as humans, we can only take in so much information at once. This is where computers (_**machines**_) come in! They can handle large data sets, and perform computations much more quickly than we can.

So, we observe what happens in the world, and try to describe it in a way that machines will understand ("study and construction of algorithms"). Once they have _**learned**_ what the question is, we can feed them tons of data and they will start to make predictions.

There are three main types of machine learning problems. They are as follows:
- Classification
    * Example: Is an email spam or not spam?
- Regression
    * Example: Given someone's weight, can you predict their height?
- Clustering 
    * Example: Each person has a course list. Can you predict their year?

### Setting up

Regardless of what type of problem we are trying to solve, we need to do a bit of setup to get started.

**1. Define your question**

We start with a question. This is the problem we are trying to solve. We saw some examples of this in the section above!

**2. Gather training data**

Experts in any field generally have a lot of experience. Well, in order for our machine learning model to become an expert in answering our question, we need to give it a lot of experience doing that!

This is what we call "training data". This data is special in that we know what our model should output on each data point. For example, if we are trying to decide whether or not an email is spam, our training data might be of the form:
(email contents (text), spam (bool)).

Make sure that your training data set is large enough, otherwise the model might not be accurate.

**3. Split into train and test**

We have all this data for which we already know the correct behavior of the model. This data would be great to figure out how accurate our model is as well!

Often, we split the training data set into training data, and testing data. Training data is used to actually create the model, while testing data is used later to verify its accuracy.

You get to decide how much data to test or train with. Sometimes, people go with 80% train, 20% test but it is up to you! The more data that is involved in creating the model the more accurate it will be.

**4. Decide on an algorithm**

Algorithms are at the heart of machine learning and data science. There are a million ways to solve the same problem, but the quality of the results (among other measures) is dependent on picking the right algorithm.

If you are using Python, a great resource to explore machine learning algorithms is `sklearn`. This package contains a plethora of machine learning algorithms that are easy to test and play with.

For the rest of this tutorial, we will explore the k-nearest neighbors (KNN) algorithm.

**5. Data Pre-Processing and Feature Selection**

Sometimes, the data you get may not be in a useable form. For example, maybe your dataset consists of images, but your algorithm takes in a vector. Here, you would need a way to convert images into something your algorithm can understand.

Additionally, data can come with a ton of extra information that you might not care about. For example, a Twitter profile contains a handle, bio, and tweets. If we only care about analyzing handles, we do not need the extra data and it would just slow the computation down. Picking what parts of our data we "care" about is _feature selection_

**6. Train your model**

Now that we have an algorithm, we feed our training data to the algorithm. As the computer sees more and more data, it begins to get a sense for how to answer our question.

In other words, the computer builds a model out of the training data.

**7. Test your model**

Now we can use our testing data to see how accurate our model is. To do this, we remove the answer from that data and give it to the model. Then, we ask the model our _question_ (from #1!).

Since we still do know the right answer for these data, we can figure out how often our machine is correct vs incorrect. Depending on the results, we might iterate on steps 2-6 a few more times.

**8. You're good to go!**

Woo!

## KNN Classification Algorithm

### Description

You know how people often say "you are the average of your closest friends"? This is kind of what KNN does.

For any new point, call it `newPoint`, the algorithm looks at the _k_ data points that are closest to `newPoint` (neighbors). Then, the _k_ neighbors vote. `newPoint` is classified as the most commonly cast vote.

So we've given an intuitive description of the algorithm, but it is imprecise. A few questions you may have may be:
1. What is _k_?
2. How is "closest" defined?
3. What does "vote" mean?

#### What is k?

This question is actually quite a bit more complex than it sounds. In the context of KNN, _k_ is a _parameter_. This means that for any implementation of KNN, the algorithm will ask you to provide _k_.

So how do we decide on a _k_?

If we pick a _k_ that is too large, it will be hard to distinguish the true correct answer. It's like when you are trying to make a decision with too many people. There can sometimes be so many different opinions that it is hard to pick out one to go with.

If we pick a _k_ that is too small, it may not be accurate. This is the other side of the scenario we just looked at. Suppose we try to make a big decision with just three people. Have we really listened to enough voices to trust this decision?

Remember how we said that algorithms are at the heart of machine learning? Here is another great example of that. There are tons of different algorithms that we can use to decide _k_.

A simple (but not always optimal) way to decide is to make `k = sqrt(number of possible answers)`

#### How is "closest" defined?

There are many ways to define "closest".

One common way is by calculating the Euclidian distance between two data points. In this case, you would need to figure out how to represent your data such that this is possible.

Vectors would be a useful data form here, because computers understand vectors!

#### What does "vote" mean?

You know how we said "you are the average of your closest friends"? Suppose we wanted to figure out whether you like pineapples on pizza or not.

Well, we might go to your closest friends and poll them on their preferences. Then, we could guess your preference based on the most common answer that they gave. (As a sidenote, pineapples _definitely_ belong on pizza.)

This is essentially what KNN does too! Our training data set comes with labels already, right? So after we find the k nearest neighbors, we look at their labels. Then, we take the most common label and classify the new data point as that.

You can decide on the exact way you want to pick an answer from the votes. Here, we just pick the most common.

### KNN implementation

First, let's import a few built-in modules. 

`math` is a module that contains many helpful functions.

`csv` helps us handle .csv format files. csv files are often used to store data.

We will not use any additional packages, but `sklearn` is a good option as mentioned above.

#### Some setup before we begin

Let's assume that our data is of the form:

`(label, (x,y))`

label: a string representing the correct classification of the data point

(x,y): floats representing the data point

#### Time for code!

In [None]:
import math
import csv

Our first step is to read in our dataset and process it. Depending on the format of your data, you may need a function `processData(data, features)` that goes through the dataset and extracts only the features you want.

In [None]:
# parameters:
    # filePath (string): path to some csv file
# returns:
    # a list of values in the csv
def readCSV(filepath):
    result = []
    with open(filepath, 'r', encoding = 'utf-8') as file:
        f = csv.reader(file)
    
    return result
        
# parameters:
    # filePath (string): contains the location of the dataset
    # trainPercent (float): number between 0 and 1, representing percentage of data to be used for training
# returns:
    # a tuple of lists (training, testing)
def makeData(filePath, trainPercent):
    training = []
    testing = []
    
    contents = readCSV(filePath)
    # contents = processData(contents, features) # comment this in if data needs pre-processing
    numData = len(contents)
    numTrain = round(numData*trainPercent) # round so that it is a whole number
    
    lenTrainSet = 0
    for line in contents:
        if lenTrainSet < numTrain:
            training.append(line)
            lenTrainSet += 1
        else:
            testing.append(line)
    
    return (training, testing)  

Then, we will define our distance function. Here, we will implement Euclidian distance between points in 2D space, but this is interchangeable with any other definition of distance.

In [None]:
# parameters:
    # ((x1, y1), (x2,y2)): represents two points in floats
# returns:
    # euclidian distance as a float
def distance(point1, point2):
    (x1,y1) = point1
    (x2,y2) = point2
    d = (x1-x2)**2 + (y1-y2)**2
    return math.sqrt(d)

Next, we will find our k nearest neighbors using the distance function defined above.

In [None]:
# parameters:
    # k: integer representing number of neighbors to be selected
    # trainData: list of training data
    # newPoint: new data point of the form (label, (x,y))
# returns:
    # list of  k neighbors
def findNeighbors(k, trainData, newPoint):
    dist = []
    (newLabel, pointNew) = newPoint
    for data in trainData:
        (label, point) = data
        dist.append((label, distance(pointNew, point)))
    dist = sorted(dist, key=itemgetter(1))
    
    neighbors = dist[0:k]
    
    result = []
    for n in neighbors:
        (label, p) = n
        result.append(label)
    
    return result

Finally, we will let them vote!

In [None]:
# parameters:
    # neighbors: list of neighbors
# returns:
    # result of neighbor vote
def vote(neighbors):
    d = dict()
    for neighbor in neighbors:
        d[neighbor] = d.get(neighbor,0) + 1
    
    bestCount = 0
    bestGuess = None
    for key in d:
        currGuess = d[key]
        if (bestGuess == None) or (currGuess > bestCount):
            bestCount = key
            bestGuess = currGuess
    return bestGuess

And finally... Let's put this all together!

In [None]:
(train, test) = makeData('data.csv', 0.8) # create model

# parameters:
    # k: integer representing number of neighbors to be selected
    # newPoint: new data point of the form (label, (x,y))
# returns:
    # classification of newPoint
def knn(k, newPoint):
    neighborSet = findNeighbors(k, train, newPoint)
    classification = vote(neighborSet)
    return classification
    
knn(5, (3.0, 4.0)) # example call to knn

If we are curious about how accurate we are, we can check that by doing the following:

In [None]:
# parameters:
    # testData: list of testing data
# returns:
    # (output, label): tuple of lists where output = guess classification, label = real classification
def classifyTestingData(testData):
    output = []
    labels = []
    for data in testData:
        (label, point) = data
        labels.append(label)
        classified = knn(5, point)
        output.append(classified)
    
    return (output, labels)

# parameters:
    # testData: list of testing data
    # k: integer representing number of neighbors to be selected
# returns:
    # float between 0 and 1 that represents percentage of correct results
def accuracyCheck(testData, k):
    (guessLabel, realLabel) = classifyTestingData(testData)
    assert(guessLabel == realLabel) # sanity check
    
    numPoints = len(guessLabel)
    correctGuesses = 0
    for i in range(numPoints):
        if guessLabel[i] == realLabel[i]:
            correctGuesses += 1
    
    return correctGuesses / numPoints
        
accuracyCheck(test, 5)

#### Brief aside: my model is 100% accurate. Is that good?

...Actually, no! This is indicative of an issue called _overfitting_.

Because your model is trained on a specific set of data, it can actually fit the data _too_ well. Since there is no way to train the model on all of the data, ever, this is an issue. If your model fits this specific set of data too well, then it may not perform well on other data sets that look different.

Of course, an accuracy that is too low is also not desirable.

This is why it is important to double check for accuracy!

## Practical Applications

### Digit Recognition

Suppose you had a hand-written number and you wanted the computer to recognize what it was. We can use KNN to solve this problem! Let's talk about one approach to this problem here:

You can imagine creating a bounding box around the number, where pen-strokes are black and everything else within the box is white. So, divide this box into some number of smaller boxes (100x150 for example). Each small box is colored either white or black.

Computers are really good at working with vectors. So, let's turn this grid into a vector such that white boxes are 0s and black boxes are 1s. So, we end up with something that looks like this:

[[white, black, white]  
 [white, black, white]  
 [white, black, white]  
 [white, black, white]]

=> [0,1,0,0,1,0,0,1,0,0,1,0]

Then, we can graph this and find the closest points by Euclidian distance like we did before.

But there's an issue. There are so many dimensions! How would we ever graph this? One technique we could use is PCA. This stands for principal component analysis. The algorithm picks out the most important features of the vector. Let's see how this could work given some data point `newPoint`!

In [None]:
from sklearn.decomposition import PCA

pca = PCA(2)
pca.fit(newPoint)
X = pca.transform(newPoint)

Calling PCA with 2 as the parameter results in a new vector in 2 dimensions.

In order to integrate this with the code we wrote above, all we need to do is define `processData` and pick k!

In [None]:
from sklearn.decomposition import PCA

def processData(data, features):
    pca = PCA(features)
    
    result = []
    for d in data:
        pca.fit(d)
        result.append(pca.transform(d))
    
    return result

def makeData(filePath, trainPercent):
    training = []
    testing = []
    
    contents = readCSV(filePath)
    # contents = processData(contents, features) # comment this in if data needs pre-processing
    numData = len(contents)
    numTrain = round(numData*trainPercent) # round so that it is a whole number
    
    lenTrainSet = 0
    for line in contents:
        if lenTrainSet < numTrain:
            training.append(line)
            lenTrainSet += 1
        else:
            testing.append(line)
    
    return (training, testing)  

### Loan eligibility

Suppose you had a set of people for which you knew their loan defaulting habits. Given a new person, can you figure out how likely he or she is to default on a loan?

You can with KNN!

Here it would be important to pick appropriate features. There are a lot of characteristics of a person, but not all of them would be relevant in determining loan defaulting habits.

Again, all we need to integrate this with our code above is to define `processData` and pick k! Here is some pseudocode to get you started:

In [None]:
def processData(data, features):
    result = []
    for d in data:
        newData = filter(lambda x: x = featuresWeWant, d)
        result.append(newData)
    
    return result

### Other examples

KNN is useful in many other practical applications. Just some of the possibilities are:
- facial recognition
- detecting fradulent credit card activity
- credit ratings
- detecting suspicious activity in surveillance footage

As you can see, KNN is a simple yet powerful tool to use in extracting meaningful results from data!