# DSC Week 8
This week, you will examine the following: K-Nearest Neighbors (KNN) and distance metrics including Euclidean, Manhattan, Minkowski, and Cosine. 


## Learning Objectives
At the end of this week, you should be able to: 
- Describe how K-Nearest Neighbors makes predictions for classification and regression tasks based on the values of nearby data points. 
- Compare different distance metrics (Euclidean, Manhattan, Minkowski, and Cosine) and explain when each might be more appropriate. 
- Explain why feature scaling is important when using distance-based models, especially with Euclidean distance. 
- Analyze how outliers and feature scaling affect different distance metrics and how to choose or adjust metrics accordingly.

## 8.1 Lesson: K-Nearest Neighbors
K-Nearest Neighbors is a simple approach to classification or regression. It examines the nearest neighbors to a given data sample and picks the “$k$” nearest. 

That is, if $k$ is 4, it picks the 4 nearest. It then looks at the target value for these 4 nearest neighbors and averages them (for regression) or picks the most common answer (for classification). 

**This method is useful because it doesn't require a complex model — just a good way to measure similarity**. 

For example: 
- A health care provider could use K-Nearest Neighbors to predict whether a patient has a certain disease based on symptoms and medical history. 
- By comparing a new patient's data to that of similar past patients (their “nearest neighbors”), the system can classify the new case based on the majority outcome among those neighbors. 
- The nearest neighbors are a specific number of patients who have symptoms and medical history similar to the patient of interest. 

It matters how we define “nearest:” 
- We could use the Euclidean distance (the square root of the sum of the squares of the feature differences) or the Manhattan distance (the sum of the absolute values of the feature differences). 
- The Euclidean distance has the property that if a sample is far from other samples in just one dimension, that one dimension matters a lot (it’s squared). 
- With the Manhattan distance, the one outlier dimension matters less. 
- Thus, the choice between the two depends on whether you really dislike having even one outlier dimension (so you will use Euclidean, and the nearest neighbors are likely to have small to moderate distances in all dimensions) or whether you will tolerate having a single outlier dimension (so you will use Manhattan, and the nearest neighbors are allowed to have a single large outlier dimension). 

It is important to use feature scaling. For example, if one feature is “height in feet,” and another is “weight in pounds,” then the weight will be (for humans) a much larger number, and the differences will be much larger. This will unfairly put too much importance on the weight number. 

If we scale the features so that they are $Z$ scores (subtract the mean, then divide by the standard deviation), then their importances will be equal. 

This is especially important for Euclidean distances, which square the features and hence exaggerate the scale differences. 

The Manhattan distance is less sensitive to outliers than the Euclidean distance. If we assume that outliers involve a large difference in one feature only, then Manhattan doesn’t square that feature and so is less sensitive. 

### Minkowski
The Minkowski distance takes the $p$ power of each difference, adds them, then takes the $1 /over p$ power of the sum: 
- If $p = 2$, it’s Euclidean. 
- If $p = 1$, it’s Manhattan. 

This is versatile, since you can tune the power of $p$ to meet your needs. 

For example: 
- Say you are using K-Nearest Neighbors to determine which songs are most similar to a particular song on your playlist. 
- You want to assign the category “classical” if it is most similar to classical songs, “blues” if it is most similar to blues songs, etc. 
- You use cross-validation to check whether you are right. 
- Then you might run a hyperparameter search where your model uses $p = 0, 0.25, 0.5, 0.75$, and $1$ and check which one is best. 

### Cosine
The **cosine distance** is 1 minus the cosine of the angle between two vectors. 

This ensures that the magnitude of the vector doesn’t matter, if that is a desirable feature. 

For example: 
- If you have one document that says “egg egg bacon” and another that says “egg egg egg egg bacon bacon,” then you might want to assign vectors of $(2, 1)$ and $(4, 2)$ to these documents, respectively, where the first is the number of “eggs” and the second is the number of “bacons.” 

But, you might also want to treat these documents as being the same: That is, you want the cosine distance to be zero. 
- The angle between the vectors is 0, because they point in the same direction, even though they are different vectors. 
- This means that their cosine distance is $1 =  \cos (0) = 1 - 1 = 0$.


## Think About It
- Why would we use the Manhattan distance in case there are outliers? 
- When $p$ is a very large number (say, 1,000,000), what does the Minkowski distance behave like? This is sometimes referred to as $p = \infty$. 
    - That is, use Python to try computing $(x^p + y^p)^{1 / p}$ for large $p$ and for different values of $x$ and $y$ — what do you get? 
- Can there be outliers if we standardize the data, or does the standardizing get rid of the outliers? (Prove that there can still be outliers!) 
- Why might we choose cosine distance instead of Euclidean or Manhattan when comparing documents or other data where the direction of features matters more than their size? 


