## K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple, instance-based, non-parametric machine learning algorithm used for both classification and regression tasks. The basic idea is to predict the label or value of a new data point based on the labels or values of the closest data points (neighbors) in the feature space.

### **How KNN Works**
KNN works by following these steps:
1. **Select the number of neighbors (K):** Choose a value of K, which is the number of nearest neighbors to consider for making a prediction.
2. **Compute distances:** Calculate the distance between the new data point and all other points in the dataset using a distance metric (typically Euclidean distance, though other metrics such as Manhattan, Minkowski, etc., can be used).
3. **Identify neighbors:** Find the K points that are closest to the new data point based on the computed distances.
4. **Make a prediction:**
   - **Classification:** Assign the most common class (mode) among the K nearest neighbors to the new data point.
   - **Regression:** Assign the average (mean) value of the K nearest neighbors to the new data point.

### **KNN for Classification: Example**

Let’s consider an example where we classify whether a flower is a **setosa** or **versicolor** based on its petal width and length.

- **Dataset:**
   - **Features:** Petal width and length.
   - **Labels:** `Setosa` and `Versicolor`.
  
- **New Data Point:** Suppose we have a new flower with petal width = 1.5 and length = 4.5, and we want to classify whether it is `Setosa` or `Versicolor`.

**Steps:**
1. **Choose K:** Let’s select \( K = 3 \).
2. **Calculate Distances:** Calculate the Euclidean distance between the new flower and all the other flowers in the dataset:
   ```python
   d(x, x_i) = sqrt(sum((x_j - x_{i,j})^2 for j in range(n)))
   ```
3. **Find the 3 Nearest Neighbors:** Based on the calculated distances, identify the 3 nearest flowers to the new data point.
4. **Classify:** Take a majority vote among the 3 nearest neighbors. If 2 out of the 3 neighbors are classified as `Setosa`, and 1 is classified as `Versicolor`, the new flower would be classified as `Setosa`.

### **KNN for Regression: Example**

Now, let’s consider a regression problem where we want to predict the house price based on features like the number of rooms, lot size, and proximity to a city center.

- **Dataset:**
   - **Features:** Number of rooms, lot size, distance from city center.
   - **Labels:** House prices.

- **New Data Point:** Suppose we have a new house with 3 rooms, a lot size of 500 sq meters, and it is 10 km from the city center. We want to predict its price.

**Steps:**
1. **Choose K:** Let’s select \( K = 3 \).
2. **Calculate Distances:** Compute the Euclidean distance between the new house and all the other houses in the dataset based on their features.
3. **Find the 3 Nearest Neighbors:** Identify the 3 houses that are closest to the new house.
4. **Predict the Price:** Take the average of the prices of the 3 nearest houses. Suppose the prices of the nearest houses are $200,000, $220,000, and $210,000. Then the predicted price would be:
   ```python
   Predicted Price = (200000 + 220000 + 210000) / 3 = 210000
   ```

### **Key Points to Consider**

1. **Distance Metric:** The choice of distance metric impacts KNN performance. Euclidean distance is the most common, but for categorical data, Hamming distance might be used, and for Manhattan distance, L1 norm can be used.

2. **Value of K:**
   - Small values of K (e.g., \( K = 1 \)) make the model sensitive to noise in the dataset but provide more flexible decision boundaries.
   - Large values of K smooth out predictions but may blur the distinctions between different classes.

3. **Scaling/Normalization:** Since KNN relies on distance calculations, it’s important to scale or normalize the feature values to avoid one feature dominating the distance calculation. For instance, features with larger ranges can distort the distances unless the data is scaled.

4. **Lazy Learner:** KNN is a **lazy learning algorithm**, meaning it doesn’t learn a model during the training phase. All computation is deferred until prediction, which can lead to high computational cost when the dataset is large.

### **Advantages of KNN**
- **Simplicity:** Easy to implement and understand.
- **No Training Phase:** No explicit model training is required, making KNN suitable for dynamic datasets where data frequently changes.
- **Versatility:** Can be used for both classification and regression tasks.

### **Disadvantages of KNN**
- **Computationally Expensive:** During prediction, KNN requires calculating the distance between the query point and all other points in the dataset, which can be slow for large datasets.
- **Memory Intensive:** KNN stores all training data, which can consume a lot of memory.
- **Sensitive to Irrelevant Features:** If irrelevant features are included, they can distort the distance metric and lead to poor performance. Therefore, feature selection is crucial.
- **Imbalanced Datasets:** KNN struggles with imbalanced datasets where one class dominates, as the majority class will be more likely to be chosen.

### **Conclusion**

KNN is a versatile algorithm that can be used for both classification and regression. Its performance heavily depends on the choice of K, distance metric, and scaling of data. It is best suited for smaller datasets, as it requires a lot of computational resources for large datasets. However, its simplicity and effectiveness make it a popular choice for many applications.