# K-Nearest Neighbors (KNN)

## Problem Type
**K-Nearest Neighbors (KNN)** is primarily used for:
- **Supervised Learning**
- **Classification** and **Regression** tasks
- **Applications**: Image classification, recommendation systems, anomaly detection, and pattern recognition.

### How K-Nearest Neighbors Works
- **Instance-Based Learning:**
  - KNN is a lazy learner, meaning it does not explicitly learn a model but rather stores all training instances and uses them directly for making predictions.
- **Distance Calculation:**
  - To make a prediction, KNN calculates the distance between the test instance and all training instances. Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
- **Voting (Classification):**
  - For classification, the algorithm identifies the `K` nearest neighbors to the test instance and assigns the class that is most frequent among these neighbors.
- **Averaging (Regression):**
  - For regression, KNN takes the average (or weighted average) of the target values of the `K` nearest neighbors to predict the value for the test instance.
- **Choosing K:**
  - The choice of `K` (the number of neighbors) is critical. A small `K` makes the model sensitive to noise, while a large `K` may smooth out predictions too much.
- **No Training Phase:**
  - KNN does not have a training phase, as it stores the entire dataset and makes predictions based on that dataset directly.
- **Non-Parametric:**
  - KNN is non-parametric, meaning it makes no assumptions about the underlying data distribution.

### Key Tuning Metrics
- **`K (number_of_neighbors)`:**
  - **Description:** The number of neighbors considered for making predictions.
  - **Impact:** A small `K` can lead to high variance (overfitting), while a large `K` can lead to high bias (underfitting).
  - **Default:** Typically `K=3` or `K=5`, but must be tuned based on the specific dataset.
- **`distance_metric`:**
  - **Description:** The function used to calculate the distance between instances (e.g., Euclidean, Manhattan, Minkowski).
  - **Impact:** Different metrics can capture different aspects of the data, influencing the model's performance.
  - **Default:** `Euclidean` is commonly used, but others like `Manhattan` may be better for high-dimensional data.
- **`weight_function`:**
  - **Description:** Determines how the neighbors contribute to the prediction (e.g., uniform weighting, distance weighting).
  - **Impact:** Weighted KNN can improve performance by giving closer neighbors more influence.
  - **Default:** `uniform` (all neighbors have equal weight) or `distance` (closer neighbors have more weight).
- **`algorithm`:**
  - **Description:** The algorithm used to compute the nearest neighbors (e.g., brute-force, KD-tree, Ball-tree).
  - **Impact:** Affects computational efficiency, especially with large datasets.
  - **Default:** `auto` lets the model choose the most appropriate method based on the data.
- **`leaf_size`:**
  - **Description:** Used in KD-tree and Ball-tree algorithms, affecting the speed of queries and memory usage.
  - **Impact:** Smaller leaf sizes lead to faster queries at the expense of higher memory usage.
  - **Default:** Typically `30`, but can be adjusted based on dataset size and dimensionality.

### Pros vs Cons

| Pros                                                  | Cons                                                   |
|-------------------------------------------------------|--------------------------------------------------------|
| Simple and easy to implement                          | Computationally expensive, especially with large datasets |
| No training phase required                            | Performance degrades with high-dimensional data (curse of dimensionality) |
| Can handle multi-class classification and regression tasks | Sensitive to irrelevant features and noise in the data |
| Intuitive and easy to understand                      | Requires careful choice of `K` and distance metric      |
| Flexible with the choice of distance metric           | Large memory requirements due to storing the entire training set |

### Evaluation Metrics
- **Accuracy (Classification):**
  - **Description:** Ratio of correct predictions to total predictions.
  - **Good Value:** Higher is better; values above 0.85 indicate strong model performance.
  - **Bad Value:** Below 0.5 suggests poor model performance.
- **Precision (Classification):**
  - **Description:** Proportion of true positives among all positive predictions.
  - **Good Value:** Higher values indicate fewer false positives, especially important in imbalanced datasets.
  - **Bad Value:** Low values suggest many false positives.
- **Recall (Classification):**
  - **Description:** Proportion of actual positives correctly identified.
  - **Good Value:** Higher values indicate fewer false negatives, important in recall-sensitive applications.
  - **Bad Value:** Low values suggest many false negatives.
- **F1 Score (Classification):**
  - **Description:** Harmonic mean of Precision and Recall.
  - **Good Value:** Higher values indicate a good balance between Precision and Recall.
  - **Bad Value:** Low values suggest a poor balance between Precision and Recall.
- **R-squared (Regression):**
  - **Description:** Proportion of variance in the dependent variable explained by the model.
  - **Good Value:** Higher is better; values closer to 1 indicate a strong model.
  - **Bad Value:** Values closer to 0 suggest the model does not explain much of the variance.
- **Mean Absolute Error (MAE) (Regression):**
  - **Description:** Measures the average absolute difference between predicted and actual values.
  - **Good Value:** Lower is better; values close to `0` indicate high accuracy.
  - **Bad Value:** Higher values suggest significant prediction errors.
- **Root Mean Squared Error (RMSE) (Regression):**
  - **Description:** Measures the square root of the average squared difference between predicted and actual values.
  - **Good Value:** Lower is better; values close to `0` indicate high accuracy.
  - **Bad Value:** Higher values suggest the model's predictions deviate significantly from actual values.
- **AUC-ROC (Classification):**
  - **Description:** Measures the model's ability to distinguish between classes across all thresholds.
  - **Good Value:** Values closer to 1 indicate strong separability between classes.
  - **Bad Value:** Values near 0.5 suggest random guessing.



In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import (accuracy_score, classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [None]:
# Load dataset (e.g., Iris dataset)
data = load_iris()
X = data.data[:, :2]  # Use only the first two features for 2D plot
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Initialize and train the KNN model
knn = KNeighborsClassifier(
    n_neighbors=5,
    metric="minkowski",
    weights="distance",
    algorithm="auto",
    leaf_size=30
)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=iris.target_names)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)
print("Confusion Matrix:")
print(conf_matrix)

In [None]:
# Create a mesh grid for plotting decision boundaries
h = .02  # Step size in the mesh
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict classification for each point in the mesh
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundaries
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k', s=20)
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
plt.title('KNN Decision Boundaries')
plt.show()