## KNN Algorithm



K-Nearest Neighbours algorithm is a supervised machine learning alogrithm generally used for classification but can also be used for Regression tasks.
- It works by finding k closest data points to a given input
- It makes prediction based on majority class for classification
    - Average values for regression.
- KNN algorithm doesn't make any assumptions about the data making it non-parametric and instanace based learning method.

#### Cons
- KNN is also called as a Lazy learner algorithm because it doesn't learn from training set immediately, instead it stores all the training data and performs computations.

#### K in KNN
- K in KNN is a number tells how many data points you should look nearby to make the prediction. More like a hyper-parameter.
- While selection k value - if too large model become too simple and underfits.
    - If model has lot of noise and outlier, using slightly large k values will give the good results. So choosing the optimal k value gives more accuracy.

##### Statistical Methods for selection K
- Cross validation
- Elbow method
- Odd values for K

#### Choosing the right distance
1. Euclidean Distance—Not ideal for High dimensional data
- it measures the straight line distance between two points. Work well when all features are continous and similarly scaled.
- Sensitive to large difference in feature values, Performs well on low dimensional, normalized data. Used for Geometric interpretation.

2. Manhattan Distance[L1 Norm]: $(dist)=  [\sum_{j=1}^{d} |x^{(1)}_j - x^{(2)}_j|^1]^1$
- Computes by summing absolute differences across dimensions.
- Useful when features represents directions, grid-based movement
- Robust to outliers, preferred for High dimensional data and sparse feature environments.

3. Minkowski Distance - $dist(x^{(1)},x^{(2)}) = \sum_{j=1}^{d} (|x_j^{(1)}-x_j^{(2)}|^p)^{\frac{1}{p}}$
- Generalised version of both Manhattan(L1) and Eucledean(l2) controlled by parameter p.

4. $Cosine Similarity (x^{(1)},x^{(2)}) = \frac{x^{(1)}. \space x^{(2)}}{||x^{(1)}|| \space ||x^{(2)}||}$
- Measure the angle between two vectors instead of magnitude,
- Range (-1, 1)
- measures angle, ignore magnitude
- Good for text based or high dimensional data
- Scale independent.

In [3]:
## implementation of KNN from scratch
import math
from collections import Counter


class KNN:
    def __init__(self, k= 3):
        if k <= 0:
            raise ValueError('k must be greater than 0')
        self.k = k

    def fit(self, X, y):
        if len(X) != len(y):
            raise ValueError('X and y must have same length')
        self.X_train = X
        self.y_train = y

    def _euclidean_distance(self, x1, x2):
        return math.sqrt(sum((a-b)**2 for a, b in zip(x1, x2)))

    def predict(self, X):
        return [self._predict_single(x) for x in X]

    def _predict_single(self, x):
        distances = [
            (self._euclidean_distance(x, x_train), y_train)
            for x_train, y_train in zip(self.X_train, self.y_train)
        ]
        k_nearest = sorted(distances, key=lambda x: x[0])[:self.k]

        #majority vote
        labels = [label for _, label in k_nearest]
        return Counter(labels).most_common(1)[0][0]


In [4]:
X_train = [
    [1, 2],
    [2, 3],
    [3, 3],
    [6, 5],
    [7, 7]
]

y_train = [0, 0, 0, 1, 1]

knn = KNN(k=3)
knn.fit(X_train, y_train)

X_test = [
    [2, 2],
    [6, 6]
]

predictions = knn.predict(X_test)
print(predictions)  # [0, 1]


[0, 1]


#### Curse of Dimensionality in KNN
`What is Curse of dimensionality`
- Curse of dimensionality is a concept that affects the model in various ways because of high dimensional data.
- High dimensional data increases sparsity in data distribution which can result in several challenges such as increased computation, overfitting and deteriorating performance of certain algorithms.

`How does it affect the kNN`:
- Increased sparcity in data: increase in dimensions increase the volume of the space. It will be difficult to find the meaning nearest neighbours as there may be fewer within a given distance.
- Equal distance: In high dimension the concepts of distance becomes less meaning full. The distance between points tends to become uniform and equidistance.
- Computational Complexity: As the dimensions grows, computational complexity grows as well, as it needs to compute distances in a high dimensional space, which involves more calculations. This makes the knn slower and less efficient

`How to fix the Curse of HD`:
- There are dimensions reductions techniques by reducing the number of features while preserving the most relevant information we can use such as PCA and T-SNE.

### Question about KNN algorithm
1. **Distance Metrics and Trade-offs**
   How do you decide between Euclidean, Manhattan, cosine, or Minkowski distance in KNN for a real-world problem, and what practical impact does this choice have on model performance and stability?

2. **Curse of Dimensionality**
   KNN often degrades in high-dimensional spaces. Explain why this happens and what concrete techniques you would apply in production to mitigate it.

3. **Choosing *k* in Practice**
   How do you select an optimal value of *k* beyond textbook cross-validation, especially when dealing with noisy or imbalanced datasets?

4. **Scalability and Production Constraints**
   KNN is considered a “lazy learner.” How would you make KNN feasible for a dataset with millions of samples and low-latency inference requirements?

5. **Feature Scaling and Data Leakage**
   Why is feature scaling critical for KNN, and how can improper scaling introduce data leakage during training and evaluation?


> Answers
1. **Distance Metrics and Trade-offs**
* Euclidean distance is good in low dimensional data, where there are less features to capture. But more prone to outliers so if the dataset contains outlier the model might not be stable.
* Manhattan distance is ideal if the features are direction based on grid based. Robust for outliers, if dataset get more outlier then you can take over manhattan distance to make the model more stable.
* Cosine distance calculates the angle ignoring the magnitude. It is most suitable for text based features, embeddings and so on. if magnitude is not the problem but angle is then choose this distance.
* Minkowski if you are unsure of first 2, you can change p values and check accordingly.

**Upgraded answers:** <br>

Euclidean (L2) penalizes large deviations heavily, making it sensitive to outliers. Manhattan (L1) grows linearly, hence more robust. Cosine is scale-invariant and preferred for sparse/high-dimensional embeddings. Minkowski generalizes L1/L2 but is rarely tuned in practice due to interpretability and stability concerns.

---

2. Curse of dimensionality
- as the no of features increases the sparsity increses where the volume of data increases.
- it would be difficult to capture meaningful distances, also computational complexity increasese as well. This makes the model's performance to degrade.
- To reduce the features and curse of dimensionality we can use the dimension reduction techniques such as pCA and tsne which reduces the features while preserving the information.
- These techniques project the high dimensional data in low dimensional data.

**Upgraded answers:** <br>

As dimensions increase, distances concentrate, making nearest neighbors indistinguishable. KNN fails because locality is lost. In production, I reduce dimensionality using PCA or autoencoders and apply feature selection; visualization methods like t-SNE are avoided since they distort distances.

---

3. **Choosing K**
* When dealing with imbalance data you would make it balance using over sample or under sampling technique
* Apart from cross validation you can use Elbow method and bias variance trade off to get to the optimal k value.

**Upgraded answers:** <br>
**What interviewers want** <br>
* Small k → low bias, high variance
* Large k → high bias, low variance
* Metric-driven selection (F1, ROC-AUC, Recall)

**Better framing** <br>

I choose k by analyzing bias–variance trade-offs using task-specific metrics. For imbalanced data, I optimize recall/F1 instead of accuracy and sometimes use distance-weighted KNN to reduce majority-class dominance.

---

4. **Scalability and Production:** I am not sure about this

**Upgraded answers:** <br>
* You should know at least one of these:
* KD-tree / Ball-tree (low dimensions)
* Approximate Nearest Neighbors (ANN)
* FAISS / HNSW / IVF
* Vector indexing

**What to say** <br>

KNN is O(N) per query. To scale, I use approximate nearest neighbor search with FAISS (HNSW or IVF), reduce embedding size, and precompute indexes. For strict latency, KNN is replaced with parametric models.

---

5. In KNN algoirthm the assumption is data should be normalized and homogenious to calculate the distances accurately.
- So it is mandatory to scaler the features.
- When scaling the features you should split the data and perform the scaling on training data. if you do it combined the mean of data is shared which can also be called as data leakage.

**Upgraded answers:** <br>
Since KNN is distance-based, features must be scaled. Scalers are fit only on training data and applied to validation/test via pipelines to prevent leakage.

**What to Fix Next (High ROI)**

* Learn ANN, FAISS, HNSW (non-negotiable for AI engineers)
* Stop mentioning t-SNE outside visualization
* Practice structured answers (Problem → Cause → Solution)
* Tie every choice to latency, scale, or metrics