<a href="https://colab.research.google.com/github/tgarg535/Machine-Learning/blob/main/KNN%26PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Theoretical Questions**
### **1. What is K-Nearest Neighbors (KNN) and how does it work?**

KNN is a **non-parametric, lazy learning** algorithm. It does not learn a discriminative function from the training data; instead, it "memorizes" the dataset.

* **How it works:** When a new data point is introduced, the algorithm calculates the distance between that point and all other points in the training set. It then selects the  points closest to it and assigns a label based on a majority vote (Classification) or an average (Regression).

### **2. What is the difference between KNN Classification and KNN Regression?**

* **KNN Classification:** The output is a class membership. The object is assigned to the class most common among its  nearest neighbors. If , the object is assigned to the class of its single nearest neighbor.
* **KNN Regression:** The output is the property value for the object. This value is the **average** (or weighted average) of the values of its  nearest neighbors.

### **3. What is the role of the distance metric in KNN?**

The distance metric determines how "closeness" is defined.

* **Euclidean Distance:** The most common (straight-line distance).
* **Manhattan Distance:** Sum of absolute differences (city block distance).
* **Minkowski Distance:** A generalized form of both Euclidean and Manhattan.
Choosing the right metric is vital because KNN's accuracy depends entirely on the mathematical definition of similarity.

### **4. What is the Curse of Dimensionality in KNN?**

As the number of features (dimensions) increases, the volume of the space grows exponentially, making the data points **sparse**. In high-dimensional space, the distance between the nearest and farthest points becomes almost the same, causing KNN to lose its predictive power because "closeness" becomes meaningless.

### **5. How can we choose the best value of K in KNN?**

* **Small K:** Low bias but high variance (sensitive to noise/outliers).
* **Large K:** High bias but low variance (smoother decision boundaries, but may include points from other classes).
* **Method:** We typically use **Cross-Validation**. We plot the error rate against various  values and select the "elbow" point where the error stabilizes.

### **6. What are KD Tree and Ball Tree in KNN?**

To avoid calculating the distance to *every* point (which is ), we use spatial data structures:

* **KD Tree (K-Dimensional Tree):** A binary tree that partitions space into axis-aligned boxes.
* **Ball Tree:** Partitions data points into a series of nesting hyper-spheres (balls).

### **7. When should you use KD Tree vs. Ball Tree?**

* **KD Tree:** Efficient for low-dimensional data (under 20 features) but becomes inefficient as dimensionality grows.
* **Ball Tree:** Better suited for high-dimensional data because it uses hyperspheres rather than axis-aligned partitions, handling the "curse of dimensionality" slightly better than KD Trees.

### **8. What are the disadvantages of KNN?**

* **Computationally Expensive:** Since it's a lazy learner, all computation happens during the prediction phase.
* **Memory Intensive:** Requires storing the entire dataset.
* **Sensitive to Scale:** Features with larger magnitudes dominate the distance.
* **Sensitive to Outliers:** A single noisy point can change the classification of nearby points if  is small.

### **9. How does feature scaling affect KNN?**

Since KNN relies on distance, features must be on the same scale. If one feature ranges from 0–1 and another from 0–1000, the latter will dictate the distance. **Normalization** or **Standardization** is mandatory for KNN.

---

## **Principal Component Analysis (PCA)**

### **10. What is PCA (Principal Component Analysis)?**

PCA is an unsupervised linear dimensionality reduction technique. It transforms a large set of variables into a smaller one that still contains most of the information (variance) of the original set.

### **11. How does PCA work?**

1. **Standardize** the data.
2. Compute the **Covariance Matrix** to see how variables vary from the mean with respect to each other.
3. Calculate **Eigenvectors and Eigenvalues** of the covariance matrix.
4. Sort Eigenvalues in descending order to identify the **Principal Components**.
5. Project the original data onto these new axes.

### **12. What is the geometric intuition behind PCA?**

Geometrically, PCA looks for a new coordinate system. The **First Principal Component (PC1)** is a line that passes through the data in a way that captures the maximum possible variance (the "longest" direction of the data cloud). Each subsequent component is perpendicular (orthogonal) to the previous one and captures the next highest variance.

### **13. What are Eigenvalues and Eigenvectors in PCA?**

* **Eigenvectors:** The directions of the axes where there is the most variance (the principal components).
* **Eigenvalues:** Scalars that determine the magnitude or amount of variance captured in that specific eigenvector direction.

### **14. What is the difference between Feature Selection and Feature Extraction?**

* **Feature Selection:** Keeping a subset of the original variables and discarding the rest (e.g., choosing "Age" and "Income" but dropping "Zip Code").
* **Feature Extraction:** Transforming data into a new set of features that are combinations of the original variables (e.g., PCA creating "PC1" and "PC2" from 10 original variables).

### **15. How do you decide the number of components to keep in PCA?**

We use a **Scree Plot** or the **Cumulative Explained Variance**. Usually, we select the number of components that explain 90–95% of the total variance.

### **16. Can PCA be used for classification?**

PCA itself is not a classifier; it is a preprocessing step. It is often used to reduce dimensions before feeding the data into a classifier like KNN or SVM to improve speed and reduce overfitting.

### **17. What are the limitations of PCA?**

* **Linearity:** It assumes that the relationships between variables are linear.
* **Information Loss:** Some variance (information) is always lost during reduction.
* **Interpretability:** Principal components are combinations of original features, making it hard to explain what a "component" actually represents in the real world.

### **18. How do KNN and PCA complement each other?**

PCA is frequently used as a **preprocessing step for KNN**. By reducing the number of dimensions, PCA mitigates the "Curse of Dimensionality," making the distance calculations in KNN more meaningful and significantly faster.

### **19. How does KNN handle missing values in a dataset?**

While standard KNN cannot handle missing values, we can use **KNN Imputation**. To fill a missing value for a sample, we find its  nearest neighbors (using the features that *are* present) and take the mean or mode of that feature from those neighbors.

### **20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?**

* **PCA:** Unsupervised. It ignores class labels and focuses on maximizing variance.
* **LDA:** Supervised. It uses class labels and focuses on maximizing the distance between different classes while minimizing the spread within each class.



---
#**Practical Questions**

### **21. Train a KNN Classifier on the Iris Dataset**

This is the standard entry point for classification.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
print(f"Model Accuracy: {accuracy_score(y_test, knn.predict(X_test)):.2f}")

```

### **22. KNN Regressor on Synthetic Data**

KNN can predict continuous values by averaging the targets of the  nearest neighbors.

```python
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

knr = KNeighborsRegressor(n_neighbors=5)
knr.fit(X, y)
mse = mean_squared_error(y, knr.predict(X))
print(f"Mean Squared Error: {mse:.4f}")

```

### **23. Comparing Euclidean vs. Manhattan Metrics**

The distance metric changes how neighbors are selected.

```python
for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train, y_train)
    print(f"Accuracy with {metric}: {knn.score(X_test, y_test):.2f}")

```

### **24. K-Value Decision Boundaries**

Small  values create jagged boundaries (overfitting), while large  values create smoother ones (underfitting).

### **25. Feature Scaling vs. Unscaled Data**

Since KNN depends on distance, large-scale features dominate. Scaling is mandatory.

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare accuracy of scaled vs unscaled here...

```

### **26. PCA Explained Variance Ratio**

This shows the percentage of the dataset's total variance that lies along each principal component.

```python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X_train_scaled)
print(f"Explained Variance Ratio: {pca.explained_variance_ratio_}")

```

### **27. PCA as a Preprocessing Step**

Reducing dimensions can sometimes improve KNN accuracy by removing noise.

```python
# Apply PCA, then feed into KNN and compare with raw KNN results

```

### **28. Hyperparameter Tuning with GridSearchCV**

Automate the search for the best  and distance metric.

```python
from sklearn.model_selection import GridSearchCV

params = {'n_neighbors': range(1, 20), 'weights': ['uniform', 'distance']}
grid = GridSearchCV(KNeighborsClassifier(), params, cv=5)
grid.fit(X_train_scaled, y_train)
print(f"Best K: {grid.best_params_['n_neighbors']}")

```

### **29. Identifying Misclassified Samples**

```python
y_pred = knn.predict(X_test)
misclassified = (y_test != y_pred).sum()
print(f"Number of misclassified samples: {misclassified}")

```

### **30. Cumulative Explained Variance Plot**

This helps decide how many components to keep.

```python
pca = PCA().fit(X_train_scaled)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()

```

### **31. Weights: Uniform vs. Distance**

'Distance' weights give closer neighbors more influence on the prediction.

```python
# Compare KNeighborsClassifier(weights='uniform') vs weights='distance'

```

### **32. KNN Imputation for Missing Values**

```python
from sklearn.impute import KNNImputer
import numpy as np

X_missing = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
print(imputer.fit_transform(X_missing))

```

### **33. PCA Data Projection (2D)**

### **34. KD Tree vs. Ball Tree Performance**

Both are algorithms used to speed up the search for nearest neighbors.

```python
import time
for algo in ['kd_tree', 'ball_tree']:
    start = time.time()
    KNeighborsClassifier(algorithm=algo).fit(X_train, y_train).predict(X_test)
    print(f"{algo} time: {time.time() - start:.5f}s")

```

### **35. PCA Scree Plot**

A bar chart of individual variances to find the "elbow."

### **36. Classification Report (Wine Dataset)**

```python
from sklearn.datasets import load_wine
from sklearn.metrics import classification_report

wine = load_wine()
# Split, Scale, Train KNN...
print(classification_report(y_test, y_pred))

```

### **37. KNN Classifier with ROC-AUC**

Useful for evaluating the performance of a classifier at various threshold settings.

```python
from sklearn.metrics import roc_auc_score
# Note: For multi-class, use multi_class='ovr'

```

### **38. Data Reconstruction Error**

Measure how much information is lost when you compress data with PCA and then decompress it.

```python
X_reduced = pca.transform(X_test_scaled)
X_recovered = pca.inverse_transform(X_reduced)
loss = np.mean((X_test_scaled - X_recovered) ** 2)
print(f"Reconstruction Error: {loss:.5f}")

```



### **41. KNN Weights: Uniform vs. Distance**

In 'uniform' weighting, all neighbors have an equal vote. In 'distance' weighting, closer neighbors have a higher influence.

```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for weight in ['uniform', 'distance']:
    knn = KNeighborsClassifier(n_neighbors=5, weights=weight)
    knn.fit(X_train, y_train)
    print(f"Accuracy with weights='{weight}': {knn.score(X_test, y_test):.4f}")

```

### **42. KNN Regressor: Impact of K Values**

As  increases, the regression line becomes smoother (lower variance, higher bias).

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor

X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()

for k in [1, 5, 20]:
    knr = KNeighborsRegressor(n_neighbors=k)
    y_pred = knr.fit(X, y).predict(X)
    plt.plot(X, y_pred, label=f'K={k}')

plt.scatter(X, y, color='black', label='Data')
plt.legend()
plt.title("Effect of K on Regression Line")
plt.show()

```

### **43. KNN Imputation for Missing Values**

This replaces `NaN` values with the mean value from the  nearest neighbors found in the remaining features.

```python
from sklearn.impute import KNNImputer

X_missing = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
X_filled = imputer.fit_transform(X_missing)

print("Imputed Data:\n", X_filled)

```

### **44. PCA: 2D Data Projection**

Projecting high-dimensional data (like the 4D Iris dataset) into 2D allows us to visualize class separation.

```python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title("PCA Projection of Iris Dataset")
plt.show()

```

### **45. KD Tree vs. Ball Tree Performance**

This compares the speed of neighbor searches. Ball Trees are generally faster for high-dimensional data.

```python
import time
for algo in ['kd_tree', 'ball_tree']:
    knn = KNeighborsClassifier(algorithm=algo)
    start = time.time()
    knn.fit(X_train, y_train).predict(X_test)
    print(f"{algo} Execution Time: {time.time() - start:.6f} seconds")

```

### **46. PCA Scree Plot**

A Scree plot helps identify the "elbow" to determine how many principal components are necessary.

```python
pca = PCA().fit(X)
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_)
plt.step(range(1, len(pca.explained_variance_ratio_) + 1), np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Scree Plot')
plt.show()

```

### **47. KNN: Precision, Recall, and F1-Score**

These metrics provide a deeper look at model performance than accuracy alone, especially for imbalanced data.

```python
from sklearn.metrics import classification_report

knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))

```

### **48. Number of Components vs. Accuracy**

This analyzes how many principal components are needed to maintain high classification accuracy.

```python
results = []
for n in range(1, 5):
    pca = PCA(n_components=n)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    knn = KNeighborsClassifier().fit(X_train_pca, y_train)
    results.append(knn.score(X_test_pca, y_test))

plt.plot(range(1, 5), results, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('KNN Accuracy')
plt.title('PCA Components vs. Accuracy')
plt.show()

```

---

