### 1. **What is K-Nearest Neighbors (KNN) and how does it work?**
KNN is a non-parametric, instance-based machine learning algorithm used for classification and regression. It works by finding the 'K' closest training samples (using a distance metric like Euclidean distance) to a new point and assigning the majority class (for classification) or averaging their values (for regression).

---

### 2. **Difference between KNN Classification and Regression**
- **KNN Classification**: Predicts the class with the majority vote from K-nearest neighbors.
- **KNN Regression**: Predicts the average (or weighted average) value from K-nearest neighbors.

---

### 3. **Role of the distance metric in KNN**
The distance metric (e.g., Euclidean, Manhattan, Minkowski) determines how "closeness" is measured. It directly affects which points are considered neighbors.

---

### 4. **Curse of Dimensionality in KNN**
As dimensions increase, data points become sparse, making distance measures less meaningful. This affects KNN’s performance because neighbors become less distinguishable.

---

### 5. **How to choose the best value of K in KNN**
Use techniques like **cross-validation** to find the K that minimizes error on validation data. Too small K = noisy, too large K = too smooth.

---

### 6. **What are KD Tree and Ball Tree in KNN**
Data structures to speed up nearest neighbor searches:
- **KD Tree**: Efficient for low-dimensional data.
- **Ball Tree**: Better for higher dimensions or non-uniform data.

---

### 7. **When to use KD Tree vs Ball Tree**
- **KD Tree**: Use when dimension < 20.
- **Ball Tree**: Use for high-dimensional or non-axis-aligned data.

---

### 8. **Disadvantages of KNN**
- Slow prediction time.
- Sensitive to irrelevant features and feature scaling.
- Struggles with high-dimensional data.

---

### 9. **Effect of feature scaling on KNN**
Essential! Since KNN uses distance metrics, unscaled features can dominate the computation. Use standardization or normalization.

---

### 10. **What is PCA (Principal Component Analysis)**
PCA is a dimensionality reduction technique that transforms data to a new coordinate system, maximizing variance along principal components.

---

### 11. **How does PCA work**
1. Standardize data
2. Compute covariance matrix
3. Compute eigenvectors and eigenvalues
4. Project data onto top components

---

### 12. **Geometric intuition of PCA**
PCA finds new axes (principal components) along which variance in data is maximized, effectively rotating the dataset.

---

### 13. **Feature Selection vs. Feature Extraction**
- **Selection**: Choose a subset of original features.
- **Extraction**: Create new features (e.g., PCA components).

---

### 14. **Eigenvalues and Eigenvectors in PCA**
- **Eigenvectors**: Directions of new feature space (principal components).
- **Eigenvalues**: Amount of variance each eigenvector explains.

---

### 15. **How to decide number of PCA components**
- Use the **explained variance ratio**.
- Use a **Scree Plot** or keep enough components to explain 95%+ of variance.

---

### 16. **Can PCA be used for classification?**
Yes. PCA reduces dimensionality, which can improve classifier performance, but it doesn't directly perform classification.

---

### 17. **Limitations of PCA**
- Assumes linear relationships.
- Doesn’t preserve class separability.
- Sensitive to outliers.

---

### 18. **How KNN and PCA complement each other**
PCA reduces noise and dimensionality, improving KNN performance (especially in high-dimensional datasets).

---

### 19. **How KNN handles missing values**
It doesn't natively. You can use **KNN Imputer** (e.g., in sklearn) which replaces missing values using neighbors' values.

---

### 20. **PCA vs. Linear Discriminant Analysis (LDA)**
- **PCA**: Unsupervised, maximizes variance.
- **LDA**: Supervised, maximizes class separability.

In [None]:
# KNN & PCA Solutions (Practical Answers)

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, make_regression, make_classification, load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor, KNeighborsTransformer
from sklearn.impute import KNNImputer
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix, classification_report, roc_auc_score, precision_score, recall_score, f1_score, roc_curve
from sklearn.decomposition import PCA

# 1. KNN Classifier on Iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("1. KNN Classifier Accuracy on Iris:", accuracy_score(y_test, y_pred))

# 2. KNN Regressor on synthetic dataset
X, y = make_regression(n_samples=200, n_features=1, noise=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)
print("2. KNN Regressor MSE:", mean_squared_error(y_test, y_pred))

# 3. KNN Classifier with Euclidean and Manhattan distance
for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsClassifier(metric=metric)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print(f"3. Accuracy with {metric} distance:", accuracy_score(y_test, y_pred))

# 4. KNN with different K values and decision boundaries (example for 2D data)
X_vis, y_vis = make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_vis, y_vis, test_size=0.2)
for k in [1, 3, 5, 7]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    plt.figure()
    sns.scatterplot(x=X_test[:,0], y=X_test[:,1], hue=knn.predict(X_test), palette='coolwarm')
    plt.title(f"4. Decision boundary for k={k}")
    plt.show()

# 5. Feature Scaling comparison
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
acc_unscaled = knn_unscaled.score(X_test, y_test)
X_train_scaled, X_test_scaled = train_test_split(X_scaled, test_size=0.2)
knn_scaled = KNeighborsClassifier().fit(X_train_scaled, y_train)
acc_scaled = knn_scaled.score(X_test_scaled, y_test)
print("5. Accuracy without scaling:", acc_unscaled)
print("   Accuracy with scaling:", acc_scaled)

# 6. PCA Explained Variance
X_pca, _ = make_classification(n_samples=200, n_features=10)
pca = PCA()
pca.fit(X_pca)
print("6. Explained variance ratio:", pca.explained_variance_ratio_)

# 7. KNN with and without PCA
X_train, X_test, y_train, y_test = train_test_split(X_pca, y_vis, test_size=0.2)
knn_orig = KNeighborsClassifier().fit(X_train, y_train)
acc_orig = knn_orig.score(X_test, y_test)
X_pca_reduced = PCA(n_components=2).fit_transform(X_pca)
X_train, X_test, y_train, y_test = train_test_split(X_pca_reduced, y_vis, test_size=0.2)
knn_pca = KNeighborsClassifier().fit(X_train, y_train)
acc_pca = knn_pca.score(X_test, y_test)
print("7. Accuracy without PCA:", acc_orig)
print("   Accuracy with PCA:", acc_pca)

# 8. GridSearchCV for KNN
param_grid = {'n_neighbors': [3, 5, 7], 'metric': ['euclidean', 'manhattan']}
gs = GridSearchCV(KNeighborsClassifier(), param_grid, cv=3)
gs.fit(X_train, y_train)
print("8. Best Params from GridSearch:", gs.best_params_)

# 9. Misclassified samples
print("9. Misclassified samples:", np.sum(knn_pca.predict(X_test) != y_test))

# 10. Cumulative Explained Variance Plot
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title("10. Cumulative Explained Variance")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance")
plt.grid(True)
plt.show()

# 11. Weights parameter in KNN
for w in ['uniform', 'distance']:
    knn = KNeighborsClassifier(weights=w)
    knn.fit(X_train, y_train)
    print(f"11. Accuracy with weights={w}:", knn.score(X_test, y_test))

# 12. Regressor K values effect
for k in [1, 3, 5, 7, 9]:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train, y_train)
    print(f"12. MSE for k={k}:", mean_squared_error(y_test, knn.predict(X_test)))

# 13. KNN Imputer
X_missing = X.copy()
X_missing[::10] = np.nan
imputer = KNNImputer()
X_filled = imputer.fit_transform(X_missing)
print("13. Missing values imputed (example):", np.isnan(X_filled).sum())

# 14. PCA 2D projection
X_proj = PCA(n_components=2).fit_transform(X)
plt.scatter(X_proj[:,0], X_proj[:,1], c=y)
plt.title("14. PCA 2D Projection")
plt.show()

# 15. KD Tree vs Ball Tree performance
for algo in ['kd_tree', 'ball_tree']:
    knn = KNeighborsClassifier(algorithm=algo)
    knn.fit(X_train, y_train)
    print(f"15. Accuracy with {algo}:", knn.score(X_test, y_test))

# 16. PCA Scree Plot
plt.plot(pca.explained_variance_)
plt.title("16. Scree Plot")
plt.xlabel("Component")
plt.ylabel("Eigenvalue")
plt.grid(True)
plt.show()

# 17. Precision, Recall, F1-Score
y_pred = knn.predict(X_test)
print("17. Precision:", precision_score(y_test, y_pred, average='macro'))
print("    Recall:", recall_score(y_test, y_pred, average='macro'))
print("    F1 Score:", f1_score(y_test, y_pred, average='macro'))

# 18. Accuracy with different PCA components
for n in [1, 2, 5]:
    X_reduced = PCA(n_components=n).fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2)
    knn = KNeighborsClassifier().fit(X_train, y_train)
    print(f"18. Accuracy with {n} PCA components:", knn.score(X_test, y_test))

# 19. Leaf size comparison
for leaf in [10, 20, 30, 40]:
    knn = KNeighborsClassifier(leaf_size=leaf)
    knn.fit(X_train, y_train)
    print(f"19. Accuracy with leaf_size={leaf}:", knn.score(X_test, y_test))

# 20. Before/After PCA plot
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("Original Data")
plt.subplot(1, 2, 2)
plt.scatter(X_proj[:, 0], X_proj[:, 1], c=y)
plt.title("After PCA")
plt.show()

# 21. KNN on Wine Dataset
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2)
knn = KNeighborsClassifier().fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("21. Classification Report on Wine:")
print(classification_report(y_test, y_pred))

# 22. KNN Regressor with different metrics
for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsRegressor(metric=metric)
    knn.fit(X_train, y_train)
    print(f"22. MSE with {metric}:", mean_squared_error(y_test, knn.predict(X_test)))

# 23. ROC-AUC Score (binary example)
X_bin, y_bin = make_classification(n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X_bin, y_bin)
knn = KNeighborsClassifier().fit(X_train, y_train)
y_prob = knn.predict_proba(X_test)[:, 1]
print("23. ROC-AUC Score:", roc_auc_score(y_test, y_prob))

# 24. Variance captured by each component
plt.bar(range(len(pca.explained_variance_ratio_)), pca.explained_variance_ratio_)
plt.title("24. Variance by Each PCA Component")
plt.show()

# 25. Feature selection before training
X_selected = X[:, :5]
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2)
knn = KNeighborsClassifier().fit(X_train, y_train)
print("25. Accuracy with Feature Selection:", knn.score(X_test, y_test))

# 26. PCA Reconstruction Error
pca = PCA(n_components=5)
X_reduced = pca.fit_transform(X)
X_reconstructed = pca.inverse_transform(X_reduced)
recon_error = np.mean((X - X_reconstructed)**2)
print("26. Reconstruction Error:", recon_error)

# 27. Decision Boundary Visualization (2D)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
x_min, x_max = X_train[:, 0].min(), X_train[:, 0].max()
y_min, y_max = X_train[:, 1].min(), X_train[:, 1].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k')
plt.title("27. KNN Decision Boundary")
plt.show()

# 28. PCA variance effect
for n in [2, 4, 6]:
    pca = PCA(n_components=n)
    X_reduced = pca.fit_transform(X)
    print(f"28. Variance with {n} components:", np.sum(pca.explained_variance_ratio_))
