### Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

**Answer:**  
K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and regression.  
- In **classification**, the algorithm assigns a class to a new data point based on the majority class of its k nearest neighbors.  
- In **regression**, the prediction is the average (or weighted average) of the values of its k nearest neighbors.  

KNN relies on distance metrics like Euclidean or Manhattan distance to find the nearest neighbors.  
It is simple, non-parametric, and effective for smaller datasets.  

**Example Code (Classification):**  
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN Classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
```

### Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

**Answer:**  
The Curse of Dimensionality refers to various problems that arise when analyzing and organizing data in high-dimensional spaces.  
- In high dimensions, distances between points become less meaningful because all points tend to look equally far apart.  
- For KNN, this means that finding "nearest" neighbors becomes difficult and noisy.  
- As a result, model performance may degrade due to poor neighborhood structure.  

**Impact on KNN:**  
- Requires more data to achieve the same accuracy.  
- Increases computation time.  
- Feature scaling and dimensionality reduction (e.g., PCA) help mitigate this issue.  


### Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

**Answer:**  
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms original features into a smaller set of uncorrelated variables called principal components.  
- These components capture the maximum variance in the data.  

**Difference from Feature Selection:**  
- PCA creates **new features** (linear combinations of original ones), while feature selection chooses a subset of original features.  
- PCA focuses on variance maximization, feature selection focuses on importance.  


### Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

**Answer:**  
- **Eigenvectors**: Directions along which data varies the most. They represent principal components.  
- **Eigenvalues**: Magnitude of variance captured by each eigenvector.  
- In PCA, eigenvalues help decide which principal components to keep (larger eigenvalues = more variance).  
- They are crucial for dimensionality reduction because they determine how much information is preserved.  


### Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

**Answer:**  
- KNN suffers in high dimensions due to the Curse of Dimensionality.  
- PCA reduces dimensions by keeping only the most informative components.  
- When PCA is applied before KNN:  
  - Noise is reduced.  
  - Computational efficiency improves.  
  - Accuracy may improve due to better neighborhood structure.  
Thus, PCA + KNN pipeline is robust for high-dimensional datasets.  


### Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

**Answer (Code + Output):**  
```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy without scaling:", accuracy_score(y_test, y_pred))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))
```

### Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

**Answer (Code + Output):**  
```python
from sklearn.decomposition import PCA

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X)

# Explained variance ratio
print("Explained variance ratio:", pca.explained_variance_ratio_)
```

### Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

**Answer (Code + Output):**  
```python
# PCA with top 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Train-test split
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Train KNN
knn.fit(X_train_pca, y_train)
y_pred_pca = knn.predict(X_test_pca)
print("Accuracy with PCA (2 components):", accuracy_score(y_test, y_pred_pca))
```

### Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

**Answer (Code + Output):**  
```python
# Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
print("Euclidean Accuracy:", accuracy_score(y_test, y_pred_euclidean))

# Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
print("Manhattan Accuracy:", accuracy_score(y_test, y_pred_manhattan))
```

### Question 10: High-dimensional gene expression dataset classification problem

**Answer:**  
Steps to build a robust pipeline:  
1. **Use PCA for Dimensionality Reduction**: Reduce thousands of features into fewer components while retaining maximum variance.  
2. **Decide Number of Components**: Use explained variance ratio (keep ~95% variance).  
3. **Apply KNN**: Train classifier on reduced dataset.  
4. **Evaluate**: Use cross-validation, accuracy, F1-score, and confusion matrix.  
5. **Justification**:  
   - PCA prevents overfitting.  
   - KNN is simple and interpretable.  
   - Pipeline ensures scalability for real biomedical data.  

**Example Code:**  
```python
# PCA for high-dimensional dataset
pca = PCA(n_components=0.95)  # Retain 95% variance
X_reduced = pca.fit_transform(X)

# Train-test split
X_train_r, X_test_r, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=42)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_r, y_train)
y_pred_r = knn.predict(X_test_r)

# Evaluate
print("Accuracy after PCA + KNN:", accuracy_score(y_test, y_pred_r))
```