#KNN & PCA Assignment

## Q1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

**Answer:**  

**K-Nearest Neighbors (KNN):**  
KNN is a **supervised machine learning algorithm** used for both **classification** and **regression** tasks.  
It is a **non-parametric** and **instance-based (lazy learning)** method, meaning it does not assume any specific distribution about the data and does not explicitly build a model during training. Instead, it memorizes the training data and makes predictions at query time.

### How KNN Works:
1. Choose a value of **K** (number of nearest neighbors).
2. For a new data point:
   - Calculate the **distance** between the new point and all training data points.  
     (Common metrics: Euclidean, Manhattan, Minkowski, etc.)
   - Identify the **K closest points** (neighbors).
3. Make predictions based on these neighbors.

### KNN for Classification:
- Each of the K neighbors "votes" for its class.
- The class with the majority vote is assigned to the new data point.  

**Example:** If K=5 and among neighbors: 3 belong to class A and 2 belong to class B →  
the new point is classified as **class A**.

### KNN for Regression:
- Instead of voting, KNN takes the **average (or weighted average)** of the target values of the K nearest neighbors.
- The predicted value is a continuous number.  

**Example:** If K=3 and neighbors have target values 10, 12, 14 →  
predicted value = **(10 + 12 + 14) / 3 = 12**.

### Key Points:
- KNN is simple and effective for smaller datasets.
- The choice of **K** is important:  
  - Small K → sensitive to noise (**overfitting**)  
  - Large K → smoother decision boundary but may **underfit**
- Requires **feature scaling** (normalization/standardization) since distance measures are used.

## Q2. What is the Curse of Dimensionality and how does it affect KNN performance?

**Answer:**  

**Curse of Dimensionality:**  
The Curse of Dimensionality refers to the various problems that arise when working with data that has a very high number of features (dimensions). As the number of dimensions increases:

- The **volume of the feature space increases exponentially**, making the data points sparse.
- The **distance between points becomes less meaningful**, because all points tend to appear similarly far from each other.
- This sparsity makes it difficult for algorithms like KNN to find truly "nearest" neighbors.

### How it affects KNN performance:
1. **Distance measure loses effectiveness:**  
   KNN relies on distance metrics (like Euclidean distance) to find neighbors. In high dimensions, the difference in distance between the nearest and farthest neighbors becomes very small, making neighbor selection unreliable.

2. **Increased computation:**  
   More features mean more calculations for each distance computation, slowing down prediction.

3. **Overfitting risk:**  
   With many features, KNN may fit noise instead of true patterns, reducing generalization performance.

### Mitigation:
- **Dimensionality reduction techniques** like PCA (Principal Component Analysis) are commonly used to reduce the number of features while retaining most of the variance.
- Feature selection or normalization can also help improve KNN performance in high-dimensional spaces.

## Q3. What is Principal Component Analysis (PCA)? How is it different from feature selection?

**Answer:**  

**Principal Component Analysis (PCA):**  
PCA is an **unsupervised dimensionality reduction technique** used to reduce the number of features in a dataset while retaining most of the original variance. It transforms the original features into a new set of **uncorrelated variables** called **principal components**, which are linear combinations of the original features.

- The first principal component captures the **maximum variance** in the data.
- The second principal component captures the maximum remaining variance, and so on.
- Typically, only the top few principal components are retained to reduce dimensionality.

**Difference from Feature Selection:**  

| Aspect                     | PCA (Feature Extraction)               | Feature Selection                     |
|-----------------------------|---------------------------------------|--------------------------------------|
| Approach                    | Creates **new features** (principal components) | Selects a **subset of original features** |
| Data Transformation         | Yes (linear combinations of original features) | No (keeps original features)        |
| Goal                        | Reduce dimensionality while retaining variance | Keep most relevant features          |
| Use Case                    | High-dimensional datasets where feature correlation exists | When some features are irrelevant or redundant |

**Summary:**  
- PCA reduces dimensions by **combining features** into principal components.  
- Feature selection reduces dimensions by **choosing the most important features** without creating new ones.  
- Both help in improving model performance, reducing overfitting, and speeding up computation, but they do it in fundamentally.

## Q4. What are eigenvalues and eigenvectors in PCA, and why are they important?

**Answer:**  

**Eigenvectors and Eigenvalues in PCA:**  
PCA uses **linear algebra** to transform the original features into principal components. This involves computing the **covariance matrix** of the data and then finding its **eigenvectors** and **eigenvalues**.

- **Eigenvectors:**  
  These are vectors that define the **directions** of the new feature space (principal components). Each eigenvector points in a direction along which the data varies the most.

- **Eigenvalues:**  
  These are scalars that measure the **magnitude of variance** along each eigenvector. A larger eigenvalue indicates that the corresponding eigenvector captures more of the data's variance.

**Importance in PCA:**
1. **Determine principal components:**  
   Eigenvectors with the largest eigenvalues are selected as principal components because they capture the most significant variation in the data.

2. **Dimensionality reduction:**  
   By keeping only the top eigenvectors (with highest eigenvalues), we reduce the number of dimensions while preserving most of the information.

3. **Feature transformation:**  
   Eigenvectors provide the new axes for the transformed feature space, and eigenvalues quantify how much variance each axis explains.

**Summary:**  
- Eigenvectors → Directions of principal components  
- Eigenvalues → Importance (variance) of each principal component  
- PCA chooses eigenvectors with largest eigenvalues to reduce dimensions efficiently.

## Q5. How do KNN and PCA complement each other when applied in a single pipeline?

**Answer:**  

**Combining PCA and KNN:**  
- **KNN** is sensitive to the **curse of dimensionality**: as the number of features increases, distances between points become less meaningful, which can degrade KNN performance.  
- **PCA** reduces the number of features by creating **principal components** that capture most of the variance, while eliminating redundant or less informative features.

**How they complement each other:**
1. **Dimensionality Reduction:**  
   PCA reduces high-dimensional data to a smaller set of components, making distance calculations in KNN more meaningful.

2. **Improved KNN Performance:**  
   With fewer dimensions, KNN can classify or regress more accurately, and it becomes less sensitive to noise.

3. **Faster Computation:**  
   Fewer features mean fewer distance calculations for KNN, which speeds up prediction.

4. **Better Generalization:**  
   By reducing irrelevant or correlated features, PCA helps KNN avoid overfitting and improves model generalization.

## Q6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

**Answer:**  

We will train a KNN classifier on the Wine dataset twice:  
1. **Without feature scaling**  
2. **With feature scaling** (StandardScaler)  

We will then compare the model accuracy in both cases.   


In [2]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ---------------------------
# 1. KNN without feature scaling
# ---------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred)
print("Accuracy without scaling:", accuracy_no_scaling)

# ---------------------------
# 2. KNN with feature scaling
# ---------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("Accuracy with scaling:", accuracy_scaled)


Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


## Q7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

**Answer:**  

We will apply PCA to the Wine dataset to reduce dimensionality and understand how much variance each principal component captures.


In [3]:
# Import libraries
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features before applying PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio for each principal component
explained_variance = pca.explained_variance_ratio_
for i, ratio in enumerate(explained_variance, start=1):
    print(f"Principal Component {i}: {ratio:.4f}")

# Optional: cumulative explained variance
cumulative_variance = explained_variance.cumsum()
print("\nCumulative Explained Variance:", cumulative_variance)


Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080

Cumulative Explained Variance: [0.36198848 0.55406338 0.66529969 0.73598999 0.80162293 0.85098116
 0.89336795 0.92017544 0.94239698 0.96169717 0.97906553 0.99204785
 1.        ]


## Q8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

**Answer:**  

We will apply PCA to reduce the Wine dataset to the top 2 principal components and then train a KNN classifier.  
We will compare the accuracy with the KNN trained on the original (scaled) dataset.


In [4]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ---------------------------
# 1. KNN on original scaled data
# ---------------------------
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train, y_train)
y_pred_original = knn_original.predict(X_test)
accuracy_original = accuracy_score(y_test, y_pred_original)
print("Accuracy on original dataset:", accuracy_original)

# ---------------------------
# 2. Apply PCA (top 2 components)
# ---------------------------
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# ---------------------------
# 3. KNN on PCA-transformed data
# ---------------------------
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test_pca, y_pred_pca)
print("Accuracy on PCA-transformed dataset (top 2 components):", accuracy_pca)


Accuracy on original dataset: 0.9444444444444444
Accuracy on PCA-transformed dataset (top 2 components): 1.0


## Q9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

**Answer:**  

We will train KNN classifiers on the scaled Wine dataset using **Euclidean** and **Manhattan** distance metrics and compare their accuracies.


In [5]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# ---------------------------
# 1. KNN with Euclidean distance
# ---------------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)
print("Accuracy with Euclidean distance:", accuracy_euclidean)

# ---------------------------
# 2. KNN with Manhattan distance
# ---------------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)
print("Accuracy with Manhattan distance:", accuracy_manhattan)


Accuracy with Euclidean distance: 0.9444444444444444
Accuracy with Manhattan distance: 0.9444444444444444


## Q10. You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Explain how you would:  
- Use PCA to reduce dimensionality  
- Decide how many components to keep  
- Use KNN for classification post-dimensionality reduction  
- Evaluate the model  
- Justify this pipeline to your stakeholders

**Answer:**  

**1. Use PCA to reduce dimensionality:**  
- Gene expression datasets often have **thousands of features (genes)** but relatively few samples.  
- Directly applying KNN or other models can lead to **overfitting**.  
- Apply **PCA** to transform the original features into **principal components**, which capture the most variance in the data.  
- This reduces the dimensionality while retaining the most important information for classification.

**2. Decide how many components to keep:**  
- Examine the **explained variance ratio** of principal components.  
- Retain enough components to capture **~90–95% of the total variance**.  
- This balances **information retention** and **overfitting risk**.

**3. Use KNN for classification post-dimensionality reduction:**  
- Standardize the PCA-transformed features.  
- Train a **KNN classifier** on the reduced feature set.  
- KNN works well here because distances in lower-dimensional space are more meaningful and less noisy.

**4. Evaluate the model:**  
- Use **cross-validation** (e.g., k-fold) to assess model stability and generalization.  
- Evaluate performance metrics such as **accuracy, precision, recall, F1-score**, or **ROC-AUC** depending on class imbalance.  
- Compare results with and without PCA to demonstrate dimensionality reduction benefits.

**5. Justify this pipeline to stakeholders:**  
- **Robustness:** PCA reduces overfitting by removing redundant or noisy features.  
- **Interpretability:** PCA highlights the most informative patterns in gene expression.  
- **Efficiency:** Reduced feature space leads to faster model training and prediction.  
- **Accuracy:** KNN on PCA-transformed data maintains strong classification performance while minimizing complexity.  
- This approach is widely accepted in **biomedical data analysis**, where high-dimensional datasets are common and interpretability is important.
