<a href="https://colab.research.google.com/github/sumitabh-naskar/KNN-PCA/blob/main/KNN_%26_PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

**K-Nearest Neighbors (KNN)** is a simple, non-parametric machine learning algorithm used for both **classification** and **regression**. It's considered a "lazy" learning algorithm because it doesn't build a model during the training phase. Instead, it memorizes the entire dataset and performs all the computational work during prediction.

***

#### How KNN Works

The core idea of KNN is that an object's characteristics are similar to those of its neighbors. When you want to make a prediction for a new, unknown data point, the algorithm follows these steps:

1.  **Choose a number K**: This is the number of neighbors to consider.
2.  **Calculate Distance**: It calculates the distance from the new data point to all other points in the training dataset. Common distance metrics include Euclidean distance or Manhattan distance.
3.  **Find the K-Nearest Neighbors**: The algorithm identifies the K data points in the training set that are closest to the new point based on the calculated distances.
4.  **Make a Prediction**:
    * **For Classification**: It looks at the classes of the K-nearest neighbors and predicts the class that is most common among them. For example, if K=5 and three neighbors are "Class A" and two are "Class B", the new data point will be classified as "Class A".
    * **For Regression**: It takes the average of the values of the K-nearest neighbors to predict a continuous value for the new data point. For example, if K=5 and the neighbor values are [10, 12, 11, 15, 12], the predicted value would be the average, which is 12.

In [1]:
#1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems? K-Nearest Neighbors (KNN) in Classification & Regression

# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris, load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import accuracy_score, mean_squared_error

# ----------------------------
# 🔹 KNN for Classification
# ----------------------------
print("=== KNN for Classification (Iris Dataset) ===")

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features (important for KNN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train KNN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Predictions
y_pred_clf = knn_clf.predict(X_test)

# Accuracy
print("Classification Accuracy:", accuracy_score(y_test, y_pred_clf))

# ----------------------------
# >> KNN for Regression <<
# ----------------------------
print("\n=== KNN for Regression (Diabetes Dataset) ===")

# Load dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train KNN Regressor
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# Predictions
y_pred_reg = knn_reg.predict(X_test)

# Mean Squared Error
print("Regression Mean Squared Error:", mean_squared_error(y_test, y_pred_reg))

# ----------------------------
# >>Explanation:
# - Classification → Majority voting among nearest neighbors
# - Regression → Average value among nearest neighbors
# - Scaling features is required since KNN relies on distances
# ----------------------------


=== KNN for Classification (Iris Dataset) ===
Classification Accuracy: 1.0

=== KNN for Regression (Diabetes Dataset) ===
Regression Mean Squared Error: 3047.449887640449


2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Ans- Curse of Dimensionality & Its Effect on KNN

What is the Curse of Dimensionality?

The curse of dimensionality refers to problems that arise when the number of features (dimensions) in the dataset becomes very large.

In high-dimensional space:

- Data becomes sparse → Points are far apart.

- Distance measures lose meaning → The difference between the nearest and farthest neighbor becomes very small.

- More data needed → The volume of space increases exponentially, so we need exponentially more data to maintain density.

How it Affects KNN Performance?

Since KNN relies on distance (Euclidean/Manhattan, etc.), the curse of dimensionality creates problems:

- Distances become less meaningful → All points seem equally far, so KNN struggles to identify true neighbors.

- Overfitting risk → With many irrelevant features, noise dominates the distance calculation.

- High computation cost → Distance calculation in high dimensions is very expensive.

Ways to Reduce the Effect:

- Feature Selection → Keep only relevant features.

- Dimensionality Reduction → Use PCA, t-SNE, or autoencoders.

- Scaling/Normalization → Helps but does not fully solve the issue.

3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

**Principal Component Analysis (PCA)** is a dimensionality reduction technique used to simplify a dataset while retaining its most important information. It transforms a set of correlated variables into a smaller set of uncorrelated variables called **principal components**. The first principal component accounts for the largest possible variance in the data, the second component accounts for the next largest variance orthogonal to the first, and so on. Essentially, PCA creates new, synthesized features from the original ones. .

***

#### PCA vs. Feature Selection

While both PCA and feature selection are techniques for dimensionality reduction, they achieve it in fundamentally different ways:

* **PCA (Feature Extraction)**: PCA doesn't remove features; it transforms the existing ones into a new set of components. It creates a new, smaller set of features that are linear combinations of the original features. This can be very useful for reducing the complexity of a dataset and combating the "curse of dimensionality." However, the new components can be difficult to interpret, as they don't directly correspond to any of the original variables.

* **Feature Selection**: This is a process that chooses a **subset of the most relevant features** from the original dataset and discards the rest. The selected features are the actual, original variables. Feature selection methods can be simple (like removing features with low variance) or more complex (like using a model to rank feature importance). The primary advantage of feature selection is that the final model is more interpretable because it uses the original, understandable features. However, it may discard valuable information contained in the discarded features.



4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Ans- Eigenvalues & Eigenvectors in PCA

What are Eigenvalues & Eigenvectors?

- Eigenvectors: Directions along which data varies the most (principal components).

- Eigenvalues: The amount of variance captured along each eigenvector (importance/weight of each component).

 In PCA:

- Eigenvectors = new feature axes (principal components).

- Eigenvalues = how much variance (information) each component explains.

Why are They Important in PCA?

1. Eigenvectors determine the orientation of new axes (principal components).

2. Eigenvalues tell us how much information (variance) is retained by each component.

3. We use the largest eigenvalues → corresponding eigenvectors form the reduced feature space.

4. Helps in deciding how many components to keep (e.g., keep components that explain 95% variance).

5: How do KNN and PCA complement each other when applied in a single
pipeline?

Ans- How KNN and PCA Complement Each Other

KNN Recap

- KNN is a distance-based algorithm.

- Performance depends heavily on feature space and distances.

- Struggles in high-dimensional data (curse of dimensionality).

PCA Recap

- PCA reduces dimensionality by creating new features (principal components).

- Removes noise and correlations between features.

- Retains maximum variance in fewer dimensions.

How They Work Together

- PCA before KNN → PCA reduces dimensionality, keeping only the most informative components.

- This makes distances more meaningful for KNN (less noise, less redundancy).

- PCA also reduces computation cost → KNN is faster with fewer features.

- PCA helps avoid overfitting in KNN by removing irrelevant/weak features.

### 6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

In [10]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
# The Wine dataset is a classic classification problem with 13 features.
wine_data = load_wine()
X = wine_data.data
y = wine_data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# 2. Case 1: Train KNN without feature scaling
# We train the KNN classifier directly on the raw data.
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print("Accuracy without feature scaling:")
print(f"{accuracy_unscaled:.4f}")

# 3. Case 2: Train KNN with feature scaling
# We use StandardScaler to normalize the features. This is a crucial step
# for distance-based algorithms like KNN.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the KNN classifier on the scaled data.
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("\nAccuracy with feature scaling:")
print(f"{accuracy_scaled:.4f}")

Accuracy without feature scaling:
0.7037

Accuracy with feature scaling:
0.9815


### 7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

In [11]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA

# 1. Load the Wine dataset
# The Wine dataset is a classic classification problem with 13 features.
wine_data = load_wine()
X = wine_data.data
y = wine_data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Case 1: Train KNN without feature scaling
# We train the KNN classifier directly on the raw data.
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print("Accuracy without feature scaling:")
print(f"{accuracy_unscaled:.4f}")

# 3. Case 2: Train KNN with feature scaling
# We use StandardScaler to normalize the features. This is a crucial step
# for distance-based algorithms like KNN.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the KNN classifier on the scaled data.
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("\nAccuracy with feature scaling:")
print(f"{accuracy_scaled:.4f}")

# 4. Train a PCA model and print explained variance ratio
# We train a PCA model on the scaled training data.
# The `explained_variance_ratio_` attribute shows the proportion of
# variance in the data that is captured by each principal component.
pca = PCA()
pca.fit(X_train_scaled)
explained_variance_ratio = pca.explained_variance_ratio_
print("\nExplained Variance Ratio of Principal Components:")
print(explained_variance_ratio)

# Sum of the first two components to show cumulative variance
cumulative_variance = np.sum(explained_variance_ratio[:2])
print(f"\nCumulative explained variance of the first two components: {cumulative_variance:.4f}")


Accuracy without feature scaling:
0.7407

Accuracy with feature scaling:
0.9630

Explained Variance Ratio of Principal Components:
[0.36196226 0.18763862 0.11656548 0.07578973 0.07043753 0.04552517
 0.03584257 0.02646315 0.02174942 0.01958347 0.01762321 0.01323825
 0.00758114]

Cumulative explained variance of the first two components: 0.5496


### 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

In [6]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA

# 1. Load the Wine dataset
# The Wine dataset is a classic classification problem with 13 features.
wine_data = load_wine()
X = wine_data.data
y = wine_data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Case 1: Train KNN without feature scaling
# We train the KNN classifier directly on the raw data.
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print("Accuracy without feature scaling:")
print(f"{accuracy_unscaled:.4f}")

# 3. Case 2: Train KNN with feature scaling
# We use StandardScaler to normalize the features. This is a crucial step
# for distance-based algorithms like KNN.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the KNN classifier on the scaled data.
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("\nAccuracy with feature scaling:")
print(f"{accuracy_scaled:.4f}")

# 4. Train a PCA model and print explained variance ratio
# We train a PCA model on the scaled training data.
# The `explained_variance_ratio_` attribute shows the proportion of
# variance in the data that is captured by each principal component.
pca = PCA()
pca.fit(X_train_scaled)
explained_variance_ratio = pca.explained_variance_ratio_
print("\nExplained Variance Ratio of Principal Components:")
print(explained_variance_ratio)

# Sum of the first two components to show cumulative variance
cumulative_variance = np.sum(explained_variance_ratio[:2])
print(f"\nCumulative explained variance of the first two components: {cumulative_variance:.4f}")

# 5. Case 3: Train KNN on PCA-transformed data (top 2 components)
# We use PCA to reduce the dimensionality to the top 2 principal components.
# This helps to remove noise and can improve performance for KNN.
pca_2_components = PCA(n_components=2)
X_train_pca = pca_2_components.fit_transform(X_train_scaled)
X_test_pca = pca_2_components.transform(X_test_scaled)

# Train a new KNN classifier on the PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print("\nAccuracy with PCA (top 2 components):")
print(f"{accuracy_pca:.4f}")

Accuracy without feature scaling:
0.7407

Accuracy with feature scaling:
0.9630

Explained Variance Ratio of Principal Components:
[0.36196226 0.18763862 0.11656548 0.07578973 0.07043753 0.04552517
 0.03584257 0.02646315 0.02174942 0.01958347 0.01762321 0.01323825
 0.00758114]

Cumulative explained variance of the first two components: 0.5496

Accuracy with PCA (top 2 components):
0.9815


In [5]:
#9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --------------------------
# KNN with Euclidean Distance (default)
# --------------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# --------------------------
# KNN with Manhattan Distance
# --------------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# --------------------------
# Results
# --------------------------
print("Accuracy with Euclidean Distance:", acc_euclidean)
print("Accuracy with Manhattan Distance:", acc_manhattan)


Accuracy with Euclidean Distance: 0.9444444444444444
Accuracy with Manhattan Distance: 0.9444444444444444


### 10. You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.
Explain how you would:
- Use PCA to reduce dimensionality
- Decide how many components to keep
- Use KNN for classification post-dimensionality reduction
- Evaluate the model
- Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

### Pipeline for High-Dimensional Gene Expression Classification

#### 1. **Use PCA to Reduce Dimensionality**

Gene expression data often has thousands of features (genes) and very few samples. PCA helps by transforming the data into a set of orthogonal components that capture the most variance.

```python
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize the data
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
```

---

#### 2. **Decide How Many Components to Keep**

We would want to retain enough components to preserve most of the variance while reducing noise and overfitting risk.

```python
import numpy as np

# Choose number of components that explain 95% variance
n_components = np.argmax(np.cumsum(pca.explained_variance_ratio_) >= 0.95) + 1

# Re-apply PCA with selected components
pca = PCA(n_components=n_components)
X_reduced = pca.fit_transform(X_scaled)
```
---

#### 3. **Use KNN for Classification Post-PCA**

KNN is simple, non-parametric, and works well when the feature space is clean and reduced.

```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Tune the number of neighbors
param_grid = {'n_neighbors': list(range(1, 21))}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best model
best_knn = grid_search.best_estimator_
```

---

#### 4. **Evaluate the Model**

Use stratified cross-validation and multiple metrics to ensure robustness.

```python
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = best_knn.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
```
---

#### 5. **Justify the Pipeline to Stakeholders**

Here’s how it has happened:

- **Dimensionality Reduction:** PCA combats overfitting and reveals latent biological signals.
- **Interpretability:** PCA components can be traced back to gene contributions, aiding biological insight.
- **Simplicity & Transparency:** KNN is intuitive and easy to explain, especially in clinical settings.
- **Validation:** Cross-validation and metric-based evaluation ensure the model generalizes well.
- **Scalability:** The pipeline is modular and can be extended to other classifiers or integrated with biological priors.