# **KNN and PCA Assignment**

**Question 1.What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?**


**Answer**-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and regression problems. It’s one of the simplest and most intuitive machine learning methods — it makes predictions based on the idea that similar data points tend to have similar outputs.

**How it works for Classification ?**

Predict whether a fruit is apple or orange based on weight and color:

KNN finds the 3 nearest fruits in the training data.

If 2 are apples and 1 is orange → predict apple.

**How it works for Regression ?**

Predict a person’s income based on age and education:

KNN finds the 3 nearest people in the training set.

Takes the average income of those 3 → predicted income.

**Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?**


**Answer:-** The Curse of Dimensionality refers to the problems that arise when the number of features (dimensions) in your dataset becomes very large.
As dimensions increase, the data becomes sparse, and the concept of “closeness” or “distance” between points becomes less meaningful.

**Effect on KNN Performance**

KNN relies on distance to find the nearest neighbors.
When the number of features increases:

**All distances become similar** — the difference between the nearest and farthest neighbors becomes very small.

**Distance loses meaning** — the algorithm struggles to identify which points are truly “close”.

**Increased noise** — irrelevant or redundant features distort distance calculations.

**Model accuracy drops** — predictions become unreliable because neighbors may not actually be similar.

**Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

**Answer:-** **Principal Component Analysis (PCA)** is a dimensionality reduction technique used to transform a large set of features into a smaller one while preserving as much variance (information) as possible.

It does this by creating new features (called principal components) that are:

Linear combinations of the original features.

Uncorrelated with each other.

Ordered by importance — the first few components capture most of the data’s variability

**How it is different from Feature Selection ?**

Principal Component Analysis (PCA) is a dimensionality reduction method that creates new features from the old ones.
It doesn’t remove features — it transforms them. These new features (called principal components) are combinations of the original variables and are designed to capture the maximum amount of information (variance) in the data.

Think of PCA as taking many correlated features and compressing them into a smaller number of uncorrelated ones. For example, instead of 10 related financial indicators, PCA might give you 2–3 new components that represent most of the variation those 10 had together.

However, these new features are not interpretable — you can’t easily say what each principal component “means” in real-world terms because it’s a mathematical mix of many features.

**Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?**

**In Principal Component Analysis (PCA)**, eigenvalues and eigenvectors come from the covariance matrix of your data.
They are the mathematical foundation that allows PCA to identify the directions (axes) where the data varies the most.

**Eigenvectors** represent the directions (axes) of maximum variance in the data —
these are your principal components.
Each eigenvector shows how much each original feature contributes to a principal component.

**Eigenvalues** represent the amount of variance captured by each eigenvector —
they tell you how important each principal component is.
A higher eigenvalue means that component captures more information (variance) from the data

**Why They’re Important in PCA ?**

***Identify Principal Components:*** Eigenvectors define the directions (axes) for the new reduced feature space.

***Measure Information Content:*** Eigenvalues tell you how much information (variance) each component carries.

***Dimensionality Reduction:*** By keeping only the components with the largest eigenvalues, PCA reduces data size while retaining most information.

***Noise Filtering***: Components with very small eigenvalues usually represent noise, which can be discarded.

**Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?**


Answer-** **bold text**KNN (K-Nearest Neighbors)** is a distance-based algorithm, and **PCA (Principal Component Analysis**) is a dimensionality reduction technique.
When used together in a single pipeline, PCA helps KNN perform better — especially when the dataset has many features or correlated variables.

**How They Work Together ?**

**PCA Step (Preprocessing)**:

PCA reduces the number of features by creating new, uncorrelated components that capture most of the data’s variance.

It removes noise and redundant information.

This makes the data simpler and cleaner for the next step.

**KNN Step (Modeling)**:

KNN then works on this transformed, lower-dimensional data.

Since KNN depends on distance (Euclidean or similar), having fewer, uncorrelated features makes distances more meaningful and less distorted.

**Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.**
Use the Wine Dataset from sklearn.datasets.load_wine().


In [2]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)


# Initialize KNN
knn_raw = KNeighborsClassifier(n_neighbors=5)

# Train
knn_raw.fit(X_train, y_train)

# Predict
y_pred_raw = knn_raw.predict(X_test)

# Accuracy
acc_raw = accuracy_score(y_test, y_pred_raw)
print("Accuracy without scaling:", round(acc_raw, 3))

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize KNN again
knn_scaled = KNeighborsClassifier(n_neighbors=5)

# Train on scaled data
knn_scaled.fit(X_train_scaled, y_train)

# Predict
y_pred_scaled = knn_scaled.predict(X_test_scaled)

# Accuracy
acc_scaled = accuracy_score(y_test, y_pred_scaled)
print("Accuracy with scaling:", round(acc_scaled, 3))

print("\nComparison:")
print(f"Without Scaling: {acc_raw:.3f}")
print(f"With Scaling   : {acc_scaled:.3f}")



Accuracy without scaling: 0.722
Accuracy with scaling: 0.944

Comparison:
Without Scaling: 0.722
With Scaling   : 0.944


**Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.**

In [3]:
# Step 1: Import Libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Step 2: Load Dataset
data = load_wine()
X = data.data
y = data.target

# Step 3: Standardize the Data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 4: Apply PCA
pca = PCA()   # keep all components
X_pca = pca.fit_transform(X_scaled)

# Step 5: Print Explained Variance Ratio
print("Explained Variance Ratio of each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")

# Optional: print total variance retained
print("\nTotal variance retained:", round(sum(pca.explained_variance_ratio_), 4))


Explained Variance Ratio of each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080

Total variance retained: 1.0


**Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.**


In [4]:
# Step 1: Import Libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load the Dataset
data = load_wine()
X, y = data.data, data.target

# Step 3: Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Step 4: Standardize Data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Train KNN on Original Scaled Data
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)
print("Accuracy on original scaled data:", round(acc_original, 3))

# Step 6: Apply PCA (keep top 2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Step 7: Train KNN on PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)
print("Accuracy on PCA (2 components) data:", round(acc_pca, 3))

# Step 8: Compare Results
print("\nComparison:")
print(f"Original Data Accuracy : {acc_original:.3f}")
print(f"PCA (2 Components) Accuracy : {acc_pca:.3f}")


Accuracy on original scaled data: 0.944
Accuracy on PCA (2 components) data: 0.944

Comparison:
Original Data Accuracy : 0.944
PCA (2 Components) Accuracy : 0.944


**Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.**

In [5]:
# Step 1: Import Libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load the Wine Dataset
data = load_wine()
X, y = data.data, data.target

# Step 3: Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Step 4: Scale Features (important for distance-based algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Train KNN with Euclidean Distance (default)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# Step 6: Train KNN with Manhattan Distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Step 7: Compare Results
print("Accuracy using Euclidean distance :", round(acc_euclidean, 3))
print("Accuracy using Manhattan distance :", round(acc_manhattan, 3))


Accuracy using Euclidean distance : 0.944
Accuracy using Manhattan distance : 0.981


**Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data**

High-dimensional gene expression datasets often have thousands of features (genes) but few samples (patients). This creates a high risk of overfitting for traditional models because they try to fit noise instead of true patterns. Using PCA + KNN is a common and effective pipeline to handle this.

**Step 1: Use PCA to Reduce Dimensionality**

Why: Thousands of genes are often highly correlated, and many may contain noise or redundant information. PCA reduces dimensionality while retaining the most important variance.

**How:**

Standardize the gene expression data (mean=0, std=1).

Apply PCA to transform the data into principal components — new axes that summarize variance.

This step compresses thousands of features into fewer components, making downstream models like KNN feasible.

**Step 2: Decide How Many Components to Keep**

Use Explained Variance Ratio:

Compute how much variance each principal component explains.

Retain the top components that together explain ~80–95% of total variance.

**Example:** If 100 components explain 90% variance, you can reduce 10,000 genes to just 100 components.

**Optional:** Scree plot — visualize cumulative variance and choose the “elbow” point where adding more components adds little extra information.

**Step 3: Use KNN for Classification Post-Dimensionality Reduction**

**Why KNN:**

KNN is a simple, non-parametric, distance-based algorithm — suitable for small sample sizes after dimensionality reduction.

**How:**

Train KNN on PCA-transformed components.

Choose K carefully using cross-validation.

Distance metric: usually Euclidean, but Manhattan can be tested for robustness.

**Step 4: Evaluate the Model**

Metrics: Accuracy, precision, recall, F1-score — depending on clinical relevance.

Cross-Validation: Use k-fold cross-validation or leave-one-out to ensure the model is robust and not overfitting.

Optional: ROC-AUC for multi-class evaluation.

**Step 5: Justify the Pipeline to Stakeholders**

Problem: High-dimensional gene data → overfitting in traditional models.

Solution:

PCA reduces noise and compresses features → reduces overfitting risk.

KNN classifies patients based on true biological similarity (distance in compressed space).

Cross-validation ensures the model generalizes to new patients.

**Advantages:**

Simple, interpretable, and computationally feasible.

Preserves the most biologically meaningful variance.

Avoids overfitting due to small sample size.

Can be visualized (e.g., first 2–3 PCA components) for presentation to clinicians.

**Summary Statement**

By combining PCA with KNN, we create a robust, interpretable, and scalable pipeline for high-dimensional biomedical data. PCA reduces dimensionality and noise, KNN leverages the true structure of patient similarity for classification, and rigorous cross-validation ensures reliable predictions — making it suitable for real-world clinical applications.