In [None]:
                         KNN & PCA | Assignment

 1:What is K-Nearest Neighbors (KNN) and how does it work in both
     classification and regression problems?
    
  - K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression.
* It is a non-parametric method (doesn’t assume any probability   distribution  of data).

* It is also a lazy learner, meaning it doesn’t build an explicit model during training; instead, it stores the data and makes predictions only when needed.

* KNN in Classification:-

Example: Predict whether an email is spam or not spam.

Suppose K=5.

The algorithm checks the 5 nearest training emails to the new one.

If 3 out of 5 neighbors are labeled "spam", the prediction = "spam".

* KNN in Regression:-

Example: Predict the price of a house based on its size and location.

Suppose K=4.

The algorithm finds the 4 nearest houses.

The predicted price = average price of these 4 houses.

 2:What is the Curse of Dimensionality and how does it affect KNN
   performance?

 -  The Curse of Dimensionality refers to the problems that arise when data has too many features (dimensions) compared to the number of samples.

As dimensions increase:

Data becomes sparse (spread out).

Distance metrics (like Euclidean) become less meaningful.

Models that rely on distance or density, like KNN, suffer.


* Effect on KNN

KNN relies heavily on distances to find neighbors.
When high dimensions kick in:

All points appear almost equally distant, so KNN struggles to identify meaningful neighbors.

Predictions become unreliable.

Need exponentially more data to maintain accuracy.

High risk of overfitting if irrelevant features are included.

 3:What is Principal Component Analysis (PCA)? How is it different from
   feature selection?

 -  PCA is a dimensionality reduction technique that transforms high-dimensional data into a new set of uncorrelated variables called principal components (PCs).

 * Each principal component is a linear combination of the original features.

 *  The first PC captures the maximum variance in the data.

 * The second PC captures the maximum remaining variance (orthogonal to the first), and so on.

 | **Aspect**     | **PCA (Feature Extraction)**                                                | **Feature Selection**                           |
| **Definition** | Creates new features (principal components) by combining original features. | Chooses a subset of the original features.      |
| **Type**       | **Feature extraction**                                                      | **Feature selection**                           |
| **Output**     | Transformed features (not easily interpretable).                            | Original features (still interpretable).        |
| **Goal**       | Preserve maximum variance, reduce redundancy.                               | Keep only the most relevant predictors.         |
| **Example**    | From 100 features → create 10 PCs.                                          | From 100 features → pick top 10 important ones. |


 4:What are eigenvalues and eigenvectors in PCA, and why are they
   important?

  * Eigenvector = direction of variance.

  * Eigenvalue = magnitude of variance along that direction.

  Why Are They Important in PCA?

 Eigenvectors:

Define the new coordinate system (principal components).

Each eigenvector is a direction in feature space along which data varies most.

 Eigenvalues:

Tell us the importance (variance explained) of each principal component.

Larger eigenvalue = more information retained.

Used to decide how many components to keep (e.g., “keep enough PCs to explain 95% of variance”).

 5:How do KNN and PCA complement each other when applied in a single
   pipeline?

 - Wine dataset has 13 chemical features → not very high-dimensional, but still correlated.

KNN suffers when features are correlated / noisy.

PCA reduces to fewer uncorrelated principal components, keeping most variance.

Combining them → faster computation + potentially better generalization.

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
import numpy as np

# 1. Load dataset
data = load_wine()
X, y = data.data, data.target

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Build pipeline: Scaling → PCA → KNN
pipeline = Pipeline([
    ('scaler', StandardScaler()),     # scale features
    ('pca', PCA()),                   # PCA step
    ('knn', KNeighborsClassifier())   # KNN classifier
])

# 4. Define hyperparameter grid
param_grid = {
    'pca__n_components': [2, 5, 7, 10, None],   # try different dimensions
    'knn__n_neighbors': [3, 5, 7, 9],           # different K values
    'knn__weights': ['uniform', 'distance']     # voting strategy
}

# 5. Grid search with cross-validation
grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

# 6. Results
print("Best Parameters:", grid.best_params_)
print("Train Accuracy:", grid.score(X_train, y_train))
print("Test Accuracy:", grid.score(X_test, y_test))


Best Parameters: {'knn__n_neighbors': 9, 'knn__weights': 'uniform', 'pca__n_components': 7}
Train Accuracy: 0.9788732394366197
Test Accuracy: 0.9722222222222222


 6:Train a KNN Classifier on the Wine dataset with and without feature
   scaling. Compare model accuracy in both cases.

In [2]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_wine()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 1. KNN without scaling
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
acc_no_scaling = knn_no_scaling.score(X_test, y_test)

# 2. KNN with scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaling = KNeighborsClassifier(n_neighbors=5)
knn_scaling.fit(X_train_scaled, y_train)
acc_scaling = knn_scaling.score(X_test_scaled, y_test)

print("Accuracy without scaling:", acc_no_scaling)
print("Accuracy with scaling:", acc_scaling)


Accuracy without scaling: 0.8055555555555556
Accuracy with scaling: 0.9722222222222222


7:Train a PCA model on the Wine dataset and print the explained variance
   ratio of each principal component.

In [3]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
data = load_wine()
X, y = data.data, data.target

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")


Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
data = load_wine()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. KNN on original scaled dataset
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
acc_original = knn_original.score(X_test_scaled, y_test)

# 2. PCA transform (top 2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = knn_pca.score(X_test_pca, y_test)

print("Accuracy on original dataset (scaled):", acc_original)
print("Accuracy on PCA (2 components):", acc_pca)


Accuracy on original dataset (scaled): 0.9722222222222222
Accuracy on PCA (2 components): 0.9166666666666666


9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [8]:
# Step 1: Import libraries
import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Step 3: Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 4: Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# Step 5: Train KNN with Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# Step 6: Train KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Step 7: Compare results
print("KNN with Euclidean Distance Accuracy:", acc_euclidean)
print("KNN with Manhattan Distance Accuracy:", acc_manhattan)


KNN with Euclidean Distance Accuracy: 0.9444444444444444
KNN with Manhattan Distance Accuracy: 0.9814814814814815


 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data




1. Use PCA to Reduce Dimensionality

Gene expression datasets often have thousands of features (genes) but only hundreds of patients (samples) → very high feature-to-sample ratio.

This leads to the curse of dimensionality and overfitting if we feed raw features to a model.

Principal Component Analysis (PCA) transforms correlated features into a smaller number of uncorrelated components that capture the maximum variance in the data.

This reduces noise, removes redundant features, and makes models more generalizable.

2. Decide How Many Components to Keep

Use the explained variance ratio from PCA.

Plot a scree plot (cumulative variance explained vs. number of components).

Keep enough components to capture 90–95% of the variance (common in biomedical research).

Alternatively, use cross-validation performance (pick the number of components that yields the best classification accuracy without overfitting).

3. Use KNN for Classification Post-Dimensionality Reduction

After dimensionality reduction, train a K-Nearest Neighbors (KNN) classifier.

Why KNN works better after PCA:

KNN relies on distance metrics → high dimensions distort distances.

PCA reduces dimensions → distances become more meaningful.

Choose k (number of neighbors) using GridSearchCV with cross-validation.

Try different distance metrics (Euclidean, Manhattan).

4. Evaluate the Model

Split data into train/test (or use cross-validation if data is small).

Use metrics relevant for biomedical data:

Accuracy (overall correctness).

Precision/Recall/F1-score (important if one cancer type is rare).

ROC-AUC (robust for imbalanced classes).

Perform stratified cross-validation to ensure all cancer types are represented in each fold.

Possibly use nested cross-validation (tuning hyperparameters inside CV) to avoid optimistic bias.

5. Justify this Pipeline to Stakeholders

Interpretability: PCA compresses thousands of genes into fewer components, making the model easier to understand.

Robustness: Reduces noise and prevents overfitting, which is critical given small sample sizes.

Generalization: Model trained on PCA-transformed data is more likely to perform well on new patient cohorts.

Scalability: This pipeline can adapt when new genes or patients are added.

Biomedical validity: PCA often groups genes into biologically meaningful patterns (e.g., co-expressed gene clusters), which can provide insights beyond classification.