THEORY QUESTION

Q- 1 What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?


  ANS-- K-Nearest Neighbors (KNN) is a simple, intuitive, supervised machine learning algorithm used for both classification and regression tasks. It makes predictions based on the idea that similar data points exist close to each other in feature space.

   how knn work: KNN is a lazy learning and instance-based algorithm, meaning it does not build a model during training. Instead, it stores the training data and makes predictions only when a query point needs to be classified or predicted.
            
            1. Choose a value for K (number of neighbors).

            2. Compute the distance (often Euclidean) between the query point and all points in the training dataset.

            3. Select the K closest points.

          

Q- 2  What is the Curse of Dimensionality and how does it affect KNN
performance?

  ANS-- The Curse of Dimensionality refers to a set of problems that arise when data has many features (high dimensionality). As the number of dimensions grows, data becomes sparse and distance measures become less meaningful—this directly harms algorithms like K-Nearest Neighbors (KNN), which rely heavily on distance calculations.

  

Q -3 What is Principal Component Analysis (PCA)? How is it different from
feature selection?

   ANS-- Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a high-dimensional dataset into a smaller set of new variables (called principal components) while retaining as much variance (information) as possible.

It is one of the most widely used unsupervised learning methods for preprocessing, visualization, noise reduction, and compressing data.


   HOW PCA Is Different from Feature Selection:-

   | Aspect                   | PCA                                                  | Feature Selection                                      |
| ------------------------ | ---------------------------------------------------- | ------------------------------------------------------ |
| **Type**                 | Feature **extraction** / transformation              | Feature **selection**                                  |
| **Original features?**   | Creates **new features** (linear combinations)       | Keeps **original features**                            |
| **Interpretability**     | Lower (components are combinations of many features) | High (selected features are original variables)        |
| **Goal**                 | Capture maximum variance                             | Keep most important original features                  |
| **Supervision**          | Usually **unsupervised**                             | Can be supervised or unsupervised                      |
| **Correlation handling** | Removes correlation (makes components orthogonal)    | May keep correlated features unless explicitly removed |
  

Q- 4  What are eigenvalues and eigenvectors in PCA, and why are they
important?

   ANS-- In Principal Component Analysis (PCA), eigenvalues and eigenvectors are fundamental mathematical concepts that determine the directions of the new feature space (principal components) and how much information (variance) each direction captures.

   | Concept              | Meaning                              | Role in PCA                        |
| -------------------- | ------------------------------------ | ---------------------------------- |
| **Eigenvector**      | Direction of maximum variance        | Defines principal components       |
| **Eigenvalue**       | Amount of variance in that direction | Ranks the importance of components |
| **Large eigenvalue** | Component keeps lots of information  | Should be retained                 |
| **Small eigenvalue** | Component contributes little         | Can be removed                     |


Q- 5  How do KNN and PCA complement each other when applied in a single
pipeline?

   ANS-- K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) are often used together in a machine-learning pipeline because they solve each other’s weaknesses and improve overall performance—especially when dealing with high-dimensional data.

Q-6  Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -----------------------------
# KNN WITHOUT feature scaling
# -----------------------------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -----------------------------
# KNN WITH feature scaling
# -----------------------------
knn_with_scaling = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])

knn_with_scaling.fit(X_train, y_train)
y_pred_scaling = knn_with_scaling.predict(X_test)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)

print("Accuracy without scaling:", accuracy_no_scaling)
print("Accuracy with scaling:", accuracy_scaling)


Accuracy without scaling: 0.7777777777777778
Accuracy with scaling: 0.9333333333333333


Q-7 Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
X, y = load_wine(return_X_y=True)

# Standardize features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA (keep all components)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
for i, ratio in enumerate(pca.explained_variance_ratio_, start=1):
    print(f"Principal Component {i}: {ratio:.4f}")


Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


Q-8  Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -------------------------------------------------
# KNN on ORIGINAL data (with scaling)
# -------------------------------------------------
knn_original = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])

knn_original.fit(X_train, y_train)
y_pred_original = knn_original.predict(X_test)
accuracy_original = accuracy_score(y_test, y_pred_original)

# -------------------------------------------------
# KNN on PCA-reduced data (top 2 components)
# -------------------------------------------------
knn_pca = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2)),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])

knn_pca.fit(X_train, y_train)
y_pred_pca = knn_pca.predict(X_test)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy on original dataset:", accuracy_original)
print("Accuracy on PCA (2 components):", accuracy_pca)


Accuracy on original dataset: 0.9333333333333333
Accuracy on PCA (2 components): 0.9333333333333333


Q-9 Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -----------------------------
# KNN with Euclidean distance
# -----------------------------
knn_euclidean = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(
        n_neighbors=5,
        metric="euclidean"
    ))
])

knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -----------------------------
# KNN with Manhattan distance
# -----------------------------
knn_manhattan = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(
        n_neighbors=5,
        metric="manhattan"
    ))
])

knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("Accuracy (Euclidean):", accuracy_euclidean)
print("Accuracy (Manhattan):", accuracy_manhattan)


Accuracy (Euclidean): 0.9333333333333333
Accuracy (Manhattan): 0.9777777777777777


Q-10 You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.

Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

In [7]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Simulate high-dimensional gene expression data
X, y = make_classification(
    n_samples=120,
    n_features=5000,
    n_informative=50,
    n_classes=3,
    random_state=42
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 1. Linear SVM on raw data (overfitting)
svm_raw = SVC(kernel="linear")
svm_raw.fit(X_train, y_train)

train_acc_raw = accuracy_score(y_train, svm_raw.predict(X_train))
test_acc_raw = accuracy_score(y_test, svm_raw.predict(X_test))

# 2. PCA + Linear SVM
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

svm_pca = SVC(kernel="linear")
svm_pca.fit(X_train_pca, y_train)

train_acc_pca = accuracy_score(y_train, svm_pca.predict(X_train_pca))
test_acc_pca = accuracy_score(y_test, svm_pca.predict(X_test_pca))

print("Raw SVM - Train Accuracy:", train_acc_raw)
print("Raw SVM - Test Accuracy:", test_acc_raw)
print("PCA + SVM - Train Accuracy:", train_acc_pca)
print("PCA + SVM - Test Accuracy:", test_acc_pca)


Raw SVM - Train Accuracy: 1.0
Raw SVM - Test Accuracy: 0.5
PCA + SVM - Train Accuracy: 1.0
PCA + SVM - Test Accuracy: 0.4722222222222222
