<a href="https://colab.research.google.com/github/waquasadnankarimi/Function/blob/main/KNN_%26_PCA_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Answer:
- K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that makes predictions based on the similarity (or distance) between data points. It assumes that similar data points exist close to each other in feature space.

**How KNN Works (General Steps)**
- Choose K: Select the number of neighbors (K) to consider, typically a small positive integer.
- Calculate Distance: Compute the distance (e.g., Euclidean distance) between the new, unknown data point and all points in the training dataset.
- Identify Neighbors: Find the K closest training points to the new point.
Predict: Make a decision based on these K neighbors.

**KNN for Classification**
- Process: The algorithm identifies the K nearest neighbors, each with a continuous target value.
- Prediction: The predicted value for the new point is the average (mean) of the target values of its K neighbors.
- Example: Predicting house prices; the new house gets the average price of the K closest houses in the training data.

**Key Characteristics**
- Lazy Learning: Stores data instead of building a model, performing computation only during prediction.
- Non-Parametric: Makes no assumptions about the data's underlying distribution.
- Distance Metric: Crucial for defining "closeness" (e.g., Euclidean, Manhattan).
- Feature Scaling: Important to standardize features before distance calculation to prevent bias.

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Answer:
- The Curse of Dimensionality refers to problems that arise when the number of features (dimensions) in a dataset becomes very large. As dimensionality increases, data becomes sparse and distance-based algorithms struggle to find meaningful similarity.

**What is the Curse of Dimensionality?**

- Exponential Growth: The volume of the feature space increases exponentially with each added dimension, meaning data points become spread out (sparse).
- Data Sparsity: With fixed data, the density of points decreases dramatically as dimensions rise, making it hard to find truly "close" neighbors.
- Meaningless Distances: Distances between points become less discriminatory; the difference between the nearest and farthest neighbor diminishes, making all points seem equidistant.

**How it affects KNN performance:**
- Loss of Locality: The concept of "nearest neighbors" loses its meaning because points that seem close in high dimensions might not be meaningfully close in a practical sense.
- Increased Computational Cost: Finding neighbors becomes computationally intensive and time-consuming.
- Overfitting & Poor Generalization: With data spread thin, KNN can easily pick up noise, leading to models that perform well on training data but poorly on new data (overfitting).
- Irrelevant Features: Extra, irrelevant features add noise, making it harder to find true patterns and further reducing accuracy.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Answer:
- Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms the original features into a new set of features called principal components. These components are linear combinations of the original features and are ordered such that the first few components retain most of the variance (information) in the data.

**How PCA Works (Conceptually)**

- Standardize the data
- Compute covariance matrix
- Find eigenvalues & eigenvectors
- Select top components that explain most variance
- Project data onto these components

| Aspect               | PCA (Dimensionality Reduction)                | Feature Selection                     |
| -------------------- | --------------------------------------------- | ------------------------------------- |
| **Approach**         | Creates new transformed features              | Selects subset of existing features   |
| **Nature**           | Feature extraction                            | Feature filtering                     |
| **Interpretability** | Harder to interpret (components mix features) | Easy to interpret (original features) |
| **Supervision**      | Unsupervised                                  | Can be supervised or unsupervised     |
| **Output**           | Principal components (linear combinations)    | Original features (reduced set)       |
| **Example**          | PC1 = 0.6x₁ + 0.4x₂ + …                       | Choose x₁, x₃, x₄ only                |

**Intuitive Explanation**

- PCA compresses the feature space by creating new axes
- Feature selection removes unimportant features and keeps key ones

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Answer:
- In Principal Component Analysis (PCA), eigenvectors and eigenvalues are derived from the covariance matrix of the data and are used to transform high-dimensional data into a lower-dimensional space

**1. What are Eigenvectors and Eigenvalues?**

- Eigenvectors (Principal Components):
  - They define the directions or axes of the new feature space.
  - They are vectors that, when subjected to the linear transformation of the covariance matrix, do not change their direction, but only scale.
  - In PCA, the first eigenvector points in the direction of maximum variance, and subsequent eigenvectors (which are orthogonal to the first) represent decreasing amounts of variance.
- Eigenvalues (Magnitudes):
  - They represent the scalar magnitude of the variance explained by each corresponding eigenvector.
  - They indicate how much of the data's total variance (information) is captured by each principal component.
  - A higher eigenvalue indicates that the corresponding eigenvector captures more information.

**2. Why are They Important in PCA?**

- Eigenvalues and eigenvectors are the core mechanism of PCA, making it possible to:

  - Dimensionality Reduction: By calculating eigenvalues/eigenvectors, we can identify which directions (components) have high variance (large eigenvalues) and which have low variance (small eigenvalues). We can safely discard eigenvectors with small eigenvalues to reduce dimensions with minimal information loss.
  - Maximized Information Retention: The top eigenvectors (principal components) are chosen based on the largest eigenvalues, ensuring that the most significant patterns and variability in the data are preserved.
  - Decoupling Variables (Orthogonality): Eigenvectors are orthogonal (perpendicular) to each other, meaning the new components are uncorrelated. This eliminates multicollinearity, which is a common issue in high-dimensional datasets.
  - Data Compression and Visualization: By reducing the number of dimensions to 2 or 3 principal components, high-dimensional datasets can be easily visualized and analyzed.
  - Noise Reduction: Low eigenvalues often correspond to noise rather than signal. By eliminating these components, PCA acts as a filter to improve the signal-to-noise ratio.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

Answer:
- When applied in a single pipeline, PCA and KNN complement each other by addressing each other's limitations, primarily by transforming high-dimensional, noisy data into a lower-dimensional, cleaner space that enhances KNN's computational efficiency and predictive accuracy.

**Here is how they complement each other**:
 - Solving the "Curse of Dimensionality": KNN relies on distance metrics (e.g., Euclidean distance) to find the nearest neighbors. In high-dimensional spaces, the distance between points becomes less meaningful, which degrades KNN performance. PCA reduces this dimensionality by projecting data into a lower-dimensional space, allowing KNN to function more effectively.
 - Improving Computational Efficiency: Because KNN calculates the distance from a test point to every training point, it is computationally expensive and slow on large, high-dimensional datasets. PCA reduces the number of features (dimensions), significantly speeding up the prediction time of the KNN model.
 - Noise Reduction and Data Cleaning: High-dimensional data often contains noise or redundant information that can trick KNN into picking wrong neighbors. PCA filters out this noise by focusing only on the components that explain the most variance, thus acting as a preprocessing step that improves classification accuracy.
 - Decoupling Features: PCA transforms raw, potentially correlated features into orthogonal (uncorrelated) principal components. This ensures that the distance metric used by KNN is not skewed by redundant or heavily correlated features.

**Typical Pipeline**
- A common ML pipeline looks like:
  - Standardize features
  - Apply PCA to reduce dimension
  - Fit KNN classifier/regressor

In [1]:
'''
Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

'''
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

data = load_wine()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
pred_no_scale = knn_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, pred_no_scale)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, pred_scaled)

acc_no_scale, acc_scaled


(0.7222222222222222, 0.9444444444444444)

In [2]:
'''
Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.
'''
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

data = load_wine()
X = data.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

pca.explained_variance_ratio_


array([0.36198848, 0.1920749 , 0.11123631, 0.0706903 , 0.06563294,
       0.04935823, 0.04238679, 0.02680749, 0.02222153, 0.01930019,
       0.01736836, 0.01298233, 0.00795215])

In [3]:
'''
Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.
'''
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

data = load_wine()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, pred_original)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, pred_pca)

acc_original, acc_pca


(0.9444444444444444, 0.9444444444444444)

Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

Answe:

- Training a K-Nearest Neighbors (KNN) classifier on the scaled Wine dataset involves preprocessing the 13 chemical features and evaluating the model using different distance metrics. For 2026, standard evaluations typically focus on Euclidean (straight-line) and Manhattan (grid-like) distances.

**Training Procedure**
- Scaling: The Wine dataset features have varying scales (e.g., alcohol content vs. color intensity). Feature scaling (like StandardScaler) is mandatory because KNN relies on distance; unscaled features with larger ranges would disproportionately influence results.
- Training: Use the KNeighborsClassifier from scikit-learn.
Euclidean Metric: Set metric='euclidean' (equivalent to \(p=2\) in Minkowski).Manhattan Metric: Set metric='manhattan' (equivalent to \(p=1\) in Minkowski)

In [4]:
'''
Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
'''
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=100, n_features=1000, n_informative=50,
                           n_classes=3, random_state=42)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

n_components = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Number of components to explain 95% variance: {n_components}")

pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3,
                                                    random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

cv_scores = cross_val_score(knn, X_pca, y, cv=5)
print(f"Cross-Validation Accuracy: {cv_scores.mean():.2f} (+/- {cv_scores.std():.2f})")


Number of components to explain 95% variance: 90
Cross-Validation Accuracy: 0.43 (+/- 0.13)
