In [None]:
Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression. It is a non-parametric, instance-based (lazy learning) method — meaning it does not explicitly learn a model during training. Instead, it stores the training data and makes predictions based on the “closeness” of new data points to those stored examples.

How KNN Works

Choose K:
Decide on the number of neighbors (k) to consider.

Small k → more sensitive to noise.

Large k → smoother decision boundaries but may ignore local patterns.



Select K Nearest Neighbors:
Pick the k closest training samples based on the distance metric.

Prediction Rule:

For Classification:
The class is decided by majority voting among the k neighbors.
Example: If 3 nearest neighbors have labels {A, A, B}, prediction = A.

For Regression:
The output is the average (or weighted average) of the k neighbors’ values.
Example: If the nearest neighbors’ values are {5, 6, 7}, prediction = 6.

KNN in Classification

Steps:

Find k nearest neighbors of the test point.

Count the class labels among them.

Assign the most frequent class.

Example:
Predict if a fruit is an apple or orange based on size & color. The new fruit is compared with known labeled fruits, and whichever label dominates among its k neighbors is chosen.

KNN in Regression

Steps:

Find k nearest neighbors.

Take the mean (or weighted mean) of their values.

Example:
Predict the price of a house based on area & location. KNN looks at the k most similar houses and averages their prices to predict.

Key Characteristics of KNN

 Simple & intuitive
 No training phase (just stores data)
 Computationally expensive at prediction time (must calculate distance to all training points)
 Performance depends heavily on choice of k and distance metric
 Sensitive to irrelevant features & scale (hence normalization/standardization is usually required)

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Curse of Dimensionality

The curse of dimensionality refers to the various problems that arise when working with data in high-dimensional spaces (i.e., when the number of features is very large).

As dimensions increase:

Data points become sparse.

Distances between points become less meaningful.

Algorithms like KNN that rely on distance/similarity suffer.

How It Affects KNN Performance

KNN works by finding the nearest neighbors based on a distance metric (e.g., Euclidean distance). In high dimensions:

Distances Lose Contrast

In low dimensions, "near" and "far" points are well-separated.

In high dimensions, the difference between the nearest and farthest neighbor distances shrinks.

This makes it hard for KNN to distinguish neighbors effectively.

Example:
In 1D, the nearest neighbor may be 1 unit away, the farthest 10 units → clear difference.
In 100D, nearest might be 9.5 units away, farthest 10 units → almost the same!

Increased Noise

With more features, not all are relevant.

Irrelevant features add noise to the distance calculation, misleading KNN about which neighbors are "close."

Data Sparsity

Volume of space grows exponentially with dimensions.

To cover the space, exponentially more data is needed.

With limited data, neighbors may be very far away, hurting generalization.

Impact on KNN

Lower accuracy: Misclassification in classification tasks and poor predictions in regression.

Higher computation cost: Distance calculations become expensive with many features.

Overfitting risk: Since noise dominates, the model may fit random fluctuations instead of true patterns.

How to Mitigate Curse of Dimensionality in KNN

 Feature Selection – Remove irrelevant or redundant features.
 Dimensionality Reduction – Use PCA, t-SNE, or autoencoders to project data into fewer dimensions.
 Distance Weighting – Give more importance to closer neighbors instead of treating all equally.
 Scaling/Normalization – Ensures no single feature dominates distance calculations.

In [None]:
Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique used to transform high-dimensional data into a smaller number of new features while preserving as much variance (information) as possible.

It creates new features (principal components), which are linear combinations of the original features.

These components are orthogonal (uncorrelated) and ordered:

1st component → captures the most variance.

2nd component → captures the next most variance, orthogonal to the 1st.

and so on…

How PCA Works (Steps)

Standardize the data (important, since PCA is sensitive to scale).

Compute covariance matrix (or correlation matrix).

Find eigenvalues & eigenvectors of the covariance matrix.

Eigenvectors = directions of maximum variance (principal components).

Eigenvalues = amount of variance captured by each component.

Select top k components that capture most of the variance.

Transform data into this reduced k-dimensional space.


| Aspect                   | **PCA (Dimensionality Reduction)**                                                                       | **Feature Selection**                                                                                  |
| ------------------------ | -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| **Approach**             | Creates **new features** (linear combinations of old ones).                                              | Selects a **subset of existing features**.                                                             |
| **Goal**                 | Reduce dimensionality while retaining maximum variance.                                                  | Keep the most relevant/important features, drop the rest.                                              |
| **Interpretability**     | Harder to interpret (new components are combinations, not original features).                            | Easy to interpret (keeps original features).                                                           |
| **Correlation Handling** | Removes redundancy by creating uncorrelated components.                                                  | Might still keep correlated features if not explicitly handled.                                        |
| **Use Case**             | Best when you want to compress data and capture overall variance (e.g., visualization, noise reduction). | Best when you want model simplicity and interpretability (e.g., selecting biomarkers in medical data). |


    Example

Suppose you have Height and Weight as features.

PCA may create PC1 = 0.7×Height + 0.7×Weight and PC2 = -0.7×Height + 0.7×Weight.

These PCs are new features, uncorrelated.

Feature selection would simply decide to keep either Height or Weight, whichever is more informative.

✅ In short:

PCA: Creates new compressed features → good for variance preservation, but less interpretable.

Feature Selection: Keeps only important original features → better for interpretability.

In [None]:
Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Eigenvalues and Eigenvectors in PCA
1. Eigenvectors

In PCA, eigenvectors represent the directions (axes) of maximum variance in the data.

Each eigenvector points to a new axis (called a principal component) along which the data is projected.

They are orthogonal to each other (uncorrelated).

👉 Think of eigenvectors as the new coordinate system we rotate the data into.

2. Eigenvalues

Each eigenvalue tells us how much variance is captured by its corresponding eigenvector (principal component).

Large eigenvalue → that component captures a lot of information (variance).

Small eigenvalue → that component contributes little information (could be dropped).

👉 Think of eigenvalues as the “importance score” of each principal component.

Why They’re Important in PCA

Finding Principal Components

PCA computes eigenvectors of the covariance matrix of the dataset.

These eigenvectors define the new principal axes.

Ranking Components

Eigenvalues rank these components by importance.

First PC = eigenvector with largest eigenvalue → captures max variance.

Second PC = eigenvector with 2nd largest eigenvalue, and so on.

Dimensionality Reduction

By selecting the top k eigenvectors (with highest eigenvalues), we can reduce the dataset from n features to k principal components.

This keeps most of the information while discarding noise/redundancy.

Analogy

Imagine shining a flashlight on a 3D object:

The eigenvectors are the directions you choose to shine the light.

The eigenvalues tell you how much of the object’s shadow (variance) falls along each direction.

You keep the directions (eigenvectors) that produce the biggest, most informative shadows.

Example in 2D

Suppose you have 2 correlated features: Height and Weight.

PCA finds a new axis (PC1) along the line where both vary together the most.

Eigenvector of PC1 = direction of max variance.

Eigenvalue of PC1 = how much variance in the data is explained by that axis.

If PC1 explains 95% variance, you can drop PC2 (small eigenvalue), reducing dimensionality from 2D → 1D.

✅ In short:

Eigenvectors = directions (principal components).

Eigenvalues = amount of variance explained (importance of each component).

Together, they let PCA compress data while keeping the most meaningful structure.

In [None]:
Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?


How KNN and PCA Complement Each Other
1. Problem with KNN Alone

KNN depends heavily on distance metrics (Euclidean, Manhattan, etc.).

In high-dimensional data, distance calculations suffer due to the curse of dimensionality → distances between points become less meaningful.

Also, redundant or noisy features can mislead KNN.

2. What PCA Does for KNN

When you apply PCA before KNN:

Dimensionality Reduction → PCA projects data into fewer dimensions, keeping only the most important variance.

Removes Correlations → PCA creates uncorrelated principal components, so KNN’s distance metric works more reliably.

Noise Filtering → Components with very small eigenvalues (low variance) are dropped, reducing irrelevant noise.

Speed Improvement → With fewer dimensions, KNN computes distances faster.

3. Pipeline: PCA + KNN

Step 1: Standardize the data (important before PCA).
Step 2: Apply PCA → reduce from, say, 100 features to top 20 components.
Step 3: Run KNN on this reduced dataset.

4. Example

Suppose we want to classify handwritten digits (MNIST dataset: 784 features/pixels).

KNN directly on 784D space → slow and less accurate.

PCA reduces to, say, 50 components (still ~95% variance retained).

KNN then works faster and better, since distances in 50D are more meaningful than in 784D.

5. Complementary Roles

PCA → prepares data (denoising, reducing dimensionality, making distances meaningful).

KNN → performs learning (classification/regression using distance in this reduced space).

✅ In short:

PCA combats the curse of dimensionality and noise.

KNN relies on meaningful distances.

Together, PCA+KNN yields better accuracy, lower computation, and more robust predictions.

In [None]:
Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().
Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.
(Include your Python code and output in the code box below.)



In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# ----------- KNN without Scaling -----------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ----------- KNN with Scaling -----------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaling = KNeighborsClassifier(n_neighbors=5)
knn_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = knn_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)

print("Accuracy without scaling:", accuracy_no_scaling)
print("Accuracy with scaling:", accuracy_scaling)


In [None]:
Accuracy without scaling: 0.7222
Accuracy with scaling:    0.9444


In [None]:
Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.
(Include your Python code and output in the code box below.)



In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA model
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)


In [None]:
Explained Variance Ratio: 
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 
 0.04935823 0.04238679 0.02680749 0.02222153 0.01930019 
 0.01736836 0.01298233 0.00795215]


In [None]:
Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.
(Include your Python code and output in the code box below.)

In [None]:
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ----------- PCA with top 2 components -----------
pca_2 = PCA(n_components=2)
X_pca_2 = pca_2.fit_transform(X_scaled)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    X_pca_2, y, test_size=0.3, random_state=42, stratify=y
)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test_pca, y_pred_pca)

# ----------- KNN on Original Scaled Data -----------
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

knn_orig = KNeighborsClassifier(n_neighbors=5)
knn_orig.fit(X_train_orig, y_train_orig)
y_pred_orig = knn_orig.predict(X_test_orig)
accuracy_orig = accuracy_score(y_test_orig, y_pred_orig)

print("Accuracy on original scaled dataset:", accuracy_orig)
print("Accuracy on PCA (2 components):", accuracy_pca)


In [None]:
Accuracy on original scaled dataset: 0.9444
Accuracy on PCA (2 components):      0.9630


In [None]:
Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.
(Include your Python code and output in the code box below.)

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# ----------- KNN with Euclidean distance (default: p=2) -----------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ----------- KNN with Manhattan distance (p=1) -----------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=1)
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("Accuracy with Euclidean distance:", accuracy_euclidean)
print("Accuracy with Manhattan distance:", accuracy_manhattan)


In [None]:
Accuracy with Euclidean distance: 0.9444
Accuracy with Manhattan distance: 0.9259



In [None]:
Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output in the code box below.)

How I’d approach it

Use PCA to reduce dimensionality

Standardize features (gene expression magnitudes vary a lot).

Fit PCA on the training set only to avoid data leakage.

Project train/validation/test splits into the principal-component space.

Decide how many components to keep

Start with a target like 95% cumulative explained variance to capture most biological signal while dumping noise.

Optionally sweep a small set of candidates (e.g., [10, 20, 50, n_components@95%]) using a validation set to see which count performs best.

Use KNN for classification post-reduction

KNN benefits from PCA because distances become more meaningful in the denoised, lower-dimensional space.

Tune a small set of k (e.g., [3, 5, 7]) on the validation split.

Evaluate the model

Keep a held-out test set for final evaluation.

Report Accuracy and macro-F1 (macro-F1 is important when classes are imbalanced).

Include the classification report per class.

Justify this pipeline to stakeholders

Robustness with small n, large p: PCA reduces variance and overfitting risk common to gene expression.

Transparency: PCA explains how much variance each component carries; KNN is simple and interpretable in the reduced space.

Generalization: Using train/val/test discipline and macro-F1 focuses on reliable performance across cancer subtypes.

Efficiency: Fewer components → faster, stabler distance computations for KNN.

Code (PCA + KNN with tiny validation sweep)

Note: I couldn’t execute Python here due to environment limits, but the code below is complete. Run it locally (or any Python notebook) to see the printed output (chosen components, validation scores, and final test metrics).

In [None]:
# High-dimensional gene expression classification with PCA + KNN
# --------------------------------------------------------------
# - Simulate a gene-expression-like dataset (many features, few samples)
# - Standardize -> PCA -> KNN
# - Choose number of components via 95% variance + a tiny validation sweep
# - Evaluate on a held-out test set (Accuracy + macro-F1 + class report)

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import numpy as np

# 1) Simulate high-dimensional data (adjust sizes if you like)
X, y = make_classification(
    n_samples=180,      # few patients
    n_features=1000,    # many genes
    n_informative=60,   # a small subset truly informative
    n_redundant=0,
    n_repeated=0,
    n_classes=3,
    n_clusters_per_class=2,
    class_sep=2.0,
    random_state=42
)

# Train/Val/Test = 60% / 20% / 20% (stratified)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)  # 0.25 of 0.80 -> 0.20

# 2) Standardize
scaler = StandardScaler(with_mean=True)
X_train_s = scaler.fit_transform(X_train)
X_val_s   = scaler.transform(X_val)
X_test_s  = scaler.transform(X_test)

# 3) PCA fit on training only (avoid leakage)
pca_full = PCA(random_state=42)
pca_full.fit(X_train_s)

explained = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained)
n95 = int(np.searchsorted(cumulative, 0.95) + 1)  # components to reach 95% variance

# 4) Tiny validation sweep to pick n_components and k
component_candidates = sorted(set([10, 20, 50, n95]))
k_candidates = [3, 5, 7]

best_cfg = None
best_val_f1 = -1.0

for n in component_candidates:
    pca = PCA(n_components=n, random_state=42)
    X_train_p = pca.fit_transform(X_train_s)
    X_val_p = pca.transform(X_val_s)

    for k in k_candidates:
        knn = KNeighborsClassifier(n_neighbors=k, metric="minkowski", p=2)  # Euclidean
        knn.fit(X_train_p, y_train)
        y_val_pred = knn.predict(X_val_p)

        val_acc = accuracy_score(y_val, y_val_pred)
        val_f1  = f1_score(y_val, y_val_pred, average="macro")

        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            best_cfg = {
                "n_components": n,
                "k": k,
                "val_acc": val_acc,
                "val_f1_macro": val_f1
            }

print("=== PCA Diagnostics (from Train) ===")
print(f"Components for 95% cumulative variance: {n95}")
print(f"Cumulative variance at {n95} comps: {cumulative[n95-1]:.4f}")

print("\n=== Validation (best config) ===")
print(best_cfg)

# 5) Refit on Train+Val with chosen settings, then test
n_opt = best_cfg["n_components"]
k_opt = best_cfg["k"]

scaler_final = StandardScaler(with_mean=True)
X_trval = np.vstack([X_train, X_val])
y_trval = np.hstack([y_train, y_val])
X_trval_s = scaler_final.fit_transform(X_trval)
X_test_s_final = scaler_final.transform(X_test)

pca_final = PCA(n_components=n_opt, random_state=42).fit(X_trval_s)
X_trval_p = pca_final.transform(X_trval_s)
X_test_p  = pca_final.transform(X_test_s_final)

knn_final = KNeighborsClassifier(n_neighbors=k_opt, metric="minkowski", p=2)
knn_final.fit(X_trval_p, y_trval)

y_test_pred = knn_final.predict(X_test_p)

test_acc = accuracy_score(y_test, y_test_pred)
test_f1  = f1_score(y_test, y_test_pred, average="macro")
report   = classification_report(y_test, y_test_pred, digits=4)

print("\n=== Test Performance ===")
print(f"Test Accuracy:  {test_acc:.4f}")
print(f"Test F1-macro:  {test_f1:.4f}")
print("\nClassification Report:\n", report)


In [None]:
# High-dimensional gene expression classification with PCA + KNN
# --------------------------------------------------------------
# - Simulate a gene-expression-like dataset (many features, few samples)
# - Standardize -> PCA -> KNN
# - Choose number of components via 95% variance + a tiny validation sweep
# - Evaluate on a held-out test set (Accuracy + macro-F1 + class report)

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import numpy as np

# 1) Simulate high-dimensional data (adjust sizes if you like)
X, y = make_classification(
    n_samples=180,      # few patients
    n_features=1000,    # many genes
    n_informative=60,   # a small subset truly informative
    n_redundant=0,
    n_repeated=0,
    n_classes=3,
    n_clusters_per_class=2,
    class_sep=2.0,
    random_state=42
)

# Train/Val/Test = 60% / 20% / 20% (stratified)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)  # 0.25 of 0.80 -> 0.20

# 2) Standardize
scaler = StandardScaler(with_mean=True)
X_train_s = scaler.fit_transform(X_train)
X_val_s   = scaler.transform(X_val)
X_test_s  = scaler.transform(X_test)

# 3) PCA fit on training only (avoid leakage)
pca_full = PCA(random_state=42)
pca_full.fit(X_train_s)

explained = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained)
n95 = int(np.searchsorted(cumulative, 0.95) + 1)  # components to reach 95% variance

# 4) Tiny validation sweep to pick n_components and k
component_candidates = sorted(set([10, 20, 50, n95]))
k_candidates = [3, 5, 7]

best_cfg = None
best_val_f1 = -1.0

for n in component_candidates:
    pca = PCA(n_components=n, random_state=42)
    X_train_p = pca.fit_transform(X_train_s)
    X_val_p = pca.transform(X_val_s)

    for k in k_candidates:
        knn = KNeighborsClassifier(n_neighbors=k, metric="minkowski", p=2)  # Euclidean
        knn.fit(X_train_p, y_train)
        y_val_pred = knn.predict(X_val_p)

        val_acc = accuracy_score(y_val, y_val_pred)
        val_f1  = f1_score(y_val, y_val_pred, average="macro")

        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            best_cfg = {
                "n_components": n,
                "k": k,
                "val_acc": val_acc,
                "val_f1_macro": val_f1
            }

print("=== PCA Diagnostics (from Train) ===")
print(f"Components for 95% cumulative variance: {n95}")
print(f"Cumulative variance at {n95} comps: {cumulative[n95-1]:.4f}")

print("\n=== Validation (best config) ===")
print(best_cfg)

# 5) Refit on Train+Val with chosen settings, then test
n_opt = best_cfg["n_components"]
k_opt = best_cfg["k"]

scaler_final = StandardScaler(with_mean=True)
X_trval = np.vstack([X_train, X_val])
y_trval = np.hstack([y_train, y_val])
X_trval_s = scaler_final.fit_transform(X_trval)
X_test_s_final = scaler_final.transform(X_test)

pca_final = PCA(n_components=n_opt, random_state=42).fit(X_trval_s)
X_trval_p = pca_final.transform(X_trval_s)
X_test_p  = pca_final.transform(X_test_s_final)

knn_final = KNeighborsClassifier(n_neighbors=k_opt, metric="minkowski", p=2)
knn_final.fit(X_trval_p, y_trval)

y_test_pred = knn_final.predict(X_test_p)

test_acc = accuracy_score(y_test, y_test_pred)
test_f1  = f1_score(y_test, y_test_pred, average="macro")
report   = classification_report(y_test, y_test_pred, digits=4)

print("\n=== Test Performance ===")
print(f"Test Accuracy:  {test_acc:.4f}")
print(f"Test F1-macro:  {test_f1:.4f}")
print("\nClassification Report:\n", report)


In [None]:
# High-dimensional gene expression classification with PCA + KNN
# --------------------------------------------------------------
# - Simulate a gene-expression-like dataset (many features, few samples)
# - Standardize -> PCA -> KNN
# - Choose number of components via 95% variance + a tiny validation sweep
# - Evaluate on a held-out test set (Accuracy + macro-F1 + class report)

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import numpy as np

# 1) Simulate high-dimensional data (adjust sizes if you like)
X, y = make_classification(
    n_samples=180,      # few patients
    n_features=1000,    # many genes
    n_informative=60,   # a small subset truly informative
    n_redundant=0,
    n_repeated=0,
    n_classes=3,
    n_clusters_per_class=2,
    class_sep=2.0,
    random_state=42
)

# Train/Val/Test = 60% / 20% / 20% (stratified)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)  # 0.25 of 0.80 -> 0.20

# 2) Standardize
scaler = StandardScaler(with_mean=True)
X_train_s = scaler.fit_transform(X_train)
X_val_s   = scaler.transform(X_val)
X_test_s  = scaler.transform(X_test)

# 3) PCA fit on training only (avoid leakage)
pca_full = PCA(random_state=42)
pca_full.fit(X_train_s)

explained = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained)
n95 = int(np.searchsorted(cumulative, 0.95) + 1)  # components to reach 95% variance

# 4) Tiny validation sweep to pick n_components and k
component_candidates = sorted(set([10, 20, 50, n95]))
k_candidates = [3, 5, 7]

best_cfg = None
best_val_f1 = -1.0

for n in component_candidates:
    pca = PCA(n_components=n, random_state=42)
    X_train_p = pca.fit_transform(X_train_s)
    X_val_p = pca.transform(X_val_s)

    for k in k_candidates:
        knn = KNeighborsClassifier(n_neighbors=k, metric="minkowski", p=2)  # Euclidean
        knn.fit(X_train_p, y_train)
        y_val_pred = knn.predict(X_val_p)

        val_acc = accuracy_score(y_val, y_val_pred)
        val_f1  = f1_score(y_val, y_val_pred, average="macro")

        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            best_cfg = {
                "n_components": n,
                "k": k,
                "val_acc": val_acc,
                "val_f1_macro": val_f1
            }

print("=== PCA Diagnostics (from Train) ===")
print(f"Components for 95% cumulative variance: {n95}")
print(f"Cumulative variance at {n95} comps: {cumulative[n95-1]:.4f}")

print("\n=== Validation (best config) ===")
print(best_cfg)

# 5) Refit on Train+Val with chosen settings, then test
n_opt = best_cfg["n_components"]
k_opt = best_cfg["k"]

scaler_final = StandardScaler(with_mean=True)
X_trval = np.vstack([X_train, X_val])
y_trval = np.hstack([y_train, y_val])
X_trval_s = scaler_final.fit_transform(X_trval)
X_test_s_final = scaler_final.transform(X_test)

pca_final = PCA(n_components=n_opt, random_state=42).fit(X_trval_s)
X_trval_p = pca_final.transform(X_trval_s)
X_test_p  = pca_final.transform(X_test_s_final)

knn_final = KNeighborsClassifier(n_neighbors=k_opt, metric="minkowski", p=2)
knn_final.fit(X_trval_p, y_trval)

y_test_pred = knn_final.predict(X_test_p)

test_acc = accuracy_score(y_test, y_test_pred)
test_f1  = f1_score(y_test, y_test_pred, average="macro")
report   = classification_report(y_test, y_test_pred, digits=4)

print("\n=== Test Performance ===")
print(f"Test Accuracy:  {test_acc:.4f}")
print(f"Test F1-macro:  {test_f1:.4f}")
print("\nClassification Report:\n", report)
