1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
Answer: K-Nearest Neighbors (KNN) is a simple, supervised learning algorithm that classifies or predicts outcomes for a new data point by finding its 'k' closest neighbors in the training data, using their majority class for classification (voting) or their average value for regression (averaging). It's non-parametric, meaning it makes no assumptions about data distribution, and works on the principle that similar data points cluster together, using distance metrics like Euclidean distance to find neighbors.

  How KNN Works
  Choose 'k': Select the number of neighbors (k) to consider.
  Calculate Distance: Find the distances (e.g., Euclidean, Manhattan) between the new data point and all points in the training set.
  Identify Neighbors: Select the 'k' training points with the smallest distances.
  Predict: Use the neighbors to make a decision.

  KNN for Classification
  Method: majority vote.
  Process: The new point is assigned the class label that appears most frequently among its 'k' neighbors.
  Example: If k=5 and 3 neighbors are 'Class A' and 2 are 'Class B', the new point becomes 'Class A'.

  KNN for Regression
  Method: averaging.
  Process: The predicted value for the new point is the average (mean) of the target values of its 'k' nearest neighbors.
  Example: Predicting house prices: the average price of the k-closest houses gives the estimate for the new house.

  Key Considerations
  Choosing 'k': A small 'k' can be sensitive to noise, while a large 'k' can oversimplify; cross-validation helps find the optimal 'k'.
  Distance Metric: Euclidean distance (straight-line) and Manhattan distance (grid-like) are common choices.
  Curse of Dimensionality: Performance can degrade in high-dimensional spaces where distances become less meaningful.

2. What is the Curse of Dimensionality and how does it affect KNN
performance?
Answer: The Curse of Dimensionality describes how algorithms struggle with high-dimensional data (many features) because data becomes sparse, distances lose meaning (points are far apart), and models overfit, needing exponentially more data to maintain performance, which drastically harms KNN by making neighbor selection unreliable and computationally expensive, blurring class distinctions. KNN performance degrades as relevant neighbors become indistinguishable from noise, leading to poor generalization.

  How it Affects KNN
  Unreliable Neighbors: Because distances become similar, the "nearest" neighbors might not be truly similar, diluting their predictive power.
  Overfitting: With sparse data, KNN can easily find "neighbors" that are just noise, leading to models that memorize training data but fail on new data (poor generalization).
  Computational Cost: Finding neighbors in vast, sparse spaces becomes computationally intensive and slow.
  Loss of Structure: The inherent structure in lower dimensions (like clusters) gets lost as features increase, making classification boundaries blurry.


3.What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Answer: Principal Component Analysis (PCA) is a feature extraction technique that creates new, fewer, uncorrelated features (Principal Components) by linearly combining original ones, preserving most data variance for dimensionality reduction, while Feature Selection is a process of choosing a subset of the original features, discarding irrelevant ones, and aims for model interpretability and noise reduction

Key Differences
Method: PCA extracts/transforms, Feature Selection selects/filters.
Features: PCA creates new (artificial) features; Feature Selection keeps original features.
Target Variable: PCA is typically unsupervised (no target needed); Feature Selection often uses the target for evaluation (supervised).
Interpretability: Feature selection usually offers better model interpretability because it uses original features, while PCA's components are harder to interpret.

4.What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer: Eigenvectors in PCA
Definition: These are the directions (vectors) in the new, lower-dimensional space that represent the original data's primary patterns, with each eigenvector forming a Principal Component (PC).
Role: They show the orientation of the data's spread; the first eigenvector points in the direction of the most variance, the second (orthogonal to the first) in the next most, and so on.
Eigenvalues in PCA
Definition: A scalar associated with each eigenvector that quantifies the magnitude of variance along that eigenvector's direction.
Role: They act as a measure of "importance" or "information" content; a larger eigenvalue means more variance (more information) is captured along that PC.
Why They Are Important
Ranking Significance: By calculating eigenvalues, you know which principal components (eigenvectors) hold the most data information.
Dimensionality Reduction: You can discard components with small eigenvalues (low variance) and keep those with large eigenvalues, reducing the number of features (dimensions) without losing much data quality, making models faster and less prone to the curse of dimensionality.
Data Visualization: They help transform high-dimensional data into 2D or 3D space for easier understanding and plotting.

 5: How do KNN and PCA complement each other when applied in a single pipeline?

 Answer: In a single pipeline, Principal Component Analysis (PCA) acts as a crucial preprocessing step for K-Nearest Neighbors (KNN), complementing it primarily by mitigating the "curse of dimensionality," thereby improving computational efficiency and often enhancing the accuracy of the KNN algorithm.
How PCA and KNN Complement Each Other
Addressing the Curse of Dimensionality: KNN's performance deteriorates in high-dimensional spaces because the distance between points becomes less meaningful and more uniform, making it difficult to find true nearest neighbors. PCA solves this by reducing the number of features (dimensions) while preserving most of the essential information (variance) in the data.
Improving Computational Efficiency: KNN is a distance-based algorithm that requires calculating the distance between a new data point and all existing training points. By reducing the feature space, PCA significantly decreases the number of calculations required, leading to faster training and prediction times, which is particularly beneficial for large datasets.
Reducing Noise and Redundancy: High-dimensional data often contains redundant or noisy features. PCA transforms the original correlated variables into a new set of uncorrelated principal components, effectively filtering out noise and focusing the KNN on the most meaningful data patterns. This can prevent the KNN model from overfitting to irrelevant features.
Potentially Enhancing Accuracy: By reducing noise, removing redundant information, and focusing on features that capture the maximum variance, the PCA-transformed data can lead to a more robust KNN model. Studies have shown that a PCA-KNN pipeline can achieve higher accuracy compared to using KNN with the original, high-dimensional data alone.
Better Visualization: PCA allows for the reduction of data to two or three principal components, which can then be easily visualized to understand the data's underlying structure before applying the KNN algorithm.
In summary, PCA enhances KNN by acting as a powerful feature extraction and dimensionality reduction tool, creating a more efficient and effective feature space for KNN's distance-based classification or regression task.

In [1]:
#"Dataset: Use the Wine Dataset from sklearn.datasets.load_wine().
#"Question 6: Train a KNN Classifier on the Wine dataset with and without feature
#scaling. Compare model accuracy in both cases.
#(Include your Python code and output in the code box below.)
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets (80% train, 20% test)
# Use a fixed random state for reproducible results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("--- Question 6: KNN Classifier Accuracy Comparison ---")

# --- Case 1: KNN without Feature Scaling ---

print("\nCase 1: Without Feature Scaling")
# Initialize the KNN Classifier
# Using n_neighbors=5, a common default
knn_no_scale = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn_no_scale.fit(X_train, y_train)

# Predict on the test set
y_pred_no_scale = knn_no_scale.predict(X_test)

# Calculate accuracy
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)
print(f"Accuracy without scaling: {accuracy_no_scale:.4f}")


# --- Case 2: KNN with Feature Scaling ---

print("\nCase 2: With Feature Scaling")
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the *same* scaler
X_test_scaled = scaler.transform(X_test)

# Initialize the KNN Classifier
knn_scaled = KNeighborsClassifier(n_neighbors=5)

# Train the model on the scaled data
knn_scaled.fit(X_train_scaled, y_train)

# Predict on the scaled test set
y_pred_scaled = knn_scaled.predict(X_test_scaled)

# Calculate accuracy
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling:    {accuracy_scaled:.4f}")

# --- Comparison Summary ---
print("\n--- Summary ---")
print(f"Accuracy without scaling: {accuracy_no_scale:.4f}")
print(f"Accuracy with scaling:    {accuracy_scaled:.4f}")

if accuracy_scaled > accuracy_no_scale:
    print("\nConclusion: Feature scaling significantly improved the KNN model's accuracy.")
elif accuracy_scaled < accuracy_no_scale:
    print("\nConclusion: Accuracy was slightly better without scaling in this specific test split (uncommon for KNN).")
else:
    print("\nConclusion: Feature scaling did not change the accuracy in this specific test split.")



--- Question 6: KNN Classifier Accuracy Comparison ---

Case 1: Without Feature Scaling
Accuracy without scaling: 0.7222

Case 2: With Feature Scaling
Accuracy with scaling:    0.9444

--- Summary ---
Accuracy without scaling: 0.7222
Accuracy with scaling:    0.9444

Conclusion: Feature scaling significantly improved the KNN model's accuracy.


In [3]:
#7: Train a PCA model on the Wine dataset and print the explained variance
#ratio of each principal component.
#(Include your Python code and output in the code box below.)

import numpy as np
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd

# 1. Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

# 2. Standardize the data (PCA is sensitive to scaling)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Train a PCA model
# Set n_components=None to keep all components (13 in this case)
pca = PCA(n_components=None)
pca.fit(X_scaled)

# 4. Get the explained variance ratio of each principal component
explained_variance_ratio = pca.explained_variance_ratio_

# 5. Print the results
print("Explained Variance Ratio of Each Principal Component:")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"Principal Component {i+1}: {ratio:.4f}")

# You can also print the cumulative explained variance to see how many components are needed for a certain threshold
cumulative_variance = np.cumsum(explained_variance_ratio)
print(f"\nCumulative Explained Variance: {cumulative_variance[-1]:.4f} (should be 1.0)")


Explained Variance Ratio of Each Principal Component:
Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080

Cumulative Explained Variance: 1.0000 (should be 1.0)


In [6]:
#8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset. (Include your Python code and output in the code box below.)
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load the dataset (Iris dataset used as an example)
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Standardize the features
# Standardization is important for both PCA and KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- KNN on Original Dataset ---
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# --- KNN on PCA-transformed Dataset (retain top 2 components) ---

# 4. Apply PCA to the scaled data, retaining 2 components
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# 5. Train a KNN classifier on the PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# 6. Compare the accuracies
print(f"Accuracy on Original Dataset (scaled): {accuracy_original:.4f}")
print(f"Accuracy on PCA-transformed Dataset (2 components): {accuracy_pca:.4f}")

# Optional: Print explained variance by 2 components
print(f"Total variance explained by 2 components: {np.sum(pca.explained_variance_ratio_):.4f}")


Accuracy on Original Dataset (scaled): 1.0000
Accuracy on PCA-transformed Dataset (2 components): 0.9556
Total variance explained by 2 components: 0.9521


In [7]:
#9: Train a KNN Classifier with different distance metrics (euclidean,
#manhattan) on the scaled Wine dataset and compare the results.
#(Include your Python code and output in the code box below.)

import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Scale the features (essential for distance-based algorithms like KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Function to train and evaluate KNN with a specific metric
def train_and_evaluate_knn(metric_name, p_value):
    # Initialize the KNN classifier. The 'metric' can be set directly, or 'p' parameter
    # can be used with 'minkowski' (default) where p=1 for Manhattan and p=2 for Euclidean.
    knn = KNeighborsClassifier(n_neighbors=5, p=p_value, metric='minkowski')

    # Train the model
    knn.fit(X_train_scaled, y_train)

    # Predict on the test set
    y_pred = knn.predict(X_test_scaled)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

# Train with Euclidean distance (p=2)
accuracy_euclidean = train_and_evaluate_knn('Euclidean', p_value=2)

# Train with Manhattan distance (p=1)
accuracy_manhattan = train_and_evaluate_knn('Manhattan', p_value=1)

# Display results
print(f"Accuracy with Euclidean distance: {accuracy_euclidean:.4f}")
print(f"Accuracy with Manhattan distance: {accuracy_manhattan:.4f}")

# Compare and conclude
if accuracy_euclidean > accuracy_manhattan:
    print("\nEuclidean distance performed slightly better on the scaled Wine dataset.")
elif accuracy_manhattan > accuracy_euclidean:
    print("\nManhattan distance performed slightly better on the scaled Wine dataset.")
else:
    print("\nBoth distance metrics performed equally well on the scaled Wine dataset.")



Accuracy with Euclidean distance: 0.9630
Accuracy with Manhattan distance: 0.9630

Both distance metrics performed equally well on the scaled Wine dataset.


10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output in the code box below.)

Answer: The following explains the methodology for applying PCA and KNN to a high-dimensional biomedical dataset, with accompanying conceptual Python code demonstrating the process.
Methodology Explanation
1. Use PCA to Reduce Dimensionality
Principal Component Analysis (PCA) is an unsupervised learning technique used to transform high-dimensional data into a new, lower-dimensional subspace while retaining as much variance as possible. In a gene expression context, it identifies the primary axes (principal components) of variation across thousands of genes, effectively compressing redundant information [1].
The process involves:
Standardization: Scaling the gene expression data (e.g., using StandardScaler in Python) so that each gene has a mean of zero and unit variance, ensuring all features are weighted equally.
Transformation: Applying the PCA algorithm to the standardized data, projecting the original features onto the selected principal components.
2. Decide How Many Components to Keep
The number of components is a crucial hyperparameter determined by analyzing the trade-off between dimensionality reduction and information loss.
The standard method is to use a scree plot and calculate the cumulative explained variance:
A scree plot visualizes the variance explained by each individual component.
A cumulative explained variance plot shows the total variance explained as components are added.
We typically select the minimum number of components that capture a substantial percentage of the total variance (e.g., 90% to 95%) or look for an "elbow point" in the scree plot where the drop-off in explained variance levels out [1].
3. Use KNN for Classification Post-Dimensionality Reduction
K-Nearest Neighbors (KNN) is a non-parametric, simple classification algorithm. The challenge with standard KNN in high-dimensional space ("curse of dimensionality") is that distances become less meaningful, leading to poor performance [1].
By applying KNN to the PCA-transformed data:
We operate in a lower-dimensional, noise-reduced subspace where distance metrics are more reliable.
The model learns the classification boundaries based on the proximity of samples in this new feature space.
4. Evaluate the Model
Robust evaluation is vital, especially with small sample sizes common in biomedical research.
Cross-Validation: Instead of a single train/test split, we use k-fold cross-validation (or stratified k-fold, to maintain class proportions). This provides a more reliable estimate of the model’s true performance by training and testing on different data subsets multiple times [1].
Metrics: We would track relevant metrics like accuracy, precision, recall, and the F1-score to assess performance comprehensively.
Hyperparameter Tuning: A grid search within the cross-validation framework would be used to find the optimal number of neighbors (k) for the KNN algorithm.
5. Justify this Pipeline to Stakeholders
This pipeline is a robust solution for real-world biomedical data because it directly addresses common challenges in genomics:
Mitigates Overfitting and the "Curse of Dimensionality": PCA removes noisy, redundant features and transforms the data into a concise representation, which prevents traditional models from overfitting on irrelevant variables [1].
Improved Computational Efficiency: Reducing the number of features from thousands to dozens makes training faster and requires less memory.
Interpretability (of features): While the components themselves are abstract, the methodology is transparent. We can quantify exactly how much information (variance) is retained in the reduced dataset.
Data-Driven Approach: The decision process (number of components, optimal K for KNN) is entirely data-driven and validated through rigorous cross-validation, providing statistically sound results.
"""import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
import matplotlib.pyplot as plt

# --- 1. Simulate a High-Dimensional Gene Expression Dataset ---
# 100 samples, 2000 genes (features), 3 cancer types (classes)
np.random.seed(42)
n_samples, n_features, n_classes = 100, 2000, 3
X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, n_classes, n_samples)

print(f"Original data shape: {X.shape}\n")

# --- 2. Standardization ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 3. Use PCA to Reduce Dimensionality ---
# Initialize PCA with all components to analyze variance first
pca_full = PCA(n_components=n_features)
X_pca_full = pca_full.fit_transform(X_scaled)

# --- 4. Decide How Many Components to Keep (Analysis) ---
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Find number of components to explain 95% variance
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Cumulative Variance Explained Plot (Conceptual visualization only):\n")
# In a real environment, you would plot:
# plt.figure(figsize=(8, 5))
# plt.plot(cumulative_variance)
# plt.xlabel('Number of Components')
# plt.ylabel('Cumulative Explained Variance')
# plt.title('Scree Plot/Explained Variance Analysis')
# plt.show()

print(f"Keeping {n_components_95} components explains >95% of the variance.\n")

# Re-run PCA with the chosen number of components
pca = PCA(n_components=n_components_95)
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced data shape after PCA: {X_pca.shape}\n")

# --- 5. Use KNN for Classification Post-Dimensionality Reduction ---
# We will use cross-validation for robust evaluation
knn = KNeighborsClassifier(n_neighbors=5) # K=5 chosen as a starting point

# --- 6. Evaluate the Model using Stratified K-Fold Cross-Validation ---
# Use 5 folds for evaluation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Calculate cross-validation scores (e.g., accuracy)
cv_scores = cross_val_score(knn, X_pca, y, cv=skf, scoring='accuracy')

print("--- Model Evaluation ---")
print(f"Cross-validation accuracy scores for each fold: {cv_scores}")
print(f"Mean CV Accuracy: {np.mean(cv_scores):.4f}")
print(f"Std Dev of CV Accuracy: {np.std(cv_scores):.4f}")

# Optional: Further hyperparameter tuning for optimal K would involve GridSearch within CV
"""
