1.What is K-Nearest Neighbors (KNN) and how does it work?
Ans.KNN is a lazy learning algorithm, meaning it doesn't learn a model during training. Instead, it memorizes the training dataset and uses it when making predictions.

How Does KNN Work?
Here's a step-by-step breakdown of how KNN works:

Choose the number of neighbors (K): This is a user-defined constant. It represents how many nearby data points will be considered to make a prediction.

Calculate distance: When given a new data point to classify, KNN calculates the distance (commonly Euclidean distance) between the new point and all points in the training dataset.

Euclidean Distance
=
(
𝑥
1
−
𝑦
1
)
2
+
(
𝑥
2
−
𝑦
2
)
2
+
⋯
+
(
𝑥
𝑛
−
𝑦
𝑛
)
2
Euclidean Distance=
(x
1
​
 −y
1
​
 )
2
 +(x
2
​
 −y
2
​
 )
2
 +⋯+(x
n
​
 −y
n
​
 )
2

​

Find the K nearest neighbors: Sort the distances and select the K closest data points.

Make a prediction:

Classification: The class most common among the K nearest neighbors is assigned to the new point (majority vote).

Regression: The average (or weighted average) of the values of the K nearest neighbors is used.
 Example (Classification)
Suppose you're classifying fruits based on size and color. You want to classify a new fruit. KNN:

Checks the distance of this new fruit to every fruit in the training set.

Picks the K closest fruits.

Assigns the most frequent fruit type among those K neighbors.



2.What is the difference between KNN Classification and KNN Regression?
Ans.K-Nearest Neighbors (KNN) uses the same underlying algorithm for both classification and regression, the key difference lies in how the prediction is made based on the K nearest neighbors.

🧩 Difference Between KNN Classification and KNN Regression
Aspect	KNN Classification	KNN Regression
Prediction Type	Predicts a class/label	Predicts a continuous value
Output	Discrete (e.g., "dog", "cat", "apple")	Continuous (e.g., 42.7, 15.3)
Decision Rule	Majority vote among K neighbors	Average (or weighted average) of neighbors' values
Loss Function	Classification error (e.g., accuracy)	Regression error (e.g., MSE, MAE)
Use Case Examples	Email spam detection, disease diagnosis	Predicting house prices, stock forecasting
Boundary Behavior	Creates distinct class regions	Produces smooth continuous curves


3.What is the role of the distance metric in KNN?
Ans.In KNN, we classify or predict a new data point based on the nearest data points in the training set. The distance metric defines what we mean by "nearest."
Example: Euclidean vs. Manhattan
Imagine you’re in a city with grid-like roads:

Euclidean Distance: "As the crow flies" (straight-line distance).

Manhattan Distance: The path you’d actually walk on city blocks (like a taxi).

In high-dimensional data, Manhattan can sometimes be more stable because Euclidean gets distorted due to the curse of dimensionality.

4.What is the Curse of Dimensionality in KNN?
Ans.he Curse of Dimensionality is a critical concept in KNN (and many other machine learning algorithms). It refers to the problems and challenges that arise when working with high-dimensional data (i.e., data with many features).

As the number of features (dimensions) increases:

Data becomes sparse: In high dimensions, data points are more spread out, making it harder to find "close" neighbors.

Distance metrics lose meaning: In higher dimensions, distances between data points tend to become very similar, reducing the effectiveness of KNN which relies on distinguishing between "near" and "far".

Computational cost increases: More dimensions = more calculations for each distance = slower predictions.

Risk of overfitting: With more features, KNN may fit noise instead of learning useful patterns, especially with small datasets.
Intuition
Imagine trying to find neighbors in:

Dimension	Points Needed to Fill the Space Well
1D	10 points
2D	100 points
3D	1,000 points
100D	10¹⁰⁰ points (!), which is impractical
So, to maintain the same data density, data requirements grow exponentially with dimensions.

 How It Affects KNN
KNN depends on "local neighborhoods".

In high dimensions, all points tend to be far from each other, so "nearest neighbors" may not actually be similar.

This leads to poor generalization and lower accuracy.

 How to Deal With It
Feature selection: Choose only the most relevant features.

Dimensionality reduction: Use techniques like:

PCA (Principal Component Analysis)

t-SNE

LDA (Linear Discriminant Analysis)

Normalize features: Helps mitigate distance distortion.

Use other algorithms: Some models (e.g., decision trees, random forests) may handle high-dimensional data better than KNN.

5.How can we choose the best value of K in KNN?
Ans.Choosing the best value of K in K-Nearest Neighbors (KNN) is super important—it can make a huge difference in your model’s performance. The value of K controls the bias-variance trade-off, so choosing it wisely helps balance underfitting vs. overfitting.

Find the K that gives the best accuracy on unseen data (i.e., generalizes well)
Here's How to Choose the Best K:
1. Use Cross-Validation (CV)
K-Fold Cross-Validation is the standard way to test multiple values of K.

Try several values of K (like 1 to 30), evaluate performance on validation sets, and choose the K with the best average accuracy (or lowest error).

2. Plot the Error vs. K Curve
Try a range of K values (e.g., 1 to 30).

For each K, compute the validation error or accuracy.

Plot it:

If K is too small (e.g., 1): Model overfits, high variance.

If K is too large: Model underfits, high bias.

Ideal K is where the error is minimized, just before the curve levels off.

3. Use Odd Values of K (for Classification)
Avoid ties in majority voting by using odd K (like 3, 5, 7) especially for binary classification.

If you have multi-class classification, this is less critical.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

k_range = range(1, 31)
scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(knn, X_train, y_train, cv=5)
    scores.append(score.mean())

plt.plot(k_range, scores)
plt.xlabel('Value of K')
plt.ylabel('Cross-Validated Accuracy')
plt.title('K vs Accuracy')
plt.show()


6.What are KD Tree and Ball Tree in KNN?
Ans.When using KNN with large datasets, searching through every training point to find the nearest neighbors becomes slow and inefficient. That’s where KD Tree and Ball Tree come in — they’re data structures that help speed up nearest neighbor search.

 What is a KD Tree?
KD Tree stands for k-dimensional tree.

It’s a binary tree used to partition data in a k-dimensional space.

At each level of the tree, the data is split along one of the dimensions (e.g., x, y, z).

It organizes the points to reduce the number of distance calculations needed when searching for nearest neighbors.

 How It Works:
Choose a dimension (e.g., x-axis) and split the data at the median.

Recursively split left and right subspaces along other dimensions.

When querying, the tree is searched in a way that prunes large chunks of data that can't possibly be the nearest neighbors.

 Best For:
Low to medium dimensions (up to ~20).

Faster than brute-force for moderate-sized datasets.

 What is a Ball Tree?
Ball Tree is another tree-based structure, but instead of splitting on axes, it divides space into nested balls (spheres).

Each node contains a "ball" (a point + radius) that encloses a subset of the data.

During search, it uses triangle inequality to prune entire balls that are too far from the query point.

 How It Works:
Build tree by grouping nearby points into balls.

Recursively divide each ball into smaller sub-balls.

During search, skip balls that are too far from the query.

 Best For:
High-dimensional data.

Non-axis-aligned data distributions.

 KD Tree vs. Ball Tree
Feature	KD Tree	Ball Tree
Split method	Axis-aligned median splits	Spherical partitions (balls)
Best for	Low to medium dimensions	High dimensions
Speed	Very fast for low-dim data	Better for complex, high-dim data
Scikit-learn support	algorithm='kd_tree'	algorithm='ball_tree'


In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Use KD Tree
knn_kd = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')

# Use Ball Tree
knn_ball = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')


7.When should you use KD Tree vs. Ball Tree?
Ans.Choosing between KD Tree and Ball Tree depends mainly on your dataset’s dimensionality, size, and the distribution of data.
When to Use KD Tree:
Your dataset has less than ~20 features.

Data is well-distributed (not too skewed or clustered).

You want fast, exact nearest neighbor queries.

Your application needs deterministic performance (no randomness).

Example use cases:

Image classification with simple color histograms.

2D/3D spatial data (e.g., map coordinates, game development).

 When to Use Ball Tree:
Your dataset has more than ~20 features (high-dimensional).

Data is not aligned to axes or is clustered/spherical in nature.

You need better scalability in complex or messy spaces.

Example use cases:

Text embeddings (e.g., word2vec, BERT features).

Bioinformatics data.

Complex similarity search where distances aren’t cleanly separable.



8.What are the disadvantages of KNN?
Ans.K-Nearest Neighbors (KNN) is simple and powerful, it has several key disadvantages that can limit its use in real-world applications—especially as datasets grow in size and complexity.
Disadvantages of KNN
1. 🐢 Slow Prediction Time (Lazy Learning)
KNN doesn’t "learn" during training. It stores the entire training set and defers all computation until prediction.

At prediction time, it must compute the distance between the new point and every training point.

Time complexity:
𝑂
(
𝑛
⋅
𝑑
)
O(n⋅d), where

𝑛
n = number of training samples

𝑑
d = number of features

 Big issue with large datasets.

2. Sensitive to Feature Scale
If features are not normalized (e.g., height in cm vs. weight in kg), the algorithm will favor features with larger scales.

Distance calculations (like Euclidean) can get distorted.

Solution: Use StandardScaler or MinMaxScaler before fitting KNN.

3.  Curse of Dimensionality
In high-dimensional spaces, all points tend to be far apart.

The concept of "nearest" becomes less meaningful, leading to poor accuracy.

Distances start to converge, making it hard to distinguish between classes.

 Dimensionality reduction (like PCA) can help.

4.  Memory Inefficient
KNN stores all training data, which means:

High memory usage

Slow for large datasets

Consider using KD-Tree, Ball Tree, or approximate nearest neighbor techniques.

5. Sensitive to Noise & Outliers
A few noisy data points can skew predictions, especially with small K values.

KNN doesn’t inherently filter or handle noise.

 Using a larger K or distance-weighted voting can reduce this effect.

6. No Model Interpretability
KNN doesn’t produce a model with parameters or coefficients.

Hard to explain why a prediction was made (other than pointing to neighbors).

7. Choosing the Right K is Tricky
Small K → overfitting

Large K → underfitting

There's no universal "best K" — it must be tuned for each problem.

Summary Table
Disadvantage	Impact	Possible Fix
Slow prediction	Poor scalability	Use KD Tree, Ball Tree, or ANN
Needs feature scaling	Skewed distance calculations	Normalize/standardize features
High dimensionality issues	Lower accuracy	Apply PCA or feature selection
Sensitive to noise	Inconsistent predictions	Use larger K or weighted KNN
High memory usage	Doesn’t scale to big data	Use approximate neighbors


9.How does feature scaling affect KNN?
Ans.feature scaling is critical in K-Nearest Neighbors (KNN) because the algorithm relies on distance calculations (like Euclidean, Manhattan, etc.) to determine which neighbors are "closest."

If your features are on different scales or units, KNN can make very biased decisions.

 Why Feature Scaling Matters in KNN
In KNN, we compute distance like this:

Euclidean distance
=
(
𝑥
1
−
𝑦
1
)
2
+
(
𝑥
2
−
𝑦
2
)
2
+
⋯
+
(
𝑥
𝑛
−
𝑦
𝑛
)
2
Euclidean distance=
(x
1
​
 −y
1
​
 )
2
 +(x
2
​
 −y
2
​
 )
2
 +⋯+(x
n
​
 −y
n
​
 )
2

​

If one feature has a much larger range, it will dominate the distance — even if it's not the most important feature.

Example:
Suppose you’re classifying fruits using:

Feature	Scale
Height	1 to 10 cm
Weight	100 to 1000 g
Without scaling, weight will completely outweigh height in the distance metric.

So, even if height is a better predictor, KNN might ignore it due to scale differences. Not good.

 Feature Scaling Fixes This
Common techniques:

Technique	What It Does	Use When
Min-Max Scaling	Scales features to [0, 1] range	General use
Standardization (Z-score)	Centers around 0 with std. dev. 1	When data is normally distributed
Robust Scaler	Uses median and IQR (handles outliers well)	When data has outliers
Scikit-learn Example:
python
Copy
Edit
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

# Wrap scaling + KNN in a pipeline
model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
model.fit(X_train, y_train)


10.What is PCA (Principal Component Analysis)?
Ans.Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a dataset with many features into a new set of features called principal components, which capture the most important patterns or directions of variance in the data.

Why Use PCA
To reduce the number of features while preserving as much information as possible.

To remove noise and redundancy.

To visualize high-dimensional data in 2D or 3D.

To speed up machine learning algorithms and reduce overfitting.

How PCA Works (Step-by-Step)
Standardize the Data: PCA is sensitive to scale, so features are typically standardized to have zero mean and unit variance.

Compute the Covariance Matrix: This matrix shows how the features vary with each other.

Calculate Eigenvalues and Eigenvectors: These represent the magnitude and direction of variance.

Select Principal Components: Choose the top k eigenvectors (based on the largest eigenvalues) that capture the most variance.

Transform the Data: Project the original data onto the new axes (principal components) to get the reduced dataset.

Key Terms
Principal Components: New axes or directions that capture the most variance in the data.

Explained Variance: How much of the total variance is captured by each principal component.

Dimensionality Reduction: Keeping only the top k principal components instead of all original features.

When to Use PCA
When you have many features and want to reduce them to a smaller set.

When features are highly correlated.

Before applying algorithms that are sensitive to the curse of dimensionality, such as KNN or clustering.

11. How does PCA work?
Ans.Standardize the Data
PCA is affected by the scale of the features, so we first standardize the data:

For each feature, subtract the mean and divide by the standard deviation.

This gives each feature a mean of 0 and standard deviation of 1.

2. Compute the Covariance Matrix
The covariance matrix shows how the features vary with respect to one another.

It helps us understand the relationships and dependencies between features.

If your data matrix is X (after standardization), the covariance matrix is:

Cov
(
𝑋
)
=
1
𝑛
−
1
𝑋
𝑇
𝑋
Cov(X)=
n−1
1
​
 X
T
 X
3. Calculate Eigenvalues and Eigenvectors
Eigenvectors give the directions (principal components).

Eigenvalues tell how much variance is explained by each principal component.

These are computed from the covariance matrix.

The first principal component corresponds to the eigenvector with the largest eigenvalue.

The second principal component is the one with the second-largest eigenvalue, and so on.

4. Select Top k Principal Components
Sort eigenvectors by their eigenvalues in descending order.

Choose the top k eigenvectors (where k is the number of dimensions you want to keep).

These form the new basis for your data.

5. Transform the Data
Multiply the original standardized data by the matrix of selected eigenvectors.

This projects the data onto the new lower-dimensional space.

The transformed data is:

𝑍
=
𝑋
⋅
𝑊
Z=X⋅W
Where:

X is the standardized data matrix.

W is the matrix of top k eigenvectors.

Z is the reduced dataset in the new coordinate system.



12.What is the geometric intuition behind PCA?
Ans.The geometric intuition behind Principal Component Analysis (PCA) is that it finds a new coordinate system (a new set of axes) for your data that captures the directions of maximum variance. These new axes are orthogonal (perpendicular) and are called principal components.

Here’s a step-by-step geometric interpretation:

1. Data as Points in Space
Imagine your dataset as a cloud of points in a multi-dimensional space:

Each data point is a vector.

If you have 2 features, you’re in a 2D space; 3 features → 3D space; more features → high-dimensional space.

2. Principal Components as New Axes
PCA rotates this cloud of points so that:

The first principal component (PC1) points in the direction of maximum variance in the data.

The second principal component (PC2) is orthogonal to the first and points in the direction of the next highest variance.

This continues for all components.

Think of fitting a line or plane through the data that best explains how it is spread out.

3. Projection Onto the Principal Components
Once you have these new axes:

You project the original data onto them.

This gives you new coordinates for each data point in the rotated system.

If you keep only the top k components, you effectively compress the data while preserving most of its structure.

4. Dimensionality Reduction
Geometrically, reducing dimensions means:

Instead of describing each point by its coordinates in the full space (e.g., 10D),

You describe it using its coordinates along the most meaningful directions (e.g., top 2 or 3 principal components).

This flattens the data onto a lower-dimensional subspace that preserves its overall shape and variation.

Visual Example (in 2D):
Imagine an elliptical cloud of points stretched diagonally in 2D space:

The longest axis of the ellipse is PC1 (most variance).

The shorter axis, perpendicular to PC1, is PC2 (second most variance).

PCA rotates the axes to align with those directions.

If you drop PC2 and keep only PC1, you’ve reduced the 2D data to 1D while retaining most of its structure.

13.What is the difference between Feature Selection and Feature Extraction?
Ans.Feature Selection and Feature Extraction are both techniques used for dimensionality reduction, but they work in very different ways.

1. Feature Selection
Definition:
Feature selection is the process of selecting a subset of existing features from the dataset that are most relevant to the target variable.

How it works:

It keeps the original features unchanged.

It removes irrelevant, redundant, or less important features.

Methods:

Filter methods: Based on statistical tests (e.g., correlation, chi-square)

Wrapper methods: Use machine learning models to test different subsets (e.g., recursive feature elimination)

Embedded methods: Feature importance from models (e.g., Lasso, tree-based models)

Example: From a dataset with 10 features, you select the top 5 based on correlation with the target.

Use when:

You want to improve model interpretability.

You prefer to keep features in their original form.

2. Feature Extraction
Definition:
Feature extraction creates new features by transforming or combining the original ones.

How it works:

It projects the data into a new feature space.

It captures the most important information in fewer features.

Methods:

Principal Component Analysis (PCA)

Linear Discriminant Analysis (LDA)

Autoencoders (in deep learning)

Example: From 10 original features, PCA creates 3 new features (principal components) that summarize the most variance.

Use when:

You want to reduce dimensionality while retaining most of the data’s information.

You’re okay with losing the original feature meanings.

Key Differences
Aspect	Feature Selection	Feature Extraction
What it does	Selects a subset of original features	Creates new features from existing ones
Original features	Retained	Transformed or combined
Interpretability	High	Lower
Goal	Keep useful features	Capture key patterns in fewer dimensions
Examples	Correlation filtering, Lasso	PCA, LDA, Autoencoders


14.What are Eigenvalues and Eigenvectors in PCA?
Ans.In Principal Component Analysis (PCA), eigenvalues and eigenvectors are central concepts that help us understand the structure and variance in the data.

Here’s what they mean and how they relate to PCA:
What Is an Eigenvector?
An eigenvector is a direction in the data space that remains unchanged when a linear transformation (like rotation or scaling) is applied. In the context of PCA, eigenvectors represent the principal components—the new axes along which the data has the most variance.

Each eigenvector defines a new axis (principal component) in the transformed feature space.

The eigenvectors are orthogonal (perpendicular) to each other.

2. What Is an Eigenvalue?
An eigenvalue tells you how much variance in the data is captured by its corresponding eigenvector.

A large eigenvalue means that the corresponding eigenvector (principal component) captures a lot of the spread or variability in the data.

A small eigenvalue means it captures very little variance.

3. How PCA Uses Eigenvalues and Eigenvectors
Here’s how PCA applies these concepts:

Compute the covariance matrix of the standardized data.

Calculate the eigenvalues and eigenvectors of this matrix.

Sort the eigenvectors by descending eigenvalues.

Select the top k eigenvectors (with the largest eigenvalues) to form the new feature space.

Project the original data onto these eigenvectors to reduce dimensionality.

4. Interpretation
Eigenvectors = directions of maximum variance (principal components)

Eigenvalues = amount of variance explained in each of those directions

For example:

If the first eigenvalue is much larger than the rest, the first principal component captures most of the data's variance.

The ratio of each eigenvalue to the sum of all eigenvalues gives the explained variance ratio.



15. How do you decide the number of components to keep in PCA?
Ans.Deciding how many principal components to keep in PCA depends on how much of the total variance you want to preserve in the data. The goal is to reduce dimensions while retaining the most important information.

Here are the most common methods to choose the number of components:

1. Explained Variance Threshold
Each principal component explains a certain percentage of the total variance.

You can choose to keep components that together explain a target amount, such as 95% of the total variance.

Steps:

Calculate the explained variance ratio for each component.

Compute the cumulative sum.

Choose the smallest number of components such that the cumulative variance reaches your threshold.

Example (cumulative variance):

Component	Explained Variance (%)	Cumulative (%)
PC1	55	55
PC2	25	80
PC3	10	90
PC4	5	95
Here, you would choose 4 components to retain 95% of the variance.

2. Scree Plot (Elbow Method)
Plot the eigenvalues (or explained variance) in decreasing order.

Look for the point where the curve starts to level off—the "elbow".

Choose the number of components before the elbow, where adding more components gives diminishing returns.

This is a visual method and works well when the drop in variance is clear.

3. Kaiser Criterion (for standardized data)
Keep components with eigenvalues > 1.

This rule is based on the idea that a component should carry more information than an individual original feature.

This method is often used in factor analysis but can be applied to PCA in some cases.

4. Cross-Validation (For Model Performance)
Use PCA as part of a pipeline in a machine learning model.

Try different numbers of components and evaluate model performance using cross-validation.

Choose the number that gives the best performance

16.Can PCA be used for classification?
Ans.PCA itself is not a classification algorithm, but it can be used as a preprocessing step to improve classification tasks.

Here’s how and when PCA can be helpful in classification:

1. How PCA Helps in Classification
PCA transforms the original high-dimensional data into a lower-dimensional space by capturing the directions of maximum variance. This helps in:

Reducing dimensionality, which simplifies the problem space

Removing noise and redundant features

Speeding up training time for classification algorithms

Preventing overfitting, especially when the number of features is much larger than the number of samples

2. When to Use PCA Before Classification
Use PCA when:

You have a large number of features that might be correlated

Your model performs poorly due to high dimensionality

You want to visualize the data in 2D or 3D before classification

You aim to improve training speed on large datasets

3. How It Works in a Classification Pipeline
Standardize the dataset

Apply PCA to reduce dimensions

Feed the reduced dataset into a classifier (e.g., logistic regression, SVM, decision tree)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    StandardScaler(),
    PCA(n_components=10),
    LogisticRegression()
)

model.fit(X_train, y_train)


17.What are the limitations of PCA?
Ans.Principal Component Analysis (PCA) is a powerful tool, but it comes with several limitations. Understanding these helps decide when it is appropriate to use PCA and when it might hurt model performance or interpretation.

1. PCA Is Linear
PCA assumes that the relationships in the data are linear.

It cannot capture nonlinear patterns or structures.

If your data lies on a nonlinear manifold (like curves or spirals), PCA will not model it well.

2. Loss of Interpretability
After transformation, the principal components are combinations of original features.

It becomes difficult to interpret what each component means in real-world terms.

This can be a problem in applications where understanding feature impact is important (e.g. healthcare, finance).

3. Variance Does Not Always Equal Importance
PCA ranks components based on variance, not on their relevance to a target variable.

A feature with high variance might not be useful for prediction or classification.

PCA is unsupervised, so it ignores class labels or outcomes during transformation.

4. Sensitive to Scaling
PCA is sensitive to the scale of features.

If features are not standardized, those with larger scales will dominate the principal components.

Always standardize or normalize data before applying PCA.

5. Assumes Gaussian Distributions
PCA works best when the features are normally distributed.

If the data is heavily skewed or has outliers, PCA may produce misleading components.

6. Affected by Outliers
Outliers can distort the covariance matrix and significantly influence the principal components.

Preprocessing steps like outlier removal or robust PCA variants may be needed.

7. Information Loss
Reducing dimensions means some information is always lost.

If too few components are retained, the reduced dataset may not capture enough structure for downstream tasks.

8. Not Ideal for Categorical Data
PCA works with numerical data.

Applying PCA to categorical features (even if one-hot encoded) may produce unreliable or meaningless results.

18.How do KNN and PCA complement each other?
Ans.K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) can complement each other effectively when used together, especially in high-dimensional datasets. Here's how they work well in combination:

1. Reducing Dimensionality Before KNN
KNN is sensitive to the number of features—as the dimensionality increases, the distance between data points becomes less meaningful. This is known as the curse of dimensionality.

How PCA helps:

PCA reduces the number of dimensions by capturing the most important patterns in the data.

This makes distance calculations in KNN more reliable and meaningful.

It also reduces computational cost and noise, improving KNN's performance.

2. Improving Classification Accuracy
In high-dimensional spaces, irrelevant or redundant features can distort the distance metric used in KNN.

PCA removes correlated and low-variance features, helping KNN focus on the most informative ones.

As a result, KNN may classify points more accurately when PCA is applied first.

3. Faster Computation in KNN
KNN stores all training data and performs lazy learning, meaning it computes distances at prediction time. This can be slow for high-dimensional data.

PCA reduces the number of features, which:

Speeds up prediction time

Reduces memory usage

Makes KNN feasible on larger datasets

4. Noise Reduction
KNN can be negatively affected by noisy features. PCA acts as a denoising tool by:

Projecting data onto components that capture meaningful variance

Ignoring small, potentially noisy variations in the data

This makes KNN more robust.

5. Visualization and Analysis
PCA reduces data to 2D or 3D, allowing you to visualize the data distribution.

This helps in understanding how KNN might behave, such as whether classes are separable or clustered.

Typical Pipeline
Standardize the data (PCA requires it)

Apply PCA to reduce dimensions

Use KNN for classification or regression on the reduced data

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

model = make_pipeline(
    StandardScaler(),
    PCA(n_components=10),
    KNeighborsClassifier(n_neighbors=5)
)

model.fit(X_train, y_train)


19.How does KNN handle missing values in a dataset?
Ans.K-Nearest Neighbors (KNN) does not handle missing values natively—meaning, if your dataset contains missing values, KNN will not work properly unless you handle those missing values first.

However, there are a few ways to deal with missing values before or with KNN:

1. Preprocessing: Imputation Before Using KNN
Before using KNN for classification or regression, you should fill in missing values. Some common imputation strategies:

Mean/Median Imputation: Replace missing values with the mean or median of the column.

Mode Imputation: For categorical features, use the most frequent value.

KNN Imputation (see below): Use a KNN-based imputation method to estimate missing values.

2. KNN Imputation (Separate Use of KNN)
This is a different use of KNN: not as a classifier or regressor, but as an imputer.

How it works:

For a sample with a missing value, find its k nearest neighbors using the non-missing features.

Use the corresponding values from neighbors to fill in the missing data.

In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
X_filled = imputer.fit_transform(X_with_missing)


3. Dropping Rows or Features (Only If Necessary)
If only a few rows or columns have missing values, you might drop them.

But this can lead to data loss, especially in small datasets.

20.What are the key differences between PCA and Linear Discriminant Analysis (LDA)?
ANsPrincipal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are both dimensionality reduction techniques, but they are fundamentally different in their goals, assumptions, and methods.

Here's a clear breakdown of the key differences between PCA and LDA:

1. Supervision
Aspect	PCA	LDA
Type	Unsupervised	Supervised
Uses class labels?	No	Yes
PCA does not use target (class) labels.

LDA uses target labels to maximize class separability.

2. Goal
Aspect	PCA	LDA
Primary objective	Maximize variance in the data	Maximize class separability
Projection direction	Directions of maximum variance	Directions that best separate classes
PCA focuses on capturing as much information (spread) as possible.

LDA focuses on finding the axes that best discriminate between classes.

3. How It Works
Step	PCA	LDA
Computation basis	Eigenvectors of the covariance matrix	Eigenvectors of scatter matrices (within and between classes)
Optimization target	Total variance	Ratio of between-class variance to within-class variance
4. Number of Components You Can Get
Aspect	PCA	LDA
Max number of components	≤ number of features	≤ (number of classes − 1)
LDA’s number of useful components is limited by the number of classes in your dataset.

PCA can return more components depending on the dataset's dimensionality.

5. Interpretability
PCA components are combinations of original features with no class-specific meaning.

LDA components are more interpretable in classification tasks, as they focus on separating the classes.

6. Performance Use Case
Best used for...	PCA	LDA
Data visualization, compression	When labels are not available	When labels are available and classes need separation
Example Scenario
Use PCA if you want to reduce dimensions of raw data without labels or for data compression.

Use LDA if you are preparing data for a classification task and want to enhance class separability.



**Practice**


1.Train a KNN Classifier on the Iris dataset and print model accuracy?
Ans

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict and calculate accuracy
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

accuracy


2.Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)?
Ans.

In [None]:
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a KNN regressor
knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train_scaled, y_train)

# Predict and evaluate using MSE
y_pred = knn_regressor.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)

mse


3.Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy
Ans.

In [None]:
# Train KNN classifier using Euclidean distance (default)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# Train KNN classifier using Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

accuracy_euclidean, accuracy_manhattan


If you'd like, I can still show you the code to run locally so you can compare the accuracy of KNN using Euclidean and Manhattan distance metrics on the Iris dataset. Let me know if you want that.

In [None]:
# Reload the Iris dataset for classification
iris = load_iris()
X, y = iris.data, iris.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN classifier using Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# Train KNN classifier using Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

accuracy_euclidean, accuracy_manhattan


4.Train a KNN Classifier with different values of K and visualize decision boundaried
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Load the Iris dataset
iris = load_iris()
X = iris.data[:, :2]  # Only use the first two features for 2D plotting
y = iris.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Set up mesh grid for plotting decision boundaries
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

# Plot for different values of K
plt.figure(figsize=(15, 4))
for i, k in enumerate([1, 5, 15]):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.subplot(1, 3, i + 1)
    plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.RdYlBu)
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k', cmap=plt.cm.RdYlBu)
    plt.title(f'KNN (k={k})')

plt.tight_layout()
plt.show()


5.Apply Feature Scaling before training a KNN model and compare results with unscaled data
Ans

In [None]:
from sklearn.pipeline import make_pipeline

# Load the Iris dataset again
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate KNN without scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)

# Train and evaluate KNN with feature scaling
pipeline_scaled = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
pipeline_scaled.fit(X_train, y_train)
y_pred_scaled = pipeline_scaled.predict(X_test)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

accuracy_unscaled, accuracy_scaled


6.Train a PCA model on synthetic data and print the explained variance ratio for each component
Ans

In [None]:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate synthetic data
X, _ = make_classification(n_samples=300, n_features=5, n_informative=3, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio per Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"Component {i + 1}: {ratio:.4f}")


7.Apply PCA before training a KNN Classifier and compare accuracy with and without PCA5
Ans.

In [None]:
from sklearn.decomposition import PCA

# Load and split the Iris dataset again
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN without PCA
knn_no_pca = KNeighborsClassifier(n_neighbors=5)
knn_no_pca.fit(X_train_scaled, y_train)
y_pred_no_pca = knn_no_pca.predict(X_test_scaled)
accuracy_no_pca = accuracy_score(y_test, y_pred_no_pca)

# Apply PCA and train KNN with PCA-transformed data
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_with_pca = KNeighborsClassifier(n_neighbors=5)
knn_with_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_with_pca.predict(X_test_pca)
accuracy_with_pca = accuracy_score(y_test, y_pred_pca)

accuracy_no_pca, accuracy_with_pca


8. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV5
Ans.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define pipeline with scaling and KNN
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Define the parameter grid for GridSearch
param_grid = {
    'knn__n_neighbors': [1, 3, 5, 7, 9],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print best parameters and accuracy
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

# Evaluate on test data
test_accuracy = grid_search.score(X_test, y_test)
print("Test Set Accuracy:", test_accuracy)


9.Train a KNN Classifier and check the number of misclassified samples?
Ans.

In [None]:
from sklearn.metrics import confusion_matrix

# Load and preprocess the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

# Calculate number of misclassified samples
misclassified = (y_test != y_pred).sum()

misclassified


10. Train a PCA model and visualize the cumulative explained variance.
Ans.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA()
pca.fit(X_scaled)

# Calculate cumulative explained variance
cumulative_variance = pca.explained_variance_ratio_.cumsum()

# Plot
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.title('Cumulative Explained Variance by PCA Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.axhline(y=0.95, color='r', linestyle=':')
plt.text(1, 0.95, '95% threshold', color='red', fontsize=9, va='bottom')
plt.show()


11.Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare
accuracy
Ans.

In [None]:
# Load and prepare the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN with 'uniform' weights
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train_scaled, y_train)
y_pred_uniform = knn_uniform.predict(X_test_scaled)
accuracy_uniform = accuracy_score(y_test, y_pred_uniform)

# Train KNN with 'distance' weights
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train_scaled, y_train)
y_pred_distance = knn_distance.predict(X_test_scaled)
accuracy_distance = accuracy_score(y_test, y_pred_distance)

accuracy_uniform, accuracy_distance


12.Train a KNN Regressor and analyze the effect of different K values on performance
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression data
X, y = make_regression(n_samples=300, n_features=5, noise=20, random_state=42)

# Split and scale the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Evaluate performance for different values of K
k_values = list(range(1, 21))
mse_scores = []

for k in k_values:
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

# Plot the results
plt.figure(figsize=(10, 5))
plt.plot(k_values, mse_scores, marker='o')
plt.title("KNN Regressor: Effect of K on MSE")
plt.xlabel("Number of Neighbors (K)")
plt.ylabel("Mean Squared Error (MSE)")
plt.grid(True)
plt.show()


13.Implement KNN Imputation for handling missing values in a dataset?
Ans.

In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.datasets import load_iris

# Load Iris dataset and introduce missing values
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
np.random.seed(42)

# Randomly set 5% of values to NaN
mask = np.random.rand(*X.shape) < 0.05
X_missing = X.mask(mask)

print("Data with Missing Values:")
print(X_missing.head())

# Apply KNN Imputer
imputer = KNNImputer(n_neighbors=3)
X_imputed = imputer.fit_transform(X_missing)

# Convert back to DataFrame
X_imputed_df = pd.DataFrame(X_imputed, columns=X.columns)

print("\nData after KNN Imputation:")
print(X_imputed_df.head())


14. Train a PCA model and visualize the data projection onto the first two principal components?
Ans.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot the projection
plt.figure(figsize=(8, 6))
for target, label in enumerate(target_names):
    plt.scatter(
        X_pca[y == target, 0],
        X_pca[y == target, 1],
        label=label,
        alpha=0.7
    )
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection of the Iris Dataset')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


15.Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance?
Ans.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load and split the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN with KD Tree
knn_kd = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
knn_kd.fit(X_train_scaled, y_train)
y_pred_kd = knn_kd.predict(X_test_scaled)
accuracy_kd = accuracy_score(y_test, y_pred_kd)

# KNN with Ball Tree
knn_ball = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
knn_ball.fit(X_train_scaled, y_train)
y_pred_ball = knn_ball.predict(X_test_scaled)
accuracy_ball = accuracy_score(y_test, y_pred_ball)

# Print results
print("KD Tree Accuracy:", accuracy_kd)
print("Ball Tree Accuracy:", accuracy_ball)


16.Train a PCA model on a high-dimensional dataset and visualize the Scree plot?
Ans

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load high-dimensional dataset (64 features)
digits = load_digits()
X = digits.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Plot the Scree plot
plt.figure(figsize=(10, 5))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         pca.explained_variance_ratio_, marker='o', linestyle='--')
plt.title("Scree Plot: Explained Variance by PCA Components")
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.grid(True)
plt.tight_layout()
plt.show()


17.Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score
Ans.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict on test set
y_pred = knn.predict(X_test_scaled)

# Evaluate performance
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


18.Train a PCA model and analyze the effect of different numbers of components on accuracy
Ans.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Try different PCA components and track accuracy
component_range = range(1, X.shape[1] + 1)
accuracy_scores = []

for n_components in component_range:
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)

    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_pca, y_train)
    y_pred = knn.predict(X_test_pca)
    accuracy = accuracy_score(y_test, y_pred)


19. Train a KNN Classifier with different leaf_size values and compare accuracy

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Try different leaf_size values
leaf_sizes = range(5, 51, 5)
accuracy_scores = []

for leaf_size in leaf_sizes:
    knn = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree', leaf_size=leaf_size)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

# Plot the results
plt.figure(figsize=(8, 5))
plt.plot(leaf_sizes, accuracy_scores, marker='o')
plt.title("KNN Classifier Accuracy vs. leaf_size")
plt.xlabel("leaf_size")
plt.ylabel("Accuracy")
plt.grid(True)
plt.tight_layout()
plt.show()


20.Train a PCA model and visualize how data points are transformed before and after PCA
Ans.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot original feature space (first 2 features) and PCA space
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original space (first 2 standardized features)
for target in set(y):
    axes[0].scatter(
        X_scaled[y == target, 0],
        X_scaled[y == target, 1],
        label=target_names[target],
        alpha=0.7
    )
axes[0].set_title("Original Feature Space (First 2 Scaled Features)")
axes[0].set_xlabel("Feature 1")
axes[0].set_ylabel("Feature 2")
axes[0].legend()
axes[0].grid(True)

# PCA space
for target in set(y):
    axes[1].scatter(
        X_pca[y == target, 0],
        X_pca[y == target, 1],
        label=target_names[target],
        alpha=0.7
    )
axes[1].set_title("PCA-Transformed Space (First 2 Components)")
axes[1].set_xlabel("Principal Component 1")
axes[1].set_ylabel("Principal Component 2")
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()


21.Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report
Ans.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target
target_names = wine.target_names

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test_scaled)
report = classification_report(y_test, y_pred, target_names=target_names)

print("Classification Report for KNN on Wine Dataset:")
print(report)


22.Train a KNN Regressor and analyze the effect of different distance metrics on prediction error
Ans.

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Create synthetic regression data
X, y = make_regression(n_samples=300, n_features=5, noise=15, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Distance metrics to compare
metrics = ['euclidean', 'manhattan', 'chebyshev']
mse_scores = []

# Train and evaluate KNN Regressor with each metric
for metric in metrics:
    knn = KNeighborsRegressor(n_neighbors=5, metric=metric)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"{metric.capitalize()} Distance - Mean Squared Error: {mse:.2f}")

# Optional: plot the results
plt.figure(figsize=(7, 5))
plt.bar(metrics, mse_scores, color='skyblue')
plt.title("KNN Regressor - MSE by Distance Metric")
plt.xlabel("Distance Metric")
plt.ylabel("Mean Squared Error")
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()


23.Train a KNN Classifier and evaluate using ROC-AUC score?
Ans.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict probabilities
y_proba = knn.predict_proba(X_test_scaled)[:, 1]  # Probability for positive class

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")


24.Train a PCA model and visualize the variance captured by each principal component
Ans.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
data = load_wine()
X = data.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA model
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Plot the explained variance ratio
plt.figure(figsize=(8, 5))
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1),
        pca.explained_variance_ratio_,
        alpha=0.7,
        color='skyblue')
plt.title("Explained Variance by Each Principal Component")
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.xticks(range(1, len(pca.explained_variance_ratio_) + 1))
plt.grid(axis='y')
plt.tight_layout()
plt.show()


25.Train a KNN Classifier and perform feature selection before training
Ans.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Load the dataset
data = load_wine()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Feature selection: select top 8 features
selector = SelectKBest(score_func=f_classif, k=8)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_selected, y_train)

# Evaluate
y_pred = knn.predict(X_test_selected)
print("Classification Report after Feature Selection:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


26.Train a PCA model and visualize the data reconstruction error after reducing dimensions?
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
data = load_wine()
X = data.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce to a chosen number of components (e.g., 2)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Reconstruct data from reduced representation
X_reconstructed = pca.inverse_transform(X_pca)

# Compute reconstruction error (mean squared error per sample)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2, axis=1)

# Plot reconstruction error
plt.figure(figsize=(8, 5))
plt.hist(reconstruction_error, bins=30, color='salmon', edgecolor='black', alpha=0.7)
plt.title("Distribution of Reconstruction Errors (PCA with 2 Components)")
plt.xlabel("Reconstruction Error")
plt.ylabel("Number of Samples")
plt.grid(True)
plt.tight_layout()
plt.show()


27.Train a KNN Classifier and visualize the decision boundary?
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

# Load Wine dataset
data = load_wine()
X, y = data.data, data.target
target_names = data.target_names

# Standardize and reduce to 2 components using PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Train KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_pca, y)

# Create mesh grid
x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
                     np.linspace(y_min, y_max, 300))

# Predict on grid points
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
for idx, label in enumerate(np.unique(y)):
    plt.scatter(X_pca[y == label, 0], X_pca[y == label, 1],
                label=target_names[label], edgecolor='k')
plt.title("KNN Decision Boundary (PCA-reduced Wine Data)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


28.Train a PCA model and analyze the effect of different numbers of components on data variance.
Ans.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Load dataset
data = load_wine()
X = data.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA with all components
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative variance vs number of components
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', color='navy')
plt.title("Cumulative Explained Variance by Number of PCA Components")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.tight_layout()
plt.axhline(y=0.95, color='red', linestyle='--', label='95% Variance Threshold')
plt.legend()
plt.show()
