# Assignment no 72 (K Nearest Neighbours - Basic) (20.4.23)

### Q1. What is the KNN algorithm?

**Ans -** 
- The K-Nearest Neighbor (KNN) algorithm is a simple, yet powerful, supervised learning technique used for classification and regression problems. 
- It is a non-parametric algorithm, meaning it does not make any assumptions about the underlying data distribution. 
- The KNN algorithm works by assuming similarity between new data points and available data points, and classifies the new data point into the category that is most similar to the available categories. 
- It is also called a lazy learner algorithm because it does not learn from the training set immediately, instead it stores the dataset and performs an action on the dataset at the time of classification and can be used for both regression and classification problems, but it is mostly used for classification problems.

### Q2. How do you choose the value of K in KNN?

**Ans -** 

Choosing the value of K in the KNN algorithm is an important step as it can significantly impact the performance of the model. There are several methods to determine the optimal value of K for a given dataset. Here are some common techniques used to choose the value of K:

1. **Cross-validation**: One way to determine the optimal value of K is to use cross-validation. This involves splitting the dataset into training and validation sets, and then testing different values of K on the validation set to see which value gives the best performance.

2. **The square root rule**: Another simple approach to select K is to set K = sqrt(n), where n = total number of data points in the dataset.

3. **The elbow method**: This method involves plotting the error rate for different values of K and choosing the value of K where the error rate starts to decrease more slowly (the "elbow" point).

It is also important to note that **choosing a very low value of K can lead to inaccurate predictions due to noise in the data, while choosing a very high value can make the model computationally expensive.**

It is generally recommended to **choose an odd value for K if there are only two classes in the dataset, to avoid ties in classification.**

### Q3. What is the difference between KNN classifier and KNN regressor?

**Ans -** 
The KNN algorithm can be used for both classification and regression problems. The main difference between the KNN classifier and the KNN regressor is the way they make predictions based on the nearest neighbors.

**1. KNN Classifier:** The KNN classifier predicts the class of a new data point by taking a majority vote among its K nearest neighbors. In other words, it assigns the new data point to the class that is most common among its K nearest neighbors.

**2. KNN Regressor:** The KNN regressor, on the other hand, predicts a continuous value for a new data point by taking the average of the values of its K nearest neighbors.

In summary, the KNN classifier is used for classification problems where the output variable is categorical, while the KNN regressor is used for regression problems where the output variable is continuous. 

### Q4. How do you measure the performance of KNN?

**Ans -** 
The performance of the KNN algorithm can be measured using various evaluation metrics, depending on the type of problem (classification or regression) and the specific requirements of the task.

For classification problems, some common evaluation metrics used to measure the performance of the KNN classifier include **accuracy, precision, recall, F1-score, and confusion matrix**. These metrics can be calculated by comparing the predicted class labels with the true class labels of the test data.

For regression problems, some common evaluation metrics used to measure the performance of the KNN regressor include **mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared**. These metrics can be calculated by comparing the predicted continuous values with the true values of the test data.

It is important to note that when training a KNN classifier or regressor, it is **essential to normalize the features as KNN measures the distance between points**. The test data is used to evaluate the performance of the model by making predictions and comparing these predictions to the actual target values.

### Q5. What is the curse of dimensionality in KNN?

**Ans -** The curse of dimensionality is a phenomenon that occurs when the number of features (dimensions) in a dataset increases, causing the feature space to become increasingly sparse. This can have a significant impact on the performance of the KNN algorithm, as it relies on measuring the distance between data points to make predictions.

In high-dimensional spaces, the distance between data points can become less meaningful, as all points tend to be almost equidistant from each other. This can make it difficult for the KNN algorithm to accurately identify the nearest neighbors of a given data point, and can result in inaccurate predictions.

To overcome the curse of dimensionality in KNN, it is often necessary to use techniques such as dimensionality reduction to reduce the number of features in the dataset. This can help to improve the performance of the KNN algorithm by making the feature space less sparse and allowing the algorithm to more accurately identify the nearest neighbors of a given data point.

### Q6. How do you handle missing values in KNN?

**Ans -** Missing values can cause problems for many machine learning algorithms, including the KNN algorithm. As such, it is important to identify and handle missing values in the dataset prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

One popular approach to missing data imputation is to use a model to predict the missing values. This requires a model to be created for each input variable that has missing values. Although any one among a range of different models can be used to predict the missing values, the k-nearest neighbor (KNN) algorithm has proven to be generally effective, often referred to as “nearest neighbor imputation”.

The KNN imputation method works by finding the K nearest neighbors of the data point with missing values, based on the other variables in the dataset. The missing value is then imputed by taking the average (for continuous variables) or the mode (for categorical variables) of the K nearest neighbors.

In Python, you can use the KNN Imputer class from the scikit-learn library to perform KNN imputation on your dataset. This class provides an easy-to-use interface for filling in missing values using the KNN algorithm.

### Q7. Compare and contrast the performance of the KNN classifier and regressor. Which one is better for which type of problem?

**Ans -** The KNN classifier and regressor are two different algorithms that are used for different types of problems. The KNN classifier is used for classification problems, where the goal is to predict the class or category that a data point belongs to. On the other hand, the KNN regressor is used for regression problems, where the goal is to predict a continuous value.

The performance of these two algorithms depends on the type of problem they are being used for. For classification problems, the KNN classifier is generally a better choice, as it is designed to predict classes. For regression problems, the KNN regressor is generally a better choice, as it is designed to predict continuous values.

In summary, the choice between the KNN classifier and regressor depends on the type of problem being solved. For classification problems, the KNN classifier is generally a better choice, while for regression problems, the KNN regressor is generally a better choice.

### Q8. What are the strengths and weaknesses of the KNN algorithm for classification and regression tasks, and how can these be addressed?

**Ans -** 

The k-Nearest Neighbor (kNN) algorithm is a simple but effective machine learning algorithm that can be used for both classification and regression tasks. 

Some of the strengths of the kNN algorithm include its 
1. flexibility, 
2. ease of implementation, and 
3. robustness to noisy data.

However, the kNN algorithm also has some weaknesses. 
1. One weakness is that it can be computationally expensive, especially when dealing with large datasets. 
2. Another weakness is that it can be sensitive to the choice of the number of nearest neighbors (k) and the distance metric used.

To address these weaknesses, several modified versions of the kNN algorithm have been developed. These variants aim to remove the weaknesses of kNN and provide a more efficient method. For example, one approach is to use dimensionality reduction techniques to reduce the computational cost of the algorithm. Another approach is to use cross-validation or other methods to select the optimal value of k and the distance metric.

In summary, while the kNN algorithm has several strengths, it also has some weaknesses that can be addressed through various techniques and modifications.

### Q9. What is the difference between Euclidean distance and Manhattan distance in KNN?

**Ans -** In the context of the k-Nearest Neighbor (kNN) algorithm, *Euclidean distance and Manhattan distance are two different distance metrics that can be used to calculate the distance between data points.* The choice of distance metric can affect the performance of the kNN algorithm, so it is important to understand the differences between these two metrics.

**1. Euclidean distance:** is the straight-line distance between two points in Euclidean space. It is calculated using the Pythagorean theorem, which states that the square of the distance between two points is equal to the sum of the squares of the differences between their coordinates.

**2. Manhattan distance:** is also known as taxicab distance or city block distance. It is calculated as the sum of the absolute differences between the coordinates of two points. This distance metric is preferred over Euclidean distance when dealing with high dimensionality.

In summary, Euclidean and Manhattan distances are two different distance metrics that can be used in kNN. Euclidean distance measures the straight-line distance between two points, while Manhattan distance measures the sum of the absolute differences between their coordinates. The choice of distance metric can affect the performance of kNN, so it is important to choose the appropriate metric for a given problem. 

### Q10. What is the role of feature scaling in KNN?

**Ans -** 

Feature scaling is an important preprocessing step for many machine learning algorithms, including the k-Nearest Neighbor (kNN) algorithm. It involves rescaling each feature such that it has a standard deviation of 1 and a mean of 0. This is crucial for the kNN algorithm, as it helps in preventing features with larger magnitudes from dominating the distance calculations.

The kNN algorithm is a distance-based algorithm, which means that it calculates the distance between data points to determine their similarity. If the features have different scales, then the distance calculation can be dominated by the features with larger magnitudes. This can lead to suboptimal performance of the kNN algorithm.

To prevent this from happening, it is important to scale the data before applying the kNN algorithm. This can be done using techniques such as normalization or standardization. By scaling the data, all features are brought to the same scale, which ensures that no single feature dominates the distance calculation. This can improve the performance of the kNN algorithm and lead to more accurate predictions.

In summary, feature scaling is an important step when using the kNN algorithm, as it helps to prevent features with larger magnitudes from dominating the distance calculations. 