# Interview Questions:

# 1. What are the key hyperparameters in KNN?

The performance of the K-Nearest Neighbors (KNN) algorithm is influenced by the following key hyperparameters:

1) n_neighbors (k): Specifies the number of nearest neighbors to consider when making a prediction.
* Effect: A small value of k may lead to overfitting (sensitive to noise), while a large value of k may smooth the decision boundary but increase bias.

2) weights: Determines the weight assigned to each neighbor’s contribution. Options include:
   - "uniform": All neighbors contribute equally.
   - "distance": Closer neighbors have a higher influence.
* Effect: The choice affects how much influence distant neighbors have on the prediction.

3) metric: Specifies the distance metric used to calculate the closeness of neighbors (e.g., Euclidean, Manhattan).
* Effect: The distance metric influences how "similarity" is defined, potentially affecting accuracy depending on the data's structure.

4) p (for Minkowski distance): Determines the power parameter for the Minkowski distance.
* Effect: If p=1, it corresponds to Manhattan distance; if p=2, it corresponds to Euclidean distance.

5) algorithm: Determines the method used to compute nearest neighbors (e.g., "auto", "ball_tree", "kd_tree", or "brute").
* Effect: Impacts computational efficiency, particularly for large datasets.

6) leaf_size (for tree-based algorithms): Specifies the size of the leaf in kd_tree or ball_tree.
* Effect: Impacts the speed of queries and memory usage.


# 2. What distance metrics can be used in KNN?


In the K-Nearest Neighbors (KNN) algorithm, the choice of distance metric determines how similarity or closeness between data points is measured. Here are the common distance metrics used:

1) Euclidean Distance:
* A popular metric for continuous data, it measures the straight-line distance between two points in multi-dimensional space. It works well when the dataset has numeric features that are scaled appropriately.

2) Manhattan Distance:
* This measures the distance between points along the grid (sum of absolute differences). It is robust to outliers and is suitable when the features are independent and not highly correlated.

3) Minkowski Distance:
* A generalization of Euclidean and Manhattan distances, this metric provides flexibility by allowing the user to adjust how distances are calculated depending on the data structure and sensitivity to outliers.

4) Hamming Distance:
* Used for categorical or binary data, this metric calculates the number of differing attributes between two data points. It’s ideal for datasets where features are non-numeric, such as text or binary flags.

5) Cosine Distance:
* Commonly used for high-dimensional data such as text or document similarity, it measures the angle between two vectors rather than their magnitude. It’s useful when the magnitude of the data is less important than the direction.

6) Mahalanobis Distance:
* This metric considers correlations between features and scales appropriately for datasets with varying feature distributions. It’s particularly useful when features have different units or when multivariate relationships exist.

The choice of metric depends on the type of data and the problem. For instance, Euclidean distance works well for numerical data, Hamming distance is suited for categorical data, and Cosine distance is effective in text or sparse data scenarios.