## KNN: 
- The KNN algorithm assumes that similar things exist in close proximity. 
- It's a supervised machine learning algorithm that can be used to solve both classification and regression problems. 
- KNN works on a similarity measure (eg. distance functions)
- KNN is a non-parametric and lazy learning algorithm. 
    - Non-parametric means there is no assumption for underlying data distribution.
    - Lazy algorithm means it does not need any training data points for model generation. All training data used in the testing phase.

### Steps: The KNN Algorithm:
- 1. Load the data.
- 2. Initialize K to your chosen number of neighbors.
- 3. For each query data point in the data.
      - Calculate the distance between the query data point and the all other data points in the data.
      - There are multiple distance functions like Euclidean distance, Hamming distance, Manhattan distance and Minkowski distance etc.
      <img src = './Image/10.1 Image a.png' width=30% height=20%/>
- 4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances.
- 5. Pick the first K entries from the sorted collection.
- 6. Get the labels of the selected K entries.
- 7. If regression, return the mean of the K labels.
- 8. If classification, return the mode of the K labels i.e, class of query data point is decided based on **majority voting**.
<img src = './Image/10.1 Image b.gif' width=50% height=50%/>


### Effect of K or n_neighbors:
The number of neighbors is the core deciding factor.
   - 1. As we decrease the value of K to 1, our predictions become less stable.
   - 2. Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
   - 3. In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker.

### How to choose the value of K?
The number of neighbors(K) in KNN is a hyperparameter(controlling variable for the prediction model) that we need choose at the time of model building.
   - 1) We can take the help of domain expert on the problem we are solving to get the best value of K.
   - 2) We can use the cross validation to find best K. We try with different K values and check how the validation erroe rate is varying. We can choose point in the graph as best K.

### Summary:
#### Pros:
- KNN is a non parametric algorithm which means there are no assumptions to be met to implement KNN.
- KNN doesn't explicitely build any model, it simply tags the new data entry based learning from historical data.
- KNN can also be used for multiclass classification.
- It is a instance-based learning- KNN is a memory based approach. The classifier immediately adapts as we collect new training data. It allows the algorithm to respond quickly to changes in the input during real-time use.

#### Cons:
- One of the biggest issue with KNN is to choose the optimal number of neighbors to be considered while classifying the new data entry.
- As dataset grows efficiency or speed of algorithm declines very fast.
- KNN works well with a small number of input features but as the number of features increases KNN struggles to predict the output of new data point. (called Curse of Dimensionality)
- We need to have normalised data.
- KNN doesn't perform well on imbalanced data.
- KNN is very sensitive to outliers as it simply choose the neighbors based on distance criteria.
- KNN inherently has no capability of dealing with missing value problem.